* Stability of git-archive, breaking (?) the Github universe, and a possible solution @ 2023-01-31 0:06 Eli Schwartz 2023-01-31 7:49 ` Ævar Arnfjörð Bjarmason 2023-01-31 9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson 0 siblings, 2 replies; 57+ messages in thread From: Eli Schwartz @ 2023-01-31 0:06 UTC (permalink / raw) To: Git List; +Cc: brian m. carlson For those that haven't seen, github changed its checksums for all "source code" artifacts attached to any git repository with tags. This change is now reverted due to widespread breakage -- and the lack of advance warning. The technical details of the change appear simple: they upgraded git. Probably the main discussion, complete with Github employees from this mailing list responding: https://github.com/bazel-contrib/SIG-rules-authors/issues/11#issuecomment-1409438954 Consequences of that discussion, attempting to mitigate issues by warning people that it already happened: https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/ And where I first saw it: https://github.com/mesonbuild/wrapdb/pull/884 Historically speaking, git-archive has been stable minus... a bug fix or two in rare cases, specifically relating to an inability to transcribe the contents of the git repo at all, I think? And the other factor is the compression algorithm used, which is generally GNU gzip, and historically whatever the system `gzip` command is. And gzip is a stable format. It's a worn-out, battle-weary format, even -- it's not the best at compressing, and it's not the best at decompressing, and "all the cool kids" are working on cooler formats, such as zstd which does indeed regularly change its byte output between versions. But the advantage of gzip is that it's good *enough*, and it's probably *everywhere*, and it's *reliable*. GNU gzip is reproducible. busybox gzip was fixed to agree with GNU gzip (this is relevant to the handful of people running software forges on, say, Alpine Linux): https://reproducible-builds.org/reports/2019-08/#upstream-news ... Nevertheless, I've seen the sentiment a few times that git doesn't like committing to output stability of git-archive, because it isn't officially documented (but it's not entirely clear what the benefits of changing are). And yet, git endeavors to do so, in order to prevent unnecessary breakage of people who embody Hyrum's Law and need that stability. Even with the new change to the compressor, git-archive is still reproducible, it's the internal gzip compressor that isn't. (This may be fixable, possibly by embedding an implementation from busybox or from GNU gzip? I'm not going to discuss that right now, though I think it's an interesting avenue of exploration.) I've thought about this now and then over the last couple of years, because I think I have a reasonable compromise that might make everyone (or at least most people) happy, and now seems like a good idea to mention it. What does everyone think about offering versioned git-archive outputs? This could be user-selectable as an option to `git archive`, but the main goal would be to select a good versioned output format depending on what is being archived. So: - first things first, un-default the internal compressor again - implement a v2 archive format, where the internal compressor is the default -- no other changes - teach git to select an archive format based on the date of the object being archived - when given a commit/tag ID to archive, check which support frame the committer date falls inside - for tree IDs, always use the latest format (it always uses the current date anyway) - schedule a date, for the sake of argument, 6 months after the next scheduled release date of git version X.Y in which this change goes live; bake this into the git sources as a transition date, all commits or tags generated after this date fall into the next format support frame The end result is that for all historic commits or tags, `git archive` will always produce the same output. This can be documented in the git-archive manpage: "the produced archive is guaranteed to be reproducible, unless you override the `tar.<format>.command` or your system compressor is not reproducible". For *new* commits or tags, everyone gets the benefit of fascinating, cool new archive formats with useful improvements at the tar container level, which is apparently a very desirable feature. The git project no longer has to worry, at all, about whether users will come to complain about how their build pipelines suddenly fail with checksum issues. The git project can simply, fearlessly, go implement innovative new changes without giving any thought to backwards compatibility. It is, simply, that those new changes only apply to projects which are still under active development, and which push new commits or tag new releases after the transition date. Old states of existing projects (regardless of whether they are still actively updating) can go have their old and apparently inefficient archives and don't get cool new stuff. That's fine. They're also increasingly rarely used, because they are, after all, old -- and most likely only used for historic archival purposes. If the worst comes to worst, well, they managed to produce a somehow useful archive with an older version of git -- nothing will *break* if they don't get the cool new stuff. And for the vast majority of new downloads for new stuff, the in-process compressor saves one fork+exec and is a bit more efficient, I guess? A note on the transition date: I suggested 6 months after the scheduled release date, because this gives everyone running a software forge time to update git itself, and have everything ready, in time to handle the first wave of commits and tags that naturally occur after the transition date. And you don't want it to be immediate, because then people will take days or weeks to deploy and the most recent archives will change For the purposes of this thought experiment, we assume that people don't routinely set the system time to a year in the future. This will only be done in situations such as, say, testing a git upgrade deployment for a software forge. ... "And then no one ever complained about archive checksums changing again." 🤞🙏🥺 -- Eli Schwartz ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz @ 2023-01-31 7:49 ` Ævar Arnfjörð Bjarmason 2023-01-31 9:11 ` Eli Schwartz 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason 2023-01-31 9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson 1 sibling, 2 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-01-31 7:49 UTC (permalink / raw) To: Eli Schwartz Cc: Git List, brian m. carlson, René Scharfe, Johannes Schindelin, Jeff King On Mon, Jan 30 2023, Eli Schwartz wrote: > For those that haven't seen, github changed its checksums for all > "source code" artifacts attached to any git repository with tags. This > change is now reverted due to widespread breakage -- and the lack of > advance warning. The technical details of the change appear simple: they > upgraded git. > > Probably the main discussion, complete with Github employees from this > mailing list responding: > > https://github.com/bazel-contrib/SIG-rules-authors/issues/11#issuecomment-1409438954 > > Consequences of that discussion, attempting to mitigate issues by > warning people that it already happened: > > https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/ > > And where I first saw it: https://github.com/mesonbuild/wrapdb/pull/884 Maybe I'm the only one that missed this on a first reading, but I couldn't find what specific change in Git was being discussed. But it's linked from the now-strikethrough portion of that github.blog URL: 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15), first released with v2.38.0. That's the change to use gzip as a library instead of gzip(1), I've added the author to the CC list, as well as well as others in the initial ML dicsussion. The ML discussion about that series starts at: https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail.com/ For that change specifically I had this comment at the time: https://lore.kernel.org/git/220615.86wndhwt9a.gmgdl@evledraar.gmail.com/ The response from René (https://lore.kernel.org/git/3ed80afd-34b3-afd8-5ffb-0187a4475ee1@web.de/) fills in the "why" missing from the commit message itself: "It's to avoid a run dependency [on gzip(1)] [...] and you can set tar.tgz.command='gzip -cn' to get the old behavior. Saving energy is a better default, though. We can discuss how worthwhile that trade-off is, especially in the face of this behavior change GitHub encounterd, but I don't think it was the intent with this change to change the output (but maybe René was aware of that, but didn't note it). Which brings me to... > Historically speaking, git-archive has been stable minus... a bug fix or > two in rare cases, specifically relating to an inability to transcribe > the contents of the git repo at all, I think? And the other factor is > the compression algorithm used, which is generally GNU gzip, and > historically whatever the system `gzip` command is. > > And gzip is a stable format. It's a worn-out, battle-weary format, even > -- it's not the best at compressing, and it's not the best at > decompressing, and "all the cool kids" are working on cooler formats, > such as zstd which does indeed regularly change its byte output between > versions. But the advantage of gzip is that it's good *enough*, and it's > probably *everywhere*, and it's *reliable*. > > GNU gzip is reproducible. busybox gzip was fixed to agree with GNU gzip > (this is relevant to the handful of people running software forges on, > say, Alpine Linux): > > https://reproducible-builds.org/reports/2019-08/#upstream-news > > ... > > Nevertheless, I've seen the sentiment a few times that git doesn't like > committing to output stability of git-archive, because it isn't > officially documented (but it's not entirely clear what the benefits of > changing are). And yet, git endeavors to do so, in order to prevent > unnecessary breakage of people who embody Hyrum's Law and need that > stability. ...Yes, this has been discussed many times on-list. My recollection of those discussions in general is that we were mostly talking about the "tar" format itself, moreso than "gzip", although in this case it's a change in the gzip component that changed the output. It's not clear to me (and I'm asking instead of digging myself, as I assume someone at GitHub has dug already) whether our change to the "internal gzip" is necessarily going to result in a different hash, or did we just forget to provide some option to the library to get the same result as gzip(1). A major thing you're eliding here is that even if "tar" or "gzip" is a "a worn-out, battle-weary format" that does *not* translate to it being a trivial matter to maintain byte-for-byte compatibility in the archives (or compression stream) you produce, even though the resulting output once un-archived or un-compressed is guaranteed to be the same. We ship our own "tar" for the purposes of this discussion (the archive.c code etc.), but offload the "gzip" part to either an external library (which is new in v2.38.0, and the subject of this discussion), or to GNU's gzip command. I have no idea if the "gzip" part of this would be as easy as saying "we'll default to gzip(1)", you note "GNU gzip is reproducible. busybox gzip was fixed to agree with GNU gzip", but does the same apply to other "gzip(1)"? I know of at least the BSD gzip. Even then, has even GNU gzip promised that it will forever maintain byte-for-byte compatibility in its output? > Even with the new change to the compressor, git-archive is still > reproducible, it's the internal gzip compressor that isn't. (This may be > fixable, possibly by embedding an implementation from busybox or from > GNU gzip? I'm not going to discuss that right now, though I think it's > an interesting avenue of exploration.) So first, aside from whatever the git project does about the default, have you tried running the newer git version with a tar.tgz.command='gzip -cn' and seeing if it's compatible with the old version? It's unclear from the blog post's "we are reverting this change for now" whether that meant a revert of the git version (probably), or a revert back to using gzip(1). > I've thought about this now and then over the last couple of years, > because I think I have a reasonable compromise that might make everyone > (or at least most people) happy, and now seems like a good idea to > mention it. > > What does everyone think about offering versioned git-archive outputs? > This could be user-selectable as an option to `git archive`, but the > main goal would be to select a good versioned output format depending on > what is being archived. So: > > - first things first, un-default the internal compressor again > - implement a v2 archive format, where the internal compressor is the > default -- no other changes > - teach git to select an archive format based on the date of the object > being archived > - when given a commit/tag ID to archive, check which support frame the > committer date falls inside > - for tree IDs, always use the latest format (it always uses the > current date anyway) > - schedule a date, for the sake of argument, 6 months after the next > scheduled release date of git version X.Y in which this change goes > live; bake this into the git sources as a transition date, all commits > or tags generated after this date fall into the next format support > frame > > The end result is that for all historic commits or tags, `git archive` > will always produce the same output. This can be documented in the > git-archive manpage: "the produced archive is guaranteed to be > reproducible, unless you override the `tar.<format>.command` or your > system compressor is not reproducible". > > For *new* commits or tags, everyone gets the benefit of fascinating, > cool new archive formats with useful improvements at the tar container > level, which is apparently a very desirable feature. The git project no > longer has to worry, at all, about whether users will come to complain > about how their build pipelines suddenly fail with checksum issues. The > git project can simply, fearlessly, go implement innovative new changes > without giving any thought to backwards compatibility. > > It is, simply, that those new changes only apply to projects which are > still under active development, and which push new commits or tag new > releases after the transition date. > > Old states of existing projects (regardless of whether they are still > actively updating) can go have their old and apparently inefficient > archives and don't get cool new stuff. That's fine. They're also > increasingly rarely used, because they are, after all, old -- and most > likely only used for historic archival purposes. If the worst comes to > worst, well, they managed to produce a somehow useful archive with an > older version of git -- nothing will *break* if they don't get the cool > new stuff. > > And for the vast majority of new downloads for new stuff, the in-process > compressor saves one fork+exec and is a bit more efficient, I guess? > > A note on the transition date: I suggested 6 months after the scheduled > release date, because this gives everyone running a software forge time > to update git itself, and have everything ready, in time to handle the > first wave of commits and tags that naturally occur after the transition > date. And you don't want it to be immediate, because then people will > take days or weeks to deploy and the most recent archives will change > > For the purposes of this thought experiment, we assume that people don't > routinely set the system time to a year in the future. This will only be > done in situations such as, say, testing a git upgrade deployment for a > software forge. This sounds like a workable transition plan, but it assumes that we had a really good reason to change to the "internal gzip" by default, and that we must move forward with that change in some way. I don't think that's the case per the linked-to on-list discussion, the aim was just to provide output if gzip(1) wasn't available, so all we'd need is the pseudocode of: - Prepare our tar stream - Try to strem it to gzip(1) - If that fails with "command does not exist" fall back to the internal one (possibly with a warning about possibly-different output) Then systems without a gzip(1) could produce output (which René was aiming for), but those with a system gzip(1) (e.g. GitHub's production installation) could just continue to use it. That's still a band-aid on the larger questions I raised above, i.e. whether we'd want to forever guarantee the output of "git archive" itself, and of the "tar.tgz.command". My off-the-cuff response to that is that we should probably: - Guarantee the "git archive" output itself (without compression), leaving the out that it *may* change in the future with notice (or we'd just version it) - Switch back to using gzip(1) by default, whatever gzip(1) that happens to be. But: - Promise that the total end result will be byte-for-byte the same, as that would imply a promise about the external gzip(1). - Just prominently note in our docs that if you want the archive->compression to be byte-for-byte with the past it's up to you to ensure that your compressor gives you that guarantee. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 7:49 ` Ævar Arnfjörð Bjarmason @ 2023-01-31 9:11 ` Eli Schwartz 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 57+ messages in thread From: Eli Schwartz @ 2023-01-31 9:11 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Git List, brian m. carlson, René Scharfe, Johannes Schindelin, Jeff King Quick response for now... On 1/31/23 2:49 AM, Ævar Arnfjörð Bjarmason wrote: > So first, aside from whatever the git project does about the default, > have you tried running the newer git version with a > tar.tgz.command='gzip -cn' and seeing if it's compatible with the old > version? > > It's unclear from the blog post's "we are reverting this change for now" > whether that meant a revert of the git version (probably), or a revert > back to using gzip(1). I do not know which one Github internally did, but I can confirm that the gzipped tarballs which github started shipping, when gunzipped, produced an uncompressed tarball that was byte-identical to uncompressed editions of the historic ones. i.e. you could do this: ``` wget ${important_archive_release} gzip -dc < ${important_archive_localfile} | gzip -cn > ${important_archive_localfile}.new ``` And: - they have different checksums - the .new file has reverted to the same checksum as historic versions from last year that are frozen into manifests That was part of my original investigation, before I located the public conversations. -- Eli Schwartz ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-01-31 7:49 ` Ævar Arnfjörð Bjarmason 2023-01-31 9:11 ` Eli Schwartz @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason ` (11 more replies) 1 sibling, 12 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason As reported in https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ changing the default "tgz" output method of from "gzip(1)" to our internal "git archive gzip" (using zlib ) broke things for users in the wild that assume that the "git archive" output is stable, most notably GitHub: https://github.com/orgs/community/discussions/45830 Leaving aside the larger question of whether we're going to promise output stability for "git archive" in general, the motivation for that change was to have a working compression method on systems that lacked a gzip(1). As the disruption of changing the default isn't worth it, let's use gzip(1) again by default, and only fall back on the new "git archive gzip" if it isn't available. The later parts of this series then document and test for the output stability of the command. We're not promising anything new there, except that we now promise that we're going to use "gzip" as the default compressor, but that it's up to that command to be stable, should the user desire output stability. The documentation discusses the various caveats involved, suggests alternatives to checksumming compressed archives, but in the end notes what's been the policy so far: We're not promising that the "tar" output is going to be stable. The early parts of this series (1-2/9) are clean-up for existing config drift, as later in the series we'll otherwise need to change the divergent config documentation in two places. CI & branch for this at: https://github.com/avar/git/tree/avar/archive-internal-gzip-not-the-default Ævar Arnfjörð Bjarmason (9): archive & tar config docs: de-duplicate configuration section git config docs: document "tar.<format>.{command,remote}" archiver API: make the "flags" in "struct archiver" an enum archive: omit the shell for built-in "command" filters archive-tar.c: move internal gzip implementation to a function archive: use "gzip -cn" for stability, not "git archive gzip" test-lib.sh: add a lazy GZIP prerequisite archive tests: test for "gzip -cn" and "git archive gzip" stability git archive docs: document output non-stability Documentation/config/tar.txt | 29 +++++++- Documentation/git-archive.txt | 96 +++++++++++++++++++------- archive-tar.c | 78 ++++++++++++++------- archive.h | 11 +-- t/t5000-tar-tree.sh | 2 - t/t5005-archive-stability.sh | 70 +++++++++++++++++++ t/t5562-http-backend-content-length.sh | 2 - t/test-lib.sh | 4 ++ 8 files changed, 231 insertions(+), 61 deletions(-) create mode 100755 t/t5005-archive-stability.sh -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH 1/9] archive & tar config docs: de-duplicate configuration section 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason ` (10 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason The "tar.umask" documentation was initially added in [1], and was duplicated from the start. Then with [2] the two started drifting apart. Let's consolidate them with a change like the ones made in the commits merged in [3]. 1. ce1a79b6a74 (tar-tree: add the "tar.umask" config option, 2006-07-20) 2. 687157c736d (Documentation: update tar.umask default, 2007-08-21) 3. 7a54d740451 (Merge branch 'ab/dedup-config-and-command-docs', 2022-09-14) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/config/tar.txt | 4 +++- Documentation/git-archive.txt | 8 +------- 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt index de8ff48ea9d..c68e294bbc5 100644 --- a/Documentation/config/tar.txt +++ b/Documentation/config/tar.txt @@ -3,4 +3,6 @@ tar.umask:: tar archive entries. The default is 0002, which turns off the world write bit. The special value "user" indicates that the archiving user's umask will be used instead. See umask(2) and - linkgit:git-archive[1]. + linkgit:git-archive[1] for + details. If `--remote` is used then only the configuration of + the remote repository takes effect. diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt index 60c040988bb..bbb407d4975 100644 --- a/Documentation/git-archive.txt +++ b/Documentation/git-archive.txt @@ -131,13 +131,7 @@ tar CONFIGURATION ------------- -tar.umask:: - This variable can be used to restrict the permission bits of - tar archive entries. The default is 0002, which turns off the - world write bit. The special value "user" indicates that the - archiving user's umask will be used instead. See umask(2) for - details. If `--remote` is used then only the configuration of - the remote repository takes effect. +include::config/tar.txt[] tar.<format>.command:: This variable specifies a shell command through which the tar -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason ` (9 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason Since the "tar.<format>.command" and "tar.<format>.remote" configuration was added in [1] and [2], we have not included it in the "git-config(1)" docs themselves. Since we're including "Documentation/config/tar.txt" in "Documentation/config/git-archive.txt" as of the preceding commit, let's move this documentation to the former, to be included in the latter. This is a move-only change, aside from changing the mention of "`git archive`" to "linkgit:git-archive[1]", for consistency with other such mentions. 1. 767cf4579f0 (archive: implement configurable tar filters, 2011-06-21) 2. 7b97730b764 (upload-archive: allow user to turn off filters, 2011-06-21) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/config/tar.txt | 18 ++++++++++++++++++ Documentation/git-archive.txt | 18 ------------------ 2 files changed, 18 insertions(+), 18 deletions(-) diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt index c68e294bbc5..894c1163bb9 100644 --- a/Documentation/config/tar.txt +++ b/Documentation/config/tar.txt @@ -6,3 +6,21 @@ tar.umask:: linkgit:git-archive[1] for details. If `--remote` is used then only the configuration of the remote repository takes effect. + +tar.<format>.command:: + This variable specifies a shell command through which the tar + output generated by linkgit:git-archive[1] should be piped. The command + is executed using the shell with the generated tar file on its + standard input, and should produce the final output on its + standard output. Any compression-level options will be passed + to the command (e.g., `-9`). ++ +The `tar.gz` and `tgz` formats are defined automatically and use the +magic command `git archive gzip` by default, which invokes an internal +implementation of gzip. + +tar.<format>.remote:: + If true, enable the format for use by remote clients via + linkgit:git-upload-archive[1]. Defaults to false for + user-defined formats, but true for the `tar.gz` and `tgz` + formats. diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt index bbb407d4975..268e797f03a 100644 --- a/Documentation/git-archive.txt +++ b/Documentation/git-archive.txt @@ -133,24 +133,6 @@ CONFIGURATION include::config/tar.txt[] -tar.<format>.command:: - This variable specifies a shell command through which the tar - output generated by `git archive` should be piped. The command - is executed using the shell with the generated tar file on its - standard input, and should produce the final output on its - standard output. Any compression-level options will be passed - to the command (e.g., `-9`). -+ -The `tar.gz` and `tgz` formats are defined automatically and use the -magic command `git archive gzip` by default, which invokes an internal -implementation of gzip. - -tar.<format>.remote:: - If true, enable the format for use by remote clients via - linkgit:git-upload-archive[1]. Defaults to false for - user-defined formats, but true for the `tar.gz` and `tgz` - formats. - [[ATTRIBUTES]] ATTRIBUTES ---------- -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason ` (8 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason Refactor the "#define" pattern in the archiver.h to use a new "enum archiver_flags". This isn't a functional change, but will make adding new flags in a subsequent commit easier to reason about. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- archive.h | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/archive.h b/archive.h index 08bed3ed3af..6b51288c2ed 100644 --- a/archive.h +++ b/archive.h @@ -36,13 +36,15 @@ const char *archive_format_from_filename(const char *filename); /* archive backend stuff */ -#define ARCHIVER_WANT_COMPRESSION_LEVELS 1 -#define ARCHIVER_REMOTE 2 -#define ARCHIVER_HIGH_COMPRESSION_LEVELS 4 +enum archiver_flags { + ARCHIVER_WANT_COMPRESSION_LEVELS = 1<<0, + ARCHIVER_REMOTE = 1<<1, + ARCHIVER_HIGH_COMPRESSION_LEVELS = 1<<2, +}; struct archiver { const char *name; int (*write_archive)(const struct archiver *, struct archiver_args *); - unsigned flags; + enum archiver_flags flags; char *filter_command; }; void register_archiver(struct archiver *); -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 4/9] archive: omit the shell for built-in "command" filters 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (2 preceding siblings ...) 2023-02-02 9:32 ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason ` (7 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason Since the "tar.<format.command" interface was added in [1] we've promised to invoke the shell to run if e.g. "gzip -cn" is configured. That common format was then added as a default in [2]. But if we have no such configuration we can safely assume that the user isn't expecting the "gzip" to be invoked via a shell, and we can skip the "sh" process. We are intentionally not treating a configured "tar.<format>.command=<cmd>" where "<cmd>" is equivalent to our hardcoded "<cmd>" the same as when the same "<cmd>" is specified in the config. If the user has configured e.g. "gzip -cn" they may be relying on what the shell gives them over a direct execve() of "gzip". This makes us marginally faster, but the real point is to make the error handling easier to deal with. When we're using the shell we don't know if e.g. the "gzip" we spawned fails as easily, i.e. "start_command()" won't fail, because we can find the "sh". A subsequent commit will tweak the default that [3] introduced to be a fallback instead, at which point we'll need this for correctness. 1. 767cf4579f0 (archive: implement configurable tar filters, 2011-06-21) 2. 0e804e09938 (archive: provide builtin .tar.gz filter, 2011-06-21) 3. 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/config/tar.txt | 3 +++ archive-tar.c | 17 +++++++++++++---- archive.h | 1 + 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt index 894c1163bb9..5456fc617a2 100644 --- a/Documentation/config/tar.txt +++ b/Documentation/config/tar.txt @@ -18,6 +18,9 @@ tar.<format>.command:: The `tar.gz` and `tgz` formats are defined automatically and use the magic command `git archive gzip` by default, which invokes an internal implementation of gzip. ++ +The automatically defined commands do not invoke the shell, avoiding +the minor overhead of an extra sh(1) process. tar.<format>.remote:: If true, enable the format for use by remote clients via diff --git a/archive-tar.c b/archive-tar.c index f8fad2946ef..8c5de949c64 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -367,12 +367,13 @@ static struct archiver *find_tar_filter(const char *name, size_t len) } static int tar_filter_config(const char *var, const char *value, - void *data UNUSED) + void *data) { struct archiver *ar; const char *name; const char *type; size_t namelen; + int *configured = data; if (parse_config_key(var, "tar", &name, &namelen, &type) < 0 || !name) return 0; @@ -388,6 +389,9 @@ static int tar_filter_config(const char *var, const char *value, tar_filters[nr_tar_filters++] = ar; } + if (configured && *configured) + ar->flags |= ARCHIVER_COMMAND_FROM_CONFIG; + if (!strcmp(type, "command")) { if (!value) return config_error_nonbool(var); @@ -495,8 +499,12 @@ static int write_tar_filter_archive(const struct archiver *ar, if (args->compression_level >= 0) strbuf_addf(&cmd, " -%d", args->compression_level); - strvec_push(&filter.args, cmd.buf); - filter.use_shell = 1; + if (ar->flags & ARCHIVER_COMMAND_FROM_CONFIG) { + strvec_push(&filter.args, cmd.buf); + filter.use_shell = 1; + } else { + strvec_split(&filter.args, cmd.buf); + } filter.in = -1; filter.silent_exec_failure = 1; @@ -526,13 +534,14 @@ static struct archiver tar_archiver = { void init_tar_archiver(void) { int i; + int configured = 1; register_archiver(&tar_archiver); tar_filter_config("tar.tgz.command", internal_gzip_command, NULL); tar_filter_config("tar.tgz.remote", "true", NULL); tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL); tar_filter_config("tar.tar.gz.remote", "true", NULL); - git_config(git_tar_config, NULL); + git_config(git_tar_config, &configured); for (i = 0; i < nr_tar_filters; i++) { /* omit any filters that never had a command configured */ if (tar_filters[i]->filter_command) diff --git a/archive.h b/archive.h index 6b51288c2ed..9686b3b5cc1 100644 --- a/archive.h +++ b/archive.h @@ -40,6 +40,7 @@ enum archiver_flags { ARCHIVER_WANT_COMPRESSION_LEVELS = 1<<0, ARCHIVER_REMOTE = 1<<1, ARCHIVER_HIGH_COMPRESSION_LEVELS = 1<<2, + ARCHIVER_COMMAND_FROM_CONFIG = 1<<3, }; struct archiver { const char *name; -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (3 preceding siblings ...) 2023-02-02 9:32 ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason ` (6 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason Refactor the code added in 76d7602631a (archive-tar: add internal gzip implementation, 2022-06-15) to call the magic "git archive gzip" command as a function. A subsequent commit will start using this as a fallback, but for now there's no functional changes here. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- archive-tar.c | 43 +++++++++++++++++++++++++------------------ 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/archive-tar.c b/archive-tar.c index 8c5de949c64..dfc133deac7 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -465,12 +465,33 @@ static void tgz_write_block(const void *data) static const char internal_gzip_command[] = "git archive gzip"; -static int write_tar_filter_archive(const struct archiver *ar, - struct archiver_args *args) +static int gzip_internally(const struct archiver *ar, + struct archiver_args *args) { #if ZLIB_VERNUM >= 0x1221 struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */ #endif + int r; + + write_block = tgz_write_block; + git_deflate_init_gzip(&gzstream, args->compression_level); +#if ZLIB_VERNUM >= 0x1221 + if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK) + BUG("deflateSetHeader() called too late"); +#endif + gzstream.next_out = outbuf; + gzstream.avail_out = sizeof(outbuf); + + r = write_tar_archive(ar, args); + + tgz_deflate(Z_FINISH); + git_deflate_end(&gzstream); + return r; +} + +static int write_tar_filter_archive(const struct archiver *ar, + struct archiver_args *args) +{ struct strbuf cmd = STRBUF_INIT; struct child_process filter = CHILD_PROCESS_INIT; int r; @@ -478,22 +499,8 @@ static int write_tar_filter_archive(const struct archiver *ar, if (!ar->filter_command) BUG("tar-filter archiver called with no filter defined"); - if (!strcmp(ar->filter_command, internal_gzip_command)) { - write_block = tgz_write_block; - git_deflate_init_gzip(&gzstream, args->compression_level); -#if ZLIB_VERNUM >= 0x1221 - if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK) - BUG("deflateSetHeader() called too late"); -#endif - gzstream.next_out = outbuf; - gzstream.avail_out = sizeof(outbuf); - - r = write_tar_archive(ar, args); - - tgz_deflate(Z_FINISH); - git_deflate_end(&gzstream); - return r; - } + if (!strcmp(ar->filter_command, internal_gzip_command)) + return gzip_internally(ar, args); strbuf_addstr(&cmd, ar->filter_command); if (args->compression_level >= 0) -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (4 preceding siblings ...) 2023-02-02 9:32 ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason ` (5 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason This reverts and amends [1] so that we don't use "git archive gzip" by default, but only fall back on it when we cannot invoke "gzip". As noted in the discussion at [2] that commit first released with v2.38.0 caused widespread breakage in the wild: Hosting sites like GitHub tend to offer a feature to download tagged releases as archives, which are generated by some variant of "git archive --format=tgz". Downstream distributors then tend to (re-)download those archives as-is, hardcoding their known hash their packaging systems. See [3], [4] etc. for reports of those systems breaking in conjunction with [1]. The reason for "why" is entirely missing from the commit message for [1], but as seen in the question about that in [5] and reply at [6] at the time it was to "avoid a run[time] dependency; the build/test dependency remains.". It's not immediately apparent what the second part of that is referring to, as [1] also removed the "GZIP" prerequisite from some tests. The answer is that we still have other tests that need "GZIP", but those are invoking "gzip(1)" explicitly. In any case, whatever promises we make in the future about the stability and non-stability of "git archive" output (or the derived compressed artifact), this amount of fallout isn't worth it to get to the stated goal in [1]. Let's instead default to "gzip -cn" again, but if we can't find it fall back on "git archive gzip". Note that we'll only fallback if that "gzip -cn" is ours, not if it comes from the user's own "tar.<format>.command" configuration. If we do need the fallback we'll warn about it. No such warning will be emitted if the user has explicitly asked for "git archive gzip". 1. 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15) 2. https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ 3. https://github.com/Homebrew/homebrew-core/issues/121877 4. https://github.com/bazel-contrib/SIG-rules-authors/issues/11 5. https://lore.kernel.org/git/220615.86wndhwt9a.gmgdl@evledraar.gmail.com/ 6. https://lore.kernel.org/git/3ed80afd-34b3-afd8-5ffb-0187a4475ee1@web.de/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/config/tar.txt | 8 ++++++-- archive-tar.c | 20 +++++++++++++++----- 2 files changed, 21 insertions(+), 7 deletions(-) diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt index 5456fc617a2..37f24baa73a 100644 --- a/Documentation/config/tar.txt +++ b/Documentation/config/tar.txt @@ -16,8 +16,12 @@ tar.<format>.command:: to the command (e.g., `-9`). + The `tar.gz` and `tgz` formats are defined automatically and use the -magic command `git archive gzip` by default, which invokes an internal -implementation of gzip. +command `gzip -cn` by default. An internal gzip implementation can be +used by specifying the value `git archive gzip`. ++ +If 'gzip -cn' cannot be executed we'll fall back on `git archive gzip` +with a warning, if you don't have a gzip(1) and would like to use the +internal `git archive gzip` without warning, configure it explicitly. + The automatically defined commands do not invoke the shell, avoiding the minor overhead of an extra sh(1) process. diff --git a/archive-tar.c b/archive-tar.c index dfc133deac7..26efb911ebc 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -464,6 +464,7 @@ static void tgz_write_block(const void *data) } static const char internal_gzip_command[] = "git archive gzip"; +static const char gzip_cn_command[] = "gzip -cn"; static int gzip_internally(const struct archiver *ar, struct archiver_args *args) @@ -494,12 +495,15 @@ static int write_tar_filter_archive(const struct archiver *ar, { struct strbuf cmd = STRBUF_INIT; struct child_process filter = CHILD_PROCESS_INIT; + int filter_is_gzip_cn = 0; int r; if (!ar->filter_command) BUG("tar-filter archiver called with no filter defined"); - if (!strcmp(ar->filter_command, internal_gzip_command)) + if (!strcmp(ar->filter_command, gzip_cn_command)) + filter_is_gzip_cn = 1; + else if (!strcmp(ar->filter_command, internal_gzip_command)) return gzip_internally(ar, args); strbuf_addstr(&cmd, ar->filter_command); @@ -515,8 +519,14 @@ static int write_tar_filter_archive(const struct archiver *ar, filter.in = -1; filter.silent_exec_failure = 1; - if (start_command(&filter) < 0) - die_errno(_("unable to start '%s' filter"), cmd.buf); + if (start_command(&filter) < 0) { + if (!filter_is_gzip_cn) + die_errno(_("unable to start '%s' filter"), cmd.buf); + + warning_errno(_("unable to start '%s' filter, falling back to '%s'"), + cmd.buf, internal_gzip_command); + return gzip_internally(ar, args); + } close(1); if (dup2(filter.in, 1) < 0) die_errno(_("unable to redirect descriptor")); @@ -544,9 +554,9 @@ void init_tar_archiver(void) int configured = 1; register_archiver(&tar_archiver); - tar_filter_config("tar.tgz.command", internal_gzip_command, NULL); + tar_filter_config("tar.tgz.command", gzip_cn_command, NULL); tar_filter_config("tar.tgz.remote", "true", NULL); - tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL); + tar_filter_config("tar.tar.gz.command", gzip_cn_command, NULL); tar_filter_config("tar.tar.gz.remote", "true", NULL); git_config(git_tar_config, &configured); for (i = 0; i < nr_tar_filters; i++) { -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (5 preceding siblings ...) 2023-02-02 9:32 ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason ` (4 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason Move the "gzip --version" lazy prerequisite added in [1] and copy/pasted to another test in [2] to test-lib.sh. A subsequent commit will add a third user, let's first stop duplicating it. 1. 96174145fc3 (t5000: simplify gzip prerequisite checks, 2013-12-03) 2. 6c213e863ae (http-backend: respect CONTENT_LENGTH for receive-pack, 2018-07-27) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- t/t5000-tar-tree.sh | 2 -- t/t5562-http-backend-content-length.sh | 2 -- t/test-lib.sh | 4 ++++ 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh index d4730481384..e1fa34bb828 100755 --- a/t/t5000-tar-tree.sh +++ b/t/t5000-tar-tree.sh @@ -38,8 +38,6 @@ test_lazy_prereq TAR_NEEDS_PAX_FALLBACK ' ) ' -test_lazy_prereq GZIP 'gzip --version' - get_pax_header() { file=$1 header=$2= diff --git a/t/t5562-http-backend-content-length.sh b/t/t5562-http-backend-content-length.sh index b68ec22d3fd..e83aa336fa8 100755 --- a/t/t5562-http-backend-content-length.sh +++ b/t/t5562-http-backend-content-length.sh @@ -3,8 +3,6 @@ test_description='test git-http-backend respects CONTENT_LENGTH' . ./test-lib.sh -test_lazy_prereq GZIP 'gzip --version' - verify_http_result() { # some fatal errors still produce status 200 # so check if there is the error message diff --git a/t/test-lib.sh b/t/test-lib.sh index 01e88781dd2..33bb9fe991f 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -1922,6 +1922,10 @@ test_lazy_prereq LONG_IS_64BIT ' test_lazy_prereq TIME_IS_64BIT 'test-tool date is64bit' test_lazy_prereq TIME_T_IS_64BIT 'test-tool date time_t-is64bit' +test_lazy_prereq GZIP ' + gzip --version +' + test_lazy_prereq CURL ' curl --version ' -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (6 preceding siblings ...) 2023-02-02 9:32 ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason ` (3 subsequent siblings) 11 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason If our test suite is instrumented to run the first "test_cmp_bin" in "test_done" it'll mostly pass, but fail on a few tests, such as "t5319-multi-pack-index.sh". Those tests reveal edge cases where the output of "gzip -cn" is different than that of "git archive gzip" for the same input. Let's extract a minimal version of the part of "t5319-multi-pack-index.sh" which triggers it, and add a test for archival stability. Whatever we ultimately decide to promise when it comes to this stability (see [1]) it'll be better to go into any behavior difference knowing that's what we're about to do, rather than discover widespread breakage due to already released Git versions. The "GZIP_TRIVIALLY_STABLE" code here is added because on OSX even a trivial *.tgz generated by the two methods will be different. 1. https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- t/t5005-archive-stability.sh | 70 ++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100755 t/t5005-archive-stability.sh diff --git a/t/t5005-archive-stability.sh b/t/t5005-archive-stability.sh new file mode 100755 index 00000000000..c7532886920 --- /dev/null +++ b/t/t5005-archive-stability.sh @@ -0,0 +1,70 @@ +#!/bin/sh + +test_description='git archive stabilty' + +TEST_PASSES_SANITIZE_LEAK=true +. ./test-lib.sh + +create_archive_file_with_config () { + local file="$1" && + local config="$2" && + shift 2 && + + test_when_finished "rm -rf \"$file\"" && + git -c tar.tgz.command="$config" archive -o "$file" HEAD +} + +setup_gzip_vs_git_archive_gzip () { + create_archive_file_with_config "expect.tgz" "gzip -cn" && + create_archive_file_with_config "actual.tgz" "git archive gzip" +} + +test_lazy_prereq GZIP_TRIVIALLY_STABLE ' + git clone "$TRASH_DIRECTORY" . && + test_commit P && + setup_gzip_vs_git_archive_gzip && + test_cmp_bin expect.tgz actual.tgz +' + +if ! test_have_prereq GZIP_TRIVIALLY_STABLE +then + skip_all='skipping gzip v.s. git archive gzip tests, even trivial content differs' + test_done +fi + +# The first test_expect_success is after the "skip_all" so we'll get +# the skip summary in prove(1) output. +test_expect_success 'setup' ' + test_commit A +' + +test_expect_success GZIP '"gzip -cn" and v.s. "git archive gzip" produce the same output still' ' + setup_gzip_vs_git_archive_gzip && + test_cmp_bin expect.tgz actual.tgz +' + +generate_objects () { + i=$1 + iii=$(printf '%03i' $i) + { + echo $iii && + test-tool genrandom "$iii" 8192 + } >file_$iii && + git update-index --add file_$iii +} + +test_expect_success 'create objects with (stable) random data' ' + test_commit initial && + for i in $(test_seq 1 5) + do + generate_objects $i || return 1 + done && + git commit -m"add objects" +' + +test_expect_success GZIP '"gzip -cn" and v.s. "git archive gzip" have differing output' ' + setup_gzip_vs_git_archive_gzip && + ! test_cmp_bin expect.tgz actual.tgz +' + +test_done -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 9/9] git archive docs: document output non-stability 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (7 preceding siblings ...) 2023-02-02 9:32 ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 ` Ævar Arnfjörð Bjarmason 2023-02-02 10:25 ` brian m. carlson 2023-02-02 16:17 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood ` (2 subsequent siblings) 11 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 9:32 UTC (permalink / raw) To: git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o, Ævar Arnfjörð Bjarmason There's an ongoing discussion about the output stability of "git archive"[1] as a follow-up to the incident GitHub experienced when upgrading to v2.38.0[2]. In a preceding commit we reverted the immediate cause of that incident, which was that we'd moved away from "gzip -cn" as the default compression method in favor of the internal "git archive gzip" in [3]. Let's follow that up by documenting the non-promises we've always maintained with regards to "git archive"'s output stability. We may want to make stronger promises in this area, but this change avoids addressing that question. Instead we're discussing that we've changed this in the past, aren't changing it willy-nilly, but it may change again in the future. The only new promise here that we haven't explicitly maintained historically is that we're promising to forever shell out to the system's "gzip" by default. Whether it produces stable output once that happens we leave up to the "gzip" tool. We're also discussing the caveats & differences in output with with SHA-1 and SHA-256 repositories, and trying to steer users towards more stable alternatives. First by using "git verify-tag" and the like to verify releases, and if they really must checksum generated output, to encourage them to at least checksum the "tar" output contained within the compressed output, not the compressed output itself. 1. https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ 2. https://github.com/orgs/community/discussions/45830 3. 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/git-archive.txt | 70 ++++++++++++++++++++++++++++++++++- 1 file changed, 69 insertions(+), 1 deletion(-) diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt index 268e797f03a..78f1b033cb7 100644 --- a/Documentation/git-archive.txt +++ b/Documentation/git-archive.txt @@ -14,6 +14,7 @@ SYNOPSIS [--remote=<repo> [--exec=<git-upload-archive>]] <tree-ish> [<path>...] +[[DESCRIPTION]] DESCRIPTION ----------- Creates an archive of the specified format containing the tree @@ -28,7 +29,7 @@ case the commit time as recorded in the referenced commit object is used instead. Additionally the commit ID is stored in a global extended pax header if the tar format is used; it can be extracted using 'git get-tar-commit-id'. In ZIP files it is stored as a file -comment. +comment. See the <<STABILITY,OUTPUT STABILITY>> section below. OPTIONS ------- @@ -202,6 +203,73 @@ EXAMPLES You can use it specifying `--format=tar.xz`, or by creating an output file like `-o foo.tar.xz`. +[[STABILITY]] +OUTPUT STABILITY +---------------- + +The output of 'git archive' is not guaranteed to be stable, and may +change between versions. + +There are many valid ways to encode the same data in the tar format +itself. For non-`tar` arguments to the `--format` option we rely on +external tools (or libraries) for compressing the output we generate. + +The `tar` format contains the commit ID in the pax header (see the +<<DESCRIPTION>> section above). A repository that's been migrated from +SHA-1 to SHA-256 will therefore have different `tar` output for the +"same" commit. See `extension.objectFormat` in linkgit:git-config[1]. + +Instead of relying on the output of `git archive`, you should prefer +to stick to git's own transport protocols, and e.g. validate releases +with linkgit:git-tag[1]'s `--verify` option. + +Despite the output of `git archive` having never been promised to be +stable, various users in the wild have come to rely on that being the +case. + +Most notably, large hosting providers provide a way to download a +given tagged release as a `git archive`. Some downstream tools then +expect the content of that archive to be stable. When that's changed +widespread breakage has been observed, see +https://github.com/orgs/community/discussions/45830 for one such case. + +While we won't promise that the output won't change in the future, we +are aware of these users, and will try to avoid changing it +willy-nilly. Furthermore, we make the following promises: + +* The default gzip compression tool will continue to be gzip(1). If + you rely on this being e.g. GNU gzip for the purposes of stability, + it's up to you to ensure that its output is stable across + versions. ++ + +We in turn promise to not e.g. make the internal "git archive gzip" +implementation the default, as it produces different ouput than +gzip(1) in some case. + +* We will do our best not to change the "tar" output itself, but won't + promise that we're never going to change it. ++ +If you must avoid using "git" itself for the tree validation, you +should be checksumming the uncompressed "tar" output, not e.g. the +compressed "tgz" output. ++ + +This ensures that you're only relying on the output emitted by git +itself, and avoiding the additional dependency on external +compression. ++ +See +https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball +for an implementation of that workflow. + +* We promise that a given version of git will emit stable "tar" output + for the same tree ID (but not commit ID, see the discussion in the + <<DESCRIPTION>> section above). ++ +While you shouldn't assume that different versions of git will emit +the same output, you can assume (e.g. for the purposes of caching) +that a given version's output is stable. SEE ALSO -------- -- 2.39.1.1392.g63e6d408230 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH 9/9] git archive docs: document output non-stability 2023-02-02 9:32 ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason @ 2023-02-02 10:25 ` brian m. carlson 2023-02-02 10:30 ` Ævar Arnfjörð Bjarmason 2023-02-02 16:34 ` Junio C Hamano 0 siblings, 2 replies; 57+ messages in thread From: brian m. carlson @ 2023-02-02 10:25 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o [-- Attachment #1: Type: text/plain, Size: 3929 bytes --] On 2023-02-02 at 09:32:29, Ævar Arnfjörð Bjarmason wrote: > +[[STABILITY]] > +OUTPUT STABILITY > +---------------- > + > +The output of 'git archive' is not guaranteed to be stable, and may > +change between versions. > + > +There are many valid ways to encode the same data in the tar format > +itself. For non-`tar` arguments to the `--format` option we rely on > +external tools (or libraries) for compressing the output we generate. > + > +The `tar` format contains the commit ID in the pax header (see the > +<<DESCRIPTION>> section above). A repository that's been migrated from > +SHA-1 to SHA-256 will therefore have different `tar` output for the > +"same" commit. See `extension.objectFormat` in linkgit:git-config[1]. > + > +Instead of relying on the output of `git archive`, you should prefer > +to stick to git's own transport protocols, and e.g. validate releases > +with linkgit:git-tag[1]'s `--verify` option. > + > +Despite the output of `git archive` having never been promised to be > +stable, various users in the wild have come to rely on that being the > +case. > + > +Most notably, large hosting providers provide a way to download a > +given tagged release as a `git archive`. Some downstream tools then > +expect the content of that archive to be stable. When that's changed > +widespread breakage has been observed, see > +https://github.com/orgs/community/discussions/45830 for one such case. > + > +While we won't promise that the output won't change in the future, we > +are aware of these users, and will try to avoid changing it > +willy-nilly. Furthermore, we make the following promises: > + > +* The default gzip compression tool will continue to be gzip(1). If > + you rely on this being e.g. GNU gzip for the purposes of stability, > + it's up to you to ensure that its output is stable across > + versions. > ++ > + > +We in turn promise to not e.g. make the internal "git archive gzip" > +implementation the default, as it produces different ouput than > +gzip(1) in some case. I think this is fine up to here. > +* We will do our best not to change the "tar" output itself, but won't > + promise that we're never going to change it. > ++ > +If you must avoid using "git" itself for the tree validation, you > +should be checksumming the uncompressed "tar" output, not e.g. the > +compressed "tgz" output. > ++ I don't think I want to state this, because it implies that the changes I made that broke kernel.org (making tar.umask apply to pax headers) wouldn't have been allowed. We should probably just state that "we won't promise that the tar output won't change between versions". Maybe, "We won't change the tar output needlessly, but it may change from time to time." That is, we won't be "let's change the format just to mix it up for users", but if there's a valuable patch that could be applied, then we might well take it. As I said, it's my goal to provide more concrete guarantees in a future patch, probably this weekend. > +* We promise that a given version of git will emit stable "tar" output > + for the same tree ID (but not commit ID, see the discussion in the > + <<DESCRIPTION>> section above). I think that section contradicts this. The tree version uses the current timestamp, which would make the archive change based on the time of day. > +While you shouldn't assume that different versions of git will emit > +the same output, you can assume (e.g. for the purposes of caching) > +that a given version's output is stable. Unfortunately, this isn't actually true if someone uses export-subst. That's because adding unrelated objects can increase the length of abbreviations, and then the tar contents can be different. I've actually seen this in the wild. Modulo that, yes, I agree with this. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 9/9] git archive docs: document output non-stability 2023-02-02 10:25 ` brian m. carlson @ 2023-02-02 10:30 ` Ævar Arnfjörð Bjarmason 2023-02-02 16:34 ` Junio C Hamano 1 sibling, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 10:30 UTC (permalink / raw) To: brian m. carlson Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o On Thu, Feb 02 2023, brian m. carlson wrote: >> +* We will do our best not to change the "tar" output itself, but won't >> + promise that we're never going to change it. >> ++ >> +If you must avoid using "git" itself for the tree validation, you >> +should be checksumming the uncompressed "tar" output, not e.g. the >> +compressed "tgz" output. >> ++ > > I don't think I want to state this, because it implies that the changes > I made that broke kernel.org (making tar.umask apply to pax headers) > wouldn't have been allowed. I don't see how "we'll do our best, but it might change" precludes that... > We should probably just state that "we > won't promise that the tar output won't change between versions". Maybe, ...but it sounds like you'd like this "softer" promise. I think it's saying the same, but picked the "we'll try not to" wording because I think it more accurately reflects reality, but... > "We won't change the tar output needlessly, but it may change from time > to time." That is, we won't be "let's change the format just to mix it > up for users", but if there's a valuable patch that could be applied, > then we might well take it. ...here we're back (at least per my reading) to basically what my proposed patch said. I'm happy to improve/change the wording, but I'm confused about the "because it implies" part you noted. > As I said, it's my goal to provide more concrete guarantees in a future > patch, probably this weekend. I think that would be great, but also think that if we're going to make new guarantees it's probably best applied on top of a series such as this, which aside from the reverting back to gzip as the default attempts to clarify the status quo. > >> +* We promise that a given version of git will emit stable "tar" output >> + for the same tree ID (but not commit ID, see the discussion in the >> + <<DESCRIPTION>> section above). > > I think that section contradicts this. The tree version uses the > current timestamp, which would make the archive change based on the time > of day. Thanks! It's referring back to the previous discussion, but I managed to somehow get the tree & commit cases reversed. >> +While you shouldn't assume that different versions of git will emit >> +the same output, you can assume (e.g. for the purposes of caching) >> +that a given version's output is stable. > > Unfortunately, this isn't actually true if someone uses export-subst. > That's because adding unrelated objects can increase the length of > abbreviations, and then the tar contents can be different. I've > actually seen this in the wild. > > Modulo that, yes, I agree with this. I didn't know about the export-subst case, I'll add that caveat in there. Thanks! ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 9/9] git archive docs: document output non-stability 2023-02-02 10:25 ` brian m. carlson 2023-02-02 10:30 ` Ævar Arnfjörð Bjarmason @ 2023-02-02 16:34 ` Junio C Hamano 2023-02-04 17:46 ` brian m. carlson 1 sibling, 1 reply; 57+ messages in thread From: Junio C Hamano @ 2023-02-02 16:34 UTC (permalink / raw) To: brian m. carlson Cc: Ævar Arnfjörð Bjarmason, git, Eli Schwartz, René Scharfe, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o "brian m. carlson" <sandals@crustytoothpaste.net> writes: >> +* We will do our best not to change the "tar" output itself, but won't >> + promise that we're never going to change it. >> ++ >> +If you must avoid using "git" itself for the tree validation, you >> +should be checksumming the uncompressed "tar" output, not e.g. the >> +compressed "tgz" output. >> ++ > > I don't think I want to state this, because it implies that the changes > I made that broke kernel.org (making tar.umask apply to pax headers) > wouldn't have been allowed. We should probably just state that "we > won't promise that the tar output won't change between versions". Maybe, > "We won't change the tar output needlessly, but it may change from time > to time." That is, we won't be "let's change the format just to mix it > up for users", but if there's a valuable patch that could be applied, > then we might well take it. I agree with you. Giving "will do our best not to" is still too strong for that. We won't change the format willy-nilly but when there is a good reason to do so, we should be able to fix or improve the output. >> +While you shouldn't assume that different versions of git will emit >> +the same output, you can assume (e.g. for the purposes of caching) >> +that a given version's output is stable. > > Unfortunately, this isn't actually true if someone uses export-subst. > That's because adding unrelated objects can increase the length of > abbreviations, and then the tar contents can be different. I've > actually seen this in the wild. "subst" is certainly an issue, especially when the substitution is unstable. There shouldn't be cross platform differences to break bit-for-bit stability at least for "tar" format, as we do not rely on any external library. Can we say the same for "zip"? I thought we throw the blob at git_deflate_*() so the exact bitstream is up to the libz implementation? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 9/9] git archive docs: document output non-stability 2023-02-02 16:34 ` Junio C Hamano @ 2023-02-04 17:46 ` brian m. carlson 0 siblings, 0 replies; 57+ messages in thread From: brian m. carlson @ 2023-02-04 17:46 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, git, Eli Schwartz, René Scharfe, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o [-- Attachment #1: Type: text/plain, Size: 688 bytes --] On 2023-02-02 at 16:34:44, Junio C Hamano wrote: > There shouldn't be cross platform differences to break bit-for-bit > stability at least for "tar" format, as we do not rely on any > external library. Can we say the same for "zip"? I thought we > throw the blob at git_deflate_*() so the exact bitstream is up to > the libz implementation? That's also true. There, we can't use gzip, so we do whatever libz does. For Zip, I believe we embed a local timestamp, so the output is also dependent on the time zone. I don't know enough about the Zip format to say if there are any other things that may vary. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (8 preceding siblings ...) 2023-02-02 9:32 ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason @ 2023-02-02 16:17 ` Phillip Wood 2023-02-02 16:40 ` Junio C Hamano ` (2 more replies) 2023-02-02 16:25 ` Junio C Hamano 2023-02-02 19:23 ` Raymond E. Pasco 11 siblings, 3 replies; 57+ messages in thread From: Phillip Wood @ 2023-02-02 16:17 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o Hi Ævar On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote: > As reported in > https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ > changing the default "tgz" output method of from "gzip(1)" to our > internal "git archive gzip" (using zlib ) broke things for users in > the wild that assume that the "git archive" output is stable, most > notably GitHub: https://github.com/orgs/community/discussions/45830 > > Leaving aside the larger question of whether we're going to promise > output stability for "git archive" in general, the motivation for that > change was to have a working compression method on systems that lacked > a gzip(1). As I recall the reduction in cpu time used to create a compressed archive was a factor in making it the default. > As the disruption of changing the default isn't worth it, let's use > gzip(1) again by default, and only fall back on the new "git archive > gzip" if it isn't available. Playing devil's advocate for a moment as we're not going to promise that the compressed output of "git archive" will be stable in the future perhaps we should use this breakage as an opportunity to highlight that to users and to advertize the config setting that allows them to use gzip for compressing archives. Reverting the change gives the misleading impression that we're making a commitment to keeping the output stable. The focus of this thread seems to be the problems relating to github which they have already addressed. I think there is general agreement that it is not practical to promise that the compressed output of "git archive" is stable so maybe it is better to make that clear now while users can work around it in the short term with a config setting rather than waiting until we're faced with some security or other issue that forces a change to the output which users cannot work around so easily. Best Wishes Phillip > The later parts of this series then document and test for the output > stability of the command. > > We're not promising anything new there, except that we now promise > that we're going to use "gzip" as the default compressor, but that > it's up to that command to be stable, should the user desire output > stability. > > The documentation discusses the various caveats involved, suggests > alternatives to checksumming compressed archives, but in the end notes > what's been the policy so far: We're not promising that the "tar" > output is going to be stable. > > The early parts of this series (1-2/9) are clean-up for existing > config drift, as later in the series we'll otherwise need to change > the divergent config documentation in two places. > > CI & branch for this at: > https://github.com/avar/git/tree/avar/archive-internal-gzip-not-the-default > > Ævar Arnfjörð Bjarmason (9): > archive & tar config docs: de-duplicate configuration section > git config docs: document "tar.<format>.{command,remote}" > archiver API: make the "flags" in "struct archiver" an enum > archive: omit the shell for built-in "command" filters > archive-tar.c: move internal gzip implementation to a function > archive: use "gzip -cn" for stability, not "git archive gzip" > test-lib.sh: add a lazy GZIP prerequisite > archive tests: test for "gzip -cn" and "git archive gzip" stability > git archive docs: document output non-stability > > Documentation/config/tar.txt | 29 +++++++- > Documentation/git-archive.txt | 96 +++++++++++++++++++------- > archive-tar.c | 78 ++++++++++++++------- > archive.h | 11 +-- > t/t5000-tar-tree.sh | 2 - > t/t5005-archive-stability.sh | 70 +++++++++++++++++++ > t/t5562-http-backend-content-length.sh | 2 - > t/test-lib.sh | 4 ++ > 8 files changed, 231 insertions(+), 61 deletions(-) > create mode 100755 t/t5005-archive-stability.sh > ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-02 16:17 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood @ 2023-02-02 16:40 ` Junio C Hamano 2023-02-03 13:49 ` Ævar Arnfjörð Bjarmason 2023-02-03 15:47 ` Theodore Ts'o 2 siblings, 0 replies; 57+ messages in thread From: Junio C Hamano @ 2023-02-02 16:40 UTC (permalink / raw) To: Phillip Wood Cc: Ævar Arnfjörð Bjarmason, git, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o Phillip Wood <phillip.wood123@gmail.com> writes: > ... Reverting the change > gives the misleading impression that we're making a commitment to > keeping the output stable. The focus of this thread seems to be the > problems relating to github which they have already addressed. > > I think there is general agreement that it is not practical to promise > that the compressed output of "git archive" is stable so maybe it is > better to make that clear now while users can work around it in the > short term with a config setting rather than waiting until we're faced > with some security or other issue that forces a change to the output > which users cannot work around so easily. I love to see somebody else play the devil's advocate role. Thanks for all of the above. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-02 16:17 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood 2023-02-02 16:40 ` Junio C Hamano @ 2023-02-03 13:49 ` Ævar Arnfjörð Bjarmason 2023-02-06 14:46 ` Phillip Wood 2023-02-03 15:47 ` Theodore Ts'o 2 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-03 13:49 UTC (permalink / raw) To: phillip.wood Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o On Thu, Feb 02 2023, Phillip Wood wrote: > On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote: >> As reported in >> https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ >> changing the default "tgz" output method of from "gzip(1)" to our >> internal "git archive gzip" (using zlib ) broke things for users in >> the wild that assume that the "git archive" output is stable, most >> notably GitHub: https://github.com/orgs/community/discussions/45830 >> >> Leaving aside the larger question of whether we're going to promise >> output stability for "git archive" in general, the motivation for that >> change was to have a working compression method on systems that lacked >> a gzip(1). > > As I recall the reduction in cpu time used to create a compressed > archive was a factor in making it the default. I read those references in 76d7602631a (archive-tar: add internal gzip implementation, 2022-06-15) more of a "it's not [much] slower", the flip to the default in 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15) didn't discuss it. So I didn't think it was important enough to mention (even though we're now back to the faster "gzip" method). >> As the disruption of changing the default isn't worth it, let's use >> gzip(1) again by default, and only fall back on the new "git archive >> gzip" if it isn't available. > > Playing devil's advocate for a moment as we're not going to promise > that the compressed output of "git archive" will be stable in the > future perhaps we should use this breakage as an opportunity to > highlight that to users and to advertize the config setting that > allows them to use gzip for compressing archives. If we were trying to intentionally break things for those users we could do a lot better than "git archive gzip", whose output is mostly the same as "gzip", we could tweak one of the headers to make it different all the time. But I think it's better to advocate for such intentional chaos-monkeying as a follow-up to this more conservative "oops, we broke stuff, it's easy not to break it, so let's not do it'. > Reverting the change gives the misleading impression that we're making > a commitment to keeping the output stable. I don't see how you can conclude that from this series. It explicitly states that we make no such promises, what it does is go back to allowing the gzip(1) command to make its own promises. > The focus of this thread seems to be the > problems relating to github which they have already addressed. Which they've addressed by reverting the change, but while they're a major user of git they're not the only one. They just happened to use "git archive". I think it would be a mistake to conclude that everyone who's run into this has already done so, or is aware of it. > I think there is general agreement that it is not practical to promise > that the compressed output of "git archive" is stable so maybe it is > better[...] ...better than what? This seems to imply that this series is making new promises about the output stability, which it isn't doing. > [...]to make that clear now while users can work around it in the > short term with a config setting rather than waiting until we're faced > with some security or other issue that forces a change to the output > which users cannot work around so easily. I think it's always been clear that you can use that setting. For ages we've been saying: The `tar.gz` and `tgz` formats are defined automatically and use the command `gzip -cn` by default. Then v2.38.0 changed it to: [...] magic command `git archive gzip` by default Which IMO was easily missed among other "Performance, Internal Implementation, Development Support etc." items in the release notes, which said: Teach "git archive" to (optionally and then by default) avoid spawning an external "gzip" process when creating ".tar.gz" (and ".tgz") archives. But I agree that all of this is subjective. To me a 2% reduction in CPU use (at the cost of ~20% increse in wallclock) & some unclear benefits to teaching users that they can't rely on our "gzip" output seems unclear or hypothetical. Whereas the widespread breakage reported is very real, and we should consider GitHub as a canary for that, not the the stand & end of its potential impact. As we didn't have a strong reason to change this in the first place (and as my series shows, we can have our cake & eat it too if we don't have a "gzip") I think the obvious choice is to go back to using "gzip". ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-03 13:49 ` Ævar Arnfjörð Bjarmason @ 2023-02-06 14:46 ` Phillip Wood 0 siblings, 0 replies; 57+ messages in thread From: Phillip Wood @ 2023-02-06 14:46 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o On 03/02/2023 13:49, Ævar Arnfjörð Bjarmason wrote: > > On Thu, Feb 02 2023, Phillip Wood wrote: >> Reverting the change gives the misleading impression that we're making >> a commitment to keeping the output stable. > > I don't see how you can conclude that from this series. It explicitly > states that we make no such promises, what it does is go back to > allowing the gzip(1) command to make its own promises. This series would not be happening if we were not reverting a change to the compressed output of 'git archive'. The documentation updates are very welcome but I think we're undermining the message that the compressed output can change by reverting that change. >> The focus of this thread seems to be the >> problems relating to github which they have already addressed. > > Which they've addressed by reverting the change, but while they're a > major user of git they're not the only one. They just happened to use > "git archive". > > I think it would be a mistake to conclude that everyone who's run into > this has already done so, or is aware of it. I've spent some time trying to find reports of problems caused by this change and have not seen anything apart from the issue with GitHub. Although it takes a while for new versions of git to get into linux distributions if there is a widespread problem we normally hear about it pretty quickly. This change has been in two releases now. If anyone does have a problem there is an easy fix in the form of setting tar.<format>.command >> I think there is general agreement that it is not practical to promise >> that the compressed output of "git archive" is stable so maybe it is >> better[...] > > ...better than what? This seems to imply that this series is making new > promises about the output stability, which it isn't doing. It's better people realize they cannot rely on the output being stable now when they can safely work around the problem while working on a proper fix rather than waiting until the change in output is caused by a security issue in gzip which means the work around is no longer safe. Best Wishes Phillip >> [...]to make that clear now while users can work around it in the >> short term with a config setting rather than waiting until we're faced >> with some security or other issue that forces a change to the output >> which users cannot work around so easily. > > I think it's always been clear that you can use that setting. For ages > we've been saying: > > The `tar.gz` and `tgz` formats are defined automatically and use the > command `gzip -cn` by default. > > Then v2.38.0 changed it to: > > [...] > magic command `git archive gzip` by default > > Which IMO was easily missed among other "Performance, Internal > Implementation, Development Support etc." items in the release notes, > which said: > > Teach "git archive" to (optionally and then by default) avoid > spawning an external "gzip" process when creating ".tar.gz" (and > ".tgz") archives. > > But I agree that all of this is subjective. To me a 2% reduction in CPU > use (at the cost of ~20% increse in wallclock) & some unclear benefits > to teaching users that they can't rely on our "gzip" output seems > unclear or hypothetical. > > Whereas the widespread breakage reported is very real, where are the reports of widespread berakage outside of GitHub? > and we should > consider GitHub as a canary for that, not the the stand & end of its > potential impact. > > As we didn't have a strong reason to change this in the first place (and > as my series shows, we can have our cake & eat it too if we don't have a > "gzip") I think the obvious choice is to go back to using "gzip". ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-02 16:17 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood 2023-02-02 16:40 ` Junio C Hamano 2023-02-03 13:49 ` Ævar Arnfjörð Bjarmason @ 2023-02-03 15:47 ` Theodore Ts'o 2 siblings, 0 replies; 57+ messages in thread From: Theodore Ts'o @ 2023-02-03 15:47 UTC (permalink / raw) To: phillip.wood Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq On Thu, Feb 02, 2023 at 04:17:09PM +0000, Phillip Wood wrote: > Playing devil's advocate for a moment as we're not going to promise that the > compressed output of "git archive" will be stable in the future perhaps we > should use this breakage as an opportunity to highlight that to users and to > advertize the config setting that allows them to use gzip for compressing > archives. Reverting the change gives the misleading impression that we're > making a commitment to keeping the output stable. The focus of this thread > seems to be the problems relating to github which they have already > addressed. > > I think there is general agreement that it is not practical to promise that > the compressed output of "git archive" is stable so maybe it is better to > make that clear now while users can work around it in the short term with a > config setting rather than waiting until we're faced with some security or > other issue that forces a change to the output which users cannot work > around so easily. I would be in favor of adding a config option that allows using the internal gzip option, although leave the default to be keep things compatible. The reason for that it should be easy for a forge provider such as GitHub to break things, deliberately. Sound insane? Hear me out. At $WORK, we have a highly reliable system, Paxos. It is a highly fault-tolerant system, so it rarely fails. But "rarely fails" is not the same as "never fails". And hopefully, things should degrade gracefully if there is a Paxos outage. But as the Google SRE's are fond of saying, "Hope is not a strategy". So periodically, the people who run the Paxos service will deliberately force downtime for a short amount of time. The fact that they will do this is well advertised, and scheduled ahead of time --- and teams responsible for user-facing services are supposed to make sure that end-users don't notice when this happens. Maybe they won't be able to update configurations as easily while Paxos is down, but it shouldn't cause a user-visible outage. So what I would recommend to the GitHub product manager, is that once a quarter, on a well-advertised date, that they flip the switch and break the git archive checksums for say, an hour. Then next quarter, they advertise that the switch will be thrown for 2 hours, doubling each time, until it is ramped up to 16 hours. This will provide the necessary nudge so that all of these badly designed systems that depend on downloaded archives of arbitrary git hubs to be stable will rethink their position, while minimizing the end-user customer impact. Otherwise, I predict that Bazel, homebrew, etc will consider to rely on this ill-considered assumption, and at some point in the future, when we *do* have a much better reason to want to make a change to the tar or compression algorithm, all of these end users will once again scream bloody murder. Of course, this is going to be up to each forge provider to decide whether they want to do this. But we can make it easy for them to do this thing, and I'd argue it is in our interest to make it easy for them to do this. Otherwise we'll get constrained in the future by the fear of massive user blowback, no metter what we say in our documentation regarding "no promises --- and next time, we really mean it!" - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (9 preceding siblings ...) 2023-02-02 16:17 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood @ 2023-02-02 16:25 ` Junio C Hamano 2023-02-04 18:08 ` René Scharfe 2023-02-02 19:23 ` Raymond E. Pasco 11 siblings, 1 reply; 57+ messages in thread From: Junio C Hamano @ 2023-02-02 16:25 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > As the disruption of changing the default isn't worth it, let's use > gzip(1) again by default, and only fall back on the new "git archive > gzip" if it isn't available. It perhaps is OK, and lets us answer "ugh, the compressed output of 'git archive' is unstable again" with "we didn't change anything, perhaps you changed your gzip(1)?" when they fix bugs or improve compression or whatever. Of course that is not an overall win for the end users, but in the short term until gzip gets such a change, we would presumably get the "same" output as before. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-02 16:25 ` Junio C Hamano @ 2023-02-04 18:08 ` René Scharfe 2023-02-05 21:30 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: René Scharfe @ 2023-02-04 18:08 UTC (permalink / raw) To: Junio C Hamano, Ævar Arnfjörð Bjarmason Cc: git, Eli Schwartz, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o Am 02.02.23 um 17:25 schrieb Junio C Hamano: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> As the disruption of changing the default isn't worth it, let's use >> gzip(1) again by default, and only fall back on the new "git archive >> gzip" if it isn't available. > > It perhaps is OK, and lets us answer "ugh, the compressed output of > 'git archive' is unstable again" with "we didn't change anything, > perhaps you changed your gzip(1)?" when they fix bugs or improve > compression or whatever. Of course that is not an overall win for > the end users, but in the short term until gzip gets such a change, > we would presumably get the "same" output as before. Restoring the old default is an understandable reflex. In theory it worsens consistency and stability of the output, but in practice using whatever was found in $PATH did work before -- or at least it was not our problem if it didn't. Are there still people left that would benefit from such a step back, however? As far as I understand forges like GitHub relied on git archive producing the same tgz output across versions. That assumption was violated, trust lost. They had to learn about the configuration option tar.tgz.command or find some other way to cope. Changing the default again won't undo that. René ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-04 18:08 ` René Scharfe @ 2023-02-05 21:30 ` Ævar Arnfjörð Bjarmason 2023-02-12 17:41 ` René Scharfe 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-05 21:30 UTC (permalink / raw) To: René Scharfe Cc: Junio C Hamano, git, Eli Schwartz, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o On Sat, Feb 04 2023, René Scharfe wrote: > Am 02.02.23 um 17:25 schrieb Junio C Hamano: >> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> >>> As the disruption of changing the default isn't worth it, let's use >>> gzip(1) again by default, and only fall back on the new "git archive >>> gzip" if it isn't available. >> >> It perhaps is OK, and lets us answer "ugh, the compressed output of >> 'git archive' is unstable again" with "we didn't change anything, >> perhaps you changed your gzip(1)?" when they fix bugs or improve >> compression or whatever. Of course that is not an overall win for >> the end users, but in the short term until gzip gets such a change, >> we would presumably get the "same" output as before. > > Restoring the old default is an understandable reflex. In theory it > worsens consistency and stability of the output, but in practice using > whatever was found in $PATH did work before -- or at least it was not > our problem if it didn't. "In theory" because the user might be flip-flopping between different gzip(1) versions? > Are there still people left that would benefit from such a step back, > however? As far as I understand forges like GitHub relied on git > archive producing the same tgz output across versions. That assumption > was violated, trust lost. They had to learn about the configuration > option tar.tgz.command or find some other way to cope. Changing the > default again won't undo that. I think it's safe to assume that git is used by enough users that anything breaking at a major hosting provider is likely to have a very long tail in the wild, almost all of which we'll never see in "this broke for me" reports to this ML. So no, that ship has clearly sailed for GitHub, but this series aims to address more than that. Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the main problem you were trying to solve in making "git archive gzip" the default didn't need to be solved by changing the default. I.e. the aim was to have it work when "gzip(1)" wasn't available, which we can do by falling back only if we can't invoke it, rather than changing the long-standing default. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-05 21:30 ` Ævar Arnfjörð Bjarmason @ 2023-02-12 17:41 ` René Scharfe 0 siblings, 0 replies; 57+ messages in thread From: René Scharfe @ 2023-02-12 17:41 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, git, Eli Schwartz, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco, demerphq, Theodore Ts'o Am 05.02.23 um 22:30 schrieb Ævar Arnfjörð Bjarmason: > > On Sat, Feb 04 2023, René Scharfe wrote: > >> Am 02.02.23 um 17:25 schrieb Junio C Hamano: >>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >>> >>>> As the disruption of changing the default isn't worth it, let's use >>>> gzip(1) again by default, and only fall back on the new "git archive >>>> gzip" if it isn't available. >>> >>> It perhaps is OK, and lets us answer "ugh, the compressed output of >>> 'git archive' is unstable again" with "we didn't change anything, >>> perhaps you changed your gzip(1)?" when they fix bugs or improve >>> compression or whatever. Of course that is not an overall win for >>> the end users, but in the short term until gzip gets such a change, >>> we would presumably get the "same" output as before. >> >> Restoring the old default is an understandable reflex. In theory it >> worsens consistency and stability of the output, but in practice using >> whatever was found in $PATH did work before -- or at least it was not >> our problem if it didn't. > > "In theory" because the user might be flip-flopping between different > gzip(1) versions? No flopping needed. We can't control what's in $PATH. There are OS-specific replacements for GNU gzip in NetBSD/FreeBSD/macOS and OpenBSD. People could use pigz. Or cat, for that matter. Different versions of different tools might produce different output. There are alternative to the original libz as well, e.g. libz-ng. We don't control which one or which version is installed, either, but we could do so if we wanted by importing one of them like we did with LibXDiff. > Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the > main problem you were trying to solve in making "git archive gzip" the > default didn't need to be solved by changing the default. I.e. the aim > was to have it work when "gzip(1)" wasn't available, which we can do by > falling back only if we can't invoke it, rather than changing the > long-standing default. The aim was to no longer depend on gzip. That goal was already met by providing the internal implementation, without changing the default. Git for Windows for example could use it in their config and drop gzip. Calling gzip if available, warning if it isn't and using the internal implementation adds yet more variance. No longer allowing gzip to be a shell alias might confuse someone. The automatic fallback would only benefit users that don't want to touch /etc/gitconfig, have nobody to do it for them and don't care about warnings -- hopefully not a big crowd. I didn't intend the change of default to be that painful, but don't see the point in going back now that we're through. The new default is better -- one less dependency to care about. And if we need to go back, however, then a know-good state makes more sense than a smart fallback with some new twists. René ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason ` (10 preceding siblings ...) 2023-02-02 16:25 ` Junio C Hamano @ 2023-02-02 19:23 ` Raymond E. Pasco 2023-02-03 8:06 ` [PATCH] archive: document output stability concerns Raymond E. Pasco 11 siblings, 1 reply; 57+ messages in thread From: Raymond E. Pasco @ 2023-02-02 19:23 UTC (permalink / raw) To: phillip.wood, Ævar Arnfjörð Bjarmason, git Cc: Junio C Hamano, Eli Schwartz, René Scharfe, brian m . carlson, Konstantin Ryabitsev, Michal Suchánek, demerphq, Theodore Ts'o February 2, 2023 11:17 AM, "Phillip Wood" <phillip.wood123@gmail.com> wrote: > Playing devil's advocate for a moment as we're not going to promise that the compressed output of > "git archive" will be stable in the future perhaps we should use this breakage as an opportunity to > highlight that to users and to advertize the config setting that allows them to use gzip for > compressing archives. Reverting the change gives the misleading impression that we're making a > commitment to keeping the output stable. The focus of this thread seems to be the problems relating > to github which they have already addressed. > > I think there is general agreement that it is not practical to promise that the compressed output > of "git archive" is stable so maybe it is better to make that clear now while users can work around > it in the short term with a config setting rather than waiting until we're faced with some security > or other issue that forces a change to the output which users cannot work around so easily. Reverting to the behavior of "use some arbitrary gzip from $PATH" would be a poor decision whether or not git were willing to make some commitment to gzip stability, because Git does not control arbitrary gzips on the user's $PATH. If Git did want to promise gzip stability, it could only start from something like the current internal implementation along with a vendored zlib; if it doesn't, as appears to be the case, then the internal implementation is superior for the other reasons already discussed. If the user wants to depend on a particular gzip executable they supply, this configuration knob already exists for them. Since there is no guarantee of stability, but there has been a popular misconception that there is some such guarantee (e.g., [1]), some kind of STABILITY section describing how there isn't any and suggesting ways the user can attain more stability via configuration seems to be a good idea. [1]: https://lists.reproducible-builds.org/pipermail/rb-general/2021-October/002422.html ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH] archive: document output stability concerns 2023-02-02 19:23 ` Raymond E. Pasco @ 2023-02-03 8:06 ` Raymond E. Pasco 0 siblings, 0 replies; 57+ messages in thread From: Raymond E. Pasco @ 2023-02-03 8:06 UTC (permalink / raw) To: ray Cc: avarab, demerphq, eschwartz93, git, gitster, konstantin, l.s.r, msuchanek, phillip.wood, sandals, tytso In 4f4be00d302 (archive-tar: use internal gzip by default), the 'git archive' command switched to using an internal compression filter implemented with zlib rather than invoking a 'gzip' binary, for the '.tar.gz' / '.tgz' output formats. This change brought to light a common misconception that the output of 'git archive' is intended to be byte-for-byte stable. While this is not the case, stable archive output is desirable for many applications; we discuss concerns related to output stability and suggest ways in which the user can control the compression used with the "tar.<format>.command" configuration option. Signed-off-by: Raymond E. Pasco <ray@ameretat.dev> --- I think that something along these lines should be included in the docs, but that the behavior should be kept the same. If it is decided later to stabilize output, e.g. by vendoring a blessed zlib version forever, the current state as of 2.38 is the best starting point; and reverting a useful change because of external breakage which already has a solution, while also promising instability, seems like a poor choice. Documentation/git-archive.txt | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt index 60c040988b..77acdacdf8 100644 --- a/Documentation/git-archive.txt +++ b/Documentation/git-archive.txt @@ -178,6 +178,41 @@ appropriate export-ignore in its `.gitattributes`), adjust the checked out option. Alternatively you can keep necessary attributes that should apply while archiving any tree in your `$GIT_DIR/info/attributes` file. +[[STABILITY]] +STABILITY +--------- + +'git archive' does not guarantee that precisely identical archive files +will be produced for invocations on the same commit or tree. + +'git archive' uses an internal implementation of `tar` archiving +for the `tar` format, which includes the commit ID in an extended +pax header. For the `tgz` and `tar.gz` formats, it is augmented with +a compression filter applied to the output, which is implemented by +'git archive' by linking to the system zlib. + +If the commit ID of the "same" commit is different, for instance in the +case of an object format migration from SHA-1 to SHA-256, the `tar` +archive will necessarily differ due to including a different ID. + +The output of the compression filter is less deterministic than +the output of the `tar` implementation, because the versions +of zlib used may differ. The internal compression filter can be +replaced with a particular command specified by the user using the +`tar.<format>.command` configuration option; for instance, a particular +gzip binary provided by the user could be specified here for consistent +output. + +The `tar` format used by 'git archive' is unlikely to change +frequently, but is not guaranteed to be completely stable; its output +will remain identical at least within the same Git version. + +The `zip` format has similar concerns to the `tar.gz` and `tgz` +formats; ZIP archiving is implemented internally, but the Deflate +compression used relies on the linked zlib. However, because archiving +and compression are combined into a single operation, there is no +user-specifiable filter command for the `zip` format. + EXAMPLES -------- `git archive --format=tar --prefix=junk/ HEAD | (cd /var/tmp/ && tar xf -)`:: -- 2.39.1.561.g98d13ac3e7 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz 2023-01-31 7:49 ` Ævar Arnfjörð Bjarmason @ 2023-01-31 9:54 ` brian m. carlson 2023-01-31 11:31 ` Ævar Arnfjörð Bjarmason ` (3 more replies) 1 sibling, 4 replies; 57+ messages in thread From: brian m. carlson @ 2023-01-31 9:54 UTC (permalink / raw) To: Eli Schwartz; +Cc: Git List [-- Attachment #1: Type: text/plain, Size: 4823 bytes --] On 2023-01-31 at 00:06:44, Eli Schwartz wrote: > Nevertheless, I've seen the sentiment a few times that git doesn't like > committing to output stability of git-archive, because it isn't > officially documented (but it's not entirely clear what the benefits of > changing are). And yet, git endeavors to do so, in order to prevent > unnecessary breakage of people who embody Hyrum's Law and need that > stability. I'm one of the GitHub employees who chimed in there, and I'm also a Git contributor in my own time (and I am speaking here only in my personal capacity, since this is a personal address). I made a change some years back to the archive format to fix the permissions on pax headers when extracted as files, and kernel.org was relying on that and broke. Linus yelled at me because of that. Since then, I've been very opposed to us guaranteeing output format consistency without explicitly doing so. I had sent some patches before that I don't think ever got picked up that documented this explicitly. I very much don't want people to come to rely on our behaviour unless we explicitly guarantee it. > What does everyone think about offering versioned git-archive outputs? > This could be user-selectable as an option to `git archive`, but the > main goal would be to select a good versioned output format depending on > what is being archived. So: > > - first things first, un-default the internal compressor again > - implement a v2 archive format, where the internal compressor is the > default -- no other changes > - teach git to select an archive format based on the date of the object > being archived > - when given a commit/tag ID to archive, check which support frame the > committer date falls inside > - for tree IDs, always use the latest format (it always uses the > current date anyway) > - schedule a date, for the sake of argument, 6 months after the next > scheduled release date of git version X.Y in which this change goes > live; bake this into the git sources as a transition date, all commits > or tags generated after this date fall into the next format support > frame I am actually very much in favour of providing a standard, deterministic version of pax (the extended tar format) that we use and documenting it as a standard so that other archive tools can use that. That is, we document some canonical tar format that is bit-for-bit identical that we (and hopefully GNU tar and libarchive) will agree should be used to serialize files for software interchange. I don't think this should be dependent on the date at all, but I do believe it should be versioned and tested, and the version number embedded as a pax header. I think this would be valuable for simply having reproducible archives in general, including for things like Docker containers, Debian packages, Rust crates, and more, and I'm happy to work with others on such a format, as I've said in the past on the list. People can opt-in to whatever format they want when creating an archive and continue to use that forever if they like. Part of the reason I think this is valuable is that once SHA-1 and SHA-256 interoperability is present, git archive will change the contents of the archive format, since it will embed a SHA-256 hash into the file instead of a SHA-1 hash, since that's what's in the repository. Thus, we can't produce an archive that's deterministic in the face of SHA-1/SHA-256 interoperability concerns, and we need to create a new format that doesn't contain that data embedded in it. Having said that, I don't think this should be based on the timestamp of the file, since that means that two otherwise identical archives differing in timestamp aren't ever going to be the same, and we do see people who import or vendor other projects. Nor do I think we should attempt to provide consistent compression, since I believe the output of things like zlib has changed in the past, and we can't continually carry an old, potentially insecure version of zlib just because the output changed. People should be able to implement compression using gzip, zlib, pigz, miniz_oxide, or whatever if they want, since people implement Git in many different languages, and we won't want to force people using memory-safe languages like Go and Rust to explicitly use zlib for archives. That may mean that it's important for people to actually decompress the archive before checking hashes if they want deterministic behaviour, and I'm okay with that. You already have to do that if you're verifying the signature on Git tarballs, since only the uncompressed tar archive is signed, so I don't think this is out of the question. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson @ 2023-01-31 11:31 ` Ævar Arnfjörð Bjarmason 2023-01-31 15:05 ` Konstantin Ryabitsev ` (2 subsequent siblings) 3 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-01-31 11:31 UTC (permalink / raw) To: brian m. carlson; +Cc: Eli Schwartz, Git List On Tue, Jan 31 2023, brian m. carlson wrote: > Part of the reason I think this is valuable is that once SHA-1 and > SHA-256 interoperability is present, git archive will change the > contents of the archive format, since it will embed a SHA-256 hash into > the file instead of a SHA-1 hash, since that's what's in the repository. > Thus, we can't produce an archive that's deterministic in the face of > SHA-1/SHA-256 interoperability concerns, and we need to create a new > format that doesn't contain that data embedded in it. I don't see why a format change would be required in this context. If a repository were to switch over to SHA-256 wouldn't a better solution to this be to disambiguate whether you're requesting a SHA-1 or SHA-256 derived archive in the URL? E.g. to never serve up an archive with a SHA-256 embedded in the header at: https://github.com/git/git/archive/refs/tags/v2.39.1.tar.gz But require a URL like: https://github.com/git/git/archive-sha256/refs/tags/v2.39.1.tar.gz If you did that then existing archives would continue to have the same byte-for-byte content (assuming that the result of this discussion is that we support that forever), but they'd always be generated with "-c extensions.objectFormat=sha1". For always-SHA256 repos such a URL would fail to generate anything. But for repos that used to be SHA-1 but are now SHA-256 either URL would work, but the PAX header would be different, referring to the SHA-1 or SHA-256 commit, respectively. Whereas your proposal seems to be that we should omit that SHA-(1|256) from the "comment" entirely. That would seem to require either a one-off change of all existing archives, or some cut-off date (or other marker). If you've got a cut-off, you could also just use it to decide whether to generate a SHA-1 or SHA-256 archive, and without that you'd be back to the one-off breakage. I also find it very useful that we've got the commit OID in the archive, as it allows for round-tripping from archives back to the relevant repository commit. Losing that entirely for SHA-1<->SHA-256 interop would be unfortunate, especially if it turns out we could have easily kept it > Having said that, I don't think this should be based on the timestamp of > the file, since that means that two otherwise identical archives > differing in timestamp aren't ever going to be the same, and we do see > people who import or vendor other projects. Yes, I agree that doing this by that sort of heuristic would be bad. > Nor do I think we should > attempt to provide consistent compression, since I believe the output of > things like zlib has changed in the past, and we can't continually carry > an old, potentially insecure version of zlib just because the output > changed. People should be able to implement compression using gzip, > zlib, pigz, miniz_oxide, or whatever if they want, since people > implement Git in many different languages, and we won't want to force > people using memory-safe languages like Go and Rust to explicitly use > zlib for archives. As I noted in the side-thread I think an acceptable solution would be to push the problem of the consistent compressor downstream. I.e. if a site like GitHub wants to maintain a potentially old version of GNU gzip that should be up to them. But I think it's a valid concern that we should guarantee the stability of the archive format. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson 2023-01-31 11:31 ` Ævar Arnfjörð Bjarmason @ 2023-01-31 15:05 ` Konstantin Ryabitsev 2023-01-31 22:32 ` brian m. carlson 2023-01-31 15:56 ` Eli Schwartz 2023-02-01 12:42 ` Ævar Arnfjörð Bjarmason 3 siblings, 1 reply; 57+ messages in thread From: Konstantin Ryabitsev @ 2023-01-31 15:05 UTC (permalink / raw) To: brian m. carlson, Eli Schwartz, Git List On Tue, Jan 31, 2023 at 09:54:58AM +0000, brian m. carlson wrote: > I'm one of the GitHub employees who chimed in there, and I'm also a Git > contributor in my own time (and I am speaking here only in my personal > capacity, since this is a personal address). I made a change some years > back to the archive format to fix the permissions on pax headers when > extracted as files, and kernel.org was relying on that and broke. Linus > yelled at me because of that. > > Since then, I've been very opposed to us guaranteeing output format > consistency without explicitly doing so. I had sent some patches before > that I don't think ever got picked up that documented this explicitly. > I very much don't want people to come to rely on our behaviour unless we > explicitly guarantee it. I understand your position, but I also think it's one of those things that happen despite your best efforts to prevent it. :) May I suggest adding a "git-archive --stable" that offers this guarantee, simply as a matter of codifying the fact that the world has built infrastructure around git's repeatable output. Maybe just for .tar (and .tar.gz). I know this complicates the code and makes it more "expensive" to maintain, but it would be dramatically less expensive than changing the established practices around the world. -K ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 15:05 ` Konstantin Ryabitsev @ 2023-01-31 22:32 ` brian m. carlson 2023-02-01 9:40 ` Ævar Arnfjörð Bjarmason 2023-02-01 12:17 ` Raymond E. Pasco 0 siblings, 2 replies; 57+ messages in thread From: brian m. carlson @ 2023-01-31 22:32 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: Eli Schwartz, Git List [-- Attachment #1: Type: text/plain, Size: 1709 bytes --] On 2023-01-31 at 15:05:55, Konstantin Ryabitsev wrote: > On Tue, Jan 31, 2023 at 09:54:58AM +0000, brian m. carlson wrote: > > I'm one of the GitHub employees who chimed in there, and I'm also a Git > > contributor in my own time (and I am speaking here only in my personal > > capacity, since this is a personal address). I made a change some years > > back to the archive format to fix the permissions on pax headers when > > extracted as files, and kernel.org was relying on that and broke. Linus > > yelled at me because of that. > > > > Since then, I've been very opposed to us guaranteeing output format > > consistency without explicitly doing so. I had sent some patches before > > that I don't think ever got picked up that documented this explicitly. > > I very much don't want people to come to rely on our behaviour unless we > > explicitly guarantee it. > > I understand your position, but I also think it's one of those things that > happen despite your best efforts to prevent it. :) > > May I suggest adding a "git-archive --stable" that offers this guarantee, > simply as a matter of codifying the fact that the world has built > infrastructure around git's repeatable output. Maybe just for .tar (and > .tar.gz). It is my intention to implement just .tar. That's my proposal: simply a pax-based format that serializes in a consistent way according to a predefined spec. As far as whether other people want to implement consistent compression, they are welcome to also write a spec and implement it. I personally feel that's too hard to get right and am not planning on working on it. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 22:32 ` brian m. carlson @ 2023-02-01 9:40 ` Ævar Arnfjörð Bjarmason 2023-02-01 11:34 ` demerphq 2023-02-01 23:16 ` brian m. carlson 2023-02-01 12:17 ` Raymond E. Pasco 1 sibling, 2 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-01 9:40 UTC (permalink / raw) To: brian m. carlson; +Cc: Konstantin Ryabitsev, Eli Schwartz, Git List On Tue, Jan 31 2023, brian m. carlson wrote: > As far as whether other people want to implement consistent compression, > they are welcome to also write a spec and implement it. I personally > feel that's too hard to get right and am not planning on working on it. "A spec" here seems like overkill to me, so far on that front we've been shelling out to gzip(1), and the breakage/event that triggered this thread is rectified by starting to do that again by default. It means that someone writing a clean-room implementation of git would likely run into the same issue, if they used e.g. the Go language and a native Go implementation of deflate. But so what? We don't need to make promises for all potential git implementations, just this one. So we could add a blurb like this to the docs: As people have come to rely on the exact "deflate" implementation "git archive" promises to invoke the system's "gzip" binary by default, under the assumption that its output is stable. If that's no longer the case you'll need to complain to whoever maintains your local "gzip". If we wanted to be even more helpful we could bunde and ship an old version of GNU gzip with our sources, and either default to that, or offer it as a "--stable" implementation of deflate. That would be going above & beyond what's needed IMO, but still a lot easier than the daunting task of writing a specification that exactly described GNU gzip's current behavior, to the point where you could clean-room implement it and be guaranteed byte-for-byte compatibility. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 9:40 ` Ævar Arnfjörð Bjarmason @ 2023-02-01 11:34 ` demerphq 2023-02-01 12:21 ` Michal Suchánek 2023-02-01 23:16 ` brian m. carlson 1 sibling, 1 reply; 57+ messages in thread From: demerphq @ 2023-02-01 11:34 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List On Wed, 1 Feb 2023 at 11:26, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > That would be going above & beyond what's needed IMO, but still a lot > easier than the daunting task of writing a specification that exactly > described GNU gzip's current behavior, to the point where you could > clean-room implement it and be guaranteed byte-for-byte compatibility. Why does it have to be gzip? It is not that hard to come up with a relatively good compression algorithm that is stable if you aren't expecting super fast performance or super good compression. If all you need is good enough but stability is a hard requirement then algorithms like LZW are available (it has been out of patent since ~2003), and produce reasonable results. If people want a stable archive then they might have to use some tool that git provides to decompress and they might not get the best compression ratios, nor speed, but they would get stability. You can write a decent LZW implementation in a few hundred lines of code. With a bit of care you could implement it in a way that allows you to compute the true hash digest of the compressed data without actually decompressing it as well, which would address some of the concerns that brian raised with regard to security I think. Why does this email remind me of that old canard that any sufficiently advanced piece of software gains the ability to send emails? :-) cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 11:34 ` demerphq @ 2023-02-01 12:21 ` Michal Suchánek 2023-02-01 12:48 ` demerphq 0 siblings, 1 reply; 57+ messages in thread From: Michal Suchánek @ 2023-02-01 12:21 UTC (permalink / raw) To: demerphq Cc: Ævar Arnfjörð Bjarmason, brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote: > On Wed, 1 Feb 2023 at 11:26, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > That would be going above & beyond what's needed IMO, but still a lot > > easier than the daunting task of writing a specification that exactly > > described GNU gzip's current behavior, to the point where you could > > clean-room implement it and be guaranteed byte-for-byte compatibility. > > Why does it have to be gzip? It is not that hard to come up with a historical reasons? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 12:21 ` Michal Suchánek @ 2023-02-01 12:48 ` demerphq 2023-02-01 13:43 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: demerphq @ 2023-02-01 12:48 UTC (permalink / raw) To: Michal Suchánek Cc: Ævar Arnfjörð Bjarmason, brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List On Wed, 1 Feb 2023, 20:21 Michal Suchánek, <msuchanek@suse.de> wrote: > > On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote: > > Why does it have to be gzip? It is not that hard to come up with a > historical reasons? Currently git doesn't advertise that archive creation is stable right[1]? So I wrote that with the assumption that this new compression would only be used when making a new archive with a hypothetical new '--stable' option. So historical reasons don't come up. Or was there some other form of history that you meant? I'm just trying to point out here that stable compression is doable and doesn't need to be as complex as specifying a stable gzip format. I am not even saying git should just do this, just that it /could/ if it decided that stability was important, and that doing so wouldn't involve the complexity that Avar was implying would be needed. Simple compression like LZ variants are pretty straightforward to implement, achieve pretty good compression and can run pretty fast. Yves [1] if it did the issue kicking off this thread would not have happened as there would be a test that would have noticed the change. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 12:48 ` demerphq @ 2023-02-01 13:43 ` Ævar Arnfjörð Bjarmason 2023-02-01 15:21 ` demerphq 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-01 13:43 UTC (permalink / raw) To: demerphq Cc: Michal Suchánek, brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List On Wed, Feb 01 2023, demerphq wrote: > On Wed, 1 Feb 2023, 20:21 Michal Suchánek, <msuchanek@suse.de> wrote: >> >> On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote: >> > Why does it have to be gzip? It is not that hard to come up with a > >> historical reasons? > > Currently git doesn't advertise that archive creation is stable > right[1]? So I wrote that with the assumption that this new > compression would only be used when making a new archive with a > hypothetical new '--stable' option. So historical reasons don't come > up. Or was there some other form of history that you meant? We haven't advertised it, but people have come to rely on it, as the widespread breakages reported when upgrading to v2.38.0 at the start of this thread show. That's unfortunate, and those people probably shouldn't have done that, but that's water under the bridge. I think it would be irresponsible to change the output willy-nilly at this point, especially when it seems rather easy to find some compromise everyone will be happy with. > I'm just trying to point out here that stable compression is doable > and doesn't need to be as complex as specifying a stable gzip format. > I am not even saying git should just do this, just that it /could/ if > it decided that stability was important, and that doing so wouldn't > involve the complexity that Avar was implying would be needed. Simple > compression like LZ variants are pretty straightforward to implement, > achieve pretty good compression and can run pretty fast. > > Yves > [1] if it did the issue kicking off this thread would not have > happened as there would be a test that would have noticed the change. I have some patches I'm about to submit to address issues in this thread, and it does add *a* test for archive output stability. But I'm not at all confident that it's exhaustive. I just found it by experiment, by locating tests ouf ours where the "git archive" output at the end is different with gzip and "git archive gzip". But is it guaranteed to find all potential cases where repository content might trigger different output with different gzip implementations? I don't know, but probably not. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 13:43 ` Ævar Arnfjörð Bjarmason @ 2023-02-01 15:21 ` demerphq 2023-02-01 18:56 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: demerphq @ 2023-02-01 15:21 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Michal Suchánek, brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List On Wed, 1 Feb 2023 at 14:49, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Wed, Feb 01 2023, demerphq wrote: > > > On Wed, 1 Feb 2023, 20:21 Michal Suchánek, <msuchanek@suse.de> wrote: > >> > >> On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote: > >> > Why does it have to be gzip? It is not that hard to come up with a > > > >> historical reasons? > > > > Currently git doesn't advertise that archive creation is stable > > right[1]? So I wrote that with the assumption that this new > > compression would only be used when making a new archive with a > > hypothetical new '--stable' option. So historical reasons don't come > > up. Or was there some other form of history that you meant? > > We haven't advertised it, but people have come to rely on it, as the > widespread breakages reported when upgrading to v2.38.0 at the start of > this thread show. > > That's unfortunate, and those people probably shouldn't have done that, > but that's water under the bridge. I think it would be irresponsible to > change the output willy-nilly at this point, especially when it seems > rather easy to find some compromise everyone will be happy with. > > > I'm just trying to point out here that stable compression is doable > > and doesn't need to be as complex as specifying a stable gzip format. > > I am not even saying git should just do this, just that it /could/ if > > it decided that stability was important, and that doing so wouldn't > > involve the complexity that Avar was implying would be needed. Simple > > compression like LZ variants are pretty straightforward to implement, > > achieve pretty good compression and can run pretty fast. > > > > Yves > > [1] if it did the issue kicking off this thread would not have > > happened as there would be a test that would have noticed the change. > > I have some patches I'm about to submit to address issues in this > thread, and it does add *a* test for archive output stability. > > But I'm not at all confident that it's exhaustive. I just found it by > experiment, by locating tests ouf ours where the "git archive" output at > the end is different with gzip and "git archive gzip". > > But is it guaranteed to find all potential cases where repository > content might trigger different output with different gzip > implementations? I don't know, but probably not. BTW, I just happened to be looking at the zstd docs (I am updating code that uses it), I saw this: Zstandard's format is stable and documented in [RFC8878](https://datatracker.ietf.org/doc/html/rfc8878). Multiple independent implementations are already available. This repository represents the reference implementation, provided as an open-source dual [BSD](LICENSE) and [GPLv2](COPYING) licensed **C** library, and a command line utility producing and decoding `.zst`, `.gz`, `.xz` and `.lz4` files. Should your project require another programming language, a list of known ports and bindings is provided on [Zstandard homepage](http://www.zstd.net/#other-languages). So it sounds like that is a spec you could use. Not sure exactly what they mean by "stable", but given the .gz compatibility maybe it would be worth considering. Its a lot faster than zlib. (The library I support includes Snappy, Zlib, and Zstd, and the latter is faster and better than the other two.) Yves -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 15:21 ` demerphq @ 2023-02-01 18:56 ` Theodore Ts'o 2023-02-02 21:19 ` Joey Hess 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2023-02-01 18:56 UTC (permalink / raw) To: demerphq Cc: Ævar Arnfjörð Bjarmason, Michal Suchánek, brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List If the goal is stable tar.gz files, Debian has a very nice soution called pristine-tar[1]. This you to store a tar.gz image which in a very efficient way, by leveraging the objects in the git repository. [1] https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html The data is stored on the pristine-tar branch, and is quite efficient: % git show --stat pristine-tar commit 56dded989c9e0c852b8af9ae72ffe94270bfd34a (origin/pristine-tar, github/pristine-tar, pristine-tar) Author: Theodore Ts'o <tytso@mit.edu> Date: Thu Dec 30 01:06:13 2021 -0500 pristine-tar data for e2fsprogs_1.46.5.orig.tar.gz e2fsprogs_1.46.5.orig.tar.gz.asc | 11 +++++++++++ e2fsprogs_1.46.5.orig.tar.gz.delta | Bin 0 -> 59034 bytes e2fsprogs_1.46.5.orig.tar.gz.id | 1 + 3 files changed, 12 insertions(+) And this allows me to reproduce the original tar.gz file, along with a GPG signature file, which is about 9 megabytes. The *.id file contains the git commit from which the tar file was generated, and this is what allows the *.delta file to be as small as it is. % pristine-tar checkout e2fsprogs_1.46.5.orig.tar.gz -s e2fsprogs_1.46.5.orig.tar.gz.asc pristine-tar: successfully generated e2fsprogs_1.46.5.orig.tar.gz pristine-tar: successfully generated e2fsprogs_1.46.5.orig.tar.gz.asc % ls -sh e2fsprogs_1.46.5.orig.tar.gz* 9.1M e2fsprogs_1.46.5.orig.tar.gz 4.0K e2fsprogs_1.46.5.orig.tar.gz.asc % gpg e2fsprogs_1.46.5.orig.tar.gz.asc gpg: WARNING: no command supplied. Trying to guess what you mean ... gpg: assuming signed data in 'e2fsprogs_1.46.5.orig.tar.gz' gpg: Signature made Thu 30 Dec 2021 01:02:52 AM EST gpg: using RSA key 2B69B954DBFE0879288137C9F2F95956950D81A3 gpg: Good signature from "Theodore Ts'o <tytso@mit.edu>" [ultimate] gpg: aka "Theodore Ts'o <tytso@debian.org>" [ultimate] gpg: aka "Theodore Ts'o <tytso@google.com>" [ultimate] Primary key fingerprint: 3AB0 57B7 E78D 945C 8C55 91FB D36F 769B C118 04F0 Subkey fingerprint: 2B69 B954 DBFE 0879 2881 37C9 F2F9 5956 950D 81A3 This is currently a Debian special, and while its functionality was designed to work well with Debian packaging workflows, but it's a general tool that could be used in multiple contexts, not just for Debian packaging. If I recall correctly, pristine-tar is currently in maintenance mode, and I suspect if someone was interested in investing time into making pristine-tar more portable to other OS's, including MacOS and Windows, and maybe potentially even integrating into git directly, the current maintainer of pristine-tar might be quite happy to let other people give the code more TLC. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 18:56 ` Theodore Ts'o @ 2023-02-02 21:19 ` Joey Hess 2023-02-03 4:02 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: Joey Hess @ 2023-02-02 21:19 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Git List [-- Attachment #1: Type: text/plain, Size: 494 bytes --] In my opinion as the original developer of pristine-tar, it's too complicated to be usefully used by git. The problem it solves is of a larger scope than the problem git has here. (I hope.) Developing pristine-tar did entail much investigation of past changes in compressor outputs. I know that gzip's output has sometimes not been deterministic as recently as 2012, see for example https://git.savannah.gnu.org/cgit/gzip.git/commit/?id=0a284baeaedca68017f46d2646e4 -- see shy jo [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-02 21:19 ` Joey Hess @ 2023-02-03 4:02 ` Theodore Ts'o 2023-02-03 13:32 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2023-02-03 4:02 UTC (permalink / raw) To: Joey Hess; +Cc: Git List On Thu, Feb 02, 2023 at 05:19:30PM -0400, Joey Hess wrote: > In my opinion as the original developer of pristine-tar, it's too > complicated to be usefully used by git. The problem it solves is of a > larger scope than the problem git has here. (I hope.) Well, the problem which I believe folks on this thread are trying to deal with is a way to reconstruct a bit-for-bit compressed tarball of a particular release in a way that minimizes the cost of storage in the git tree. One way of doing that would be to guarantee that git archive would return something which is always bit-for-bit identical. Another way is to use something like pristine tar. I'll grant that pristine tar does solve a bit more of the problem than what has been stated, since it allows the creator of the tarball to remove some files, or add some auto-generated files (e.g., after running autoreconf), and so in that way, pristine tar does solve a somewhat larger problem than what was expressed in this thread. That being said, however, pristine-tar is **extremely** useful, and I'm very happy, and very thankful, that you wrote it. It has been super, super useful. Cheers, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-03 4:02 ` Theodore Ts'o @ 2023-02-03 13:32 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-03 13:32 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Joey Hess, Git List On Thu, Feb 02 2023, Theodore Ts'o wrote: > On Thu, Feb 02, 2023 at 05:19:30PM -0400, Joey Hess wrote: >> In my opinion as the original developer of pristine-tar, it's too >> complicated to be usefully used by git. The problem it solves is of a >> larger scope than the problem git has here. (I hope.) > > Well, the problem which I believe folks on this thread are trying to > deal with is a way to reconstruct a bit-for-bit compressed tarball of > a particular release in a way that minimizes the cost of storage in > the git tree. One way of doing that would be to guarantee that git > archive would return something which is always bit-for-bit identical. > Another way is to use something like pristine tar. I think that's what this side-thread has devolved into, but I honestly don't see how that's useful or more than tangentally related to the problem noted at the start of the thread. If you are writing a new system that consumes "git archive" output something like what I'm proposing to add in [1] should nicely sidestep this issue, just checksum the uncompressed archive (assuming you're OK with our soft "tar" guarantees), or "git tag -v" (if you can) etc. That part of the docs is just a summary of what Konstantin Ryabitsev pointed out in a side-thread. One might also imagine any other number of trivial solutions to the problem, e.g. people interested in this can unpack the archive, and then (needs to guarantee sorted order, which I think find(1) doesn't, but just as a POC): (cd unpacked && find . -type f -printf "%f\n" -exec cat {} \; | sha256sum) Or whatever. But any such solution to the abstract problem isn't going to help the existing users whose systems broke because they were assuming certain things about the "git archive" output. For those users I think (as my proposed series does) we should just do whatever we can do limit the disruption, as my proposed [2] does by switching back to "gzip". For those users who are creating new systems that might use "git archive" today we then just need to update the documentation going forward. Maybe those could use "pristine-tar", or perhaps they can use some entirely different distribution mechanism. 1. https://lore.kernel.org/git/patch-9.9-b40833b2168-20230202T093212Z-avarab@gmail.com/ 2. https://lore.kernel.org/git/cover-0.9-00000000000-20230202T093212Z-avarab@gmail.com/ ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 9:40 ` Ævar Arnfjörð Bjarmason 2023-02-01 11:34 ` demerphq @ 2023-02-01 23:16 ` brian m. carlson 2023-02-01 23:37 ` Junio C Hamano 2023-02-02 0:42 ` Ævar Arnfjörð Bjarmason 1 sibling, 2 replies; 57+ messages in thread From: brian m. carlson @ 2023-02-01 23:16 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Konstantin Ryabitsev, Eli Schwartz, Git List [-- Attachment #1: Type: text/plain, Size: 1784 bytes --] On 2023-02-01 at 09:40:57, Ævar Arnfjörð Bjarmason wrote: > "A spec" here seems like overkill to me, so far on that front we've been > shelling out to gzip(1), and the breakage/event that triggered this > thread is rectified by starting to do that again by default. Sure, that will fix the immediate problem. > But so what? We don't need to make promises for all potential git > implementations, just this one. So we could add a blurb like this to the > docs: > > As people have come to rely on the exact "deflate" > implementation "git archive" promises to invoke the system's > "gzip" binary by default, under the assumption that its output > is stable. If that's no longer the case you'll need to complain > to whoever maintains your local "gzip". I don't think a blurb is necessary, but you're basically underscoring the problem, which is that nobody is willing to promise that compression is consistent, but yet people want to rely on that fact. I'm willing to write and implement a consistent tar spec and to guarantee compatibility with that, but the tension here is that people also want gzip to never change its byte format ever, which frankly seems unrealistic without explicit guarantees. Maybe the authors will agree to promise that, but it seems unlikely. > If we wanted to be even more helpful we could bunde and ship an old > version of GNU gzip with our sources, and either default to that, or > offer it as a "--stable" implementation of deflate. That would probably break things, because gzip is GPLv3, and we'd need to ship a much older GPLv2 gzip, which would probably differ from the current behaviour, and might also have some security problems. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 23:16 ` brian m. carlson @ 2023-02-01 23:37 ` Junio C Hamano 2023-02-02 23:01 ` brian m. carlson 2023-02-02 0:42 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 57+ messages in thread From: Junio C Hamano @ 2023-02-01 23:37 UTC (permalink / raw) To: brian m. carlson Cc: Ævar Arnfjörð Bjarmason, Konstantin Ryabitsev, Eli Schwartz, Git List "brian m. carlson" <sandals@crustytoothpaste.net> writes: > I don't think a blurb is necessary, but you're basically underscoring > the problem, which is that nobody is willing to promise that compression > is consistent, but yet people want to rely on that fact. I'm willing to > write and implement a consistent tar spec and to guarantee compatibility > with that, but the tension here is that people also want gzip to never > change its byte format ever, which frankly seems unrealistic without > explicit guarantees. Maybe the authors will agree to promise that, but > it seems unlikely. Just to step back a bit, where does the distinction between guaranteeing the tar format stability and gzip compressed bitstream stability come from? At both levels, the same thing can be expressed in multiple different ways, I think, but spelling out how exactly the compressor compresses is more involved than spelling out how entries in a tar archive is ordered and each entry is expressed, or something? > That would probably break things, because gzip is GPLv3, and we'd need > to ship a much older GPLv2 gzip, which would probably differ from the > current behaviour, and might also have some security problems. Yup, security issues may make bit-for-bit-stability unrealistic. IIRC, the last time we had discussion on this topic, we settled on stability across the same version of Git (i.e. deterministic result)? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 23:37 ` Junio C Hamano @ 2023-02-02 23:01 ` brian m. carlson 2023-02-02 23:47 ` rsbecker 0 siblings, 1 reply; 57+ messages in thread From: brian m. carlson @ 2023-02-02 23:01 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, Konstantin Ryabitsev, Eli Schwartz, Git List [-- Attachment #1: Type: text/plain, Size: 2575 bytes --] On 2023-02-01 at 23:37:19, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > I don't think a blurb is necessary, but you're basically underscoring > > the problem, which is that nobody is willing to promise that compression > > is consistent, but yet people want to rely on that fact. I'm willing to > > write and implement a consistent tar spec and to guarantee compatibility > > with that, but the tension here is that people also want gzip to never > > change its byte format ever, which frankly seems unrealistic without > > explicit guarantees. Maybe the authors will agree to promise that, but > > it seems unlikely. > > Just to step back a bit, where does the distinction between > guaranteeing the tar format stability and gzip compressed bitstream > stability come from? At both levels, the same thing can be > expressed in multiple different ways, I think, but spelling out how > exactly the compressor compresses is more involved than spelling out > how entries in a tar archive is ordered and each entry is expressed, > or something? Yes, at least with my understanding about how gzip and compression in general work. The tar format (and the pax format which builds on it) can mostly be restricted by explaining what data is to be included in the pax and tar headers and how it is to be formatted. If we say, we will always write such and such information in the pax header and sort the keys, and we write such and such information in the tar header, then the format is completely deterministic, and we can make nice guarantees. My understanding about how Lempel-Ziv-based compression algorithms work is that there's a lot more freedom to decide how best to compress things and that there isn't always a logical obvious choice, but I will admit my understanding is relatively limited. If someone thinks we can effectively succeed in supporting compression more than just relying on gzip, I would be delighted to be shown to be wrong. > > That would probably break things, because gzip is GPLv3, and we'd need > > to ship a much older GPLv2 gzip, which would probably differ from the > > current behaviour, and might also have some security problems. > > Yup, security issues may make bit-for-bit-stability unrealistic. > IIRC, the last time we had discussion on this topic, we settled > on stability across the same version of Git (i.e. deterministic > result)? Yes, I think that's what we agreed. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* RE: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-02 23:01 ` brian m. carlson @ 2023-02-02 23:47 ` rsbecker 2023-02-03 13:18 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: rsbecker @ 2023-02-02 23:47 UTC (permalink / raw) To: 'brian m. carlson', 'Junio C Hamano' Cc: 'Ævar Arnfjörð Bjarmason', 'Konstantin Ryabitsev', 'Eli Schwartz', 'Git List' On February 2, 2023 6:02 PM, brian m. carlson wrote: >On 2023-02-01 at 23:37:19, Junio C Hamano wrote: >> "brian m. carlson" <sandals@crustytoothpaste.net> writes: >> >> > I don't think a blurb is necessary, but you're basically >> > underscoring the problem, which is that nobody is willing to promise >> > that compression is consistent, but yet people want to rely on that >> > fact. I'm willing to write and implement a consistent tar spec and >> > to guarantee compatibility with that, but the tension here is that >> > people also want gzip to never change its byte format ever, which >> > frankly seems unrealistic without explicit guarantees. Maybe the >> > authors will agree to promise that, but it seems unlikely. >> >> Just to step back a bit, where does the distinction between >> guaranteeing the tar format stability and gzip compressed bitstream >> stability come from? At both levels, the same thing can be expressed >> in multiple different ways, I think, but spelling out how exactly the >> compressor compresses is more involved than spelling out how entries >> in a tar archive is ordered and each entry is expressed, or something? > >Yes, at least with my understanding about how gzip and compression in general >work. > >The tar format (and the pax format which builds on it) can mostly be restricted by >explaining what data is to be included in the pax and tar headers and how it is to be >formatted. If we say, we will always write such and such information in the pax >header and sort the keys, and we write such and such information in the tar header, >then the format is completely deterministic, and we can make nice guarantees. > >My understanding about how Lempel-Ziv-based compression algorithms work is that >there's a lot more freedom to decide how best to compress things and that there >isn't always a logical obvious choice, but I will admit my understanding is relatively >limited. If someone thinks we can effectively succeed in supporting compression >more than just relying on gzip, I would be delighted to be shown to be wrong. The nice part about gzip is that it is generally available on virtually all platforms (or can be easily obtained). Other compression forms, like bz2, which sometimes produces more dense compression, are not necessarily available. Availability is something I would be worried about (clone and checkout failures). Tar formats are also to be used carefully. Not all platform implementations of tar support all variants. "ustar" is fairly common but there are others that are not. Interoperability needs to be the biggest factor in this decision, IMHO, rather than compression rates. The alternative is having git supply its own implementation, but that is a longer term migration problem, resembling the SHA-256 migration. > >> > That would probably break things, because gzip is GPLv3, and we'd >> > need to ship a much older GPLv2 gzip, which would probably differ >> > from the current behaviour, and might also have some security problems. >> >> Yup, security issues may make bit-for-bit-stability unrealistic. >> IIRC, the last time we had discussion on this topic, we settled on >> stability across the same version of Git (i.e. deterministic result)? In the old days, it was export concerns. Fortunately, git never really hit those in a post-2007 timeframe. I would not bank on this issue staying off the table. --Randall ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-02 23:47 ` rsbecker @ 2023-02-03 13:18 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-03 13:18 UTC (permalink / raw) To: rsbecker Cc: 'brian m. carlson', 'Junio C Hamano', 'Konstantin Ryabitsev', 'Eli Schwartz', 'Git List' On Thu, Feb 02 2023, rsbecker@nexbridge.com wrote: > On February 2, 2023 6:02 PM, brian m. carlson wrote: >>On 2023-02-01 at 23:37:19, Junio C Hamano wrote: >>> "brian m. carlson" <sandals@crustytoothpaste.net> writes: >>> >>> > I don't think a blurb is necessary, but you're basically >>> > underscoring the problem, which is that nobody is willing to promise >>> > that compression is consistent, but yet people want to rely on that >>> > fact. I'm willing to write and implement a consistent tar spec and >>> > to guarantee compatibility with that, but the tension here is that >>> > people also want gzip to never change its byte format ever, which >>> > frankly seems unrealistic without explicit guarantees. Maybe the >>> > authors will agree to promise that, but it seems unlikely. >>> >>> Just to step back a bit, where does the distinction between >>> guaranteeing the tar format stability and gzip compressed bitstream >>> stability come from? At both levels, the same thing can be expressed >>> in multiple different ways, I think, but spelling out how exactly the >>> compressor compresses is more involved than spelling out how entries >>> in a tar archive is ordered and each entry is expressed, or something? >> >>Yes, at least with my understanding about how gzip and compression in general >>work. >> >>The tar format (and the pax format which builds on it) can mostly be restricted by >>explaining what data is to be included in the pax and tar headers and how it is to be >>formatted. If we say, we will always write such and such information in the pax >>header and sort the keys, and we write such and such information in the tar header, >>then the format is completely deterministic, and we can make nice guarantees. >> >>My understanding about how Lempel-Ziv-based compression algorithms work is that >>there's a lot more freedom to decide how best to compress things and that there >>isn't always a logical obvious choice, but I will admit my understanding is relatively >>limited. If someone thinks we can effectively succeed in supporting compression >>more than just relying on gzip, I would be delighted to be shown to be wrong. > > The nice part about gzip is that it is generally available on > virtually all platforms (or can be easily obtained). Other compression > forms, like bz2, which sometimes produces more dense compression, are > not necessarily available. Availability is something I would be > worried about... I agree with all of that, gzip is in such wide use for a reason. >... (clone and checkout failures). But how would a hypothetical obscure format for "git archive" contribute to clone or checkout failures? Are you thinking of our use of zlib for e.g. loose objects? That's unrelated to this discussion (and I don't think anyone relies on their compressed checksum). > Tar formats are also to be used carefully. Not all platform > implementations of tar support all variants. "ustar" is fairly common > but there are others that are not. Interoperability needs to be the > biggest factor in this decision, IMHO, rather than compression rates. For "git archive" whether you care about interoperability depends on the target audience of your archive, and in any case I don't see why we need to worry about it, except to perhaps note that some are more portable than others if we e.g. had a built-in "tar.bz2" helper method. > The alternative is having git supply its own implementation, but that > is a longer term migration problem, resembling the SHA-256 migration. I've noted elsewhere in this thread that I don't see the point of shipping a fallback "gzip" beyond the "git archive gzip" we have already, but even if we did that the scope of that seems pretty simple, and *much* easier than the SHA-256 migration. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 23:16 ` brian m. carlson 2023-02-01 23:37 ` Junio C Hamano @ 2023-02-02 0:42 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-02 0:42 UTC (permalink / raw) To: brian m. carlson; +Cc: Konstantin Ryabitsev, Eli Schwartz, Git List On Wed, Feb 01 2023, brian m. carlson wrote: > [[PGP Signed Part:Undecided]] > On 2023-02-01 at 09:40:57, Ævar Arnfjörð Bjarmason wrote: >> "A spec" here seems like overkill to me, so far on that front we've been >> shelling out to gzip(1), and the breakage/event that triggered this >> thread is rectified by starting to do that again by default. > > Sure, that will fix the immediate problem. > >> But so what? We don't need to make promises for all potential git >> implementations, just this one. So we could add a blurb like this to the >> docs: >> >> As people have come to rely on the exact "deflate" >> implementation "git archive" promises to invoke the system's >> "gzip" binary by default, under the assumption that its output >> is stable. If that's no longer the case you'll need to complain >> to whoever maintains your local "gzip". > > I don't think a blurb is necessary, but you're basically underscoring > the problem, which is that nobody is willing to promise that compression > is consistent, but yet people want to rely on that fact. I'm willing to > write and implement a consistent tar spec and to guarantee compatibility > with that, but the tension here is that people also want gzip to never > change its byte format ever, which frankly seems unrealistic without > explicit guarantees. Maybe the authors will agree to promise that, but > it seems unlikely. Maybe they won't, the point is that an upgrade of git wouldn't break github in the way that's been observed, instead that potential breakage would happen whenever the OS (or whatever's providing "gzip") is upgraded. So, if gzip promises to never change such sites can upgrade it without issues, but if it does they'll presumably need to pin it forever. And those sites that don't care about "git archive" stability can use whatever their local "gzip" is, without caring that the output might change. >> If we wanted to be even more helpful we could bunde and ship an old >> version of GNU gzip with our sources, and either default to that, or >> offer it as a "--stable" implementation of deflate. > > That would probably break things, because gzip is GPLv3, and we'd need > to ship a much older GPLv2 gzip, which would probably differ from the > current behaviour, and might also have some security problems. We're way off in the realm of the hypothetical, I don't think we need a gzip fallback, we can make it the issue of the rare downstream user who needs such stability. But if we shipped a last-good gzip my understanding of software licensing is that we could ship the GPLv3 version. The issue with combining GPLv3 and GPLv2 works is if you do something like upgrade our wildmatch.c to the GPLv3 version (ours is derived from an older GPLv2 version). Then our combined work is derived from two different licenses. But if you're just invoking a different process those two sources can use incompatible licenses. There's established precedence for that throughout the industry, and it's the FSF's position on the matter. So if we offered to build a gzip for you from GPLv3 sources shipped in-tree that wouldn't infect the rest of git's GPLv2 code, any more than Debian shipping both git and gzip is cross-contaminating the two. It might cause us some hassle with distributors for whom any mention of GPLv3 is anathema (e.g. Apple), but I understand that that's general paranoia about its patent clauses impacting the distributor, not a license incompatiblity. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 22:32 ` brian m. carlson 2023-02-01 9:40 ` Ævar Arnfjörð Bjarmason @ 2023-02-01 12:17 ` Raymond E. Pasco 1 sibling, 0 replies; 57+ messages in thread From: Raymond E. Pasco @ 2023-02-01 12:17 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, brian m. carlson Cc: Konstantin Ryabitsev, Eli Schwartz, Git List February 1, 2023 4:40 AM, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com> wrote: > As people have come to rely on the exact "deflate" > implementation "git archive" promises to invoke the system's > "gzip" binary by default, under the assumption that its output > is stable. If that's no longer the case you'll need to complain > to whoever maintains your local "gzip". Surely if reproducibility of .tar.gz files is the goal,"invoke whatever arbitrary binary on $PATH happens to be called gzip" is an poor solution. It is only even possible to consider stabilizing gzip output as a goal for Git (although this seems ill-advised for the reasons Brian already discussed) in the post-2.38 world where git is doing the gzipping. If one has the requirement to substitute one's own specific compressor, there is an option for that. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson 2023-01-31 11:31 ` Ævar Arnfjörð Bjarmason 2023-01-31 15:05 ` Konstantin Ryabitsev @ 2023-01-31 15:56 ` Eli Schwartz 2023-01-31 16:20 ` Konstantin Ryabitsev 2023-02-01 1:33 ` brian m. carlson 2023-02-01 12:42 ` Ævar Arnfjörð Bjarmason 3 siblings, 2 replies; 57+ messages in thread From: Eli Schwartz @ 2023-01-31 15:56 UTC (permalink / raw) To: brian m. carlson, Git List On 1/31/23 4:54 AM, brian m. carlson wrote: > Part of the reason I think this is valuable is that once SHA-1 and > SHA-256 interoperability is present, git archive will change the > contents of the archive format, since it will embed a SHA-256 hash into > the file instead of a SHA-1 hash, since that's what's in the repository. > Thus, we can't produce an archive that's deterministic in the face of > SHA-1/SHA-256 interoperability concerns, and we need to create a new > format that doesn't contain that data embedded in it. I assume that whatever the reason for originally embedding the OID into the file is still an applicable reason even if a new PAX format is established for the use of git-archive. It may not be a great reason -- I don't know. Perhaps there's an argument to remove it. But can't that be done irrespective of standardizing the PAX format? ... I'm not deeply knowledgeable about the SHA-256 transition work -- or knowledgeable at all about it, frankly. (Also my understanding was it seems to have stalled as discussed in https://lwn.net/Articles/898522/ -- I understand that you're still enthusiastic about the work? But that doesn't really answer "is there a timeframe for that to ever happen".) But I sort of assumed that the transition work would already have to embed a fair bit of information into the repository about the whole process? Would it not be possible to determine whether a given tag started life as SHA-1 or SHA-256? Maybe even just a date when the repository was converted to work with both, and embed the OID based on whether the tag is tagging contents that were created after that conversion? Seems to me like the problem should be solvable if people want to solve it. ... git-archive run on a commit obviously doesn't have this problem -- it can simply embed the OID for the same argument it was called with. But I assume it's far more common to access tag-based github endpoints. :D > Having said that, I don't think this should be based on the timestamp of > the file, since that means that two otherwise identical archives > differing in timestamp aren't ever going to be the same, and we do see > people who import or vendor other projects. The timestamp of the output file? Surely not. But I only suggested the timestamp of the commit/tag metadata that git-archive is asked to produce output for. And we would need that in order to solve the problem that reproducible github API archive endpoints poses. I'm not sure what the "import or vendor other projects" angle here means. Do you mean people who copy a directory of files into their project? Who expects this to be the same to begin with? And doesn't embedding the OID kill this idea, since the entire point of git commit sha's is that you shouldn't (it should be prohibitively unrealistic to) be able to produce the same one twice in different contexts? I have never said to myself "ah yes, I really would like to be able to download a git auto-generated tarball for project A, and compare its hash to the tarball for project B, and have them compare identical even though they are different projects with different commits". IMHO this isn't an interesting problem to solve -- the interesting problem to solve is that a single absolute URL to a downloadable file should be able to offer documented guarantees that it will always be the same file, even though it is generated on the fly. > Nor do I think we should > attempt to provide consistent compression, since I believe the output of > things like zlib has changed in the past, and we can't continually carry > an old, potentially insecure version of zlib just because the output > changed. People should be able to implement compression using gzip, > zlib, pigz, miniz_oxide, or whatever if they want, since people > implement Git in many different languages, and we won't want to force > people using memory-safe languages like Go and Rust to explicitly use > zlib for archives. I do not think it is realistic or reasonable for people to implement compression using intentionally incompatible replacements for gzip and expect interoperability of any sort. I also don't think people *have* to implement compression in rust using zlib, but if they are going to make a git-alike that produces archives, it would be worth it for them to write whatever memory-safe rust is necessary to memory-safely produce the same output stream of bytes. It's no less feasible than making sure that busybox gzip and GNU gzip produce the same output, surely. Alternatively, they could just not bother with gzip at all, and make their git-alike produce zstd-compressed tarballs, which change their byte outputs every time a new zstd release is published. :D Again, why limit yourself to gzip if you want to be innovative anyway. > That may mean that it's important for people to actually decompress the > archive before checking hashes if they want deterministic behaviour, and > I'm okay with that. You already have to do that if you're verifying the > signature on Git tarballs, since only the uncompressed tar archive is > signed, so I don't think this is out of the question. This is a very kernel.org-centric view of things, I think. I have rarely seen PGP signatures applied to the uncompressed tar except in that context. The vast majority of tarballs with signatures have signed a single compressed tarball and don't concern themselves with, say, providing a rotating backdated changeable list of compression formats with a single signature covering all of them. Nevertheless, in order to handle kernel.org-style tarballs, you are entirely correct that one should be able to handle this. >From experience, I can say that this needs to be selected on a per-tarball basis. Since signature files have filenames, we can match their stems and given foo.tar.asc and foo.tar.gz, check the signature of the output of gzip -dc < foo.tar.gz, but given foo.tar.gz.asc and foo.tar.gz, simply check the signature of the original foo.tar.gz. This doesn't really work for checksums, because you need to settle on one or the other everywhere or else embed decompression information into your checksum metadata field. And for tarballs that are generated once and uploaded to ftp storage, not repeatedly generated on the fly, we know the checksum will never legitimately change, so we *want* to hash the compressed file. Decompressing kernel.org tarballs in order to run PGP on them is *slow*. Although at least one can verify the checksums first without decompression, which is virtually guaranteed to catch invalid source code releases, so if you ever progress to the PGP verification stage it's unlikely to be wasted effort -- that tarball is definitely getting used to build something. -- Eli Schwartz ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 15:56 ` Eli Schwartz @ 2023-01-31 16:20 ` Konstantin Ryabitsev 2023-01-31 16:34 ` Eli Schwartz 2023-02-01 1:33 ` brian m. carlson 1 sibling, 1 reply; 57+ messages in thread From: Konstantin Ryabitsev @ 2023-01-31 16:20 UTC (permalink / raw) To: Eli Schwartz; +Cc: brian m. carlson, Git List On Tue, Jan 31, 2023 at 10:56:52AM -0500, Eli Schwartz wrote: > And for tarballs that are generated once and uploaded to ftp storage, > not repeatedly generated on the fly, we know the checksum will never > legitimately change, so we *want* to hash the compressed file. > Decompressing kernel.org tarballs in order to run PGP on them is *slow*. FWIW, the most correct way is: * download sha256sums.asc and verify its signature (auto-signed by infra) * download the tarball you want and verify that the checksum matches * uncompress and verify the PGP signature (signed by developer) This script implements this workflow: https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball -K ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 16:20 ` Konstantin Ryabitsev @ 2023-01-31 16:34 ` Eli Schwartz 2023-01-31 20:34 ` Konstantin Ryabitsev 2023-01-31 20:45 ` Michal Suchánek 0 siblings, 2 replies; 57+ messages in thread From: Eli Schwartz @ 2023-01-31 16:34 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: brian m. carlson, Git List On 1/31/23 11:20 AM, Konstantin Ryabitsev wrote: > On Tue, Jan 31, 2023 at 10:56:52AM -0500, Eli Schwartz wrote: >> And for tarballs that are generated once and uploaded to ftp storage, >> not repeatedly generated on the fly, we know the checksum will never >> legitimately change, so we *want* to hash the compressed file. >> Decompressing kernel.org tarballs in order to run PGP on them is *slow*. > > FWIW, the most correct way is: > > * download sha256sums.asc and verify its signature (auto-signed by infra) > * download the tarball you want and verify that the checksum matches > * uncompress and verify the PGP signature (signed by developer) > > This script implements this workflow: > https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball This is just what I said, but with an additional first step for when you are updating to a new tarball and don't have your own checksums integrated into your own ecosystem tracking. In most contexts, it's utterly unacceptable to not remember the checksum of the file you used last time and instead simply trust PGP identity verification. This permits upstream the technical means to be malicious, and re-upload a totally different tarball with the same name, different contents, and different PGP signature, and you will never notice because the PGP signature is still okay. Just because I trust you all doesn't mean I should ignore existing best practices to make sure that I always use the same reviewed byte-identical tarball -- or find out exactly why it changed. -- Eli Schwartz ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 16:34 ` Eli Schwartz @ 2023-01-31 20:34 ` Konstantin Ryabitsev 2023-01-31 20:45 ` Michal Suchánek 1 sibling, 0 replies; 57+ messages in thread From: Konstantin Ryabitsev @ 2023-01-31 20:34 UTC (permalink / raw) To: Eli Schwartz; +Cc: brian m. carlson, Git List On Tue, Jan 31, 2023 at 11:34:59AM -0500, Eli Schwartz wrote: > In most contexts, it's utterly unacceptable to not remember the checksum > of the file you used last time and instead simply trust PGP identity > verification. This permits upstream the technical means to be malicious, > and re-upload a totally different tarball with the same name, different > contents, and different PGP signature, and you will never notice because > the PGP signature is still okay. Yes, it's true, and it's something that Sigstore tries to address. That said, if I wanted to trojan a download and had access to both the infrastructure and the developer's credentials, I wouldn't pick a months-old release for this purpose. I would wait until I see a new release coming out and then swap it mid-flight. This lets me defeat even transparency-log based solutions like sigstore. (I'll probably be giving a talk at the Linux Security Summit titled "How to trojan the Linux Kernel" where I'll go into some of these considerations. :)) -K ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 16:34 ` Eli Schwartz 2023-01-31 20:34 ` Konstantin Ryabitsev @ 2023-01-31 20:45 ` Michal Suchánek 1 sibling, 0 replies; 57+ messages in thread From: Michal Suchánek @ 2023-01-31 20:45 UTC (permalink / raw) To: Eli Schwartz; +Cc: Konstantin Ryabitsev, brian m. carlson, Git List On Tue, Jan 31, 2023 at 11:34:59AM -0500, Eli Schwartz wrote: > On 1/31/23 11:20 AM, Konstantin Ryabitsev wrote: > > On Tue, Jan 31, 2023 at 10:56:52AM -0500, Eli Schwartz wrote: > >> And for tarballs that are generated once and uploaded to ftp storage, > >> not repeatedly generated on the fly, we know the checksum will never > >> legitimately change, so we *want* to hash the compressed file. > >> Decompressing kernel.org tarballs in order to run PGP on them is *slow*. > > > > FWIW, the most correct way is: > > > > * download sha256sums.asc and verify its signature (auto-signed by infra) > > * download the tarball you want and verify that the checksum matches > > * uncompress and verify the PGP signature (signed by developer) > > > > This script implements this workflow: > > https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball > > > This is just what I said, but with an additional first step for when you > are updating to a new tarball and don't have your own checksums > integrated into your own ecosystem tracking. > > In most contexts, it's utterly unacceptable to not remember the checksum > of the file you used last time and instead simply trust PGP identity > verification. This permits upstream the technical means to be malicious, > and re-upload a totally different tarball with the same name, different > contents, and different PGP signature, and you will never notice because > the PGP signature is still okay. But where is the hash remembered? The signature is a hash+signature, it you can replace that, you can also repolace a hash without a signature. You can store hashesd of anything you want locally, and indeed such stored hashes in some build systemns did detect some code hosting corruption but that's not for upstream to do, that's something that only unrelated third party can do. Thanks Michal ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 15:56 ` Eli Schwartz 2023-01-31 16:20 ` Konstantin Ryabitsev @ 2023-02-01 1:33 ` brian m. carlson 1 sibling, 0 replies; 57+ messages in thread From: brian m. carlson @ 2023-02-01 1:33 UTC (permalink / raw) To: Eli Schwartz; +Cc: Git List [-- Attachment #1: Type: text/plain, Size: 11930 bytes --] On 2023-01-31 at 15:56:52, Eli Schwartz wrote: > On 1/31/23 4:54 AM, brian m. carlson wrote: > > Part of the reason I think this is valuable is that once SHA-1 and > > SHA-256 interoperability is present, git archive will change the > > contents of the archive format, since it will embed a SHA-256 hash into > > the file instead of a SHA-1 hash, since that's what's in the repository. > > Thus, we can't produce an archive that's deterministic in the face of > > SHA-1/SHA-256 interoperability concerns, and we need to create a new > > format that doesn't contain that data embedded in it. > > > I assume that whatever the reason for originally embedding the OID into > the file is still an applicable reason even if a new PAX format is > established for the use of git-archive. > > It may not be a great reason -- I don't know. Perhaps there's an > argument to remove it. But can't that be done irrespective of > standardizing the PAX format? > > ... > > I'm not deeply knowledgeable about the SHA-256 transition work -- or > knowledgeable at all about it, frankly. (Also my understanding was it > seems to have stalled as discussed in https://lwn.net/Articles/898522/ > -- I understand that you're still enthusiastic about the work? But that > doesn't really answer "is there a timeframe for that to ever happen".) The timeframe is when my employer pays me to work on it. Right now, I've implemented functional SHA-256 repositories but am currently a bit on the way to burnout and am very selective about what things I'm doing outside of work. My hope is that my employer will find time for me to work on the interop stuff soon, but I'm not at liberty to discuss this more in depth at the moment. > But I sort of assumed that the transition work would already have to > embed a fair bit of information into the repository about the whole > process? Would it not be possible to determine whether a given tag > started life as SHA-1 or SHA-256? Maybe even just a date when the > repository was converted to work with both, and embed the OID based on > whether the tag is tagging contents that were created after that conversion? It's designed such that the two objects are completely interoperable and can be accessed by either name, depending on how the repository is configured locally. There may be a signature for one algorithm, both, or neither, so it's hard to say definitively what version it's created with. That is completely intentional since the goal is to transition seamlessly from one to another at any point depending on the preferences of the owner of the local repository. > > Having said that, I don't think this should be based on the timestamp of > > the file, since that means that two otherwise identical archives > > differing in timestamp aren't ever going to be the same, and we do see > > people who import or vendor other projects. > > > The timestamp of the output file? Surely not. But I only suggested the > timestamp of the commit/tag metadata that git-archive is asked to > produce output for. And we would need that in order to solve the problem > that reproducible github API archive endpoints poses. I think it would simply be easier to say, "This is the command-line option that implements canonical tar version 1." If you want a reproducible archive, you use that command-line option, and your uncompressed tar archive is reproducible. Otherwise, you get the same guarantees on reproducibility that we've always provided, which is absolutely none. Using commit and tag metadata doesn't solve the problem of trees, which would use the current timestamp. It's better to solve the problem in a consistent way, which would mean embedding a fixed timestamp (probably the Epoch) into those tree tarballs. In my view, using the commit or tag timestamp is very risky, because it changes the behaviour at some point in the future without notifying people. If we produce a tar archive that isn't readable by FooZip, say, then nobody will realize that until we actually start producing them, several months after the release. And, I should point out, this still poses problems for GitHub and other forges, because GitHub doesn't run the latest release right away; we usually trail a version or two. So using the commit or tag timestamp might mean that on an upgrade, suddenly the behaviour changes because the new version has a change (which was scheduled to have occurred in the past) but the old version doesn't. In addition, the one guarantee we've given with archives in the past is that the same version of Git with the same input (flags, repository, etc.) will produce deterministic results (that is, the same output), and I think we're likely to run afoul of that with a timestamp-based approach. I don't want the archive to suddenly be different because I happened to do "git commit --amend" to update just a commit message and we happened to cross that timestamp threshold. > I'm not sure what the "import or vendor other projects" angle here > means. Do you mean people who copy a directory of files into their > project? Who expects this to be the same to begin with? And doesn't > embedding the OID kill this idea, since the entire point of git commit > sha's is that you shouldn't (it should be prohibitively unrealistic to) > be able to produce the same one twice in different contexts? We have people who import the entirety of Chromium into a project at one time to work on a browser-based project. > I have never said to myself "ah yes, I really would like to be able to > download a git auto-generated tarball for project A, and compare its > hash to the tarball for project B, and have them compare identical even > though they are different projects with different commits". IMHO this > isn't an interesting problem to solve -- the interesting problem to > solve is that a single absolute URL to a downloadable file should be > able to offer documented guarantees that it will always be the same > file, even though it is generated on the fly. I do think having identical output for identical contents is very valuable. If our goal is reproducible output, we should endeavour to produce identical output for identical input. What we're specifically trying to move away from is varying output based on the same input. > I do not think it is realistic or reasonable for people to implement > compression using intentionally incompatible replacements for gzip and > expect interoperability of any sort. I disagree completely. The gzip and zlib formats are documented in RFCs and have been since 1996. There are already at least a half-dozen interoperable implementations, including zlib, gzip, pigz, Go's standard library, miniz_oxide, and the Windows archiver. I'm sure if I searched I could find at least half a dozen more. > I also don't think people *have* to implement compression in rust using > zlib, but if they are going to make a git-alike that produces archives, > it would be worth it for them to write whatever memory-safe rust is > necessary to memory-safely produce the same output stream of bytes. It's > no less feasible than making sure that busybox gzip and GNU gzip produce > the same output, surely. I don't agree at all. The Go standard library couldn't achieve that, because busybox and gzip are GPL and doing that would almost certainly require looking at the code, which would require the Go standard library to be GPL as well. The same thing goes for zlib, which is permissively licensed, and which is clearly the obvious choice if we had to settle on a standard, since it's a shared library. That also ignores tools like pigz which provide parallel compression and can provide an order of magnitude performance increase, but which won't provide an identical byte stream. Why should we require people to use a single core if they have a very large archive that could compress several times as fast with a parallel operation? My goal is to produce tar archives that are interoperable based on a spec. That spec would be implementable by Git, GNU tar, libarchive, or anyone else, by reading the spec and following it. That's very different from saying, "Well, just make your program do exactly the same thing as this other one without sharing any code." If you want to write a spec for canonical gzip, I'm interested in reading it, but I think it's practically going to be difficult to achieve. > > That may mean that it's important for people to actually decompress the > > archive before checking hashes if they want deterministic behaviour, and > > I'm okay with that. You already have to do that if you're verifying the > > signature on Git tarballs, since only the uncompressed tar archive is > > signed, so I don't think this is out of the question. > > > This is a very kernel.org-centric view of things, I think. I have rarely > seen PGP signatures applied to the uncompressed tar except in that > context. The vast majority of tarballs with signatures have signed a > single compressed tarball and don't concern themselves with, say, > providing a rotating backdated changeable list of compression formats > with a single signature covering all of them. Sure, and that's a valid approach if you have a consistent, persistent tarball. However, Git does not persist data forever in tarballs, and people want to use different versions to get the same data, which is a new guarantee that we'd be providing. That is an easy guarantee to provide with tar, but not an easy guarantee to provide with the gzip format, as we've all just seen. > >From experience, I can say that this needs to be selected on a > per-tarball basis. Since signature files have filenames, we can match > their stems and given foo.tar.asc and foo.tar.gz, check the signature of > the output of gzip -dc < foo.tar.gz, but given foo.tar.gz.asc and > foo.tar.gz, simply check the signature of the original foo.tar.gz. > > This doesn't really work for checksums, because you need to settle on > one or the other everywhere or else embed decompression information into > your checksum metadata field. I don't think that's absolutely required. You need to know how to decompress the archive, and you can have a hash for the tarball before decompression or after decompression, as well as possibly needing to deal with multiple different hash algorithms. I've implemented this myself when I was a vendor of Git and lots of other software, and we would take the hash of the compressed or decompressed archive as shipped by the vendor and verify it, as long as the hash was sufficiently strong. > And for tarballs that are generated once and uploaded to ftp storage, > not repeatedly generated on the fly, we know the checksum will never > legitimately change, so we *want* to hash the compressed file. > Decompressing kernel.org tarballs in order to run PGP on them is *slow*. > Although at least one can verify the checksums first without > decompression, which is virtually guaranteed to catch invalid source > code releases, so if you ever progress to the PGP verification stage > it's unlikely to be wasted effort -- that tarball is definitely getting > used to build something. Sure, and if you want to generate tarballs once and upload them to storage, go ahead. That's always an option. Even GitHub provides you the option to do that with release assets if you want. My proposal is to provide deterministic archives in a functionally and practically achievable way with nothing more than a version of Git, which I think we can do with tar, but not gzip. I'm happy to be proven wrong if you can develop a spec for canonical gzip compression. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-01-31 9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson ` (2 preceding siblings ...) 2023-01-31 15:56 ` Eli Schwartz @ 2023-02-01 12:42 ` Ævar Arnfjörð Bjarmason 2023-02-01 23:18 ` brian m. carlson 3 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-01 12:42 UTC (permalink / raw) To: brian m. carlson; +Cc: Eli Schwartz, Git List On Tue, Jan 31 2023, brian m. carlson wrote: > Since then, I've been very opposed to us guaranteeing output format > consistency without explicitly doing so. I had sent some patches before > that I don't think ever got picked up that documented this explicitly. > I very much don't want people to come to rely on our behaviour unless we > explicitly guarantee it. FWIW I think the reason that didn't get picked up (I went back and read the discussion) is that there was some feedback on the v1, [1] suggested (at least to me) that you'd re-roll it, but that re-roll never seems to have made it to the list. 1. https://lore.kernel.org/git/YD7aDwX%2FaiRN0GZs@camp.crustytoothpaste.net/ ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution 2023-02-01 12:42 ` Ævar Arnfjörð Bjarmason @ 2023-02-01 23:18 ` brian m. carlson 0 siblings, 0 replies; 57+ messages in thread From: brian m. carlson @ 2023-02-01 23:18 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Eli Schwartz, Git List [-- Attachment #1: Type: text/plain, Size: 950 bytes --] On 2023-02-01 at 12:42:54, Ævar Arnfjörð Bjarmason wrote: > > On Tue, Jan 31 2023, brian m. carlson wrote: > > > Since then, I've been very opposed to us guaranteeing output format > > consistency without explicitly doing so. I had sent some patches before > > that I don't think ever got picked up that documented this explicitly. > > I very much don't want people to come to rely on our behaviour unless we > > explicitly guarantee it. > > FWIW I think the reason that didn't get picked up (I went back and read > the discussion) is that there was some feedback on the v1, [1] suggested > (at least to me) that you'd re-roll it, but that re-roll never seems to > have made it to the list. That may very well have been the case. As mentioned upthread, I have very limited time to work on Git these days, and sometimes things just fall through the cracks. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2023-02-12 17:41 UTC | newest] Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-31 0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz 2023-01-31 7:49 ` Ævar Arnfjörð Bjarmason 2023-01-31 9:11 ` Eli Schwartz 2023-02-02 9:32 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason 2023-02-02 9:32 ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason 2023-02-02 10:25 ` brian m. carlson 2023-02-02 10:30 ` Ævar Arnfjörð Bjarmason 2023-02-02 16:34 ` Junio C Hamano 2023-02-04 17:46 ` brian m. carlson 2023-02-02 16:17 ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood 2023-02-02 16:40 ` Junio C Hamano 2023-02-03 13:49 ` Ævar Arnfjörð Bjarmason 2023-02-06 14:46 ` Phillip Wood 2023-02-03 15:47 ` Theodore Ts'o 2023-02-02 16:25 ` Junio C Hamano 2023-02-04 18:08 ` René Scharfe 2023-02-05 21:30 ` Ævar Arnfjörð Bjarmason 2023-02-12 17:41 ` René Scharfe 2023-02-02 19:23 ` Raymond E. Pasco 2023-02-03 8:06 ` [PATCH] archive: document output stability concerns Raymond E. Pasco 2023-01-31 9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson 2023-01-31 11:31 ` Ævar Arnfjörð Bjarmason 2023-01-31 15:05 ` Konstantin Ryabitsev 2023-01-31 22:32 ` brian m. carlson 2023-02-01 9:40 ` Ævar Arnfjörð Bjarmason 2023-02-01 11:34 ` demerphq 2023-02-01 12:21 ` Michal Suchánek 2023-02-01 12:48 ` demerphq 2023-02-01 13:43 ` Ævar Arnfjörð Bjarmason 2023-02-01 15:21 ` demerphq 2023-02-01 18:56 ` Theodore Ts'o 2023-02-02 21:19 ` Joey Hess 2023-02-03 4:02 ` Theodore Ts'o 2023-02-03 13:32 ` Ævar Arnfjörð Bjarmason 2023-02-01 23:16 ` brian m. carlson 2023-02-01 23:37 ` Junio C Hamano 2023-02-02 23:01 ` brian m. carlson 2023-02-02 23:47 ` rsbecker 2023-02-03 13:18 ` Ævar Arnfjörð Bjarmason 2023-02-02 0:42 ` Ævar Arnfjörð Bjarmason 2023-02-01 12:17 ` Raymond E. Pasco 2023-01-31 15:56 ` Eli Schwartz 2023-01-31 16:20 ` Konstantin Ryabitsev 2023-01-31 16:34 ` Eli Schwartz 2023-01-31 20:34 ` Konstantin Ryabitsev 2023-01-31 20:45 ` Michal Suchánek 2023-02-01 1:33 ` brian m. carlson 2023-02-01 12:42 ` Ævar Arnfjörð Bjarmason 2023-02-01 23:18 ` brian m. carlson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).