git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Stability of git-archive, breaking (?) the Github universe, and a possible solution
@ 2023-01-31  0:06 Eli Schwartz
  2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
  2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
  0 siblings, 2 replies; 57+ messages in thread
From: Eli Schwartz @ 2023-01-31  0:06 UTC (permalink / raw)
  To: Git List; +Cc: brian m. carlson

For those that haven't seen, github changed its checksums for all
"source code" artifacts attached to any git repository with tags. This
change is now reverted due to widespread breakage -- and the lack of
advance warning. The technical details of the change appear simple: they
upgraded git.

Probably the main discussion, complete with Github employees from this
mailing list responding:

https://github.com/bazel-contrib/SIG-rules-authors/issues/11#issuecomment-1409438954

Consequences of that discussion, attempting to mitigate issues by
warning people that it already happened:

https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/

And where I first saw it: https://github.com/mesonbuild/wrapdb/pull/884

Historically speaking, git-archive has been stable minus... a bug fix or
two in rare cases, specifically relating to an inability to transcribe
the contents of the git repo at all, I think? And the other factor is
the compression algorithm used, which is generally GNU gzip, and
historically whatever the system `gzip` command is.

And gzip is a stable format. It's a worn-out, battle-weary format, even
-- it's not the best at compressing, and it's not the best at
decompressing, and "all the cool kids" are working on cooler formats,
such as zstd which does indeed regularly change its byte output between
versions. But the advantage of gzip is that it's good *enough*, and it's
probably *everywhere*, and it's *reliable*.

GNU gzip is reproducible. busybox gzip was fixed to agree with GNU gzip
(this is relevant to the handful of people running software forges on,
say, Alpine Linux):

https://reproducible-builds.org/reports/2019-08/#upstream-news

...

Nevertheless, I've seen the sentiment a few times that git doesn't like
committing to output stability of git-archive, because it isn't
officially documented (but it's not entirely clear what the benefits of
changing are). And yet, git endeavors to do so, in order to prevent
unnecessary breakage of people who embody Hyrum's Law and need that
stability.

Even with the new change to the compressor, git-archive is still
reproducible, it's the internal gzip compressor that isn't. (This may be
fixable, possibly by embedding an implementation from busybox or from
GNU gzip? I'm not going to discuss that right now, though I think it's
an interesting avenue of exploration.)

I've thought about this now and then over the last couple of years,
because I think I have a reasonable compromise that might make everyone
(or at least most people) happy, and now seems like a good idea to
mention it.

What does everyone think about offering versioned git-archive outputs?
This could be user-selectable as an option to `git archive`, but the
main goal would be to select a good versioned output format depending on
what is being archived. So:

- first things first, un-default the internal compressor again
- implement a v2 archive format, where the internal compressor is the
  default -- no other changes
- teach git to select an archive format based on the date of the object
  being archived
  - when given a commit/tag ID to archive, check which support frame the
    committer date falls inside
  - for tree IDs, always use the latest format (it always uses the
    current date anyway)
- schedule a date, for the sake of argument, 6 months after the next
  scheduled release date of git version X.Y in which this change goes
  live; bake this into the git sources as a transition date, all commits
  or tags generated after this date fall into the next format support
  frame


The end result is that for all historic commits or tags, `git archive`
will always produce the same output. This can be documented in the
git-archive manpage: "the produced archive is guaranteed to be
reproducible, unless you override the `tar.<format>.command` or your
system compressor is not reproducible".

For *new* commits or tags, everyone gets the benefit of fascinating,
cool new archive formats with useful improvements at the tar container
level, which is apparently a very desirable feature. The git project no
longer has to worry, at all, about whether users will come to complain
about how their build pipelines suddenly fail with checksum issues. The
git project can simply, fearlessly, go implement innovative new changes
without giving any thought to backwards compatibility.

It is, simply, that those new changes only apply to projects which are
still under active development, and which push new commits or tag new
releases after the transition date.

Old states of existing projects (regardless of whether they are still
actively updating) can go have their old and apparently inefficient
archives and don't get cool new stuff. That's fine. They're also
increasingly rarely used, because they are, after all, old -- and most
likely only used for historic archival purposes. If the worst comes to
worst, well, they managed to produce a somehow useful archive with an
older version of git -- nothing will *break* if they don't get the cool
new stuff.

And for the vast majority of new downloads for new stuff, the in-process
compressor saves one fork+exec and is a bit more efficient, I guess?

A note on the transition date: I suggested 6 months after the scheduled
release date, because this gives everyone running a software forge time
to update git itself, and have everything ready, in time to handle the
first wave of commits and tags that naturally occur after the transition
date. And you don't want it to be immediate, because then people will
take days or weeks to deploy and the most recent archives will change


For the purposes of this thought experiment, we assume that people don't
routinely set the system time to a year in the future. This will only be
done in situations such as, say, testing a git upgrade deployment for a
software forge.

...


"And then no one ever complained about archive checksums changing again."

🤞🙏🥺

-- 
Eli Schwartz

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31  0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz
@ 2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
  2023-01-31  9:11   ` Eli Schwartz
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
  2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
  1 sibling, 2 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-31  7:49 UTC (permalink / raw)
  To: Eli Schwartz
  Cc: Git List, brian m. carlson, René Scharfe,
	Johannes Schindelin, Jeff King


On Mon, Jan 30 2023, Eli Schwartz wrote:

> For those that haven't seen, github changed its checksums for all
> "source code" artifacts attached to any git repository with tags. This
> change is now reverted due to widespread breakage -- and the lack of
> advance warning. The technical details of the change appear simple: they
> upgraded git.
>
> Probably the main discussion, complete with Github employees from this
> mailing list responding:
>
> https://github.com/bazel-contrib/SIG-rules-authors/issues/11#issuecomment-1409438954
>
> Consequences of that discussion, attempting to mitigate issues by
> warning people that it already happened:
>
> https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/
>
> And where I first saw it: https://github.com/mesonbuild/wrapdb/pull/884

Maybe I'm the only one that missed this on a first reading, but I
couldn't find what specific change in Git was being discussed.

But it's linked from the now-strikethrough portion of that github.blog
URL: 4f4be00d302 (archive-tar: use internal gzip by default,
2022-06-15), first released with v2.38.0.

That's the change to use gzip as a library instead of gzip(1), I've
added the author to the CC list, as well as well as others in the
initial ML dicsussion.

The ML discussion about that series starts at:
https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail.com/

For that change specifically I had this comment at the time:
https://lore.kernel.org/git/220615.86wndhwt9a.gmgdl@evledraar.gmail.com/

The response from René
(https://lore.kernel.org/git/3ed80afd-34b3-afd8-5ffb-0187a4475ee1@web.de/)
fills in the "why" missing from the commit message itself:

	"It's to avoid a run dependency [on gzip(1)] [...] and you can
	set tar.tgz.command='gzip -cn' to get the old behavior.  Saving
	energy is a better default, though.

We can discuss how worthwhile that trade-off is, especially in the face
of this behavior change GitHub encounterd, but I don't think it was the
intent with this change to change the output (but maybe René was aware
of that, but didn't note it).

Which brings me to...

> Historically speaking, git-archive has been stable minus... a bug fix or
> two in rare cases, specifically relating to an inability to transcribe
> the contents of the git repo at all, I think? And the other factor is
> the compression algorithm used, which is generally GNU gzip, and
> historically whatever the system `gzip` command is.
>
> And gzip is a stable format. It's a worn-out, battle-weary format, even
> -- it's not the best at compressing, and it's not the best at
> decompressing, and "all the cool kids" are working on cooler formats,
> such as zstd which does indeed regularly change its byte output between
> versions. But the advantage of gzip is that it's good *enough*, and it's
> probably *everywhere*, and it's *reliable*.
>
> GNU gzip is reproducible. busybox gzip was fixed to agree with GNU gzip
> (this is relevant to the handful of people running software forges on,
> say, Alpine Linux):
>
> https://reproducible-builds.org/reports/2019-08/#upstream-news
>
> ...
>
> Nevertheless, I've seen the sentiment a few times that git doesn't like
> committing to output stability of git-archive, because it isn't
> officially documented (but it's not entirely clear what the benefits of
> changing are). And yet, git endeavors to do so, in order to prevent
> unnecessary breakage of people who embody Hyrum's Law and need that
> stability.

...Yes, this has been discussed many times on-list.

My recollection of those discussions in general is that we were mostly
talking about the "tar" format itself, moreso than "gzip", although in
this case it's a change in the gzip component that changed the output.

It's not clear to me (and I'm asking instead of digging myself, as I
assume someone at GitHub has dug already) whether our change to the
"internal gzip" is necessarily going to result in a different hash, or
did we just forget to provide some option to the library to get the same
result as gzip(1).

A major thing you're eliding here is that even if "tar" or "gzip" is a
"a worn-out, battle-weary format" that does *not* translate to it being
a trivial matter to maintain byte-for-byte compatibility in the archives
(or compression stream) you produce, even though the resulting output
once un-archived or un-compressed is guaranteed to be the same.

We ship our own "tar" for the purposes of this discussion (the archive.c
code etc.), but offload the "gzip" part to either an external library
(which is new in v2.38.0, and the subject of this discussion), or to
GNU's gzip command.

I have no idea if the "gzip" part of this would be as easy as saying
"we'll default to gzip(1)", you note "GNU gzip is reproducible. busybox
gzip was fixed to agree with GNU gzip", but does the same apply to other
"gzip(1)"? I know of at least the BSD gzip.

Even then, has even GNU gzip promised that it will forever maintain
byte-for-byte compatibility in its output?

> Even with the new change to the compressor, git-archive is still
> reproducible, it's the internal gzip compressor that isn't. (This may be
> fixable, possibly by embedding an implementation from busybox or from
> GNU gzip? I'm not going to discuss that right now, though I think it's
> an interesting avenue of exploration.)

So first, aside from whatever the git project does about the default,
have you tried running the newer git version with a
tar.tgz.command='gzip -cn' and seeing if it's compatible with the old
version?

It's unclear from the blog post's "we are reverting this change for now"
whether that meant a revert of the git version (probably), or a revert
back to using gzip(1).

> I've thought about this now and then over the last couple of years,
> because I think I have a reasonable compromise that might make everyone
> (or at least most people) happy, and now seems like a good idea to
> mention it.
>
> What does everyone think about offering versioned git-archive outputs?
> This could be user-selectable as an option to `git archive`, but the
> main goal would be to select a good versioned output format depending on
> what is being archived. So:
>
> - first things first, un-default the internal compressor again
> - implement a v2 archive format, where the internal compressor is the
>   default -- no other changes
> - teach git to select an archive format based on the date of the object
>   being archived
>   - when given a commit/tag ID to archive, check which support frame the
>     committer date falls inside
>   - for tree IDs, always use the latest format (it always uses the
>     current date anyway)
> - schedule a date, for the sake of argument, 6 months after the next
>   scheduled release date of git version X.Y in which this change goes
>   live; bake this into the git sources as a transition date, all commits
>   or tags generated after this date fall into the next format support
>   frame
>
> The end result is that for all historic commits or tags, `git archive`
> will always produce the same output. This can be documented in the
> git-archive manpage: "the produced archive is guaranteed to be
> reproducible, unless you override the `tar.<format>.command` or your
> system compressor is not reproducible".
>
> For *new* commits or tags, everyone gets the benefit of fascinating,
> cool new archive formats with useful improvements at the tar container
> level, which is apparently a very desirable feature. The git project no
> longer has to worry, at all, about whether users will come to complain
> about how their build pipelines suddenly fail with checksum issues. The
> git project can simply, fearlessly, go implement innovative new changes
> without giving any thought to backwards compatibility.
>
> It is, simply, that those new changes only apply to projects which are
> still under active development, and which push new commits or tag new
> releases after the transition date.
>
> Old states of existing projects (regardless of whether they are still
> actively updating) can go have their old and apparently inefficient
> archives and don't get cool new stuff. That's fine. They're also
> increasingly rarely used, because they are, after all, old -- and most
> likely only used for historic archival purposes. If the worst comes to
> worst, well, they managed to produce a somehow useful archive with an
> older version of git -- nothing will *break* if they don't get the cool
> new stuff.
>
> And for the vast majority of new downloads for new stuff, the in-process
> compressor saves one fork+exec and is a bit more efficient, I guess?
>
> A note on the transition date: I suggested 6 months after the scheduled
> release date, because this gives everyone running a software forge time
> to update git itself, and have everything ready, in time to handle the
> first wave of commits and tags that naturally occur after the transition
> date. And you don't want it to be immediate, because then people will
> take days or weeks to deploy and the most recent archives will change
>
> For the purposes of this thought experiment, we assume that people don't
> routinely set the system time to a year in the future. This will only be
> done in situations such as, say, testing a git upgrade deployment for a
> software forge.

This sounds like a workable transition plan, but it assumes that we had
a really good reason to change to the "internal gzip" by default, and
that we must move forward with that change in some way.

I don't think that's the case per the linked-to on-list discussion, the
aim was just to provide output if gzip(1) wasn't available, so all we'd
need is the pseudocode of:

	- Prepare our tar stream
        - Try to strem it to gzip(1)
        - If that fails with "command does not exist" fall back to the
          internal one (possibly with a warning about possibly-different
          output)

Then systems without a gzip(1) could produce output (which René was
aiming for), but those with a system gzip(1) (e.g. GitHub's production
installation) could just continue to use it.

That's still a band-aid on the larger questions I raised above,
i.e. whether we'd want to forever guarantee the output of "git archive"
itself, and of the "tar.tgz.command".

My off-the-cuff response to that is that we should probably:

 - Guarantee the "git archive" output itself (without compression),
   leaving the out that it *may* change in the future with notice (or
   we'd just version it)

 - Switch back to using gzip(1) by default, whatever gzip(1) that
   happens to be.

But:

 - Promise that the total end result will be byte-for-byte the same, as
   that would imply a promise about the external gzip(1).

 - Just prominently note in our docs that if you want the
   archive->compression to be byte-for-byte with the past it's up to you
   to ensure that your compressor gives you that guarantee.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
@ 2023-01-31  9:11   ` Eli Schwartz
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 57+ messages in thread
From: Eli Schwartz @ 2023-01-31  9:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git List, brian m. carlson, René Scharfe,
	Johannes Schindelin, Jeff King

Quick response for now...

On 1/31/23 2:49 AM, Ævar Arnfjörð Bjarmason wrote:
> So first, aside from whatever the git project does about the default,
> have you tried running the newer git version with a
> tar.tgz.command='gzip -cn' and seeing if it's compatible with the old
> version?
> 
> It's unclear from the blog post's "we are reverting this change for now"
> whether that meant a revert of the git version (probably), or a revert
> back to using gzip(1).


I do not know which one Github internally did, but I can confirm that
the gzipped tarballs which github started shipping, when gunzipped,
produced an uncompressed tarball that was byte-identical to uncompressed
editions of the historic ones.

i.e. you could do this:

```
wget ${important_archive_release}

gzip -dc < ${important_archive_localfile} | gzip -cn >
${important_archive_localfile}.new
```

And:
- they have different checksums
- the .new file has reverted to the same checksum as historic versions
  from last year that are frozen into manifests

That was part of my original investigation, before I located the public
conversations.


-- 
Eli Schwartz

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31  0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz
  2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
@ 2023-01-31  9:54 ` brian m. carlson
  2023-01-31 11:31   ` Ævar Arnfjörð Bjarmason
                     ` (3 more replies)
  1 sibling, 4 replies; 57+ messages in thread
From: brian m. carlson @ 2023-01-31  9:54 UTC (permalink / raw)
  To: Eli Schwartz; +Cc: Git List

[-- Attachment #1: Type: text/plain, Size: 4823 bytes --]

On 2023-01-31 at 00:06:44, Eli Schwartz wrote:
> Nevertheless, I've seen the sentiment a few times that git doesn't like
> committing to output stability of git-archive, because it isn't
> officially documented (but it's not entirely clear what the benefits of
> changing are). And yet, git endeavors to do so, in order to prevent
> unnecessary breakage of people who embody Hyrum's Law and need that
> stability.

I'm one of the GitHub employees who chimed in there, and I'm also a Git
contributor in my own time (and I am speaking here only in my personal
capacity, since this is a personal address).  I made a change some years
back to the archive format to fix the permissions on pax headers when
extracted as files, and kernel.org was relying on that and broke.  Linus
yelled at me because of that.

Since then, I've been very opposed to us guaranteeing output format
consistency without explicitly doing so.  I had sent some patches before
that I don't think ever got picked up that documented this explicitly.
I very much don't want people to come to rely on our behaviour unless we
explicitly guarantee it.

> What does everyone think about offering versioned git-archive outputs?
> This could be user-selectable as an option to `git archive`, but the
> main goal would be to select a good versioned output format depending on
> what is being archived. So:
> 
> - first things first, un-default the internal compressor again
> - implement a v2 archive format, where the internal compressor is the
>   default -- no other changes
> - teach git to select an archive format based on the date of the object
>   being archived
>   - when given a commit/tag ID to archive, check which support frame the
>     committer date falls inside
>   - for tree IDs, always use the latest format (it always uses the
>     current date anyway)
> - schedule a date, for the sake of argument, 6 months after the next
>   scheduled release date of git version X.Y in which this change goes
>   live; bake this into the git sources as a transition date, all commits
>   or tags generated after this date fall into the next format support
>   frame

I am actually very much in favour of providing a standard, deterministic
version of pax (the extended tar format) that we use and documenting it
as a standard so that other archive tools can use that.  That is, we
document some canonical tar format that is bit-for-bit identical that we
(and hopefully GNU tar and libarchive) will agree should be used to
serialize files for software interchange.  I don't think this should be
dependent on the date at all, but I do believe it should be versioned
and tested, and the version number embedded as a pax header.  I think
this would be valuable for simply having reproducible archives in
general, including for things like Docker containers, Debian packages,
Rust crates, and more, and I'm happy to work with others on such a
format, as I've said in the past on the list.  People can opt-in to
whatever format they want when creating an archive and continue to use
that forever if they like.

Part of the reason I think this is valuable is that once SHA-1 and
SHA-256 interoperability is present, git archive will change the
contents of the archive format, since it will embed a SHA-256 hash into
the file instead of a SHA-1 hash, since that's what's in the repository.
Thus, we can't produce an archive that's deterministic in the face of
SHA-1/SHA-256 interoperability concerns, and we need to create a new
format that doesn't contain that data embedded in it.

Having said that, I don't think this should be based on the timestamp of
the file, since that means that two otherwise identical archives
differing in timestamp aren't ever going to be the same, and we do see
people who import or vendor other projects.  Nor do I think we should
attempt to provide consistent compression, since I believe the output of
things like zlib has changed in the past, and we can't continually carry
an old, potentially insecure version of zlib just because the output
changed.  People should be able to implement compression using gzip,
zlib, pigz, miniz_oxide, or whatever if they want, since people
implement Git in many different languages, and we won't want to force
people using memory-safe languages like Go and Rust to explicitly use
zlib for archives.

That may mean that it's important for people to actually decompress the
archive before checking hashes if they want deterministic behaviour, and
I'm okay with that.  You already have to do that if you're verifying the
signature on Git tarballs, since only the uncompressed tar archive is
signed, so I don't think this is out of the question.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
@ 2023-01-31 11:31   ` Ævar Arnfjörð Bjarmason
  2023-01-31 15:05   ` Konstantin Ryabitsev
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-31 11:31 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Eli Schwartz, Git List


On Tue, Jan 31 2023, brian m. carlson wrote:

> Part of the reason I think this is valuable is that once SHA-1 and
> SHA-256 interoperability is present, git archive will change the
> contents of the archive format, since it will embed a SHA-256 hash into
> the file instead of a SHA-1 hash, since that's what's in the repository.
> Thus, we can't produce an archive that's deterministic in the face of
> SHA-1/SHA-256 interoperability concerns, and we need to create a new
> format that doesn't contain that data embedded in it.

I don't see why a format change would be required in this context.

If a repository were to switch over to SHA-256 wouldn't a better
solution to this be to disambiguate whether you're requesting a SHA-1 or
SHA-256 derived archive in the URL? E.g. to never serve up an archive
with a SHA-256 embedded in the header at:

	https://github.com/git/git/archive/refs/tags/v2.39.1.tar.gz

But require a URL like:

	https://github.com/git/git/archive-sha256/refs/tags/v2.39.1.tar.gz

If you did that then existing archives would continue to have the same
byte-for-byte content (assuming that the result of this discussion is
that we support that forever), but they'd always be generated with "-c
extensions.objectFormat=sha1". For always-SHA256 repos such a URL would
fail to generate anything.

But for repos that used to be SHA-1 but are now SHA-256 either URL would
work, but the PAX header would be different, referring to the SHA-1 or
SHA-256 commit, respectively.

Whereas your proposal seems to be that we should omit that SHA-(1|256)
from the "comment" entirely. That would seem to require either a one-off
change of all existing archives, or some cut-off date (or other marker).

If you've got a cut-off, you could also just use it to decide whether to
generate a SHA-1 or SHA-256 archive, and without that you'd be back to
the one-off breakage.

I also find it very useful that we've got the commit OID in the archive,
as it allows for round-tripping from archives back to the relevant
repository commit. Losing that entirely for SHA-1<->SHA-256 interop
would be unfortunate, especially if it turns out we could have easily
kept it

> Having said that, I don't think this should be based on the timestamp of
> the file, since that means that two otherwise identical archives
> differing in timestamp aren't ever going to be the same, and we do see
> people who import or vendor other projects.

Yes, I agree that doing this by that sort of heuristic would be bad.

> Nor do I think we should
> attempt to provide consistent compression, since I believe the output of
> things like zlib has changed in the past, and we can't continually carry
> an old, potentially insecure version of zlib just because the output
> changed.  People should be able to implement compression using gzip,
> zlib, pigz, miniz_oxide, or whatever if they want, since people
> implement Git in many different languages, and we won't want to force
> people using memory-safe languages like Go and Rust to explicitly use
> zlib for archives.

As I noted in the side-thread I think an acceptable solution would be to
push the problem of the consistent compressor downstream. I.e. if a site
like GitHub wants to maintain a potentially old version of GNU gzip that
should be up to them.

But I think it's a valid concern that we should guarantee the stability
of the archive format.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
  2023-01-31 11:31   ` Ævar Arnfjörð Bjarmason
@ 2023-01-31 15:05   ` Konstantin Ryabitsev
  2023-01-31 22:32     ` brian m. carlson
  2023-01-31 15:56   ` Eli Schwartz
  2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
  3 siblings, 1 reply; 57+ messages in thread
From: Konstantin Ryabitsev @ 2023-01-31 15:05 UTC (permalink / raw)
  To: brian m. carlson, Eli Schwartz, Git List

On Tue, Jan 31, 2023 at 09:54:58AM +0000, brian m. carlson wrote:
> I'm one of the GitHub employees who chimed in there, and I'm also a Git
> contributor in my own time (and I am speaking here only in my personal
> capacity, since this is a personal address).  I made a change some years
> back to the archive format to fix the permissions on pax headers when
> extracted as files, and kernel.org was relying on that and broke.  Linus
> yelled at me because of that.
> 
> Since then, I've been very opposed to us guaranteeing output format
> consistency without explicitly doing so.  I had sent some patches before
> that I don't think ever got picked up that documented this explicitly.
> I very much don't want people to come to rely on our behaviour unless we
> explicitly guarantee it.

I understand your position, but I also think it's one of those things that
happen despite your best efforts to prevent it. :)

May I suggest adding a "git-archive --stable" that offers this guarantee,
simply as a matter of codifying the fact that the world has built
infrastructure around git's repeatable output. Maybe just for .tar (and
.tar.gz).

I know this complicates the code and makes it more "expensive" to maintain,
but it would be dramatically less expensive than changing the established
practices around the world.

-K

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
  2023-01-31 11:31   ` Ævar Arnfjörð Bjarmason
  2023-01-31 15:05   ` Konstantin Ryabitsev
@ 2023-01-31 15:56   ` Eli Schwartz
  2023-01-31 16:20     ` Konstantin Ryabitsev
  2023-02-01  1:33     ` brian m. carlson
  2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
  3 siblings, 2 replies; 57+ messages in thread
From: Eli Schwartz @ 2023-01-31 15:56 UTC (permalink / raw)
  To: brian m. carlson, Git List

On 1/31/23 4:54 AM, brian m. carlson wrote:
> Part of the reason I think this is valuable is that once SHA-1 and
> SHA-256 interoperability is present, git archive will change the
> contents of the archive format, since it will embed a SHA-256 hash into
> the file instead of a SHA-1 hash, since that's what's in the repository.
> Thus, we can't produce an archive that's deterministic in the face of
> SHA-1/SHA-256 interoperability concerns, and we need to create a new
> format that doesn't contain that data embedded in it.


I assume that whatever the reason for originally embedding the OID into
the file is still an applicable reason even if a new PAX format is
established for the use of git-archive.

It may not be a great reason -- I don't know. Perhaps there's an
argument to remove it. But can't that be done irrespective of
standardizing the PAX format?

...

I'm not deeply knowledgeable about the SHA-256 transition work -- or
knowledgeable at all about it, frankly. (Also my understanding was it
seems to have stalled as discussed in https://lwn.net/Articles/898522/
-- I understand that you're still enthusiastic about the work? But that
doesn't really answer "is there a timeframe for that to ever happen".)

But I sort of assumed that the transition work would already have to
embed a fair bit of information into the repository about the whole
process? Would it not be possible to determine whether a given tag
started life as SHA-1 or SHA-256? Maybe even just a date when the
repository was converted to work with both, and embed the OID based on
whether the tag is tagging contents that were created after that conversion?

Seems to me like the problem should be solvable if people want to solve it.

...

git-archive run on a commit obviously doesn't have this problem -- it
can simply embed the OID for the same argument it was called with. But I
assume it's far more common to access tag-based github endpoints. :D


> Having said that, I don't think this should be based on the timestamp of
> the file, since that means that two otherwise identical archives
> differing in timestamp aren't ever going to be the same, and we do see
> people who import or vendor other projects. 


The timestamp of the output file? Surely not. But I only suggested the
timestamp of the commit/tag metadata that git-archive is asked to
produce output for. And we would need that in order to solve the problem
that reproducible github API archive endpoints poses.

I'm not sure what the "import or vendor other projects" angle here
means. Do you mean people who copy a directory of files into their
project? Who expects this to be the same to begin with? And doesn't
embedding the OID kill this idea, since the entire point of git commit
sha's is that you shouldn't (it should be prohibitively unrealistic to)
be able to produce the same one twice in different contexts?

I have never said to myself "ah yes, I really would like to be able to
download a git auto-generated tarball for project A, and compare its
hash to the tarball for project B, and have them compare identical even
though they are different projects with different commits". IMHO this
isn't an interesting problem to solve -- the interesting problem to
solve is that a single absolute URL to a downloadable file should be
able to offer documented guarantees that it will always be the same
file, even though it is generated on the fly.


> Nor do I think we should
> attempt to provide consistent compression, since I believe the output of
> things like zlib has changed in the past, and we can't continually carry
> an old, potentially insecure version of zlib just because the output
> changed.  People should be able to implement compression using gzip,
> zlib, pigz, miniz_oxide, or whatever if they want, since people
> implement Git in many different languages, and we won't want to force
> people using memory-safe languages like Go and Rust to explicitly use
> zlib for archives.


I do not think it is realistic or reasonable for people to implement
compression using intentionally incompatible replacements for gzip and
expect interoperability of any sort.

I also don't think people *have* to implement compression in rust using
zlib, but if they are going to make a git-alike that produces archives,
it would be worth it for them to write whatever memory-safe rust is
necessary to memory-safely produce the same output stream of bytes. It's
no less feasible than making sure that busybox gzip and GNU gzip produce
the same output, surely.

Alternatively, they could just not bother with gzip at all, and make
their git-alike produce zstd-compressed tarballs, which change their
byte outputs every time a new zstd release is published. :D Again, why
limit yourself to gzip if you want to be innovative anyway.


> That may mean that it's important for people to actually decompress the
> archive before checking hashes if they want deterministic behaviour, and
> I'm okay with that.  You already have to do that if you're verifying the
> signature on Git tarballs, since only the uncompressed tar archive is
> signed, so I don't think this is out of the question.


This is a very kernel.org-centric view of things, I think. I have rarely
seen PGP signatures applied to the uncompressed tar except in that
context. The vast majority of tarballs with signatures have signed a
single compressed tarball and don't concern themselves with, say,
providing a rotating backdated changeable list of compression formats
with a single signature covering all of them.

Nevertheless, in order to handle kernel.org-style tarballs, you are
entirely correct that one should be able to handle this.

>From experience, I can say that this needs to be selected on a
per-tarball basis. Since signature files have filenames, we can match
their stems and given foo.tar.asc and foo.tar.gz, check the signature of
the output of gzip -dc < foo.tar.gz, but given foo.tar.gz.asc and
foo.tar.gz, simply check the signature of the original foo.tar.gz.

This doesn't really work for checksums, because you need to settle on
one or the other everywhere or else embed decompression information into
your checksum metadata field.

And for tarballs that are generated once and uploaded to ftp storage,
not repeatedly generated on the fly, we know the checksum will never
legitimately change, so we *want* to hash the compressed file.
Decompressing kernel.org tarballs in order to run PGP on them is *slow*.
Although at least one can verify the checksums first without
decompression, which is virtually guaranteed to catch invalid source
code releases, so if you ever progress to the PGP verification stage
it's unlikely to be wasted effort -- that tarball is definitely getting
used to build something.


-- 
Eli Schwartz

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 15:56   ` Eli Schwartz
@ 2023-01-31 16:20     ` Konstantin Ryabitsev
  2023-01-31 16:34       ` Eli Schwartz
  2023-02-01  1:33     ` brian m. carlson
  1 sibling, 1 reply; 57+ messages in thread
From: Konstantin Ryabitsev @ 2023-01-31 16:20 UTC (permalink / raw)
  To: Eli Schwartz; +Cc: brian m. carlson, Git List

On Tue, Jan 31, 2023 at 10:56:52AM -0500, Eli Schwartz wrote:
> And for tarballs that are generated once and uploaded to ftp storage,
> not repeatedly generated on the fly, we know the checksum will never
> legitimately change, so we *want* to hash the compressed file.
> Decompressing kernel.org tarballs in order to run PGP on them is *slow*.

FWIW, the most correct way is:

* download sha256sums.asc and verify its signature (auto-signed by infra)
* download the tarball you want and verify that the checksum matches
* uncompress and verify the PGP signature (signed by developer)

This script implements this workflow:
https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball

-K

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 16:20     ` Konstantin Ryabitsev
@ 2023-01-31 16:34       ` Eli Schwartz
  2023-01-31 20:34         ` Konstantin Ryabitsev
  2023-01-31 20:45         ` Michal Suchánek
  0 siblings, 2 replies; 57+ messages in thread
From: Eli Schwartz @ 2023-01-31 16:34 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: brian m. carlson, Git List

On 1/31/23 11:20 AM, Konstantin Ryabitsev wrote:
> On Tue, Jan 31, 2023 at 10:56:52AM -0500, Eli Schwartz wrote:
>> And for tarballs that are generated once and uploaded to ftp storage,
>> not repeatedly generated on the fly, we know the checksum will never
>> legitimately change, so we *want* to hash the compressed file.
>> Decompressing kernel.org tarballs in order to run PGP on them is *slow*.
> 
> FWIW, the most correct way is:
> 
> * download sha256sums.asc and verify its signature (auto-signed by infra)
> * download the tarball you want and verify that the checksum matches
> * uncompress and verify the PGP signature (signed by developer)
> 
> This script implements this workflow:
> https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball


This is just what I said, but with an additional first step for when you
are updating to a new tarball and don't have your own checksums
integrated into your own ecosystem tracking.

In most contexts, it's utterly unacceptable to not remember the checksum
of the file you used last time and instead simply trust PGP identity
verification. This permits upstream the technical means to be malicious,
and re-upload a totally different tarball with the same name, different
contents, and different PGP signature, and you will never notice because
the PGP signature is still okay.

Just because I trust you all doesn't mean I should ignore existing best
practices to make sure that I always use the same reviewed
byte-identical tarball -- or find out exactly why it changed.


-- 
Eli Schwartz

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 16:34       ` Eli Schwartz
@ 2023-01-31 20:34         ` Konstantin Ryabitsev
  2023-01-31 20:45         ` Michal Suchánek
  1 sibling, 0 replies; 57+ messages in thread
From: Konstantin Ryabitsev @ 2023-01-31 20:34 UTC (permalink / raw)
  To: Eli Schwartz; +Cc: brian m. carlson, Git List

On Tue, Jan 31, 2023 at 11:34:59AM -0500, Eli Schwartz wrote:
> In most contexts, it's utterly unacceptable to not remember the checksum
> of the file you used last time and instead simply trust PGP identity
> verification. This permits upstream the technical means to be malicious,
> and re-upload a totally different tarball with the same name, different
> contents, and different PGP signature, and you will never notice because
> the PGP signature is still okay.

Yes, it's true, and it's something that Sigstore tries to address.

That said, if I wanted to trojan a download and had access to both the
infrastructure and the developer's credentials, I wouldn't pick a months-old
release for this purpose. I would wait until I see a new release coming out
and then swap it mid-flight. This lets me defeat even transparency-log based
solutions like sigstore.

(I'll probably be giving a talk at the Linux Security Summit titled "How to
trojan the Linux Kernel" where I'll go into some of these considerations. :))

-K

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 16:34       ` Eli Schwartz
  2023-01-31 20:34         ` Konstantin Ryabitsev
@ 2023-01-31 20:45         ` Michal Suchánek
  1 sibling, 0 replies; 57+ messages in thread
From: Michal Suchánek @ 2023-01-31 20:45 UTC (permalink / raw)
  To: Eli Schwartz; +Cc: Konstantin Ryabitsev, brian m. carlson, Git List

On Tue, Jan 31, 2023 at 11:34:59AM -0500, Eli Schwartz wrote:
> On 1/31/23 11:20 AM, Konstantin Ryabitsev wrote:
> > On Tue, Jan 31, 2023 at 10:56:52AM -0500, Eli Schwartz wrote:
> >> And for tarballs that are generated once and uploaded to ftp storage,
> >> not repeatedly generated on the fly, we know the checksum will never
> >> legitimately change, so we *want* to hash the compressed file.
> >> Decompressing kernel.org tarballs in order to run PGP on them is *slow*.
> > 
> > FWIW, the most correct way is:
> > 
> > * download sha256sums.asc and verify its signature (auto-signed by infra)
> > * download the tarball you want and verify that the checksum matches
> > * uncompress and verify the PGP signature (signed by developer)
> > 
> > This script implements this workflow:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball
> 
> 
> This is just what I said, but with an additional first step for when you
> are updating to a new tarball and don't have your own checksums
> integrated into your own ecosystem tracking.
> 
> In most contexts, it's utterly unacceptable to not remember the checksum
> of the file you used last time and instead simply trust PGP identity
> verification. This permits upstream the technical means to be malicious,
> and re-upload a totally different tarball with the same name, different
> contents, and different PGP signature, and you will never notice because
> the PGP signature is still okay.

But where is the hash remembered?

The signature is a hash+signature, it you can replace that, you can also
repolace a hash without a signature.

You can store hashesd of anything you want locally, and indeed such
stored hashes in some build systemns did detect some code hosting
corruption but that's not for upstream to do, that's something that only
unrelated third party can do.

Thanks

Michal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 15:05   ` Konstantin Ryabitsev
@ 2023-01-31 22:32     ` brian m. carlson
  2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
  2023-02-01 12:17       ` Raymond E. Pasco
  0 siblings, 2 replies; 57+ messages in thread
From: brian m. carlson @ 2023-01-31 22:32 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Eli Schwartz, Git List

[-- Attachment #1: Type: text/plain, Size: 1709 bytes --]

On 2023-01-31 at 15:05:55, Konstantin Ryabitsev wrote:
> On Tue, Jan 31, 2023 at 09:54:58AM +0000, brian m. carlson wrote:
> > I'm one of the GitHub employees who chimed in there, and I'm also a Git
> > contributor in my own time (and I am speaking here only in my personal
> > capacity, since this is a personal address).  I made a change some years
> > back to the archive format to fix the permissions on pax headers when
> > extracted as files, and kernel.org was relying on that and broke.  Linus
> > yelled at me because of that.
> > 
> > Since then, I've been very opposed to us guaranteeing output format
> > consistency without explicitly doing so.  I had sent some patches before
> > that I don't think ever got picked up that documented this explicitly.
> > I very much don't want people to come to rely on our behaviour unless we
> > explicitly guarantee it.
> 
> I understand your position, but I also think it's one of those things that
> happen despite your best efforts to prevent it. :)
> 
> May I suggest adding a "git-archive --stable" that offers this guarantee,
> simply as a matter of codifying the fact that the world has built
> infrastructure around git's repeatable output. Maybe just for .tar (and
> .tar.gz).

It is my intention to implement just .tar.  That's my proposal: simply a
pax-based format that serializes in a consistent way according to a
predefined spec.

As far as whether other people want to implement consistent compression,
they are welcome to also write a spec and implement it.  I personally
feel that's too hard to get right and am not planning on working on it.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 15:56   ` Eli Schwartz
  2023-01-31 16:20     ` Konstantin Ryabitsev
@ 2023-02-01  1:33     ` brian m. carlson
  1 sibling, 0 replies; 57+ messages in thread
From: brian m. carlson @ 2023-02-01  1:33 UTC (permalink / raw)
  To: Eli Schwartz; +Cc: Git List

[-- Attachment #1: Type: text/plain, Size: 11930 bytes --]

On 2023-01-31 at 15:56:52, Eli Schwartz wrote:
> On 1/31/23 4:54 AM, brian m. carlson wrote:
> > Part of the reason I think this is valuable is that once SHA-1 and
> > SHA-256 interoperability is present, git archive will change the
> > contents of the archive format, since it will embed a SHA-256 hash into
> > the file instead of a SHA-1 hash, since that's what's in the repository.
> > Thus, we can't produce an archive that's deterministic in the face of
> > SHA-1/SHA-256 interoperability concerns, and we need to create a new
> > format that doesn't contain that data embedded in it.
> 
> 
> I assume that whatever the reason for originally embedding the OID into
> the file is still an applicable reason even if a new PAX format is
> established for the use of git-archive.
> 
> It may not be a great reason -- I don't know. Perhaps there's an
> argument to remove it. But can't that be done irrespective of
> standardizing the PAX format?
> 
> ...
> 
> I'm not deeply knowledgeable about the SHA-256 transition work -- or
> knowledgeable at all about it, frankly. (Also my understanding was it
> seems to have stalled as discussed in https://lwn.net/Articles/898522/
> -- I understand that you're still enthusiastic about the work? But that
> doesn't really answer "is there a timeframe for that to ever happen".)

The timeframe is when my employer pays me to work on it.  Right now,
I've implemented functional SHA-256 repositories but am currently a bit
on the way to burnout and am very selective about what things I'm doing
outside of work.  My hope is that my employer will find time for me to
work on the interop stuff soon, but I'm not at liberty to discuss this
more in depth at the moment.

> But I sort of assumed that the transition work would already have to
> embed a fair bit of information into the repository about the whole
> process? Would it not be possible to determine whether a given tag
> started life as SHA-1 or SHA-256? Maybe even just a date when the
> repository was converted to work with both, and embed the OID based on
> whether the tag is tagging contents that were created after that conversion?

It's designed such that the two objects are completely interoperable and
can be accessed by either name, depending on how the repository is
configured locally.  There may be a signature for one algorithm, both,
or neither, so it's hard to say definitively what version it's created
with.  That is completely intentional since the goal is to transition
seamlessly from one to another at any point depending on the preferences
of the owner of the local repository.

> > Having said that, I don't think this should be based on the timestamp of
> > the file, since that means that two otherwise identical archives
> > differing in timestamp aren't ever going to be the same, and we do see
> > people who import or vendor other projects. 
> 
> 
> The timestamp of the output file? Surely not. But I only suggested the
> timestamp of the commit/tag metadata that git-archive is asked to
> produce output for. And we would need that in order to solve the problem
> that reproducible github API archive endpoints poses.

I think it would simply be easier to say, "This is the command-line
option that implements canonical tar version 1."  If you want a
reproducible archive, you use that command-line option, and your
uncompressed tar archive is reproducible.  Otherwise, you get the same
guarantees on reproducibility that we've always provided, which is
absolutely none.

Using commit and tag metadata doesn't solve the problem of trees, which
would use the current timestamp.  It's better to solve the problem in a
consistent way, which would mean embedding a fixed timestamp (probably
the Epoch) into those tree tarballs.

In my view, using the commit or tag timestamp is very risky, because it
changes the behaviour at some point in the future without notifying
people.  If we produce a tar archive that isn't readable by FooZip, say,
then nobody will realize that until we actually start producing them,
several months after the release.  And, I should point out, this still
poses problems for GitHub and other forges, because GitHub doesn't run
the latest release right away; we usually trail a version or two.  So
using the commit or tag timestamp might mean that on an upgrade,
suddenly the behaviour changes because the new version has a change
(which was scheduled to have occurred in the past) but the old version
doesn't.

In addition, the one guarantee we've given with archives in the past is
that the same version of Git with the same input (flags, repository,
etc.) will produce deterministic results (that is, the same output), and
I think we're likely to run afoul of that with a timestamp-based
approach.  I don't want the archive to suddenly be different because I
happened to do "git commit --amend" to update just a commit message and
we happened to cross that timestamp threshold.

> I'm not sure what the "import or vendor other projects" angle here
> means. Do you mean people who copy a directory of files into their
> project? Who expects this to be the same to begin with? And doesn't
> embedding the OID kill this idea, since the entire point of git commit
> sha's is that you shouldn't (it should be prohibitively unrealistic to)
> be able to produce the same one twice in different contexts?

We have people who import the entirety of Chromium into a project at
one time to work on a browser-based project.

> I have never said to myself "ah yes, I really would like to be able to
> download a git auto-generated tarball for project A, and compare its
> hash to the tarball for project B, and have them compare identical even
> though they are different projects with different commits". IMHO this
> isn't an interesting problem to solve -- the interesting problem to
> solve is that a single absolute URL to a downloadable file should be
> able to offer documented guarantees that it will always be the same
> file, even though it is generated on the fly.

I do think having identical output for identical contents is very
valuable.  If our goal is reproducible output, we should endeavour to
produce identical output for identical input.  What we're specifically
trying to move away from is varying output based on the same input.

> I do not think it is realistic or reasonable for people to implement
> compression using intentionally incompatible replacements for gzip and
> expect interoperability of any sort.

I disagree completely.  The gzip and zlib formats are documented in RFCs
and have been since 1996.  There are already at least a half-dozen
interoperable implementations, including zlib, gzip, pigz, Go's standard
library, miniz_oxide, and the Windows archiver.  I'm sure if I searched
I could find at least half a dozen more.

> I also don't think people *have* to implement compression in rust using
> zlib, but if they are going to make a git-alike that produces archives,
> it would be worth it for them to write whatever memory-safe rust is
> necessary to memory-safely produce the same output stream of bytes. It's
> no less feasible than making sure that busybox gzip and GNU gzip produce
> the same output, surely.

I don't agree at all.  The Go standard library couldn't achieve that,
because busybox and gzip are GPL and doing that would almost certainly
require looking at the code, which would require the Go standard library
to be GPL as well.  The same thing goes for zlib, which is permissively
licensed, and which is clearly the obvious choice if we had to settle on
a standard, since it's a shared library.

That also ignores tools like pigz which provide parallel compression and
can provide an order of magnitude performance increase, but which won't
provide an identical byte stream.  Why should we require people to use a
single core if they have a very large archive that could compress
several times as fast with a parallel operation?

My goal is to produce tar archives that are interoperable based on a
spec.  That spec would be implementable by Git, GNU tar, libarchive, or
anyone else, by reading the spec and following it.  That's very
different from saying, "Well, just make your program do exactly the same
thing as this other one without sharing any code."  If you want to write
a spec for canonical gzip, I'm interested in reading it, but I think
it's practically going to be difficult to achieve.

> > That may mean that it's important for people to actually decompress the
> > archive before checking hashes if they want deterministic behaviour, and
> > I'm okay with that.  You already have to do that if you're verifying the
> > signature on Git tarballs, since only the uncompressed tar archive is
> > signed, so I don't think this is out of the question.
> 
> 
> This is a very kernel.org-centric view of things, I think. I have rarely
> seen PGP signatures applied to the uncompressed tar except in that
> context. The vast majority of tarballs with signatures have signed a
> single compressed tarball and don't concern themselves with, say,
> providing a rotating backdated changeable list of compression formats
> with a single signature covering all of them.

Sure, and that's a valid approach if you have a consistent, persistent
tarball.  However, Git does not persist data forever in tarballs, and
people want to use different versions to get the same data, which is a
new guarantee that we'd be providing.  That is an easy guarantee to
provide with tar, but not an easy guarantee to provide with the gzip
format, as we've all just seen.

> >From experience, I can say that this needs to be selected on a
> per-tarball basis. Since signature files have filenames, we can match
> their stems and given foo.tar.asc and foo.tar.gz, check the signature of
> the output of gzip -dc < foo.tar.gz, but given foo.tar.gz.asc and
> foo.tar.gz, simply check the signature of the original foo.tar.gz.
> 
> This doesn't really work for checksums, because you need to settle on
> one or the other everywhere or else embed decompression information into
> your checksum metadata field.

I don't think that's absolutely required.  You need to know how to
decompress the archive, and you can have a hash for the tarball before
decompression or after decompression, as well as possibly needing to
deal with multiple different hash algorithms.  I've implemented this
myself when I was a vendor of Git and lots of other software, and we
would take the hash of the compressed or decompressed archive as shipped
by the vendor and verify it, as long as the hash was sufficiently
strong.

> And for tarballs that are generated once and uploaded to ftp storage,
> not repeatedly generated on the fly, we know the checksum will never
> legitimately change, so we *want* to hash the compressed file.
> Decompressing kernel.org tarballs in order to run PGP on them is *slow*.
> Although at least one can verify the checksums first without
> decompression, which is virtually guaranteed to catch invalid source
> code releases, so if you ever progress to the PGP verification stage
> it's unlikely to be wasted effort -- that tarball is definitely getting
> used to build something.

Sure, and if you want to generate tarballs once and upload them to
storage, go ahead.  That's always an option.  Even GitHub provides you
the option to do that with release assets if you want.

My proposal is to provide deterministic archives in a functionally and
practically achievable way with nothing more than a version of Git,
which I think we can do with tar, but not gzip.  I'm happy to be proven
wrong if you can develop a spec for canonical gzip compression.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 22:32     ` brian m. carlson
@ 2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
  2023-02-01 11:34         ` demerphq
  2023-02-01 23:16         ` brian m. carlson
  2023-02-01 12:17       ` Raymond E. Pasco
  1 sibling, 2 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-01  9:40 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Konstantin Ryabitsev, Eli Schwartz, Git List


On Tue, Jan 31 2023, brian m. carlson wrote:

> As far as whether other people want to implement consistent compression,
> they are welcome to also write a spec and implement it.  I personally
> feel that's too hard to get right and am not planning on working on it.

"A spec" here seems like overkill to me, so far on that front we've been
shelling out to gzip(1), and the breakage/event that triggered this
thread is rectified by starting to do that again by default.

It means that someone writing a clean-room implementation of git would
likely run into the same issue, if they used e.g. the Go language and a
native Go implementation of deflate.

But so what? We don't need to make promises for all potential git
implementations, just this one. So we could add a blurb like this to the
docs:

	As people have come to rely on the exact "deflate"
	implementation "git archive" promises to invoke the system's
	"gzip" binary by default, under the assumption that its output
	is stable. If that's no longer the case you'll need to complain
	to whoever maintains your local "gzip".

If we wanted to be even more helpful we could bunde and ship an old
version of GNU gzip with our sources, and either default to that, or
offer it as a "--stable" implementation of deflate.

That would be going above & beyond what's needed IMO, but still a lot
easier than the daunting task of writing a specification that exactly
described GNU gzip's current behavior, to the point where you could
clean-room implement it and be guaranteed byte-for-byte compatibility.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
@ 2023-02-01 11:34         ` demerphq
  2023-02-01 12:21           ` Michal Suchánek
  2023-02-01 23:16         ` brian m. carlson
  1 sibling, 1 reply; 57+ messages in thread
From: demerphq @ 2023-02-01 11:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List

On Wed, 1 Feb 2023 at 11:26, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> That would be going above & beyond what's needed IMO, but still a lot
> easier than the daunting task of writing a specification that exactly
> described GNU gzip's current behavior, to the point where you could
> clean-room implement it and be guaranteed byte-for-byte compatibility.

Why does it have to be gzip? It is not that hard to come up with a
relatively good compression algorithm that is stable if you aren't
expecting super fast performance or super good compression. If all you
need is good enough but stability is a hard requirement then
algorithms like LZW are available (it has been out of patent since
~2003), and produce reasonable results. If people want a stable
archive then they might have to use some tool that git provides to
decompress and they might not get the best compression ratios, nor
speed, but they would get stability. You can write a decent LZW
implementation in a few hundred lines of code. With a bit of care you
could implement it in a way that allows you to compute the true hash
digest of the compressed data without actually decompressing it as
well, which would address some of the concerns that brian raised with
regard to security I think.

Why does this email remind me of that old canard that any sufficiently
advanced piece of software gains the ability to send emails? :-)

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31 22:32     ` brian m. carlson
  2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
@ 2023-02-01 12:17       ` Raymond E. Pasco
  1 sibling, 0 replies; 57+ messages in thread
From: Raymond E. Pasco @ 2023-02-01 12:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, brian m. carlson
  Cc: Konstantin Ryabitsev, Eli Schwartz, Git List

February 1, 2023 4:40 AM, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com> wrote:
> As people have come to rely on the exact "deflate"
> implementation "git archive" promises to invoke the system's
> "gzip" binary by default, under the assumption that its output
> is stable. If that's no longer the case you'll need to complain
> to whoever maintains your local "gzip".

Surely if reproducibility of .tar.gz files is the goal,"invoke
whatever arbitrary binary on $PATH happens to be called gzip" is an
poor solution.

It is only even possible to consider stabilizing gzip output as a
goal for Git (although this seems ill-advised for the reasons
Brian already discussed) in the post-2.38 world where git is
doing the gzipping.

If one has the requirement to substitute one's own specific
compressor, there is an option for that.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 11:34         ` demerphq
@ 2023-02-01 12:21           ` Michal Suchánek
  2023-02-01 12:48             ` demerphq
  0 siblings, 1 reply; 57+ messages in thread
From: Michal Suchánek @ 2023-02-01 12:21 UTC (permalink / raw)
  To: demerphq
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Konstantin Ryabitsev, Eli Schwartz, Git List

On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote:
> On Wed, 1 Feb 2023 at 11:26, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > That would be going above & beyond what's needed IMO, but still a lot
> > easier than the daunting task of writing a specification that exactly
> > described GNU gzip's current behavior, to the point where you could
> > clean-room implement it and be guaranteed byte-for-byte compatibility.
> 
> Why does it have to be gzip? It is not that hard to come up with a
historical reasons?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
                     ` (2 preceding siblings ...)
  2023-01-31 15:56   ` Eli Schwartz
@ 2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
  2023-02-01 23:18     ` brian m. carlson
  3 siblings, 1 reply; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-01 12:42 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Eli Schwartz, Git List


On Tue, Jan 31 2023, brian m. carlson wrote:

> Since then, I've been very opposed to us guaranteeing output format
> consistency without explicitly doing so.  I had sent some patches before
> that I don't think ever got picked up that documented this explicitly.
> I very much don't want people to come to rely on our behaviour unless we
> explicitly guarantee it.

FWIW I think the reason that didn't get picked up (I went back and read
the discussion) is that there was some feedback on the v1, [1] suggested
(at least to me) that you'd re-roll it, but that re-roll never seems to
have made it to the list.

1. https://lore.kernel.org/git/YD7aDwX%2FaiRN0GZs@camp.crustytoothpaste.net/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 12:21           ` Michal Suchánek
@ 2023-02-01 12:48             ` demerphq
  2023-02-01 13:43               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 57+ messages in thread
From: demerphq @ 2023-02-01 12:48 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Konstantin Ryabitsev, Eli Schwartz, Git List

On Wed, 1 Feb 2023, 20:21 Michal Suchánek, <msuchanek@suse.de> wrote:
>
> On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote:
> > Why does it have to be gzip? It is not that hard to come up with a

> historical reasons?

Currently git doesn't advertise that archive creation is stable
right[1]? So I wrote that with the assumption that this new
compression would only be used when making a new archive with a
hypothetical new '--stable' option. So historical reasons don't come
up. Or was there some other form of history that you meant?

I'm just trying to point out here that stable compression is doable
and doesn't need to be as complex as specifying a stable gzip format.
I am not even saying git should just do this, just that it /could/ if
it decided that stability was important, and that doing so wouldn't
involve the complexity that Avar was implying would be needed.  Simple
compression like LZ variants are pretty straightforward to implement,
achieve pretty good compression and can run pretty fast.

Yves
[1] if it did the issue kicking off this thread would not have
happened as there would be a test that would have noticed the change.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 12:48             ` demerphq
@ 2023-02-01 13:43               ` Ævar Arnfjörð Bjarmason
  2023-02-01 15:21                 ` demerphq
  0 siblings, 1 reply; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-01 13:43 UTC (permalink / raw)
  To: demerphq
  Cc: Michal Suchánek, brian m. carlson, Konstantin Ryabitsev,
	Eli Schwartz, Git List


On Wed, Feb 01 2023, demerphq wrote:

> On Wed, 1 Feb 2023, 20:21 Michal Suchánek, <msuchanek@suse.de> wrote:
>>
>> On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote:
>> > Why does it have to be gzip? It is not that hard to come up with a
>
>> historical reasons?
>
> Currently git doesn't advertise that archive creation is stable
> right[1]? So I wrote that with the assumption that this new
> compression would only be used when making a new archive with a
> hypothetical new '--stable' option. So historical reasons don't come
> up. Or was there some other form of history that you meant?

We haven't advertised it, but people have come to rely on it, as the
widespread breakages reported when upgrading to v2.38.0 at the start of
this thread show.

That's unfortunate, and those people probably shouldn't have done that,
but that's water under the bridge. I think it would be irresponsible to
change the output willy-nilly at this point, especially when it seems
rather easy to find some compromise everyone will be happy with.

> I'm just trying to point out here that stable compression is doable
> and doesn't need to be as complex as specifying a stable gzip format.
> I am not even saying git should just do this, just that it /could/ if
> it decided that stability was important, and that doing so wouldn't
> involve the complexity that Avar was implying would be needed.  Simple
> compression like LZ variants are pretty straightforward to implement,
> achieve pretty good compression and can run pretty fast.
>
> Yves
> [1] if it did the issue kicking off this thread would not have
> happened as there would be a test that would have noticed the change.

I have some patches I'm about to submit to address issues in this
thread, and it does add *a* test for archive output stability.

But I'm not at all confident that it's exhaustive. I just found it by
experiment, by locating tests ouf ours where the "git archive" output at
the end is different with gzip and "git archive gzip".

But is it guaranteed to find all potential cases where repository
content might trigger different output with different gzip
implementations? I don't know, but probably not.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 13:43               ` Ævar Arnfjörð Bjarmason
@ 2023-02-01 15:21                 ` demerphq
  2023-02-01 18:56                   ` Theodore Ts'o
  0 siblings, 1 reply; 57+ messages in thread
From: demerphq @ 2023-02-01 15:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Michal Suchánek, brian m. carlson, Konstantin Ryabitsev,
	Eli Schwartz, Git List

On Wed, 1 Feb 2023 at 14:49, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Wed, Feb 01 2023, demerphq wrote:
>
> > On Wed, 1 Feb 2023, 20:21 Michal Suchánek, <msuchanek@suse.de> wrote:
> >>
> >> On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote:
> >> > Why does it have to be gzip? It is not that hard to come up with a
> >
> >> historical reasons?
> >
> > Currently git doesn't advertise that archive creation is stable
> > right[1]? So I wrote that with the assumption that this new
> > compression would only be used when making a new archive with a
> > hypothetical new '--stable' option. So historical reasons don't come
> > up. Or was there some other form of history that you meant?
>
> We haven't advertised it, but people have come to rely on it, as the
> widespread breakages reported when upgrading to v2.38.0 at the start of
> this thread show.
>
> That's unfortunate, and those people probably shouldn't have done that,
> but that's water under the bridge. I think it would be irresponsible to
> change the output willy-nilly at this point, especially when it seems
> rather easy to find some compromise everyone will be happy with.
>
> > I'm just trying to point out here that stable compression is doable
> > and doesn't need to be as complex as specifying a stable gzip format.
> > I am not even saying git should just do this, just that it /could/ if
> > it decided that stability was important, and that doing so wouldn't
> > involve the complexity that Avar was implying would be needed.  Simple
> > compression like LZ variants are pretty straightforward to implement,
> > achieve pretty good compression and can run pretty fast.
> >
> > Yves
> > [1] if it did the issue kicking off this thread would not have
> > happened as there would be a test that would have noticed the change.
>
> I have some patches I'm about to submit to address issues in this
> thread, and it does add *a* test for archive output stability.
>
> But I'm not at all confident that it's exhaustive. I just found it by
> experiment, by locating tests ouf ours where the "git archive" output at
> the end is different with gzip and "git archive gzip".
>
> But is it guaranteed to find all potential cases where repository
> content might trigger different output with different gzip
> implementations? I don't know, but probably not.

BTW, I just happened to be looking at the zstd docs (I am updating
code that uses it), I saw this:

Zstandard's format is stable and documented in
[RFC8878](https://datatracker.ietf.org/doc/html/rfc8878). Multiple
independent implementations are already available.
This repository represents the reference implementation, provided as
an open-source dual [BSD](LICENSE) and [GPLv2](COPYING) licensed **C**
library,
and a command line utility producing and decoding `.zst`, `.gz`, `.xz`
and `.lz4` files.
Should your project require another programming language,
a list of known ports and bindings is provided on [Zstandard
homepage](http://www.zstd.net/#other-languages).

So it sounds like that is a spec you could use. Not sure exactly what
they mean by "stable", but given the .gz compatibility maybe it would
be worth considering. Its a lot faster than zlib. (The library I
support includes Snappy, Zlib, and Zstd, and the latter is faster and
better than the other two.)

Yves
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 15:21                 ` demerphq
@ 2023-02-01 18:56                   ` Theodore Ts'o
  2023-02-02 21:19                     ` Joey Hess
  0 siblings, 1 reply; 57+ messages in thread
From: Theodore Ts'o @ 2023-02-01 18:56 UTC (permalink / raw)
  To: demerphq
  Cc: Ævar Arnfjörð Bjarmason, Michal Suchánek,
	brian m. carlson, Konstantin Ryabitsev, Eli Schwartz, Git List

If the goal is stable tar.gz files, Debian has a very nice soution
called pristine-tar[1].  This you to store a tar.gz image which in a
very efficient way, by leveraging the objects in the git repository.

[1] https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html

The data is stored on the pristine-tar branch, and is quite efficient:

% git show --stat pristine-tar
commit 56dded989c9e0c852b8af9ae72ffe94270bfd34a (origin/pristine-tar, github/pristine-tar, pristine-tar)
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Dec 30 01:06:13 2021 -0500

    pristine-tar data for e2fsprogs_1.46.5.orig.tar.gz

 e2fsprogs_1.46.5.orig.tar.gz.asc   |  11 +++++++++++
 e2fsprogs_1.46.5.orig.tar.gz.delta | Bin 0 -> 59034 bytes
 e2fsprogs_1.46.5.orig.tar.gz.id    |   1 +
 3 files changed, 12 insertions(+)

And this allows me to reproduce the original tar.gz file, along with a
GPG signature file, which is about 9 megabytes.  The *.id file
contains the git commit from which the tar file was generated, and
this is what allows the *.delta file to be as small as it is.

% pristine-tar checkout e2fsprogs_1.46.5.orig.tar.gz -s e2fsprogs_1.46.5.orig.tar.gz.asc
pristine-tar: successfully generated e2fsprogs_1.46.5.orig.tar.gz
pristine-tar: successfully generated e2fsprogs_1.46.5.orig.tar.gz.asc

% ls -sh e2fsprogs_1.46.5.orig.tar.gz*
9.1M e2fsprogs_1.46.5.orig.tar.gz  4.0K e2fsprogs_1.46.5.orig.tar.gz.asc

% gpg e2fsprogs_1.46.5.orig.tar.gz.asc
gpg: WARNING: no command supplied.  Trying to guess what you mean ...
gpg: assuming signed data in 'e2fsprogs_1.46.5.orig.tar.gz'
gpg: Signature made Thu 30 Dec 2021 01:02:52 AM EST
gpg:                using RSA key 2B69B954DBFE0879288137C9F2F95956950D81A3
gpg: Good signature from "Theodore Ts'o <tytso@mit.edu>" [ultimate]
gpg:                 aka "Theodore Ts'o <tytso@debian.org>" [ultimate]
gpg:                 aka "Theodore Ts'o <tytso@google.com>" [ultimate]
Primary key fingerprint: 3AB0 57B7 E78D 945C 8C55  91FB D36F 769B C118 04F0
     Subkey fingerprint: 2B69 B954 DBFE 0879 2881  37C9 F2F9 5956 950D 81A3

This is currently a Debian special, and while its functionality was
designed to work well with Debian packaging workflows, but it's a
general tool that could be used in multiple contexts, not just for
Debian packaging.

If I recall correctly, pristine-tar is currently in maintenance mode,
and I suspect if someone was interested in investing time into making
pristine-tar more portable to other OS's, including MacOS and Windows,
and maybe potentially even integrating into git directly, the current
maintainer of pristine-tar might be quite happy to let other people
give the code more TLC.

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
  2023-02-01 11:34         ` demerphq
@ 2023-02-01 23:16         ` brian m. carlson
  2023-02-01 23:37           ` Junio C Hamano
  2023-02-02  0:42           ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 57+ messages in thread
From: brian m. carlson @ 2023-02-01 23:16 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Konstantin Ryabitsev, Eli Schwartz, Git List

[-- Attachment #1: Type: text/plain, Size: 1784 bytes --]

On 2023-02-01 at 09:40:57, Ævar Arnfjörð Bjarmason wrote:
> "A spec" here seems like overkill to me, so far on that front we've been
> shelling out to gzip(1), and the breakage/event that triggered this
> thread is rectified by starting to do that again by default.

Sure, that will fix the immediate problem.

> But so what? We don't need to make promises for all potential git
> implementations, just this one. So we could add a blurb like this to the
> docs:
> 
> 	As people have come to rely on the exact "deflate"
> 	implementation "git archive" promises to invoke the system's
> 	"gzip" binary by default, under the assumption that its output
> 	is stable. If that's no longer the case you'll need to complain
> 	to whoever maintains your local "gzip".

I don't think a blurb is necessary, but you're basically underscoring
the problem, which is that nobody is willing to promise that compression
is consistent, but yet people want to rely on that fact.  I'm willing to
write and implement a consistent tar spec and to guarantee compatibility
with that, but the tension here is that people also want gzip to never
change its byte format ever, which frankly seems unrealistic without
explicit guarantees.  Maybe the authors will agree to promise that, but
it seems unlikely.

> If we wanted to be even more helpful we could bunde and ship an old
> version of GNU gzip with our sources, and either default to that, or
> offer it as a "--stable" implementation of deflate.

That would probably break things, because gzip is GPLv3, and we'd need
to ship a much older GPLv2 gzip, which would probably differ from the
current behaviour, and might also have some security problems.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
@ 2023-02-01 23:18     ` brian m. carlson
  0 siblings, 0 replies; 57+ messages in thread
From: brian m. carlson @ 2023-02-01 23:18 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Eli Schwartz, Git List

[-- Attachment #1: Type: text/plain, Size: 950 bytes --]

On 2023-02-01 at 12:42:54, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Jan 31 2023, brian m. carlson wrote:
> 
> > Since then, I've been very opposed to us guaranteeing output format
> > consistency without explicitly doing so.  I had sent some patches before
> > that I don't think ever got picked up that documented this explicitly.
> > I very much don't want people to come to rely on our behaviour unless we
> > explicitly guarantee it.
> 
> FWIW I think the reason that didn't get picked up (I went back and read
> the discussion) is that there was some feedback on the v1, [1] suggested
> (at least to me) that you'd re-roll it, but that re-roll never seems to
> have made it to the list.

That may very well have been the case.  As mentioned upthread, I have
very limited time to work on Git these days, and sometimes things just
fall through the cracks.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 23:16         ` brian m. carlson
@ 2023-02-01 23:37           ` Junio C Hamano
  2023-02-02 23:01             ` brian m. carlson
  2023-02-02  0:42           ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2023-02-01 23:37 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Ævar Arnfjörð Bjarmason, Konstantin Ryabitsev,
	Eli Schwartz, Git List

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I don't think a blurb is necessary, but you're basically underscoring
> the problem, which is that nobody is willing to promise that compression
> is consistent, but yet people want to rely on that fact.  I'm willing to
> write and implement a consistent tar spec and to guarantee compatibility
> with that, but the tension here is that people also want gzip to never
> change its byte format ever, which frankly seems unrealistic without
> explicit guarantees.  Maybe the authors will agree to promise that, but
> it seems unlikely.

Just to step back a bit, where does the distinction between
guaranteeing the tar format stability and gzip compressed bitstream
stability come from?  At both levels, the same thing can be
expressed in multiple different ways, I think, but spelling out how
exactly the compressor compresses is more involved than spelling out
how entries in a tar archive is ordered and each entry is expressed,
or something?

> That would probably break things, because gzip is GPLv3, and we'd need
> to ship a much older GPLv2 gzip, which would probably differ from the
> current behaviour, and might also have some security problems.

Yup, security issues may make bit-for-bit-stability unrealistic.
IIRC, the last time we had discussion on this topic, we settled
on stability across the same version of Git (i.e. deterministic
result)?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 23:16         ` brian m. carlson
  2023-02-01 23:37           ` Junio C Hamano
@ 2023-02-02  0:42           ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  0:42 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Konstantin Ryabitsev, Eli Schwartz, Git List


On Wed, Feb 01 2023, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> On 2023-02-01 at 09:40:57, Ævar Arnfjörð Bjarmason wrote:
>> "A spec" here seems like overkill to me, so far on that front we've been
>> shelling out to gzip(1), and the breakage/event that triggered this
>> thread is rectified by starting to do that again by default.
>
> Sure, that will fix the immediate problem.
>
>> But so what? We don't need to make promises for all potential git
>> implementations, just this one. So we could add a blurb like this to the
>> docs:
>> 
>> 	As people have come to rely on the exact "deflate"
>> 	implementation "git archive" promises to invoke the system's
>> 	"gzip" binary by default, under the assumption that its output
>> 	is stable. If that's no longer the case you'll need to complain
>> 	to whoever maintains your local "gzip".
>
> I don't think a blurb is necessary, but you're basically underscoring
> the problem, which is that nobody is willing to promise that compression
> is consistent, but yet people want to rely on that fact.  I'm willing to
> write and implement a consistent tar spec and to guarantee compatibility
> with that, but the tension here is that people also want gzip to never
> change its byte format ever, which frankly seems unrealistic without
> explicit guarantees.  Maybe the authors will agree to promise that, but
> it seems unlikely.

Maybe they won't, the point is that an upgrade of git wouldn't break
github in the way that's been observed, instead that potential breakage
would happen whenever the OS (or whatever's providing "gzip") is
upgraded.

So, if gzip promises to never change such sites can upgrade it without
issues, but if it does they'll presumably need to pin it forever.

And those sites that don't care about "git archive" stability can use
whatever their local "gzip" is, without caring that the output might
change.

>> If we wanted to be even more helpful we could bunde and ship an old
>> version of GNU gzip with our sources, and either default to that, or
>> offer it as a "--stable" implementation of deflate.
>
> That would probably break things, because gzip is GPLv3, and we'd need
> to ship a much older GPLv2 gzip, which would probably differ from the
> current behaviour, and might also have some security problems.

We're way off in the realm of the hypothetical, I don't think we need a
gzip fallback, we can make it the issue of the rare downstream user who
needs such stability.

But if we shipped a last-good gzip my understanding of software
licensing is that we could ship the GPLv3 version.

The issue with combining GPLv3 and GPLv2 works is if you do something
like upgrade our wildmatch.c to the GPLv3 version (ours is derived from
an older GPLv2 version). Then our combined work is derived from two
different licenses.

But if you're just invoking a different process those two sources can
use incompatible licenses. There's established precedence for that
throughout the industry, and it's the FSF's position on the matter.

So if we offered to build a gzip for you from GPLv3 sources shipped
in-tree that wouldn't infect the rest of git's GPLv2 code, any more than
Debian shipping both git and gzip is cross-contaminating the two.

It might cause us some hassle with distributors for whom any mention of
GPLv3 is anathema (e.g. Apple), but I understand that that's general
paranoia about its patent clauses impacting the distributor, not a
license incompatiblity.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
  2023-01-31  9:11   ` Eli Schwartz
@ 2023-02-02  9:32   ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason
                       ` (11 more replies)
  1 sibling, 12 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

As reported in
https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
changing the default "tgz" output method of from "gzip(1)" to our
internal "git archive gzip" (using zlib ) broke things for users in
the wild that assume that the "git archive" output is stable, most
notably GitHub: https://github.com/orgs/community/discussions/45830

Leaving aside the larger question of whether we're going to promise
output stability for "git archive" in general, the motivation for that
change was to have a working compression method on systems that lacked
a gzip(1).

As the disruption of changing the default isn't worth it, let's use
gzip(1) again by default, and only fall back on the new "git archive
gzip" if it isn't available.

The later parts of this series then document and test for the output
stability of the command.

We're not promising anything new there, except that we now promise
that we're going to use "gzip" as the default compressor, but that
it's up to that command to be stable, should the user desire output
stability.

The documentation discusses the various caveats involved, suggests
alternatives to checksumming compressed archives, but in the end notes
what's been the policy so far: We're not promising that the "tar"
output is going to be stable.

The early parts of this series (1-2/9) are clean-up for existing
config drift, as later in the series we'll otherwise need to change
the divergent config documentation in two places.

CI & branch for this at:
https://github.com/avar/git/tree/avar/archive-internal-gzip-not-the-default

Ævar Arnfjörð Bjarmason (9):
  archive & tar config docs: de-duplicate configuration section
  git config docs: document "tar.<format>.{command,remote}"
  archiver API: make the "flags" in "struct archiver" an enum
  archive: omit the shell for built-in "command" filters
  archive-tar.c: move internal gzip implementation to a function
  archive: use "gzip -cn" for stability, not "git archive gzip"
  test-lib.sh: add a lazy GZIP prerequisite
  archive tests: test for "gzip -cn" and "git archive gzip" stability
  git archive docs: document output non-stability

 Documentation/config/tar.txt           | 29 +++++++-
 Documentation/git-archive.txt          | 96 +++++++++++++++++++-------
 archive-tar.c                          | 78 ++++++++++++++-------
 archive.h                              | 11 +--
 t/t5000-tar-tree.sh                    |  2 -
 t/t5005-archive-stability.sh           | 70 +++++++++++++++++++
 t/t5562-http-backend-content-length.sh |  2 -
 t/test-lib.sh                          |  4 ++
 8 files changed, 231 insertions(+), 61 deletions(-)
 create mode 100755 t/t5005-archive-stability.sh

-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/9] archive & tar config docs: de-duplicate configuration section
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason
                       ` (10 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

The "tar.umask" documentation was initially added in [1], and was
duplicated from the start. Then with [2] the two started drifting
apart. Let's consolidate them with a change like the ones made in the
commits merged in [3].

1. ce1a79b6a74 (tar-tree: add the "tar.umask" config option,
   2006-07-20)
2. 687157c736d (Documentation: update tar.umask default, 2007-08-21)
3. 7a54d740451 (Merge branch 'ab/dedup-config-and-command-docs',
   2022-09-14)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/tar.txt  | 4 +++-
 Documentation/git-archive.txt | 8 +-------
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt
index de8ff48ea9d..c68e294bbc5 100644
--- a/Documentation/config/tar.txt
+++ b/Documentation/config/tar.txt
@@ -3,4 +3,6 @@ tar.umask::
 	tar archive entries.  The default is 0002, which turns off the
 	world write bit.  The special value "user" indicates that the
 	archiving user's umask will be used instead.  See umask(2) and
-	linkgit:git-archive[1].
+	linkgit:git-archive[1] for
+	details. If `--remote` is used then only the configuration of
+	the remote repository takes effect.
diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 60c040988bb..bbb407d4975 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -131,13 +131,7 @@ tar
 CONFIGURATION
 -------------
 
-tar.umask::
-	This variable can be used to restrict the permission bits of
-	tar archive entries.  The default is 0002, which turns off the
-	world write bit.  The special value "user" indicates that the
-	archiving user's umask will be used instead.  See umask(2) for
-	details.  If `--remote` is used then only the configuration of
-	the remote repository takes effect.
+include::config/tar.txt[]
 
 tar.<format>.command::
 	This variable specifies a shell command through which the tar
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}"
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

Since the "tar.<format>.command" and "tar.<format>.remote"
configuration was added in [1] and [2], we have not included it in the
"git-config(1)" docs themselves.

Since we're including "Documentation/config/tar.txt" in
"Documentation/config/git-archive.txt" as of the preceding commit,
let's move this documentation to the former, to be included in the
latter.

This is a move-only change, aside from changing the mention of "`git
archive`" to "linkgit:git-archive[1]", for consistency with other such
mentions.

1. 767cf4579f0 (archive: implement configurable tar filters,
   2011-06-21)
2. 7b97730b764 (upload-archive: allow user to turn off filters,
   2011-06-21)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/tar.txt  | 18 ++++++++++++++++++
 Documentation/git-archive.txt | 18 ------------------
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt
index c68e294bbc5..894c1163bb9 100644
--- a/Documentation/config/tar.txt
+++ b/Documentation/config/tar.txt
@@ -6,3 +6,21 @@ tar.umask::
 	linkgit:git-archive[1] for
 	details. If `--remote` is used then only the configuration of
 	the remote repository takes effect.
+
+tar.<format>.command::
+	This variable specifies a shell command through which the tar
+	output generated by linkgit:git-archive[1] should be piped. The command
+	is executed using the shell with the generated tar file on its
+	standard input, and should produce the final output on its
+	standard output. Any compression-level options will be passed
+	to the command (e.g., `-9`).
++
+The `tar.gz` and `tgz` formats are defined automatically and use the
+magic command `git archive gzip` by default, which invokes an internal
+implementation of gzip.
+
+tar.<format>.remote::
+	If true, enable the format for use by remote clients via
+	linkgit:git-upload-archive[1]. Defaults to false for
+	user-defined formats, but true for the `tar.gz` and `tgz`
+	formats.
diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index bbb407d4975..268e797f03a 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -133,24 +133,6 @@ CONFIGURATION
 
 include::config/tar.txt[]
 
-tar.<format>.command::
-	This variable specifies a shell command through which the tar
-	output generated by `git archive` should be piped. The command
-	is executed using the shell with the generated tar file on its
-	standard input, and should produce the final output on its
-	standard output. Any compression-level options will be passed
-	to the command (e.g., `-9`).
-+
-The `tar.gz` and `tgz` formats are defined automatically and use the
-magic command `git archive gzip` by default, which invokes an internal
-implementation of gzip.
-
-tar.<format>.remote::
-	If true, enable the format for use by remote clients via
-	linkgit:git-upload-archive[1]. Defaults to false for
-	user-defined formats, but true for the `tar.gz` and `tgz`
-	formats.
-
 [[ATTRIBUTES]]
 ATTRIBUTES
 ----------
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason
                       ` (8 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

Refactor the "#define" pattern in the archiver.h to use a new "enum
archiver_flags". This isn't a functional change, but will make adding
new flags in a subsequent commit easier to reason about.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 archive.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/archive.h b/archive.h
index 08bed3ed3af..6b51288c2ed 100644
--- a/archive.h
+++ b/archive.h
@@ -36,13 +36,15 @@ const char *archive_format_from_filename(const char *filename);
 
 /* archive backend stuff */
 
-#define ARCHIVER_WANT_COMPRESSION_LEVELS 1
-#define ARCHIVER_REMOTE 2
-#define ARCHIVER_HIGH_COMPRESSION_LEVELS 4
+enum archiver_flags {
+	ARCHIVER_WANT_COMPRESSION_LEVELS = 1<<0,
+	ARCHIVER_REMOTE = 1<<1,
+	ARCHIVER_HIGH_COMPRESSION_LEVELS = 1<<2,
+};
 struct archiver {
 	const char *name;
 	int (*write_archive)(const struct archiver *, struct archiver_args *);
-	unsigned flags;
+	enum archiver_flags flags;
 	char *filter_command;
 };
 void register_archiver(struct archiver *);
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 4/9] archive: omit the shell for built-in "command" filters
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (2 preceding siblings ...)
  2023-02-02  9:32     ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason
                       ` (7 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

Since the "tar.<format.command" interface was added in [1] we've
promised to invoke the shell to run if e.g. "gzip -cn" is
configured. That common format was then added as a default in [2].

But if we have no such configuration we can safely assume that the
user isn't expecting the "gzip" to be invoked via a shell, and we can
skip the "sh" process.

We are intentionally not treating a configured
"tar.<format>.command=<cmd>" where "<cmd>" is equivalent to our
hardcoded "<cmd>" the same as when the same "<cmd>" is specified in
the config. If the user has configured e.g. "gzip -cn" they may be
relying on what the shell gives them over a direct execve() of "gzip".

This makes us marginally faster, but the real point is to make the
error handling easier to deal with. When we're using the shell we
don't know if e.g. the "gzip" we spawned fails as easily,
i.e. "start_command()" won't fail, because we can find the "sh".

A subsequent commit will tweak the default that [3] introduced to be a
fallback instead, at which point we'll need this for correctness.

1. 767cf4579f0 (archive: implement configurable tar filters, 2011-06-21)
2. 0e804e09938 (archive: provide builtin .tar.gz filter, 2011-06-21)
3. 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/tar.txt |  3 +++
 archive-tar.c                | 17 +++++++++++++----
 archive.h                    |  1 +
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt
index 894c1163bb9..5456fc617a2 100644
--- a/Documentation/config/tar.txt
+++ b/Documentation/config/tar.txt
@@ -18,6 +18,9 @@ tar.<format>.command::
 The `tar.gz` and `tgz` formats are defined automatically and use the
 magic command `git archive gzip` by default, which invokes an internal
 implementation of gzip.
++
+The automatically defined commands do not invoke the shell, avoiding
+the minor overhead of an extra sh(1) process.
 
 tar.<format>.remote::
 	If true, enable the format for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index f8fad2946ef..8c5de949c64 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -367,12 +367,13 @@ static struct archiver *find_tar_filter(const char *name, size_t len)
 }
 
 static int tar_filter_config(const char *var, const char *value,
-			     void *data UNUSED)
+			     void *data)
 {
 	struct archiver *ar;
 	const char *name;
 	const char *type;
 	size_t namelen;
+	int *configured = data;
 
 	if (parse_config_key(var, "tar", &name, &namelen, &type) < 0 || !name)
 		return 0;
@@ -388,6 +389,9 @@ static int tar_filter_config(const char *var, const char *value,
 		tar_filters[nr_tar_filters++] = ar;
 	}
 
+	if (configured && *configured)
+		ar->flags |= ARCHIVER_COMMAND_FROM_CONFIG;
+
 	if (!strcmp(type, "command")) {
 		if (!value)
 			return config_error_nonbool(var);
@@ -495,8 +499,12 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
 
-	strvec_push(&filter.args, cmd.buf);
-	filter.use_shell = 1;
+	if (ar->flags & ARCHIVER_COMMAND_FROM_CONFIG) {
+		strvec_push(&filter.args, cmd.buf);
+		filter.use_shell = 1;
+	} else {
+		strvec_split(&filter.args, cmd.buf);
+	}
 	filter.in = -1;
 	filter.silent_exec_failure = 1;
 
@@ -526,13 +534,14 @@ static struct archiver tar_archiver = {
 void init_tar_archiver(void)
 {
 	int i;
+	int configured = 1;
 	register_archiver(&tar_archiver);
 
 	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
 	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
-	git_config(git_tar_config, NULL);
+	git_config(git_tar_config, &configured);
 	for (i = 0; i < nr_tar_filters; i++) {
 		/* omit any filters that never had a command configured */
 		if (tar_filters[i]->filter_command)
diff --git a/archive.h b/archive.h
index 6b51288c2ed..9686b3b5cc1 100644
--- a/archive.h
+++ b/archive.h
@@ -40,6 +40,7 @@ enum archiver_flags {
 	ARCHIVER_WANT_COMPRESSION_LEVELS = 1<<0,
 	ARCHIVER_REMOTE = 1<<1,
 	ARCHIVER_HIGH_COMPRESSION_LEVELS = 1<<2,
+	ARCHIVER_COMMAND_FROM_CONFIG = 1<<3,
 };
 struct archiver {
 	const char *name;
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (3 preceding siblings ...)
  2023-02-02  9:32     ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason
                       ` (6 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

Refactor the code added in 76d7602631a (archive-tar: add internal gzip
implementation, 2022-06-15) to call the magic "git archive gzip"
command as a function.

A subsequent commit will start using this as a fallback, but for now
there's no functional changes here.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 archive-tar.c | 43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 8c5de949c64..dfc133deac7 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -465,12 +465,33 @@ static void tgz_write_block(const void *data)
 
 static const char internal_gzip_command[] = "git archive gzip";
 
-static int write_tar_filter_archive(const struct archiver *ar,
-				    struct archiver_args *args)
+static int gzip_internally(const struct archiver *ar,
+			   struct archiver_args *args)
 {
 #if ZLIB_VERNUM >= 0x1221
 	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
 #endif
+	int r;
+
+	write_block = tgz_write_block;
+	git_deflate_init_gzip(&gzstream, args->compression_level);
+#if ZLIB_VERNUM >= 0x1221
+	if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+		BUG("deflateSetHeader() called too late");
+#endif
+	gzstream.next_out = outbuf;
+	gzstream.avail_out = sizeof(outbuf);
+
+	r = write_tar_archive(ar, args);
+
+	tgz_deflate(Z_FINISH);
+	git_deflate_end(&gzstream);
+	return r;
+}
+
+static int write_tar_filter_archive(const struct archiver *ar,
+				    struct archiver_args *args)
+{
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -478,22 +499,8 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");
 
-	if (!strcmp(ar->filter_command, internal_gzip_command)) {
-		write_block = tgz_write_block;
-		git_deflate_init_gzip(&gzstream, args->compression_level);
-#if ZLIB_VERNUM >= 0x1221
-		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
-			BUG("deflateSetHeader() called too late");
-#endif
-		gzstream.next_out = outbuf;
-		gzstream.avail_out = sizeof(outbuf);
-
-		r = write_tar_archive(ar, args);
-
-		tgz_deflate(Z_FINISH);
-		git_deflate_end(&gzstream);
-		return r;
-	}
+	if (!strcmp(ar->filter_command, internal_gzip_command))
+		return gzip_internally(ar, args);
 
 	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip"
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (4 preceding siblings ...)
  2023-02-02  9:32     ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason
                       ` (5 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

This reverts and amends [1] so that we don't use "git archive gzip" by
default, but only fall back on it when we cannot invoke "gzip".

As noted in the discussion at [2] that commit first released with
v2.38.0 caused widespread breakage in the wild: Hosting sites like
GitHub tend to offer a feature to download tagged releases as
archives, which are generated by some variant of "git archive
--format=tgz".

Downstream distributors then tend to (re-)download those archives
as-is, hardcoding their known hash their packaging systems. See [3],
[4] etc. for reports of those systems breaking in conjunction with
[1].

The reason for "why" is entirely missing from the commit message for
[1], but as seen in the question about that in [5] and reply at [6] at
the time it was to "avoid a run[time] dependency; the build/test
dependency remains.".

It's not immediately apparent what the second part of that is
referring to, as [1] also removed the "GZIP" prerequisite from some
tests. The answer is that we still have other tests that need "GZIP",
but those are invoking "gzip(1)" explicitly.

In any case, whatever promises we make in the future about the
stability and non-stability of "git archive" output (or the derived
compressed artifact), this amount of fallout isn't worth it to get to
the stated goal in [1].

Let's instead default to "gzip -cn" again, but if we can't find it
fall back on "git archive gzip". Note that we'll only fallback if that
"gzip -cn" is ours, not if it comes from the user's own
"tar.<format>.command" configuration.

If we do need the fallback we'll warn about it. No such warning will
be emitted if the user has explicitly asked for "git archive gzip".

1. 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15)
2. https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
3. https://github.com/Homebrew/homebrew-core/issues/121877
4. https://github.com/bazel-contrib/SIG-rules-authors/issues/11
5. https://lore.kernel.org/git/220615.86wndhwt9a.gmgdl@evledraar.gmail.com/
6. https://lore.kernel.org/git/3ed80afd-34b3-afd8-5ffb-0187a4475ee1@web.de/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/tar.txt |  8 ++++++--
 archive-tar.c                | 20 +++++++++++++++-----
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/tar.txt b/Documentation/config/tar.txt
index 5456fc617a2..37f24baa73a 100644
--- a/Documentation/config/tar.txt
+++ b/Documentation/config/tar.txt
@@ -16,8 +16,12 @@ tar.<format>.command::
 	to the command (e.g., `-9`).
 +
 The `tar.gz` and `tgz` formats are defined automatically and use the
-magic command `git archive gzip` by default, which invokes an internal
-implementation of gzip.
+command `gzip -cn` by default. An internal gzip implementation can be
+used by specifying the value `git archive gzip`.
++
+If 'gzip -cn' cannot be executed we'll fall back on `git archive gzip`
+with a warning, if you don't have a gzip(1) and would like to use the
+internal `git archive gzip` without warning, configure it explicitly.
 +
 The automatically defined commands do not invoke the shell, avoiding
 the minor overhead of an extra sh(1) process.
diff --git a/archive-tar.c b/archive-tar.c
index dfc133deac7..26efb911ebc 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -464,6 +464,7 @@ static void tgz_write_block(const void *data)
 }
 
 static const char internal_gzip_command[] = "git archive gzip";
+static const char gzip_cn_command[] = "gzip -cn";
 
 static int gzip_internally(const struct archiver *ar,
 			   struct archiver_args *args)
@@ -494,12 +495,15 @@ static int write_tar_filter_archive(const struct archiver *ar,
 {
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
+	int filter_is_gzip_cn = 0;
 	int r;
 
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");
 
-	if (!strcmp(ar->filter_command, internal_gzip_command))
+	if (!strcmp(ar->filter_command, gzip_cn_command))
+		filter_is_gzip_cn = 1;
+	else if (!strcmp(ar->filter_command, internal_gzip_command))
 		return gzip_internally(ar, args);
 
 	strbuf_addstr(&cmd, ar->filter_command);
@@ -515,8 +519,14 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	filter.in = -1;
 	filter.silent_exec_failure = 1;
 
-	if (start_command(&filter) < 0)
-		die_errno(_("unable to start '%s' filter"), cmd.buf);
+	if (start_command(&filter) < 0) {
+		if (!filter_is_gzip_cn)
+			die_errno(_("unable to start '%s' filter"), cmd.buf);
+
+		warning_errno(_("unable to start '%s' filter, falling back to '%s'"),
+			      cmd.buf, internal_gzip_command);
+		return gzip_internally(ar, args);
+	}
 	close(1);
 	if (dup2(filter.in, 1) < 0)
 		die_errno(_("unable to redirect descriptor"));
@@ -544,9 +554,9 @@ void init_tar_archiver(void)
 	int configured = 1;
 	register_archiver(&tar_archiver);
 
-	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
+	tar_filter_config("tar.tgz.command", gzip_cn_command, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
-	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
+	tar_filter_config("tar.tar.gz.command", gzip_cn_command, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
 	git_config(git_tar_config, &configured);
 	for (i = 0; i < nr_tar_filters; i++) {
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (5 preceding siblings ...)
  2023-02-02  9:32     ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason
                       ` (4 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

Move the "gzip --version" lazy prerequisite added in [1] and
copy/pasted to another test in [2] to test-lib.sh. A subsequent commit
will add a third user, let's first stop duplicating it.

1. 96174145fc3 (t5000: simplify gzip prerequisite checks, 2013-12-03)
2. 6c213e863ae (http-backend: respect CONTENT_LENGTH for receive-pack,
   2018-07-27)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5000-tar-tree.sh                    | 2 --
 t/t5562-http-backend-content-length.sh | 2 --
 t/test-lib.sh                          | 4 ++++
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index d4730481384..e1fa34bb828 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -38,8 +38,6 @@ test_lazy_prereq TAR_NEEDS_PAX_FALLBACK '
 	)
 '
 
-test_lazy_prereq GZIP 'gzip --version'
-
 get_pax_header() {
 	file=$1
 	header=$2=
diff --git a/t/t5562-http-backend-content-length.sh b/t/t5562-http-backend-content-length.sh
index b68ec22d3fd..e83aa336fa8 100755
--- a/t/t5562-http-backend-content-length.sh
+++ b/t/t5562-http-backend-content-length.sh
@@ -3,8 +3,6 @@
 test_description='test git-http-backend respects CONTENT_LENGTH'
 . ./test-lib.sh
 
-test_lazy_prereq GZIP 'gzip --version'
-
 verify_http_result() {
 	# some fatal errors still produce status 200
 	# so check if there is the error message
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 01e88781dd2..33bb9fe991f 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1922,6 +1922,10 @@ test_lazy_prereq LONG_IS_64BIT '
 test_lazy_prereq TIME_IS_64BIT 'test-tool date is64bit'
 test_lazy_prereq TIME_T_IS_64BIT 'test-tool date time_t-is64bit'
 
+test_lazy_prereq GZIP '
+	gzip --version
+'
+
 test_lazy_prereq CURL '
 	curl --version
 '
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (6 preceding siblings ...)
  2023-02-02  9:32     ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02  9:32     ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason
                       ` (3 subsequent siblings)
  11 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

If our test suite is instrumented to run the first "test_cmp_bin" in
"test_done" it'll mostly pass, but fail on a few tests, such as
"t5319-multi-pack-index.sh". Those tests reveal edge cases where the
output of "gzip -cn" is different than that of "git archive gzip" for
the same input.

Let's extract a minimal version of the part of
"t5319-multi-pack-index.sh" which triggers it, and add a test for
archival stability.

Whatever we ultimately decide to promise when it comes to this
stability (see [1]) it'll be better to go into any behavior difference
knowing that's what we're about to do, rather than discover widespread
breakage due to already released Git versions.

The "GZIP_TRIVIALLY_STABLE" code here is added because on OSX even a
trivial *.tgz generated by the two methods will be different.

1. https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5005-archive-stability.sh | 70 ++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)
 create mode 100755 t/t5005-archive-stability.sh

diff --git a/t/t5005-archive-stability.sh b/t/t5005-archive-stability.sh
new file mode 100755
index 00000000000..c7532886920
--- /dev/null
+++ b/t/t5005-archive-stability.sh
@@ -0,0 +1,70 @@
+#!/bin/sh
+
+test_description='git archive stabilty'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+create_archive_file_with_config () {
+	local file="$1" &&
+	local config="$2" &&
+	shift 2 &&
+
+	test_when_finished "rm -rf \"$file\"" &&
+	git -c tar.tgz.command="$config" archive -o "$file" HEAD
+}
+
+setup_gzip_vs_git_archive_gzip () {
+	create_archive_file_with_config "expect.tgz" "gzip -cn" &&
+	create_archive_file_with_config "actual.tgz" "git archive gzip"
+}
+
+test_lazy_prereq GZIP_TRIVIALLY_STABLE '
+	git clone "$TRASH_DIRECTORY" . &&
+	test_commit P &&
+	setup_gzip_vs_git_archive_gzip &&
+	test_cmp_bin expect.tgz actual.tgz
+'
+
+if ! test_have_prereq GZIP_TRIVIALLY_STABLE
+then
+	skip_all='skipping gzip v.s. git archive gzip tests, even trivial content differs'
+	test_done
+fi
+
+# The first test_expect_success is after the "skip_all" so we'll get
+# the skip summary in prove(1) output.
+test_expect_success 'setup' '
+	test_commit A
+'
+
+test_expect_success GZIP '"gzip -cn" and v.s. "git archive gzip" produce the same output still' '
+	setup_gzip_vs_git_archive_gzip &&
+	test_cmp_bin expect.tgz actual.tgz
+'
+
+generate_objects () {
+	i=$1
+	iii=$(printf '%03i' $i)
+	{
+		echo $iii &&
+		test-tool genrandom "$iii" 8192
+	} >file_$iii &&
+	git update-index --add file_$iii
+}
+
+test_expect_success 'create objects with (stable) random data' '
+	test_commit initial &&
+	for i in $(test_seq 1 5)
+	do
+		generate_objects $i || return 1
+	done &&
+	git commit -m"add objects"
+'
+
+test_expect_success GZIP '"gzip -cn" and v.s. "git archive gzip" have differing output' '
+	setup_gzip_vs_git_archive_gzip &&
+	! test_cmp_bin expect.tgz actual.tgz
+'
+
+test_done
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 9/9] git archive docs: document output non-stability
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (7 preceding siblings ...)
  2023-02-02  9:32     ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason
@ 2023-02-02  9:32     ` Ævar Arnfjörð Bjarmason
  2023-02-02 10:25       ` brian m. carlson
  2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
                       ` (2 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02  9:32 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o,
	Ævar Arnfjörð Bjarmason

There's an ongoing discussion about the output stability of "git
archive"[1] as a follow-up to the incident GitHub experienced when
upgrading to v2.38.0[2].

In a preceding commit we reverted the immediate cause of that
incident, which was that we'd moved away from "gzip -cn" as the
default compression method in favor of the internal "git archive gzip"
in [3].

Let's follow that up by documenting the non-promises we've always
maintained with regards to "git archive"'s output stability. We may
want to make stronger promises in this area, but this change avoids
addressing that question.

Instead we're discussing that we've changed this in the past, aren't
changing it willy-nilly, but it may change again in the future. The
only new promise here that we haven't explicitly maintained
historically is that we're promising to forever shell out to the
system's "gzip" by default. Whether it produces stable output once
that happens we leave up to the "gzip" tool.

We're also discussing the caveats & differences in output with with
SHA-1 and SHA-256 repositories, and trying to steer users towards more
stable alternatives. First by using "git verify-tag" and the like to
verify releases, and if they really must checksum generated output, to
encourage them to at least checksum the "tar" output contained within
the compressed output, not the compressed output itself.

1. https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
2. https://github.com/orgs/community/discussions/45830
3. 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-archive.txt | 70 ++++++++++++++++++++++++++++++++++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 268e797f03a..78f1b033cb7 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -14,6 +14,7 @@ SYNOPSIS
 	      [--remote=<repo> [--exec=<git-upload-archive>]] <tree-ish>
 	      [<path>...]
 
+[[DESCRIPTION]]
 DESCRIPTION
 -----------
 Creates an archive of the specified format containing the tree
@@ -28,7 +29,7 @@ case the commit time as recorded in the referenced commit object is
 used instead.  Additionally the commit ID is stored in a global
 extended pax header if the tar format is used; it can be extracted
 using 'git get-tar-commit-id'. In ZIP files it is stored as a file
-comment.
+comment. See the <<STABILITY,OUTPUT STABILITY>> section below.
 
 OPTIONS
 -------
@@ -202,6 +203,73 @@ EXAMPLES
 	You can use it specifying `--format=tar.xz`, or by creating an
 	output file like `-o foo.tar.xz`.
 
+[[STABILITY]]
+OUTPUT STABILITY
+----------------
+
+The output of 'git archive' is not guaranteed to be stable, and may
+change between versions.
+
+There are many valid ways to encode the same data in the tar format
+itself. For non-`tar` arguments to the `--format` option we rely on
+external tools (or libraries) for compressing the output we generate.
+
+The `tar` format contains the commit ID in the pax header (see the
+<<DESCRIPTION>> section above). A repository that's been migrated from
+SHA-1 to SHA-256 will therefore have different `tar` output for the
+"same" commit. See `extension.objectFormat` in linkgit:git-config[1].
+
+Instead of relying on the output of `git archive`, you should prefer
+to stick to git's own transport protocols, and e.g. validate releases
+with linkgit:git-tag[1]'s `--verify` option.
+
+Despite the output of `git archive` having never been promised to be
+stable, various users in the wild have come to rely on that being the
+case.
+
+Most notably, large hosting providers provide a way to download a
+given tagged release as a `git archive`. Some downstream tools then
+expect the content of that archive to be stable. When that's changed
+widespread breakage has been observed, see
+https://github.com/orgs/community/discussions/45830 for one such case.
+
+While we won't promise that the output won't change in the future, we
+are aware of these users, and will try to avoid changing it
+willy-nilly. Furthermore, we make the following promises:
+
+* The default gzip compression tool will continue to be gzip(1). If
+  you rely on this being e.g. GNU gzip for the purposes of stability,
+  it's up to you to ensure that its output is stable across
+  versions.
++
+
+We in turn promise to not e.g. make the internal "git archive gzip"
+implementation the default, as it produces different ouput than
+gzip(1) in some case.
+
+* We will do our best not to change the "tar" output itself, but won't
+  promise that we're never going to change it.
++
+If you must avoid using "git" itself for the tree validation, you
+should be checksumming the uncompressed "tar" output, not e.g. the
+compressed "tgz" output.
++
+
+This ensures that you're only relying on the output emitted by git
+itself, and avoiding the additional dependency on external
+compression.
++
+See
+https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tree/get-verified-tarball
+for an implementation of that workflow.
+
+* We promise that a given version of git will emit stable "tar" output
+  for the same tree ID (but not commit ID, see the discussion in the
+  <<DESCRIPTION>> section above).
++
+While you shouldn't assume that different versions of git will emit
+the same output, you can assume (e.g. for the purposes of caching)
+that a given version's output is stable.
 
 SEE ALSO
 --------
-- 
2.39.1.1392.g63e6d408230


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 9/9] git archive docs: document output non-stability
  2023-02-02  9:32     ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason
@ 2023-02-02 10:25       ` brian m. carlson
  2023-02-02 10:30         ` Ævar Arnfjörð Bjarmason
  2023-02-02 16:34         ` Junio C Hamano
  0 siblings, 2 replies; 57+ messages in thread
From: brian m. carlson @ 2023-02-02 10:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe,
	Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco,
	demerphq, Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 3929 bytes --]

On 2023-02-02 at 09:32:29, Ævar Arnfjörð Bjarmason wrote:
> +[[STABILITY]]
> +OUTPUT STABILITY
> +----------------
> +
> +The output of 'git archive' is not guaranteed to be stable, and may
> +change between versions.
> +
> +There are many valid ways to encode the same data in the tar format
> +itself. For non-`tar` arguments to the `--format` option we rely on
> +external tools (or libraries) for compressing the output we generate.
> +
> +The `tar` format contains the commit ID in the pax header (see the
> +<<DESCRIPTION>> section above). A repository that's been migrated from
> +SHA-1 to SHA-256 will therefore have different `tar` output for the
> +"same" commit. See `extension.objectFormat` in linkgit:git-config[1].
> +
> +Instead of relying on the output of `git archive`, you should prefer
> +to stick to git's own transport protocols, and e.g. validate releases
> +with linkgit:git-tag[1]'s `--verify` option.
> +
> +Despite the output of `git archive` having never been promised to be
> +stable, various users in the wild have come to rely on that being the
> +case.
> +
> +Most notably, large hosting providers provide a way to download a
> +given tagged release as a `git archive`. Some downstream tools then
> +expect the content of that archive to be stable. When that's changed
> +widespread breakage has been observed, see
> +https://github.com/orgs/community/discussions/45830 for one such case.
> +
> +While we won't promise that the output won't change in the future, we
> +are aware of these users, and will try to avoid changing it
> +willy-nilly. Furthermore, we make the following promises:
> +
> +* The default gzip compression tool will continue to be gzip(1). If
> +  you rely on this being e.g. GNU gzip for the purposes of stability,
> +  it's up to you to ensure that its output is stable across
> +  versions.
> ++
> +
> +We in turn promise to not e.g. make the internal "git archive gzip"
> +implementation the default, as it produces different ouput than
> +gzip(1) in some case.

I think this is fine up to here.

> +* We will do our best not to change the "tar" output itself, but won't
> +  promise that we're never going to change it.
> ++
> +If you must avoid using "git" itself for the tree validation, you
> +should be checksumming the uncompressed "tar" output, not e.g. the
> +compressed "tgz" output.
> ++

I don't think I want to state this, because it implies that the changes
I made that broke kernel.org (making tar.umask apply to pax headers)
wouldn't have been allowed.  We should probably just state that "we
won't promise that the tar output won't change between versions". Maybe,
"We won't change the tar output needlessly, but it may change from time
to time."  That is, we won't be "let's change the format just to mix it
up for users", but if there's a valuable patch that could be applied,
then we might well take it.

As I said, it's my goal to provide more concrete guarantees in a future
patch, probably this weekend.

> +* We promise that a given version of git will emit stable "tar" output
> +  for the same tree ID (but not commit ID, see the discussion in the
> +  <<DESCRIPTION>> section above).

I think that section contradicts this.  The tree version uses the
current timestamp, which would make the archive change based on the time
of day.

> +While you shouldn't assume that different versions of git will emit
> +the same output, you can assume (e.g. for the purposes of caching)
> +that a given version's output is stable.

Unfortunately, this isn't actually true if someone uses export-subst.
That's because adding unrelated objects can increase the length of
abbreviations, and then the tar contents can be different.  I've
actually seen this in the wild.

Modulo that, yes, I agree with this.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 9/9] git archive docs: document output non-stability
  2023-02-02 10:25       ` brian m. carlson
@ 2023-02-02 10:30         ` Ævar Arnfjörð Bjarmason
  2023-02-02 16:34         ` Junio C Hamano
  1 sibling, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-02 10:30 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe,
	Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco,
	demerphq, Theodore Ts'o


On Thu, Feb 02 2023, brian m. carlson wrote:

>> +* We will do our best not to change the "tar" output itself, but won't
>> +  promise that we're never going to change it.
>> ++
>> +If you must avoid using "git" itself for the tree validation, you
>> +should be checksumming the uncompressed "tar" output, not e.g. the
>> +compressed "tgz" output.
>> ++
>
> I don't think I want to state this, because it implies that the changes
> I made that broke kernel.org (making tar.umask apply to pax headers)
> wouldn't have been allowed.

I don't see how "we'll do our best, but it might change" precludes that...

> We should probably just state that "we
> won't promise that the tar output won't change between versions". Maybe,

...but it sounds like you'd like this "softer" promise. I think it's
saying the same, but picked the "we'll try not to" wording because I
think it more accurately reflects reality, but...

> "We won't change the tar output needlessly, but it may change from time
> to time."  That is, we won't be "let's change the format just to mix it
> up for users", but if there's a valuable patch that could be applied,
> then we might well take it.

...here we're back (at least per my reading) to basically what my
proposed patch said. I'm happy to improve/change the wording, but I'm
confused about the "because it implies" part you noted.

> As I said, it's my goal to provide more concrete guarantees in a future
> patch, probably this weekend.

I think that would be great, but also think that if we're going to make
new guarantees it's probably best applied on top of a series such as
this, which aside from the reverting back to gzip as the default
attempts to clarify the status quo.
>
>> +* We promise that a given version of git will emit stable "tar" output
>> +  for the same tree ID (but not commit ID, see the discussion in the
>> +  <<DESCRIPTION>> section above).
>
> I think that section contradicts this.  The tree version uses the
> current timestamp, which would make the archive change based on the time
> of day.

Thanks! It's referring back to the previous discussion, but I managed to
somehow get the tree & commit cases reversed.	

>> +While you shouldn't assume that different versions of git will emit
>> +the same output, you can assume (e.g. for the purposes of caching)
>> +that a given version's output is stable.
>
> Unfortunately, this isn't actually true if someone uses export-subst.
> That's because adding unrelated objects can increase the length of
> abbreviations, and then the tar contents can be different.  I've
> actually seen this in the wild.
>
> Modulo that, yes, I agree with this.

I didn't know about the export-subst case, I'll add that caveat in
there. Thanks!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (8 preceding siblings ...)
  2023-02-02  9:32     ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason
@ 2023-02-02 16:17     ` Phillip Wood
  2023-02-02 16:40       ` Junio C Hamano
                         ` (2 more replies)
  2023-02-02 16:25     ` Junio C Hamano
  2023-02-02 19:23     ` Raymond E. Pasco
  11 siblings, 3 replies; 57+ messages in thread
From: Phillip Wood @ 2023-02-02 16:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o

Hi Ævar

On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote:
> As reported in
> https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
> changing the default "tgz" output method of from "gzip(1)" to our
> internal "git archive gzip" (using zlib ) broke things for users in
> the wild that assume that the "git archive" output is stable, most
> notably GitHub: https://github.com/orgs/community/discussions/45830
 >
> Leaving aside the larger question of whether we're going to promise
> output stability for "git archive" in general, the motivation for that
> change was to have a working compression method on systems that lacked
> a gzip(1).

As I recall the reduction in cpu time used to create a compressed 
archive was a factor in making it the default.

> As the disruption of changing the default isn't worth it, let's use
> gzip(1) again by default, and only fall back on the new "git archive
> gzip" if it isn't available.

Playing devil's advocate for a moment as we're not going to promise that 
the compressed output of "git archive" will be stable in the future 
perhaps we should use this breakage as an opportunity to highlight that 
to users and to advertize the config setting that allows them to use 
gzip for compressing archives. Reverting the change gives the misleading 
impression that we're making a commitment to keeping the output stable. 
The focus of this thread seems to be the problems relating to github 
which they have already addressed.

I think there is general agreement that it is not practical to promise 
that the compressed output of "git archive" is stable so maybe it is 
better to make that clear now while users can work around it in the 
short term with a config setting rather than waiting until we're faced 
with some security or other issue that forces a change to the output 
which users cannot work around so easily.

Best Wishes

Phillip


> The later parts of this series then document and test for the output
> stability of the command.
> 
> We're not promising anything new there, except that we now promise
> that we're going to use "gzip" as the default compressor, but that
> it's up to that command to be stable, should the user desire output
> stability.
> 
> The documentation discusses the various caveats involved, suggests
> alternatives to checksumming compressed archives, but in the end notes
> what's been the policy so far: We're not promising that the "tar"
> output is going to be stable.
> 
> The early parts of this series (1-2/9) are clean-up for existing
> config drift, as later in the series we'll otherwise need to change
> the divergent config documentation in two places.
> 
> CI & branch for this at:
> https://github.com/avar/git/tree/avar/archive-internal-gzip-not-the-default
> 
> Ævar Arnfjörð Bjarmason (9):
>    archive & tar config docs: de-duplicate configuration section
>    git config docs: document "tar.<format>.{command,remote}"
>    archiver API: make the "flags" in "struct archiver" an enum
>    archive: omit the shell for built-in "command" filters
>    archive-tar.c: move internal gzip implementation to a function
>    archive: use "gzip -cn" for stability, not "git archive gzip"
>    test-lib.sh: add a lazy GZIP prerequisite
>    archive tests: test for "gzip -cn" and "git archive gzip" stability
>    git archive docs: document output non-stability
> 
>   Documentation/config/tar.txt           | 29 +++++++-
>   Documentation/git-archive.txt          | 96 +++++++++++++++++++-------
>   archive-tar.c                          | 78 ++++++++++++++-------
>   archive.h                              | 11 +--
>   t/t5000-tar-tree.sh                    |  2 -
>   t/t5005-archive-stability.sh           | 70 +++++++++++++++++++
>   t/t5562-http-backend-content-length.sh |  2 -
>   t/test-lib.sh                          |  4 ++
>   8 files changed, 231 insertions(+), 61 deletions(-)
>   create mode 100755 t/t5005-archive-stability.sh
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (9 preceding siblings ...)
  2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
@ 2023-02-02 16:25     ` Junio C Hamano
  2023-02-04 18:08       ` René Scharfe
  2023-02-02 19:23     ` Raymond E. Pasco
  11 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2023-02-02 16:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Eli Schwartz, René Scharfe, brian m . carlson,
	Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco,
	demerphq, Theodore Ts'o

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> As the disruption of changing the default isn't worth it, let's use
> gzip(1) again by default, and only fall back on the new "git archive
> gzip" if it isn't available.

It perhaps is OK, and lets us answer "ugh, the compressed output of
'git archive' is unstable again" with "we didn't change anything,
perhaps you changed your gzip(1)?" when they fix bugs or improve
compression or whatever.  Of course that is not an overall win for
the end users, but in the short term until gzip gets such a change,
we would presumably get the "same" output as before.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 9/9] git archive docs: document output non-stability
  2023-02-02 10:25       ` brian m. carlson
  2023-02-02 10:30         ` Ævar Arnfjörð Bjarmason
@ 2023-02-02 16:34         ` Junio C Hamano
  2023-02-04 17:46           ` brian m. carlson
  1 sibling, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2023-02-02 16:34 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Ævar Arnfjörð Bjarmason, git, Eli Schwartz,
	René Scharfe, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> +* We will do our best not to change the "tar" output itself, but won't
>> +  promise that we're never going to change it.
>> ++
>> +If you must avoid using "git" itself for the tree validation, you
>> +should be checksumming the uncompressed "tar" output, not e.g. the
>> +compressed "tgz" output.
>> ++
>
> I don't think I want to state this, because it implies that the changes
> I made that broke kernel.org (making tar.umask apply to pax headers)
> wouldn't have been allowed.  We should probably just state that "we
> won't promise that the tar output won't change between versions". Maybe,
> "We won't change the tar output needlessly, but it may change from time
> to time."  That is, we won't be "let's change the format just to mix it
> up for users", but if there's a valuable patch that could be applied,
> then we might well take it.

I agree with you.  Giving "will do our best not to" is still too
strong for that.  We won't change the format willy-nilly but when
there is a good reason to do so, we should be able to fix or improve
the output.

>> +While you shouldn't assume that different versions of git will emit
>> +the same output, you can assume (e.g. for the purposes of caching)
>> +that a given version's output is stable.
>
> Unfortunately, this isn't actually true if someone uses export-subst.
> That's because adding unrelated objects can increase the length of
> abbreviations, and then the tar contents can be different.  I've
> actually seen this in the wild.

"subst" is certainly an issue, especially when the substitution is
unstable.

There shouldn't be cross platform differences to break bit-for-bit
stability at least for "tar" format, as we do not rely on any
external library.  Can we say the same for "zip"?  I thought we
throw the blob at git_deflate_*() so the exact bitstream is up to
the libz implementation?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
@ 2023-02-02 16:40       ` Junio C Hamano
  2023-02-03 13:49       ` Ævar Arnfjörð Bjarmason
  2023-02-03 15:47       ` Theodore Ts'o
  2 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2023-02-02 16:40 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Ævar Arnfjörð Bjarmason, git, Eli Schwartz,
	René Scharfe, brian m . carlson, Konstantin Ryabitsev,
	Michal Suchánek, Raymond E . Pasco, demerphq,
	Theodore Ts'o

Phillip Wood <phillip.wood123@gmail.com> writes:

> ... Reverting the change
> gives the misleading impression that we're making a commitment to
> keeping the output stable. The focus of this thread seems to be the
> problems relating to github which they have already addressed.
>
> I think there is general agreement that it is not practical to promise
> that the compressed output of "git archive" is stable so maybe it is
> better to make that clear now while users can work around it in the
> short term with a config setting rather than waiting until we're faced
> with some security or other issue that forces a change to the output
> which users cannot work around so easily.

I love to see somebody else play the devil's advocate role.  Thanks
for all of the above.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
                       ` (10 preceding siblings ...)
  2023-02-02 16:25     ` Junio C Hamano
@ 2023-02-02 19:23     ` Raymond E. Pasco
  2023-02-03  8:06       ` [PATCH] archive: document output stability concerns Raymond E. Pasco
  11 siblings, 1 reply; 57+ messages in thread
From: Raymond E. Pasco @ 2023-02-02 19:23 UTC (permalink / raw)
  To: phillip.wood, Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	demerphq, Theodore Ts'o

February 2, 2023 11:17 AM, "Phillip Wood" <phillip.wood123@gmail.com> wrote:
> Playing devil's advocate for a moment as we're not going to promise that the compressed output of
> "git archive" will be stable in the future perhaps we should use this breakage as an opportunity to
> highlight that to users and to advertize the config setting that allows them to use gzip for
> compressing archives. Reverting the change gives the misleading impression that we're making a
> commitment to keeping the output stable. The focus of this thread seems to be the problems relating
> to github which they have already addressed.
> 
> I think there is general agreement that it is not practical to promise that the compressed output
> of "git archive" is stable so maybe it is better to make that clear now while users can work around
> it in the short term with a config setting rather than waiting until we're faced with some security
> or other issue that forces a change to the output which users cannot work around so easily.

Reverting to the behavior of "use some arbitrary gzip from $PATH" would
be a poor decision whether or not git were willing to make some
commitment to gzip stability, because Git does not control arbitrary
gzips on the user's $PATH. If Git did want to promise gzip stability, it 
could only start from something like the current internal implementation
along with a vendored zlib; if it doesn't, as appears to be the case, 
then the internal implementation is superior for the other reasons 
already discussed.

If the user wants to depend on a particular gzip executable they supply, 
this configuration knob already exists for them.

Since there is no guarantee of stability, but there has been a popular 
misconception that there is some such guarantee (e.g., [1]), some kind 
of STABILITY section describing how there isn't any and suggesting ways
the user can attain more stability via configuration seems to be a good
idea.

[1]: https://lists.reproducible-builds.org/pipermail/rb-general/2021-October/002422.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 18:56                   ` Theodore Ts'o
@ 2023-02-02 21:19                     ` Joey Hess
  2023-02-03  4:02                       ` Theodore Ts'o
  0 siblings, 1 reply; 57+ messages in thread
From: Joey Hess @ 2023-02-02 21:19 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Git List

[-- Attachment #1: Type: text/plain, Size: 494 bytes --]

In my opinion as the original developer of pristine-tar, it's too
complicated to be usefully used by git. The problem it solves is of a
larger scope than the problem git has here. (I hope.)

Developing pristine-tar did entail much investigation of past changes in
compressor outputs. I know that gzip's output has sometimes not been
deterministic as recently as 2012, see for example
https://git.savannah.gnu.org/cgit/gzip.git/commit/?id=0a284baeaedca68017f46d2646e4

-- 
see shy jo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-01 23:37           ` Junio C Hamano
@ 2023-02-02 23:01             ` brian m. carlson
  2023-02-02 23:47               ` rsbecker
  0 siblings, 1 reply; 57+ messages in thread
From: brian m. carlson @ 2023-02-02 23:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Konstantin Ryabitsev,
	Eli Schwartz, Git List

[-- Attachment #1: Type: text/plain, Size: 2575 bytes --]

On 2023-02-01 at 23:37:19, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > I don't think a blurb is necessary, but you're basically underscoring
> > the problem, which is that nobody is willing to promise that compression
> > is consistent, but yet people want to rely on that fact.  I'm willing to
> > write and implement a consistent tar spec and to guarantee compatibility
> > with that, but the tension here is that people also want gzip to never
> > change its byte format ever, which frankly seems unrealistic without
> > explicit guarantees.  Maybe the authors will agree to promise that, but
> > it seems unlikely.
> 
> Just to step back a bit, where does the distinction between
> guaranteeing the tar format stability and gzip compressed bitstream
> stability come from?  At both levels, the same thing can be
> expressed in multiple different ways, I think, but spelling out how
> exactly the compressor compresses is more involved than spelling out
> how entries in a tar archive is ordered and each entry is expressed,
> or something?

Yes, at least with my understanding about how gzip and compression in
general work.

The tar format (and the pax format which builds on it) can mostly be
restricted by explaining what data is to be included in the pax and tar
headers and how it is to be formatted.  If we say, we will always write
such and such information in the pax header and sort the keys, and we
write such and such information in the tar header, then the format is
completely deterministic, and we can make nice guarantees.

My understanding about how Lempel-Ziv-based compression algorithms work
is that there's a lot more freedom to decide how best to compress things
and that there isn't always a logical obvious choice, but I will admit
my understanding is relatively limited.  If someone thinks we can
effectively succeed in supporting compression more than just relying on
gzip, I would be delighted to be shown to be wrong.

> > That would probably break things, because gzip is GPLv3, and we'd need
> > to ship a much older GPLv2 gzip, which would probably differ from the
> > current behaviour, and might also have some security problems.
> 
> Yup, security issues may make bit-for-bit-stability unrealistic.
> IIRC, the last time we had discussion on this topic, we settled
> on stability across the same version of Git (i.e. deterministic
> result)?

Yes, I think that's what we agreed.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-02 23:01             ` brian m. carlson
@ 2023-02-02 23:47               ` rsbecker
  2023-02-03 13:18                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 57+ messages in thread
From: rsbecker @ 2023-02-02 23:47 UTC (permalink / raw)
  To: 'brian m. carlson', 'Junio C Hamano'
  Cc: 'Ævar Arnfjörð Bjarmason',
	'Konstantin Ryabitsev', 'Eli Schwartz',
	'Git List'

On February 2, 2023 6:02 PM, brian m. carlson wrote:
>On 2023-02-01 at 23:37:19, Junio C Hamano wrote:
>> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>>
>> > I don't think a blurb is necessary, but you're basically
>> > underscoring the problem, which is that nobody is willing to promise
>> > that compression is consistent, but yet people want to rely on that
>> > fact.  I'm willing to write and implement a consistent tar spec and
>> > to guarantee compatibility with that, but the tension here is that
>> > people also want gzip to never change its byte format ever, which
>> > frankly seems unrealistic without explicit guarantees.  Maybe the
>> > authors will agree to promise that, but it seems unlikely.
>>
>> Just to step back a bit, where does the distinction between
>> guaranteeing the tar format stability and gzip compressed bitstream
>> stability come from?  At both levels, the same thing can be expressed
>> in multiple different ways, I think, but spelling out how exactly the
>> compressor compresses is more involved than spelling out how entries
>> in a tar archive is ordered and each entry is expressed, or something?
>
>Yes, at least with my understanding about how gzip and compression in general
>work.
>
>The tar format (and the pax format which builds on it) can mostly be restricted by
>explaining what data is to be included in the pax and tar headers and how it is to be
>formatted.  If we say, we will always write such and such information in the pax
>header and sort the keys, and we write such and such information in the tar header,
>then the format is completely deterministic, and we can make nice guarantees.
>
>My understanding about how Lempel-Ziv-based compression algorithms work is that
>there's a lot more freedom to decide how best to compress things and that there
>isn't always a logical obvious choice, but I will admit my understanding is relatively
>limited.  If someone thinks we can effectively succeed in supporting compression
>more than just relying on gzip, I would be delighted to be shown to be wrong.

The nice part about gzip is that it is generally available on virtually all platforms (or can be easily obtained). Other compression forms, like bz2, which sometimes produces more dense compression, are not necessarily available. Availability is something I would be worried about (clone and checkout failures).

Tar formats are also to be used carefully. Not all platform implementations of tar support all variants. "ustar" is fairly common but there are others that are not. Interoperability needs to be the biggest factor in this decision, IMHO, rather than compression rates.

The alternative is having git supply its own implementation, but that is a longer term migration problem, resembling the SHA-256 migration.

>
>> > That would probably break things, because gzip is GPLv3, and we'd
>> > need to ship a much older GPLv2 gzip, which would probably differ
>> > from the current behaviour, and might also have some security problems.
>>
>> Yup, security issues may make bit-for-bit-stability unrealistic.
>> IIRC, the last time we had discussion on this topic, we settled on
>> stability across the same version of Git (i.e. deterministic result)?

In the old days, it was export concerns. Fortunately, git never really hit those in a post-2007 timeframe. I would not bank on this issue staying off the table.

--Randall


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-02 21:19                     ` Joey Hess
@ 2023-02-03  4:02                       ` Theodore Ts'o
  2023-02-03 13:32                         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 57+ messages in thread
From: Theodore Ts'o @ 2023-02-03  4:02 UTC (permalink / raw)
  To: Joey Hess; +Cc: Git List

On Thu, Feb 02, 2023 at 05:19:30PM -0400, Joey Hess wrote:
> In my opinion as the original developer of pristine-tar, it's too
> complicated to be usefully used by git. The problem it solves is of a
> larger scope than the problem git has here. (I hope.)

Well, the problem which I believe folks on this thread are trying to
deal with is a way to reconstruct a bit-for-bit compressed tarball of
a particular release in a way that minimizes the cost of storage in
the git tree.  One way of doing that would be to guarantee that git
archive would return something which is always bit-for-bit identical.
Another way is to use something like pristine tar.

I'll grant that pristine tar does solve a bit more of the problem than
what has been stated, since it allows the creator of the tarball to
remove some files, or add some auto-generated files (e.g., after
running autoreconf), and so in that way, pristine tar does solve a
somewhat larger problem than what was expressed in this thread.

That being said, however, pristine-tar is **extremely** useful, and
I'm very happy, and very thankful, that you wrote it.  It has been
super, super useful.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH] archive: document output stability concerns
  2023-02-02 19:23     ` Raymond E. Pasco
@ 2023-02-03  8:06       ` Raymond E. Pasco
  0 siblings, 0 replies; 57+ messages in thread
From: Raymond E. Pasco @ 2023-02-03  8:06 UTC (permalink / raw)
  To: ray
  Cc: avarab, demerphq, eschwartz93, git, gitster, konstantin, l.s.r,
	msuchanek, phillip.wood, sandals, tytso

In 4f4be00d302 (archive-tar: use internal gzip by default), the 'git
archive' command switched to using an internal compression filter
implemented with zlib rather than invoking a 'gzip' binary, for the
'.tar.gz' / '.tgz' output formats.

This change brought to light a common misconception that the output of
'git archive' is intended to be byte-for-byte stable. While this is not
the case, stable archive output is desirable for many applications; we
discuss concerns related to output stability and suggest ways in which
the user can control the compression used with the
"tar.<format>.command" configuration option.

Signed-off-by: Raymond E. Pasco <ray@ameretat.dev>
---
I think that something along these lines should be included in the
docs, but that the behavior should be kept the same. If it is decided
later to stabilize output, e.g. by vendoring a blessed zlib version
forever, the current state as of 2.38 is the best starting point;
and reverting a useful change because of external breakage which
already has a solution, while also promising instability, seems like
a poor choice.

 Documentation/git-archive.txt | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 60c040988b..77acdacdf8 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -178,6 +178,41 @@ appropriate export-ignore in its `.gitattributes`), adjust the checked out
 option.  Alternatively you can keep necessary attributes that should apply
 while archiving any tree in your `$GIT_DIR/info/attributes` file.
 
+[[STABILITY]]
+STABILITY
+---------
+
+'git archive' does not guarantee that precisely identical archive files
+will be produced for invocations on the same commit or tree.
+
+'git archive' uses an internal implementation of `tar` archiving
+for the `tar` format, which includes the commit ID in an extended
+pax header.  For the `tgz` and `tar.gz` formats, it is augmented with
+a compression filter applied to the output, which is implemented by
+'git archive' by linking to the system zlib.
+
+If the commit ID of the "same" commit is different, for instance in the
+case of an object format migration from SHA-1 to SHA-256, the `tar`
+archive will necessarily differ due to including a different ID.
+
+The output of the compression filter is less deterministic than
+the output of the `tar` implementation, because the versions
+of zlib used may differ. The internal compression filter can be
+replaced with a particular command specified by the user using the
+`tar.<format>.command` configuration option; for instance, a particular
+gzip binary provided by the user could be specified here for consistent
+output.
+
+The `tar` format used by 'git archive' is unlikely to change
+frequently, but is not guaranteed to be completely stable; its output
+will remain identical at least within the same Git version.
+
+The `zip` format has similar concerns to the `tar.gz` and `tgz`
+formats; ZIP archiving is implemented internally, but the Deflate
+compression used relies on the linked zlib. However, because archiving
+and compression are combined into a single operation, there is no
+user-specifiable filter command for the `zip` format.
+
 EXAMPLES
 --------
 `git archive --format=tar --prefix=junk/ HEAD | (cd /var/tmp/ && tar xf -)`::
-- 
2.39.1.561.g98d13ac3e7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-02 23:47               ` rsbecker
@ 2023-02-03 13:18                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-03 13:18 UTC (permalink / raw)
  To: rsbecker
  Cc: 'brian m. carlson', 'Junio C Hamano',
	'Konstantin Ryabitsev', 'Eli Schwartz',
	'Git List'


On Thu, Feb 02 2023, rsbecker@nexbridge.com wrote:

> On February 2, 2023 6:02 PM, brian m. carlson wrote:
>>On 2023-02-01 at 23:37:19, Junio C Hamano wrote:
>>> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>>>
>>> > I don't think a blurb is necessary, but you're basically
>>> > underscoring the problem, which is that nobody is willing to promise
>>> > that compression is consistent, but yet people want to rely on that
>>> > fact.  I'm willing to write and implement a consistent tar spec and
>>> > to guarantee compatibility with that, but the tension here is that
>>> > people also want gzip to never change its byte format ever, which
>>> > frankly seems unrealistic without explicit guarantees.  Maybe the
>>> > authors will agree to promise that, but it seems unlikely.
>>>
>>> Just to step back a bit, where does the distinction between
>>> guaranteeing the tar format stability and gzip compressed bitstream
>>> stability come from?  At both levels, the same thing can be expressed
>>> in multiple different ways, I think, but spelling out how exactly the
>>> compressor compresses is more involved than spelling out how entries
>>> in a tar archive is ordered and each entry is expressed, or something?
>>
>>Yes, at least with my understanding about how gzip and compression in general
>>work.
>>
>>The tar format (and the pax format which builds on it) can mostly be restricted by
>>explaining what data is to be included in the pax and tar headers and how it is to be
>>formatted.  If we say, we will always write such and such information in the pax
>>header and sort the keys, and we write such and such information in the tar header,
>>then the format is completely deterministic, and we can make nice guarantees.
>>
>>My understanding about how Lempel-Ziv-based compression algorithms work is that
>>there's a lot more freedom to decide how best to compress things and that there
>>isn't always a logical obvious choice, but I will admit my understanding is relatively
>>limited.  If someone thinks we can effectively succeed in supporting compression
>>more than just relying on gzip, I would be delighted to be shown to be wrong.
>
> The nice part about gzip is that it is generally available on
> virtually all platforms (or can be easily obtained). Other compression
> forms, like bz2, which sometimes produces more dense compression, are
> not necessarily available. Availability is something I would be
> worried about...

I agree with all of that, gzip is in such wide use for a reason. 

>... (clone and checkout failures).

But how would a hypothetical obscure format for "git archive" contribute
to clone or checkout failures? Are you thinking of our use of zlib for
e.g. loose objects? That's unrelated to this discussion (and I don't
think anyone relies on their compressed checksum).

> Tar formats are also to be used carefully. Not all platform
> implementations of tar support all variants. "ustar" is fairly common
> but there are others that are not. Interoperability needs to be the
> biggest factor in this decision, IMHO, rather than compression rates.

For "git archive" whether you care about interoperability depends on the
target audience of your archive, and in any case I don't see why we need
to worry about it, except to perhaps note that some are more portable
than others if we e.g. had a built-in "tar.bz2" helper method.

> The alternative is having git supply its own implementation, but that
> is a longer term migration problem, resembling the SHA-256 migration.

I've noted elsewhere in this thread that I don't see the point of
shipping a fallback "gzip" beyond the "git archive gzip" we have
already, but even if we did that the scope of that seems pretty simple,
and *much* easier than the SHA-256 migration.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  2023-02-03  4:02                       ` Theodore Ts'o
@ 2023-02-03 13:32                         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-03 13:32 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Joey Hess, Git List


On Thu, Feb 02 2023, Theodore Ts'o wrote:

> On Thu, Feb 02, 2023 at 05:19:30PM -0400, Joey Hess wrote:
>> In my opinion as the original developer of pristine-tar, it's too
>> complicated to be usefully used by git. The problem it solves is of a
>> larger scope than the problem git has here. (I hope.)
>
> Well, the problem which I believe folks on this thread are trying to
> deal with is a way to reconstruct a bit-for-bit compressed tarball of
> a particular release in a way that minimizes the cost of storage in
> the git tree.  One way of doing that would be to guarantee that git
> archive would return something which is always bit-for-bit identical.
> Another way is to use something like pristine tar.

I think that's what this side-thread has devolved into, but I honestly
don't see how that's useful or more than tangentally related to the
problem noted at the start of the thread.

If you are writing a new system that consumes "git archive" output
something like what I'm proposing to add in [1] should nicely sidestep
this issue, just checksum the uncompressed archive (assuming you're OK
with our soft "tar" guarantees), or "git tag -v" (if you can) etc.

That part of the docs is just a summary of what Konstantin Ryabitsev
pointed out in a side-thread.

One might also imagine any other number of trivial solutions to the
problem, e.g. people interested in this can unpack the archive, and then
(needs to guarantee sorted order, which I think find(1) doesn't, but
just as a POC):

	(cd unpacked && find . -type f -printf "%f\n" -exec cat {} \; | sha256sum)

Or whatever.

But any such solution to the abstract problem isn't going to help the
existing users whose systems broke because they were assuming certain
things about the "git archive" output.

For those users I think (as my proposed series does) we should just do
whatever we can do limit the disruption, as my proposed [2] does by
switching back to "gzip".

For those users who are creating new systems that might use "git
archive" today we then just need to update the documentation going
forward. Maybe those could use "pristine-tar", or perhaps they can use
some entirely different distribution mechanism.

1. https://lore.kernel.org/git/patch-9.9-b40833b2168-20230202T093212Z-avarab@gmail.com/
2. https://lore.kernel.org/git/cover-0.9-00000000000-20230202T093212Z-avarab@gmail.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
  2023-02-02 16:40       ` Junio C Hamano
@ 2023-02-03 13:49       ` Ævar Arnfjörð Bjarmason
  2023-02-06 14:46         ` Phillip Wood
  2023-02-03 15:47       ` Theodore Ts'o
  2 siblings, 1 reply; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-03 13:49 UTC (permalink / raw)
  To: phillip.wood
  Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o


On Thu, Feb 02 2023, Phillip Wood wrote:

> On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote:
>> As reported in
>> https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
>> changing the default "tgz" output method of from "gzip(1)" to our
>> internal "git archive gzip" (using zlib ) broke things for users in
>> the wild that assume that the "git archive" output is stable, most
>> notably GitHub: https://github.com/orgs/community/discussions/45830
>>
>> Leaving aside the larger question of whether we're going to promise
>> output stability for "git archive" in general, the motivation for that
>> change was to have a working compression method on systems that lacked
>> a gzip(1).
>
> As I recall the reduction in cpu time used to create a compressed
> archive was a factor in making it the default.

I read those references in 76d7602631a (archive-tar: add internal gzip
implementation, 2022-06-15) more of a "it's not [much] slower", the flip
to the default in 4f4be00d302 (archive-tar: use internal gzip by
default, 2022-06-15) didn't discuss it.

So I didn't think it was important enough to mention (even though we're
now back to the faster "gzip" method).

>> As the disruption of changing the default isn't worth it, let's use
>> gzip(1) again by default, and only fall back on the new "git archive
>> gzip" if it isn't available.
>
> Playing devil's advocate for a moment as we're not going to promise
> that the compressed output of "git archive" will be stable in the
> future perhaps we should use this breakage as an opportunity to
> highlight that to users and to advertize the config setting that
> allows them to use gzip for compressing archives.

If we were trying to intentionally break things for those users we could
do a lot better than "git archive gzip", whose output is mostly the same
as "gzip", we could tweak one of the headers to make it different all
the time.

But I think it's better to advocate for such intentional chaos-monkeying
as a follow-up to this more conservative "oops, we broke stuff, it's
easy not to break it, so let's not do it'.

> Reverting the change gives the misleading impression that we're making
> a commitment to keeping the output stable.

I don't see how you can conclude that from this series. It explicitly
states that we make no such promises, what it does is go back to
allowing the gzip(1) command to make its own promises.

> The focus of this thread seems to be the
> problems relating to github which they have already addressed.

Which they've addressed by reverting the change, but while they're a
major user of git they're not the only one. They just happened to use
"git archive".

I think it would be a mistake to conclude that everyone who's run into
this has already done so, or is aware of it.

> I think there is general agreement that it is not practical to promise
> that the compressed output of "git archive" is stable so maybe it is
> better[...]

...better than what? This seems to imply that this series is making new
promises about the output stability, which it isn't doing.

> [...]to make that clear now while users can work around it in the
> short term with a config setting rather than waiting until we're faced
> with some security or other issue that forces a change to the output
> which users cannot work around so easily.

I think it's always been clear that you can use that setting. For ages
we've been saying:

	The `tar.gz` and `tgz` formats are defined automatically and use the
	command `gzip -cn` by default.

Then v2.38.0 changed it to:

	[...]
        magic command `git archive gzip` by default

Which IMO was easily missed among other "Performance, Internal
Implementation, Development Support etc." items in the release notes,
which said:

   Teach "git archive" to (optionally and then by default) avoid
   spawning an external "gzip" process when creating ".tar.gz" (and
   ".tgz") archives.

But I agree that all of this is subjective. To me a 2% reduction in CPU
use (at the cost of ~20% increse in wallclock) & some unclear benefits
to teaching users that they can't rely on our "gzip" output seems
unclear or hypothetical.

Whereas the widespread breakage reported is very real, and we should
consider GitHub as a canary for that, not the the stand & end of its
potential impact.

As we didn't have a strong reason to change this in the first place (and
as my series shows, we can have our cake & eat it too if we don't have a
"gzip") I think the obvious choice is to go back to using "gzip".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
  2023-02-02 16:40       ` Junio C Hamano
  2023-02-03 13:49       ` Ævar Arnfjörð Bjarmason
@ 2023-02-03 15:47       ` Theodore Ts'o
  2 siblings, 0 replies; 57+ messages in thread
From: Theodore Ts'o @ 2023-02-03 15:47 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Eli Schwartz, René Scharfe, brian m . carlson,
	Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco,
	demerphq

On Thu, Feb 02, 2023 at 04:17:09PM +0000, Phillip Wood wrote:
> Playing devil's advocate for a moment as we're not going to promise that the
> compressed output of "git archive" will be stable in the future perhaps we
> should use this breakage as an opportunity to highlight that to users and to
> advertize the config setting that allows them to use gzip for compressing
> archives. Reverting the change gives the misleading impression that we're
> making a commitment to keeping the output stable. The focus of this thread
> seems to be the problems relating to github which they have already
> addressed.
> 
> I think there is general agreement that it is not practical to promise that
> the compressed output of "git archive" is stable so maybe it is better to
> make that clear now while users can work around it in the short term with a
> config setting rather than waiting until we're faced with some security or
> other issue that forces a change to the output which users cannot work
> around so easily.

I would be in favor of adding a config option that allows using the
internal gzip option, although leave the default to be keep things
compatible.

The reason for that it should be easy for a forge provider such as
GitHub to break things, deliberately.  Sound insane?  Hear me out.

At $WORK, we have a highly reliable system, Paxos.  It is a highly
fault-tolerant system, so it rarely fails.  But "rarely fails" is not
the same as "never fails".  And hopefully, things should degrade
gracefully if there is a Paxos outage.  But as the Google SRE's are
fond of saying, "Hope is not a strategy".

So periodically, the people who run the Paxos service will
deliberately force downtime for a short amount of time.  The fact that
they will do this is well advertised, and scheduled ahead of time ---
and teams responsible for user-facing services are supposed to make
sure that end-users don't notice when this happens.  Maybe they won't
be able to update configurations as easily while Paxos is down, but it
shouldn't cause a user-visible outage.

So what I would recommend to the GitHub product manager, is that once
a quarter, on a well-advertised date, that they flip the switch and
break the git archive checksums for say, an hour.  Then next quarter,
they advertise that the switch will be thrown for 2 hours, doubling
each time, until it is ramped up to 16 hours.

This will provide the necessary nudge so that all of these badly
designed systems that depend on downloaded archives of arbitrary git
hubs to be stable will rethink their position, while minimizing the
end-user customer impact.  Otherwise, I predict that Bazel, homebrew,
etc will consider to rely on this ill-considered assumption, and at
some point in the future, when we *do* have a much better reason to
want to make a change to the tar or compression algorithm, all of
these end users will once again scream bloody murder.

Of course, this is going to be up to each forge provider to decide
whether they want to do this.  But we can make it easy for them to do
this thing, and I'd argue it is in our interest to make it easy for
them to do this.  Otherwise we'll get constrained in the future by the
fear of massive user blowback, no metter what we say in our
documentation regarding "no promises --- and next time, we really
mean it!"

	      	       	       	    	  - Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 9/9] git archive docs: document output non-stability
  2023-02-02 16:34         ` Junio C Hamano
@ 2023-02-04 17:46           ` brian m. carlson
  0 siblings, 0 replies; 57+ messages in thread
From: brian m. carlson @ 2023-02-04 17:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, Eli Schwartz,
	René Scharfe, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 688 bytes --]

On 2023-02-02 at 16:34:44, Junio C Hamano wrote:
> There shouldn't be cross platform differences to break bit-for-bit
> stability at least for "tar" format, as we do not rely on any
> external library.  Can we say the same for "zip"?  I thought we
> throw the blob at git_deflate_*() so the exact bitstream is up to
> the libz implementation?

That's also true.  There, we can't use gzip, so we do whatever libz
does.  For Zip, I believe we embed a local timestamp, so the output is
also dependent on the time zone.  I don't know enough about the Zip
format to say if there are any other things that may vary.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-02 16:25     ` Junio C Hamano
@ 2023-02-04 18:08       ` René Scharfe
  2023-02-05 21:30         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 57+ messages in thread
From: René Scharfe @ 2023-02-04 18:08 UTC (permalink / raw)
  To: Junio C Hamano, Ævar Arnfjörð Bjarmason
  Cc: git, Eli Schwartz, brian m . carlson, Konstantin Ryabitsev,
	Michal Suchánek, Raymond E . Pasco, demerphq,
	Theodore Ts'o

Am 02.02.23 um 17:25 schrieb Junio C Hamano:
> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> As the disruption of changing the default isn't worth it, let's use
>> gzip(1) again by default, and only fall back on the new "git archive
>> gzip" if it isn't available.
>
> It perhaps is OK, and lets us answer "ugh, the compressed output of
> 'git archive' is unstable again" with "we didn't change anything,
> perhaps you changed your gzip(1)?" when they fix bugs or improve
> compression or whatever.  Of course that is not an overall win for
> the end users, but in the short term until gzip gets such a change,
> we would presumably get the "same" output as before.

Restoring the old default is an understandable reflex.  In theory it
worsens consistency and stability of the output, but in practice using
whatever was found in $PATH did work before -- or at least it was not
our problem if it didn't.

Are there still people left that would benefit from such a step back,
however?  As far as I understand forges like GitHub relied on git
archive producing the same tgz output across versions.  That assumption
was violated, trust lost.  They had to learn about the configuration
option tar.tgz.command or find some other way to cope.  Changing the
default again won't undo that.

René


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-04 18:08       ` René Scharfe
@ 2023-02-05 21:30         ` Ævar Arnfjörð Bjarmason
  2023-02-12 17:41           ` René Scharfe
  0 siblings, 1 reply; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-05 21:30 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, git, Eli Schwartz, brian m . carlson,
	Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco,
	demerphq, Theodore Ts'o


On Sat, Feb 04 2023, René Scharfe wrote:

> Am 02.02.23 um 17:25 schrieb Junio C Hamano:
>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>
>>> As the disruption of changing the default isn't worth it, let's use
>>> gzip(1) again by default, and only fall back on the new "git archive
>>> gzip" if it isn't available.
>>
>> It perhaps is OK, and lets us answer "ugh, the compressed output of
>> 'git archive' is unstable again" with "we didn't change anything,
>> perhaps you changed your gzip(1)?" when they fix bugs or improve
>> compression or whatever.  Of course that is not an overall win for
>> the end users, but in the short term until gzip gets such a change,
>> we would presumably get the "same" output as before.
>
> Restoring the old default is an understandable reflex.  In theory it
> worsens consistency and stability of the output, but in practice using
> whatever was found in $PATH did work before -- or at least it was not
> our problem if it didn't.

"In theory" because the user might be flip-flopping between different
gzip(1) versions?

> Are there still people left that would benefit from such a step back,
> however?  As far as I understand forges like GitHub relied on git
> archive producing the same tgz output across versions.  That assumption
> was violated, trust lost.  They had to learn about the configuration
> option tar.tgz.command or find some other way to cope.  Changing the
> default again won't undo that.

I think it's safe to assume that git is used by enough users that
anything breaking at a major hosting provider is likely to have a very
long tail in the wild, almost all of which we'll never see in "this
broke for me" reports to this ML.

So no, that ship has clearly sailed for GitHub, but this series aims to
address more than that.

Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the
main problem you were trying to solve in making "git archive gzip" the
default didn't need to be solved by changing the default. I.e. the aim
was to have it work when "gzip(1)" wasn't available, which we can do by
falling back only if we can't invoke it, rather than changing the
long-standing default.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-03 13:49       ` Ævar Arnfjörð Bjarmason
@ 2023-02-06 14:46         ` Phillip Wood
  0 siblings, 0 replies; 57+ messages in thread
From: Phillip Wood @ 2023-02-06 14:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Eli Schwartz, René Scharfe,
	brian m . carlson, Konstantin Ryabitsev, Michal Suchánek,
	Raymond E . Pasco, demerphq, Theodore Ts'o

On 03/02/2023 13:49, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Feb 02 2023, Phillip Wood wrote: >> Reverting the change gives the misleading impression that we're making
>> a commitment to keeping the output stable.
> 
> I don't see how you can conclude that from this series. It explicitly
> states that we make no such promises, what it does is go back to
> allowing the gzip(1) command to make its own promises.

This series would not be happening if we were not reverting a change to 
the compressed output of 'git archive'. The documentation updates are 
very welcome but I think we're undermining the message that the 
compressed output can change by reverting that change.

>> The focus of this thread seems to be the
>> problems relating to github which they have already addressed.
> 
> Which they've addressed by reverting the change, but while they're a
> major user of git they're not the only one. They just happened to use
> "git archive".
> 
> I think it would be a mistake to conclude that everyone who's run into
> this has already done so, or is aware of it.

I've spent some time trying to find reports of problems caused by this 
change and have not seen anything apart from the issue with GitHub. 
Although it takes a while for new versions of git to get into linux 
distributions if there is a widespread problem we normally hear about it 
pretty quickly. This change has been in two releases now. If anyone does 
have a problem there is an easy fix in the form of setting 
tar.<format>.command

>> I think there is general agreement that it is not practical to promise
>> that the compressed output of "git archive" is stable so maybe it is
>> better[...]
> 
> ...better than what? This seems to imply that this series is making new
> promises about the output stability, which it isn't doing.

It's better people realize they cannot rely on the output being stable 
now when they can safely work around the problem while working on a 
proper fix rather than waiting until the change in output is caused by a 
security issue in gzip which means the work around is no longer safe.

Best Wishes

Phillip

>> [...]to make that clear now while users can work around it in the
>> short term with a config setting rather than waiting until we're faced
>> with some security or other issue that forces a change to the output
>> which users cannot work around so easily.
> 
> I think it's always been clear that you can use that setting. For ages
> we've been saying:
> 
> 	The `tar.gz` and `tgz` formats are defined automatically and use the
> 	command `gzip -cn` by default.
> 
> Then v2.38.0 changed it to:
> 
> 	[...]
>          magic command `git archive gzip` by default
> 
> Which IMO was easily missed among other "Performance, Internal
> Implementation, Development Support etc." items in the release notes,
> which said:
> 
>     Teach "git archive" to (optionally and then by default) avoid
>     spawning an external "gzip" process when creating ".tar.gz" (and
>     ".tgz") archives.
> 
> But I agree that all of this is subjective. To me a 2% reduction in CPU
> use (at the cost of ~20% increse in wallclock) & some unclear benefits
> to teaching users that they can't rely on our "gzip" output seems
> unclear or hypothetical.
> 
> Whereas the widespread breakage reported is very real,

where are the reports of widespread berakage outside of GitHub?

> and we should
> consider GitHub as a canary for that, not the the stand & end of its
> potential impact.
> 
> As we didn't have a strong reason to change this in the first place (and
> as my series shows, we can have our cake & eat it too if we don't have a
> "gzip") I think the obvious choice is to go back to using "gzip".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/9] git archive: use gzip again by default, document output stabilty
  2023-02-05 21:30         ` Ævar Arnfjörð Bjarmason
@ 2023-02-12 17:41           ` René Scharfe
  0 siblings, 0 replies; 57+ messages in thread
From: René Scharfe @ 2023-02-12 17:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, git, Eli Schwartz, brian m . carlson,
	Konstantin Ryabitsev, Michal Suchánek, Raymond E . Pasco,
	demerphq, Theodore Ts'o

Am 05.02.23 um 22:30 schrieb Ævar Arnfjörð Bjarmason:
>
> On Sat, Feb 04 2023, René Scharfe wrote:
>
>> Am 02.02.23 um 17:25 schrieb Junio C Hamano:
>>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>>
>>>> As the disruption of changing the default isn't worth it, let's use
>>>> gzip(1) again by default, and only fall back on the new "git archive
>>>> gzip" if it isn't available.
>>>
>>> It perhaps is OK, and lets us answer "ugh, the compressed output of
>>> 'git archive' is unstable again" with "we didn't change anything,
>>> perhaps you changed your gzip(1)?" when they fix bugs or improve
>>> compression or whatever.  Of course that is not an overall win for
>>> the end users, but in the short term until gzip gets such a change,
>>> we would presumably get the "same" output as before.
>>
>> Restoring the old default is an understandable reflex.  In theory it
>> worsens consistency and stability of the output, but in practice using
>> whatever was found in $PATH did work before -- or at least it was not
>> our problem if it didn't.
>
> "In theory" because the user might be flip-flopping between different
> gzip(1) versions?

No flopping needed.  We can't control what's in $PATH.  There are
OS-specific replacements for GNU gzip in NetBSD/FreeBSD/macOS and
OpenBSD.  People could use pigz.  Or cat, for that matter.  Different
versions of different tools might produce different output.

There are alternative to the original libz as well, e.g. libz-ng.  We
don't control which one or which version is installed, either, but we
could do so if we wanted by importing one of them like we did with
LibXDiff.

> Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the
> main problem you were trying to solve in making "git archive gzip" the
> default didn't need to be solved by changing the default. I.e. the aim
> was to have it work when "gzip(1)" wasn't available, which we can do by
> falling back only if we can't invoke it, rather than changing the
> long-standing default.

The aim was to no longer depend on gzip.  That goal was already met by
providing the internal implementation, without changing the default.
Git for Windows for example could use it in their config and drop gzip.

Calling gzip if available, warning if it isn't and using the internal
implementation adds yet more variance.  No longer allowing gzip to be a
shell alias might confuse someone.  The automatic fallback would only
benefit users that don't want to touch /etc/gitconfig, have nobody to
do it for them and don't care about warnings -- hopefully not a big
crowd.

I didn't intend the change of default to be that painful, but don't see
the point in going back now that we're through.  The new default is
better -- one less dependency to care about.  And if we need to go
back, however, then a know-good state makes more sense than a smart
fallback with some new twists.

René

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2023-02-12 17:41 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-31  0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz
2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
2023-01-31  9:11   ` Eli Schwartz
2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason
2023-02-02 10:25       ` brian m. carlson
2023-02-02 10:30         ` Ævar Arnfjörð Bjarmason
2023-02-02 16:34         ` Junio C Hamano
2023-02-04 17:46           ` brian m. carlson
2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
2023-02-02 16:40       ` Junio C Hamano
2023-02-03 13:49       ` Ævar Arnfjörð Bjarmason
2023-02-06 14:46         ` Phillip Wood
2023-02-03 15:47       ` Theodore Ts'o
2023-02-02 16:25     ` Junio C Hamano
2023-02-04 18:08       ` René Scharfe
2023-02-05 21:30         ` Ævar Arnfjörð Bjarmason
2023-02-12 17:41           ` René Scharfe
2023-02-02 19:23     ` Raymond E. Pasco
2023-02-03  8:06       ` [PATCH] archive: document output stability concerns Raymond E. Pasco
2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
2023-01-31 11:31   ` Ævar Arnfjörð Bjarmason
2023-01-31 15:05   ` Konstantin Ryabitsev
2023-01-31 22:32     ` brian m. carlson
2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
2023-02-01 11:34         ` demerphq
2023-02-01 12:21           ` Michal Suchánek
2023-02-01 12:48             ` demerphq
2023-02-01 13:43               ` Ævar Arnfjörð Bjarmason
2023-02-01 15:21                 ` demerphq
2023-02-01 18:56                   ` Theodore Ts'o
2023-02-02 21:19                     ` Joey Hess
2023-02-03  4:02                       ` Theodore Ts'o
2023-02-03 13:32                         ` Ævar Arnfjörð Bjarmason
2023-02-01 23:16         ` brian m. carlson
2023-02-01 23:37           ` Junio C Hamano
2023-02-02 23:01             ` brian m. carlson
2023-02-02 23:47               ` rsbecker
2023-02-03 13:18                 ` Ævar Arnfjörð Bjarmason
2023-02-02  0:42           ` Ævar Arnfjörð Bjarmason
2023-02-01 12:17       ` Raymond E. Pasco
2023-01-31 15:56   ` Eli Schwartz
2023-01-31 16:20     ` Konstantin Ryabitsev
2023-01-31 16:34       ` Eli Schwartz
2023-01-31 20:34         ` Konstantin Ryabitsev
2023-01-31 20:45         ` Michal Suchánek
2023-02-01  1:33     ` brian m. carlson
2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
2023-02-01 23:18     ` brian m. carlson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).