On 2023-01-31 at 00:06:44, Eli Schwartz wrote: > Nevertheless, I've seen the sentiment a few times that git doesn't like > committing to output stability of git-archive, because it isn't > officially documented (but it's not entirely clear what the benefits of > changing are). And yet, git endeavors to do so, in order to prevent > unnecessary breakage of people who embody Hyrum's Law and need that > stability. I'm one of the GitHub employees who chimed in there, and I'm also a Git contributor in my own time (and I am speaking here only in my personal capacity, since this is a personal address). I made a change some years back to the archive format to fix the permissions on pax headers when extracted as files, and kernel.org was relying on that and broke. Linus yelled at me because of that. Since then, I've been very opposed to us guaranteeing output format consistency without explicitly doing so. I had sent some patches before that I don't think ever got picked up that documented this explicitly. I very much don't want people to come to rely on our behaviour unless we explicitly guarantee it. > What does everyone think about offering versioned git-archive outputs? > This could be user-selectable as an option to `git archive`, but the > main goal would be to select a good versioned output format depending on > what is being archived. So: > > - first things first, un-default the internal compressor again > - implement a v2 archive format, where the internal compressor is the > default -- no other changes > - teach git to select an archive format based on the date of the object > being archived > - when given a commit/tag ID to archive, check which support frame the > committer date falls inside > - for tree IDs, always use the latest format (it always uses the > current date anyway) > - schedule a date, for the sake of argument, 6 months after the next > scheduled release date of git version X.Y in which this change goes > live; bake this into the git sources as a transition date, all commits > or tags generated after this date fall into the next format support > frame I am actually very much in favour of providing a standard, deterministic version of pax (the extended tar format) that we use and documenting it as a standard so that other archive tools can use that. That is, we document some canonical tar format that is bit-for-bit identical that we (and hopefully GNU tar and libarchive) will agree should be used to serialize files for software interchange. I don't think this should be dependent on the date at all, but I do believe it should be versioned and tested, and the version number embedded as a pax header. I think this would be valuable for simply having reproducible archives in general, including for things like Docker containers, Debian packages, Rust crates, and more, and I'm happy to work with others on such a format, as I've said in the past on the list. People can opt-in to whatever format they want when creating an archive and continue to use that forever if they like. Part of the reason I think this is valuable is that once SHA-1 and SHA-256 interoperability is present, git archive will change the contents of the archive format, since it will embed a SHA-256 hash into the file instead of a SHA-1 hash, since that's what's in the repository. Thus, we can't produce an archive that's deterministic in the face of SHA-1/SHA-256 interoperability concerns, and we need to create a new format that doesn't contain that data embedded in it. Having said that, I don't think this should be based on the timestamp of the file, since that means that two otherwise identical archives differing in timestamp aren't ever going to be the same, and we do see people who import or vendor other projects. Nor do I think we should attempt to provide consistent compression, since I believe the output of things like zlib has changed in the past, and we can't continually carry an old, potentially insecure version of zlib just because the output changed. People should be able to implement compression using gzip, zlib, pigz, miniz_oxide, or whatever if they want, since people implement Git in many different languages, and we won't want to force people using memory-safe languages like Go and Rust to explicitly use zlib for archives. That may mean that it's important for people to actually decompress the archive before checking hashes if they want deterministic behaviour, and I'm okay with that. You already have to do that if you're verifying the signature on Git tarballs, since only the uncompressed tar archive is signed, so I don't think this is out of the question. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA