All of lore.kernel.org
 help / color / mirror / Atom feed
* Fw: Curiosity
       [not found] <Wlh_w2gSCDQ2ieJnIY7TStWrzxbwP98SNRIFMTYpva7SRFipqk63HEYFVF7wFn1oSHOkQNsjWGOa5L49vyRlvSLbuZqpmvOaDOHmFkdt2zw=@protonmail.com>
@ 2021-12-15  3:52 ` João Victor Bonfim
  2021-12-15 18:07   ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: João Victor Bonfim @ 2021-12-15  3:52 UTC (permalink / raw)
  To: git

I sent this message to Junio Hamano kinda of forever ago, since then I haven't been able to address it or do anything about it really (I am personally making a report on Git for the conclusion of my technician course so I can get my certification, yada yada yada, couldn't get to it). These days I have been reading Junio's responses on the git mailing list archive (https://marc.info/?l=git or rather https://marc.info/?a=118086005800002&r=1&w=4) from May to now to see if Junio said anything. Junio didn't, but I did read https://marc.info/?i=xmqqpmudng5x.fsf%20()%20gitster%20!%20g and kinda of felt that was targeted at me, or people like me at least...

`:-)  - me sweating in exasperation.

Also since then, I may have improved on my confusing line of thought, so here is the past message and my current version so to speak:

------- Second attempt --------

Since Git is almost used for everything at this point, is there any intent on providing better support for non textual file types? Why do I say this? Take this game mod which I follow as example -> https://github.com/SolariusScorch/XComFiles <- whenever I clone it Git takes a significant forever amount of time to download 452 MB of files whose some part, from my perspective, isn't being delta compressed like the text files are (since, whenever reading a log of what changes were made, git creates and undoes modes for all binary files, some of which only changed by a pixel from one colour to another).

From my perspective it would be interesting to enhance the effectiveness/performance of git for such files, since some projects are very heavy on multimedia that isn't hard coded and those will eventually come around to using git. From a personal perspective: I pretend to create an open source game and track it with git, however it concerns me whether or not it might take forever for users to clone the repo once a few versions of a singular file of, perhaps, some Gigabytes in size aren't stored and compressed efficiently and instead all the versions are stored in full, totalling some Terabytes in storage for a few of such files.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
Wednesday, 27, May, 2021, 22:12, João Victor Bonfim <JoaoVictorBonfim@protonmail.com> wrote:

> I am assuming you are the Git maintainer, therefore the message, otherwise, forgive me.
> Considering the ubiquity of Git as a versioning system and my internal queries about the future of software development, specially game development, is there any intent on providing support for non textual file types? What do I mean is that binary files, from my perspective as a user, are tracked in full rather than partially, which I mean is that the files are discarded and replaced if they are altered when, instead, they could have the differentiation between files tracked. Of course this would require several changes to Git so it can interpret images and so on, but I think that it could be good for software development that requires extensive multimedia use and, therefore, may require that better tracking for such material is made available.
>
> Do you understand where I want to get to?
>
> Graciously yours, João Victor Bonfim.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-15  3:52 ` Fw: Curiosity João Victor Bonfim
@ 2021-12-15 18:07   ` Junio C Hamano
  2021-12-15 23:45     ` João Victor Bonfim
  2021-12-16  2:19     ` brian m. carlson
  0 siblings, 2 replies; 14+ messages in thread
From: Junio C Hamano @ 2021-12-15 18:07 UTC (permalink / raw)
  To: João Victor Bonfim; +Cc: git

João Victor Bonfim  <JoaoVictorBonfim@protonmail.com> writes:

> I sent this message to Junio Hamano kinda of forever ago, since
> then I haven't been able to address it or do anything about it
> really...

My spam filter has learned that anything that goes to gitster@
address without cc'ed to the git@vger list are to be caught, so it
is very plausible that I didn't see it.  Sending any inquiry here on
the list is the right thing to do, especially because it is likely
that I may not be the area expert for whatever you want to learn
about Git, while there are others who are more familiar with various
parts of the system and other ways the system is used.

You will also increase your chances to be read if you made your
message look more like the ones typically posted here (see the
archive), by wrapping overly long lines, etc.

> Since Git is almost used for everything at this point, is there
> any intent on providing better support for non textual file types?
> Why do I say this? Take this game mod which I follow as example ->
> https://github.com/SolariusScorch/XComFiles <- whenever I clone it
> Git takes a significant forever amount of time to download 452 MB
> of files whose some part, from my perspective, isn't being delta
> compressed like the text files are (since, whenever reading a log
> of what changes were made, git creates and undoes modes for all
> binary files, some of which only changed by a pixel from one
> colour to another).

Our delta compression does not care whether the contents are text or
binary, so if it is not compressed well, so it can be a sign that
the contents are not compressible to begin with, at least with the
xdelta binary-diff-patch engine we use.  Improvement designs,
algorithms and patches are always welcome ;-)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-15 18:07   ` Junio C Hamano
@ 2021-12-15 23:45     ` João Victor Bonfim
  2021-12-16  2:19     ` brian m. carlson
  1 sibling, 0 replies; 14+ messages in thread
From: João Victor Bonfim @ 2021-12-15 23:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

> João Victor Bonfim JoaoVictorBonfim@protonmail.com writes:
>
> Our delta compression does not care whether the contents are text or
>
> binary, so if it is not compressed well, so it can be a sign that
>
> the contents are not compressible to begin with, at least with the
>
> xdelta binary-diff-patch engine we use. Improvement designs,
>
> algorithms and patches are always welcome ;-)

Gosh, I wish I could do anything about it.

I am but a mere code monkey, haven't done much writing practice either.

Maybe one day, but that is yet to be seen.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-15 18:07   ` Junio C Hamano
  2021-12-15 23:45     ` João Victor Bonfim
@ 2021-12-16  2:19     ` brian m. carlson
  2021-12-16 21:20       ` João Victor Bonfim
  1 sibling, 1 reply; 14+ messages in thread
From: brian m. carlson @ 2021-12-16  2:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: João Victor Bonfim, git

[-- Attachment #1: Type: text/plain, Size: 2184 bytes --]

On 2021-12-15 at 18:07:20, Junio C Hamano wrote:
> João Victor Bonfim  <JoaoVictorBonfim@protonmail.com> writes:
> > Since Git is almost used for everything at this point, is there
> > any intent on providing better support for non textual file types?
> > Why do I say this? Take this game mod which I follow as example ->
> > https://github.com/SolariusScorch/XComFiles <- whenever I clone it
> > Git takes a significant forever amount of time to download 452 MB
> > of files whose some part, from my perspective, isn't being delta
> > compressed like the text files are (since, whenever reading a log
> > of what changes were made, git creates and undoes modes for all
> > binary files, some of which only changed by a pixel from one
> > colour to another).
> 
> Our delta compression does not care whether the contents are text or
> binary, so if it is not compressed well, so it can be a sign that
> the contents are not compressible to begin with, at least with the
> xdelta binary-diff-patch engine we use.  Improvement designs,
> algorithms and patches are always welcome ;-)

To expand on this, if what you're storing is already compressed, like
Ogg Vorbis files or PNGs, like are found in that repository, then
generally they will not delta well.  This is also true of things like
Microsoft Office or OpenOffice documents, because they're essentially
Zip files.

The delta algorithm looks for similarities between files to compress
them.  If a file is already compressed using something like Deflate,
used in PNGs and Zip files, then even very similar files will generally
look very different, so deltification will generally be ineffective.

There are two main solutions to this.  One is to store your data
uncompressed in the repository and compress it as part of a build step.
This makes your checkouts larger, but it makes your repository smaller.

The other is to store them outside of the repository proper.  Some folks
use Git LFS for this, but you could also just store a manifest with file
names and secure hashes, plus a download location for a public server.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-16  2:19     ` brian m. carlson
@ 2021-12-16 21:20       ` João Victor Bonfim
  2021-12-16 21:33         ` Martin Fick
  0 siblings, 1 reply; 14+ messages in thread
From: João Victor Bonfim @ 2021-12-16 21:20 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, git

> To expand on this, if what you're storing is already compressed, like
> Ogg Vorbis files or PNGs, like are found in that repository, then
> generally they will not delta well. This is also true of things like
> Microsoft Office or OpenOffice documents, because they're essentially
> Zip files.
>
> The delta algorithm looks for similarities between files to compress
> them. If a file is already compressed using something like Deflate,
> used in PNGs and Zip files, then even very similar files will generally
> look very different, so deltification will generally be ineffective.

This explain why, also, Git opens a new mode every time an edit is made,
since it cannot recognize any similarities between the files, even
though there are.

> There are two main solutions to this. One is to store your data
> uncompressed in the repository and compress it as part of a build step.
> This makes your checkouts larger, but it makes your repository smaller.
>
> The other is to store them outside of the repository proper. Some folks
> use Git LFS for this, but you could also just store a manifest with file
> names and secure hashes, plus a download location for a public server.

Maybe I am thinking too outside the box, but wouldn't it be quite more
effective for git to identify compressed files, specially on edge cases
where the compression doesn't have a good chemistry with delta compression,
decompress them for repo storage while also storing the compression
algorithm as some metadata tag (like a text string or an ID code decided
beforehand), and, when creating the work mirrors, return the compression
to its default state before checkout?

Of course you would also need reversing functions when you want to
checkout the info back to repo.

Just throwing ideas out there.

-------------------------------

João Victor Bonfim, any pronouns are welcome.

‐‐‐‐‐‐‐Original Message ‐‐‐‐‐‐‐

Em quarta-feira, 15 de dezembro de 2021 às 23:19, brian m. carlson <sandals@crustytoothpaste.net> escreveu:

> On 2021-12-15 at 18:07:20, Junio C Hamano wrote:
>
> > João Victor Bonfim JoaoVictorBonfim@protonmail.com writes:
> >
> > > Since Git is almost used for everything at this point, is there
> > >
> > > any intent on providing better support for non textual file types?
> > >
> > > Why do I say this? Take this game mod which I follow as example ->
> > >
> > > https://github.com/SolariusScorch/XComFiles <- whenever I clone it
> > >
> > > Git takes a significant forever amount of time to download 452 MB
> > >
> > > of files whose some part, from my perspective, isn't being delta
> > >
> > > compressed like the text files are (since, whenever reading a log
> > >
> > > of what changes were made, git creates and undoes modes for all
> > >
> > > binary files, some of which only changed by a pixel from one
> > >
> > > colour to another).
> >
> > Our delta compression does not care whether the contents are text or
> >
> > binary, so if it is not compressed well, so it can be a sign that
> >
> > the contents are not compressible to begin with, at least with the
> >
> > xdelta binary-diff-patch engine we use. Improvement designs,
> >
> > algorithms and patches are always welcome ;-)
>
> To expand on this, if what you're storing is already compressed, like
>
> Ogg Vorbis files or PNGs, like are found in that repository, then
>
> generally they will not delta well. This is also true of things like
>
> Microsoft Office or OpenOffice documents, because they're essentially
>
> Zip files.
>
> The delta algorithm looks for similarities between files to compress
>
> them. If a file is already compressed using something like Deflate,
>
> used in PNGs and Zip files, then even very similar files will generally
>
> look very different, so deltification will generally be ineffective.
>
> There are two main solutions to this. One is to store your data
>
> uncompressed in the repository and compress it as part of a build step.
>
> This makes your checkouts larger, but it makes your repository smaller.
>
> The other is to store them outside of the repository proper. Some folks
>
> use Git LFS for this, but you could also just store a manifest with file
>
> names and secure hashes, plus a download location for a public server.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> brian m. carlson (he/him or they/them)
>
> Toronto, Ontario, CA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-16 21:20       ` João Victor Bonfim
@ 2021-12-16 21:33         ` Martin Fick
  2021-12-16 21:42           ` Junio C Hamano
  2021-12-18  0:15           ` João Victor Bonfim
  0 siblings, 2 replies; 14+ messages in thread
From: Martin Fick @ 2021-12-16 21:33 UTC (permalink / raw)
  To: João Victor Bonfim; +Cc: brian m. carlson, Junio C Hamano, git

On 2021-12-16 14:20, João Victor Bonfim wrote:
>> To expand on this, if what you're storing is already compressed, like
>> Ogg Vorbis files or PNGs, like are found in that repository, then
>> generally they will not delta well. This is also true of things like
>> Microsoft Office or OpenOffice documents, because they're essentially
>> Zip files.
>> 
>> The delta algorithm looks for similarities between files to compress
>> them. If a file is already compressed using something like Deflate,
>> used in PNGs and Zip files, then even very similar files will 
>> generally
>> look very different, so deltification will generally be ineffective.
...
> Maybe I am thinking too outside the box, but wouldn't it be quite more
> effective for git to identify compressed files, specially on edge cases
> where the compression doesn't have a good chemistry with delta 
> compression,
> decompress them for repo storage while also storing the compression
> algorithm as some metadata tag (like a text string or an ID code 
> decided
> beforehand), and, when creating the work mirrors, return the 
> compression
> to its default state before checkout?

I suspect that for most algorithms and their implementations, this would
not result in repeatable "recompressed" results. Thus the checked-out
files might be different every time you checked them out. :(

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-16 21:33         ` Martin Fick
@ 2021-12-16 21:42           ` Junio C Hamano
  2021-12-18  0:17             ` João Victor Bonfim
  2021-12-18  0:15           ` João Victor Bonfim
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2021-12-16 21:42 UTC (permalink / raw)
  To: Martin Fick; +Cc: João Victor Bonfim, brian m. carlson, git

Martin Fick <mfick@codeaurora.org> writes:

> On 2021-12-16 14:20, João Victor Bonfim wrote:
>>> To expand on this, if what you're storing is already compressed, like
>>> Ogg Vorbis files or PNGs, like are found in that repository, then
>>> generally they will not delta well. This is also true of things like
>>> Microsoft Office or OpenOffice documents, because they're essentially
>>> Zip files.
>>> The delta algorithm looks for similarities between files to
>>> compress
>>> them. If a file is already compressed using something like Deflate,
>>> used in PNGs and Zip files, then even very similar files will
>>> generally
>>> look very different, so deltification will generally be ineffective.
> ...
>> Maybe I am thinking too outside the box, but wouldn't it be quite more
>> effective for git to identify compressed files, specially on edge cases
>> where the compression doesn't have a good chemistry with delta
>> compression,
>> decompress them for repo storage while also storing the compression
>> algorithm as some metadata tag (like a text string or an ID code
>> decided
>> beforehand), and, when creating the work mirrors, return the
>> compression
>> to its default state before checkout?
>
> I suspect that for most algorithms and their implementations, this would
> not result in repeatable "recompressed" results. Thus the checked-out
> files might be different every time you checked them out. :(

That is probably too application specific to be in core-git, but it
is probably a good application for smudge/clean filters like brian
alluded to?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-16 21:33         ` Martin Fick
  2021-12-16 21:42           ` Junio C Hamano
@ 2021-12-18  0:15           ` João Victor Bonfim
  2021-12-18  0:24             ` Junio C Hamano
                               ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: João Victor Bonfim @ 2021-12-18  0:15 UTC (permalink / raw)
  To: Martin Fick; +Cc: brian m. carlson, Junio C Hamano, git

> I suspect that for most algorithms and their implementations, this would
>
> not result in repeatable "recompressed" results. Thus the checked-out
>
> files might be different every time you checked them out. :(

How or why?

Sincere question.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Em quinta-feira, 16 de dezembro de 2021 às 18:33, Martin Fick <mfick@codeaurora.org> escreveu:

> On 2021-12-16 14:20, João Victor Bonfim wrote:
>
> > > To expand on this, if what you're storing is already compressed, like
> > >
> > > Ogg Vorbis files or PNGs, like are found in that repository, then
> > >
> > > generally they will not delta well. This is also true of things like
> > >
> > > Microsoft Office or OpenOffice documents, because they're essentially
> > >
> > > Zip files.
> > >
> > > The delta algorithm looks for similarities between files to compress
> > >
> > > them. If a file is already compressed using something like Deflate,
> > >
> > > used in PNGs and Zip files, then even very similar files will
> > >
> > > generally
> > >
> > > look very different, so deltification will generally be ineffective.
>
> ...
>
> > Maybe I am thinking too outside the box, but wouldn't it be quite more
> >
> > effective for git to identify compressed files, specially on edge cases
> >
> > where the compression doesn't have a good chemistry with delta
> >
> > compression,
> >
> > decompress them for repo storage while also storing the compression
> >
> > algorithm as some metadata tag (like a text string or an ID code
> >
> > decided
> >
> > beforehand), and, when creating the work mirrors, return the
> >
> > compression
> >
> > to its default state before checkout?
>
> I suspect that for most algorithms and their implementations, this would
>
> not result in repeatable "recompressed" results. Thus the checked-out
>
> files might be different every time you checked them out. :(
>
> -Martin
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> The Qualcomm Innovation Center, Inc. is a member of Code
>
> Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-16 21:42           ` Junio C Hamano
@ 2021-12-18  0:17             ` João Victor Bonfim
  0 siblings, 0 replies; 14+ messages in thread
From: João Victor Bonfim @ 2021-12-18  0:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, brian m. carlson, git

> That is probably too application specific to be in core-git, but it

Application specific as in that it is too much of an edge case to be used by all git users?

> is probably a good application for smudge/clean filters like brian
>
> alluded to?

Perhaps.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Em quinta-feira, 16 de dezembro de 2021 às 18:42, Junio C Hamano <gitster@pobox.com> escreveu:

> Martin Fick mfick@codeaurora.org writes:
>
> > On 2021-12-16 14:20, João Victor Bonfim wrote:
> >
> > > > To expand on this, if what you're storing is already compressed, like
> > > >
> > > > Ogg Vorbis files or PNGs, like are found in that repository, then
> > > >
> > > > generally they will not delta well. This is also true of things like
> > > >
> > > > Microsoft Office or OpenOffice documents, because they're essentially
> > > >
> > > > Zip files.
> > > >
> > > > The delta algorithm looks for similarities between files to
> > > >
> > > > compress
> > > >
> > > > them. If a file is already compressed using something like Deflate,
> > > >
> > > > used in PNGs and Zip files, then even very similar files will
> > > >
> > > > generally
> > > >
> > > > look very different, so deltification will generally be ineffective.
> > > >
> > > > ...
> > > >
> > > > Maybe I am thinking too outside the box, but wouldn't it be quite more
> > > >
> > > > effective for git to identify compressed files, specially on edge cases
> > > >
> > > > where the compression doesn't have a good chemistry with delta
> > > >
> > > > compression,
> > > >
> > > > decompress them for repo storage while also storing the compression
> > > >
> > > > algorithm as some metadata tag (like a text string or an ID code
> > > >
> > > > decided
> > > >
> > > > beforehand), and, when creating the work mirrors, return the
> > > >
> > > > compression
> > > >
> > > > to its default state before checkout?
> >
> > I suspect that for most algorithms and their implementations, this would
> >
> > not result in repeatable "recompressed" results. Thus the checked-out
> >
> > files might be different every time you checked them out. :(
>
> That is probably too application specific to be in core-git, but it
>
> is probably a good application for smudge/clean filters like brian
>
> alluded to?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-18  0:15           ` João Victor Bonfim
@ 2021-12-18  0:24             ` Junio C Hamano
  2021-12-18  0:50               ` João Victor Bonfim
  2021-12-18  1:06             ` Martin Fick
  2021-12-18  1:34             ` brian m. carlson
  2 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2021-12-18  0:24 UTC (permalink / raw)
  To: João Victor Bonfim; +Cc: Martin Fick, brian m. carlson, git

João Victor Bonfim  <JoaoVictorBonfim@protonmail.com> writes:

>> I suspect that for most algorithms and their implementations, this would
>>
>> not result in repeatable "recompressed" results. Thus the checked-out
>>
>> files might be different every time you checked them out. :(
>
> How or why?
>
> Sincere question.

Two immediate things that come to my mind are lossy compression
algorithms (jpeg pictures?) and compressors that do not necessarily
produce bit-for-bit identical results (e.g. gzip by default embeds
timestamp unless explicitly told not to from a command line option).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-18  0:24             ` Junio C Hamano
@ 2021-12-18  0:50               ` João Victor Bonfim
  0 siblings, 0 replies; 14+ messages in thread
From: João Victor Bonfim @ 2021-12-18  0:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, brian m. carlson, git

Yeah, that sounds reasonable, Junio.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Em sexta-feira, 17 de dezembro de 2021 às 21:24, Junio C Hamano <gitster@pobox.com> escreveu:

> João Victor Bonfim JoaoVictorBonfim@protonmail.com writes:
>
> > > I suspect that for most algorithms and their implementations, this would
> > >
> > > not result in repeatable "recompressed" results. Thus the checked-out
> > >
> > > files might be different every time you checked them out. :(
> >
> > How or why?
> >
> > Sincere question.
>
> Two immediate things that come to my mind are lossy compression
>
> algorithms (jpeg pictures?) and compressors that do not necessarily
>
> produce bit-for-bit identical results (e.g. gzip by default embeds
>
> timestamp unless explicitly told not to from a command line option).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-18  0:15           ` João Victor Bonfim
  2021-12-18  0:24             ` Junio C Hamano
@ 2021-12-18  1:06             ` Martin Fick
  2021-12-18  1:34             ` brian m. carlson
  2 siblings, 0 replies; 14+ messages in thread
From: Martin Fick @ 2021-12-18  1:06 UTC (permalink / raw)
  To: João Victor Bonfim; +Cc: brian m. carlson, Junio C Hamano, git

On 2021-12-17 17:15, João Victor Bonfim wrote:
>> I suspect that for most algorithms and their implementations, this 
>> would
>> 
>> not result in repeatable "recompressed" results. Thus the checked-out
>> 
>> files might be different every time you checked them out. :(
> 
> How or why?
> 

Here are some reasons I can think of (I am no expert):

1) Most compression formats are file formats, not exact algorithms, thus 
different program implementations of similar algorithms can create 
vastly different outputs.

2) The same program will evolve over time, get improvements, bug fixes, 
etc. so each version of the same program could vary over time even with 
the same settings. The same program version on different platforms could 
have different output.

3) Settings, compression programs have compression levels, perhaps 
memory utilization parameters... The way the program measures these may 
not be deterministic and non-repeatable.

4) Threading. Some compressions algorithms, such as git repack itself, 
can use several threads to analyze the input data. And since the timing 
between different threads is not deterministic, when cooperating, they 
can have different results.

Much of this has to do with the idea that there is usually no such thing 
as "done" when it comes to compression. You can probably search 
infinitely to try and find more data patterns to compress the data more. 
Thus compression programs have to have limits based on heuristics (how 
far to look ahead/behind, how many patterns to remember...) programmed 
into them to come to an end somehow. How these limits are determined can 
sometimes be non deterministic, it may even involve system resources 
(how much RAM the machine has, how long it has run...) or system config.

I hope that helps,

-Martin


> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> 
> Em quinta-feira, 16 de dezembro de 2021 às 18:33, Martin Fick
> <mfick@codeaurora.org> escreveu:
> 
>> On 2021-12-16 14:20, João Victor Bonfim wrote:
>> 
>> > > To expand on this, if what you're storing is already compressed, like
>> > >
>> > > Ogg Vorbis files or PNGs, like are found in that repository, then
>> > >
>> > > generally they will not delta well. This is also true of things like
>> > >
>> > > Microsoft Office or OpenOffice documents, because they're essentially
>> > >
>> > > Zip files.
>> > >
>> > > The delta algorithm looks for similarities between files to compress
>> > >
>> > > them. If a file is already compressed using something like Deflate,
>> > >
>> > > used in PNGs and Zip files, then even very similar files will
>> > >
>> > > generally
>> > >
>> > > look very different, so deltification will generally be ineffective.
>> 
>> ...
>> 
>> > Maybe I am thinking too outside the box, but wouldn't it be quite more
>> >
>> > effective for git to identify compressed files, specially on edge cases
>> >
>> > where the compression doesn't have a good chemistry with delta
>> >
>> > compression,
>> >
>> > decompress them for repo storage while also storing the compression
>> >
>> > algorithm as some metadata tag (like a text string or an ID code
>> >
>> > decided
>> >
>> > beforehand), and, when creating the work mirrors, return the
>> >
>> > compression
>> >
>> > to its default state before checkout?
>> 
>> I suspect that for most algorithms and their implementations, this 
>> would
>> 
>> not result in repeatable "recompressed" results. Thus the checked-out
>> 
>> files might be different every time you checked them out. :(
>> 
>> -Martin
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>> The Qualcomm Innovation Center, Inc. is a member of Code
>> 
>> Aurora Forum, hosted by The Linux Foundation

-- 
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-18  0:15           ` João Victor Bonfim
  2021-12-18  0:24             ` Junio C Hamano
  2021-12-18  1:06             ` Martin Fick
@ 2021-12-18  1:34             ` brian m. carlson
  2021-12-18  1:40               ` João Victor Bonfim
  2 siblings, 1 reply; 14+ messages in thread
From: brian m. carlson @ 2021-12-18  1:34 UTC (permalink / raw)
  To: João Victor Bonfim; +Cc: Martin Fick, Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 1840 bytes --]

On 2021-12-18 at 00:15:59, João Victor Bonfim wrote:
> > I suspect that for most algorithms and their implementations, this would
> >
> > not result in repeatable "recompressed" results. Thus the checked-out
> >
> > files might be different every time you checked them out. :(
> 
> How or why?
> 
> Sincere question.

A lossless compression algorithm has to produce an encoded value that,
when decoded, must produce the original input.  Ideally, it will also
reduce the file size of the original input.  Beyond that, there's a
great deal of freedom to implement that.

Just taking Deflate, which is used in zlib and gzip, as an example,
there are different compression settings that control the size of the
window to use that affect compression speed, quality of compression
(resulting size), and memory usage.  One might prefer using gzip -1 to
get better performance or use less memory, or gzip -9 to reduce the file
size as much as possible.

Even when the same settings are used, the technique used can vary
between versions of the software.  For example, GitHub effectively uses
git archive to generate archives, and one time when they upgraded their
servers, the compression changed in the tarballs and zip files, and
everybody who was relying on the archives being bit-for-bit identical[0]
had a problem.

So it would be nearly impossible to produce bit-for-bit repeatable
results without specifying a specific, hard-coded implementation, and
even in that case, the behavior might need to change for security
reasons, so it would end up being difficult to achieve.

[0] Neither Git nor GitHub provides this guarantee, so please do not
make this mistake.  If you need a fixed bit-for-bit tarball, save it as
a release artifact.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Curiosity
  2021-12-18  1:34             ` brian m. carlson
@ 2021-12-18  1:40               ` João Victor Bonfim
  0 siblings, 0 replies; 14+ messages in thread
From: João Victor Bonfim @ 2021-12-18  1:40 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Martin Fick, Junio C Hamano, git

How does one make a release artifact?
o-o

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Em sexta-feira, 17 de dezembro de 2021 às 22:34, brian m. carlson <sandals@crustytoothpaste.net> escreveu:

> On 2021-12-18 at 00:15:59, João Victor Bonfim wrote:
>
> > > I suspect that for most algorithms and their implementations, this would
> > >
> > > not result in repeatable "recompressed" results. Thus the checked-out
> > >
> > > files might be different every time you checked them out. :(
> >
> > How or why?
> >
> > Sincere question.
>
> A lossless compression algorithm has to produce an encoded value that,
>
> when decoded, must produce the original input. Ideally, it will also
>
> reduce the file size of the original input. Beyond that, there's a
>
> great deal of freedom to implement that.
>
> Just taking Deflate, which is used in zlib and gzip, as an example,
>
> there are different compression settings that control the size of the
>
> window to use that affect compression speed, quality of compression
>
> (resulting size), and memory usage. One might prefer using gzip -1 to
>
> get better performance or use less memory, or gzip -9 to reduce the file
>
> size as much as possible.
>
> Even when the same settings are used, the technique used can vary
>
> between versions of the software. For example, GitHub effectively uses
>
> git archive to generate archives, and one time when they upgraded their
>
> servers, the compression changed in the tarballs and zip files, and
>
> everybody who was relying on the archives being bit-for-bit identical[0]
>
> had a problem.
>
> So it would be nearly impossible to produce bit-for-bit repeatable
>
> results without specifying a specific, hard-coded implementation, and
>
> even in that case, the behavior might need to change for security
>
> reasons, so it would end up being difficult to achieve.
>
> [0] Neither Git nor GitHub provides this guarantee, so please do not
>
> make this mistake. If you need a fixed bit-for-bit tarball, save it as
>
> a release artifact.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> brian m. carlson (he/him or they/them)
>
> Toronto, Ontario, CA

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-12-18  1:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Wlh_w2gSCDQ2ieJnIY7TStWrzxbwP98SNRIFMTYpva7SRFipqk63HEYFVF7wFn1oSHOkQNsjWGOa5L49vyRlvSLbuZqpmvOaDOHmFkdt2zw=@protonmail.com>
2021-12-15  3:52 ` Fw: Curiosity João Victor Bonfim
2021-12-15 18:07   ` Junio C Hamano
2021-12-15 23:45     ` João Victor Bonfim
2021-12-16  2:19     ` brian m. carlson
2021-12-16 21:20       ` João Victor Bonfim
2021-12-16 21:33         ` Martin Fick
2021-12-16 21:42           ` Junio C Hamano
2021-12-18  0:17             ` João Victor Bonfim
2021-12-18  0:15           ` João Victor Bonfim
2021-12-18  0:24             ` Junio C Hamano
2021-12-18  0:50               ` João Victor Bonfim
2021-12-18  1:06             ` Martin Fick
2021-12-18  1:34             ` brian m. carlson
2021-12-18  1:40               ` João Victor Bonfim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.