* Is there some script to find un-delta-able objects?
@ 2018-10-05 14:20 Ævar Arnfjörð Bjarmason
2018-10-05 16:19 ` Jeff King
0 siblings, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-05 14:20 UTC (permalink / raw)
To: Git List
I.e. something to generate the .gitattributes file using this format:
https://git-scm.com/docs/gitattributes#_packing_objects
Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
there's some repo scanner utility to spew this out for a given repo.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Is there some script to find un-delta-able objects?
2018-10-05 14:20 Is there some script to find un-delta-able objects? Ævar Arnfjörð Bjarmason
@ 2018-10-05 16:19 ` Jeff King
2018-10-05 16:44 ` Ævar Arnfjörð Bjarmason
2018-10-05 16:47 ` Junio C Hamano
0 siblings, 2 replies; 5+ messages in thread
From: Jeff King @ 2018-10-05 16:19 UTC (permalink / raw)
To: Ævar Arnfjörð Bjarmason; +Cc: Git List
On Fri, Oct 05, 2018 at 04:20:27PM +0200, Ævar Arnfjörð Bjarmason wrote:
> I.e. something to generate the .gitattributes file using this format:
>
> https://git-scm.com/docs/gitattributes#_packing_objects
>
> Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
> there's some repo scanner utility to spew this out for a given repo.
I'm not sure what you mean by "un-delta-able" objects. Do you mean ones
where we're not likely to find a delta? Or ones where Git will not try
to look for a delta?
If the latter, I think the only rules are the "-delta" attribute and the
object size. You should be able to use git-check-attr and "git-cat-file"
to get that info.
If the former, I don't know how you would know. We can only report on
what isn't a delta _yet_.
-Peff
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Is there some script to find un-delta-able objects?
2018-10-05 16:19 ` Jeff King
@ 2018-10-05 16:44 ` Ævar Arnfjörð Bjarmason
2018-10-05 16:56 ` Jeff King
2018-10-05 16:47 ` Junio C Hamano
1 sibling, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-05 16:44 UTC (permalink / raw)
To: Jeff King; +Cc: Git List, Michael Haggerty
On Fri, Oct 05 2018, Jeff King wrote:
> On Fri, Oct 05, 2018 at 04:20:27PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> I.e. something to generate the .gitattributes file using this format:
>>
>> https://git-scm.com/docs/gitattributes#_packing_objects
>>
>> Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
>> there's some repo scanner utility to spew this out for a given repo.
>
> I'm not sure what you mean by "un-delta-able" objects. Do you mean ones
> where we're not likely to find a delta? Or ones where Git will not try
> to look for a delta?
>
> If the latter, I think the only rules are the "-delta" attribute and the
> object size. You should be able to use git-check-attr and "git-cat-file"
> to get that info.
>
> If the former, I don't know how you would know. We can only report on
> what isn't a delta _yet_.
Some version of the former. Ones where we haven't found any (or much of)
useful deltas yet. E.g. say I had a repository with a lot of files
generated by this command at various points in the history:
dd if=/dev/urandom of=file.binary count=1024 bs=1024
Some script similar to git-sizer which could report that the
packed+compressed+delta'd version of the 10 *.binary files I had in my
history had a 1:1 ratio of how large they were in .git, v.s. how large
the sum of each file retrieved by "git show" was (i.e. uncompressed,
un-delta'd).
That doesn't mean that tomorrow I won't commit 10 new objects which
would have a really good delta ratio to those 10 existing files,
bringing the ratio to ~1:2, but if I had some report like:
<ratio> <extension>
For a given repo that could be fed into .gitattributes to say we
shouldn't bother to delta files of certain extensions.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Is there some script to find un-delta-able objects?
2018-10-05 16:19 ` Jeff King
2018-10-05 16:44 ` Ævar Arnfjörð Bjarmason
@ 2018-10-05 16:47 ` Junio C Hamano
1 sibling, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2018-10-05 16:47 UTC (permalink / raw)
To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Git List
Jeff King <peff@peff.net> writes:
> On Fri, Oct 05, 2018 at 04:20:27PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> I.e. something to generate the .gitattributes file using this format:
>>
>> https://git-scm.com/docs/gitattributes#_packing_objects
>>
>> Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
>> there's some repo scanner utility to spew this out for a given repo.
>
> I'm not sure what you mean by "un-delta-able" objects. Do you mean ones
> where we're not likely to find a delta? Or ones where Git will not try
> to look for a delta?
>
> If the latter, I think the only rules are the "-delta" attribute and the
> object size. You should be able to use git-check-attr and "git-cat-file"
> to get that info.
>
> If the former, I don't know how you would know. We can only report on
> what isn't a delta _yet_.
I am reasonably sure that the question is about solving the former
so that "-delta" attribute is set appropriately.
Iniitially, I thought that it is likely an undeltifiable object has
higher randomness than deltifiable ones and that can be exploited,
but if you have such a highly random blob A (and no other object
like it) in the repository and then later acquire another blob B
that happens to share most of the data with A, then A and B by
themselves will pass the "highly random" test but still yet each can
be expressed as a delta derived from the other. So your "what isn't
a delta yet" is a reasonable assessment of what mechanically can be
known.
Knowledge/heuristic like "No two '*.gpg' files are expected to be
alike" needs something more than the randomness of individual files,
I guess.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Is there some script to find un-delta-able objects?
2018-10-05 16:44 ` Ævar Arnfjörð Bjarmason
@ 2018-10-05 16:56 ` Jeff King
0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2018-10-05 16:56 UTC (permalink / raw)
To: Ævar Arnfjörð Bjarmason; +Cc: Git List, Michael Haggerty
On Fri, Oct 05, 2018 at 06:44:25PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Some version of the former. Ones where we haven't found any (or much of)
> useful deltas yet. E.g. say I had a repository with a lot of files
> generated by this command at various points in the history:
>
> dd if=/dev/urandom of=file.binary count=1024 bs=1024
>
> Some script similar to git-sizer which could report that the
> packed+compressed+delta'd version of the 10 *.binary files I had in my
> history had a 1:1 ratio of how large they were in .git, v.s. how large
> the sum of each file retrieved by "git show" was (i.e. uncompressed,
> un-delta'd).
You can get the uncompressed and on-disk sizes with:
git cat-file --batch-all-objects \
--batch-check='%(objectname) %(objectsize) %(objectsize:disk)'
and then compare the sizes/ratios however you like. If you want just a
subset of the blobs, drop the "--batch-all-objects" and just feed the
object names or even "HEAD:filename" on stdin).
> That doesn't mean that tomorrow I won't commit 10 new objects which
> would have a really good delta ratio to those 10 existing files,
> bringing the ratio to ~1:2, but if I had some report like:
>
> <ratio> <extension>
>
> For a given repo that could be fed into .gitattributes to say we
> shouldn't bother to delta files of certain extensions.
I don't know of a tool that does that, but I think a modest application
of perl to the cat-file output would produce it.
-Peff
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-10-05 16:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-05 14:20 Is there some script to find un-delta-able objects? Ævar Arnfjörð Bjarmason
2018-10-05 16:19 ` Jeff King
2018-10-05 16:44 ` Ævar Arnfjörð Bjarmason
2018-10-05 16:56 ` Jeff King
2018-10-05 16:47 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).