git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Is there some script to find un-delta-able objects?
@ 2018-10-05 14:20 Ævar Arnfjörð Bjarmason
  2018-10-05 16:19 ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-05 14:20 UTC (permalink / raw)
  To: Git List

I.e. something to generate the .gitattributes file using this format:

https://git-scm.com/docs/gitattributes#_packing_objects

Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
there's some repo scanner utility to spew this out for a given repo.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Is there some script to find un-delta-able objects?
  2018-10-05 14:20 Is there some script to find un-delta-able objects? Ævar Arnfjörð Bjarmason
@ 2018-10-05 16:19 ` Jeff King
  2018-10-05 16:44   ` Ævar Arnfjörð Bjarmason
  2018-10-05 16:47   ` Junio C Hamano
  0 siblings, 2 replies; 5+ messages in thread
From: Jeff King @ 2018-10-05 16:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git List

On Fri, Oct 05, 2018 at 04:20:27PM +0200, Ævar Arnfjörð Bjarmason wrote:

> I.e. something to generate the .gitattributes file using this format:
> 
> https://git-scm.com/docs/gitattributes#_packing_objects
> 
> Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
> there's some repo scanner utility to spew this out for a given repo.

I'm not sure what you mean by "un-delta-able" objects. Do you mean ones
where we're not likely to find a delta? Or ones where Git will not try
to look for a delta?

If the latter, I think the only rules are the "-delta" attribute and the
object size. You should be able to use git-check-attr and "git-cat-file"
to get that info.

If the former, I don't know how you would know. We can only report on
what isn't a delta _yet_.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Is there some script to find un-delta-able objects?
  2018-10-05 16:19 ` Jeff King
@ 2018-10-05 16:44   ` Ævar Arnfjörð Bjarmason
  2018-10-05 16:56     ` Jeff King
  2018-10-05 16:47   ` Junio C Hamano
  1 sibling, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-05 16:44 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Michael Haggerty


On Fri, Oct 05 2018, Jeff King wrote:

> On Fri, Oct 05, 2018 at 04:20:27PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> I.e. something to generate the .gitattributes file using this format:
>>
>> https://git-scm.com/docs/gitattributes#_packing_objects
>>
>> Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
>> there's some repo scanner utility to spew this out for a given repo.
>
> I'm not sure what you mean by "un-delta-able" objects. Do you mean ones
> where we're not likely to find a delta? Or ones where Git will not try
> to look for a delta?
>
> If the latter, I think the only rules are the "-delta" attribute and the
> object size. You should be able to use git-check-attr and "git-cat-file"
> to get that info.
>
> If the former, I don't know how you would know. We can only report on
> what isn't a delta _yet_.

Some version of the former. Ones where we haven't found any (or much of)
useful deltas yet. E.g. say I had a repository with a lot of files
generated by this command at various points in the history:

    dd if=/dev/urandom of=file.binary count=1024 bs=1024

Some script similar to git-sizer which could report that the
packed+compressed+delta'd version of the 10 *.binary files I had in my
history had a 1:1 ratio of how large they were in .git, v.s. how large
the sum of each file retrieved by "git show" was (i.e. uncompressed,
un-delta'd).

That doesn't mean that tomorrow I won't commit 10 new objects which
would have a really good delta ratio to those 10 existing files,
bringing the ratio to ~1:2, but if I had some report like:

    <ratio> <extension>

For a given repo that could be fed into .gitattributes to say we
shouldn't bother to delta files of certain extensions.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Is there some script to find un-delta-able objects?
  2018-10-05 16:19 ` Jeff King
  2018-10-05 16:44   ` Ævar Arnfjörð Bjarmason
@ 2018-10-05 16:47   ` Junio C Hamano
  1 sibling, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2018-10-05 16:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Git List

Jeff King <peff@peff.net> writes:

> On Fri, Oct 05, 2018 at 04:20:27PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> I.e. something to generate the .gitattributes file using this format:
>> 
>> https://git-scm.com/docs/gitattributes#_packing_objects
>> 
>> Some stuff is obvious, like "*.gpg binary -delta", but I'm wondering if
>> there's some repo scanner utility to spew this out for a given repo.
>
> I'm not sure what you mean by "un-delta-able" objects. Do you mean ones
> where we're not likely to find a delta? Or ones where Git will not try
> to look for a delta?
>
> If the latter, I think the only rules are the "-delta" attribute and the
> object size. You should be able to use git-check-attr and "git-cat-file"
> to get that info.
>
> If the former, I don't know how you would know. We can only report on
> what isn't a delta _yet_.

I am reasonably sure that the question is about solving the former
so that "-delta" attribute is set appropriately.

Iniitially, I thought that it is likely an undeltifiable object has
higher randomness than deltifiable ones and that can be exploited,
but if you have such a highly random blob A (and no other object
like it) in the repository and then later acquire another blob B
that happens to share most of the data with A, then A and B by
themselves will pass the "highly random" test but still yet each can
be expressed as a delta derived from the other.  So your "what isn't
a delta yet" is a reasonable assessment of what mechanically can be
known.  

Knowledge/heuristic like "No two '*.gpg' files are expected to be
alike" needs something more than the randomness of individual files,
I guess.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Is there some script to find un-delta-able objects?
  2018-10-05 16:44   ` Ævar Arnfjörð Bjarmason
@ 2018-10-05 16:56     ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2018-10-05 16:56 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git List, Michael Haggerty

On Fri, Oct 05, 2018 at 06:44:25PM +0200, Ævar Arnfjörð Bjarmason wrote:

> Some version of the former. Ones where we haven't found any (or much of)
> useful deltas yet. E.g. say I had a repository with a lot of files
> generated by this command at various points in the history:
> 
>     dd if=/dev/urandom of=file.binary count=1024 bs=1024
> 
> Some script similar to git-sizer which could report that the
> packed+compressed+delta'd version of the 10 *.binary files I had in my
> history had a 1:1 ratio of how large they were in .git, v.s. how large
> the sum of each file retrieved by "git show" was (i.e. uncompressed,
> un-delta'd).

You can get the uncompressed and on-disk sizes with:

  git cat-file --batch-all-objects \
    --batch-check='%(objectname) %(objectsize) %(objectsize:disk)'

and then compare the sizes/ratios however you like. If you want just a
subset of the blobs, drop the "--batch-all-objects" and just feed the
object names or even "HEAD:filename" on stdin).

> That doesn't mean that tomorrow I won't commit 10 new objects which
> would have a really good delta ratio to those 10 existing files,
> bringing the ratio to ~1:2, but if I had some report like:
> 
>     <ratio> <extension>
> 
> For a given repo that could be fed into .gitattributes to say we
> shouldn't bother to delta files of certain extensions.

I don't know of a tool that does that, but I think a modest application
of perl to the cat-file output would produce it.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-10-05 16:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-05 14:20 Is there some script to find un-delta-able objects? Ævar Arnfjörð Bjarmason
2018-10-05 16:19 ` Jeff King
2018-10-05 16:44   ` Ævar Arnfjörð Bjarmason
2018-10-05 16:56     ` Jeff King
2018-10-05 16:47   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).