All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bo Chen <chen@chenirvine.org>
To: Sergio <sergio.callegari@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Fri, 30 Mar 2012 15:51:20 -0400	[thread overview]
Message-ID: <CA+M5ThTPyic=RhFL2SvuNB0xBWOHxNTaUZrYMB144UjpjCiLoQ@mail.gmail.com> (raw)
In-Reply-To: <loom.20120328T131530-717@post.gmane.org>

Please neglect my last email.
Following is the version more readable.
The sub-problems of "delta for large file" problem.

1 large file

1.1 text file (always delta well? need to be confirmed)

1.2 binary file

1.2.1  general binary file (without encryption, compression. Other
cases which definitely can not delta well)

1.2.1.1 delta well (ok)
1.2.1.2 does not delta well (improvement?)

1.2.2  encrypted file (improvement? one straightforward method is to
decrypt the file before delta-ing it, however, we don't always have
the key for decryption. Other?)

1.2.3 compressed file (improvement? Decompress before delta-ing it? Other?)

Can anyone give me any feed back for further refining the problem. Thanks.

Bo

On Wed, Mar 28, 2012 at 7:33 AM, Sergio <sergio.callegari@gmail.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>>
>> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote:
>> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
>> > big-file support".
>> >
>> > As it is described in the idea page,
>> > "Many large files (like media) do not delta very well. However, some
>> > do (like VM disk images). Git could split large objects into smaller
>> > chunks, similar to bup, and find deltas between these much more
>> > manageable chunks. There are some preliminary patches in this
>> > direction, but they are in need of review and expansion."
>> >
>> > Can anyone elaborate a little bit why many large files do not delta
>> > very well?
>>
>> Large files are usually binary. Depends on the type of binary, they
>> may or may not delta well. Those that are compressed/encrypted
>> obviously don't delta well because one change can make the final
>> result completely different.
>
> I would add that the larger a file, the larger the temptation to use a
> compressed format for it, so that large files are often compressed binaries.
>
> For these, a trick to obtain good deltas can be to decompress before splitting
> in chunks with the rsync algorithm. Git filters can already be used for this,
> but it can be tricky to assure that the decompress - recompress roundtrip
> re-creates the original compressed file.
>
> Furhermore, some compressed binaries are internally composed by multiple streams
> (think of a zip archive containing multiple files, but this is by no means
> limited to zip). In this case, it is frequent to have many possible orderings of
> the streams. If so, the best deltas can be obtained by sorting the streams in
> some 'canonical' order and decompressing. Even without decompressing, sorting
> alone can obtain good results as long as changes are only due to changes in a
> single stream of the container. Personally, I know no example of git filters
> used to perform this sorting which can be extremely tricky in assuring the
> possibility of recovering the file in the original stream order.
>
> Maybe (but this is just speculation), once the bup-inspired file chunking
> support is in place, people will start contributing filters to improve the
> management of many types of standard files (obviously 'improve' in terms of
> space efficiency as filters can be quite slow).
>
> Sergio
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2012-03-30 19:51 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44     ` Bo Chen
2012-03-30 19:51     ` Bo Chen [this message]
2012-03-30 20:34       ` Jeff King
2012-03-30 23:08         ` Bo Chen
2012-03-31 11:02           ` Sergio Callegari
2012-03-31 16:18             ` Neal Kreitzinger
2012-04-02 21:07               ` Jeff King
2012-04-03  9:58                 ` Sergio Callegari
2012-04-11  1:24                 ` Neal Kreitzinger
2012-04-11  6:04                   ` Jonathan Nieder
2012-04-11 16:29                     ` Neal Kreitzinger
2012-04-11 22:09                       ` Jeff King
2012-04-11 16:35                     ` Neal Kreitzinger
2012-04-11 16:44                     ` Neal Kreitzinger
2012-04-11 17:20                       ` Jonathan Nieder
2012-04-11 18:51                         ` Junio C Hamano
2012-04-11 19:03                           ` Jonathan Nieder
2012-04-11 18:23                     ` Neal Kreitzinger
2012-04-11 21:35                   ` Jeff King
2012-04-12 19:29                     ` Neal Kreitzinger
2012-04-12 21:03                       ` Jeff King
     [not found]                         ` <4F8A2EBD.1070407@gmail.com>
2012-04-15  2:15                           ` Jeff King
2012-04-15  2:33                             ` Neal Kreitzinger
2012-04-16 14:54                               ` Jeff King
2012-05-10 21:43                             ` Neal Kreitzinger
2012-05-10 22:39                               ` Jeff King
2012-04-12 21:08                       ` Neal Kreitzinger
2012-04-13 21:36                       ` Bo Chen
2012-03-31 15:19         ` Neal Kreitzinger
2012-04-02 21:40           ` Jeff King
2012-04-02 22:19             ` Junio C Hamano
2012-04-03 10:07               ` Jeff King
2012-03-31 16:49         ` Neal Kreitzinger
2012-03-31 20:28         ` Neal Kreitzinger
2012-03-31 21:27           ` Bo Chen
2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
2012-04-01 23:30               ` Bo Chen
2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+M5ThTPyic=RhFL2SvuNB0xBWOHxNTaUZrYMB144UjpjCiLoQ@mail.gmail.com' \
    --to=chen@chenirvine.org \
    --cc=git@vger.kernel.org \
    --cc=sergio.callegari@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.