All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sergio Callegari <sergio.callegari@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Neal Kreitzinger <nkreitzinger@gmail.com>,
	Bo Chen <chen@chenirvine.org>,
	git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Tue, 03 Apr 2012 11:58:58 +0200	[thread overview]
Message-ID: <4F7AC9E2.60203@gmail.com> (raw)
In-Reply-To: <20120402210708.GA28926@sigill.intra.peff.net>

On 02/04/2012 23:07, Jeff King wrote:
>> gitattributes or gitconfig could configure the big-file handler for
>> specified files.  Known/supported filetypes like gif, png, zip, pdf,
>> etc., could be auto-configured by git.  Any
>> yet-unknown/yet-unsupported filetypes could be configured manually by
>> the user, e.g.
>> *.zgp=bigcontainer
> This is a tempting route (and one I've even suggested myself before),
> but I think ultimately it is a bad way to go. The problem is that
> splitting is only half of the equation. Once you have split contents,
> you have to use them intelligently, which means looking at the sha1s of
> each split chunk and discarding whole chunks as "the same" without even
> looking at the contents.
>
> Which means that it is very important that your chunking algorithm
> remain stable from version to version. A change in the algorithm is
> going to completely negate the benefits of chunking in the first place.
> So something configurable, or something that is not applied consistently
> (because it depends on each user's git config, or even on the specific
> version of a tool used) can end up being no help at all.
Isn't this the same with filters? The clean algorithms should remain 
stable from
version to version. Filters are often perceived as simpler, so that this 
stability seems easier to achieve, but it is not necessarily the case.
> Properly applied, I think a content-aware chunking algorithm could
> out-perform a generic one. But I think we need to first find out exactly
> how well the generic algorithm can perform. It may be "good enough"
> compared to the hassle that inconsistent application of a content-aware
> algorithm will cause.
Absolutely true, but why not giving freedom to the user to chose? Git 
could provide the bupsplit mechanism and at the same time have a means 
so that the user can plug in a different machinery for specific file 
types.  In this case, it is the user responsibility to do it right.

One could have a special 'filter' for splitting/unsplitting. Say

[splitfilter "XXX"]
     split = xxx
     unsplit = uxxx

xxx is given the file to split on stdin and returns on stdout a stream 
made of an index header and the concatenation of the parts in which the 
file should be split. For unsplitting uxxx is given on stdin the index 
and the concatenation of parts and returns on stdout the binary file.

bupsplit and bupunsplit could be built in, with other tools being user 
provided.  If the users gets them wrong it is ultimately his/her 
responsibility. In the end, the user is given even 'rm' isn't he/she? 
Git could provide a header file defining the index header format to help 
the coding of the alternative, more specific splitters. If people devise 
some of them that look promising, they can probably be collected in contrib.

Possibly, the index header could comprise starting positions for the 
various parts in the stream, but also 'names' for them. This would let 
reusing blob and tree objects to physically store the various parts. For 
bupsplit, names could be flat (e.g. sequence numbers like 0000, 0001). 
For files that are container, they could reflect the inner names. 
Perspectively, one could even devise specific diff tools for these 
'special' trees of split-object components. With this, when storing say 
a very large zip file in git, these tools could help saying things like 
'from version x to version y, only that specific part in the zip file 
has changed'.

Sergio

  reply	other threads:[~2012-04-03  9:59 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44     ` Bo Chen
2012-03-30 19:51     ` Bo Chen
2012-03-30 20:34       ` Jeff King
2012-03-30 23:08         ` Bo Chen
2012-03-31 11:02           ` Sergio Callegari
2012-03-31 16:18             ` Neal Kreitzinger
2012-04-02 21:07               ` Jeff King
2012-04-03  9:58                 ` Sergio Callegari [this message]
2012-04-11  1:24                 ` Neal Kreitzinger
2012-04-11  6:04                   ` Jonathan Nieder
2012-04-11 16:29                     ` Neal Kreitzinger
2012-04-11 22:09                       ` Jeff King
2012-04-11 16:35                     ` Neal Kreitzinger
2012-04-11 16:44                     ` Neal Kreitzinger
2012-04-11 17:20                       ` Jonathan Nieder
2012-04-11 18:51                         ` Junio C Hamano
2012-04-11 19:03                           ` Jonathan Nieder
2012-04-11 18:23                     ` Neal Kreitzinger
2012-04-11 21:35                   ` Jeff King
2012-04-12 19:29                     ` Neal Kreitzinger
2012-04-12 21:03                       ` Jeff King
     [not found]                         ` <4F8A2EBD.1070407@gmail.com>
2012-04-15  2:15                           ` Jeff King
2012-04-15  2:33                             ` Neal Kreitzinger
2012-04-16 14:54                               ` Jeff King
2012-05-10 21:43                             ` Neal Kreitzinger
2012-05-10 22:39                               ` Jeff King
2012-04-12 21:08                       ` Neal Kreitzinger
2012-04-13 21:36                       ` Bo Chen
2012-03-31 15:19         ` Neal Kreitzinger
2012-04-02 21:40           ` Jeff King
2012-04-02 22:19             ` Junio C Hamano
2012-04-03 10:07               ` Jeff King
2012-03-31 16:49         ` Neal Kreitzinger
2012-03-31 20:28         ` Neal Kreitzinger
2012-03-31 21:27           ` Bo Chen
2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
2012-04-01 23:30               ` Bo Chen
2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F7AC9E2.60203@gmail.com \
    --to=sergio.callegari@gmail.com \
    --cc=chen@chenirvine.org \
    --cc=git@vger.kernel.org \
    --cc=nkreitzinger@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.