All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Neal Kreitzinger <nkreitzinger@gmail.com>
Cc: Sergio Callegari <sergio.callegari@gmail.com>,
	Bo Chen <chen@chenirvine.org>,
	git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Mon, 2 Apr 2012 17:07:08 -0400	[thread overview]
Message-ID: <20120402210708.GA28926@sigill.intra.peff.net> (raw)
In-Reply-To: <4F772E48.3030708@gmail.com>

On Sat, Mar 31, 2012 at 11:18:16AM -0500, Neal Kreitzinger wrote:

> On 3/31/2012 6:02 AM, Sergio Callegari wrote:
> >I wonder if it could make sense to have some pluggable mechanism for
> > file splitting. Something under the lines of filters, so to say.
> >Bupsplit can be a rather general mechanism, but large binaries that
> >are containers (zip, jar, docx, tgz, pdf - seen as a collection of
> >streams) may possibly be more conveniently split by their inherent
> >components.
> >
> 
> gitattributes or gitconfig could configure the big-file handler for
> specified files.  Known/supported filetypes like gif, png, zip, pdf,
> etc., could be auto-configured by git.  Any
> yet-unknown/yet-unsupported filetypes could be configured manually by
> the user, e.g.
> *.zgp=bigcontainer

This is a tempting route (and one I've even suggested myself before),
but I think ultimately it is a bad way to go. The problem is that
splitting is only half of the equation. Once you have split contents,
you have to use them intelligently, which means looking at the sha1s of
each split chunk and discarding whole chunks as "the same" without even
looking at the contents.

Which means that it is very important that your chunking algorithm
remain stable from version to version. A change in the algorithm is
going to completely negate the benefits of chunking in the first place.
So something configurable, or something that is not applied consistently
(because it depends on each user's git config, or even on the specific
version of a tool used) can end up being no help at all.

Properly applied, I think a content-aware chunking algorithm could
out-perform a generic one. But I think we need to first find out exactly
how well the generic algorithm can perform. It may be "good enough"
compared to the hassle that inconsistent application of a content-aware
algorithm will cause.  So I wouldn't rule it out, but I'd rather try the
bup-style splitting first, and see how good (or bad) it is.

-Peff

  reply	other threads:[~2012-04-02 21:07 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44     ` Bo Chen
2012-03-30 19:51     ` Bo Chen
2012-03-30 20:34       ` Jeff King
2012-03-30 23:08         ` Bo Chen
2012-03-31 11:02           ` Sergio Callegari
2012-03-31 16:18             ` Neal Kreitzinger
2012-04-02 21:07               ` Jeff King [this message]
2012-04-03  9:58                 ` Sergio Callegari
2012-04-11  1:24                 ` Neal Kreitzinger
2012-04-11  6:04                   ` Jonathan Nieder
2012-04-11 16:29                     ` Neal Kreitzinger
2012-04-11 22:09                       ` Jeff King
2012-04-11 16:35                     ` Neal Kreitzinger
2012-04-11 16:44                     ` Neal Kreitzinger
2012-04-11 17:20                       ` Jonathan Nieder
2012-04-11 18:51                         ` Junio C Hamano
2012-04-11 19:03                           ` Jonathan Nieder
2012-04-11 18:23                     ` Neal Kreitzinger
2012-04-11 21:35                   ` Jeff King
2012-04-12 19:29                     ` Neal Kreitzinger
2012-04-12 21:03                       ` Jeff King
     [not found]                         ` <4F8A2EBD.1070407@gmail.com>
2012-04-15  2:15                           ` Jeff King
2012-04-15  2:33                             ` Neal Kreitzinger
2012-04-16 14:54                               ` Jeff King
2012-05-10 21:43                             ` Neal Kreitzinger
2012-05-10 22:39                               ` Jeff King
2012-04-12 21:08                       ` Neal Kreitzinger
2012-04-13 21:36                       ` Bo Chen
2012-03-31 15:19         ` Neal Kreitzinger
2012-04-02 21:40           ` Jeff King
2012-04-02 22:19             ` Junio C Hamano
2012-04-03 10:07               ` Jeff King
2012-03-31 16:49         ` Neal Kreitzinger
2012-03-31 20:28         ` Neal Kreitzinger
2012-03-31 21:27           ` Bo Chen
2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
2012-04-01 23:30               ` Bo Chen
2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120402210708.GA28926@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=chen@chenirvine.org \
    --cc=git@vger.kernel.org \
    --cc=nkreitzinger@gmail.com \
    --cc=sergio.callegari@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.