All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Bo Chen <chen@chenirvine.org>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>, git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of "Better big-file support".
Date: Fri, 30 Mar 2012 15:54:04 -0400	[thread overview]
Message-ID: <20120330195404.GA20189@sigill.intra.peff.net> (raw)
In-Reply-To: <CA+M5ThS1XiaGJWmSvfwXoqebnH6fK3h6cC7OnQQi=LXzcA0GRw@mail.gmail.com>

On Fri, Mar 30, 2012 at 03:11:40PM -0400, Bo Chen wrote:

> Just make clear one of my confusions. Delta operation is to find out
> the differences between different versions of the same file, right?
> As I know, delta encoding is to re-encode a file based on the
> differences between neighboring blocks, thus can help compress a file
> since after delta encoding, we will have more similar data within the
> file. Can anyone elaborate a little bit what is the relation between
> delta operation in git and delta encoding listed above? Thanks.

Sort of. Git is snapshot based. So each version of a file is its own
"object", and from a high-level view, we store all objects. But we store
the logical objects themselves in packfiles, in which the actual
representation of the object may be stored as a difference to another
object (which is likely to be a different version of the same file, but
does not have to be).

Here's some background reading:

  http://progit.org/book/ch1-3.html

  http://progit.org/book/ch9-4.html

> I am wondering why we cannot divide the 2  2GB files into chunks and
> delta chunks by chunks. Is that any difference, except a little more
> IOs?

It's more complicated than that. What if the file is re-ordered? You
would want to compare early chunks in one version against later chunks
in the other. So yes, you can reduce memory pressure by doing more I/O,
but doing too much I/O will be very slow. Coming up with a solution is
part of what this project is about. And chunking is part of that
solution.

> > Read about rsync algorithm [2]. Bup [1] implements the same (I think)
> > algorithm, but on top of git. For preliminary patches, have a look at
> > jc/split-blob series at commit 4a1242d in git.git.
> 
> Make clear my another confusion. The file which has been updated
> (added, deleted, and modified) is first delta-compressed, and then
> synchronize to the remote repo by some mechanism (rsync?). I am
> wondering what is the the relationship between delta operation and
> rsync.

No, the updated file is delta compressed into a packfile, and the
packfile is transmitted. Rsync comes into play because it uses a novel
chunking algorithm, which was copied by bup (and is referred to as the
"bupsplit" algorithm). Read up on how bup works and why it was invented.

-Peff

      reply	other threads:[~2012-03-30 19:54 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44     ` Bo Chen
2012-03-30 19:51     ` Bo Chen
2012-03-30 20:34       ` Jeff King
2012-03-30 23:08         ` Bo Chen
2012-03-31 11:02           ` Sergio Callegari
2012-03-31 16:18             ` Neal Kreitzinger
2012-04-02 21:07               ` Jeff King
2012-04-03  9:58                 ` Sergio Callegari
2012-04-11  1:24                 ` Neal Kreitzinger
2012-04-11  6:04                   ` Jonathan Nieder
2012-04-11 16:29                     ` Neal Kreitzinger
2012-04-11 22:09                       ` Jeff King
2012-04-11 16:35                     ` Neal Kreitzinger
2012-04-11 16:44                     ` Neal Kreitzinger
2012-04-11 17:20                       ` Jonathan Nieder
2012-04-11 18:51                         ` Junio C Hamano
2012-04-11 19:03                           ` Jonathan Nieder
2012-04-11 18:23                     ` Neal Kreitzinger
2012-04-11 21:35                   ` Jeff King
2012-04-12 19:29                     ` Neal Kreitzinger
2012-04-12 21:03                       ` Jeff King
     [not found]                         ` <4F8A2EBD.1070407@gmail.com>
2012-04-15  2:15                           ` Jeff King
2012-04-15  2:33                             ` Neal Kreitzinger
2012-04-16 14:54                               ` Jeff King
2012-05-10 21:43                             ` Neal Kreitzinger
2012-05-10 22:39                               ` Jeff King
2012-04-12 21:08                       ` Neal Kreitzinger
2012-04-13 21:36                       ` Bo Chen
2012-03-31 15:19         ` Neal Kreitzinger
2012-04-02 21:40           ` Jeff King
2012-04-02 22:19             ` Junio C Hamano
2012-04-03 10:07               ` Jeff King
2012-03-31 16:49         ` Neal Kreitzinger
2012-03-31 20:28         ` Neal Kreitzinger
2012-03-31 21:27           ` Bo Chen
2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
2012-04-01 23:30               ` Bo Chen
2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54     ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120330195404.GA20189@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=chen@chenirvine.org \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.