git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Christopher Li <git@chrisli.org>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: CAREFUL! No more delta object support!
Date: Tue, 28 Jun 2005 09:45:28 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.58.0506280921480.19755@ppc970.osdl.org> (raw)
In-Reply-To: <20050628103852.GB21533@64m.dyndns.org>



On Tue, 28 Jun 2005, Christopher Li wrote:
>
> That is all nice improvement to address the space usage issue.
> 
> Should people just run repacking once a while or is it automaticly
> add new object to the pack file?

While adding a new object to a pack file is _possible_ (you add it to the
end of the pack-file, and re-generate the index file), I would strongly
suggest against it for several reasons:

 - It's a lot more complex and expensive than just writing a new file.  
   Much better to make the pack generation be an off-line thing, and make 
   new object creation really cheap.

 - it has serious locking issues, and if something goes wrong you are just 
   horribly screwed. This implies, for example, that to be safe you really 
   have to use fsync() etc at every point (and be careful about writing 
   the index), making the update even _more_ expensive. Over NFS you need 
   to be extremely careful to make sure that everybody got the right lock, 
   yadda yadda.

   Packing things off-line just means that _all_ of these problems go 
   away.

 - There are operations that want to remove objects (I do that all the 
   time: I do something stupid, and decide to undo it, or I just do a 
   "git-update-cache" and notice that I need to do more work so I edit it 
   some more and actually never commit the first version)

   If _adding_ to the file had some serious correctness issues, _removing_ 
   an object from a file is even worse. MUCH worse. Now you don't just 
   have to lock against other people creating new objects, now you have to 
   lock against updates (or totally re-write the whole big file and do an 
   atomic "rename").

 - it can actually generate worse packing. The current "offline" method 
   means that we can pack any version of a file against any other version 
   of a file, and we do. We pick the closest version we can find, and we 
   try to always pack against the bigger one (deletes are smaller deltas, 
   and the biggest one tends to be the latest version, so this not only
   means that the delta is denser, it also means that the latest version -
   which is likely to be the biggest and most often used - tends to be
   non-delta).

   In contrast, updating the pack file means that you always write the 
   latest version as a delta, which means that you're doing things 
   _exactly_ the wrong way around both for performance and size.

 - Finally: packing allows us to do optimize for locality. In particular, 
   I write out the pack file in "recency" order, ie the top-most objects 
   go first, and in particular, the "commit" objects go at the very top of 
   the file. Why? Because it means that the commit objects (which are 
   heavily used for the history generation by pretty much anything, since 
   "git-rev-list" will access them) are packed together, and in the right 
   order.

   Again, you can't do that if you do on-line updates as opposed to 
   offline packing.

So the usage pattern I envision is to pack stuff maybe once a month
(depending on how much changes, of course), because then you really do get
the best of both worlds: the simplicity of individual objects for recent
work and the optimal packing and ordering that you can really work on for
the longer range case. And your project never grows very big.

Btw, I'm not claiming that my current pack format is "optimal" of course.  
For example, while I write all objects in recency order, right now that
means that if a recent object has been written as a delta that depends on
an older one, I actually write the delta first (correct) but I won't write
the older object until its recency ordering (wrong).

That kind of thing is trivial to fix (eventually), but it's an example of
where ordering matters (ie if it's the other way around: the delta is the
older object, it's probably better to leave it at the end of the file,
since it's probably not going to be accessed much, making the effective
packing at the head more efficicient). It's also an example of the kinds
of things we can do exactly because we're doing the packing off-line.

			Linus

  reply	other threads:[~2005-06-28 16:37 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-06-28  1:14 CAREFUL! No more delta object support! Linus Torvalds
2005-06-27 23:58 ` Christopher Li
2005-06-28  3:30   ` Linus Torvalds
2005-06-28  9:40     ` Junio C Hamano
2005-06-28 11:06       ` Christopher Li
2005-06-28 14:52         ` Petr Baudis
2005-06-28 16:35           ` Benjamin LaHaise
2005-06-28 20:30             ` Petr Baudis
2005-06-28 14:46       ` Jan Harkes
2005-06-28 10:38     ` Christopher Li
2005-06-28 16:45       ` Linus Torvalds [this message]
2005-06-29  0:49         ` [PATCH] Emit base objects of a delta chain when the delta is output Junio C Hamano
2005-06-28  2:01 ` CAREFUL! No more delta object support! Junio C Hamano
2005-06-28  2:03   ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano
2005-06-28  2:43     ` Linus Torvalds
2005-06-28  3:33       ` Junio C Hamano
2005-06-28 15:45         ` Linus Torvalds
2005-06-28  2:13   ` CAREFUL! No more delta object support! Linus Torvalds
2005-06-28  2:32     ` Junio C Hamano
2005-06-28  2:37       ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano
2005-06-28  2:48       ` CAREFUL! No more delta object support! Linus Torvalds
2005-06-28  5:09     ` Daniel Barkalow
2005-06-28 15:49       ` Linus Torvalds
2005-06-28 16:21         ` Linus Torvalds
2005-06-28 17:04           ` Daniel Barkalow
2005-06-28 17:36             ` Linus Torvalds
2005-06-28 18:17               ` Linus Torvalds
2005-06-28 19:49                 ` Matthias Urlichs
2005-06-28 20:18                   ` Matthias Urlichs
2005-06-28 20:01                 ` Daniel Barkalow
2005-06-29  3:53                 ` Linus Torvalds
2005-06-29 18:59     ` Linus Torvalds
2005-06-29 21:05       ` Daniel Barkalow
2005-06-29 21:38         ` Linus Torvalds
2005-06-29 22:24           ` Daniel Barkalow
2005-06-28  8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano
2005-06-28 21:56   ` [PATCH] Expose packed_git and alt_odb Junio C Hamano
2005-06-28 21:58   ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.58.0506280921480.19755@ppc970.osdl.org \
    --to=torvalds@osdl.org \
    --cc=git@chrisli.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).