From: Linus Torvalds <torvalds@osdl.org>
To: Christopher Li <git@chrisli.org>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: CAREFUL! No more delta object support!
Date: Tue, 28 Jun 2005 09:45:28 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.58.0506280921480.19755@ppc970.osdl.org> (raw)
In-Reply-To: <20050628103852.GB21533@64m.dyndns.org>
On Tue, 28 Jun 2005, Christopher Li wrote:
>
> That is all nice improvement to address the space usage issue.
>
> Should people just run repacking once a while or is it automaticly
> add new object to the pack file?
While adding a new object to a pack file is _possible_ (you add it to the
end of the pack-file, and re-generate the index file), I would strongly
suggest against it for several reasons:
- It's a lot more complex and expensive than just writing a new file.
Much better to make the pack generation be an off-line thing, and make
new object creation really cheap.
- it has serious locking issues, and if something goes wrong you are just
horribly screwed. This implies, for example, that to be safe you really
have to use fsync() etc at every point (and be careful about writing
the index), making the update even _more_ expensive. Over NFS you need
to be extremely careful to make sure that everybody got the right lock,
yadda yadda.
Packing things off-line just means that _all_ of these problems go
away.
- There are operations that want to remove objects (I do that all the
time: I do something stupid, and decide to undo it, or I just do a
"git-update-cache" and notice that I need to do more work so I edit it
some more and actually never commit the first version)
If _adding_ to the file had some serious correctness issues, _removing_
an object from a file is even worse. MUCH worse. Now you don't just
have to lock against other people creating new objects, now you have to
lock against updates (or totally re-write the whole big file and do an
atomic "rename").
- it can actually generate worse packing. The current "offline" method
means that we can pack any version of a file against any other version
of a file, and we do. We pick the closest version we can find, and we
try to always pack against the bigger one (deletes are smaller deltas,
and the biggest one tends to be the latest version, so this not only
means that the delta is denser, it also means that the latest version -
which is likely to be the biggest and most often used - tends to be
non-delta).
In contrast, updating the pack file means that you always write the
latest version as a delta, which means that you're doing things
_exactly_ the wrong way around both for performance and size.
- Finally: packing allows us to do optimize for locality. In particular,
I write out the pack file in "recency" order, ie the top-most objects
go first, and in particular, the "commit" objects go at the very top of
the file. Why? Because it means that the commit objects (which are
heavily used for the history generation by pretty much anything, since
"git-rev-list" will access them) are packed together, and in the right
order.
Again, you can't do that if you do on-line updates as opposed to
offline packing.
So the usage pattern I envision is to pack stuff maybe once a month
(depending on how much changes, of course), because then you really do get
the best of both worlds: the simplicity of individual objects for recent
work and the optimal packing and ordering that you can really work on for
the longer range case. And your project never grows very big.
Btw, I'm not claiming that my current pack format is "optimal" of course.
For example, while I write all objects in recency order, right now that
means that if a recent object has been written as a delta that depends on
an older one, I actually write the delta first (correct) but I won't write
the older object until its recency ordering (wrong).
That kind of thing is trivial to fix (eventually), but it's an example of
where ordering matters (ie if it's the other way around: the delta is the
older object, it's probably better to leave it at the end of the file,
since it's probably not going to be accessed much, making the effective
packing at the head more efficicient). It's also an example of the kinds
of things we can do exactly because we're doing the packing off-line.
Linus
next prev parent reply other threads:[~2005-06-28 16:37 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-06-28 1:14 CAREFUL! No more delta object support! Linus Torvalds
2005-06-27 23:58 ` Christopher Li
2005-06-28 3:30 ` Linus Torvalds
2005-06-28 9:40 ` Junio C Hamano
2005-06-28 11:06 ` Christopher Li
2005-06-28 14:52 ` Petr Baudis
2005-06-28 16:35 ` Benjamin LaHaise
2005-06-28 20:30 ` Petr Baudis
2005-06-28 14:46 ` Jan Harkes
2005-06-28 10:38 ` Christopher Li
2005-06-28 16:45 ` Linus Torvalds [this message]
2005-06-29 0:49 ` [PATCH] Emit base objects of a delta chain when the delta is output Junio C Hamano
2005-06-28 2:01 ` CAREFUL! No more delta object support! Junio C Hamano
2005-06-28 2:03 ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano
2005-06-28 2:43 ` Linus Torvalds
2005-06-28 3:33 ` Junio C Hamano
2005-06-28 15:45 ` Linus Torvalds
2005-06-28 2:13 ` CAREFUL! No more delta object support! Linus Torvalds
2005-06-28 2:32 ` Junio C Hamano
2005-06-28 2:37 ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano
2005-06-28 2:48 ` CAREFUL! No more delta object support! Linus Torvalds
2005-06-28 5:09 ` Daniel Barkalow
2005-06-28 15:49 ` Linus Torvalds
2005-06-28 16:21 ` Linus Torvalds
2005-06-28 17:04 ` Daniel Barkalow
2005-06-28 17:36 ` Linus Torvalds
2005-06-28 18:17 ` Linus Torvalds
2005-06-28 19:49 ` Matthias Urlichs
2005-06-28 20:18 ` Matthias Urlichs
2005-06-28 20:01 ` Daniel Barkalow
2005-06-29 3:53 ` Linus Torvalds
2005-06-29 18:59 ` Linus Torvalds
2005-06-29 21:05 ` Daniel Barkalow
2005-06-29 21:38 ` Linus Torvalds
2005-06-29 22:24 ` Daniel Barkalow
2005-06-28 8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano
2005-06-28 21:56 ` [PATCH] Expose packed_git and alt_odb Junio C Hamano
2005-06-28 21:58 ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.58.0506280921480.19755@ppc970.osdl.org \
--to=torvalds@osdl.org \
--cc=git@chrisli.org \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).