* auto-packing on kernel.org? please?
@ 2005-10-13 18:44 Linus Torvalds
[not found] ` <434EABFD.5070604@zytor.com>
2005-11-21 19:01 ` Carl Baldwin
0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-10-13 18:44 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Git Mailing List
I know we tried this once earlier, and it caused problems, but that was
when pack-files were new, and not everybody could handle them. These days,
if you can't handle pack-files, kernel.org is already pretty useless,
because all the major packages use them anyway, because people have
packed their repositories by hand.
So I'm suggesting we try to do an automatic repack every once in a while.
In my suggestion, there would be two levels of repacking: "incremental"
and "full", and both of them would count the number of files before they
run, so that you'd only do it when it seems worthwhile.
This is a _really_ simple heuristic:
- incremental repacking run every day:
#
# Check if we have more than a couple of hundred
# unpacked objects - approximated by whether we
# have any "00" directory with more than one
#
# This means that we don't repack projects that
# that don't have a lot of work going on.
#
# Note: with really new versions of git, the "00"
# directory may not exist if it has been pruned
# away, so handle that gracefully.
#
export GIT_DIR=${1:-.}
objs=$(find "$GIT_DIR/objects/00" -type f 2> /dev/null | wc -l)
if [ "$obj" -gt 0 ]; then
git repack &&
git prune-packed
fi
- "full repack" every week if the number of packs has grown to be bigger
than say 10 (ie even a very active projects will never have a full
repack more than every other week)
#
# Check if we have lots of packs, where "lots" is defined as 10.
#
# Note: with something that was generated with an old version
# of git, the "pack" directory may not exist, so handle that
# gracefully.
#
export GIT_DIR=${1:-.}
packs=$(find "$GIT_DIR/objects/pack" -name '*.idx' 2> /dev/null | wc -l)
if [ "$packs" -gt 10 ]; then
git repack -a -d &&
git prune-packed
fi
- do a full repack of everything once to start with.
export GIT_DIR=${1:-.}
git repack -a -d &&
git prune-packed
the above three trivial scripts just take a single argument, which becomes
the GIT_DIR (and if no argument exists, it would default to ".")
Is there any reason not to do this? Right now mirroring is slow, and
webgit is also getting to be very slow sometimes. I bet we'd be _much_
better off with this kind of setup.
NOTE! The above is the "stupid" approach, which totally ignores alternate
directories, and isn't able to take advantage of the fact that many
projects could share objects. But it's simple, and it's efficient (eg it
won't spend time on things like the large historic archives which don't
change, but that would be expensive to repack if you didn't check for the
need).
So we could try to come up with a better approach eventually, which would
automatically notice alternate directories and not repack stuff that
exists there, but I'm pretty sure that the above would already help a
_lot_, and while pack-files have been been around forever, the
"alternates" support is still pretty new, so the above is also the "safer"
thing to do.
We'd only do the automatic thing on stuff under /pub/scm, of course: not
stuff in peoples home directories etc..
Peter?
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
[not found] ` <434EC07C.30505@pobox.com>
@ 2005-10-13 21:23 ` Linus Torvalds
2005-10-16 14:33 ` Dirk Behme
0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-10-13 21:23 UTC (permalink / raw)
To: Jeff Garzik; +Cc: H. Peter Anvin, users, Git Mailing List, Junio C Hamano
On Thu, 13 Oct 2005, Jeff Garzik wrote:
>
> Right now, things go through an expand-contract cycle:
>
> * people base repos off of Marcelo or Linus's git repo, including using those
> pack files (saves download bandwidth, disk space through hardlinks).
>
> * as 3rd parties and Marcelo/Linus merge stuff, .git/objects/* grows with
> individual files.
>
> * once a month/release/whatever, Linus packs his repo, allowing all the repos
> following his to use those pack files, pruning a ton of objects off of
> kernel.org.
>
> I have real users of my git repos who can't just download a 100MB pack file in
> an hour, it takes them many hours.
Argh.
Ok, I'm going to follow this up with three small patches that add a "-l"
flag to "git repack", which does only a "local repack" (ie it will pack
only objects that are _not_ in packs in alternate object directories).
That will hopefully mean that this usage case is supported too.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-13 21:23 ` [kernel.org users] " Linus Torvalds
@ 2005-10-16 14:33 ` Dirk Behme
2005-10-16 15:44 ` Daniel Barkalow
0 siblings, 1 reply; 33+ messages in thread
From: Dirk Behme @ 2005-10-16 14:33 UTC (permalink / raw)
To: Git Mailing List
> On Thu, 13 Oct 2005, Jeff Garzik wrote:
>
>>I have real users of my git repos who can't just download a 100MB pack file in
>>an hour, it takes them many hours.
Seems that I'm one of these users (but using an other repo).
Pack files are very nice saving bandwith and disk space. But what I
dislike is that I often have to download same information twice: Remote
.git/objects/* repo grows and I update my local repo daily against this.
Then once a month/release/whatever .git/objects/* are packed into one
file. This new pack file then is downloaded as well, but most/all of the
information in this file is already in my local repo and downloaded
again. Something like
- detect that there is new pack file in remote repo
- check what is in this remote pack file
- if in local repo no or only few .git/objects/* are missing, download
the missing ones and create an identical copy of remote pack file using
local .git/objects/*. Don't download remote pack file.
- remove all local .git/objects/* now in pack file
would be nice.
Or is this already possible? Or do I misunderstand anything?
Dirk
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 14:33 ` Dirk Behme
@ 2005-10-16 15:44 ` Daniel Barkalow
2005-10-16 16:12 ` Nick Hengeveld
0 siblings, 1 reply; 33+ messages in thread
From: Daniel Barkalow @ 2005-10-16 15:44 UTC (permalink / raw)
To: Dirk Behme; +Cc: Git Mailing List
On Sun, 16 Oct 2005, Dirk Behme wrote:
> > On Thu, 13 Oct 2005, Jeff Garzik wrote:
> >
> > >I have real users of my git repos who can't just download a 100MB pack file
> > >in
> > >an hour, it takes them many hours.
>
> Seems that I'm one of these users (but using an other repo).
>
> Pack files are very nice saving bandwith and disk space. But what I dislike is
> that I often have to download same information twice: Remote .git/objects/*
> repo grows and I update my local repo daily against this. Then once a
> month/release/whatever .git/objects/* are packed into one file. This new pack
> file then is downloaded as well, but most/all of the information in this file
> is already in my local repo and downloaded again. Something like
>
> - detect that there is new pack file in remote repo
> - check what is in this remote pack file
> - if in local repo no or only few .git/objects/* are missing, download the
> missing ones and create an identical copy of remote pack file using local
> .git/objects/*. Don't download remote pack file.
This is the problem: it's impossible to download only a few objects from a
pack file from an HTTP server, because those don't exist on the server as
separate files.
The current HTTP code actually never downloads a pack file unless a needed
object is not anywhere else, at which point it has no choice but to
download the pack.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 15:44 ` Daniel Barkalow
@ 2005-10-16 16:12 ` Nick Hengeveld
2005-10-16 16:23 ` Brian Gerst
0 siblings, 1 reply; 33+ messages in thread
From: Nick Hengeveld @ 2005-10-16 16:12 UTC (permalink / raw)
To: Daniel Barkalow; +Cc: Dirk Behme, Git Mailing List
On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote:
> This is the problem: it's impossible to download only a few objects from a
> pack file from an HTTP server, because those don't exist on the server as
> separate files.
Is it possible to determine the object locations inside the remote pack
file? If so, it would be possible to use Range: headers to download
selected objects from a pack.
--
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 16:12 ` Nick Hengeveld
@ 2005-10-16 16:23 ` Brian Gerst
2005-10-16 16:56 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Brian Gerst @ 2005-10-16 16:23 UTC (permalink / raw)
To: Nick Hengeveld; +Cc: Daniel Barkalow, Dirk Behme, Git Mailing List
Nick Hengeveld wrote:
> On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote:
>
>> This is the problem: it's impossible to download only a few objects from a
>> pack file from an HTTP server, because those don't exist on the server as
>> separate files.
>
> Is it possible to determine the object locations inside the remote pack
> file? If so, it would be possible to use Range: headers to download
> selected objects from a pack.
>
Not possible because the entire pack is compressed.
--
Brian Gerst
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 16:23 ` Brian Gerst
@ 2005-10-16 16:56 ` Junio C Hamano
2005-10-16 21:33 ` Nick Hengeveld
2005-10-16 17:10 ` Johannes Schindelin
2005-10-16 17:15 ` Brian Gerst
2 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-16 16:56 UTC (permalink / raw)
To: git
Brian Gerst <bgerst@didntduck.org> writes:
>> Is it possible to determine the object locations inside the remote
>> pack
>> file? If so, it would be possible to use Range: headers to download
>> selected objects from a pack.
That's what the .idx file is for, except that after you fetch
the range, you may find you would need something else that the
object is delta against.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 16:23 ` Brian Gerst
2005-10-16 16:56 ` Junio C Hamano
@ 2005-10-16 17:10 ` Johannes Schindelin
2005-10-16 17:15 ` Brian Gerst
2 siblings, 0 replies; 33+ messages in thread
From: Johannes Schindelin @ 2005-10-16 17:10 UTC (permalink / raw)
To: Brian Gerst; +Cc: Nick Hengeveld, Daniel Barkalow, Dirk Behme, Git Mailing List
Hi,
On Sun, 16 Oct 2005, Brian Gerst wrote:
> Nick Hengeveld wrote:
> > On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote:
> >
> > > This is the problem: it's impossible to download only a few objects from a
> > > pack file from an HTTP server, because those don't exist on the server as
> > > separate files.
> >
> > Is it possible to determine the object locations inside the remote pack
> > file? If so, it would be possible to use Range: headers to download
> > selected objects from a pack.
> >
>
> Not possible because the entire pack is compressed.
Maybe we should introduce an option which only packs objects of a minimal
age (something like "pack only objects 2 days and older")? This could be
used to autopackage as long as HTTP is the preferred protocol, so that if
you update daily, you already have those objects.
Alternatively, git-prune-packed could have an option to prune only those
objects older than 2 days.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 16:23 ` Brian Gerst
2005-10-16 16:56 ` Junio C Hamano
2005-10-16 17:10 ` Johannes Schindelin
@ 2005-10-16 17:15 ` Brian Gerst
2 siblings, 0 replies; 33+ messages in thread
From: Brian Gerst @ 2005-10-16 17:15 UTC (permalink / raw)
To: Nick Hengeveld; +Cc: Daniel Barkalow, Dirk Behme, Git Mailing List
Brian Gerst wrote:
> Nick Hengeveld wrote:
>
>> On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote:
>>
>>> This is the problem: it's impossible to download only a few objects
>>> from a pack file from an HTTP server, because those don't exist on
>>> the server as separate files.
>>
>>
>> Is it possible to determine the object locations inside the remote pack
>> file? If so, it would be possible to use Range: headers to download
>> selected objects from a pack.
>>
>
> Not possible because the entire pack is compressed.
I should have looked at the source more closely before stating that.
Each object gets compressed individually, so this would be possible.
--
Brian Gerst
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 16:56 ` Junio C Hamano
@ 2005-10-16 21:33 ` Nick Hengeveld
2005-10-16 22:12 ` Junio C Hamano
2005-10-17 19:13 ` Daniel Barkalow
0 siblings, 2 replies; 33+ messages in thread
From: Nick Hengeveld @ 2005-10-16 21:33 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote:
> That's what the .idx file is for, except that after you fetch
> the range, you may find you would need something else that the
> object is delta against.
Would it make sense to load the pack indexes for each base up front,
and then fetch individual objects from a pack if they exist in one of
a base's pack indexes? In such a case, it may not even make sense to
try fetching the object directly first.
What are the circumstances under which it makes more sense to fetch the
whole pack rather than fetching individual objects from it?
--
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 21:33 ` Nick Hengeveld
@ 2005-10-16 22:12 ` Junio C Hamano
2005-10-17 6:06 ` Nick Hengeveld
2005-10-17 19:13 ` Daniel Barkalow
1 sibling, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-16 22:12 UTC (permalink / raw)
To: Nick Hengeveld; +Cc: git
Nick Hengeveld <nickh@reactrix.com> writes:
> On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote:
>
>> That's what the .idx file is for, except that after you fetch
>> the range, you may find you would need something else that the
>> object is delta against.
>
> Would it make sense to load the pack indexes for each base up front,
> and then fetch individual objects from a pack if they exist in one of
> a base's pack indexes? In such a case, it may not even make sense to
> try fetching the object directly first.
>
> What are the circumstances under which it makes more sense to fetch the
> whole pack rather than fetching individual objects from it?
It would make sense if we end up needing most them anyway, I
think.
We are probably far from this, but ideally, we should be able to
set up something like this.
We encourage the server side to prepare packs this way [*1*].
-- development --> time --> flows --> this --> way -->
(optional)
full ------------------------------------------------
base ---------------
6mo ---------------------------------
3mo -------------------
1mo --------------
2wk ----------
1wk -----
^
last pack optimization
That is, a big base pack (say v2.6.12), and multiple packs to
bring people that were in-sync at various time up-to-date to the
time when the set of packs were last optimized. Any objects
created after the last pack optimization time are left unpacked
until the next pack optimization time. It might not be a bad
idea to also have a "full" pack.
For example, if you were in-sync 5-months ago, fetching 3mo pack
would not be enough and you would need to get 6mo pack to become
up-to-date wrt the last pack optimization (say 3 days ago). You
would have obtained the objects not in pack, created within the
last 3 days, already as individual objects before realizing that
you would need to fetch some pack.
Then, we can teach git-http-fetch to do:
- If an object is unavailable unpacked, get all the indices
from that repository (and probably its alternates while we
are at it).
- Among the set of packs that contain the object we are
currently interested in, try to find the "best" pack. The
definition of "best" would be a balancing act of finding the
one that contains the least number of objects we already
have, and the one that contains the most number of objects we
do not have yet.
The commit walker always goes from present to past, so you would
start from fetching the latest, presumably unpacked objects, and
as soon as you hit the last pack optimization boundary, you have
choices of multiple packs. If you are relatively up-to-date,
you would find that 1mo pack has more things you already have
than 1wk pack, although both of them would fit the bill -- at
that point you choose to download 1wk pack. On the other hand,
if you are behind, you may find that 3mo pack has more things
you do not have than 1wk or 2wk or 1mo pack, and using 3mo pack
would become the right choice for you.
I think most repositories have a few related heads and their
heads almost never rewind, so favoring the pack that contains
the most number of objects we do not have would be the right
strategy in practice for the downloader.
[Footnote]
*1* This is different from a proposal posted on the list earlier
by somebody (I think it was Pasky but I may be mistaken) which
looked like this:
-- development --> time --> flows --> this --> way -->
base ---------------
6mo --------------
3mo -----
1mo ----
2wk -----
1wk -----
The thing is, sum of 3mo+1mo+2wk+1wk packs in the latter scheme
tends to be a lot bigger than the size of 3mo pack in the former
scheme.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 22:12 ` Junio C Hamano
@ 2005-10-17 6:06 ` Nick Hengeveld
2005-10-17 8:21 ` Junio C Hamano
0 siblings, 1 reply; 33+ messages in thread
From: Nick Hengeveld @ 2005-10-17 6:06 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sun, Oct 16, 2005 at 03:12:03PM -0700, Junio C Hamano wrote:
> - Among the set of packs that contain the object we are
> currently interested in, try to find the "best" pack. The
> definition of "best" would be a balancing act of finding the
> one that contains the least number of objects we already
> have, and the one that contains the most number of objects we
> do not have yet.
To get a complete list of objects we do not have yet, fetch will need
to walk all the trees first and then make another pass to process
all the missing objects. Is it worth considering a case where the
missing objects are packed along with objects that don't need to be
transferred? From the use cases you described, it's not clear that
situation would ever really happen.
If the blobs have been packed, it seems likely that the tree objects will
also be packed, so fetching them during the first pass will either involve
fetching a pack without being able to determine which is best or fetching
the appropriate ranges from packs to get the tree objects.
--
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-17 6:06 ` Nick Hengeveld
@ 2005-10-17 8:21 ` Junio C Hamano
2005-10-17 17:41 ` Nick Hengeveld
0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-17 8:21 UTC (permalink / raw)
To: Nick Hengeveld; +Cc: git
Nick Hengeveld <nickh@reactrix.com> writes:
> To get a complete list of objects we do not have yet, fetch will need
> to walk all the trees first and then make another pass to process
> all the missing objects.
Notice I did not say "we do not have yet but we will need" -- I
just said "we do not have yet".
The assumption, which is the property the suggested packing
strategy has, is that older objects that are needed to complete
the history leading to the current tip are packed in those
n-month/n-week packs, so if we do not have them we would likely
be needing them, although we might not have walked that far back
in history yet.
The previous "packing strategy" picture was certainly too
simplified. Obviously we would not want to repack everything
every week for different periods all the way back -- we would
want to leave old huge pack untouched to help server side (and
mirroring), so instead of having a single "pack optimization
boundary", we would probably need some staggering as well for
archived material.
This is a revised example.
1yr -----
9mo --------
6mo ----------
3mo ------------------
1mo ------------
2wk --------
1wk ----
We keep track of "the current heads and tags" for each week.
Every week, we can do something like this:
- rotate the record, and create a new one:
mv .save/wk11 .save/wk12
mv .save/wk10 .save/wk11
mv .save/wk9 .save/wk10
...
mv .save/wk0 .save/wk1
find .git/refs -type f -print | xargs cat >.save/wk0
- prepare a pack to allow a single pack fetch to bring a
repository that had everything reachable from wk$N refs
up-to-date to the current, for selected recent weeks (say N=1,
2, 4, 12):
for N in 1 2 4 12
do
name=$(git-rev-list --objects \
$(sed -e 's/^/^/' .save/wk$N) \
$(cat .save/wk0) |
git-pack-object pack-) &&
mv pack-$name.* .git/objects/pack/.
done
remove the pack files that we created this way last week from
the repository (if the repository did not have any activity
during the last week we would have created the same set of
packs. make sure we do not remove them).
- except that, we keep the longest period (i.e. N=12 in this
example) one every N weeks (that's how 1yr, 9mo, 6mo packs in
the picture are kept).
This way, really old stuff (say, older than 3mo) will stay
intact and will not be repacked, so people reasonably up-to-date
(within 12 weeks in the example) need to fetch only one pack
(and unpacked objects since the last pack optimization), but
people without the ancient history need to go further back.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-17 8:21 ` Junio C Hamano
@ 2005-10-17 17:41 ` Nick Hengeveld
2005-10-17 20:08 ` Junio C Hamano
0 siblings, 1 reply; 33+ messages in thread
From: Nick Hengeveld @ 2005-10-17 17:41 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Mon, Oct 17, 2005 at 01:21:14AM -0700, Junio C Hamano wrote:
> The assumption, which is the property the suggested packing
> strategy has, is that older objects that are needed to complete
> the history leading to the current tip are packed in those
> n-month/n-week packs, so if we do not have them we would likely
> be needing them, although we might not have walked that far back
> in history yet.
Gotcha - I'm still thinking in terms of content distribution, where
you only need a specific version of a tree to be available locally
and explicitly don't want to transfer history. In our case, using
packs doesn't make sense at the moment.
--
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-16 21:33 ` Nick Hengeveld
2005-10-16 22:12 ` Junio C Hamano
@ 2005-10-17 19:13 ` Daniel Barkalow
1 sibling, 0 replies; 33+ messages in thread
From: Daniel Barkalow @ 2005-10-17 19:13 UTC (permalink / raw)
To: Nick Hengeveld; +Cc: Junio C Hamano, git
On Sun, 16 Oct 2005, Nick Hengeveld wrote:
> On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote:
>
> > That's what the .idx file is for, except that after you fetch
> > the range, you may find you would need something else that the
> > object is delta against.
>
> Would it make sense to load the pack indexes for each base up front,
> and then fetch individual objects from a pack if they exist in one of
> a base's pack indexes? In such a case, it may not even make sense to
> try fetching the object directly first.
At the start, you have the option of either fetching the list of packs or
the object. There are three cases:
1) the object isn't available separately; we need to fetch the list of
packs to find it in a pack.
2) there aren't any new packs; we need to fetch the object individually.
3) the object is present both individually and in a pack.
(2) is more common than (1), because we don't repack every update. (3)
doesn't happen at all, currently, because we prune after packing. So it
makes most sense to try the object at once.
On the other hand, the parallel code should probably do both at the same
time, since it can, and it only causes notable latency, not bandwidth. We
probably also ought to speculatively get any new index files in parallel
with whatever else we're doing, since it is likely that we'll need some
pack at some point, and then we'll need all the index files to decide what
pack to get.
> What are the circumstances under which it makes more sense to fetch the
> whole pack rather than fetching individual objects from it?
I'm not sure there's a good way of deciding without a plan for what
conditions cause there to be a choice.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-17 17:41 ` Nick Hengeveld
@ 2005-10-17 20:08 ` Junio C Hamano
2005-10-17 22:56 ` Daniel Barkalow
2005-10-17 23:54 ` Nick Hengeveld
0 siblings, 2 replies; 33+ messages in thread
From: Junio C Hamano @ 2005-10-17 20:08 UTC (permalink / raw)
To: Nick Hengeveld; +Cc: git
Nick Hengeveld <nickh@reactrix.com> writes:
> Gotcha - I'm still thinking in terms of content distribution, where
> you only need a specific version of a tree to be available locally
> and explicitly don't want to transfer history.
In other words, you'd want to also support CVS-like "working
tree has the specific version, and history is not kept here, but
available on demand, possibly over the network" mode of
operation. I'd say why not. We could aim to have "working tree
has the specific version and partial history of recent versions,
and the ancient history is available on demand, possibly over
the network" mode of operation.
It is somewhat different from the primary focus of what we have
been doing, but I think it is a natural extension. The
invariant is that once you have a ref pointing at a specific
commit, everything reachable from it ought to be available to
you.
And we have extended the definition of "available" over time.
Initially, you needed to have individual objects, and then we
made it so they could live in packs, and now they could even be
borrowed from another repository via alternates. We currently
do not consider "lazily fetchable over the network" as
"available", but I do not object too much to that, as long as it
is an optional feature.
This probably is a post 1.0 item, though. Off the top of my
head, we would need:
- a way for the user to say "unless I ask explicitly otherwise,
do not bother me if the commits older than these ones are
incomplete" -- an milder version of cauterizing commit chain
via info/grafts.
- a way for the user to say "this time I am explicitly
overriding the above -- I am interested in older history".
- change to fsck-objects, fetch- and probably upload-pack on
the other end, and commit walkers to honor the above two.
Most of these can probably be done by existing info/grafts
mechanism, but even then definitely would need a nicer user
interface.
Once this is in place, range requests to pick data for
individual objects from packs residing on a remote HTTP server
would start to make sense.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-17 20:08 ` Junio C Hamano
@ 2005-10-17 22:56 ` Daniel Barkalow
2005-10-17 23:19 ` Linus Torvalds
2005-10-17 23:54 ` Nick Hengeveld
1 sibling, 1 reply; 33+ messages in thread
From: Daniel Barkalow @ 2005-10-17 22:56 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nick Hengeveld, git
On Mon, 17 Oct 2005, Junio C Hamano wrote:
> Nick Hengeveld <nickh@reactrix.com> writes:
>
> > Gotcha - I'm still thinking in terms of content distribution, where
> > you only need a specific version of a tree to be available locally
> > and explicitly don't want to transfer history.
>
> In other words, you'd want to also support CVS-like "working
> tree has the specific version, and history is not kept here, but
> available on demand, possibly over the network" mode of
> operation. I'd say why not. We could aim to have "working tree
> has the specific version and partial history of recent versions,
> and the ancient history is available on demand, possibly over
> the network" mode of operation.
>
> It is somewhat different from the primary focus of what we have
> been doing, but I think it is a natural extension. The
> invariant is that once you have a ref pointing at a specific
> commit, everything reachable from it ought to be available to
> you.
Wouldn't "git fetch http://.../foo.git/ master^{tree}" do the right thing?
You get only the current tree, and write a ref to the tree instead of the
commit, maintaining the invariant. Of course, fetch.c needs a bit of work
so that it can fetch objects in the process of figuring out what the
refspec that it's really trying to fetch, but that should be simple
enough.
Of course, this really isolates you from the history, since you don't even
remember what the commit was that you've got the tree from, but that may
not be an issue in a pure content distribution setup. Also, a pack file of
a single tree isn't going to be terribly efficient, because pack files
mostly exploit the high similarity between different versions of the same
file.
My other idea is to have a file of things that you expect to be missing,
even though they are referenced, and where to expect to find them if
necessary. Then you could download the latest commit, mark its parents
(unless you have them) as known-missing, and write the ref.
-Daniel
*This .sig left intentionally blank*
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-17 22:56 ` Daniel Barkalow
@ 2005-10-17 23:19 ` Linus Torvalds
0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-10-17 23:19 UTC (permalink / raw)
To: Daniel Barkalow; +Cc: Junio C Hamano, Nick Hengeveld, git
On Mon, 17 Oct 2005, Daniel Barkalow wrote:
>
> Wouldn't "git fetch http://.../foo.git/ master^{tree}" do the right thing?
The pack pullers have trouble with anything that isn't commit-based,
because they do all the "figure out what we have in common" logic based on
the commit history.
So if you fetch a tree, it by definition doesn't _have_ any history, and
the pack pullers will always pack the whole tree. I think.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please?
2005-10-17 20:08 ` Junio C Hamano
2005-10-17 22:56 ` Daniel Barkalow
@ 2005-10-17 23:54 ` Nick Hengeveld
1 sibling, 0 replies; 33+ messages in thread
From: Nick Hengeveld @ 2005-10-17 23:54 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Mon, Oct 17, 2005 at 01:08:03PM -0700, Junio C Hamano wrote:
> - a way for the user to say "unless I ask explicitly otherwise,
> do not bother me if the commits older than these ones are
> incomplete" -- an milder version of cauterizing commit chain
> via info/grafts.
>
> - a way for the user to say "this time I am explicitly
> overriding the above -- I am interested in older history".
>
> - change to fsck-objects, fetch- and probably upload-pack on
> the other end, and commit walkers to honor the above two.
That's how I interpreted the -c and -a command-line arguments to the
commit walkers. git-fetch calls them with -a but we've been using -t
to only follow the tree objects and it's been working great.
Perhaps that would be a good way for the commit walker to decide whether
to transfer a full pack file - it may not make sense if it wasn't told
to get history.
--
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-10-13 18:44 auto-packing on kernel.org? please? Linus Torvalds
[not found] ` <434EABFD.5070604@zytor.com>
@ 2005-11-21 19:01 ` Carl Baldwin
2005-11-21 19:24 ` Linus Torvalds
1 sibling, 1 reply; 33+ messages in thread
From: Carl Baldwin @ 2005-11-21 19:01 UTC (permalink / raw)
To: Linus Torvalds; +Cc: H. Peter Anvin, Git Mailing List
I have a question about automatic repacking.
I am thinking of turning something like Linus' repacking heuristic loose
on my repositories. I just want to make sure it is as safe as possible.
At the core of the incremental and full repack strategies are these
statements.
Incremental...
> git repack &&
> git prune-packed
Full...
> git repack -a -d &&
> git prune-packed
Are there some built in safety checks in 'git repack' and/or 'git
prune-packed' to guard against corruption? In the long run, I would
feel more comfortable with somelike like this:
git repack
git verify-pack <new pack>
git prune-packed
Would something like this even work with 'git repack -a -d'? Is there a
way to do something like the following for a full repack to achieve the
ultimate in paranoia?
git repack -a
git verify-pack <new pack file>
git trash-redundant-packs <new pack file>
git prune-packed
Carl
On Thu, Oct 13, 2005 at 11:44:30AM -0700, Linus Torvalds wrote:
>
> I know we tried this once earlier, and it caused problems, but that was
> when pack-files were new, and not everybody could handle them. These days,
> if you can't handle pack-files, kernel.org is already pretty useless,
> because all the major packages use them anyway, because people have
> packed their repositories by hand.
>
> So I'm suggesting we try to do an automatic repack every once in a while.
>
> In my suggestion, there would be two levels of repacking: "incremental"
> and "full", and both of them would count the number of files before they
> run, so that you'd only do it when it seems worthwhile.
>
> This is a _really_ simple heuristic:
>
> - incremental repacking run every day:
>
> #
> # Check if we have more than a couple of hundred
> # unpacked objects - approximated by whether we
> # have any "00" directory with more than one
> #
> # This means that we don't repack projects that
> # that don't have a lot of work going on.
> #
> # Note: with really new versions of git, the "00"
> # directory may not exist if it has been pruned
> # away, so handle that gracefully.
> #
> export GIT_DIR=${1:-.}
> objs=$(find "$GIT_DIR/objects/00" -type f 2> /dev/null | wc -l)
> if [ "$obj" -gt 0 ]; then
> git repack &&
> git prune-packed
> fi
>
> - "full repack" every week if the number of packs has grown to be bigger
> than say 10 (ie even a very active projects will never have a full
> repack more than every other week)
>
> #
> # Check if we have lots of packs, where "lots" is defined as 10.
> #
> # Note: with something that was generated with an old version
> # of git, the "pack" directory may not exist, so handle that
> # gracefully.
> #
> export GIT_DIR=${1:-.}
> packs=$(find "$GIT_DIR/objects/pack" -name '*.idx' 2> /dev/null | wc -l)
> if [ "$packs" -gt 10 ]; then
> git repack -a -d &&
> git prune-packed
> fi
>
> - do a full repack of everything once to start with.
>
> export GIT_DIR=${1:-.}
> git repack -a -d &&
> git prune-packed
>
> the above three trivial scripts just take a single argument, which becomes
> the GIT_DIR (and if no argument exists, it would default to ".")
>
> Is there any reason not to do this? Right now mirroring is slow, and
> webgit is also getting to be very slow sometimes. I bet we'd be _much_
> better off with this kind of setup.
>
> NOTE! The above is the "stupid" approach, which totally ignores alternate
> directories, and isn't able to take advantage of the fact that many
> projects could share objects. But it's simple, and it's efficient (eg it
> won't spend time on things like the large historic archives which don't
> change, but that would be expensive to repack if you didn't check for the
> need).
>
> So we could try to come up with a better approach eventually, which would
> automatically notice alternate directories and not repack stuff that
> exists there, but I'm pretty sure that the above would already help a
> _lot_, and while pack-files have been been around forever, the
> "alternates" support is still pretty new, so the above is also the "safer"
> thing to do.
>
> We'd only do the automatic thing on stuff under /pub/scm, of course: not
> stuff in peoples home directories etc..
>
> Peter?
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin Systems VLSI Laboratory
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-21 19:01 ` Carl Baldwin
@ 2005-11-21 19:24 ` Linus Torvalds
2005-11-21 19:58 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-11-21 19:24 UTC (permalink / raw)
To: Carl Baldwin; +Cc: H. Peter Anvin, Git Mailing List
On Mon, 21 Nov 2005, Carl Baldwin wrote:
>
> I have a question about automatic repacking.
>
> I am thinking of turning something like Linus' repacking heuristic loose
> on my repositories. I just want to make sure it is as safe as possible.
>
> At the core of the incremental and full repack strategies are these
> statements.
>
> Incremental...
> > git repack &&
> > git prune-packed
>
> Full...
> > git repack -a -d &&
> > git prune-packed
NOTE! Since that email, "git repack" has gotten a "local" option (-l),
which is very useful if the repositories have pointers to alternates.
So do
git repack -l
instead, to get much better packs (and "-a -d" for the full case, of
course).
Other that than, the old email suggestion should still be fine.
> Are there some built in safety checks in 'git repack' and/or 'git
> prune-packed' to guard against corruption? In the long run, I would
> feel more comfortable with somelike like this:
>
> git repack
> git verify-pack <new pack>
> git prune-packed
You can certainly do that if you are nervous. It might even be a good
idea: just for fun, I just did
git clone -l git git-clone
cd git-clone
# pick an object at random
rm .git/objects/f7/c3d39fe3db6da3a307da385a7a1cb563ed15f7
git repack -a -d
and it said:
error: Could not read f7c3d39fe3db6da3a307da385a7a1cb563ed15f7
fatal: bad tree object f7c3d39fe3db6da3a307da385a7a1cb563ed15f7
but then it created the pack _anyway_, and said:
Packing 27 objects
Pack pack-13bfca704078175c1c1c59964553b14f7b952651 created.
and happily removed all the old ones.
So right now, repacking a broken archive can actually break it even more.
NOTE! Your "git verify-pack" wouldn't even catch this: the _pack_ is fine,
it's just incomplete.
Of course, this only happens if the repository was broken to begin with,
so arguably it's not that bad. But it does show that git-repack should be
more careful and return an error more aggressively.
Can anybody tell me how to do that sanely? Right now we do
..
name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) |
git-pack-objects --non-empty $pack_objects .tmp-pack) ||
exit 1
..
and the thing is, the "git-pack-objects" thing is happy, it's the
"git-rev-list" that fails. So because the last command in the pipeline
returns ok, we think it all is ok..
(This is one of the reasons I much prefer working in C over working in
shell: it may be twenty times more lines, but when you have a problem, the
fix is always obvious..)
Anyway, with that fixed, a "git repack" in many ways would be a mini-fsck,
so it should be very safe in general. Modulo any other bugs like the
above.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-21 19:24 ` Linus Torvalds
@ 2005-11-21 19:58 ` Junio C Hamano
2005-11-21 20:38 ` Linus Torvalds
2005-11-22 5:26 ` Chuck Lever
2005-11-22 17:25 ` Carl Baldwin
2 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-11-21 19:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> Can anybody tell me how to do that sanely? Right now we do
>
> ..
> name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) |
> git-pack-objects --non-empty $pack_objects .tmp-pack) ||
> exit 1
> ..
>
> and the thing is, the "git-pack-objects" thing is happy, it's the
> "git-rev-list" that fails. So because the last command in the pipeline
> returns ok, we think it all is ok..
One cop-out: do fsck-objects upfront before making a pack. This
would populate your buffer cache so it might not be a bad thing.
Alternatively:
name=$( {
git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) ||
echo Gaaahhh
} | git-pack-objects --non-empty $pack_objects .tmp-pack)
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-21 19:58 ` Junio C Hamano
@ 2005-11-21 20:38 ` Linus Torvalds
2005-11-21 21:35 ` Junio C Hamano
0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-11-21 20:38 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Mon, 21 Nov 2005, Junio C Hamano wrote:
>
> One cop-out: do fsck-objects upfront before making a pack. This
> would populate your buffer cache so it might not be a bad thing.
Well, it's extremely expensive most of the time. It's often as expensive
as the packing itself. So I don't like that option very much.
> Alternatively:
>
> name=$( {
> git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) ||
> echo Gaaahhh
> } | git-pack-objects --non-empty $pack_objects .tmp-pack)
Actually, some dim memories prodded me to some man-page digging, and the
"pipefail" option in particular.
It seems to be a common option to both ksh and bash, so
set -o pipefail
seems like it should fix this. Sadly, I think it's pretty recent in bash
(ksh apparently got it in -93, bash seems to have gotten it only as of
version 3.0, which is definitely recent enough that we can't just assume
it).
[ Also, bash seems to have a variable called $PIPESTATUS, but that's
bash-specific (I don't know when it was enabled). ]
Anyway, doing a
set -o pipefail
should never be the wrong thing to do, but the problem is figuring out
whether the option is available or not, since if it isn't available, it's
considered an error ;/
So with all that, how about we take your "Gaah" idea, and simplify it:
just pipe stderr too. That, together with making git-pack-objects tell
what garbage it got, actually does the rigth thing:
[torvalds@g5 git-clone]$ git repack -a -d
fatal: expected sha1, got garbage:
error: Could not read 7f59dbbb8f8d479c1d31453eac06ec765436a780
with this pretty simple patch.
Whaddaya think?
Linus
---
diff --git a/git-repack.sh b/git-repack.sh
index 4e16d34..c0f271d 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -41,7 +41,7 @@ esac
if [ "$local" ]; then
pack_objects="$pack_objects --local"
fi
-name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) |
+name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) 2>&1 |
git-pack-objects --non-empty $pack_objects .tmp-pack) ||
exit 1
if [ -z "$name" ]; then
diff --git a/pack-objects.c b/pack-objects.c
index 4e941e7..8864a31 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -524,7 +524,7 @@ int main(int argc, char **argv)
unsigned char sha1[20];
if (get_sha1_hex(line, sha1))
- die("expected sha1, got garbage");
+ die("expected sha1, got garbage:\n %s", line);
hash = 0;
p = line+40;
while (*p) {
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-21 20:38 ` Linus Torvalds
@ 2005-11-21 21:35 ` Junio C Hamano
0 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2005-11-21 21:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> ...just pipe stderr too. That, together with making git-pack-objects tell
> what garbage it got, actually does the rigth thing:
>
> [torvalds@g5 git-clone]$ git repack -a -d
> fatal: expected sha1, got garbage:
> error: Could not read 7f59dbbb8f8d479c1d31453eac06ec765436a780
>
> with this pretty simple patch.
>
> Whaddaya think?
Obviously the right thing to do ;-). I like it.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-21 19:24 ` Linus Torvalds
2005-11-21 19:58 ` Junio C Hamano
@ 2005-11-22 5:26 ` Chuck Lever
2005-11-22 5:41 ` Linus Torvalds
2005-11-22 17:25 ` Carl Baldwin
2 siblings, 1 reply; 33+ messages in thread
From: Chuck Lever @ 2005-11-22 5:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Carl Baldwin, H. Peter Anvin, Git Mailing List
[-- Attachment #1: Type: text/plain, Size: 1504 bytes --]
Linus Torvalds wrote:
>
> On Mon, 21 Nov 2005, Carl Baldwin wrote:
>
>>I have a question about automatic repacking.
>>
>>I am thinking of turning something like Linus' repacking heuristic loose
>>on my repositories. I just want to make sure it is as safe as possible.
>>
>>At the core of the incremental and full repack strategies are these
>>statements.
>>
>>Incremental...
>>
>>> git repack &&
>>> git prune-packed
>>
>>Full...
>>
>>> git repack -a -d &&
>>> git prune-packed
>
>
> NOTE! Since that email, "git repack" has gotten a "local" option (-l),
> which is very useful if the repositories have pointers to alternates.
>
> So do
>
> git repack -l
>
> instead, to get much better packs (and "-a -d" for the full case, of
> course).
>
> Other that than, the old email suggestion should still be fine.
i've been playing with "git repack" on StGIT-managed repositories.
on NFS, using packs instead of individual objects is quite a bit faster,
because a single NFS GETATTR will tell you if your NFS client's cached
pack file is still valid, whereas a whole bunch of GETATTRs are required
for validating individual object files.
there are some things repacking does that breaks StGIT, though.
git repack -d
seems to remove old commits that StGIT was still depending on.
git repack -a -n
seems to work fine with StGIT, as does
git prune-packed
i'm really interested in trying out the new command to remove redundant
objects and packs, but haven't gotten around to it yet.
[-- Attachment #2: cel.vcf --]
[-- Type: text/x-vcard, Size: 439 bytes --]
begin:vcard
fn:Chuck Lever
n:Lever;Charles
org:Network Appliance, Incorporated;Linux NFS Client Development
adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA
email;internet:cel@citi.umich.edu
title:Member of Technical Staff
tel;work:+1 734 763-4415
tel;fax:+1 734 763 4434
tel;home:+1 734 668-1089
x-mozilla-html:FALSE
url:http://www.monkey.org/~cel/
version:2.1
end:vcard
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-22 5:26 ` Chuck Lever
@ 2005-11-22 5:41 ` Linus Torvalds
2005-11-22 14:13 ` Catalin Marinas
2005-11-22 18:18 ` Chuck Lever
0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-11-22 5:41 UTC (permalink / raw)
To: Chuck Lever
Cc: Carl Baldwin, H. Peter Anvin, Git Mailing List, Catalin Marinas
On Tue, 22 Nov 2005, Chuck Lever wrote:
>
> there are some things repacking does that breaks StGIT, though.
>
> git repack -d
>
> seems to remove old commits that StGIT was still depending on.
If that is true, then "git-fsck-cache" probably also reports errors on a
StGIT repository. No? Basically, it implies that the tool doesn't know how
to find all the "heads".
Could somebody (Catalin?) perhaps tell how tools like git-fsck-cache and
git-repack could figure out which objects are still in use by stgit?
Preferably with some generic mechanism that _other_ projects (not just
stgit) might want to use?
The preferred way would be to just list the references somewhere under
.git/refs/stgit, in which case fsck and repack should pick them up
automatically (so clearly stgit doesn't do that right now ;).
It also implies that doing a "git prune" will do horribly bad things to a
stgit repo, since it would remove all the objects that it thinks aren't
reachable..
> git repack -a -n
>
> seems to work fine with StGIT,
Well, it "works", but not "fine". Since it doesn't know about the stgit
objects, it won't ever pack them.
But maybe that's what stgit wants (since they are "temporary"), but it
does mean that if you see a big advantage from packing, you might be
losing some of it.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-22 5:41 ` Linus Torvalds
@ 2005-11-22 14:13 ` Catalin Marinas
2005-11-22 17:05 ` Linus Torvalds
[not found] ` <7v64qkfwhe.fsf@assigned-by-dhcp.cox.net>
2005-11-22 18:18 ` Chuck Lever
1 sibling, 2 replies; 33+ messages in thread
From: Catalin Marinas @ 2005-11-22 14:13 UTC (permalink / raw)
To: Linus Torvalds
Cc: Chuck Lever, Carl Baldwin, H. Peter Anvin, Git Mailing List
On 22/11/05, Linus Torvalds <torvalds@osdl.org> wrote:
> On Tue, 22 Nov 2005, Chuck Lever wrote:
> > there are some things repacking does that breaks StGIT, though.
> >
> > git repack -d
> >
> > seems to remove old commits that StGIT was still depending on.
>
> If that is true, then "git-fsck-cache" probably also reports errors on a
> StGIT repository. No? Basically, it implies that the tool doesn't know how
> to find all the "heads".
Indeed, 'git repack -d' or 'git prune' might remove the patches which
are not applied since there is no link to them from .git/refs/.
> Could somebody (Catalin?) perhaps tell how tools like git-fsck-cache and
> git-repack could figure out which objects are still in use by stgit?
They don't figure this out at the moment. I initially thought about
implementing these commands in StGIT so that they would pass the
proper references.
> Preferably with some generic mechanism that _other_ projects (not just
> stgit) might want to use?
>
> The preferred way would be to just list the references somewhere under
> .git/refs/stgit, in which case fsck and repack should pick them up
> automatically (so clearly stgit doesn't do that right now ;).
I thought about adding .git/refs/patches/<branch>/* files
corresponding to the every StGIT patch. Are the above git commands
looking at all depths in the .git/refs/ directory?
> > git repack -a -n
> >
> > seems to work fine with StGIT,
>
> Well, it "works", but not "fine". Since it doesn't know about the stgit
> objects, it won't ever pack them.
>
> But maybe that's what stgit wants (since they are "temporary"), but it
> does mean that if you see a big advantage from packing, you might be
> losing some of it.
The 'git repack -a' command would include the applied patches in the
newly created pack but leave out the unapplied ones. It would be even
better to leave all of them out since the StGIT patches are frequently
changed but an independent mechanism for this would complicate GIT -
'git repack' shouldn't pack any of the objects found in
.git/refs/patches/, even if they are reachable via .git/refs/heads/*
(and maybe call the patches directory something like
.git/refs/unpackable or volatile).
--
Catalin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-22 14:13 ` Catalin Marinas
@ 2005-11-22 17:05 ` Linus Torvalds
[not found] ` <7v64qkfwhe.fsf@assigned-by-dhcp.cox.net>
1 sibling, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-11-22 17:05 UTC (permalink / raw)
To: Catalin Marinas
Cc: Chuck Lever, Carl Baldwin, H. Peter Anvin, Git Mailing List
On Tue, 22 Nov 2005, Catalin Marinas wrote:
>
> > The preferred way would be to just list the references somewhere under
> > .git/refs/stgit, in which case fsck and repack should pick them up
> > automatically (so clearly stgit doesn't do that right now ;).
>
> I thought about adding .git/refs/patches/<branch>/* files
> corresponding to the every StGIT patch. Are the above git commands
> looking at all depths in the .git/refs/ directory?
Yes. Or at least they're supposed to. If they are not, it's a bug
regardless, and we'll fix it.
> The 'git repack -a' command would include the applied patches in the
> newly created pack but leave out the unapplied ones. It would be even
> better to leave all of them out since the StGIT patches are frequently
> changed but an independent mechanism for this would complicate GIT -
> 'git repack' shouldn't pack any of the objects found in
> .git/refs/patches/, even if they are reachable via .git/refs/heads/*
> (and maybe call the patches directory something like
> .git/refs/unpackable or volatile).
If we have some default location (and .git/refs/patches/ sounds good), we
can make git do the right thing - find them for git-fsck-objects, and
ignore them for git-repack.
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-21 19:24 ` Linus Torvalds
2005-11-21 19:58 ` Junio C Hamano
2005-11-22 5:26 ` Chuck Lever
@ 2005-11-22 17:25 ` Carl Baldwin
2005-11-22 17:58 ` Linus Torvalds
2 siblings, 1 reply; 33+ messages in thread
From: Carl Baldwin @ 2005-11-22 17:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: H. Peter Anvin, Git Mailing List
On Mon, Nov 21, 2005 at 11:24:11AM -0800, Linus Torvalds wrote:
> NOTE! Since that email, "git repack" has gotten a "local" option (-l),
> which is very useful if the repositories have pointers to alternates.
>
> So do
>
> git repack -l
>
> instead, to get much better packs (and "-a -d" for the full case, of
> course).
I'm assuming that this option will have no effect on a repository with
no alternates file.
> Other that than, the old email suggestion should still be fine.
[snip]
> You can certainly do that if you are nervous. It might even be a good
> idea: just for fun, I just did
>
> git clone -l git git-clone
> cd git-clone
>
> # pick an object at random
> rm .git/objects/f7/c3d39fe3db6da3a307da385a7a1cb563ed15f7
>
> git repack -a -d
>
> and it said:
>
> error: Could not read f7c3d39fe3db6da3a307da385a7a1cb563ed15f7
> fatal: bad tree object f7c3d39fe3db6da3a307da385a7a1cb563ed15f7
>
> but then it created the pack _anyway_, and said:
>
> Packing 27 objects
> Pack pack-13bfca704078175c1c1c59964553b14f7b952651 created.
>
> and happily removed all the old ones.
>
> So right now, repacking a broken archive can actually break it even more.
Interesting.
> NOTE! Your "git verify-pack" wouldn't even catch this: the _pack_ is fine,
> it's just incomplete.
In my opinion, git repack did the right thing in creating the pack even
if it is more broken. Starting with a broken repository was the real
problem. git repack shouldn't need to worry too much about it.
Looking at it from the nervous repository admin's point of view I think
he would want to make sure that the repository is good to begin with. I
think this should be left up to the repository owner and maybe not git
repack. Although, the check that you do following this is probably a
good idea.
> Of course, this only happens if the repository was broken to begin with,
> so arguably it's not that bad. But it does show that git-repack should be
> more careful and return an error more aggressively.
>
> Can anybody tell me how to do that sanely? Right now we do
>
> ..
> name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) |
> git-pack-objects --non-empty $pack_objects .tmp-pack) ||
> exit 1
> ..
>
> and the thing is, the "git-pack-objects" thing is happy, it's the
> "git-rev-list" that fails. So because the last command in the pipeline
> returns ok, we think it all is ok..
>
> (This is one of the reasons I much prefer working in C over working in
> shell: it may be twenty times more lines, but when you have a problem, the
> fix is always obvious..)
>
> Anyway, with that fixed, a "git repack" in many ways would be a mini-fsck,
> so it should be very safe in general. Modulo any other bugs like the
> above.
>
> Linus
*NOTE* There is one question that I feel remains unanswered. Is it
possible to split up the repack -a and repack -d so that the nervous
repository owner can insert a git verify-pack in the middle.
I'm not nearly this nervous about repositories that I keep for myself
but I have ownership of some repositories on which many people may
depend. I will feel better if I can verify the pack separately from
git-repack before I do the (potentially destructive) -d to remove old
packs.
I don't mean to say that I don't trust git repack to do the right thing.
Fundamentally, I just think that I shouldn't depend on it to do the
right thing in order to avoid corruption in my repository.
Carl
PS I love that the git object store is designed so that object files
never *need* to be removed, renamed, modified or otherwise touched in
any way after being written to disk. I think this makes git inherently
extremely safe from corruption unlike many other older repository
designs. The only thing that breaks this inherent safety is the desire
to pack repositories to avoid bloat.
That is why I want to be a little paranoid when I do the repacking. I
want to maintain some inherent safety in the process that I use to pack
them. This kind of inherent safety is much more valuable then even the
highest quality code written to actually do the packing.
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin Systems VLSI Laboratory
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-22 17:25 ` Carl Baldwin
@ 2005-11-22 17:58 ` Linus Torvalds
0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-11-22 17:58 UTC (permalink / raw)
To: Carl Baldwin; +Cc: H. Peter Anvin, Git Mailing List
On Tue, 22 Nov 2005, Carl Baldwin wrote:
> On Mon, Nov 21, 2005 at 11:24:11AM -0800, Linus Torvalds wrote:
> > NOTE! Since that email, "git repack" has gotten a "local" option (-l),
> > which is very useful if the repositories have pointers to alternates.
> >
> > So do
> >
> > git repack -l
> >
> > instead, to get much better packs (and "-a -d" for the full case, of
> > course).
>
> I'm assuming that this option will have no effect on a repository with
> no alternates file.
Correct.
The only thing it does is that when it looks up an object, if it's not in
our _own_ ".git/objects/" dir, it won't pack it.
Actually, that's not entirely true. It isn't smart enough to know where
every object exists, so it only knows about remote _packs_. So what
happens is that if you do
git repack -l -a -d
it will create a pack-file that contains _all_ unpacked objects (whether
local or not) and all objects that are in local packs (because of the
"-a"), but not any objects that are in "alternate packs".
Which is actually exactly what you want, if you are in the situation that
kernel.org is, and you have people who point their alternates to mine:
when I repack my objects, they'll use my packs, but other than that,
they'll prefer to use their own packs over any unpacked objects.
> > So right now, repacking a broken archive can actually break it even more.
>
> Interesting.
Well, with the latest git repack script, that should no longer be true.
> > NOTE! Your "git verify-pack" wouldn't even catch this: the _pack_ is fine,
> > it's just incomplete.
>
> In my opinion, git repack did the right thing in creating the pack even
> if it is more broken. Starting with a broken repository was the real
> problem. git repack shouldn't need to worry too much about it.
Well, "git repack" did the wrong thing in that it never _noticed_, and it
then removed all old packs - even though those old packs contained objects
that we hadn't repacked because of the broken repository.
Of course, _usually_ a broken repository is just that - broken. The way
you fix a broken repo is to find a non-broken one, and clone that.
However, sometimes what you can do (if you literally just lost a few
objects) is to find a non-broken repo, and make that the _alternates_, in
which case you may be able to save any work you had in the broken one
(assuming you only lost objects that were available somewhere else).
> Looking at it from the nervous repository admin's point of view I think
> he would want to make sure that the repository is good to begin with.
Doing an fsck is certainly always a good idea. I do a "shallow" fsck
usually several times a day ("shallow" means that it doesn't fsck packs,
only new objects that I have aquired since the last repacking), and I do a
full fsck a couple of times a week.
I don't actually know why I do that, though. I don't think I've really
_ever_ had a broken repo since some very early days, except for the cases
where I break things on purpose (like remove an object to check whether
"git repack" does the right thing or not). I'm just used to it, and the
shallow fsck takes a fraction of a second, so I tend to do it after each
pull.
So I really think that an admin has to be more than "nervous" to worry
about it. He has to be really anal.
(Now, doing a repack and a fsck every week or so might be good, and
automatic shallow fsck's daily is probably a great idea too. After all, it
_is_ checking checksums, so if you worry about security and want to make
sure that nobody is trying to break in and do bad things to your repo, a
regular fsck is a good thing even if you're not otherwise worried about
corruption).
> *NOTE* There is one question that I feel remains unanswered. Is it
> possible to split up the repack -a and repack -d so that the nervous
> repository owner can insert a git verify-pack in the middle.
They are already split up inside "git-repack", so we could add a hook
there, I guess. See the git-repack.sh file, and notice how it does the
"remove_redundant" part only after it has created the new pack-file and
done a "sync".
> I don't mean to say that I don't trust git repack to do the right thing.
> Fundamentally, I just think that I shouldn't depend on it to do the
> right thing in order to avoid corruption in my repository.
That's good. However, as the previous failure of git repack showed, to
some degree the more likely failure mode is actually that the pack
generated by "git repack" is perfectly fine, but it's not _complete_. Say
we have a bug in git repack, for example.
Another case where it's not complete is when you have deleted a branch.
"git repack -a -d" will effectively do a "git prune" wrt objects that are
no longer reachable, and that were in the old packs.
So I'd actually suggest a slightly different approach. When-ever you
remove old objects (whether it's "git prune" or "git prune-packed" or "git
repack -a -d"), you might want to have an option that doesn't actually
_remove_ them, but just moves them into ".git/attic" or something like
that.
Then you can clean up the attic after doing your weekly full fsck or
something. And it has the advantage that if somebody has deleted a branch,
and notices later that maybe he wanted that branch back, you can "unprune"
all the objects, run "git-fsck-objects --full" to find any dangling
commits, and you'll have all your branches back.
So in many ways it would perhaps be nicer to have that kind of "safe
remove" option to the pruning commands?
Linus
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-22 5:41 ` Linus Torvalds
2005-11-22 14:13 ` Catalin Marinas
@ 2005-11-22 18:18 ` Chuck Lever
2005-11-23 14:18 ` Catalin Marinas
1 sibling, 1 reply; 33+ messages in thread
From: Chuck Lever @ 2005-11-22 18:18 UTC (permalink / raw)
To: Linus Torvalds
Cc: Carl Baldwin, H. Peter Anvin, Git Mailing List, Catalin Marinas
[-- Attachment #1: Type: text/plain, Size: 2357 bytes --]
Linus Torvalds wrote:
>
> On Tue, 22 Nov 2005, Chuck Lever wrote:
>
>>there are some things repacking does that breaks StGIT, though.
>>
>>git repack -d
>>
>>seems to remove old commits that StGIT was still depending on.
>
>
> If that is true, then "git-fsck-cache" probably also reports errors on a
> StGIT repository. No? Basically, it implies that the tool doesn't know how
> to find all the "heads".
indeed. this is one area where StGIT is "not safe" to use with other
porcelains. these raw GIT commands can show a bunch of confusing
"dangling references" type errors, or actually modify the index in ways
that eliminate StGIT-related commits that aren't currently attached to
any ancestry. (i think Catalin mentioned these are related to the
unapplied patches in a stack, but there could be others; see below).
> The preferred way would be to just list the references somewhere under
> .git/refs/stgit, in which case fsck and repack should pick them up
> automatically (so clearly stgit doesn't do that right now ;).
that could be an extremely large number of commits on a large repository
with a lot of patches that have been worked on over a long period. so
whatever mechanism is created to do this needs to scale well in the
number of commits.
> It also implies that doing a "git prune" will do horribly bad things to a
> stgit repo, since it would remove all the objects that it thinks aren't
> reachable..
yup. been there, done that. lucky for me i have an excellent hourly
backup scheme.
>>git repack -a -n
>>
>>seems to work fine with StGIT,
>
>
> Well, it "works", but not "fine". Since it doesn't know about the stgit
> objects, it won't ever pack them.
ah!
> But maybe that's what stgit wants (since they are "temporary"), but it
> does mean that if you see a big advantage from packing, you might be
> losing some of it.
actually, those commits aren't all that "temporary". the
history/revision feature i'm working on would like to maintain all the
commits ever done to an StGIT patch.
the only time you can throw away such commits is when the patch is
deleted or when it is finally committed to the repository via "stg
commit". otherwise, keeping these commits in a pack would be quite a
good thing.
maybe the first thing to do is to get a basic understanding of an StGIT
commit's lifetime.
[-- Attachment #2: cel.vcf --]
[-- Type: text/x-vcard, Size: 439 bytes --]
begin:vcard
fn:Chuck Lever
n:Lever;Charles
org:Network Appliance, Incorporated;Linux NFS Client Development
adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA
email;internet:cel@citi.umich.edu
title:Member of Technical Staff
tel;work:+1 734 763-4415
tel;fax:+1 734 763 4434
tel;home:+1 734 668-1089
x-mozilla-html:FALSE
url:http://www.monkey.org/~cel/
version:2.1
end:vcard
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
[not found] ` <7v1x18eddp.fsf@assigned-by-dhcp.cox.net>
@ 2005-11-23 14:10 ` Catalin Marinas
0 siblings, 0 replies; 33+ messages in thread
From: Catalin Marinas @ 2005-11-23 14:10 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On 22/11/05, Junio C Hamano <junkio@cox.net> wrote:
> Catalin Marinas <catalin.marinas@gmail.com> writes:
>
> > What I meant is any object whose exact reference is found in
> > refs/patches (not reachable via refs/patches), even if it is reachable
> > from refs/heads.
>
> do you mean you
> keep blobs and trees in refs/patches, or "exactly found in
> refs/patches" imply "commits in refs/patches and trees and blobs
> reachable from it"? If the latter I think it amounts to the
> same thing. If some of the blobs are shared with what is
> reachable from refs/heads or refs/tags I would presume you would
> want to pack them.
Each patch needs to have 2 commit and 2 tree objects (with the
corresponding blobs). I now understand where the problem appears. Most
of the blobs should actually be packed since they are part of the base
of the stack.
Since refs/heads files always point to the top of the stack, the
applied patches (the corresponding objects) would be automatically
packed. The alternative would be to only pack the objects reachable
from refs/bases but that's really StGIT-specific.
Other algorithm would be to avoid packing objects reachable from
refs/patches but not reachable from refs/bases but this would probably
complicate GIT.
> And the "volatile" idea may be a good way of doing this.
> Perhaps "git repack --volatile <glob>" to name paths under
> .git/refs to mark things not to be packed, with a per-repository
> configuration item to give default 'volatile' patterns? I could
> use it when packing my repository to exclude things that are
> only reachable from "pu" branch.
After I eventually understood what you meant, the above would still
include the already applied StGIT patches since they are reachable via
HEAD. Maybe StGIT could avoid modifying refs/heads but I think it
would lose some benefits.
--
Catalin
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please?
2005-11-22 18:18 ` Chuck Lever
@ 2005-11-23 14:18 ` Catalin Marinas
0 siblings, 0 replies; 33+ messages in thread
From: Catalin Marinas @ 2005-11-23 14:18 UTC (permalink / raw)
To: cel; +Cc: Linus Torvalds, Carl Baldwin, H. Peter Anvin, Git Mailing List
On 22/11/05, Chuck Lever <cel@citi.umich.edu> wrote:
> Linus Torvalds wrote:
> > But maybe that's what stgit wants (since they are "temporary"), but it
> > does mean that if you see a big advantage from packing, you might be
> > losing some of it.
>
> actually, those commits aren't all that "temporary". the
> history/revision feature i'm working on would like to maintain all the
> commits ever done to an StGIT patch.
That's to avoid pruning them but you might not always want to add them
to a pack.
> the only time you can throw away such commits is when the patch is
> deleted or when it is finally committed to the repository via "stg
> commit". otherwise, keeping these commits in a pack would be quite a
> good thing.
>
> maybe the first thing to do is to get a basic understanding of an StGIT
> commit's lifetime.
My initial idea was to throw the old commit away once a patch is
refreshed. Even if you want to preserve the history, it would be only
preserved until you send the patch to be merged upstream and you would
delete it locally. If all the patches are meant to be sent upstream at
some point, you can avoid packing them.
--
Catalin
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2005-11-23 14:18 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-13 18:44 auto-packing on kernel.org? please? Linus Torvalds
[not found] ` <434EABFD.5070604@zytor.com>
[not found] ` <434EC07C.30505@pobox.com>
2005-10-13 21:23 ` [kernel.org users] " Linus Torvalds
2005-10-16 14:33 ` Dirk Behme
2005-10-16 15:44 ` Daniel Barkalow
2005-10-16 16:12 ` Nick Hengeveld
2005-10-16 16:23 ` Brian Gerst
2005-10-16 16:56 ` Junio C Hamano
2005-10-16 21:33 ` Nick Hengeveld
2005-10-16 22:12 ` Junio C Hamano
2005-10-17 6:06 ` Nick Hengeveld
2005-10-17 8:21 ` Junio C Hamano
2005-10-17 17:41 ` Nick Hengeveld
2005-10-17 20:08 ` Junio C Hamano
2005-10-17 22:56 ` Daniel Barkalow
2005-10-17 23:19 ` Linus Torvalds
2005-10-17 23:54 ` Nick Hengeveld
2005-10-17 19:13 ` Daniel Barkalow
2005-10-16 17:10 ` Johannes Schindelin
2005-10-16 17:15 ` Brian Gerst
2005-11-21 19:01 ` Carl Baldwin
2005-11-21 19:24 ` Linus Torvalds
2005-11-21 19:58 ` Junio C Hamano
2005-11-21 20:38 ` Linus Torvalds
2005-11-21 21:35 ` Junio C Hamano
2005-11-22 5:26 ` Chuck Lever
2005-11-22 5:41 ` Linus Torvalds
2005-11-22 14:13 ` Catalin Marinas
2005-11-22 17:05 ` Linus Torvalds
[not found] ` <7v64qkfwhe.fsf@assigned-by-dhcp.cox.net>
[not found] ` <b0943d9e0511220946o3b62842ey@mail.gmail.com>
[not found] ` <7v1x18eddp.fsf@assigned-by-dhcp.cox.net>
2005-11-23 14:10 ` Catalin Marinas
2005-11-22 18:18 ` Chuck Lever
2005-11-23 14:18 ` Catalin Marinas
2005-11-22 17:25 ` Carl Baldwin
2005-11-22 17:58 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).