* auto-packing on kernel.org? please? @ 2005-10-13 18:44 Linus Torvalds [not found] ` <434EABFD.5070604@zytor.com> 2005-11-21 19:01 ` Carl Baldwin 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2005-10-13 18:44 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Git Mailing List I know we tried this once earlier, and it caused problems, but that was when pack-files were new, and not everybody could handle them. These days, if you can't handle pack-files, kernel.org is already pretty useless, because all the major packages use them anyway, because people have packed their repositories by hand. So I'm suggesting we try to do an automatic repack every once in a while. In my suggestion, there would be two levels of repacking: "incremental" and "full", and both of them would count the number of files before they run, so that you'd only do it when it seems worthwhile. This is a _really_ simple heuristic: - incremental repacking run every day: # # Check if we have more than a couple of hundred # unpacked objects - approximated by whether we # have any "00" directory with more than one # # This means that we don't repack projects that # that don't have a lot of work going on. # # Note: with really new versions of git, the "00" # directory may not exist if it has been pruned # away, so handle that gracefully. # export GIT_DIR=${1:-.} objs=$(find "$GIT_DIR/objects/00" -type f 2> /dev/null | wc -l) if [ "$obj" -gt 0 ]; then git repack && git prune-packed fi - "full repack" every week if the number of packs has grown to be bigger than say 10 (ie even a very active projects will never have a full repack more than every other week) # # Check if we have lots of packs, where "lots" is defined as 10. # # Note: with something that was generated with an old version # of git, the "pack" directory may not exist, so handle that # gracefully. # export GIT_DIR=${1:-.} packs=$(find "$GIT_DIR/objects/pack" -name '*.idx' 2> /dev/null | wc -l) if [ "$packs" -gt 10 ]; then git repack -a -d && git prune-packed fi - do a full repack of everything once to start with. export GIT_DIR=${1:-.} git repack -a -d && git prune-packed the above three trivial scripts just take a single argument, which becomes the GIT_DIR (and if no argument exists, it would default to ".") Is there any reason not to do this? Right now mirroring is slow, and webgit is also getting to be very slow sometimes. I bet we'd be _much_ better off with this kind of setup. NOTE! The above is the "stupid" approach, which totally ignores alternate directories, and isn't able to take advantage of the fact that many projects could share objects. But it's simple, and it's efficient (eg it won't spend time on things like the large historic archives which don't change, but that would be expensive to repack if you didn't check for the need). So we could try to come up with a better approach eventually, which would automatically notice alternate directories and not repack stuff that exists there, but I'm pretty sure that the above would already help a _lot_, and while pack-files have been been around forever, the "alternates" support is still pretty new, so the above is also the "safer" thing to do. We'd only do the automatic thing on stuff under /pub/scm, of course: not stuff in peoples home directories etc.. Peter? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <434EABFD.5070604@zytor.com>]
[parent not found: <434EC07C.30505@pobox.com>]
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? [not found] ` <434EC07C.30505@pobox.com> @ 2005-10-13 21:23 ` Linus Torvalds 2005-10-16 14:33 ` Dirk Behme 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-10-13 21:23 UTC (permalink / raw) To: Jeff Garzik; +Cc: H. Peter Anvin, users, Git Mailing List, Junio C Hamano On Thu, 13 Oct 2005, Jeff Garzik wrote: > > Right now, things go through an expand-contract cycle: > > * people base repos off of Marcelo or Linus's git repo, including using those > pack files (saves download bandwidth, disk space through hardlinks). > > * as 3rd parties and Marcelo/Linus merge stuff, .git/objects/* grows with > individual files. > > * once a month/release/whatever, Linus packs his repo, allowing all the repos > following his to use those pack files, pruning a ton of objects off of > kernel.org. > > I have real users of my git repos who can't just download a 100MB pack file in > an hour, it takes them many hours. Argh. Ok, I'm going to follow this up with three small patches that add a "-l" flag to "git repack", which does only a "local repack" (ie it will pack only objects that are _not_ in packs in alternate object directories). That will hopefully mean that this usage case is supported too. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-13 21:23 ` [kernel.org users] " Linus Torvalds @ 2005-10-16 14:33 ` Dirk Behme 2005-10-16 15:44 ` Daniel Barkalow 0 siblings, 1 reply; 33+ messages in thread From: Dirk Behme @ 2005-10-16 14:33 UTC (permalink / raw) To: Git Mailing List > On Thu, 13 Oct 2005, Jeff Garzik wrote: > >>I have real users of my git repos who can't just download a 100MB pack file in >>an hour, it takes them many hours. Seems that I'm one of these users (but using an other repo). Pack files are very nice saving bandwith and disk space. But what I dislike is that I often have to download same information twice: Remote .git/objects/* repo grows and I update my local repo daily against this. Then once a month/release/whatever .git/objects/* are packed into one file. This new pack file then is downloaded as well, but most/all of the information in this file is already in my local repo and downloaded again. Something like - detect that there is new pack file in remote repo - check what is in this remote pack file - if in local repo no or only few .git/objects/* are missing, download the missing ones and create an identical copy of remote pack file using local .git/objects/*. Don't download remote pack file. - remove all local .git/objects/* now in pack file would be nice. Or is this already possible? Or do I misunderstand anything? Dirk ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 14:33 ` Dirk Behme @ 2005-10-16 15:44 ` Daniel Barkalow 2005-10-16 16:12 ` Nick Hengeveld 0 siblings, 1 reply; 33+ messages in thread From: Daniel Barkalow @ 2005-10-16 15:44 UTC (permalink / raw) To: Dirk Behme; +Cc: Git Mailing List On Sun, 16 Oct 2005, Dirk Behme wrote: > > On Thu, 13 Oct 2005, Jeff Garzik wrote: > > > > >I have real users of my git repos who can't just download a 100MB pack file > > >in > > >an hour, it takes them many hours. > > Seems that I'm one of these users (but using an other repo). > > Pack files are very nice saving bandwith and disk space. But what I dislike is > that I often have to download same information twice: Remote .git/objects/* > repo grows and I update my local repo daily against this. Then once a > month/release/whatever .git/objects/* are packed into one file. This new pack > file then is downloaded as well, but most/all of the information in this file > is already in my local repo and downloaded again. Something like > > - detect that there is new pack file in remote repo > - check what is in this remote pack file > - if in local repo no or only few .git/objects/* are missing, download the > missing ones and create an identical copy of remote pack file using local > .git/objects/*. Don't download remote pack file. This is the problem: it's impossible to download only a few objects from a pack file from an HTTP server, because those don't exist on the server as separate files. The current HTTP code actually never downloads a pack file unless a needed object is not anywhere else, at which point it has no choice but to download the pack. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 15:44 ` Daniel Barkalow @ 2005-10-16 16:12 ` Nick Hengeveld 2005-10-16 16:23 ` Brian Gerst 0 siblings, 1 reply; 33+ messages in thread From: Nick Hengeveld @ 2005-10-16 16:12 UTC (permalink / raw) To: Daniel Barkalow; +Cc: Dirk Behme, Git Mailing List On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote: > This is the problem: it's impossible to download only a few objects from a > pack file from an HTTP server, because those don't exist on the server as > separate files. Is it possible to determine the object locations inside the remote pack file? If so, it would be possible to use Range: headers to download selected objects from a pack. -- For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 16:12 ` Nick Hengeveld @ 2005-10-16 16:23 ` Brian Gerst 2005-10-16 16:56 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Brian Gerst @ 2005-10-16 16:23 UTC (permalink / raw) To: Nick Hengeveld; +Cc: Daniel Barkalow, Dirk Behme, Git Mailing List Nick Hengeveld wrote: > On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote: > >> This is the problem: it's impossible to download only a few objects from a >> pack file from an HTTP server, because those don't exist on the server as >> separate files. > > Is it possible to determine the object locations inside the remote pack > file? If so, it would be possible to use Range: headers to download > selected objects from a pack. > Not possible because the entire pack is compressed. -- Brian Gerst ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 16:23 ` Brian Gerst @ 2005-10-16 16:56 ` Junio C Hamano 2005-10-16 21:33 ` Nick Hengeveld 2005-10-16 17:10 ` Johannes Schindelin 2005-10-16 17:15 ` Brian Gerst 2 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-16 16:56 UTC (permalink / raw) To: git Brian Gerst <bgerst@didntduck.org> writes: >> Is it possible to determine the object locations inside the remote >> pack >> file? If so, it would be possible to use Range: headers to download >> selected objects from a pack. That's what the .idx file is for, except that after you fetch the range, you may find you would need something else that the object is delta against. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 16:56 ` Junio C Hamano @ 2005-10-16 21:33 ` Nick Hengeveld 2005-10-16 22:12 ` Junio C Hamano 2005-10-17 19:13 ` Daniel Barkalow 0 siblings, 2 replies; 33+ messages in thread From: Nick Hengeveld @ 2005-10-16 21:33 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote: > That's what the .idx file is for, except that after you fetch > the range, you may find you would need something else that the > object is delta against. Would it make sense to load the pack indexes for each base up front, and then fetch individual objects from a pack if they exist in one of a base's pack indexes? In such a case, it may not even make sense to try fetching the object directly first. What are the circumstances under which it makes more sense to fetch the whole pack rather than fetching individual objects from it? -- For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 21:33 ` Nick Hengeveld @ 2005-10-16 22:12 ` Junio C Hamano 2005-10-17 6:06 ` Nick Hengeveld 2005-10-17 19:13 ` Daniel Barkalow 1 sibling, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-16 22:12 UTC (permalink / raw) To: Nick Hengeveld; +Cc: git Nick Hengeveld <nickh@reactrix.com> writes: > On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote: > >> That's what the .idx file is for, except that after you fetch >> the range, you may find you would need something else that the >> object is delta against. > > Would it make sense to load the pack indexes for each base up front, > and then fetch individual objects from a pack if they exist in one of > a base's pack indexes? In such a case, it may not even make sense to > try fetching the object directly first. > > What are the circumstances under which it makes more sense to fetch the > whole pack rather than fetching individual objects from it? It would make sense if we end up needing most them anyway, I think. We are probably far from this, but ideally, we should be able to set up something like this. We encourage the server side to prepare packs this way [*1*]. -- development --> time --> flows --> this --> way --> (optional) full ------------------------------------------------ base --------------- 6mo --------------------------------- 3mo ------------------- 1mo -------------- 2wk ---------- 1wk ----- ^ last pack optimization That is, a big base pack (say v2.6.12), and multiple packs to bring people that were in-sync at various time up-to-date to the time when the set of packs were last optimized. Any objects created after the last pack optimization time are left unpacked until the next pack optimization time. It might not be a bad idea to also have a "full" pack. For example, if you were in-sync 5-months ago, fetching 3mo pack would not be enough and you would need to get 6mo pack to become up-to-date wrt the last pack optimization (say 3 days ago). You would have obtained the objects not in pack, created within the last 3 days, already as individual objects before realizing that you would need to fetch some pack. Then, we can teach git-http-fetch to do: - If an object is unavailable unpacked, get all the indices from that repository (and probably its alternates while we are at it). - Among the set of packs that contain the object we are currently interested in, try to find the "best" pack. The definition of "best" would be a balancing act of finding the one that contains the least number of objects we already have, and the one that contains the most number of objects we do not have yet. The commit walker always goes from present to past, so you would start from fetching the latest, presumably unpacked objects, and as soon as you hit the last pack optimization boundary, you have choices of multiple packs. If you are relatively up-to-date, you would find that 1mo pack has more things you already have than 1wk pack, although both of them would fit the bill -- at that point you choose to download 1wk pack. On the other hand, if you are behind, you may find that 3mo pack has more things you do not have than 1wk or 2wk or 1mo pack, and using 3mo pack would become the right choice for you. I think most repositories have a few related heads and their heads almost never rewind, so favoring the pack that contains the most number of objects we do not have would be the right strategy in practice for the downloader. [Footnote] *1* This is different from a proposal posted on the list earlier by somebody (I think it was Pasky but I may be mistaken) which looked like this: -- development --> time --> flows --> this --> way --> base --------------- 6mo -------------- 3mo ----- 1mo ---- 2wk ----- 1wk ----- The thing is, sum of 3mo+1mo+2wk+1wk packs in the latter scheme tends to be a lot bigger than the size of 3mo pack in the former scheme. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 22:12 ` Junio C Hamano @ 2005-10-17 6:06 ` Nick Hengeveld 2005-10-17 8:21 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Nick Hengeveld @ 2005-10-17 6:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, Oct 16, 2005 at 03:12:03PM -0700, Junio C Hamano wrote: > - Among the set of packs that contain the object we are > currently interested in, try to find the "best" pack. The > definition of "best" would be a balancing act of finding the > one that contains the least number of objects we already > have, and the one that contains the most number of objects we > do not have yet. To get a complete list of objects we do not have yet, fetch will need to walk all the trees first and then make another pass to process all the missing objects. Is it worth considering a case where the missing objects are packed along with objects that don't need to be transferred? From the use cases you described, it's not clear that situation would ever really happen. If the blobs have been packed, it seems likely that the tree objects will also be packed, so fetching them during the first pass will either involve fetching a pack without being able to determine which is best or fetching the appropriate ranges from packs to get the tree objects. -- For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-17 6:06 ` Nick Hengeveld @ 2005-10-17 8:21 ` Junio C Hamano 2005-10-17 17:41 ` Nick Hengeveld 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-17 8:21 UTC (permalink / raw) To: Nick Hengeveld; +Cc: git Nick Hengeveld <nickh@reactrix.com> writes: > To get a complete list of objects we do not have yet, fetch will need > to walk all the trees first and then make another pass to process > all the missing objects. Notice I did not say "we do not have yet but we will need" -- I just said "we do not have yet". The assumption, which is the property the suggested packing strategy has, is that older objects that are needed to complete the history leading to the current tip are packed in those n-month/n-week packs, so if we do not have them we would likely be needing them, although we might not have walked that far back in history yet. The previous "packing strategy" picture was certainly too simplified. Obviously we would not want to repack everything every week for different periods all the way back -- we would want to leave old huge pack untouched to help server side (and mirroring), so instead of having a single "pack optimization boundary", we would probably need some staggering as well for archived material. This is a revised example. 1yr ----- 9mo -------- 6mo ---------- 3mo ------------------ 1mo ------------ 2wk -------- 1wk ---- We keep track of "the current heads and tags" for each week. Every week, we can do something like this: - rotate the record, and create a new one: mv .save/wk11 .save/wk12 mv .save/wk10 .save/wk11 mv .save/wk9 .save/wk10 ... mv .save/wk0 .save/wk1 find .git/refs -type f -print | xargs cat >.save/wk0 - prepare a pack to allow a single pack fetch to bring a repository that had everything reachable from wk$N refs up-to-date to the current, for selected recent weeks (say N=1, 2, 4, 12): for N in 1 2 4 12 do name=$(git-rev-list --objects \ $(sed -e 's/^/^/' .save/wk$N) \ $(cat .save/wk0) | git-pack-object pack-) && mv pack-$name.* .git/objects/pack/. done remove the pack files that we created this way last week from the repository (if the repository did not have any activity during the last week we would have created the same set of packs. make sure we do not remove them). - except that, we keep the longest period (i.e. N=12 in this example) one every N weeks (that's how 1yr, 9mo, 6mo packs in the picture are kept). This way, really old stuff (say, older than 3mo) will stay intact and will not be repacked, so people reasonably up-to-date (within 12 weeks in the example) need to fetch only one pack (and unpacked objects since the last pack optimization), but people without the ancient history need to go further back. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-17 8:21 ` Junio C Hamano @ 2005-10-17 17:41 ` Nick Hengeveld 2005-10-17 20:08 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Nick Hengeveld @ 2005-10-17 17:41 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Oct 17, 2005 at 01:21:14AM -0700, Junio C Hamano wrote: > The assumption, which is the property the suggested packing > strategy has, is that older objects that are needed to complete > the history leading to the current tip are packed in those > n-month/n-week packs, so if we do not have them we would likely > be needing them, although we might not have walked that far back > in history yet. Gotcha - I'm still thinking in terms of content distribution, where you only need a specific version of a tree to be available locally and explicitly don't want to transfer history. In our case, using packs doesn't make sense at the moment. -- For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-17 17:41 ` Nick Hengeveld @ 2005-10-17 20:08 ` Junio C Hamano 2005-10-17 22:56 ` Daniel Barkalow 2005-10-17 23:54 ` Nick Hengeveld 0 siblings, 2 replies; 33+ messages in thread From: Junio C Hamano @ 2005-10-17 20:08 UTC (permalink / raw) To: Nick Hengeveld; +Cc: git Nick Hengeveld <nickh@reactrix.com> writes: > Gotcha - I'm still thinking in terms of content distribution, where > you only need a specific version of a tree to be available locally > and explicitly don't want to transfer history. In other words, you'd want to also support CVS-like "working tree has the specific version, and history is not kept here, but available on demand, possibly over the network" mode of operation. I'd say why not. We could aim to have "working tree has the specific version and partial history of recent versions, and the ancient history is available on demand, possibly over the network" mode of operation. It is somewhat different from the primary focus of what we have been doing, but I think it is a natural extension. The invariant is that once you have a ref pointing at a specific commit, everything reachable from it ought to be available to you. And we have extended the definition of "available" over time. Initially, you needed to have individual objects, and then we made it so they could live in packs, and now they could even be borrowed from another repository via alternates. We currently do not consider "lazily fetchable over the network" as "available", but I do not object too much to that, as long as it is an optional feature. This probably is a post 1.0 item, though. Off the top of my head, we would need: - a way for the user to say "unless I ask explicitly otherwise, do not bother me if the commits older than these ones are incomplete" -- an milder version of cauterizing commit chain via info/grafts. - a way for the user to say "this time I am explicitly overriding the above -- I am interested in older history". - change to fsck-objects, fetch- and probably upload-pack on the other end, and commit walkers to honor the above two. Most of these can probably be done by existing info/grafts mechanism, but even then definitely would need a nicer user interface. Once this is in place, range requests to pick data for individual objects from packs residing on a remote HTTP server would start to make sense. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-17 20:08 ` Junio C Hamano @ 2005-10-17 22:56 ` Daniel Barkalow 2005-10-17 23:19 ` Linus Torvalds 2005-10-17 23:54 ` Nick Hengeveld 1 sibling, 1 reply; 33+ messages in thread From: Daniel Barkalow @ 2005-10-17 22:56 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nick Hengeveld, git On Mon, 17 Oct 2005, Junio C Hamano wrote: > Nick Hengeveld <nickh@reactrix.com> writes: > > > Gotcha - I'm still thinking in terms of content distribution, where > > you only need a specific version of a tree to be available locally > > and explicitly don't want to transfer history. > > In other words, you'd want to also support CVS-like "working > tree has the specific version, and history is not kept here, but > available on demand, possibly over the network" mode of > operation. I'd say why not. We could aim to have "working tree > has the specific version and partial history of recent versions, > and the ancient history is available on demand, possibly over > the network" mode of operation. > > It is somewhat different from the primary focus of what we have > been doing, but I think it is a natural extension. The > invariant is that once you have a ref pointing at a specific > commit, everything reachable from it ought to be available to > you. Wouldn't "git fetch http://.../foo.git/ master^{tree}" do the right thing? You get only the current tree, and write a ref to the tree instead of the commit, maintaining the invariant. Of course, fetch.c needs a bit of work so that it can fetch objects in the process of figuring out what the refspec that it's really trying to fetch, but that should be simple enough. Of course, this really isolates you from the history, since you don't even remember what the commit was that you've got the tree from, but that may not be an issue in a pure content distribution setup. Also, a pack file of a single tree isn't going to be terribly efficient, because pack files mostly exploit the high similarity between different versions of the same file. My other idea is to have a file of things that you expect to be missing, even though they are referenced, and where to expect to find them if necessary. Then you could download the latest commit, mark its parents (unless you have them) as known-missing, and write the ref. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-17 22:56 ` Daniel Barkalow @ 2005-10-17 23:19 ` Linus Torvalds 0 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2005-10-17 23:19 UTC (permalink / raw) To: Daniel Barkalow; +Cc: Junio C Hamano, Nick Hengeveld, git On Mon, 17 Oct 2005, Daniel Barkalow wrote: > > Wouldn't "git fetch http://.../foo.git/ master^{tree}" do the right thing? The pack pullers have trouble with anything that isn't commit-based, because they do all the "figure out what we have in common" logic based on the commit history. So if you fetch a tree, it by definition doesn't _have_ any history, and the pack pullers will always pack the whole tree. I think. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-17 20:08 ` Junio C Hamano 2005-10-17 22:56 ` Daniel Barkalow @ 2005-10-17 23:54 ` Nick Hengeveld 1 sibling, 0 replies; 33+ messages in thread From: Nick Hengeveld @ 2005-10-17 23:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Oct 17, 2005 at 01:08:03PM -0700, Junio C Hamano wrote: > - a way for the user to say "unless I ask explicitly otherwise, > do not bother me if the commits older than these ones are > incomplete" -- an milder version of cauterizing commit chain > via info/grafts. > > - a way for the user to say "this time I am explicitly > overriding the above -- I am interested in older history". > > - change to fsck-objects, fetch- and probably upload-pack on > the other end, and commit walkers to honor the above two. That's how I interpreted the -c and -a command-line arguments to the commit walkers. git-fetch calls them with -a but we've been using -t to only follow the tree objects and it's been working great. Perhaps that would be a good way for the commit walker to decide whether to transfer a full pack file - it may not make sense if it wasn't told to get history. -- For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 21:33 ` Nick Hengeveld 2005-10-16 22:12 ` Junio C Hamano @ 2005-10-17 19:13 ` Daniel Barkalow 1 sibling, 0 replies; 33+ messages in thread From: Daniel Barkalow @ 2005-10-17 19:13 UTC (permalink / raw) To: Nick Hengeveld; +Cc: Junio C Hamano, git On Sun, 16 Oct 2005, Nick Hengeveld wrote: > On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote: > > > That's what the .idx file is for, except that after you fetch > > the range, you may find you would need something else that the > > object is delta against. > > Would it make sense to load the pack indexes for each base up front, > and then fetch individual objects from a pack if they exist in one of > a base's pack indexes? In such a case, it may not even make sense to > try fetching the object directly first. At the start, you have the option of either fetching the list of packs or the object. There are three cases: 1) the object isn't available separately; we need to fetch the list of packs to find it in a pack. 2) there aren't any new packs; we need to fetch the object individually. 3) the object is present both individually and in a pack. (2) is more common than (1), because we don't repack every update. (3) doesn't happen at all, currently, because we prune after packing. So it makes most sense to try the object at once. On the other hand, the parallel code should probably do both at the same time, since it can, and it only causes notable latency, not bandwidth. We probably also ought to speculatively get any new index files in parallel with whatever else we're doing, since it is likely that we'll need some pack at some point, and then we'll need all the index files to decide what pack to get. > What are the circumstances under which it makes more sense to fetch the > whole pack rather than fetching individual objects from it? I'm not sure there's a good way of deciding without a plan for what conditions cause there to be a choice. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 16:23 ` Brian Gerst 2005-10-16 16:56 ` Junio C Hamano @ 2005-10-16 17:10 ` Johannes Schindelin 2005-10-16 17:15 ` Brian Gerst 2 siblings, 0 replies; 33+ messages in thread From: Johannes Schindelin @ 2005-10-16 17:10 UTC (permalink / raw) To: Brian Gerst; +Cc: Nick Hengeveld, Daniel Barkalow, Dirk Behme, Git Mailing List Hi, On Sun, 16 Oct 2005, Brian Gerst wrote: > Nick Hengeveld wrote: > > On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote: > > > > > This is the problem: it's impossible to download only a few objects from a > > > pack file from an HTTP server, because those don't exist on the server as > > > separate files. > > > > Is it possible to determine the object locations inside the remote pack > > file? If so, it would be possible to use Range: headers to download > > selected objects from a pack. > > > > Not possible because the entire pack is compressed. Maybe we should introduce an option which only packs objects of a minimal age (something like "pack only objects 2 days and older")? This could be used to autopackage as long as HTTP is the preferred protocol, so that if you update daily, you already have those objects. Alternatively, git-prune-packed could have an option to prune only those objects older than 2 days. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [kernel.org users] Re: auto-packing on kernel.org? please? 2005-10-16 16:23 ` Brian Gerst 2005-10-16 16:56 ` Junio C Hamano 2005-10-16 17:10 ` Johannes Schindelin @ 2005-10-16 17:15 ` Brian Gerst 2 siblings, 0 replies; 33+ messages in thread From: Brian Gerst @ 2005-10-16 17:15 UTC (permalink / raw) To: Nick Hengeveld; +Cc: Daniel Barkalow, Dirk Behme, Git Mailing List Brian Gerst wrote: > Nick Hengeveld wrote: > >> On Sun, Oct 16, 2005 at 11:44:46AM -0400, Daniel Barkalow wrote: >> >>> This is the problem: it's impossible to download only a few objects >>> from a pack file from an HTTP server, because those don't exist on >>> the server as separate files. >> >> >> Is it possible to determine the object locations inside the remote pack >> file? If so, it would be possible to use Range: headers to download >> selected objects from a pack. >> > > Not possible because the entire pack is compressed. I should have looked at the source more closely before stating that. Each object gets compressed individually, so this would be possible. -- Brian Gerst ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-10-13 18:44 auto-packing on kernel.org? please? Linus Torvalds [not found] ` <434EABFD.5070604@zytor.com> @ 2005-11-21 19:01 ` Carl Baldwin 2005-11-21 19:24 ` Linus Torvalds 1 sibling, 1 reply; 33+ messages in thread From: Carl Baldwin @ 2005-11-21 19:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: H. Peter Anvin, Git Mailing List I have a question about automatic repacking. I am thinking of turning something like Linus' repacking heuristic loose on my repositories. I just want to make sure it is as safe as possible. At the core of the incremental and full repack strategies are these statements. Incremental... > git repack && > git prune-packed Full... > git repack -a -d && > git prune-packed Are there some built in safety checks in 'git repack' and/or 'git prune-packed' to guard against corruption? In the long run, I would feel more comfortable with somelike like this: git repack git verify-pack <new pack> git prune-packed Would something like this even work with 'git repack -a -d'? Is there a way to do something like the following for a full repack to achieve the ultimate in paranoia? git repack -a git verify-pack <new pack file> git trash-redundant-packs <new pack file> git prune-packed Carl On Thu, Oct 13, 2005 at 11:44:30AM -0700, Linus Torvalds wrote: > > I know we tried this once earlier, and it caused problems, but that was > when pack-files were new, and not everybody could handle them. These days, > if you can't handle pack-files, kernel.org is already pretty useless, > because all the major packages use them anyway, because people have > packed their repositories by hand. > > So I'm suggesting we try to do an automatic repack every once in a while. > > In my suggestion, there would be two levels of repacking: "incremental" > and "full", and both of them would count the number of files before they > run, so that you'd only do it when it seems worthwhile. > > This is a _really_ simple heuristic: > > - incremental repacking run every day: > > # > # Check if we have more than a couple of hundred > # unpacked objects - approximated by whether we > # have any "00" directory with more than one > # > # This means that we don't repack projects that > # that don't have a lot of work going on. > # > # Note: with really new versions of git, the "00" > # directory may not exist if it has been pruned > # away, so handle that gracefully. > # > export GIT_DIR=${1:-.} > objs=$(find "$GIT_DIR/objects/00" -type f 2> /dev/null | wc -l) > if [ "$obj" -gt 0 ]; then > git repack && > git prune-packed > fi > > - "full repack" every week if the number of packs has grown to be bigger > than say 10 (ie even a very active projects will never have a full > repack more than every other week) > > # > # Check if we have lots of packs, where "lots" is defined as 10. > # > # Note: with something that was generated with an old version > # of git, the "pack" directory may not exist, so handle that > # gracefully. > # > export GIT_DIR=${1:-.} > packs=$(find "$GIT_DIR/objects/pack" -name '*.idx' 2> /dev/null | wc -l) > if [ "$packs" -gt 10 ]; then > git repack -a -d && > git prune-packed > fi > > - do a full repack of everything once to start with. > > export GIT_DIR=${1:-.} > git repack -a -d && > git prune-packed > > the above three trivial scripts just take a single argument, which becomes > the GIT_DIR (and if no argument exists, it would default to ".") > > Is there any reason not to do this? Right now mirroring is slow, and > webgit is also getting to be very slow sometimes. I bet we'd be _much_ > better off with this kind of setup. > > NOTE! The above is the "stupid" approach, which totally ignores alternate > directories, and isn't able to take advantage of the fact that many > projects could share objects. But it's simple, and it's efficient (eg it > won't spend time on things like the large historic archives which don't > change, but that would be expensive to repack if you didn't check for the > need). > > So we could try to come up with a better approach eventually, which would > automatically notice alternate directories and not repack stuff that > exists there, but I'm pretty sure that the above would already help a > _lot_, and while pack-files have been been around forever, the > "alternates" support is still pretty new, so the above is also the "safer" > thing to do. > > We'd only do the automatic thing on stuff under /pub/scm, of course: not > stuff in peoples home directories etc.. > > Peter? > > Linus > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Carl Baldwin Systems VLSI Laboratory Hewlett Packard Company MS 88 work: 970 898-1523 3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com Fort Collins, CO 80525 home: Carl@ecBaldwin.net - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-21 19:01 ` Carl Baldwin @ 2005-11-21 19:24 ` Linus Torvalds 2005-11-21 19:58 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Linus Torvalds @ 2005-11-21 19:24 UTC (permalink / raw) To: Carl Baldwin; +Cc: H. Peter Anvin, Git Mailing List On Mon, 21 Nov 2005, Carl Baldwin wrote: > > I have a question about automatic repacking. > > I am thinking of turning something like Linus' repacking heuristic loose > on my repositories. I just want to make sure it is as safe as possible. > > At the core of the incremental and full repack strategies are these > statements. > > Incremental... > > git repack && > > git prune-packed > > Full... > > git repack -a -d && > > git prune-packed NOTE! Since that email, "git repack" has gotten a "local" option (-l), which is very useful if the repositories have pointers to alternates. So do git repack -l instead, to get much better packs (and "-a -d" for the full case, of course). Other that than, the old email suggestion should still be fine. > Are there some built in safety checks in 'git repack' and/or 'git > prune-packed' to guard against corruption? In the long run, I would > feel more comfortable with somelike like this: > > git repack > git verify-pack <new pack> > git prune-packed You can certainly do that if you are nervous. It might even be a good idea: just for fun, I just did git clone -l git git-clone cd git-clone # pick an object at random rm .git/objects/f7/c3d39fe3db6da3a307da385a7a1cb563ed15f7 git repack -a -d and it said: error: Could not read f7c3d39fe3db6da3a307da385a7a1cb563ed15f7 fatal: bad tree object f7c3d39fe3db6da3a307da385a7a1cb563ed15f7 but then it created the pack _anyway_, and said: Packing 27 objects Pack pack-13bfca704078175c1c1c59964553b14f7b952651 created. and happily removed all the old ones. So right now, repacking a broken archive can actually break it even more. NOTE! Your "git verify-pack" wouldn't even catch this: the _pack_ is fine, it's just incomplete. Of course, this only happens if the repository was broken to begin with, so arguably it's not that bad. But it does show that git-repack should be more careful and return an error more aggressively. Can anybody tell me how to do that sanely? Right now we do .. name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) | git-pack-objects --non-empty $pack_objects .tmp-pack) || exit 1 .. and the thing is, the "git-pack-objects" thing is happy, it's the "git-rev-list" that fails. So because the last command in the pipeline returns ok, we think it all is ok.. (This is one of the reasons I much prefer working in C over working in shell: it may be twenty times more lines, but when you have a problem, the fix is always obvious..) Anyway, with that fixed, a "git repack" in many ways would be a mini-fsck, so it should be very safe in general. Modulo any other bugs like the above. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-21 19:24 ` Linus Torvalds @ 2005-11-21 19:58 ` Junio C Hamano 2005-11-21 20:38 ` Linus Torvalds 2005-11-22 5:26 ` Chuck Lever 2005-11-22 17:25 ` Carl Baldwin 2 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-11-21 19:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > Can anybody tell me how to do that sanely? Right now we do > > .. > name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) | > git-pack-objects --non-empty $pack_objects .tmp-pack) || > exit 1 > .. > > and the thing is, the "git-pack-objects" thing is happy, it's the > "git-rev-list" that fails. So because the last command in the pipeline > returns ok, we think it all is ok.. One cop-out: do fsck-objects upfront before making a pack. This would populate your buffer cache so it might not be a bad thing. Alternatively: name=$( { git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) || echo Gaaahhh } | git-pack-objects --non-empty $pack_objects .tmp-pack) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-21 19:58 ` Junio C Hamano @ 2005-11-21 20:38 ` Linus Torvalds 2005-11-21 21:35 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-11-21 20:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, 21 Nov 2005, Junio C Hamano wrote: > > One cop-out: do fsck-objects upfront before making a pack. This > would populate your buffer cache so it might not be a bad thing. Well, it's extremely expensive most of the time. It's often as expensive as the packing itself. So I don't like that option very much. > Alternatively: > > name=$( { > git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) || > echo Gaaahhh > } | git-pack-objects --non-empty $pack_objects .tmp-pack) Actually, some dim memories prodded me to some man-page digging, and the "pipefail" option in particular. It seems to be a common option to both ksh and bash, so set -o pipefail seems like it should fix this. Sadly, I think it's pretty recent in bash (ksh apparently got it in -93, bash seems to have gotten it only as of version 3.0, which is definitely recent enough that we can't just assume it). [ Also, bash seems to have a variable called $PIPESTATUS, but that's bash-specific (I don't know when it was enabled). ] Anyway, doing a set -o pipefail should never be the wrong thing to do, but the problem is figuring out whether the option is available or not, since if it isn't available, it's considered an error ;/ So with all that, how about we take your "Gaah" idea, and simplify it: just pipe stderr too. That, together with making git-pack-objects tell what garbage it got, actually does the rigth thing: [torvalds@g5 git-clone]$ git repack -a -d fatal: expected sha1, got garbage: error: Could not read 7f59dbbb8f8d479c1d31453eac06ec765436a780 with this pretty simple patch. Whaddaya think? Linus --- diff --git a/git-repack.sh b/git-repack.sh index 4e16d34..c0f271d 100755 --- a/git-repack.sh +++ b/git-repack.sh @@ -41,7 +41,7 @@ esac if [ "$local" ]; then pack_objects="$pack_objects --local" fi -name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) | +name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) 2>&1 | git-pack-objects --non-empty $pack_objects .tmp-pack) || exit 1 if [ -z "$name" ]; then diff --git a/pack-objects.c b/pack-objects.c index 4e941e7..8864a31 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -524,7 +524,7 @@ int main(int argc, char **argv) unsigned char sha1[20]; if (get_sha1_hex(line, sha1)) - die("expected sha1, got garbage"); + die("expected sha1, got garbage:\n %s", line); hash = 0; p = line+40; while (*p) { ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-21 20:38 ` Linus Torvalds @ 2005-11-21 21:35 ` Junio C Hamano 0 siblings, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2005-11-21 21:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > ...just pipe stderr too. That, together with making git-pack-objects tell > what garbage it got, actually does the rigth thing: > > [torvalds@g5 git-clone]$ git repack -a -d > fatal: expected sha1, got garbage: > error: Could not read 7f59dbbb8f8d479c1d31453eac06ec765436a780 > > with this pretty simple patch. > > Whaddaya think? Obviously the right thing to do ;-). I like it. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-21 19:24 ` Linus Torvalds 2005-11-21 19:58 ` Junio C Hamano @ 2005-11-22 5:26 ` Chuck Lever 2005-11-22 5:41 ` Linus Torvalds 2005-11-22 17:25 ` Carl Baldwin 2 siblings, 1 reply; 33+ messages in thread From: Chuck Lever @ 2005-11-22 5:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Carl Baldwin, H. Peter Anvin, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 1504 bytes --] Linus Torvalds wrote: > > On Mon, 21 Nov 2005, Carl Baldwin wrote: > >>I have a question about automatic repacking. >> >>I am thinking of turning something like Linus' repacking heuristic loose >>on my repositories. I just want to make sure it is as safe as possible. >> >>At the core of the incremental and full repack strategies are these >>statements. >> >>Incremental... >> >>> git repack && >>> git prune-packed >> >>Full... >> >>> git repack -a -d && >>> git prune-packed > > > NOTE! Since that email, "git repack" has gotten a "local" option (-l), > which is very useful if the repositories have pointers to alternates. > > So do > > git repack -l > > instead, to get much better packs (and "-a -d" for the full case, of > course). > > Other that than, the old email suggestion should still be fine. i've been playing with "git repack" on StGIT-managed repositories. on NFS, using packs instead of individual objects is quite a bit faster, because a single NFS GETATTR will tell you if your NFS client's cached pack file is still valid, whereas a whole bunch of GETATTRs are required for validating individual object files. there are some things repacking does that breaks StGIT, though. git repack -d seems to remove old commits that StGIT was still depending on. git repack -a -n seems to work fine with StGIT, as does git prune-packed i'm really interested in trying out the new command to remove redundant objects and packs, but haven't gotten around to it yet. [-- Attachment #2: cel.vcf --] [-- Type: text/x-vcard, Size: 439 bytes --] begin:vcard fn:Chuck Lever n:Lever;Charles org:Network Appliance, Incorporated;Linux NFS Client Development adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA email;internet:cel@citi.umich.edu title:Member of Technical Staff tel;work:+1 734 763-4415 tel;fax:+1 734 763 4434 tel;home:+1 734 668-1089 x-mozilla-html:FALSE url:http://www.monkey.org/~cel/ version:2.1 end:vcard ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-22 5:26 ` Chuck Lever @ 2005-11-22 5:41 ` Linus Torvalds 2005-11-22 14:13 ` Catalin Marinas 2005-11-22 18:18 ` Chuck Lever 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2005-11-22 5:41 UTC (permalink / raw) To: Chuck Lever Cc: Carl Baldwin, H. Peter Anvin, Git Mailing List, Catalin Marinas On Tue, 22 Nov 2005, Chuck Lever wrote: > > there are some things repacking does that breaks StGIT, though. > > git repack -d > > seems to remove old commits that StGIT was still depending on. If that is true, then "git-fsck-cache" probably also reports errors on a StGIT repository. No? Basically, it implies that the tool doesn't know how to find all the "heads". Could somebody (Catalin?) perhaps tell how tools like git-fsck-cache and git-repack could figure out which objects are still in use by stgit? Preferably with some generic mechanism that _other_ projects (not just stgit) might want to use? The preferred way would be to just list the references somewhere under .git/refs/stgit, in which case fsck and repack should pick them up automatically (so clearly stgit doesn't do that right now ;). It also implies that doing a "git prune" will do horribly bad things to a stgit repo, since it would remove all the objects that it thinks aren't reachable.. > git repack -a -n > > seems to work fine with StGIT, Well, it "works", but not "fine". Since it doesn't know about the stgit objects, it won't ever pack them. But maybe that's what stgit wants (since they are "temporary"), but it does mean that if you see a big advantage from packing, you might be losing some of it. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-22 5:41 ` Linus Torvalds @ 2005-11-22 14:13 ` Catalin Marinas 2005-11-22 17:05 ` Linus Torvalds [not found] ` <7v64qkfwhe.fsf@assigned-by-dhcp.cox.net> 2005-11-22 18:18 ` Chuck Lever 1 sibling, 2 replies; 33+ messages in thread From: Catalin Marinas @ 2005-11-22 14:13 UTC (permalink / raw) To: Linus Torvalds Cc: Chuck Lever, Carl Baldwin, H. Peter Anvin, Git Mailing List On 22/11/05, Linus Torvalds <torvalds@osdl.org> wrote: > On Tue, 22 Nov 2005, Chuck Lever wrote: > > there are some things repacking does that breaks StGIT, though. > > > > git repack -d > > > > seems to remove old commits that StGIT was still depending on. > > If that is true, then "git-fsck-cache" probably also reports errors on a > StGIT repository. No? Basically, it implies that the tool doesn't know how > to find all the "heads". Indeed, 'git repack -d' or 'git prune' might remove the patches which are not applied since there is no link to them from .git/refs/. > Could somebody (Catalin?) perhaps tell how tools like git-fsck-cache and > git-repack could figure out which objects are still in use by stgit? They don't figure this out at the moment. I initially thought about implementing these commands in StGIT so that they would pass the proper references. > Preferably with some generic mechanism that _other_ projects (not just > stgit) might want to use? > > The preferred way would be to just list the references somewhere under > .git/refs/stgit, in which case fsck and repack should pick them up > automatically (so clearly stgit doesn't do that right now ;). I thought about adding .git/refs/patches/<branch>/* files corresponding to the every StGIT patch. Are the above git commands looking at all depths in the .git/refs/ directory? > > git repack -a -n > > > > seems to work fine with StGIT, > > Well, it "works", but not "fine". Since it doesn't know about the stgit > objects, it won't ever pack them. > > But maybe that's what stgit wants (since they are "temporary"), but it > does mean that if you see a big advantage from packing, you might be > losing some of it. The 'git repack -a' command would include the applied patches in the newly created pack but leave out the unapplied ones. It would be even better to leave all of them out since the StGIT patches are frequently changed but an independent mechanism for this would complicate GIT - 'git repack' shouldn't pack any of the objects found in .git/refs/patches/, even if they are reachable via .git/refs/heads/* (and maybe call the patches directory something like .git/refs/unpackable or volatile). -- Catalin ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-22 14:13 ` Catalin Marinas @ 2005-11-22 17:05 ` Linus Torvalds [not found] ` <7v64qkfwhe.fsf@assigned-by-dhcp.cox.net> 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2005-11-22 17:05 UTC (permalink / raw) To: Catalin Marinas Cc: Chuck Lever, Carl Baldwin, H. Peter Anvin, Git Mailing List On Tue, 22 Nov 2005, Catalin Marinas wrote: > > > The preferred way would be to just list the references somewhere under > > .git/refs/stgit, in which case fsck and repack should pick them up > > automatically (so clearly stgit doesn't do that right now ;). > > I thought about adding .git/refs/patches/<branch>/* files > corresponding to the every StGIT patch. Are the above git commands > looking at all depths in the .git/refs/ directory? Yes. Or at least they're supposed to. If they are not, it's a bug regardless, and we'll fix it. > The 'git repack -a' command would include the applied patches in the > newly created pack but leave out the unapplied ones. It would be even > better to leave all of them out since the StGIT patches are frequently > changed but an independent mechanism for this would complicate GIT - > 'git repack' shouldn't pack any of the objects found in > .git/refs/patches/, even if they are reachable via .git/refs/heads/* > (and maybe call the patches directory something like > .git/refs/unpackable or volatile). If we have some default location (and .git/refs/patches/ sounds good), we can make git do the right thing - find them for git-fsck-objects, and ignore them for git-repack. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <7v64qkfwhe.fsf@assigned-by-dhcp.cox.net>]
[parent not found: <b0943d9e0511220946o3b62842ey@mail.gmail.com>]
[parent not found: <7v1x18eddp.fsf@assigned-by-dhcp.cox.net>]
* Re: auto-packing on kernel.org? please? [not found] ` <7v1x18eddp.fsf@assigned-by-dhcp.cox.net> @ 2005-11-23 14:10 ` Catalin Marinas 0 siblings, 0 replies; 33+ messages in thread From: Catalin Marinas @ 2005-11-23 14:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On 22/11/05, Junio C Hamano <junkio@cox.net> wrote: > Catalin Marinas <catalin.marinas@gmail.com> writes: > > > What I meant is any object whose exact reference is found in > > refs/patches (not reachable via refs/patches), even if it is reachable > > from refs/heads. > > do you mean you > keep blobs and trees in refs/patches, or "exactly found in > refs/patches" imply "commits in refs/patches and trees and blobs > reachable from it"? If the latter I think it amounts to the > same thing. If some of the blobs are shared with what is > reachable from refs/heads or refs/tags I would presume you would > want to pack them. Each patch needs to have 2 commit and 2 tree objects (with the corresponding blobs). I now understand where the problem appears. Most of the blobs should actually be packed since they are part of the base of the stack. Since refs/heads files always point to the top of the stack, the applied patches (the corresponding objects) would be automatically packed. The alternative would be to only pack the objects reachable from refs/bases but that's really StGIT-specific. Other algorithm would be to avoid packing objects reachable from refs/patches but not reachable from refs/bases but this would probably complicate GIT. > And the "volatile" idea may be a good way of doing this. > Perhaps "git repack --volatile <glob>" to name paths under > .git/refs to mark things not to be packed, with a per-repository > configuration item to give default 'volatile' patterns? I could > use it when packing my repository to exclude things that are > only reachable from "pu" branch. After I eventually understood what you meant, the above would still include the already applied StGIT patches since they are reachable via HEAD. Maybe StGIT could avoid modifying refs/heads but I think it would lose some benefits. -- Catalin ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-22 5:41 ` Linus Torvalds 2005-11-22 14:13 ` Catalin Marinas @ 2005-11-22 18:18 ` Chuck Lever 2005-11-23 14:18 ` Catalin Marinas 1 sibling, 1 reply; 33+ messages in thread From: Chuck Lever @ 2005-11-22 18:18 UTC (permalink / raw) To: Linus Torvalds Cc: Carl Baldwin, H. Peter Anvin, Git Mailing List, Catalin Marinas [-- Attachment #1: Type: text/plain, Size: 2357 bytes --] Linus Torvalds wrote: > > On Tue, 22 Nov 2005, Chuck Lever wrote: > >>there are some things repacking does that breaks StGIT, though. >> >>git repack -d >> >>seems to remove old commits that StGIT was still depending on. > > > If that is true, then "git-fsck-cache" probably also reports errors on a > StGIT repository. No? Basically, it implies that the tool doesn't know how > to find all the "heads". indeed. this is one area where StGIT is "not safe" to use with other porcelains. these raw GIT commands can show a bunch of confusing "dangling references" type errors, or actually modify the index in ways that eliminate StGIT-related commits that aren't currently attached to any ancestry. (i think Catalin mentioned these are related to the unapplied patches in a stack, but there could be others; see below). > The preferred way would be to just list the references somewhere under > .git/refs/stgit, in which case fsck and repack should pick them up > automatically (so clearly stgit doesn't do that right now ;). that could be an extremely large number of commits on a large repository with a lot of patches that have been worked on over a long period. so whatever mechanism is created to do this needs to scale well in the number of commits. > It also implies that doing a "git prune" will do horribly bad things to a > stgit repo, since it would remove all the objects that it thinks aren't > reachable.. yup. been there, done that. lucky for me i have an excellent hourly backup scheme. >>git repack -a -n >> >>seems to work fine with StGIT, > > > Well, it "works", but not "fine". Since it doesn't know about the stgit > objects, it won't ever pack them. ah! > But maybe that's what stgit wants (since they are "temporary"), but it > does mean that if you see a big advantage from packing, you might be > losing some of it. actually, those commits aren't all that "temporary". the history/revision feature i'm working on would like to maintain all the commits ever done to an StGIT patch. the only time you can throw away such commits is when the patch is deleted or when it is finally committed to the repository via "stg commit". otherwise, keeping these commits in a pack would be quite a good thing. maybe the first thing to do is to get a basic understanding of an StGIT commit's lifetime. [-- Attachment #2: cel.vcf --] [-- Type: text/x-vcard, Size: 439 bytes --] begin:vcard fn:Chuck Lever n:Lever;Charles org:Network Appliance, Incorporated;Linux NFS Client Development adr:535 West William Street, Suite 3100;;Center for Information Technology Integration;Ann Arbor;MI;48103-4943;USA email;internet:cel@citi.umich.edu title:Member of Technical Staff tel;work:+1 734 763-4415 tel;fax:+1 734 763 4434 tel;home:+1 734 668-1089 x-mozilla-html:FALSE url:http://www.monkey.org/~cel/ version:2.1 end:vcard ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-22 18:18 ` Chuck Lever @ 2005-11-23 14:18 ` Catalin Marinas 0 siblings, 0 replies; 33+ messages in thread From: Catalin Marinas @ 2005-11-23 14:18 UTC (permalink / raw) To: cel; +Cc: Linus Torvalds, Carl Baldwin, H. Peter Anvin, Git Mailing List On 22/11/05, Chuck Lever <cel@citi.umich.edu> wrote: > Linus Torvalds wrote: > > But maybe that's what stgit wants (since they are "temporary"), but it > > does mean that if you see a big advantage from packing, you might be > > losing some of it. > > actually, those commits aren't all that "temporary". the > history/revision feature i'm working on would like to maintain all the > commits ever done to an StGIT patch. That's to avoid pruning them but you might not always want to add them to a pack. > the only time you can throw away such commits is when the patch is > deleted or when it is finally committed to the repository via "stg > commit". otherwise, keeping these commits in a pack would be quite a > good thing. > > maybe the first thing to do is to get a basic understanding of an StGIT > commit's lifetime. My initial idea was to throw the old commit away once a patch is refreshed. Even if you want to preserve the history, it would be only preserved until you send the patch to be merged upstream and you would delete it locally. If all the patches are meant to be sent upstream at some point, you can avoid packing them. -- Catalin ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-21 19:24 ` Linus Torvalds 2005-11-21 19:58 ` Junio C Hamano 2005-11-22 5:26 ` Chuck Lever @ 2005-11-22 17:25 ` Carl Baldwin 2005-11-22 17:58 ` Linus Torvalds 2 siblings, 1 reply; 33+ messages in thread From: Carl Baldwin @ 2005-11-22 17:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: H. Peter Anvin, Git Mailing List On Mon, Nov 21, 2005 at 11:24:11AM -0800, Linus Torvalds wrote: > NOTE! Since that email, "git repack" has gotten a "local" option (-l), > which is very useful if the repositories have pointers to alternates. > > So do > > git repack -l > > instead, to get much better packs (and "-a -d" for the full case, of > course). I'm assuming that this option will have no effect on a repository with no alternates file. > Other that than, the old email suggestion should still be fine. [snip] > You can certainly do that if you are nervous. It might even be a good > idea: just for fun, I just did > > git clone -l git git-clone > cd git-clone > > # pick an object at random > rm .git/objects/f7/c3d39fe3db6da3a307da385a7a1cb563ed15f7 > > git repack -a -d > > and it said: > > error: Could not read f7c3d39fe3db6da3a307da385a7a1cb563ed15f7 > fatal: bad tree object f7c3d39fe3db6da3a307da385a7a1cb563ed15f7 > > but then it created the pack _anyway_, and said: > > Packing 27 objects > Pack pack-13bfca704078175c1c1c59964553b14f7b952651 created. > > and happily removed all the old ones. > > So right now, repacking a broken archive can actually break it even more. Interesting. > NOTE! Your "git verify-pack" wouldn't even catch this: the _pack_ is fine, > it's just incomplete. In my opinion, git repack did the right thing in creating the pack even if it is more broken. Starting with a broken repository was the real problem. git repack shouldn't need to worry too much about it. Looking at it from the nervous repository admin's point of view I think he would want to make sure that the repository is good to begin with. I think this should be left up to the repository owner and maybe not git repack. Although, the check that you do following this is probably a good idea. > Of course, this only happens if the repository was broken to begin with, > so arguably it's not that bad. But it does show that git-repack should be > more careful and return an error more aggressively. > > Can anybody tell me how to do that sanely? Right now we do > > .. > name=$(git-rev-list --objects $rev_list $(git-rev-parse $rev_parse) | > git-pack-objects --non-empty $pack_objects .tmp-pack) || > exit 1 > .. > > and the thing is, the "git-pack-objects" thing is happy, it's the > "git-rev-list" that fails. So because the last command in the pipeline > returns ok, we think it all is ok.. > > (This is one of the reasons I much prefer working in C over working in > shell: it may be twenty times more lines, but when you have a problem, the > fix is always obvious..) > > Anyway, with that fixed, a "git repack" in many ways would be a mini-fsck, > so it should be very safe in general. Modulo any other bugs like the > above. > > Linus *NOTE* There is one question that I feel remains unanswered. Is it possible to split up the repack -a and repack -d so that the nervous repository owner can insert a git verify-pack in the middle. I'm not nearly this nervous about repositories that I keep for myself but I have ownership of some repositories on which many people may depend. I will feel better if I can verify the pack separately from git-repack before I do the (potentially destructive) -d to remove old packs. I don't mean to say that I don't trust git repack to do the right thing. Fundamentally, I just think that I shouldn't depend on it to do the right thing in order to avoid corruption in my repository. Carl PS I love that the git object store is designed so that object files never *need* to be removed, renamed, modified or otherwise touched in any way after being written to disk. I think this makes git inherently extremely safe from corruption unlike many other older repository designs. The only thing that breaks this inherent safety is the desire to pack repositories to avoid bloat. That is why I want to be a little paranoid when I do the repacking. I want to maintain some inherent safety in the process that I use to pack them. This kind of inherent safety is much more valuable then even the highest quality code written to actually do the packing. -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Carl Baldwin Systems VLSI Laboratory Hewlett Packard Company MS 88 work: 970 898-1523 3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com Fort Collins, CO 80525 home: Carl@ecBaldwin.net - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: auto-packing on kernel.org? please? 2005-11-22 17:25 ` Carl Baldwin @ 2005-11-22 17:58 ` Linus Torvalds 0 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2005-11-22 17:58 UTC (permalink / raw) To: Carl Baldwin; +Cc: H. Peter Anvin, Git Mailing List On Tue, 22 Nov 2005, Carl Baldwin wrote: > On Mon, Nov 21, 2005 at 11:24:11AM -0800, Linus Torvalds wrote: > > NOTE! Since that email, "git repack" has gotten a "local" option (-l), > > which is very useful if the repositories have pointers to alternates. > > > > So do > > > > git repack -l > > > > instead, to get much better packs (and "-a -d" for the full case, of > > course). > > I'm assuming that this option will have no effect on a repository with > no alternates file. Correct. The only thing it does is that when it looks up an object, if it's not in our _own_ ".git/objects/" dir, it won't pack it. Actually, that's not entirely true. It isn't smart enough to know where every object exists, so it only knows about remote _packs_. So what happens is that if you do git repack -l -a -d it will create a pack-file that contains _all_ unpacked objects (whether local or not) and all objects that are in local packs (because of the "-a"), but not any objects that are in "alternate packs". Which is actually exactly what you want, if you are in the situation that kernel.org is, and you have people who point their alternates to mine: when I repack my objects, they'll use my packs, but other than that, they'll prefer to use their own packs over any unpacked objects. > > So right now, repacking a broken archive can actually break it even more. > > Interesting. Well, with the latest git repack script, that should no longer be true. > > NOTE! Your "git verify-pack" wouldn't even catch this: the _pack_ is fine, > > it's just incomplete. > > In my opinion, git repack did the right thing in creating the pack even > if it is more broken. Starting with a broken repository was the real > problem. git repack shouldn't need to worry too much about it. Well, "git repack" did the wrong thing in that it never _noticed_, and it then removed all old packs - even though those old packs contained objects that we hadn't repacked because of the broken repository. Of course, _usually_ a broken repository is just that - broken. The way you fix a broken repo is to find a non-broken one, and clone that. However, sometimes what you can do (if you literally just lost a few objects) is to find a non-broken repo, and make that the _alternates_, in which case you may be able to save any work you had in the broken one (assuming you only lost objects that were available somewhere else). > Looking at it from the nervous repository admin's point of view I think > he would want to make sure that the repository is good to begin with. Doing an fsck is certainly always a good idea. I do a "shallow" fsck usually several times a day ("shallow" means that it doesn't fsck packs, only new objects that I have aquired since the last repacking), and I do a full fsck a couple of times a week. I don't actually know why I do that, though. I don't think I've really _ever_ had a broken repo since some very early days, except for the cases where I break things on purpose (like remove an object to check whether "git repack" does the right thing or not). I'm just used to it, and the shallow fsck takes a fraction of a second, so I tend to do it after each pull. So I really think that an admin has to be more than "nervous" to worry about it. He has to be really anal. (Now, doing a repack and a fsck every week or so might be good, and automatic shallow fsck's daily is probably a great idea too. After all, it _is_ checking checksums, so if you worry about security and want to make sure that nobody is trying to break in and do bad things to your repo, a regular fsck is a good thing even if you're not otherwise worried about corruption). > *NOTE* There is one question that I feel remains unanswered. Is it > possible to split up the repack -a and repack -d so that the nervous > repository owner can insert a git verify-pack in the middle. They are already split up inside "git-repack", so we could add a hook there, I guess. See the git-repack.sh file, and notice how it does the "remove_redundant" part only after it has created the new pack-file and done a "sync". > I don't mean to say that I don't trust git repack to do the right thing. > Fundamentally, I just think that I shouldn't depend on it to do the > right thing in order to avoid corruption in my repository. That's good. However, as the previous failure of git repack showed, to some degree the more likely failure mode is actually that the pack generated by "git repack" is perfectly fine, but it's not _complete_. Say we have a bug in git repack, for example. Another case where it's not complete is when you have deleted a branch. "git repack -a -d" will effectively do a "git prune" wrt objects that are no longer reachable, and that were in the old packs. So I'd actually suggest a slightly different approach. When-ever you remove old objects (whether it's "git prune" or "git prune-packed" or "git repack -a -d"), you might want to have an option that doesn't actually _remove_ them, but just moves them into ".git/attic" or something like that. Then you can clean up the attic after doing your weekly full fsck or something. And it has the advantage that if somebody has deleted a branch, and notices later that maybe he wanted that branch back, you can "unprune" all the objects, run "git-fsck-objects --full" to find any dangling commits, and you'll have all your branches back. So in many ways it would perhaps be nicer to have that kind of "safe remove" option to the pruning commands? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2005-11-23 14:18 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-10-13 18:44 auto-packing on kernel.org? please? Linus Torvalds [not found] ` <434EABFD.5070604@zytor.com> [not found] ` <434EC07C.30505@pobox.com> 2005-10-13 21:23 ` [kernel.org users] " Linus Torvalds 2005-10-16 14:33 ` Dirk Behme 2005-10-16 15:44 ` Daniel Barkalow 2005-10-16 16:12 ` Nick Hengeveld 2005-10-16 16:23 ` Brian Gerst 2005-10-16 16:56 ` Junio C Hamano 2005-10-16 21:33 ` Nick Hengeveld 2005-10-16 22:12 ` Junio C Hamano 2005-10-17 6:06 ` Nick Hengeveld 2005-10-17 8:21 ` Junio C Hamano 2005-10-17 17:41 ` Nick Hengeveld 2005-10-17 20:08 ` Junio C Hamano 2005-10-17 22:56 ` Daniel Barkalow 2005-10-17 23:19 ` Linus Torvalds 2005-10-17 23:54 ` Nick Hengeveld 2005-10-17 19:13 ` Daniel Barkalow 2005-10-16 17:10 ` Johannes Schindelin 2005-10-16 17:15 ` Brian Gerst 2005-11-21 19:01 ` Carl Baldwin 2005-11-21 19:24 ` Linus Torvalds 2005-11-21 19:58 ` Junio C Hamano 2005-11-21 20:38 ` Linus Torvalds 2005-11-21 21:35 ` Junio C Hamano 2005-11-22 5:26 ` Chuck Lever 2005-11-22 5:41 ` Linus Torvalds 2005-11-22 14:13 ` Catalin Marinas 2005-11-22 17:05 ` Linus Torvalds [not found] ` <7v64qkfwhe.fsf@assigned-by-dhcp.cox.net> [not found] ` <b0943d9e0511220946o3b62842ey@mail.gmail.com> [not found] ` <7v1x18eddp.fsf@assigned-by-dhcp.cox.net> 2005-11-23 14:10 ` Catalin Marinas 2005-11-22 18:18 ` Chuck Lever 2005-11-23 14:18 ` Catalin Marinas 2005-11-22 17:25 ` Carl Baldwin 2005-11-22 17:58 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).