* Git is not scalable with too many refs/* @ 2011-06-09 3:44 NAKAMURA Takumi 2011-06-09 6:50 ` Sverre Rabbelier 2011-06-09 11:18 ` Jakub Narebski 0 siblings, 2 replies; 126+ messages in thread From: NAKAMURA Takumi @ 2011-06-09 3:44 UTC (permalink / raw) To: git Hello, Git. It is my 1st post here. I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn repo locally. (over 100k refs/tags.) Indeed, it made something extremely slower, even with packed-refs and pack objects. I gave up, then, to push tags to upstream. (it must be terror) :p I know it might be crazy in the git way, but it would bring me conveniences. (eg. git log --oneline --decorate shows me each svn revision) I would like to work for Git to live with many tags. * Issues as far as I have investigated; - git show --decorate is always slow. in decorate.c, every commits are inspected. - git rev-tree --quiet --objects $upstream --not --all spends so much time, even if it is expected to return with 0. As you know, it is used in builtin/fetch.c. - git-upload-pack shows "all" refs to me if upstream has too many refs. I would like to work as below if they were valuable. - Get rid of inspecting commits in packed-refs on decorate stuff. - Implement sort-by-hash packed-refs, (not sort-by-name) - Implement more effective pruning --not --all on revision.c. - Think about enhancement of protocol to transfer many refs more effectively. I am happy to consider the issue, thank you. ...Takumi ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 3:44 Git is not scalable with too many refs/* NAKAMURA Takumi @ 2011-06-09 6:50 ` Sverre Rabbelier 2011-06-09 15:23 ` Shawn Pearce 2011-06-09 11:18 ` Jakub Narebski 1 sibling, 1 reply; 126+ messages in thread From: Sverre Rabbelier @ 2011-06-09 6:50 UTC (permalink / raw) To: NAKAMURA Takumi, Shawn O. Pearce; +Cc: git Heya, [+shawn, who runs into something similar with Gerrit] On Thu, Jun 9, 2011 at 05:44, NAKAMURA Takumi <geek4civic@gmail.com> wrote: > Hello, Git. It is my 1st post here. > > I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn > repo locally. (over 100k refs/tags.) > Indeed, it made something extremely slower, even with packed-refs and > pack objects. > I gave up, then, to push tags to upstream. (it must be terror) :p > > I know it might be crazy in the git way, but it would bring me conveniences. > (eg. git log --oneline --decorate shows me each svn revision) > I would like to work for Git to live with many tags. > > * Issues as far as I have investigated; > > - git show --decorate is always slow. > in decorate.c, every commits are inspected. > - git rev-tree --quiet --objects $upstream --not --all spends so much time, > even if it is expected to return with 0. > As you know, it is used in builtin/fetch.c. > - git-upload-pack shows "all" refs to me if upstream has too many refs. > > I would like to work as below if they were valuable. > > - Get rid of inspecting commits in packed-refs on decorate stuff. > - Implement sort-by-hash packed-refs, (not sort-by-name) > - Implement more effective pruning --not --all on revision.c. > - Think about enhancement of protocol to transfer many refs more effectively. > > I am happy to consider the issue, thank you. -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 6:50 ` Sverre Rabbelier @ 2011-06-09 15:23 ` Shawn Pearce 2011-06-09 15:52 ` A Large Angry SCM 0 siblings, 1 reply; 126+ messages in thread From: Shawn Pearce @ 2011-06-09 15:23 UTC (permalink / raw) To: Sverre Rabbelier; +Cc: NAKAMURA Takumi, git On Wed, Jun 8, 2011 at 23:50, Sverre Rabbelier <srabbelier@gmail.com> wrote: > [+shawn, who runs into something similar with Gerrit] > On Thu, Jun 9, 2011 at 05:44, NAKAMURA Takumi <geek4civic@gmail.com> wrote: >> Hello, Git. It is my 1st post here. >> >> I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn >> repo locally. (over 100k refs/tags.) As Jakub pointed out, use git notes for this. They were designed to scale to >100,000 annotations. >> Indeed, it made something extremely slower, even with packed-refs and >> pack objects. Having a reference to every commit in the repository is horrifically slow. We run into this with Gerrit Code Review and I need to find another solution. Git just wasn't meant to process repositories like this. -- Shawn. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 15:23 ` Shawn Pearce @ 2011-06-09 15:52 ` A Large Angry SCM 2011-06-09 15:56 ` Shawn Pearce 0 siblings, 1 reply; 126+ messages in thread From: A Large Angry SCM @ 2011-06-09 15:52 UTC (permalink / raw) To: Shawn Pearce; +Cc: Sverre Rabbelier, NAKAMURA Takumi, git On 06/09/2011 11:23 AM, Shawn Pearce wrote: > On Wed, Jun 8, 2011 at 23:50, Sverre Rabbelier<srabbelier@gmail.com> wrote: >> [+shawn, who runs into something similar with Gerrit] > >> On Thu, Jun 9, 2011 at 05:44, NAKAMURA Takumi<geek4civic@gmail.com> wrote: >>> Hello, Git. It is my 1st post here. >>> >>> I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn >>> repo locally. (over 100k refs/tags.) > > As Jakub pointed out, use git notes for this. They were designed to > scale to>100,000 annotations. > >>> Indeed, it made something extremely slower, even with packed-refs and >>> pack objects. > > Having a reference to every commit in the repository is horrifically > slow. We run into this with Gerrit Code Review and I need to find > another solution. Git just wasn't meant to process repositories like > this. Assuming a very large number of refs, what is it that makes git so horrifically slow? Is there a design or implementation lesson here? ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 15:52 ` A Large Angry SCM @ 2011-06-09 15:56 ` Shawn Pearce 2011-06-09 16:26 ` Jeff King 2011-06-10 7:41 ` Andreas Ericsson 0 siblings, 2 replies; 126+ messages in thread From: Shawn Pearce @ 2011-06-09 15:56 UTC (permalink / raw) To: A Large Angry SCM; +Cc: Sverre Rabbelier, NAKAMURA Takumi, git On Thu, Jun 9, 2011 at 08:52, A Large Angry SCM <gitzilla@gmail.com> wrote: > On 06/09/2011 11:23 AM, Shawn Pearce wrote: >> Having a reference to every commit in the repository is horrifically >> slow. We run into this with Gerrit Code Review and I need to find >> another solution. Git just wasn't meant to process repositories like >> this. > > Assuming a very large number of refs, what is it that makes git so > horrifically slow? Is there a design or implementation lesson here? A few things. Git does a sequential scan of all references when it first needs to access references for an operation. This requires reading the entire packed-refs file, and the recursive scan of the "refs/" subdirectory for any loose refs that might override the packed-refs file. A lot of operations toss every commit that a reference points at into the revision walker's LRU queue. If you have a tag pointing to every commit, then the entire project history enters the LRU queue at once, up front. That queue is managed with O(N^2) insertion time. And the entire queue has to be filled before anything can be output. -- Shawn. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 15:56 ` Shawn Pearce @ 2011-06-09 16:26 ` Jeff King 2011-06-10 3:59 ` NAKAMURA Takumi 2011-06-10 7:41 ` Andreas Ericsson 1 sibling, 1 reply; 126+ messages in thread From: Jeff King @ 2011-06-09 16:26 UTC (permalink / raw) To: Shawn Pearce; +Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git On Thu, Jun 09, 2011 at 08:56:50AM -0700, Shawn O. Pearce wrote: > A lot of operations toss every commit that a reference points at into > the revision walker's LRU queue. If you have a tag pointing to every > commit, then the entire project history enters the LRU queue at once, > up front. That queue is managed with O(N^2) insertion time. And the > entire queue has to be filled before anything can be output. We ran into this recently at github. Since our many-refs repos were mostly forks, we had a lot of duplicate commits, and were able to solve it with ea5f220 (fetch: avoid repeated commits in mark_complete, 2011-05-19). However, I also worked up a faster priority queue implementation that would work in the general case: http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005 I suspect it would speed up the original poster's slow fetch. The problem is that a fast priority queue doesn't have quite the same access patterns as a linked list, so replacing all of the commit_lists in git with the priority queue would be quite a painful undertaking. So we are left with using the fast queue only in specific hot-spots. -Peff ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 16:26 ` Jeff King @ 2011-06-10 3:59 ` NAKAMURA Takumi 2011-06-13 22:27 ` Jeff King 2011-06-14 0:17 ` Andreas Ericsson 0 siblings, 2 replies; 126+ messages in thread From: NAKAMURA Takumi @ 2011-06-10 3:59 UTC (permalink / raw) To: Jeff King; +Cc: Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git Good afternoon Git! Thank you guys to give me comments. Jakub and Shawn, Sure, Notes should be used at the case, I agree. > (eg. git log --oneline --decorate shows me each svn revision) My example might misunderstand you. I intended tags could show me pretty abbrev everywhere on Git. I would be happier if tags might be available bi-directional alias, as Stephen mentions. It would be better git-svn could record metadata into notes, I think, too. :D Stephen, 2011/6/10 Stephen Bash <bash@genarts.com>: > I've seen two different workflows develop: > 1) Hacking on some code in Git the programmer finds something wrong. Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion. > 2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356". > > I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit. It is the point I wanted to say, thank you! I am working with svn-men. They often speak svn revision number. (And I have to tell them svn revs then) > In our office we created the cross reference table once by walking the Git tree and storing it as a file (we had some degenerate cases where one SVN rev mapped to multiple Git commits, but I don't remember the details), but it's not really usable from Git. Lightweight tags would be an awesome solution (if they worked). Perhaps a custom subcommand is a reasonable middle ground. Reconstructing svnrev-commits mapping can be done by git-svn itself. Unfortunately, git-svn's .rev-map is sorted by revision number. I think it would be useless to make subcommands unless they were pluggable into Git as "smart-tag resolver". Peff, At first, thank you to work for Github! Awesome! I didn't know Github has refs issues. (yeah, I should not push 100k of tags to Github for now :p ) I am working on linux and windows. Many-refs-repo can make Git awfully slow (than linux!) I hope I could work also for windows to improve various performance issue. FYI, I have tweaked git-rev-list for commits not to sort by date with --quiet. It improves git-fetch (git-rev-list --not --all) performance when objects is well-packed. ...Takumi ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-10 3:59 ` NAKAMURA Takumi @ 2011-06-13 22:27 ` Jeff King 2011-06-14 0:17 ` Andreas Ericsson 1 sibling, 0 replies; 126+ messages in thread From: Jeff King @ 2011-06-13 22:27 UTC (permalink / raw) To: NAKAMURA Takumi; +Cc: Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git On Fri, Jun 10, 2011 at 12:59:47PM +0900, NAKAMURA Takumi wrote: > 2011/6/10 Stephen Bash <bash@genarts.com>: > > I've seen two different workflows develop: > > 1) Hacking on some code in Git the programmer finds something wrong. Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion. > > 2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356". > > > > I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit. > > It is the point I wanted to say, thank you! I am working with svn-men. > They often speak svn revision number. (And I have to tell them svn > revs then) Yeah, there is no simple way to do the bi-directional mapping in git. If all you want are decorations on commits, notes are definitely the way to go. They are optimized for lookup in of commit -> data. But if you want data -> commit, the only mapping we have is refs, and they are not well optimized for the many-refs use case. Packed-refs are better than loose refs, but I think right now we just load them all in to an in-memory linked list. We could load them into a more efficient in-memory data structure, or we could perhaps even mmap the packed-refs file and binary search it in place. But lookup is only part of the problem. There are algorithms that want to look at all the refs (notably fetching and pushing), which are going to be a bit slower. We don't have a way to tell those algorithms that those refs are uninteresting for reachability analysis, because they are just pointing to parts of the graph that are already reachable by regular refs. Maybe there could be a part of the refs namespace that is ignored by "--all". I dunno. That seems like a weird inconsistency. > FYI, I have tweaked git-rev-list for commits not to sort by date with > --quiet. It improves git-fetch (git-rev-list --not --all) performance > when objects is well-packed. I'm not sure that is a good solution. Even with --quiet, we will be walking the commit graph to find merge bases to see if things are connected. The walking code expects date-sorting; I'm not sure what changing that assumption will do to the code. -Peff ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-10 3:59 ` NAKAMURA Takumi 2011-06-13 22:27 ` Jeff King @ 2011-06-14 0:17 ` Andreas Ericsson 2011-06-14 0:30 ` Jeff King 1 sibling, 1 reply; 126+ messages in thread From: Andreas Ericsson @ 2011-06-14 0:17 UTC (permalink / raw) To: NAKAMURA Takumi Cc: Jeff King, Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git On 06/10/2011 05:59 AM, NAKAMURA Takumi wrote: > Good afternoon Git! Thank you guys to give me comments. > > Jakub and Shawn, > > Sure, Notes should be used at the case, I agree. > >> (eg. git log --oneline --decorate shows me each svn revision) > > My example might misunderstand you. I intended tags could show me > pretty abbrev everywhere on Git. I would be happier if tags might be > available bi-directional alias, as Stephen mentions. > > It would be better git-svn could record metadata into notes, I think, too. :D > > Stephen, > > 2011/6/10 Stephen Bash<bash@genarts.com>: >> I've seen two different workflows develop: >> 1) Hacking on some code in Git the programmer finds something wrong. Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion. >> 2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356". >> >> I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit. > If you're using svn metadata in the commit text, you can always do "git log -p --grep=@20356" to get the commits relevant to that one. It's not as fast as "git show svn-20356", but it's not exactly glacial either and would avoid the problems you're having now. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 0:17 ` Andreas Ericsson @ 2011-06-14 0:30 ` Jeff King 2011-06-14 4:41 ` Junio C Hamano 0 siblings, 1 reply; 126+ messages in thread From: Jeff King @ 2011-06-14 0:30 UTC (permalink / raw) To: Andreas Ericsson Cc: NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git On Tue, Jun 14, 2011 at 02:17:58AM +0200, Andreas Ericsson wrote: > If you're using svn metadata in the commit text, you can always do > "git log -p --grep=@20356" to get the commits relevant to that one. > It's not as fast as "git show svn-20356", but it's not exactly > glacial either and would avoid the problems you're having now. If we do end up putting this data into notes eventually (which I think we _should_ do, because then you aren't locked into having this svn cruft in your commit messages for all time, but can rather choose whether or not to display it), it would be nice to have a --grep-notes feature in git-log. Or maybe --grep should look in notes by default, too, if we are showing them. I suspect the feature would be really easy to implement, if somebody is looking for a gentle introduction to git, or a fun way to spend an hour. :) -Peff ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 0:30 ` Jeff King @ 2011-06-14 4:41 ` Junio C Hamano 2011-06-14 7:26 ` Sverre Rabbelier 0 siblings, 1 reply; 126+ messages in thread From: Junio C Hamano @ 2011-06-14 4:41 UTC (permalink / raw) To: Jeff King Cc: Andreas Ericsson, NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git Jeff King <peff@peff.net> writes: > I suspect the feature would be really easy to implement, if somebody is > looking for a gentle introduction to git, or a fun way to spend an hour. I would rather want to see if somebody can come up with a flexible reverse mapping feature around notes. It does not have to be completely generic, just being flexible enough is fine. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 4:41 ` Junio C Hamano @ 2011-06-14 7:26 ` Sverre Rabbelier 2011-06-14 10:02 ` Johan Herland 0 siblings, 1 reply; 126+ messages in thread From: Sverre Rabbelier @ 2011-06-14 7:26 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, Andreas Ericsson, NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM, git Heya, On Tue, Jun 14, 2011 at 06:41, Junio C Hamano <gitster@pobox.com> wrote: > I would rather want to see if somebody can come up with a flexible reverse > mapping feature around notes. It does not have to be completely generic, > just being flexible enough is fine. Wouldn't it be enough to simply create a note on 'r651235' with as contents the git ref? -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 7:26 ` Sverre Rabbelier @ 2011-06-14 10:02 ` Johan Herland 2011-06-14 10:34 ` Sverre Rabbelier 2011-06-14 17:02 ` Jeff King 0 siblings, 2 replies; 126+ messages in thread From: Johan Herland @ 2011-06-14 10:02 UTC (permalink / raw) To: Sverre Rabbelier Cc: git, Junio C Hamano, Jeff King, Andreas Ericsson, NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM On Tuesday 14 June 2011, Sverre Rabbelier wrote: > Heya, > > On Tue, Jun 14, 2011 at 06:41, Junio C Hamano <gitster@pobox.com> wrote: > > I would rather want to see if somebody can come up with a flexible > > reverse mapping feature around notes. It does not have to be > > completely generic, just being flexible enough is fine. > > Wouldn't it be enough to simply create a note on 'r651235' with as > contents the git ref? Not quite sure what you mean by "create a note on 'r651235'". You could devise a scheme where you SHA1('r651235'), and then create a note on the resulting hash. Notes are named by the SHA1 of the object they annotate, but there is no hard requirement (as long as you stay away from "git notes prune") that the SHA1 annotated actually exists as a valid Git object in your repo. Hence, you can use notes to annotate _anything_ that can be uniquely reduced to a SHA1 hash. ...Johan -- Johan Herland, <johan@herland.net> www.herland.net ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 10:02 ` Johan Herland @ 2011-06-14 10:34 ` Sverre Rabbelier 2011-06-14 17:02 ` Jeff King 1 sibling, 0 replies; 126+ messages in thread From: Sverre Rabbelier @ 2011-06-14 10:34 UTC (permalink / raw) To: Johan Herland Cc: git, Junio C Hamano, Jeff King, Andreas Ericsson, NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM Heya, On Tue, Jun 14, 2011 at 12:02, Johan Herland <johan@herland.net> wrote: > Not quite sure what you mean by "create a note on 'r651235'". You could > devise a scheme where you SHA1('r651235'), and then create a note on the > resulting hash. I was thinking they could annotate anything, even non-sha's, but in that case, yes, the sha of the revision would work just as well. -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 10:02 ` Johan Herland 2011-06-14 10:34 ` Sverre Rabbelier @ 2011-06-14 17:02 ` Jeff King 2011-06-14 19:20 ` Shawn Pearce 1 sibling, 1 reply; 126+ messages in thread From: Jeff King @ 2011-06-14 17:02 UTC (permalink / raw) To: Johan Herland Cc: Sverre Rabbelier, git, Junio C Hamano, Andreas Ericsson, NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM On Tue, Jun 14, 2011 at 12:02:46PM +0200, Johan Herland wrote: > > Wouldn't it be enough to simply create a note on 'r651235' with as > > contents the git ref? > > Not quite sure what you mean by "create a note on 'r651235'". You could > devise a scheme where you SHA1('r651235'), and then create a note on the > resulting hash. > > Notes are named by the SHA1 of the object they annotate, but there is no > hard requirement (as long as you stay away from "git notes prune") that the > SHA1 annotated actually exists as a valid Git object in your repo. > > Hence, you can use notes to annotate _anything_ that can be uniquely reduced > to a SHA1 hash. I lean against that as a solution. I think "git gc" will probably eventually learn to do "git notes prune", at which point we would start losing people's data. So I think it is better to keep the definition of notes a little tighter now, and say "the left-hand side of a notes mapping must be a referenced object". We can always loosen it later. On top of that, though, the sha1 solution is not all that pleasant. It lets you do exact lookups, but you have no way of iterating over the list of svn revisions. I also think we can do something a little more lightweight. The user has already created and is maintaining a mapping in one direction via the notes. We just need the inverse mapping, which we can generate programatically. So it can be a straight cache, with the sha1 of the notes tree determining the cache validity (i.e., if the forward mapping in the notes tree changes, you regenerate the cache from scratch). We would want to store the cache in an on-disk format that could be searched easily. Possibly something like the packed-refs format would be sufficient, if we mmap'd and binary searched it. It would be dirt simple if we used an existing key/value store like gdbm or tokyocabinet, but we usually try to avoid extra dependencies. -Peff ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 17:02 ` Jeff King @ 2011-06-14 19:20 ` Shawn Pearce 2011-06-14 19:47 ` Jeff King 0 siblings, 1 reply; 126+ messages in thread From: Shawn Pearce @ 2011-06-14 19:20 UTC (permalink / raw) To: Jeff King Cc: Johan Herland, Sverre Rabbelier, git, Junio C Hamano, Andreas Ericsson, NAKAMURA Takumi, A Large Angry SCM On Tue, Jun 14, 2011 at 10:02, Jeff King <peff@peff.net> wrote: > I also think we can do something a little more lightweight. The user has > already created and is maintaining a mapping in one direction via the > notes. We just need the inverse mapping, which we can generate > programatically. So it can be a straight cache, with the sha1 of the > notes tree determining the cache validity (i.e., if the forward mapping > in the notes tree changes, you regenerate the cache from scratch). > > We would want to store the cache in an on-disk format that could be > searched easily. Possibly something like the packed-refs format would be > sufficient, if we mmap'd and binary searched it. It would be dirt simple > if we used an existing key/value store like gdbm or tokyocabinet, but we > usually try to avoid extra dependencies. Yea, not a bad idea. Use a series of SSTable like things, like Hadoop uses. It doesn't need to be as complex as the Hadoop SSTable concept. But a simple sorted string to string mapping file that is immutable, with edits applied by creating an overlay file that contains new/updated entries. As you point out, we can use the notes tree to tell us the validity of the cache, and do incremental updates. If the current cache doesn't match the notes ref, compute the tree diff between the current cache's source tree and the new tree, and create a new SSTable like thing that has the relevant updates as an overlay of the existing tables. After some time you will have many of these little overlay files, and a GC can just merge them down to a single file. The only problem is, you probably want this "reverse notes index" to be indexing a portion of the note blob text, not all of it. That is, we want the SVN note text to say something like "SVN Revision: r1828" so `git log --notes=svn` shows us something more useful than just "r1828". But in the reverse index, we may only want the key to be "r1828". So you need some sort of small mapping function to decide what to put into that reverse index. -- Shawn. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 19:20 ` Shawn Pearce @ 2011-06-14 19:47 ` Jeff King 2011-06-14 20:12 ` Shawn Pearce 0 siblings, 1 reply; 126+ messages in thread From: Jeff King @ 2011-06-14 19:47 UTC (permalink / raw) To: Shawn Pearce Cc: Johan Herland, Sverre Rabbelier, git, Junio C Hamano, Andreas Ericsson, NAKAMURA Takumi, A Large Angry SCM On Tue, Jun 14, 2011 at 12:20:29PM -0700, Shawn O. Pearce wrote: > > We would want to store the cache in an on-disk format that could be > > searched easily. Possibly something like the packed-refs format would be > > sufficient, if we mmap'd and binary searched it. It would be dirt simple > > if we used an existing key/value store like gdbm or tokyocabinet, but we > > usually try to avoid extra dependencies. > > Yea, not a bad idea. Use a series of SSTable like things, like Hadoop > uses. It doesn't need to be as complex as the Hadoop SSTable concept. > But a simple sorted string to string mapping file that is immutable, > with edits applied by creating an overlay file that contains > new/updated entries. > > As you point out, we can use the notes tree to tell us the validity of > the cache, and do incremental updates. If the current cache doesn't > match the notes ref, compute the tree diff between the current cache's > source tree and the new tree, and create a new SSTable like thing that > has the relevant updates as an overlay of the existing tables. After > some time you will have many of these little overlay files, and a GC > can just merge them down to a single file. I was really hoping that it would be fast enough that we could simply blow away the old mapping and recreate it from scratch. That gets us out of writing any journaling-type code with overlays. For something like svn revisions, it's probably fine to take an extra second or two to build the cache after we do a fetch. But it wouldn't scale to something that was getting updated frequently. If we're going to start doing clever database-y things, I'd much rather use a proven key/value db solution like tokyocabinet. I'm just not sure how to degrade gracefully when the db library isn't available. Don't allow reverse mappings? Fallback to something slow? > The only problem is, you probably want this "reverse notes index" to > be indexing a portion of the note blob text, not all of it. That is, > we want the SVN note text to say something like "SVN Revision: r1828" > so `git log --notes=svn` shows us something more useful than just > "r1828". But in the reverse index, we may only want the key to be > "r1828". So you need some sort of small mapping function to decide > what to put into that reverse index. I had assumed that we would just be writing r1828 into the note. The output via git log is actually pretty readable: $ git notes --ref=svn/revisions add -m r1828 $ git show --notes=svn/revisions ... Notes (svn/revisions): r1828 Of course this is just one use case. For that matter, we have to figure out how one would actually reference the reverse mapping. If we have a simple, pure-reverse mapping, we can just generate and cache them on the fly, and give a special syntax. Like: $ git log notes/svn/revisions@{revnote:r1828} which would invert the notes/svn/revisions tree, search for r1828, and reference the resulting commit. If you had something more heavyweight that actually needed to parse during the mapping, you might have something like: $ : set up the mapping $ git config revnote.svn.map 'SVN Revision: (r[0-9]+)' $ : do the reverse; we should be able to build the cache on the fly $ git notes reverse r1828 346ab9aaa1cf7b1ed2dd2c0a67bccc5b8ec23f7c $ : so really you could have a similar ref syntax like, though $ : this would require some ref parser updates, as we currently $ : assume anything to the left of @{} is a real ref $ git log r1828@{revnote:svn} The syntaxes are not as nice as having a real ref. In the last example, we could probably look for the contents of "@{}" as a possible revnote mapping (since we've already had to name it via the configuration), to make it "r1828@{svn}". Or you could even come up with a default set of revnotes to consider, so that if we lookup "r1828" and it isn't a real ref, we fall back to trying r1828@{revnote:svn}. I dunno. I'm just throwing ideas out at this point. -Peff ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 19:47 ` Jeff King @ 2011-06-14 20:12 ` Shawn Pearce 2011-09-08 19:53 ` Martin Fick 0 siblings, 1 reply; 126+ messages in thread From: Shawn Pearce @ 2011-06-14 20:12 UTC (permalink / raw) To: Jeff King Cc: Johan Herland, Sverre Rabbelier, git, Junio C Hamano, Andreas Ericsson, NAKAMURA Takumi, A Large Angry SCM On Tue, Jun 14, 2011 at 12:47, Jeff King <peff@peff.net> wrote: > On Tue, Jun 14, 2011 at 12:20:29PM -0700, Shawn O. Pearce wrote: > >> > We would want to store the cache in an on-disk format that could be >> > searched easily. Possibly something like the packed-refs format would be >> > sufficient, if we mmap'd and binary searched it. It would be dirt simple >> > if we used an existing key/value store like gdbm or tokyocabinet, but we >> > usually try to avoid extra dependencies. >> >> Yea, not a bad idea. Use a series of SSTable like things, like Hadoop >> uses. It doesn't need to be as complex as the Hadoop SSTable concept. >> But a simple sorted string to string mapping file that is immutable, >> with edits applied by creating an overlay file that contains >> new/updated entries. >> >> As you point out, we can use the notes tree to tell us the validity of >> the cache, and do incremental updates. If the current cache doesn't >> match the notes ref, compute the tree diff between the current cache's >> source tree and the new tree, and create a new SSTable like thing that >> has the relevant updates as an overlay of the existing tables. After >> some time you will have many of these little overlay files, and a GC >> can just merge them down to a single file. > > I was really hoping that it would be fast enough that we could simply > blow away the old mapping and recreate it from scratch. That gets us out > of writing any journaling-type code with overlays. For something like > svn revisions, it's probably fine to take an extra second or two to > build the cache after we do a fetch. But it wouldn't scale to something > that was getting updated frequently. > > If we're going to start doing clever database-y things, I'd much rather > use a proven key/value db solution like tokyocabinet. I'm just not sure > how to degrade gracefully when the db library isn't available. Don't > allow reverse mappings? Fallback to something slow? This is why I would prefer to build the solution into Git. Its not that bad to do a sorted string file. Take something simple that is similar to a pack file: GSST | vers | rcnt | srctree | base [ klen key vlen value ]* [ roff ]* SHA1(all_of_above) Where vers is a version number, rcnt is the number of records, srctree is the SHA-1 of the notes tree this thing indexed from, and base is the SHA-1 of the notes tree this "applies on top of". There are then rcnt records in the file, each using a variable length key length and value length field (klen, vlen), with variable length key and values. At the end of the file are rcnt 4 byte offsets to the start of each key. When writing the file, write all of the above to a temporary file, then rename it to $GIT_DIR/cache/db-$SHA1.db, as its a unique name. Its easy to prepare the list of entries in memory as an array of structs of key/value pairs, sort them with qsort(), write them out and update offsets as you go, then dump out the offset table at the end. One could compress the offset table by only storing every N offsets, readers perform binary search until they find the first key that is before their desired key, then sequentially scan records until they locate the correct entry... but I'm not sure the space savings is really worthwhile here. When reading, scan the directory and read the headers of each file. If the file has your target srctree, your cache is current and you can read it. If a key isn't in this file, you open the file named $GIT_DIR/cache/db-$base.db and try again there, walking back along that base chain until base is '0'x40. (or some other marker in the header to denote there is no base file). GC is just a matter of merging the sorted files together. Follow along all of the base pointers, open all of them, scan through the records and write out the first key that is defined. I guess we need a small "delete" bit in the record to indicate a particular key/value was removed from the database. Since this is a reverse mapping, duplicates are possible, and readers that want all values need to scan back to the base file, but skip base entries that were marked deleted in a newer file. Updating is just preparing a new file that uses the current srctree as your base, and only inserting/sorting the paths that were different in the notes. We probably need to store these files keyed by their notes ref, so we can find "svn/revisions" differently from "bugzilla" (another hypothetical mapping of Bugzilla bug ids to commit SHA-1s, based on notes that attached bug numbers to commits). I don't think its that bad. Maybe its a bit too much complexity for version 1 to have these incremental update files be supported, but it shouldn't be that hard. >> The only problem is, you probably want this "reverse notes index" to >> be indexing a portion of the note blob text, not all of it. That is, >> we want the SVN note text to say something like "SVN Revision: r1828" >> so `git log --notes=svn` shows us something more useful than just >> "r1828". But in the reverse index, we may only want the key to be >> "r1828". So you need some sort of small mapping function to decide >> what to put into that reverse index. > > I had assumed that we would just be writing r1828 into the note. The > output via git log is actually pretty readable: > > $ git notes --ref=svn/revisions add -m r1828 > $ git show --notes=svn/revisions > ... > Notes (svn/revisions): > r1828 > > Of course this is just one use case. Thanks, I keep forgetting that the notes prints the note ref name out before the text, so its already got this annotation present. This makes it much more likely that the bare "r1828" text is acceptable in the note, and that the reverse index is just the entire content of the blob as the key. :-) > For that matter, we have to figure out how one would actually reference > the reverse mapping. If we have a simple, pure-reverse mapping, we can > just generate and cache them on the fly, and give a special syntax. > Like: > > $ git log notes/svn/revisions@{revnote:r1828} Uhm. Ick. > The syntaxes are not as nice as having a real ref. In the last example, > we could probably look for the contents of "@{}" as a possible revnote > mapping (since we've already had to name it via the configuration), to > make it "r1828@{svn}". Or you could even come up with a default set of > revnotes to consider, so that if we lookup "r1828" and it isn't a real > ref, we fall back to trying r1828@{revnote:svn}. Or, what about setting up a fake ref namespace: git config ref.refs/remotes/svn/*.from refs/notes/svn/revisions Then `git log svn/r1828` works. But these aren't real references. We would only want to consider them if a request matched the glob, so `git for-each-ref` and `git upload-pack` aren't reporting these things by default, and neither is `git log --all` or `gitk --all`. I agree a syntax that works out of the box without a configuration file change would be nicer. But we are running out of operators to do that with. `git log notes/svn/revisions@{revnote:r1828}` as you propose above is at least workable... -- Shawn. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-14 20:12 ` Shawn Pearce @ 2011-09-08 19:53 ` Martin Fick 2011-09-09 0:52 ` Martin Fick 2011-09-09 13:50 ` Michael Haggerty 0 siblings, 2 replies; 126+ messages in thread From: Martin Fick @ 2011-09-08 19:53 UTC (permalink / raw) To: git Just thought that I should add some numbers to this thread as it seems that the later versions of git are worse off by several orders of magnitude on this one. We have a Gerrit repo with just under 100K refs in refs/changes/*. When I fetch them all with git 1.7.6 it does not seem to complete. Even after 5 days, it is just under half way through the ref #s! It appears, but I am not sure, that it is getting slower with time also, so it may not even complete after 10 days, I couldn't wait any longer. However, the same command works in under 8 mins with git 1.7.3.3 on the same machine! Syncing 100K refs: git 1.7.6 > 8 days? git 1.7.3.3 ~8mins That is quite a difference! Have there been any obvious changes to git that should cause this? If needed, I can bisect git to find out where things go sour, but I thought that perhaps there would be someone who already understands the problem and why older versions aren't nearly as bas as recent ones. Some more things that I have tried: after syncing the repo locally with all 100K refs under refs/changes, I cloned it locally again and tried fetching locally with both git 1.7.6 and 1.7.3.3. I got the same results as remotely, so it does not appear to be related to round trips. The original git remote syncing takes just a bit of time, and then it outputs lines like these: ... * [new branch] refs/changes/13/66713/2 -> refs/changes/13/66713/2 * [new branch] refs/changes/13/66713/3 -> refs/changes/13/66713/3 * [new branch] refs/changes/13/66713/4 -> refs/changes/13/66713/4 * [new branch] refs/changes/13/66713/5 -> refs/changes/13/66713/5 ... This is the part that takes forever. The lines seem to scroll by slower and slower (with git 1.7.6). In the beginning, the lines might be a screens worth a minute, after 5 days, about 1 a minute. My CPU is pegged at 100% during this time (one core). Since I have some good test data for this, let me know if I should test anything specific. Thanks, -Martin Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum -- View this message in context: http://git.661346.n2.nabble.com/Git-is-not-scalable-with-too-many-refs-tp6456443p6773496.html Sent from the git mailing list archive at Nabble.com. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-08 19:53 ` Martin Fick @ 2011-09-09 0:52 ` Martin Fick 2011-09-09 1:05 ` Thomas Rast ` (2 more replies) 2011-09-09 13:50 ` Michael Haggerty 1 sibling, 3 replies; 126+ messages in thread From: Martin Fick @ 2011-09-09 0:52 UTC (permalink / raw) To: git An update, I bisected it down to this commit: 88a21979c5717e3f37b9691e90b6dbf2b94c751a fetch/pull: recurse into submodules when necessary Since this can be disabled with the --no-recurse-submodules switch, I tried that and indeed, even with the latest 1.7.7rc it becomes fast (~8mins) again. The strange part about this is that the repository does not have any submodules. Anyway, I hope that this can be useful to others since it is a workaround which speed things up enormously. Let me know if you have any other tests that you want me to perform, -Martin -- View this message in context: http://git.661346.n2.nabble.com/Git-is-not-scalable-with-too-many-refs-tp6456443p6774328.html Sent from the git mailing list archive at Nabble.com. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-09 0:52 ` Martin Fick @ 2011-09-09 1:05 ` Thomas Rast 2011-09-09 1:13 ` Thomas Rast 2011-09-09 15:59 ` Jens Lehmann 2011-09-25 20:43 ` Martin Fick 2 siblings, 1 reply; 126+ messages in thread From: Thomas Rast @ 2011-09-09 1:05 UTC (permalink / raw) To: Martin Fick, Jens Lehmann; +Cc: git Martin Fick wrote: > An update, I bisected it down to this commit: > > 88a21979c5717e3f37b9691e90b6dbf2b94c751a > > fetch/pull: recurse into submodules when necessary > > Since this can be disabled with the --no-recurse-submodules switch, I tried > that and indeed, even with the latest 1.7.7rc it becomes fast (~8mins) > again. The strange part about this is that the repository does not have any > submodules. Anyway, I hope that this can be useful to others since it is a > workaround which speed things up enormously. Let me know if you have any > other tests that you want me to perform, Jens should know about this, so let's Cc him. I took a quick look and I'm guessing that there's at least one quadratic behaviour: in check_for_new_submodule_commits(), I see + const char *argv[] = {NULL, NULL, "--not", "--all", NULL}; + int argc = ARRAY_SIZE(argv) - 1; + + init_revisions(&rev, NULL); which means that the --all needs to walk all commits reachable from all refs and flag them as uninteresting. But that function is called for every ref update, so IIUC the time spent is on the order of #ref updates*#commits. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-09 1:05 ` Thomas Rast @ 2011-09-09 1:13 ` Thomas Rast 0 siblings, 0 replies; 126+ messages in thread From: Thomas Rast @ 2011-09-09 1:13 UTC (permalink / raw) To: Martin Fick, Jens Lehmann; +Cc: git Thomas Rast wrote: > + const char *argv[] = {NULL, NULL, "--not", "--all", NULL}; > + int argc = ARRAY_SIZE(argv) - 1; > + > + init_revisions(&rev, NULL); > > which means that the --all needs to walk all commits reachable from > all refs and flag them as uninteresting. Scratch that, it "only" needs to mark every tip commit and then walk them back to about where the interesting commits end. In any case, since the uninteresting set only gets larger, it should be possible to reuse the same revision walker. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-09 0:52 ` Martin Fick 2011-09-09 1:05 ` Thomas Rast @ 2011-09-09 15:59 ` Jens Lehmann 2011-09-25 20:43 ` Martin Fick 2 siblings, 0 replies; 126+ messages in thread From: Jens Lehmann @ 2011-09-09 15:59 UTC (permalink / raw) To: Martin Fick; +Cc: git Am 09.09.2011 02:52, schrieb Martin Fick: > An update, I bisected it down to this commit: > > 88a21979c5717e3f37b9691e90b6dbf2b94c751a > > fetch/pull: recurse into submodules when necessary > > Since this can be disabled with the --no-recurse-submodules switch, I tried > that and indeed, even with the latest 1.7.7rc it becomes fast (~8mins) > again. The strange part about this is that the repository does not have any > submodules. Anyway, I hope that this can be useful to others since it is a > workaround which speed things up enormously. Let me know if you have any > other tests that you want me to perform, Thanks for nailing that one down. I'm currently looking into bringing back decent performance here. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-09 0:52 ` Martin Fick 2011-09-09 1:05 ` Thomas Rast 2011-09-09 15:59 ` Jens Lehmann @ 2011-09-25 20:43 ` Martin Fick 2011-09-26 12:41 ` Christian Couder ` (2 more replies) 2 siblings, 3 replies; 126+ messages in thread From: Martin Fick @ 2011-09-25 20:43 UTC (permalink / raw) To: git; +Cc: Christian Couder A coworker of mine pointed out to me that a simple git checkout can also take rather long periods of time > 3 mins when run on a repo with ~100K refs. While this is not massive like the other problem I reported, it still seems like it is more than one would expect. So, I tried an older version of git, and to my surprise/delight, it was much faster (.2s). So, I bisected this issue also, and it seems that the "offending" commit is 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07: commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07 Author: Christian Couder <chriscool@tuxfamily.org> Date: Fri Jan 23 10:06:53 2009 +0100 replace_object: add mechanism to replace objects found in "refs/replace/" The code implementing this mechanism has been copied more-or-less from the commit graft code. This mechanism is used in "read_sha1_file". sha1 passed to this function that match a ref name in "refs/replace/" are replaced by the sha1 that has been read in the ref. We "die" if the replacement recursion depth is too high or if we can't read the replacement object. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com> Now, I suspect this commit is desirable, but I was hoping that perhaps a look at it might inspire someone to find an obvious problem with it. Thanks, -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-25 20:43 ` Martin Fick @ 2011-09-26 12:41 ` Christian Couder 2011-09-26 17:47 ` Martin Fick 2011-09-28 19:38 ` Martin Fick 2011-09-26 15:15 ` Git is not scalable with too many refs/* Martin Fick 2011-09-26 15:32 ` Michael Haggerty 2 siblings, 2 replies; 126+ messages in thread From: Christian Couder @ 2011-09-26 12:41 UTC (permalink / raw) To: Martin Fick; +Cc: git, Christian Couder On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick <mfick@codeaurora.org> wrote: > A coworker of mine pointed out to me that a simple > > git checkout > > can also take rather long periods of time > 3 mins when run > on a repo with ~100K refs. Are all these refs packed? > While this is not massive like the other problem I reported, > it still seems like it is more than one would expect. So, I > tried an older version of git, and to my surprise/delight, > it was much faster (.2s). So, I bisected this issue also, > and it seems that the "offending" commit is > 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07: > > commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07 > Author: Christian Couder <chriscool@tuxfamily.org> > Date: Fri Jan 23 10:06:53 2009 +0100 > > replace_object: add mechanism to replace objects found > in "refs/replace/" [...] > Now, I suspect this commit is desirable, but I was hoping > that perhaps a look at it might inspire someone to find an > obvious problem with it. I don't think there is an obvious problem with it, but it would be nice if you could dig a bit deeper. The first thing that could take a lot of time is the call to for_each_replace_ref() in this function: +static void prepare_replace_object(void) +{ + static int replace_object_prepared; + + if (replace_object_prepared) + return; + + for_each_replace_ref(register_replace_ref, NULL); + replace_object_prepared = 1; +} Another thing is calling replace_object_pos() repeatedly in lookup_replace_object(). Thanks, Christian. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 12:41 ` Christian Couder @ 2011-09-26 17:47 ` Martin Fick 2011-09-26 18:56 ` Christian Couder 2011-09-28 19:38 ` Martin Fick 1 sibling, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-26 17:47 UTC (permalink / raw) To: Christian Couder; +Cc: git, Christian Couder On Monday, September 26, 2011 06:41:04 am Christian Couder wrote: > On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick <mfick@codeaurora.org> wrote: > > A coworker of mine pointed out to me that a simple > > > > git checkout > > > > can also take rather long periods of time > 3 mins when > > run on a repo with ~100K refs. > > Are all these refs packed? I think so, is there a way to find out for sure? -Martin ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 17:47 ` Martin Fick @ 2011-09-26 18:56 ` Christian Couder 2011-09-30 16:41 ` Martin Fick 0 siblings, 1 reply; 126+ messages in thread From: Christian Couder @ 2011-09-26 18:56 UTC (permalink / raw) To: Martin Fick; +Cc: Christian Couder, git On Monday 26 September 2011 19:47:56 Martin Fick wrote: > On Monday, September 26, 2011 06:41:04 am Christian Couder > wrote: > > > > Are all these refs packed? > > I think so, is there a way to find out for sure? After "git pack-refs --all" I get: $ find .git/refs/ -type f .git/refs/remotes/origin/HEAD .git/refs/stash So I suppose that if such a find gives you only a few files all (or most of) your refs are packed. Best regards, Christian. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 18:56 ` Christian Couder @ 2011-09-30 16:41 ` Martin Fick 2011-09-30 19:26 ` Martin Fick 2011-09-30 21:02 ` Martin Fick 0 siblings, 2 replies; 126+ messages in thread From: Martin Fick @ 2011-09-30 16:41 UTC (permalink / raw) To: git Cc: Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty > On Monday, September 26, 2011 12:56:04 pm Christian Couder wrote: > After "git pack-refs --all" I get: OK. So many great improvements in ref scalability, thanks everyone! It is getting so good, that I had to take a step back and re-evaluate what we consider good/bad. On doing so, I can't help but think that fetches still need some improvement. Fetches had the worst regression of all > 8days, so the massive fix to bring it down to 7.5mins was awesome. 7-8mins sounded pretty good 2 weeks ago, especially when a checkout took 5+ mins! but now that almost every other operation has been sped up, that is starting to feel a bit on the slow side still. My spidey sense tells me something is still not quite right in the fetch path. Here is some more data to backup my spidey sense: after all the improvements, a noop fetch of all the changes (noop meaning they are all already uptodate) takes around 3mins with a non gced (non packed refs) case. That same noop only takes ~12s in the gced (packed ref case)! I dug into this a bit further. I took a non gced and non packed refs repo and this time instead of gcing it to get packedrefs, I only ran the above git pack-refs --all so that objects did not get gced. With this, the noop fetch was also only around 12s. This confirmed that the non gced objects are not interfering with the noop fetch, the problem really is just the unpacked refs. Just to confirm that the FS is not horribly slow, I did a "find .git/refs" and it only takes about .4s for about 80Kresults! So, while I understand that a full fetch will actually have to transfer quite a bit of data, the noop fetch seems like it is still suffering in the non gced (non packed ref case). If that time were improved, I suspect that the full fetch will improve at least by an equivalent amount, if not more. Any thoughts? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 16:41 ` Martin Fick @ 2011-09-30 19:26 ` Martin Fick 2011-09-30 21:02 ` Martin Fick 1 sibling, 0 replies; 126+ messages in thread From: Martin Fick @ 2011-09-30 19:26 UTC (permalink / raw) To: git Cc: Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty On Friday, September 30, 2011 10:41:13 am Martin Fick wrote: > I dug into this a bit further. I took a non gced and non > packed refs repo and this time instead of gcing it to get > packedrefs, I only ran the above git pack-refs --all so > that objects did not get gced. With this, the noop > fetch was also only around 12s. This confirmed that the > non gced objects are not interfering with the noop > fetch, the problem really is just the unpacked refs. > Just to confirm that the FS is not horribly slow, I did > a "find .git/refs" and it only takes about .4s for about > 80Kresults! Is there a way I can for refs to always be packed? I didn't see a config option for this. I would like to try a fetch this way even if I have to make a small code tweak. I tried simulating on the fly ref packing every now and then by running the pack from another repo during the fetch, it actually slowed things down (by more than the time it took to do the packs). -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 16:41 ` Martin Fick 2011-09-30 19:26 ` Martin Fick @ 2011-09-30 21:02 ` Martin Fick 2011-09-30 22:06 ` Martin Fick 1 sibling, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-30 21:02 UTC (permalink / raw) To: git Cc: Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty On Friday, September 30, 2011 10:41:13 am Martin Fick wrote: > massive fix to bring it down to 7.5mins was awesome. > 7-8mins sounded pretty good 2 weeks ago, especially when > a checkout took 5+ mins! but now that almost every > other operation has been sped up, that is starting to > feel a bit on the slow side still. My spidey sense > tells me something is still not quite right in the fetch > path. I guess I overlooked that there were 2 sides to this equation. Even though I have been doing my fetches locally, I was using the file:// protocol and it appears that the remote was running git 1.7.6 which was in my path the whole time. So eliminating that from my path and pointing to the the "best" binary with all the fixes for both remote and local, the full fetch does indeed speed up quite a bit, it goes from about 7.5mins down to ~5m! Previously the remote seemed to primarily spend the extra time after: remote: Counting objects: 316961 yet before: remote: Compressing objects > Here is some more data to backup my spidey sense: after > all the improvements, a noop fetch of all the changes > (noop meaning they are all already uptodate) takes > around 3mins with a non gced (non packed refs) case. > That same noop only takes ~12s in the gced (packed ref > case)! I believe (it is hard to be go back and be sure) that this means that the timings above which gave me 3mins were because the remote was using git 1.7.6. Now, with the good binary, in both repos (packed and unpacked), I get great warm cache times of about 11-13s for a noop fetch. It is interesting to note that cold cache times are 20s for packed refs and 1m30s for unpacked refs. I guess that makes some sense. But, this does leave me thinking that packed refs should become the default and that there should be a config option to disable it? This still might help a fetch? Since a full sync is now done to about 5mins, I broke down the output a bit. It appears that the longest part (2:45m) is now the time spent scrolling though each change still. Each one of these takes about 2ms: * [new branch] refs/changes/99/71199/1 -> refs/changes/99/71199/1 Seems fast, but at about 80K... So, are there any obvious N loops over the refs happening inside each of of the [new branch] iterations? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 21:02 ` Martin Fick @ 2011-09-30 22:06 ` Martin Fick 2011-10-01 20:41 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 126+ messages in thread From: Martin Fick @ 2011-09-30 22:06 UTC (permalink / raw) To: git Cc: Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty On Friday, September 30, 2011 03:02:30 pm Martin Fick wrote: > On Friday, September 30, 2011 10:41:13 am Martin Fick wrote: > Since a full sync is now done to about 5mins, I broke > down the output a bit. It appears that the longest part > (2:45m) is now the time spent scrolling though each > change still. Each one of these takes about 2ms: > * [new branch] refs/changes/99/71199/1 -> > refs/changes/99/71199/1 > > Seems fast, but at about 80K... So, are there any obvious > N loops over the refs happening inside each of of the > [new branch] iterations? OK, I narrowed it down I believe. If I comment out the invalidate_cached_refs() line in write_ref_sha1(), it speeds through this section. I guess this makes sense, we invalidate the cache and have to rebuild it after every new ref is added? Perhaps a simple fix would be to move the invalidation right after all the refs are updated? Maybe write_ref_sha1 could take in a flag to tell it to not invalidate the cache so that during iterative updates it could be disabled and then run manually after the update? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 22:06 ` Martin Fick @ 2011-10-01 20:41 ` Junio C Hamano 2011-10-02 5:19 ` Michael Haggerty 2011-10-03 18:12 ` Martin Fick 2011-10-08 20:59 ` Martin Fick 2 siblings, 1 reply; 126+ messages in thread From: Junio C Hamano @ 2011-10-01 20:41 UTC (permalink / raw) To: Martin Fick Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty Martin Fick <mfick@codeaurora.org> writes: > I guess this makes sense, we invalidate the cache and have > to rebuild it after every new ref is added? Perhaps a > simple fix would be to move the invalidation right after all > the refs are updated? Maybe write_ref_sha1 could take in a > flag to tell it to not invalidate the cache so that during > iterative updates it could be disabled and then run manually > after the update? It might make sense, on top of Julian's patch, to add a bit that says "the contents of this ref-array is current but the array is not sorted", and whenever somebody runs add_ref(), append it also to the ref-array (so that the contents do not have to be re-read from the filesystem) but flip the "unsorted" bit on. Then update look-up and iteration to sort the array when "unsorted" bit is on without re-reading the contents from the filesystem. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-10-01 20:41 ` Junio C Hamano @ 2011-10-02 5:19 ` Michael Haggerty 2011-10-03 0:46 ` Martin Fick 0 siblings, 1 reply; 126+ messages in thread From: Michael Haggerty @ 2011-10-02 5:19 UTC (permalink / raw) To: Junio C Hamano Cc: Martin Fick, git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips On 10/01/2011 10:41 PM, Junio C Hamano wrote: > Martin Fick <mfick@codeaurora.org> writes: >> I guess this makes sense, we invalidate the cache and have >> to rebuild it after every new ref is added? Perhaps a >> simple fix would be to move the invalidation right after all >> the refs are updated? Maybe write_ref_sha1 could take in a >> flag to tell it to not invalidate the cache so that during >> iterative updates it could be disabled and then run manually >> after the update? > > It might make sense, on top of Julian's patch, to add a bit that says "the > contents of this ref-array is current but the array is not sorted", and > whenever somebody runs add_ref(), append it also to the ref-array (so that > the contents do not have to be re-read from the filesystem) but flip the > "unsorted" bit on. Then update look-up and iteration to sort the array > when "unsorted" bit is on without re-reading the contents from the > filesystem. My WIP patch series does one better than this; it keeps track of what part of the array is already sorted so that a reference can be found in the sorted part of the array using binary search, and if it is not found there a linear search is done through the unsorted part of the array. I also have some code (not pushed) that adds some intelligence to make the use case repeat many times: check if reference exists add reference efficient by picking optimal intervals to re-sort the array. (This sort can also be faster if most of the array is already sorted: sort the new entries using qsort then merge sort them into the already-sorted part of the list.) But there is another reason that we cannot currently update the reference cache on the fly rather than invalidating it after each change: symbolic references are stored *resolved* in the reference cache, and no record is kept of the reference that they refer to. Therefore it is possible that the addition or modification of an arbitrary reference can affect how a symbolic reference is resolved, but there is not enough information in the cache to track this. IMO the correct solution is to store symbolic references un-resolved. Given that lookup is going to become much faster, the slowdown in reference resolution should not be a big performance penalty, whereas reference updating could become *much* faster. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-10-02 5:19 ` Michael Haggerty @ 2011-10-03 0:46 ` Martin Fick 2011-10-04 8:08 ` Michael Haggerty 0 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-10-03 0:46 UTC (permalink / raw) To: Michael Haggerty, Junio C Hamano Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips Michael Haggerty <mhagger@alum.mit.edu> wrote: >On 10/01/2011 10:41 PM, Junio C Hamano wrote: >> Martin Fick <mfick@codeaurora.org> writes: >>> I guess this makes sense, we invalidate the cache and have >>> to rebuild it after every new ref is added? Perhaps a >>> simple fix would be to move the invalidation right after all >>> the refs are updated? Maybe write_ref_sha1 could take in a >>> flag to tell it to not invalidate the cache so that during >>> iterative updates it could be disabled and then run manually >>> after the update? >> >I >also have some code (not pushed) that adds some intelligence to make >the use case > > repeat many times: > check if reference exists > add reference Would it be possible to separate the two steps into separate loops somehow? Could it instead look like this: > repeat many times: > check if reference exists > repeat many times: > add reference It might be difficult with the current functions to achive this, but it would allow the cache to be invalidated over and over in loop two without impacting performance since all the lookups could be done in the first loop. Of course, this would likely require checking for dups before running the first loop. -Martin Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-10-03 0:46 ` Martin Fick @ 2011-10-04 8:08 ` Michael Haggerty 0 siblings, 0 replies; 126+ messages in thread From: Michael Haggerty @ 2011-10-04 8:08 UTC (permalink / raw) To: Martin Fick Cc: Junio C Hamano, git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips On 10/03/2011 02:46 AM, Martin Fick wrote: > Michael Haggerty <mhagger@alum.mit.edu> wrote: >> I >> also have some code (not pushed) that adds some intelligence to make >> the use case >> >> repeat many times: >> check if reference exists >> add reference > > Would it be possible to separate the two steps into separate loops somehow? Could it instead look like this: > >> repeat many times: >> check if reference exists > >> repeat many times: >> add reference Undoubtedly this would be possible. But I'd rather make the refs code efficient and general enough that its users don't need to worry about such things. > [...] Of course, this would likely require checking for dups > before running the first loop. Yes, and this "checking for dups before running the first loop" is approximately the same work that would have to be done within a smarter version of the refs code. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 22:06 ` Martin Fick 2011-10-01 20:41 ` Junio C Hamano @ 2011-10-03 18:12 ` Martin Fick 2011-10-03 19:42 ` Junio C Hamano 2011-10-04 8:16 ` Michael Haggerty 2011-10-08 20:59 ` Martin Fick 2 siblings, 2 replies; 126+ messages in thread From: Martin Fick @ 2011-10-03 18:12 UTC (permalink / raw) To: git Cc: Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty On Friday, September 30, 2011 04:06:31 pm Martin Fick wrote: > > OK, I narrowed it down I believe. If I comment out the > invalidate_cached_refs() line in write_ref_sha1(), it > speeds through this section. > > I guess this makes sense, we invalidate the cache and > have to rebuild it after every new ref is added? > Perhaps a simple fix would be to move the invalidation > right after all the refs are updated? Maybe > write_ref_sha1 could take in a flag to tell it to not > invalidate the cache so that during iterative updates it > could be disabled and then run manually after the > update? Would this solution be acceptable if I submitted a patch to do it? My test shows that this will make a full fetch of ~80K changes go from 4:50min to 1:50min, -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-10-03 18:12 ` Martin Fick @ 2011-10-03 19:42 ` Junio C Hamano 2011-10-04 8:16 ` Michael Haggerty 1 sibling, 0 replies; 126+ messages in thread From: Junio C Hamano @ 2011-10-03 19:42 UTC (permalink / raw) To: Martin Fick Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty Martin Fick <mfick@codeaurora.org> writes: >> I guess this makes sense, we invalidate the cache and >> have to rebuild it after every new ref is added? >> Perhaps a simple fix would be to move the invalidation >> right after all the refs are updated? Maybe >> write_ref_sha1 could take in a flag to tell it to not >> invalidate the cache so that during iterative updates it >> could be disabled and then run manually after the >> update? > > Would this solution be acceptable if I submitted a patch to > do it? My test shows that this will make a full fetch of > ~80K changes go from 4:50min to 1:50min, As long as the resulting code does not introduce new races with another process updating refs while the bulk update is running, I wouldn't have an issue with it. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-10-03 18:12 ` Martin Fick 2011-10-03 19:42 ` Junio C Hamano @ 2011-10-04 8:16 ` Michael Haggerty 1 sibling, 0 replies; 126+ messages in thread From: Michael Haggerty @ 2011-10-04 8:16 UTC (permalink / raw) To: Martin Fick Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips On 10/03/2011 08:12 PM, Martin Fick wrote: > On Friday, September 30, 2011 04:06:31 pm Martin Fick wrote: >> OK, I narrowed it down I believe. If I comment out the >> invalidate_cached_refs() line in write_ref_sha1(), it >> speeds through this section. >> >> I guess this makes sense, we invalidate the cache and >> have to rebuild it after every new ref is added? >> Perhaps a simple fix would be to move the invalidation >> right after all the refs are updated? Maybe >> write_ref_sha1 could take in a flag to tell it to not >> invalidate the cache so that during iterative updates it >> could be disabled and then run manually after the >> update? > > Would this solution be acceptable if I submitted a patch to > do it? My test shows that this will make a full fetch of > ~80K changes go from 4:50min to 1:50min, No, no, no. Let's fix up the refs cache once and for all and avoid adding special case code all over the place. * With minor changes, we can make it possible to invalidate single refs instead of the whole the refs cache. And we can teach the refs code to invalidate refs by itself when necessary, so that other code can become stupider and more decoupled from the refs code. * With other minor changes (mostly implemented), we can support a partly-sorted refs list that decides intelligently when to resort itself. This will give most of the performance benefit of circumventing the refs cache API with none of the chaos. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 22:06 ` Martin Fick 2011-10-01 20:41 ` Junio C Hamano 2011-10-03 18:12 ` Martin Fick @ 2011-10-08 20:59 ` Martin Fick 2011-10-09 5:43 ` Michael Haggerty 2 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-10-08 20:59 UTC (permalink / raw) To: git Cc: Christian Couder, Thomas Rast, René Scharfe, Julian Phillips, Michael Haggerty On Friday, September 30, 2011 04:06:31 pm Martin Fick wrote: > On Friday, September 30, 2011 03:02:30 pm Martin Fick wrote: > > On Friday, September 30, 2011 10:41:13 am Martin Fick > > wrote: > > Since a full sync is now done to about 5mins, I broke > > down the output a bit. It appears that the longest > > part (2:45m) is now the time spent scrolling though > > each > > > > change still. Each one of these takes about 2ms: > > * [new branch] refs/changes/99/71199/1 -> > > > > refs/changes/99/71199/1 > > > > Seems fast, but at about 80K... So, are there any > > obvious N loops over the refs happening inside each of > > of the [new branch] iterations? > > OK, I narrowed it down I believe. If I comment out the > invalidate_cached_refs() line in write_ref_sha1(), it > speeds through this section. > > I guess this makes sense, we invalidate the cache and > have to rebuild it after every new ref is added? > Perhaps a simple fix would be to move the invalidation > right after all the refs are updated? Maybe > write_ref_sha1 could take in a flag to tell it to not > invalidate the cache so that during iterative updates it > could be disabled and then run manually after the > update? OK, this thing has been bugging me... I found some more surprising results, I hope you can follow because there are corner cases here which have surprising impacts. ** Important fact: ** --------------- ** When I clone my repo, it has about 4K tags which ** come in packed to the clone. ** This fact has a heavy impact on how I test things. If I choose to delete these packed-refs from the cloned repo and then do a fetch of the changes, all of the tags are also fetched along with these changes. This means that if I want to test the impact of having packed-refs vs no packed refs, on my change fetches, I need to first delete the packed-refs file, and second fetch all the tags again, so that when I fetch the changes, the repo only actually fetches changes, not all the tags! So, with this in mind, I have discovered, that the fetch performance degradation by invalidating the caches in write_ref_sha1() is actually due to the packed-refs being reloaded and resorted again on each ref insertion (not the loose refs)!!! Remember the important fact above? Yeah, those silly 4K refs (not a huge number, not 61K!) take a while to reread from the file and sort. When this is done for 61K changes, it adds a lot of time to a fetch. The sad part is that, of course, the packed-refs don't really need to be invalidated since we never add new refs as packed refs during a fetch (but apparently we do during a clone)! Also noteworthy is that invalidating the loose refs, does not cause a big delay. Some data: 1) A fetch of the changes in my series with all good external patches applied takes about 7:30min. 2) A fetch of the changes with #1 invalidate_cache_refs() commented out in write_ref_sha1() takes about 1:50min. 3) A fetch of the changes with #1 with invalidate_cache_refs() in write_ref_sha1() replaced with a call to my custom invalidate_loose_cache_refs() takes about 1:50min. 4) A fetch with #1 on a repo with packed-refs deleted after the clone, takes about ~5min. ** This is a strange regression which threw me off. In this case, all the tags are refetched in addition to the changes, this seems to cause some weird interaction that makes things take longer than they should (#5 + #6 = 2:10m << #4 5min). 5) A fetch with #1 on a repo with packed-refs deleted after the clone, and then a fetch done to get all the tags (see #6), takes only 1:30m!!!! 6) A fetch to get all the **TAGS** with packed-refs deleted after the clone, takes about 40s. ---Additional side data/tests: 7) A fetch of the changes with #1 and a special flag causing the packed-refs to be read from the file, but not parsed or sorted, takes 2:34min. So just the repeated reads add at least 40s. 8) A fetch of the changes with #1 and a special flag causing the packed-refs to be read from the file, parsed, but NOT sorted, takes 3:40min. So the parsing appears to take an additional minute at least. I think that all of this might explain why no matter how good Michael's intentions are with his patch series, his series isn't likely to fix this problem unless he does not invalidate the packed-refs after each insertion. I tried preventing this invalidation in his series to prove this, but unfortunately, it appears that in his series it is no longer possible to only invalidate just the packed-refs? :( Michael, I hope I am completely wrong about that... Are there any good consistency reasons to invalidate the packed refs in write_ref_sha1()? If not, would you accept a patch to simply skip this invalidation (to only invalidate the loose refs)? Thanks, -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-10-08 20:59 ` Martin Fick @ 2011-10-09 5:43 ` Michael Haggerty 0 siblings, 0 replies; 126+ messages in thread From: Michael Haggerty @ 2011-10-09 5:43 UTC (permalink / raw) To: Martin Fick Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips [-- Attachment #1: Type: text/plain, Size: 2441 bytes --] On 10/08/2011 10:59 PM, Martin Fick wrote: > [...] > So, with this in mind, I have discovered, that the fetch > performance degradation by invalidating the caches in > write_ref_sha1() is actually due to the packed-refs being > reloaded and resorted again on each ref insertion (not the > loose refs)!!! Good point. > I think that all of this might explain why no matter how > good Michael's intentions are with his patch series, his > series isn't likely to fix this problem I never claimed that my patch fixes all use cases, or cures cancer either :-) One step at a time. > unless he does not > invalidate the packed-refs after each insertion. I tried > preventing this invalidation in his series to prove this, > but unfortunately, it appears that in his series it is no > longer possible to only invalidate just the packed-refs? :( > Michael, I hope I am completely wrong about that... Yes, you are completely wrong. I just implemented more selective cache invalidation on top of the patch series. I think your suggestion is safe because only non-symbolic references can be stored in the packed refs; therefore the modification of a loose ref can never affect the value of a packed ref. Of course a loose ref can *hide* the value of a packed ref, but in such cases the packed ref is never read anyway. And the *deletion* of a loose ref can expose a previously-hidden packed ref, but this case is handled by delete_ref(), which explicitly invalidates the packed-ref cache. While I was at it, I also: * In delete_ref(), only invalidate the packed reference cache if the reference that is being deleted actually *is* among the packed references. * Changed the code to stop invalidating the ref caches for submodules. In the code paths where the cache invalidation was being done, only main-module references were being changed. However, I'm not familiar enough with submodules to know if/when submodule references *can* be changed. It could be that the submodule reference caches have to be invalidated under some circumstances; the current code might be buggy in this area. The changes are pushed to github. They don't make any significant difference to my "refperf" results (attached), so perhaps a new benchmark should be added. But I'm curious to see how they affect your timings. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ [-- Attachment #2: refperf-summary.out --] [-- Type: text/plain, Size: 5251 bytes --] =================================== ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= Test name [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] =================================== ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= branch-loose-cold 3.19 3.15 3.10 3.19 3.25 0.70 0.61 0.74 0.66 0.56 branch-loose-warm 0.19 0.19 0.20 0.19 0.19 0.00 0.00 0.00 0.00 0.00 for-each-ref-loose-cold 3.73 3.45 3.55 3.39 3.44 3.40 3.50 3.52 3.70 3.51 for-each-ref-loose-warm 0.44 0.44 0.44 0.43 0.43 0.43 0.43 0.43 0.43 0.43 checkout-loose-cold 3.35 3.23 3.23 3.15 3.29 0.65 0.71 0.76 0.66 0.69 checkout-loose-warm 0.19 0.19 0.20 0.18 0.19 0.01 0.01 0.01 0.01 0.00 checkout-orphan-loose 0.19 0.19 0.19 0.18 0.19 0.00 0.00 0.00 0.00 0.00 checkout-from-detached-loose-cold 7.80 4.17 4.17 4.05 4.09 4.07 4.26 4.23 4.18 4.08 checkout-from-detached-loose-warm 1.01 1.01 1.02 1.02 1.04 1.03 1.04 1.04 1.02 1.04 branch-contains-loose-cold 35.76 35.80 36.15 36.67 35.13 36.29 36.37 36.03 36.70 36.01 branch-contains-loose-warm 33.01 33.62 33.52 33.51 32.41 33.51 33.71 32.10 33.70 31.99 pack-refs-loose 4.19 4.20 4.25 4.21 4.20 4.21 4.20 4.19 4.24 4.21 branch-packed-cold 0.79 0.62 0.60 0.66 0.65 0.58 0.68 0.72 0.60 0.61 branch-packed-warm 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 for-each-ref-packed-cold 0.96 0.97 0.97 0.93 0.89 0.92 0.98 0.96 0.92 0.96 for-each-ref-packed-warm 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.27 0.27 0.27 checkout-packed-cold 16.14 16.16 16.74 2.04 2.03 2.09 2.06 2.13 2.03 2.00 checkout-packed-warm 0.17 0.17 0.18 0.19 0.18 0.17 0.27 0.18 0.19 0.18 checkout-orphan-packed 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 checkout-from-detached-packed-cold 16.24 15.96 16.80 1.99 2.06 2.01 2.08 2.10 1.97 1.96 checkout-from-detached-packed-warm 15.04 14.96 15.76 0.77 0.81 0.79 0.83 0.80 0.79 0.80 branch-contains-packed-cold 36.18 36.98 36.92 35.19 34.97 35.09 33.34 33.87 34.27 34.51 branch-contains-packed-warm 35.27 35.12 36.20 33.52 32.76 33.49 33.65 32.96 33.68 32.34 clone-loose-cold 9.09 9.22 9.15 9.10 9.19 9.03 9.09 9.25 8.96 9.03 clone-loose-warm 5.57 5.85 5.65 5.55 5.61 5.64 5.65 5.61 5.74 5.59 fetch-nothing-loose 1.43 1.43 1.44 1.44 1.45 1.45 1.46 1.44 1.44 1.44 pack-refs 0.08 0.08 0.08 0.08 0.09 0.08 0.09 0.08 0.08 0.08 fetch-nothing-packed 1.44 1.43 1.44 1.44 1.44 1.44 1.44 1.44 1.44 1.44 clone-packed-cold 1.35 1.26 1.30 1.32 1.28 1.35 1.38 1.35 1.29 1.21 clone-packed-warm 0.36 0.35 0.35 0.36 0.36 0.36 0.35 0.36 0.37 0.35 fetch-everything-cold 30.29 30.01 29.79 29.04 29.84 29.25 29.30 29.26 29.76 29.30 fetch-everything-warm 26.20 26.04 26.40 25.60 26.22 25.83 25.82 25.85 26.68 25.73 =================================== ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= [0] f696543 (tag: v1.7.6) Git 1.7.6 [1] 703f05a (tag: v1.7.7) Git 1.7.7 [2] 27897d2 (origin/master) Merge remote-tracking branch 'gitster/mh/iterate-refs' [3] 558b49c is_refname_available(): reimplement using do_for_each_ref_in_list() [4] 1658397 Store references hierarchically [5] 5f5a126 get_ref_dir(): add a recursive option [6] a306af1 get_ref_dir(): read one whole directory before descending into subdirs [7] fd53cf7 add_ref(): change to take a (struct ref_entry *) as second argument [8] 9944c7f (origin/testing) read_packed_refs(): keep track of the directory being worked in [9] cb75c57 (origin/ok, origin/hierarchical-refs, origin/HEAD) refs.c: call clear_cached_ref_cache() from repack_without_ref() ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 12:41 ` Christian Couder 2011-09-26 17:47 ` Martin Fick @ 2011-09-28 19:38 ` Martin Fick 2011-09-28 22:10 ` Martin Fick 1 sibling, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-28 19:38 UTC (permalink / raw) To: Christian Couder; +Cc: git, Christian Couder, Thomas Rast, Julian Phillips On Monday, September 26, 2011 06:41:04 am Christian Couder wrote: > On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick <mfick@codeaurora.org> wrote: ... > > git checkout > > > > can also take rather long periods of time > 3 mins when > > run on a repo with ~100K refs. ... > > So, I bisected this issue also, and it seems that the > > "offending" commit is ... > > commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07 > > Author: Christian Couder <chriscool@tuxfamily.org> > > > > replace_object: add mechanism to replace objects > > found in "refs/replace/" ... > I don't think there is an obvious problem with it, but it > would be nice if you could dig a bit deeper. > > The first thing that could take a lot of time is the call > to for_each_replace_ref() in this function: > > +static void prepare_replace_object(void) > +{ > + static int replace_object_prepared; > + > + if (replace_object_prepared) > + return; > + > + for_each_replace_ref(register_replace_ref, NULL); > + replace_object_prepared = 1; > +} The time was actually spent in for_each_replace_ref() which calls get_loose_refs() which has the recursive bug that Julian Phillips fixed 2 days ago. Good to see that this fix helps other use cases too. So with that bug fixed, the thing taking the most time now for a git checkout with ~100K refs seems to be the orphan check as Thomas predicted. The strange part with this, is that the orphan check seems to take only about ~20s in the repo where the refs aren't packed. However, in the repo where they are packed, this check takes at least 5min! This seems a bit unusual, doesn't it? Is the filesystem that much better at indexing refs than git's pack mechanism? Seems unlikely, the unpacked refs take 312M in the FS, the packed ones only take about 4.3M. I suspect their is something else unexpected going on here in the packed ref case. Any thoughts? I will dig deeper... -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-28 19:38 ` Martin Fick @ 2011-09-28 22:10 ` Martin Fick 2011-09-29 0:54 ` Julian Phillips 0 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-28 22:10 UTC (permalink / raw) To: Christian Couder; +Cc: git, Christian Couder, Thomas Rast, Julian Phillips On Wednesday, September 28, 2011 01:38:04 pm Martin Fick wrote: > On Monday, September 26, 2011 06:41:04 am Christian > Couder > > wrote: > > On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick > > <mfick@codeaurora.org> wrote: > ... > > > > git checkout > > > > > > can also take rather long periods of time > 3 mins > > > when run on a repo with ~100K refs. > > ... > > > > So, I bisected this issue also, and it seems that > > > the > > > > > > "offending" commit is > > ... > > > > commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07 > > > Author: Christian Couder <chriscool@tuxfamily.org> > > > > > > replace_object: add mechanism to replace objects > > > > > > found in "refs/replace/" > > ... > > > I don't think there is an obvious problem with it, but > > it would be nice if you could dig a bit deeper. > > > > The first thing that could take a lot of time is the > > call to for_each_replace_ref() in this function: > > > > +static void prepare_replace_object(void) > > +{ > > + static int replace_object_prepared; > > + > > + if (replace_object_prepared) > > + return; > > + > > + for_each_replace_ref(register_replace_ref, > > NULL); + replace_object_prepared = 1; > > +} > > The time was actually spent in for_each_replace_ref() > which calls get_loose_refs() which has the recursive bug > that Julian Phillips fixed 2 days ago. Good to see that > this fix helps other use cases too. > > So with that bug fixed, the thing taking the most time > now for a git checkout with ~100K refs seems to be the > orphan check as Thomas predicted. The strange part with > this, is that the orphan check seems to take only about > ~20s in the repo where the refs aren't packed. However, > in the repo where they are packed, this check takes at > least 5min! This seems a bit unusual, doesn't it? Is > the filesystem that much better at indexing refs than > git's pack mechanism? Seems unlikely, the unpacked refs > take 312M in the FS, the packed ones only take about > 4.3M. I suspect their is something else unexpected > going on here in the packed ref case. > > Any thoughts? I will dig deeper... I think the problem is that resolve_ref() walks a linked list of searching for the packed ref. Does this mean that packed refs are not indexed at all? > > -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-28 22:10 ` Martin Fick @ 2011-09-29 0:54 ` Julian Phillips 2011-09-29 1:37 ` Martin Fick 0 siblings, 1 reply; 126+ messages in thread From: Julian Phillips @ 2011-09-29 0:54 UTC (permalink / raw) To: Martin Fick; +Cc: Christian Couder, git, Christian Couder, Thomas Rast On Wed, 28 Sep 2011 16:10:48 -0600, Martin Fick wrote: > On Wednesday, September 28, 2011 01:38:04 pm Martin Fick > wrote: -- snip -- >> So with that bug fixed, the thing taking the most time >> now for a git checkout with ~100K refs seems to be the >> orphan check as Thomas predicted. The strange part with >> this, is that the orphan check seems to take only about >> ~20s in the repo where the refs aren't packed. However, >> in the repo where they are packed, this check takes at >> least 5min! This seems a bit unusual, doesn't it? Is >> the filesystem that much better at indexing refs than >> git's pack mechanism? Seems unlikely, the unpacked refs >> take 312M in the FS, the packed ones only take about >> 4.3M. I suspect their is something else unexpected >> going on here in the packed ref case. >> >> Any thoughts? I will dig deeper... > > I think the problem is that resolve_ref() walks a linked > list of searching for the packed ref. Does this mean that > packed refs are not indexed at all? Are you sure that it is walking the linked list that is the problem? I've created a test repo with ~100k refs/changes/... style refs, and ~40000 refs/heads/... style refs, and checkout can walk the list of ~140k refs seven times in 85ms user time including doing whatever other processing is needed for checkout. The real time is only 114ms - but then my test repo has no real data in. If resolve_ref() walking the linked list of refs was the problem, then I would expect my test repo to show the same problem. It doesn't, a pre ref-packing checkout took minutes (~0.5s user time), whereas a ref-packed checkout takes ~0.1s. So, I would suggest that the problem lies elsewhere. Have you tried running a checkout whilst profiling? -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 0:54 ` Julian Phillips @ 2011-09-29 1:37 ` Martin Fick 2011-09-29 2:19 ` Julian Phillips 0 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-29 1:37 UTC (permalink / raw) To: Julian Phillips; +Cc: Christian Couder, git, Christian Couder, Thomas Rast On Wednesday 28 September 2011 18:59:09 Martin Fick wrote: > Julian Phillips <julian@quantumfyre.co.uk> wrote: > > On Wed, 28 Sep 2011 16:10:48 -0600, Martin Fick wrote: > >> So with that bug fixed, the thing taking the most time > >> now for a git checkout with ~100K refs seems to be the > >> orphan check as Thomas predicted. The strange part with > >> this, is that the orphan check seems to take only about > >> ~20s in the repo where the refs aren't packed. However, > >> in the repo where they are packed, this check takes at > >> least 5min! This seems a bit unusual, doesn't it? Is > >> the filesystem that much better at indexing refs than > >> git's pack mechanism? Seems unlikely, the unpacked refs > >> take 312M in the FS, the packed ones only take about > >> 4.3M. I suspect their is something else unexpected > >> going on here in the packed ref case. > >> > >> Any thoughts? I will dig deeper... > > > > I think the problem is that resolve_ref() walks a linked > > list of searching for the packed ref. Does this mean that > > packed refs are not indexed at all? > > Are you sure that it is walking the linked list that is the problem? It sure seems like it. > I've created a test repo with ~100k refs/changes/... style refs, and > ~40000 refs/heads/... style refs, and checkout can walk the list of > ~140k refs seven times in 85ms user time including doing whatever other > processing is needed for checkout. The real time is only 114ms - but > then my test repo has no real data in. If I understand what you are saying, it sounds like you do not have a very good test case. The amount of time it takes for checkout depends on how long it takes to find a ref with the sha1 that you are on. If that sha1 is so early in the list of refs that it only took you 7 traversals to find it, then that is not a very good testcase. I think that you should probably try making an orphaned ref (checkout a detached head, commit to it), that is probably the worst testcase since it should then have to search all 140K refs to eventually give up. Again, if I understand what you are saying, if it took 85ms for 7 traversals, then it takes approximately 10ms per traversal, that's only 100/s! If you have to traverse it 140K times, that should work out to 1400s ~ 23mins. > If resolve_ref() walking the linked list of refs was the problem, then > I would expect my test repo to show the same problem. It doesn't, a pre > ref-packing checkout took minutes (~0.5s user time), whereas a > ref-packed checkout takes ~0.1s. So, I would suggest that the problem > lies elsewhere. > > Have you tried running a checkout whilst profiling? No, to be honest, I am not familiar with any profilling tools. -Martin Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 1:37 ` Martin Fick @ 2011-09-29 2:19 ` Julian Phillips 2011-09-29 16:38 ` Martin Fick 2011-09-29 18:27 ` René Scharfe 0 siblings, 2 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-29 2:19 UTC (permalink / raw) To: Martin Fick; +Cc: Christian Couder, git, Christian Couder, Thomas Rast On Wed, 28 Sep 2011 19:37:18 -0600, Martin Fick wrote: > On Wednesday 28 September 2011 18:59:09 Martin Fick wrote: >> Julian Phillips <julian@quantumfyre.co.uk> wrote: -- snip -- >> I've created a test repo with ~100k refs/changes/... style refs, and >> ~40000 refs/heads/... style refs, and checkout can walk the list of >> ~140k refs seven times in 85ms user time including doing whatever >> other >> processing is needed for checkout. The real time is only 114ms - but >> then my test repo has no real data in. > > If I understand what you are saying, it sounds like you do not have a > very good test case. The amount of time it takes for checkout depends > on how long it takes to find a ref with the sha1 that you are on. If > that sha1 is so early in the list of refs that it only took you 7 > traversals to find it, then that is not a very good testcase. I think > that you should probably try making an orphaned ref (checkout a > detached head, commit to it), that is probably the worst testcase > since it should then have to search all 140K refs to eventually give > up. > > Again, if I understand what you are saying, if it took 85ms for 7 > traversals, then it takes approximately 10ms per traversal, that's > only 100/s! If you have to traverse it 140K times, that should work > out to 1400s ~ 23mins. Well, it's no more than 10ms per traversal - since the rest of the work presumably takes some time too ... However, I had forgotten to make the orphaned commit as you suggest - and then _bang_ 7N^2, it tries seven different variants of each ref (which is silly as they are all fully qualified), and with packed refs it has to search for them each time, all to turn names into hashes that we already know to start with. So, yes - it is that list traversal. Does the following help? diff --git a/builtin/checkout.c b/builtin/checkout.c index 5e356a6..f0f4ca1 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -605,7 +605,7 @@ static int add_one_ref_to_rev_list_arg(const char *refname, int flags, void *cb_data) { - add_one_rev_list_arg(cb_data, refname); + add_one_rev_list_arg(cb_data, strdup(sha1_to_hex(sha1))); return 0; } -- Julian ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 2:19 ` Julian Phillips @ 2011-09-29 16:38 ` Martin Fick 2011-09-29 18:26 ` Julian Phillips 2011-09-29 18:27 ` René Scharfe 1 sibling, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-29 16:38 UTC (permalink / raw) To: Julian Phillips; +Cc: Christian Couder, git, Christian Couder, Thomas Rast On Wednesday, September 28, 2011 08:19:16 pm Julian Phillips wrote: > On Wed, 28 Sep 2011 19:37:18 -0600, Martin Fick wrote: > > On Wednesday 28 September 2011 18:59:09 Martin Fick wrote: > >> Julian Phillips <julian@quantumfyre.co.uk> wrote: > -- snip -- > > >> I've created a test repo with ~100k refs/changes/... > >> style refs, and ~40000 refs/heads/... style refs, and > >> checkout can walk the list of ~140k refs seven times > >> in 85ms user time including doing whatever other > >> processing is needed for checkout. The real time is > >> only 114ms - but then my test repo has no real data > >> in. > > > > If I understand what you are saying, it sounds like you > > do not have a very good test case. The amount of time > > it takes for checkout depends on how long it takes to > > find a ref with the sha1 that you are on. If that sha1 > > is so early in the list of refs that it only took you > > 7 traversals to find it, then that is not a very good > > testcase. I think that you should probably try making > > an orphaned ref (checkout a detached head, commit to > > it), that is probably the worst testcase since it > > should then have to search all 140K refs to eventually > > give up. > > > > Again, if I understand what you are saying, if it took > > 85ms for 7 traversals, then it takes approximately > > 10ms per traversal, that's only 100/s! If you have to > > traverse it 140K times, that should work out to 1400s > > ~ 23mins. > > Well, it's no more than 10ms per traversal - since the > rest of the work presumably takes some time too ... > > However, I had forgotten to make the orphaned commit as > you suggest - and then _bang_ 7N^2, it tries seven > different variants of each ref (which is silly as they > are all fully qualified), and with packed refs it has to > search for them each time, all to turn names into hashes > that we already know to start with. > > So, yes - it is that list traversal. > > Does the following help? > > diff --git a/builtin/checkout.c b/builtin/checkout.c > index 5e356a6..f0f4ca1 100644 > --- a/builtin/checkout.c > +++ b/builtin/checkout.c > @@ -605,7 +605,7 @@ static int > add_one_ref_to_rev_list_arg(const char *refname, > int flags, > void *cb_data) > { > - add_one_rev_list_arg(cb_data, refname); > + add_one_rev_list_arg(cb_data, > strdup(sha1_to_hex(sha1))); return 0; > } Yes, but in some strange ways. :) First, let me clarify that all the tests here involve your "sort fix" from 2 days ago applied first. In the packed ref repo, it brings the time down to about ~10s (from > 5 mins). In the unpacked ref repo, it brings it down to about the same thing ~10s, but it was only starting at about ~20s. So, I have to ask, what does that change do, I don't quite understand it? Does it just do only one lookup per ref by normalizing it? Is the list still being traversed, just about 7 time less now? Should the packed_ref list simply be put in an array which could be binary searched instead, it is a fixed list once loaded, no? I prototyped a packed_ref implementation using the hash.c provided in the git sources and it seemed to speed a checkout up to almost instantaneous, but I was getting a few collisions so the implementation was not good enough. That is when I started to wonder if an array wouldn't be better in this case? Now I also decided to go back and test a noop fetch (a refetch) of all the changes (since this use case is still taking way longer than I think it should, even with the submodule fix posted earlier). Up until this point, even the sorting fix did not help. So I tried it with this fix. In the unpackref case, it did not seem to change (2~4mins). However, in the packed ref change (which was previously also about 2-4mins), this now only takes about 10-15s! Any clues as to why the unpacked refs would still be so slow on noop fetches and not be sped up by this? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 16:38 ` Martin Fick @ 2011-09-29 18:26 ` Julian Phillips 0 siblings, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-29 18:26 UTC (permalink / raw) To: Martin Fick; +Cc: Christian Couder, git, Christian Couder, Thomas Rast On Thu, 29 Sep 2011 10:38:44 -0600, Martin Fick wrote: > On Wednesday, September 28, 2011 08:19:16 pm Julian Phillips > wrote: -- snip -- >> However, I had forgotten to make the orphaned commit as >> you suggest - and then _bang_ 7N^2, it tries seven >> different variants of each ref (which is silly as they >> are all fully qualified), and with packed refs it has to >> search for them each time, all to turn names into hashes >> that we already know to start with. >> >> So, yes - it is that list traversal. >> >> Does the following help? >> >> diff --git a/builtin/checkout.c b/builtin/checkout.c >> index 5e356a6..f0f4ca1 100644 >> --- a/builtin/checkout.c >> +++ b/builtin/checkout.c >> @@ -605,7 +605,7 @@ static int >> add_one_ref_to_rev_list_arg(const char *refname, >> int flags, >> void *cb_data) >> { >> - add_one_rev_list_arg(cb_data, refname); >> + add_one_rev_list_arg(cb_data, >> strdup(sha1_to_hex(sha1))); return 0; >> } > > > Yes, but in some strange ways. :) > > First, let me clarify that all the tests here involve your > "sort fix" from 2 days ago applied first. > > In the packed ref repo, it brings the time down to about > ~10s (from > 5 mins). In the unpacked ref repo, it brings > it down to about the same thing ~10s, but it was only > starting at about ~20s. > > So, I have to ask, what does that change do, I don't quite > understand it? Does it just do only one lookup per ref by > normalizing it? Is the list still being traversed, just > about 7 time less now? In order to check for orphaned commits, checkout effectively calls rev-list passing it a list of the names of all the refs as input. The rev-list code then has to go through this list and convert each entry into an actual hash that it can look up in the object database. This is where the N^2 comes in for packed refs, as it calles resolve_ref() for each ref in the list (N), which then loops through the list of all refs (N) to find a match. However, the code that creates the list of refs to pass to the rev-list code already knows the hash for each ref. So the change above passes the hashes to rev-list, which then doesn't need to lookup the ref - it just converts the string form hash back to binary form, avoiding the N^2 work altogether. This is why packed and unpacked are about the same speed, as they are now doing the same amount of work. > Should the packed_ref list simply be > put in an array which could be binary searched instead, it > is a fixed list once loaded, no? A quick look at the code suggests that probably both the list of loose refs, and the list of packed refs could both be stored as binary searchable arrays, or in an ordered hash table. Though whether it is actually necessary I don't know. So far, it seems to have been possible to fix performance issues whilst keeping the simple lists ... > I prototyped a packed_ref implementation using the hash.c > provided in the git sources and it seemed to speed a > checkout up to almost instantaneous, but I was getting a few > collisions so the implementation was not good enough. That > is when I started to wonder if an array wouldn't be better > in this case? > > > > Now I also decided to go back and test a noop fetch (a > refetch) of all the changes (since this use case is still > taking way longer than I think it should, even with the > submodule fix posted earlier). Up until this point, even > the sorting fix did not help. So I tried it with this fix. > In the unpackref case, it did not seem to change (2~4mins). > However, in the packed ref change (which was previously also > about 2-4mins), this now only takes about 10-15s! > > Any clues as to why the unpacked refs would still be so slow > on noop fetches and not be sped up by this? Not really. I wouldn't expect this change to have any effect on fetch, but I haven't actually looked into it. -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 2:19 ` Julian Phillips 2011-09-29 16:38 ` Martin Fick @ 2011-09-29 18:27 ` René Scharfe 2011-09-29 19:10 ` Junio C Hamano ` (2 more replies) 1 sibling, 3 replies; 126+ messages in thread From: René Scharfe @ 2011-09-29 18:27 UTC (permalink / raw) To: Julian Phillips Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast, Junio C Hamano Am 29.09.2011 04:19, schrieb Julian Phillips: > Does the following help? > > diff --git a/builtin/checkout.c b/builtin/checkout.c > index 5e356a6..f0f4ca1 100644 > --- a/builtin/checkout.c > +++ b/builtin/checkout.c > @@ -605,7 +605,7 @@ static int add_one_ref_to_rev_list_arg(const char > *refname, > int flags, > void *cb_data) > { > - add_one_rev_list_arg(cb_data, refname); > + add_one_rev_list_arg(cb_data, strdup(sha1_to_hex(sha1))); > return 0; > } Hmm. Can we get rid of the multiple ref lookups fixed by the above *and* the overhead of dealing with a textual argument list at the same time by calling add_pending_object directly, like this? (Factoring out add_pending_sha1 should be a separate patch..) René --- builtin/checkout.c | 39 ++++++++++++--------------------------- revision.c | 11 ++++++++--- revision.h | 1 + 3 files changed, 21 insertions(+), 30 deletions(-) diff --git a/builtin/checkout.c b/builtin/checkout.c index 5e356a6..84e0cdc 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -588,24 +588,11 @@ static void update_refs_for_switch(struct checkout_opts *opts, report_tracking(new); } -struct rev_list_args { - int argc; - int alloc; - const char **argv; -}; - -static void add_one_rev_list_arg(struct rev_list_args *args, const char *s) -{ - ALLOC_GROW(args->argv, args->argc + 1, args->alloc); - args->argv[args->argc++] = s; -} - -static int add_one_ref_to_rev_list_arg(const char *refname, - const unsigned char *sha1, - int flags, - void *cb_data) +static int add_pending_uninteresting_ref(const char *refname, + const unsigned char *sha1, + int flags, void *cb_data) { - add_one_rev_list_arg(cb_data, refname); + add_pending_sha1(cb_data, refname, sha1, flags | UNINTERESTING); return 0; } @@ -685,19 +672,17 @@ static void suggest_reattach(struct commit *commit, struct rev_info *revs) */ static void orphaned_commit_warning(struct commit *commit) { - struct rev_list_args args = { 0, 0, NULL }; struct rev_info revs; - - add_one_rev_list_arg(&args, "(internal)"); - add_one_rev_list_arg(&args, sha1_to_hex(commit->object.sha1)); - add_one_rev_list_arg(&args, "--not"); - for_each_ref(add_one_ref_to_rev_list_arg, &args); - add_one_rev_list_arg(&args, "--"); - add_one_rev_list_arg(&args, NULL); + struct object *object = &commit->object; init_revisions(&revs, NULL); - if (setup_revisions(args.argc - 1, args.argv, &revs, NULL) != 1) - die(_("internal error: only -- alone should have been left")); + setup_revisions(0, NULL, &revs, NULL); + + object->flags &= ~UNINTERESTING; + add_pending_object(&revs, object, sha1_to_hex(object->sha1)); + + for_each_ref(add_pending_uninteresting_ref, &revs); + if (prepare_revision_walk(&revs)) die(_("internal error in revision walk")); if (!(commit->object.flags & UNINTERESTING)) diff --git a/revision.c b/revision.c index c46cfaa..2e8aa33 100644 --- a/revision.c +++ b/revision.c @@ -185,6 +185,13 @@ static struct object *get_reference(struct rev_info *revs, const char *name, con return object; } +void add_pending_sha1(struct rev_info *revs, const char *name, + const unsigned char *sha1, unsigned int flags) +{ + struct object *object = get_reference(revs, name, sha1, flags); + add_pending_object(revs, object, name); +} + static struct commit *handle_commit(struct rev_info *revs, struct object *object, const char *name) { unsigned long flags = object->flags; @@ -832,9 +839,7 @@ struct all_refs_cb { static int handle_one_ref(const char *path, const unsigned char *sha1, int flag, void *cb_data) { struct all_refs_cb *cb = cb_data; - struct object *object = get_reference(cb->all_revs, path, sha1, - cb->all_flags); - add_pending_object(cb->all_revs, object, path); + add_pending_sha1(cb->all_revs, path, sha1, cb->all_flags); return 0; } diff --git a/revision.h b/revision.h index 3d64ada..4541265 100644 --- a/revision.h +++ b/revision.h @@ -191,6 +191,7 @@ extern void add_object(struct object *obj, const char *name); extern void add_pending_object(struct rev_info *revs, struct object *obj, const char *name); +extern void add_pending_sha1(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags); extern void add_head_to_pending(struct rev_info *); -- 1.7.7.rc1 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 18:27 ` René Scharfe @ 2011-09-29 19:10 ` Junio C Hamano 2011-09-29 4:18 ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips 2011-09-29 20:44 ` Git is not scalable with too many refs/* Martin Fick 2011-09-29 19:10 ` Julian Phillips 2011-09-29 20:11 ` Martin Fick 2 siblings, 2 replies; 126+ messages in thread From: Junio C Hamano @ 2011-09-29 19:10 UTC (permalink / raw) To: René Scharfe Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > Hmm. Can we get rid of the multiple ref lookups fixed by the above > *and* the overhead of dealing with a textual argument list at the same > time by calling add_pending_object directly, like this? (Factoring > out add_pending_sha1 should be a separate patch..) I haven't tested it or thought about it through, but it smells right ;-) Also we would probably want to drop "next" field from "struct ref_list" (i.e. making it not a linear list), introduce a new "struct ref_array" that is a ALLOC_GROW() managed array of pointers to "struct ref_list", make get_packed_refs() and get_loose_refs() return a pointer to "struct ref_array" after sorting the array contents by "name". Then resolve_ref() can do a bisection search in the packed refs array when it does not find a loose ref. ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH] refs: Use binary search to lookup refs faster 2011-09-29 19:10 ` Junio C Hamano @ 2011-09-29 4:18 ` Julian Phillips 2011-09-29 21:57 ` Junio C Hamano ` (2 more replies) 2011-09-29 20:44 ` Git is not scalable with too many refs/* Martin Fick 1 sibling, 3 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-29 4:18 UTC (permalink / raw) To: Junio C Hamano Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Currently we linearly search through lists of refs when we need to find a specific ref. This can be very slow if we need to lookup a large number of refs. By changing to a binary search we can make this faster. In order to be able to use a binary search we need to change from using linked lists to arrays, which we can manage using ALLOC_GROW. We can now also use the standard library qsort function to sort the refs arrays. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> --- Something like this? refs.c | 328 ++++++++++++++++++++++++++-------------------------------------- 1 files changed, 131 insertions(+), 197 deletions(-) diff --git a/refs.c b/refs.c index a49ff74..e411bea 100644 --- a/refs.c +++ b/refs.c @@ -8,14 +8,18 @@ #define REF_KNOWS_PEELED 04 #define REF_BROKEN 010 -struct ref_list { - struct ref_list *next; +struct ref_entry { unsigned char flag; /* ISSYMREF? ISPACKED? */ unsigned char sha1[20]; unsigned char peeled[20]; char name[FLEX_ARRAY]; }; +struct ref_array { + int nr, alloc; + struct ref_entry **refs; +}; + static const char *parse_ref_line(char *line, unsigned char *sha1) { /* @@ -44,108 +48,55 @@ static const char *parse_ref_line(char *line, unsigned char *sha1) return line; } -static struct ref_list *add_ref(const char *name, const unsigned char *sha1, - int flag, struct ref_list *list, - struct ref_list **new_entry) +static void add_ref(const char *name, const unsigned char *sha1, + int flag, struct ref_array *refs, + struct ref_entry **new_entry) { int len; - struct ref_list *entry; + struct ref_entry *entry; /* Allocate it and add it in.. */ len = strlen(name) + 1; - entry = xmalloc(sizeof(struct ref_list) + len); + entry = xmalloc(sizeof(struct ref) + len); hashcpy(entry->sha1, sha1); hashclr(entry->peeled); memcpy(entry->name, name, len); entry->flag = flag; - entry->next = list; if (new_entry) *new_entry = entry; - return entry; + ALLOC_GROW(refs->refs, refs->nr + 1, refs->alloc); + refs->refs[refs->nr++] = entry; } -/* merge sort the ref list */ -static struct ref_list *sort_ref_list(struct ref_list *list) +static int ref_entry_cmp(const void *a, const void *b) { - int psize, qsize, last_merge_count, cmp; - struct ref_list *p, *q, *l, *e; - struct ref_list *new_list = list; - int k = 1; - int merge_count = 0; - - if (!list) - return list; - - do { - last_merge_count = merge_count; - merge_count = 0; - - psize = 0; - - p = new_list; - q = new_list; - new_list = NULL; - l = NULL; + struct ref_entry *one = *(struct ref_entry **)a; + struct ref_entry *two = *(struct ref_entry **)b; + return strcmp(one->name, two->name); +} - while (p) { - merge_count++; +static void sort_ref_array(struct ref_array *array) +{ + qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp); +} - while (psize < k && q->next) { - q = q->next; - psize++; - } - qsize = k; - - while ((psize > 0) || (qsize > 0 && q)) { - if (qsize == 0 || !q) { - e = p; - p = p->next; - psize--; - } else if (psize == 0) { - e = q; - q = q->next; - qsize--; - } else { - cmp = strcmp(q->name, p->name); - if (cmp < 0) { - e = q; - q = q->next; - qsize--; - } else if (cmp > 0) { - e = p; - p = p->next; - psize--; - } else { - if (hashcmp(q->sha1, p->sha1)) - die("Duplicated ref, and SHA1s don't match: %s", - q->name); - warning("Duplicated ref: %s", q->name); - e = q; - q = q->next; - qsize--; - free(e); - e = p; - p = p->next; - psize--; - } - } +static struct ref_entry *search_ref_array(struct ref_array *array, const char *name) +{ + struct ref_entry *e, **r; + int len; - e->next = NULL; + len = strlen(name) + 1; + e = xmalloc(sizeof(struct ref) + len); + memcpy(e->name, name, len); - if (l) - l->next = e; - if (!new_list) - new_list = e; - l = e; - } + r = bsearch(&e, array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp); - p = q; - }; + free(e); - k = k * 2; - } while ((last_merge_count != merge_count) || (last_merge_count != 1)); + if (r == NULL) + return NULL; - return new_list; + return *r; } /* @@ -155,38 +106,37 @@ static struct ref_list *sort_ref_list(struct ref_list *list) static struct cached_refs { char did_loose; char did_packed; - struct ref_list *loose; - struct ref_list *packed; + struct ref_array loose; + struct ref_array packed; } cached_refs, submodule_refs; -static struct ref_list *current_ref; +static struct ref_entry *current_ref; -static struct ref_list *extra_refs; +static struct ref_array extra_refs; -static void free_ref_list(struct ref_list *list) +static void free_ref_array(struct ref_array *array) { - struct ref_list *next; - for ( ; list; list = next) { - next = list->next; - free(list); - } + int i; + for (i = 0; i < array->nr; i++) + free(array->refs[i]); + free(array->refs); + array->nr = array->alloc = 0; + array->refs = NULL; } static void invalidate_cached_refs(void) { struct cached_refs *ca = &cached_refs; - if (ca->did_loose && ca->loose) - free_ref_list(ca->loose); - if (ca->did_packed && ca->packed) - free_ref_list(ca->packed); - ca->loose = ca->packed = NULL; + if (ca->did_loose) + free_ref_array(&ca->loose); + if (ca->did_packed) + free_ref_array(&ca->packed); ca->did_loose = ca->did_packed = 0; } static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) { - struct ref_list *list = NULL; - struct ref_list *last = NULL; + struct ref_entry *last = NULL; char refline[PATH_MAX]; int flag = REF_ISPACKED; @@ -205,7 +155,7 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) name = parse_ref_line(refline, sha1); if (name) { - list = add_ref(name, sha1, flag, list, &last); + add_ref(name, sha1, flag, &cached_refs->packed, &last); continue; } if (last && @@ -215,21 +165,20 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) !get_sha1_hex(refline + 1, sha1)) hashcpy(last->peeled, sha1); } - cached_refs->packed = sort_ref_list(list); + sort_ref_array(&cached_refs->packed); } void add_extra_ref(const char *name, const unsigned char *sha1, int flag) { - extra_refs = add_ref(name, sha1, flag, extra_refs, NULL); + add_ref(name, sha1, flag, &extra_refs, NULL); } void clear_extra_refs(void) { - free_ref_list(extra_refs); - extra_refs = NULL; + free_ref_array(&extra_refs); } -static struct ref_list *get_packed_refs(const char *submodule) +static struct ref_array *get_packed_refs(const char *submodule) { const char *packed_refs_file; struct cached_refs *refs; @@ -237,7 +186,7 @@ static struct ref_list *get_packed_refs(const char *submodule) if (submodule) { packed_refs_file = git_path_submodule(submodule, "packed-refs"); refs = &submodule_refs; - free_ref_list(refs->packed); + free_ref_array(&refs->packed); } else { packed_refs_file = git_path("packed-refs"); refs = &cached_refs; @@ -245,18 +194,17 @@ static struct ref_list *get_packed_refs(const char *submodule) if (!refs->did_packed || submodule) { FILE *f = fopen(packed_refs_file, "r"); - refs->packed = NULL; if (f) { read_packed_refs(f, refs); fclose(f); } refs->did_packed = 1; } - return refs->packed; + return &refs->packed; } -static struct ref_list *get_ref_dir(const char *submodule, const char *base, - struct ref_list *list) +static void get_ref_dir(const char *submodule, const char *base, + struct ref_array *array) { DIR *dir; const char *path; @@ -299,7 +247,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, if (stat(refdir, &st) < 0) continue; if (S_ISDIR(st.st_mode)) { - list = get_ref_dir(submodule, ref, list); + get_ref_dir(submodule, ref, array); continue; } if (submodule) { @@ -314,12 +262,11 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, hashclr(sha1); flag |= REF_BROKEN; } - list = add_ref(ref, sha1, flag, list, NULL); + add_ref(ref, sha1, flag, array, NULL); } free(ref); closedir(dir); } - return list; } struct warn_if_dangling_data { @@ -356,21 +303,21 @@ void warn_dangling_symref(FILE *fp, const char *msg_fmt, const char *refname) for_each_rawref(warn_if_dangling_symref, &data); } -static struct ref_list *get_loose_refs(const char *submodule) +static struct ref_array *get_loose_refs(const char *submodule) { if (submodule) { - free_ref_list(submodule_refs.loose); - submodule_refs.loose = get_ref_dir(submodule, "refs", NULL); - submodule_refs.loose = sort_ref_list(submodule_refs.loose); - return submodule_refs.loose; + free_ref_array(&submodule_refs.loose); + get_ref_dir(submodule, "refs", &submodule_refs.loose); + sort_ref_array(&submodule_refs.loose); + return &submodule_refs.loose; } if (!cached_refs.did_loose) { - cached_refs.loose = get_ref_dir(NULL, "refs", NULL); - cached_refs.loose = sort_ref_list(cached_refs.loose); + get_ref_dir(NULL, "refs", &cached_refs.loose); + sort_ref_array(&cached_refs.loose); cached_refs.did_loose = 1; } - return cached_refs.loose; + return &cached_refs.loose; } /* We allow "recursive" symbolic refs. Only within reason, though */ @@ -381,8 +328,8 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna { FILE *f; struct cached_refs refs; - struct ref_list *ref; - int retval; + struct ref_entry *ref; + int retval = -1; strcpy(name + pathlen, "packed-refs"); f = fopen(name, "r"); @@ -390,17 +337,12 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna return -1; read_packed_refs(f, &refs); fclose(f); - ref = refs.packed; - retval = -1; - while (ref) { - if (!strcmp(ref->name, refname)) { - retval = 0; - memcpy(result, ref->sha1, 20); - break; - } - ref = ref->next; + ref = search_ref_array(&refs.packed, refname); + if (ref != NULL) { + memcpy(result, ref->sha1, 20); + retval = 0; } - free_ref_list(refs.packed); + free_ref_array(&refs.packed); return retval; } @@ -501,15 +443,13 @@ const char *resolve_ref(const char *ref, unsigned char *sha1, int reading, int * git_snpath(path, sizeof(path), "%s", ref); /* Special case: non-existing file. */ if (lstat(path, &st) < 0) { - struct ref_list *list = get_packed_refs(NULL); - while (list) { - if (!strcmp(ref, list->name)) { - hashcpy(sha1, list->sha1); - if (flag) - *flag |= REF_ISPACKED; - return ref; - } - list = list->next; + struct ref_array *packed = get_packed_refs(NULL); + struct ref_entry *r = search_ref_array(packed, ref); + if (r != NULL) { + hashcpy(sha1, r->sha1); + if (flag) + *flag |= REF_ISPACKED; + return ref; } if (reading || errno != ENOENT) return NULL; @@ -584,7 +524,7 @@ int read_ref(const char *ref, unsigned char *sha1) #define DO_FOR_EACH_INCLUDE_BROKEN 01 static int do_one_ref(const char *base, each_ref_fn fn, int trim, - int flags, void *cb_data, struct ref_list *entry) + int flags, void *cb_data, struct ref_entry *entry) { if (prefixcmp(entry->name, base)) return 0; @@ -630,18 +570,12 @@ int peel_ref(const char *ref, unsigned char *sha1) return -1; if ((flag & REF_ISPACKED)) { - struct ref_list *list = get_packed_refs(NULL); + struct ref_array *array = get_packed_refs(NULL); + struct ref_entry *r = search_ref_array(array, ref); - while (list) { - if (!strcmp(list->name, ref)) { - if (list->flag & REF_KNOWS_PEELED) { - hashcpy(sha1, list->peeled); - return 0; - } - /* older pack-refs did not leave peeled ones */ - break; - } - list = list->next; + if (r != NULL && r->flag & REF_KNOWS_PEELED) { + hashcpy(sha1, r->peeled); + return 0; } } @@ -660,36 +594,39 @@ fallback: static int do_for_each_ref(const char *submodule, const char *base, each_ref_fn fn, int trim, int flags, void *cb_data) { - int retval = 0; - struct ref_list *packed = get_packed_refs(submodule); - struct ref_list *loose = get_loose_refs(submodule); + int retval = 0, i, p = 0, l = 0; + struct ref_array *packed = get_packed_refs(submodule); + struct ref_array *loose = get_loose_refs(submodule); - struct ref_list *extra; + struct ref_array *extra = &extra_refs; - for (extra = extra_refs; extra; extra = extra->next) - retval = do_one_ref(base, fn, trim, flags, cb_data, extra); + for (i = 0; i < extra->nr; i++) + retval = do_one_ref(base, fn, trim, flags, cb_data, extra->refs[i]); - while (packed && loose) { - struct ref_list *entry; - int cmp = strcmp(packed->name, loose->name); + while (p < packed->nr && l < loose->nr) { + struct ref_entry *entry; + int cmp = strcmp(packed->refs[p]->name, loose->refs[l]->name); if (!cmp) { - packed = packed->next; + p++; continue; } if (cmp > 0) { - entry = loose; - loose = loose->next; + entry = loose->refs[l++]; } else { - entry = packed; - packed = packed->next; + entry = packed->refs[p++]; } retval = do_one_ref(base, fn, trim, flags, cb_data, entry); if (retval) goto end_each; } - for (packed = packed ? packed : loose; packed; packed = packed->next) { - retval = do_one_ref(base, fn, trim, flags, cb_data, packed); + if (l < loose->nr) { + p = l; + packed = loose; + } + + for (; p < packed->nr; p++) { + retval = do_one_ref(base, fn, trim, flags, cb_data, packed->refs[p]); if (retval) goto end_each; } @@ -1005,24 +942,24 @@ static int remove_empty_directories(const char *file) } static int is_refname_available(const char *ref, const char *oldref, - struct ref_list *list, int quiet) -{ - int namlen = strlen(ref); /* e.g. 'foo/bar' */ - while (list) { - /* list->name could be 'foo' or 'foo/bar/baz' */ - if (!oldref || strcmp(oldref, list->name)) { - int len = strlen(list->name); + struct ref_array *array, int quiet) +{ + int i, namlen = strlen(ref); /* e.g. 'foo/bar' */ + for (i = 0; i < array->nr; i++ ) { + struct ref_entry *entry = array->refs[i]; + /* entry->name could be 'foo' or 'foo/bar/baz' */ + if (!oldref || strcmp(oldref, entry->name)) { + int len = strlen(entry->name); int cmplen = (namlen < len) ? namlen : len; - const char *lead = (namlen < len) ? list->name : ref; - if (!strncmp(ref, list->name, cmplen) && + const char *lead = (namlen < len) ? entry->name : ref; + if (!strncmp(ref, entry->name, cmplen) && lead[cmplen] == '/') { if (!quiet) error("'%s' exists; cannot create '%s'", - list->name, ref); + entry->name, ref); return 0; } } - list = list->next; } return 1; } @@ -1129,18 +1066,13 @@ static struct lock_file packlock; static int repack_without_ref(const char *refname) { - struct ref_list *list, *packed_ref_list; - int fd; - int found = 0; + struct ref_array *packed; + struct ref_entry *ref; + int fd, i; - packed_ref_list = get_packed_refs(NULL); - for (list = packed_ref_list; list; list = list->next) { - if (!strcmp(refname, list->name)) { - found = 1; - break; - } - } - if (!found) + packed = get_packed_refs(NULL); + ref = search_ref_array(packed, refname); + if (ref == NULL) return 0; fd = hold_lock_file_for_update(&packlock, git_path("packed-refs"), 0); if (fd < 0) { @@ -1148,17 +1080,19 @@ static int repack_without_ref(const char *refname) return error("cannot delete '%s' from packed refs", refname); } - for (list = packed_ref_list; list; list = list->next) { + for (i = 0; i < packed->nr; i++) { char line[PATH_MAX + 100]; int len; - if (!strcmp(refname, list->name)) + ref = packed->refs[i]; + + if (!strcmp(refname, ref->name)) continue; len = snprintf(line, sizeof(line), "%s %s\n", - sha1_to_hex(list->sha1), list->name); + sha1_to_hex(ref->sha1), ref->name); /* this should not happen but just being defensive */ if (len > sizeof(line)) - die("too long a refname '%s'", list->name); + die("too long a refname '%s'", ref->name); write_or_die(fd, line, len); } return commit_lock_file(&packlock); -- 1.7.6.1 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH] refs: Use binary search to lookup refs faster 2011-09-29 4:18 ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips @ 2011-09-29 21:57 ` Junio C Hamano 2011-09-29 22:04 ` [PATCH v2] " Julian Phillips 2011-09-29 22:06 ` [PATCH] " Junio C Hamano 2 siblings, 0 replies; 126+ messages in thread From: Junio C Hamano @ 2011-09-29 21:57 UTC (permalink / raw) To: Julian Phillips Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Julian Phillips <julian@quantumfyre.co.uk> writes: > Currently we linearly search through lists of refs when we need to > find a specific ref. This can be very slow if we need to lookup a > large number of refs. By changing to a binary search we can make this > faster. > > In order to be able to use a binary search we need to change from > using linked lists to arrays, which we can manage using ALLOC_GROW. > > We can now also use the standard library qsort function to sort the > refs arrays. > > Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> > --- > > Something like this? > > refs.c | 328 ++++++++++++++++++++++++++-------------------------------------- > 1 files changed, 131 insertions(+), 197 deletions(-) > > diff --git a/refs.c b/refs.c > index a49ff74..e411bea 100644 > --- a/refs.c > +++ b/refs.c > @@ -8,14 +8,18 @@ > #define REF_KNOWS_PEELED 04 > #define REF_BROKEN 010 > > -struct ref_list { > - struct ref_list *next; > +struct ref_entry { > unsigned char flag; /* ISSYMREF? ISPACKED? */ > unsigned char sha1[20]; > unsigned char peeled[20]; > char name[FLEX_ARRAY]; > }; > > +struct ref_array { > + int nr, alloc; > + struct ref_entry **refs; > +}; > + Yeah, I can say "something like that" without looking at the rest of the patch ;-) The rest should naturally follow from the above data structures. ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH v2] refs: Use binary search to lookup refs faster 2011-09-29 4:18 ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips 2011-09-29 21:57 ` Junio C Hamano @ 2011-09-29 22:04 ` Julian Phillips 2011-09-29 22:06 ` [PATCH] " Junio C Hamano 2 siblings, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-29 22:04 UTC (permalink / raw) To: Julian Phillips Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Currently we linearly search through lists of refs when we need to find a specific ref. This can be very slow if we need to lookup a large number of refs. By changing to a binary search we can make this faster. In order to be able to use a binary search we need to change from using linked lists to arrays, which we can manage using ALLOC_GROW. We can now also use the standard library qsort function to sort the refs arrays. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> --- Previous version caused a regression in the test suite ... :$ refs.c | 329 ++++++++++++++++++++++++++-------------------------------------- 1 files changed, 133 insertions(+), 196 deletions(-) diff --git a/refs.c b/refs.c index a49ff74..35bba97 100644 --- a/refs.c +++ b/refs.c @@ -8,14 +8,18 @@ #define REF_KNOWS_PEELED 04 #define REF_BROKEN 010 -struct ref_list { - struct ref_list *next; +struct ref_entry { unsigned char flag; /* ISSYMREF? ISPACKED? */ unsigned char sha1[20]; unsigned char peeled[20]; char name[FLEX_ARRAY]; }; +struct ref_array { + int nr, alloc; + struct ref_entry **refs; +}; + static const char *parse_ref_line(char *line, unsigned char *sha1) { /* @@ -44,108 +48,58 @@ static const char *parse_ref_line(char *line, unsigned char *sha1) return line; } -static struct ref_list *add_ref(const char *name, const unsigned char *sha1, - int flag, struct ref_list *list, - struct ref_list **new_entry) +static void add_ref(const char *name, const unsigned char *sha1, + int flag, struct ref_array *refs, + struct ref_entry **new_entry) { int len; - struct ref_list *entry; + struct ref_entry *entry; /* Allocate it and add it in.. */ len = strlen(name) + 1; - entry = xmalloc(sizeof(struct ref_list) + len); + entry = xmalloc(sizeof(struct ref) + len); hashcpy(entry->sha1, sha1); hashclr(entry->peeled); memcpy(entry->name, name, len); entry->flag = flag; - entry->next = list; if (new_entry) *new_entry = entry; - return entry; + ALLOC_GROW(refs->refs, refs->nr + 1, refs->alloc); + refs->refs[refs->nr++] = entry; } -/* merge sort the ref list */ -static struct ref_list *sort_ref_list(struct ref_list *list) +static int ref_entry_cmp(const void *a, const void *b) { - int psize, qsize, last_merge_count, cmp; - struct ref_list *p, *q, *l, *e; - struct ref_list *new_list = list; - int k = 1; - int merge_count = 0; - - if (!list) - return list; - - do { - last_merge_count = merge_count; - merge_count = 0; - - psize = 0; + struct ref_entry *one = *(struct ref_entry **)a; + struct ref_entry *two = *(struct ref_entry **)b; + return strcmp(one->name, two->name); +} - p = new_list; - q = new_list; - new_list = NULL; - l = NULL; +static void sort_ref_array(struct ref_array *array) +{ + qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp); +} - while (p) { - merge_count++; +static struct ref_entry *search_ref_array(struct ref_array *array, const char *name) +{ + struct ref_entry *e, **r; + int len; - while (psize < k && q->next) { - q = q->next; - psize++; - } - qsize = k; - - while ((psize > 0) || (qsize > 0 && q)) { - if (qsize == 0 || !q) { - e = p; - p = p->next; - psize--; - } else if (psize == 0) { - e = q; - q = q->next; - qsize--; - } else { - cmp = strcmp(q->name, p->name); - if (cmp < 0) { - e = q; - q = q->next; - qsize--; - } else if (cmp > 0) { - e = p; - p = p->next; - psize--; - } else { - if (hashcmp(q->sha1, p->sha1)) - die("Duplicated ref, and SHA1s don't match: %s", - q->name); - warning("Duplicated ref: %s", q->name); - e = q; - q = q->next; - qsize--; - free(e); - e = p; - p = p->next; - psize--; - } - } + if (name == NULL) + return NULL; - e->next = NULL; + len = strlen(name) + 1; + e = xmalloc(sizeof(struct ref) + len); + memcpy(e->name, name, len); - if (l) - l->next = e; - if (!new_list) - new_list = e; - l = e; - } + r = bsearch(&e, array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp); - p = q; - }; + free(e); - k = k * 2; - } while ((last_merge_count != merge_count) || (last_merge_count != 1)); + if (r == NULL) + return NULL; - return new_list; + return *r; } /* @@ -155,38 +109,37 @@ static struct ref_list *sort_ref_list(struct ref_list *list) static struct cached_refs { char did_loose; char did_packed; - struct ref_list *loose; - struct ref_list *packed; + struct ref_array loose; + struct ref_array packed; } cached_refs, submodule_refs; -static struct ref_list *current_ref; +static struct ref_entry *current_ref; -static struct ref_list *extra_refs; +static struct ref_array extra_refs; -static void free_ref_list(struct ref_list *list) +static void free_ref_array(struct ref_array *array) { - struct ref_list *next; - for ( ; list; list = next) { - next = list->next; - free(list); - } + int i; + for (i = 0; i < array->nr; i++) + free(array->refs[i]); + free(array->refs); + array->nr = array->alloc = 0; + array->refs = NULL; } static void invalidate_cached_refs(void) { struct cached_refs *ca = &cached_refs; - if (ca->did_loose && ca->loose) - free_ref_list(ca->loose); - if (ca->did_packed && ca->packed) - free_ref_list(ca->packed); - ca->loose = ca->packed = NULL; + if (ca->did_loose) + free_ref_array(&ca->loose); + if (ca->did_packed) + free_ref_array(&ca->packed); ca->did_loose = ca->did_packed = 0; } static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) { - struct ref_list *list = NULL; - struct ref_list *last = NULL; + struct ref_entry *last = NULL; char refline[PATH_MAX]; int flag = REF_ISPACKED; @@ -205,7 +158,7 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) name = parse_ref_line(refline, sha1); if (name) { - list = add_ref(name, sha1, flag, list, &last); + add_ref(name, sha1, flag, &cached_refs->packed, &last); continue; } if (last && @@ -215,21 +168,20 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) !get_sha1_hex(refline + 1, sha1)) hashcpy(last->peeled, sha1); } - cached_refs->packed = sort_ref_list(list); + sort_ref_array(&cached_refs->packed); } void add_extra_ref(const char *name, const unsigned char *sha1, int flag) { - extra_refs = add_ref(name, sha1, flag, extra_refs, NULL); + add_ref(name, sha1, flag, &extra_refs, NULL); } void clear_extra_refs(void) { - free_ref_list(extra_refs); - extra_refs = NULL; + free_ref_array(&extra_refs); } -static struct ref_list *get_packed_refs(const char *submodule) +static struct ref_array *get_packed_refs(const char *submodule) { const char *packed_refs_file; struct cached_refs *refs; @@ -237,7 +189,7 @@ static struct ref_list *get_packed_refs(const char *submodule) if (submodule) { packed_refs_file = git_path_submodule(submodule, "packed-refs"); refs = &submodule_refs; - free_ref_list(refs->packed); + free_ref_array(&refs->packed); } else { packed_refs_file = git_path("packed-refs"); refs = &cached_refs; @@ -245,18 +197,17 @@ static struct ref_list *get_packed_refs(const char *submodule) if (!refs->did_packed || submodule) { FILE *f = fopen(packed_refs_file, "r"); - refs->packed = NULL; if (f) { read_packed_refs(f, refs); fclose(f); } refs->did_packed = 1; } - return refs->packed; + return &refs->packed; } -static struct ref_list *get_ref_dir(const char *submodule, const char *base, - struct ref_list *list) +static void get_ref_dir(const char *submodule, const char *base, + struct ref_array *array) { DIR *dir; const char *path; @@ -299,7 +250,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, if (stat(refdir, &st) < 0) continue; if (S_ISDIR(st.st_mode)) { - list = get_ref_dir(submodule, ref, list); + get_ref_dir(submodule, ref, array); continue; } if (submodule) { @@ -314,12 +265,11 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, hashclr(sha1); flag |= REF_BROKEN; } - list = add_ref(ref, sha1, flag, list, NULL); + add_ref(ref, sha1, flag, array, NULL); } free(ref); closedir(dir); } - return list; } struct warn_if_dangling_data { @@ -356,21 +306,21 @@ void warn_dangling_symref(FILE *fp, const char *msg_fmt, const char *refname) for_each_rawref(warn_if_dangling_symref, &data); } -static struct ref_list *get_loose_refs(const char *submodule) +static struct ref_array *get_loose_refs(const char *submodule) { if (submodule) { - free_ref_list(submodule_refs.loose); - submodule_refs.loose = get_ref_dir(submodule, "refs", NULL); - submodule_refs.loose = sort_ref_list(submodule_refs.loose); - return submodule_refs.loose; + free_ref_array(&submodule_refs.loose); + get_ref_dir(submodule, "refs", &submodule_refs.loose); + sort_ref_array(&submodule_refs.loose); + return &submodule_refs.loose; } if (!cached_refs.did_loose) { - cached_refs.loose = get_ref_dir(NULL, "refs", NULL); - cached_refs.loose = sort_ref_list(cached_refs.loose); + get_ref_dir(NULL, "refs", &cached_refs.loose); + sort_ref_array(&cached_refs.loose); cached_refs.did_loose = 1; } - return cached_refs.loose; + return &cached_refs.loose; } /* We allow "recursive" symbolic refs. Only within reason, though */ @@ -381,8 +331,8 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna { FILE *f; struct cached_refs refs; - struct ref_list *ref; - int retval; + struct ref_entry *ref; + int retval = -1; strcpy(name + pathlen, "packed-refs"); f = fopen(name, "r"); @@ -390,17 +340,12 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna return -1; read_packed_refs(f, &refs); fclose(f); - ref = refs.packed; - retval = -1; - while (ref) { - if (!strcmp(ref->name, refname)) { - retval = 0; - memcpy(result, ref->sha1, 20); - break; - } - ref = ref->next; + ref = search_ref_array(&refs.packed, refname); + if (ref != NULL) { + memcpy(result, ref->sha1, 20); + retval = 0; } - free_ref_list(refs.packed); + free_ref_array(&refs.packed); return retval; } @@ -501,15 +446,13 @@ const char *resolve_ref(const char *ref, unsigned char *sha1, int reading, int * git_snpath(path, sizeof(path), "%s", ref); /* Special case: non-existing file. */ if (lstat(path, &st) < 0) { - struct ref_list *list = get_packed_refs(NULL); - while (list) { - if (!strcmp(ref, list->name)) { - hashcpy(sha1, list->sha1); - if (flag) - *flag |= REF_ISPACKED; - return ref; - } - list = list->next; + struct ref_array *packed = get_packed_refs(NULL); + struct ref_entry *r = search_ref_array(packed, ref); + if (r != NULL) { + hashcpy(sha1, r->sha1); + if (flag) + *flag |= REF_ISPACKED; + return ref; } if (reading || errno != ENOENT) return NULL; @@ -584,7 +527,7 @@ int read_ref(const char *ref, unsigned char *sha1) #define DO_FOR_EACH_INCLUDE_BROKEN 01 static int do_one_ref(const char *base, each_ref_fn fn, int trim, - int flags, void *cb_data, struct ref_list *entry) + int flags, void *cb_data, struct ref_entry *entry) { if (prefixcmp(entry->name, base)) return 0; @@ -630,18 +573,12 @@ int peel_ref(const char *ref, unsigned char *sha1) return -1; if ((flag & REF_ISPACKED)) { - struct ref_list *list = get_packed_refs(NULL); + struct ref_array *array = get_packed_refs(NULL); + struct ref_entry *r = search_ref_array(array, ref); - while (list) { - if (!strcmp(list->name, ref)) { - if (list->flag & REF_KNOWS_PEELED) { - hashcpy(sha1, list->peeled); - return 0; - } - /* older pack-refs did not leave peeled ones */ - break; - } - list = list->next; + if (r != NULL && r->flag & REF_KNOWS_PEELED) { + hashcpy(sha1, r->peeled); + return 0; } } @@ -660,36 +597,39 @@ fallback: static int do_for_each_ref(const char *submodule, const char *base, each_ref_fn fn, int trim, int flags, void *cb_data) { - int retval = 0; - struct ref_list *packed = get_packed_refs(submodule); - struct ref_list *loose = get_loose_refs(submodule); + int retval = 0, i, p = 0, l = 0; + struct ref_array *packed = get_packed_refs(submodule); + struct ref_array *loose = get_loose_refs(submodule); - struct ref_list *extra; + struct ref_array *extra = &extra_refs; - for (extra = extra_refs; extra; extra = extra->next) - retval = do_one_ref(base, fn, trim, flags, cb_data, extra); + for (i = 0; i < extra->nr; i++) + retval = do_one_ref(base, fn, trim, flags, cb_data, extra->refs[i]); - while (packed && loose) { - struct ref_list *entry; - int cmp = strcmp(packed->name, loose->name); + while (p < packed->nr && l < loose->nr) { + struct ref_entry *entry; + int cmp = strcmp(packed->refs[p]->name, loose->refs[l]->name); if (!cmp) { - packed = packed->next; + p++; continue; } if (cmp > 0) { - entry = loose; - loose = loose->next; + entry = loose->refs[l++]; } else { - entry = packed; - packed = packed->next; + entry = packed->refs[p++]; } retval = do_one_ref(base, fn, trim, flags, cb_data, entry); if (retval) goto end_each; } - for (packed = packed ? packed : loose; packed; packed = packed->next) { - retval = do_one_ref(base, fn, trim, flags, cb_data, packed); + if (l < loose->nr) { + p = l; + packed = loose; + } + + for (; p < packed->nr; p++) { + retval = do_one_ref(base, fn, trim, flags, cb_data, packed->refs[p]); if (retval) goto end_each; } @@ -1005,24 +945,24 @@ static int remove_empty_directories(const char *file) } static int is_refname_available(const char *ref, const char *oldref, - struct ref_list *list, int quiet) -{ - int namlen = strlen(ref); /* e.g. 'foo/bar' */ - while (list) { - /* list->name could be 'foo' or 'foo/bar/baz' */ - if (!oldref || strcmp(oldref, list->name)) { - int len = strlen(list->name); + struct ref_array *array, int quiet) +{ + int i, namlen = strlen(ref); /* e.g. 'foo/bar' */ + for (i = 0; i < array->nr; i++ ) { + struct ref_entry *entry = array->refs[i]; + /* entry->name could be 'foo' or 'foo/bar/baz' */ + if (!oldref || strcmp(oldref, entry->name)) { + int len = strlen(entry->name); int cmplen = (namlen < len) ? namlen : len; - const char *lead = (namlen < len) ? list->name : ref; - if (!strncmp(ref, list->name, cmplen) && + const char *lead = (namlen < len) ? entry->name : ref; + if (!strncmp(ref, entry->name, cmplen) && lead[cmplen] == '/') { if (!quiet) error("'%s' exists; cannot create '%s'", - list->name, ref); + entry->name, ref); return 0; } } - list = list->next; } return 1; } @@ -1129,18 +1069,13 @@ static struct lock_file packlock; static int repack_without_ref(const char *refname) { - struct ref_list *list, *packed_ref_list; - int fd; - int found = 0; + struct ref_array *packed; + struct ref_entry *ref; + int fd, i; - packed_ref_list = get_packed_refs(NULL); - for (list = packed_ref_list; list; list = list->next) { - if (!strcmp(refname, list->name)) { - found = 1; - break; - } - } - if (!found) + packed = get_packed_refs(NULL); + ref = search_ref_array(packed, refname); + if (ref == NULL) return 0; fd = hold_lock_file_for_update(&packlock, git_path("packed-refs"), 0); if (fd < 0) { @@ -1148,17 +1083,19 @@ static int repack_without_ref(const char *refname) return error("cannot delete '%s' from packed refs", refname); } - for (list = packed_ref_list; list; list = list->next) { + for (i = 0; i < packed->nr; i++) { char line[PATH_MAX + 100]; int len; - if (!strcmp(refname, list->name)) + ref = packed->refs[i]; + + if (!strcmp(refname, ref->name)) continue; len = snprintf(line, sizeof(line), "%s %s\n", - sha1_to_hex(list->sha1), list->name); + sha1_to_hex(ref->sha1), ref->name); /* this should not happen but just being defensive */ if (len > sizeof(line)) - die("too long a refname '%s'", list->name); + die("too long a refname '%s'", ref->name); write_or_die(fd, line, len); } return commit_lock_file(&packlock); -- 1.7.6.1 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH] refs: Use binary search to lookup refs faster 2011-09-29 4:18 ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips 2011-09-29 21:57 ` Junio C Hamano 2011-09-29 22:04 ` [PATCH v2] " Julian Phillips @ 2011-09-29 22:06 ` Junio C Hamano 2011-09-29 22:11 ` [PATCH v3] " Julian Phillips 2 siblings, 1 reply; 126+ messages in thread From: Junio C Hamano @ 2011-09-29 22:06 UTC (permalink / raw) To: Julian Phillips Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Julian Phillips <julian@quantumfyre.co.uk> writes: > +static void add_ref(const char *name, const unsigned char *sha1, > + int flag, struct ref_array *refs, > + struct ref_entry **new_entry) > { > int len; > - struct ref_list *entry; > + struct ref_entry *entry; > > /* Allocate it and add it in.. */ > len = strlen(name) + 1; > - entry = xmalloc(sizeof(struct ref_list) + len); > + entry = xmalloc(sizeof(struct ref) + len); This should be sizeof(struct ref_entry), no? There is another such misallocation in search_ref_array() where it prepares a temporary. ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-29 22:06 ` [PATCH] " Junio C Hamano @ 2011-09-29 22:11 ` Julian Phillips 2011-09-29 23:48 ` Junio C Hamano 2011-09-30 1:13 ` Martin Fick 0 siblings, 2 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-29 22:11 UTC (permalink / raw) To: Junio C Hamano Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Currently we linearly search through lists of refs when we need to find a specific ref. This can be very slow if we need to lookup a large number of refs. By changing to a binary search we can make this faster. In order to be able to use a binary search we need to change from using linked lists to arrays, which we can manage using ALLOC_GROW. We can now also use the standard library qsort function to sort the refs arrays. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> --- On Thu, 29 Sep 2011 15:06:03 -0700, Junio C Hamano wrote: > Julian Phillips <julian@quantumfyre.co.uk> writes: > >> +static void add_ref(const char *name, const unsigned char *sha1, >> + int flag, struct ref_array *refs, >> + struct ref_entry **new_entry) >> { >> int len; >> - struct ref_list *entry; >> + struct ref_entry *entry; >> >> /* Allocate it and add it in.. */ >> len = strlen(name) + 1; >> - entry = xmalloc(sizeof(struct ref_list) + len); >> + entry = xmalloc(sizeof(struct ref) + len); > > This should be sizeof(struct ref_entry), no? There is another such > misallocation in search_ref_array() where it prepares a temporary. Indeed, thanks. Looks like two instances of not noticing that "struct ref" already existed managed to survive. Drat. Of course since "struct ref" is bigger than "struct ref_entry", everthing worked fine ... so no failed tests to tip me off. refs.c | 329 ++++++++++++++++++++++++++-------------------------------------- 1 files changed, 133 insertions(+), 196 deletions(-) diff --git a/refs.c b/refs.c index a49ff74..4c01d79 100644 --- a/refs.c +++ b/refs.c @@ -8,14 +8,18 @@ #define REF_KNOWS_PEELED 04 #define REF_BROKEN 010 -struct ref_list { - struct ref_list *next; +struct ref_entry { unsigned char flag; /* ISSYMREF? ISPACKED? */ unsigned char sha1[20]; unsigned char peeled[20]; char name[FLEX_ARRAY]; }; +struct ref_array { + int nr, alloc; + struct ref_entry **refs; +}; + static const char *parse_ref_line(char *line, unsigned char *sha1) { /* @@ -44,108 +48,58 @@ static const char *parse_ref_line(char *line, unsigned char *sha1) return line; } -static struct ref_list *add_ref(const char *name, const unsigned char *sha1, - int flag, struct ref_list *list, - struct ref_list **new_entry) +static void add_ref(const char *name, const unsigned char *sha1, + int flag, struct ref_array *refs, + struct ref_entry **new_entry) { int len; - struct ref_list *entry; + struct ref_entry *entry; /* Allocate it and add it in.. */ len = strlen(name) + 1; - entry = xmalloc(sizeof(struct ref_list) + len); + entry = xmalloc(sizeof(struct ref_entry) + len); hashcpy(entry->sha1, sha1); hashclr(entry->peeled); memcpy(entry->name, name, len); entry->flag = flag; - entry->next = list; if (new_entry) *new_entry = entry; - return entry; + ALLOC_GROW(refs->refs, refs->nr + 1, refs->alloc); + refs->refs[refs->nr++] = entry; } -/* merge sort the ref list */ -static struct ref_list *sort_ref_list(struct ref_list *list) +static int ref_entry_cmp(const void *a, const void *b) { - int psize, qsize, last_merge_count, cmp; - struct ref_list *p, *q, *l, *e; - struct ref_list *new_list = list; - int k = 1; - int merge_count = 0; - - if (!list) - return list; - - do { - last_merge_count = merge_count; - merge_count = 0; - - psize = 0; + struct ref_entry *one = *(struct ref_entry **)a; + struct ref_entry *two = *(struct ref_entry **)b; + return strcmp(one->name, two->name); +} - p = new_list; - q = new_list; - new_list = NULL; - l = NULL; +static void sort_ref_array(struct ref_array *array) +{ + qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp); +} - while (p) { - merge_count++; +static struct ref_entry *search_ref_array(struct ref_array *array, const char *name) +{ + struct ref_entry *e, **r; + int len; - while (psize < k && q->next) { - q = q->next; - psize++; - } - qsize = k; - - while ((psize > 0) || (qsize > 0 && q)) { - if (qsize == 0 || !q) { - e = p; - p = p->next; - psize--; - } else if (psize == 0) { - e = q; - q = q->next; - qsize--; - } else { - cmp = strcmp(q->name, p->name); - if (cmp < 0) { - e = q; - q = q->next; - qsize--; - } else if (cmp > 0) { - e = p; - p = p->next; - psize--; - } else { - if (hashcmp(q->sha1, p->sha1)) - die("Duplicated ref, and SHA1s don't match: %s", - q->name); - warning("Duplicated ref: %s", q->name); - e = q; - q = q->next; - qsize--; - free(e); - e = p; - p = p->next; - psize--; - } - } + if (name == NULL) + return NULL; - e->next = NULL; + len = strlen(name) + 1; + e = xmalloc(sizeof(struct ref_entry) + len); + memcpy(e->name, name, len); - if (l) - l->next = e; - if (!new_list) - new_list = e; - l = e; - } + r = bsearch(&e, array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp); - p = q; - }; + free(e); - k = k * 2; - } while ((last_merge_count != merge_count) || (last_merge_count != 1)); + if (r == NULL) + return NULL; - return new_list; + return *r; } /* @@ -155,38 +109,37 @@ static struct ref_list *sort_ref_list(struct ref_list *list) static struct cached_refs { char did_loose; char did_packed; - struct ref_list *loose; - struct ref_list *packed; + struct ref_array loose; + struct ref_array packed; } cached_refs, submodule_refs; -static struct ref_list *current_ref; +static struct ref_entry *current_ref; -static struct ref_list *extra_refs; +static struct ref_array extra_refs; -static void free_ref_list(struct ref_list *list) +static void free_ref_array(struct ref_array *array) { - struct ref_list *next; - for ( ; list; list = next) { - next = list->next; - free(list); - } + int i; + for (i = 0; i < array->nr; i++) + free(array->refs[i]); + free(array->refs); + array->nr = array->alloc = 0; + array->refs = NULL; } static void invalidate_cached_refs(void) { struct cached_refs *ca = &cached_refs; - if (ca->did_loose && ca->loose) - free_ref_list(ca->loose); - if (ca->did_packed && ca->packed) - free_ref_list(ca->packed); - ca->loose = ca->packed = NULL; + if (ca->did_loose) + free_ref_array(&ca->loose); + if (ca->did_packed) + free_ref_array(&ca->packed); ca->did_loose = ca->did_packed = 0; } static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) { - struct ref_list *list = NULL; - struct ref_list *last = NULL; + struct ref_entry *last = NULL; char refline[PATH_MAX]; int flag = REF_ISPACKED; @@ -205,7 +158,7 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) name = parse_ref_line(refline, sha1); if (name) { - list = add_ref(name, sha1, flag, list, &last); + add_ref(name, sha1, flag, &cached_refs->packed, &last); continue; } if (last && @@ -215,21 +168,20 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs) !get_sha1_hex(refline + 1, sha1)) hashcpy(last->peeled, sha1); } - cached_refs->packed = sort_ref_list(list); + sort_ref_array(&cached_refs->packed); } void add_extra_ref(const char *name, const unsigned char *sha1, int flag) { - extra_refs = add_ref(name, sha1, flag, extra_refs, NULL); + add_ref(name, sha1, flag, &extra_refs, NULL); } void clear_extra_refs(void) { - free_ref_list(extra_refs); - extra_refs = NULL; + free_ref_array(&extra_refs); } -static struct ref_list *get_packed_refs(const char *submodule) +static struct ref_array *get_packed_refs(const char *submodule) { const char *packed_refs_file; struct cached_refs *refs; @@ -237,7 +189,7 @@ static struct ref_list *get_packed_refs(const char *submodule) if (submodule) { packed_refs_file = git_path_submodule(submodule, "packed-refs"); refs = &submodule_refs; - free_ref_list(refs->packed); + free_ref_array(&refs->packed); } else { packed_refs_file = git_path("packed-refs"); refs = &cached_refs; @@ -245,18 +197,17 @@ static struct ref_list *get_packed_refs(const char *submodule) if (!refs->did_packed || submodule) { FILE *f = fopen(packed_refs_file, "r"); - refs->packed = NULL; if (f) { read_packed_refs(f, refs); fclose(f); } refs->did_packed = 1; } - return refs->packed; + return &refs->packed; } -static struct ref_list *get_ref_dir(const char *submodule, const char *base, - struct ref_list *list) +static void get_ref_dir(const char *submodule, const char *base, + struct ref_array *array) { DIR *dir; const char *path; @@ -299,7 +250,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, if (stat(refdir, &st) < 0) continue; if (S_ISDIR(st.st_mode)) { - list = get_ref_dir(submodule, ref, list); + get_ref_dir(submodule, ref, array); continue; } if (submodule) { @@ -314,12 +265,11 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, hashclr(sha1); flag |= REF_BROKEN; } - list = add_ref(ref, sha1, flag, list, NULL); + add_ref(ref, sha1, flag, array, NULL); } free(ref); closedir(dir); } - return list; } struct warn_if_dangling_data { @@ -356,21 +306,21 @@ void warn_dangling_symref(FILE *fp, const char *msg_fmt, const char *refname) for_each_rawref(warn_if_dangling_symref, &data); } -static struct ref_list *get_loose_refs(const char *submodule) +static struct ref_array *get_loose_refs(const char *submodule) { if (submodule) { - free_ref_list(submodule_refs.loose); - submodule_refs.loose = get_ref_dir(submodule, "refs", NULL); - submodule_refs.loose = sort_ref_list(submodule_refs.loose); - return submodule_refs.loose; + free_ref_array(&submodule_refs.loose); + get_ref_dir(submodule, "refs", &submodule_refs.loose); + sort_ref_array(&submodule_refs.loose); + return &submodule_refs.loose; } if (!cached_refs.did_loose) { - cached_refs.loose = get_ref_dir(NULL, "refs", NULL); - cached_refs.loose = sort_ref_list(cached_refs.loose); + get_ref_dir(NULL, "refs", &cached_refs.loose); + sort_ref_array(&cached_refs.loose); cached_refs.did_loose = 1; } - return cached_refs.loose; + return &cached_refs.loose; } /* We allow "recursive" symbolic refs. Only within reason, though */ @@ -381,8 +331,8 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna { FILE *f; struct cached_refs refs; - struct ref_list *ref; - int retval; + struct ref_entry *ref; + int retval = -1; strcpy(name + pathlen, "packed-refs"); f = fopen(name, "r"); @@ -390,17 +340,12 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna return -1; read_packed_refs(f, &refs); fclose(f); - ref = refs.packed; - retval = -1; - while (ref) { - if (!strcmp(ref->name, refname)) { - retval = 0; - memcpy(result, ref->sha1, 20); - break; - } - ref = ref->next; + ref = search_ref_array(&refs.packed, refname); + if (ref != NULL) { + memcpy(result, ref->sha1, 20); + retval = 0; } - free_ref_list(refs.packed); + free_ref_array(&refs.packed); return retval; } @@ -501,15 +446,13 @@ const char *resolve_ref(const char *ref, unsigned char *sha1, int reading, int * git_snpath(path, sizeof(path), "%s", ref); /* Special case: non-existing file. */ if (lstat(path, &st) < 0) { - struct ref_list *list = get_packed_refs(NULL); - while (list) { - if (!strcmp(ref, list->name)) { - hashcpy(sha1, list->sha1); - if (flag) - *flag |= REF_ISPACKED; - return ref; - } - list = list->next; + struct ref_array *packed = get_packed_refs(NULL); + struct ref_entry *r = search_ref_array(packed, ref); + if (r != NULL) { + hashcpy(sha1, r->sha1); + if (flag) + *flag |= REF_ISPACKED; + return ref; } if (reading || errno != ENOENT) return NULL; @@ -584,7 +527,7 @@ int read_ref(const char *ref, unsigned char *sha1) #define DO_FOR_EACH_INCLUDE_BROKEN 01 static int do_one_ref(const char *base, each_ref_fn fn, int trim, - int flags, void *cb_data, struct ref_list *entry) + int flags, void *cb_data, struct ref_entry *entry) { if (prefixcmp(entry->name, base)) return 0; @@ -630,18 +573,12 @@ int peel_ref(const char *ref, unsigned char *sha1) return -1; if ((flag & REF_ISPACKED)) { - struct ref_list *list = get_packed_refs(NULL); + struct ref_array *array = get_packed_refs(NULL); + struct ref_entry *r = search_ref_array(array, ref); - while (list) { - if (!strcmp(list->name, ref)) { - if (list->flag & REF_KNOWS_PEELED) { - hashcpy(sha1, list->peeled); - return 0; - } - /* older pack-refs did not leave peeled ones */ - break; - } - list = list->next; + if (r != NULL && r->flag & REF_KNOWS_PEELED) { + hashcpy(sha1, r->peeled); + return 0; } } @@ -660,36 +597,39 @@ fallback: static int do_for_each_ref(const char *submodule, const char *base, each_ref_fn fn, int trim, int flags, void *cb_data) { - int retval = 0; - struct ref_list *packed = get_packed_refs(submodule); - struct ref_list *loose = get_loose_refs(submodule); + int retval = 0, i, p = 0, l = 0; + struct ref_array *packed = get_packed_refs(submodule); + struct ref_array *loose = get_loose_refs(submodule); - struct ref_list *extra; + struct ref_array *extra = &extra_refs; - for (extra = extra_refs; extra; extra = extra->next) - retval = do_one_ref(base, fn, trim, flags, cb_data, extra); + for (i = 0; i < extra->nr; i++) + retval = do_one_ref(base, fn, trim, flags, cb_data, extra->refs[i]); - while (packed && loose) { - struct ref_list *entry; - int cmp = strcmp(packed->name, loose->name); + while (p < packed->nr && l < loose->nr) { + struct ref_entry *entry; + int cmp = strcmp(packed->refs[p]->name, loose->refs[l]->name); if (!cmp) { - packed = packed->next; + p++; continue; } if (cmp > 0) { - entry = loose; - loose = loose->next; + entry = loose->refs[l++]; } else { - entry = packed; - packed = packed->next; + entry = packed->refs[p++]; } retval = do_one_ref(base, fn, trim, flags, cb_data, entry); if (retval) goto end_each; } - for (packed = packed ? packed : loose; packed; packed = packed->next) { - retval = do_one_ref(base, fn, trim, flags, cb_data, packed); + if (l < loose->nr) { + p = l; + packed = loose; + } + + for (; p < packed->nr; p++) { + retval = do_one_ref(base, fn, trim, flags, cb_data, packed->refs[p]); if (retval) goto end_each; } @@ -1005,24 +945,24 @@ static int remove_empty_directories(const char *file) } static int is_refname_available(const char *ref, const char *oldref, - struct ref_list *list, int quiet) -{ - int namlen = strlen(ref); /* e.g. 'foo/bar' */ - while (list) { - /* list->name could be 'foo' or 'foo/bar/baz' */ - if (!oldref || strcmp(oldref, list->name)) { - int len = strlen(list->name); + struct ref_array *array, int quiet) +{ + int i, namlen = strlen(ref); /* e.g. 'foo/bar' */ + for (i = 0; i < array->nr; i++ ) { + struct ref_entry *entry = array->refs[i]; + /* entry->name could be 'foo' or 'foo/bar/baz' */ + if (!oldref || strcmp(oldref, entry->name)) { + int len = strlen(entry->name); int cmplen = (namlen < len) ? namlen : len; - const char *lead = (namlen < len) ? list->name : ref; - if (!strncmp(ref, list->name, cmplen) && + const char *lead = (namlen < len) ? entry->name : ref; + if (!strncmp(ref, entry->name, cmplen) && lead[cmplen] == '/') { if (!quiet) error("'%s' exists; cannot create '%s'", - list->name, ref); + entry->name, ref); return 0; } } - list = list->next; } return 1; } @@ -1129,18 +1069,13 @@ static struct lock_file packlock; static int repack_without_ref(const char *refname) { - struct ref_list *list, *packed_ref_list; - int fd; - int found = 0; + struct ref_array *packed; + struct ref_entry *ref; + int fd, i; - packed_ref_list = get_packed_refs(NULL); - for (list = packed_ref_list; list; list = list->next) { - if (!strcmp(refname, list->name)) { - found = 1; - break; - } - } - if (!found) + packed = get_packed_refs(NULL); + ref = search_ref_array(packed, refname); + if (ref == NULL) return 0; fd = hold_lock_file_for_update(&packlock, git_path("packed-refs"), 0); if (fd < 0) { @@ -1148,17 +1083,19 @@ static int repack_without_ref(const char *refname) return error("cannot delete '%s' from packed refs", refname); } - for (list = packed_ref_list; list; list = list->next) { + for (i = 0; i < packed->nr; i++) { char line[PATH_MAX + 100]; int len; - if (!strcmp(refname, list->name)) + ref = packed->refs[i]; + + if (!strcmp(refname, ref->name)) continue; len = snprintf(line, sizeof(line), "%s %s\n", - sha1_to_hex(list->sha1), list->name); + sha1_to_hex(ref->sha1), ref->name); /* this should not happen but just being defensive */ if (len > sizeof(line)) - die("too long a refname '%s'", list->name); + die("too long a refname '%s'", ref->name); write_or_die(fd, line, len); } return commit_lock_file(&packlock); -- 1.7.6.1 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-29 22:11 ` [PATCH v3] " Julian Phillips @ 2011-09-29 23:48 ` Junio C Hamano 2011-09-30 15:30 ` Michael Haggerty 2011-09-30 1:13 ` Martin Fick 1 sibling, 1 reply; 126+ messages in thread From: Junio C Hamano @ 2011-09-29 23:48 UTC (permalink / raw) To: Julian Phillips Cc: Michael Haggerty, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast This version looks sane, although I have a suspicion that it may have some interaction with what Michael may be working on. Thanks. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-29 23:48 ` Junio C Hamano @ 2011-09-30 15:30 ` Michael Haggerty 2011-09-30 16:38 ` Junio C Hamano 0 siblings, 1 reply; 126+ messages in thread From: Michael Haggerty @ 2011-09-30 15:30 UTC (permalink / raw) To: Junio C Hamano Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast On 09/30/2011 01:48 AM, Junio C Hamano wrote: > This version looks sane, although I have a suspicion that it may have > some interaction with what Michael may be working on. Indeed, I have almost equivalent changes in the giant patch series that I am working on [1]. The branch is very experimental. The tip currently passes all the tests, but it has a known performance regression in connection if "git fetch" is used to fetch many commits. But before comparing ref-related optimizations, we have an *urgent* need for a decent performance test suite. There are many slightly different scenarios that have very different performance characteristics, and we have to be sure that we are optimizing for the whole palette of many-reference use cases. So I made an attempt at a kludgey but somewhat flexible performance-testing script [2]. I don't know whether something like this should be integrated into the git project, and if so where; suggestions are welcome. To run the tests, from the root of the git source tree: make # make sure git is up-to-date t/make-refperf-repo --help t/make-refperf-repo [OPTIONS] t/refperf cat refperf.times # See the results The default repo has 5k commits in a linear series with one reference on each commit. (These numbers can both be adjusted.) The reference namespace can be laid out a few ways: * Many references in a single "directory" vs. sharded over many "directories" * In lexicographic order by commit, in reverse order, or "shuffled". By default, the repo is written to "refperf-repo". The time it takes to create the test repository is itself also an interesting benchmark. For example, on the maint branch it is terribly slow unless it is passed either the --pack-refs-interval=N (with N, say 100) or --no-replace-object option. I also noticed that if it is run like t/make-refperf-repo --refs=5000 --commits=5000 \ --pack-refs-interval=100 (one ref per commit), git-pack-refs becomes precipitously and dramatically slower after the 2000th commit. I haven't had time yet for systematic benchmarks of other git versions. See the refperf script to see what sorts of benchmarks that I have built into it so far. The refperf test is non-destructive; it always copies from "refperf-repo" to "refperf-repo-copy" and does its tests in the copy; therefore a test repo can be reused. The timing data are written to "refperf.times" and other output to "refperf.log". Here are my refperf results for the "maint" branch on my notebook with the default "make-refperf-repo" arguments (times in seconds): 3.36 git branch (cold) 0.01 git branch (warm) 0.04 git for-each-ref 3.08 git checkout (cold) 0.01 git checkout (warm) 0.00 git checkout --orphan (warm) 0.15 git checkout from detached orphan 0.12 git pack-refs 1.17 git branch (cold) 0.00 git branch (warm) 0.17 git for-each-ref 0.95 git checkout (cold) 0.00 git checkout (warm) 0.00 git checkout --orphan (warm) 0.21 git checkout from detached orphan 0.18 git branch -a --contains 7.67 git clone 0.06 git fetch (nothing) 0.01 git pack-refs 0.05 git fetch (nothing, packed) 0.10 git clone of a ref-packed repo 0.63 git fetch (everything) Probably we should test with even more references than this, but this test already shows that some commands are quite sluggish. There are some more things that could be added, like: * Branches vs. annotated tags * References on the tips of branches in a more typical "branchy" repository. * git describe --all * git log --decorate * git gc * git filter-branch (This has very different performance characteristics because it is a script that invokes git many times.) I suggest that we try to do systematic benchmarking of any changes that we claim are performance optimizations and share before/after results in the cover letter for the patch series. Michael [1] branch hierarchical-refs at git://github.com/mhagger/git.git [2] branch refperf at git://github.com/mhagger/git.git -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-30 15:30 ` Michael Haggerty @ 2011-09-30 16:38 ` Junio C Hamano 2011-09-30 17:56 ` [PATCH] refs: Remove duplicates after sorting with qsort Julian Phillips 2011-10-02 5:15 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty 0 siblings, 2 replies; 126+ messages in thread From: Junio C Hamano @ 2011-09-30 16:38 UTC (permalink / raw) To: Michael Haggerty Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Michael Haggerty <mhagger@alum.mit.edu> writes: > On 09/30/2011 01:48 AM, Junio C Hamano wrote: >> This version looks sane, although I have a suspicion that it may have >> some interaction with what Michael may be working on. > > Indeed, I have almost equivalent changes in the giant patch series that > I am working on [1]. Good; that was the primary thing I wanted to know. I want to take Julian's patch early but if the approach and data structures were drastically different from what you are cooking, that would force unnecessary reroll on your part, which I wanted to avoid. Thanks. ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH] refs: Remove duplicates after sorting with qsort 2011-09-30 16:38 ` Junio C Hamano @ 2011-09-30 17:56 ` Julian Phillips 2011-10-02 5:15 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty 1 sibling, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-30 17:56 UTC (permalink / raw) To: Junio C Hamano Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast, Michael Haggerty The previous custom merge sort would drop duplicate entries as part of the sort. It would also die if the duplicate entries had different sha1 values. The standard library qsort doesn't do this, so we have to do it manually afterwards. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> --- On Fri, 30 Sep 2011 09:38:54 -0700, Junio C Hamano wrote: > Michael Haggerty <mhagger@alum.mit.edu> writes: > >> On 09/30/2011 01:48 AM, Junio C Hamano wrote: >>> This version looks sane, although I have a suspicion that it may >>> have >>> some interaction with what Michael may be working on. >> >> Indeed, I have almost equivalent changes in the giant patch series >> that >> I am working on [1]. > > Good; that was the primary thing I wanted to know. I want to take > Julian's patch early but if the approach and data structures were > drastically different from what you are cooking, that would force > unnecessary reroll on your part, which I wanted to avoid. > > Thanks. I had a quick look at Michael's code, and it reminded me that I had missed one thing out. If we want to keep the duplicate detection & removal from the original merge sort then this patch is needed on top of v3 of the binary search. Though I never could figure out how duplicate refs were supposed to appear ... I tested by editing packed-refs, but I assume that isn't "supported". refs.c | 22 ++++++++++++++++++++++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/refs.c b/refs.c index 4c01d79..cf080ee 100644 --- a/refs.c +++ b/refs.c @@ -77,7 +77,29 @@ static int ref_entry_cmp(const void *a, const void *b) static void sort_ref_array(struct ref_array *array) { + int i = 0, j = 1; + + /* Nothing to sort unless there are at least two entries */ + if (array->nr < 2) + return; + qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp); + + /* Remove any duplicates from the ref_array */ + for (; j < array->nr; j++) { + struct ref_entry *a = array->refs[i]; + struct ref_entry *b = array->refs[j]; + if (!strcmp(a->name, b->name)) { + if (hashcmp(a->sha1, b->sha1)) + die("Duplicated ref, and SHA1s don't match: %s", + a->name); + warning("Duplicated ref: %s", a->name); + continue; + } + i++; + array->refs[i] = array->refs[j]; + } + array->nr = i + 1; } static struct ref_entry *search_ref_array(struct ref_array *array, const char *name) -- 1.7.6.1 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-30 16:38 ` Junio C Hamano 2011-09-30 17:56 ` [PATCH] refs: Remove duplicates after sorting with qsort Julian Phillips @ 2011-10-02 5:15 ` Michael Haggerty 2011-10-02 5:45 ` Junio C Hamano 1 sibling, 1 reply; 126+ messages in thread From: Michael Haggerty @ 2011-10-02 5:15 UTC (permalink / raw) To: Junio C Hamano Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast On 09/30/2011 06:38 PM, Junio C Hamano wrote: > Michael Haggerty <mhagger@alum.mit.edu> writes: > >> On 09/30/2011 01:48 AM, Junio C Hamano wrote: >>> This version looks sane, although I have a suspicion that it may have >>> some interaction with what Michael may be working on. >> >> Indeed, I have almost equivalent changes in the giant patch series that >> I am working on [1]. > > Good; that was the primary thing I wanted to know. I want to take > Julian's patch early but if the approach and data structures were > drastically different from what you are cooking, that would force > unnecessary reroll on your part, which I wanted to avoid. Um, well, my patch series includes the same changes that Julian's wants to introduce, but following lots of other changes, cleanups, documentation improvements, etc. Moreover, my patch series builds on mh/iterate-refs, with which Julian's patch conflicts. In other words, it would be a real mess to reroll my series on top of Julian's patch. (That is of course not to imply that I hold a mutex on refs.c.) Because it changes a data structure that is used throughout refs.c, changes a lot of lines of code. I think that the switch from linked list + linear sort to array plus binary sort is a pretty obvious win in terms of code complexity and *potential* performance improvement, but empirically I haven't seen any claims that it brings performance improvements beyond "René's patch". (Though, honestly, I've lost track of which "René's patch" is being discussed and I don't see anything relevant in Junio's tree.) Intuitively, given that populating the reference cache involves O(N) I/O, speeding up lookups can only help if there are very many ref lookups within a single git invocation. I think we will get a better improvement by avoiding the reading of unneeded loose refs by reading them one subdirectory at a time instead of always reading them en masse. I wanted to reach that milestone before submitting my changes. My preference would be: 1. Merge jp/get-ref-dir-unsorted, perhaps even into maint. It is a simple, noninvasive, and obvious improvement and helps performance a lot in an important use case. 2. Hold off on merging jp/get-ref-dir-unsorted for a while to give me a chance to avoid conflict hell. 3. Evaluate René's patch on its own merits; if it makes sense regardless of the binary search speedups, then it can be accepted independently to give most of the performance benefit already. Are there any other patches in this area that I've forgotten? Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-10-02 5:15 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty @ 2011-10-02 5:45 ` Junio C Hamano 2011-10-04 20:58 ` Junio C Hamano 0 siblings, 1 reply; 126+ messages in thread From: Junio C Hamano @ 2011-10-02 5:45 UTC (permalink / raw) To: Michael Haggerty Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Michael Haggerty <mhagger@alum.mit.edu> writes: > Um, well, my patch series includes the same changes that Julian's wants > to introduce, but following lots of other changes, cleanups, > documentation improvements, etc. Moreover, my patch series builds on > mh/iterate-refs, with which Julian's patch conflicts. In other words, > it would be a real mess to reroll my series on top of Julian's patch. Conflicts during re-rolling was not something I was worried too much about---that is just the fact of life. We cannot easily resolve two topics that want to go in totally different direction, but we should be able to converge two topics that want to take the same approach in the end, especially one is a subset of the other. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-10-02 5:45 ` Junio C Hamano @ 2011-10-04 20:58 ` Junio C Hamano 0 siblings, 0 replies; 126+ messages in thread From: Junio C Hamano @ 2011-10-04 20:58 UTC (permalink / raw) To: Michael Haggerty Cc: Julian Phillips, Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast Junio C Hamano <gitster@pobox.com> writes: > Michael Haggerty <mhagger@alum.mit.edu> writes: > >> Um, well, my patch series includes the same changes that Julian's wants >> to introduce, but following lots of other changes, cleanups, >> documentation improvements, etc. Moreover, my patch series builds on >> mh/iterate-refs, with which Julian's patch conflicts. In other words, >> it would be a real mess to reroll my series on top of Julian's patch. > > Conflicts during re-rolling was not something I was worried too much > about---that is just the fact of life. We cannot easily resolve two topics > that want to go in totally different direction, but we should be able to > converge two topics that want to take the same approach in the end, > especially one is a subset of the other. Ah, also I should have noted that I have a fix-up between mh/iterate-refs and Julian's patch already queued on 'pu'. I am planning to make mh/iterate-refs graduate to 'master' soonish, so hopefully things will become simpler. Thanks. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-29 22:11 ` [PATCH v3] " Julian Phillips 2011-09-29 23:48 ` Junio C Hamano @ 2011-09-30 1:13 ` Martin Fick 2011-09-30 3:44 ` Junio C Hamano 1 sibling, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-30 1:13 UTC (permalink / raw) To: Julian Phillips Cc: Junio C Hamano, Christian Couder, git, Christian Couder, Thomas Rast On Thursday, September 29, 2011 04:11:42 pm Julian Phillips wrote: > Currently we linearly search through lists of refs when > we need to find a specific ref. This can be very slow > if we need to lookup a large number of refs. By > changing to a binary search we can make this faster. > > In order to be able to use a binary search we need to > change from using linked lists to arrays, which we can > manage using ALLOC_GROW. > > We can now also use the standard library qsort function > to sort the refs arrays. > This works for me, however unfortunately, I cannot find any scenarios where it improves anything over the previous fix by René. :( I tested many things, clones, fetches, fetch noops, checkouts, garbage collection. I am a bit surprised, because I thought that my hack of a hash map did improve still on checkouts on packed refs, but it could just be that my hack was buggy and did not actually do a full orphan check. Thanks, -Martin ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-30 1:13 ` Martin Fick @ 2011-09-30 3:44 ` Junio C Hamano 2011-09-30 8:04 ` Julian Phillips 2011-09-30 15:45 ` Martin Fick 0 siblings, 2 replies; 126+ messages in thread From: Junio C Hamano @ 2011-09-30 3:44 UTC (permalink / raw) To: Martin Fick Cc: Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Martin Fick <mfick@codeaurora.org> writes: > This works for me, however unfortunately, I cannot find any > scenarios where it improves anything over the previous fix > by René. :( Nevertheless, I would appreciate it if you can try this _without_ René's patch. This attempts to make resolve_ref() cheap for _any_ caller. René's patch avoids calling it in one specific callchain. They address different issues. René's patch is probably an independently good change (I haven't thought about the interactions with the topics in flight and its implications on the future direction), but would not help other/new callers that make many calls to resolve_ref(). ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-30 3:44 ` Junio C Hamano @ 2011-09-30 8:04 ` Julian Phillips 2011-09-30 15:45 ` Martin Fick 1 sibling, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-30 8:04 UTC (permalink / raw) To: Junio C Hamano Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast On Thu, 29 Sep 2011 20:44:40 -0700, Junio C Hamano wrote: > Martin Fick <mfick@codeaurora.org> writes: > >> This works for me, however unfortunately, I cannot find any >> scenarios where it improves anything over the previous fix >> by René. :( > > Nevertheless, I would appreciate it if you can try this _without_ > René's > patch. This attempts to make resolve_ref() cheap for _any_ caller. > René's > patch avoids calling it in one specific callchain. > > They address different issues. René's patch is probably an > independently > good change (I haven't thought about the interactions with the topics > in > flight and its implications on the future direction), but would not > help > other/new callers that make many calls to resolve_ref(). It certainly helps with my test repo (~140k refs, of which ~40k are branches). User times for checkout starting from an orphaned commit are: No fix : ~16m8s + Binary Search : ~4s + René's patch : ~2s (The 2s includes both patches, though the timing is the same for René's patch alone) -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: [PATCH v3] refs: Use binary search to lookup refs faster 2011-09-30 3:44 ` Junio C Hamano 2011-09-30 8:04 ` Julian Phillips @ 2011-09-30 15:45 ` Martin Fick 1 sibling, 0 replies; 126+ messages in thread From: Martin Fick @ 2011-09-30 15:45 UTC (permalink / raw) To: Junio C Hamano Cc: Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast On Thursday, September 29, 2011 09:44:40 pm Junio C Hamano wrote: > Martin Fick <mfick@codeaurora.org> writes: > > This works for me, however unfortunately, I cannot find > > any scenarios where it improves anything over the > > previous fix by René. :( > > Nevertheless, I would appreciate it if you can try this > _without_ René's patch. This attempts to make > resolve_ref() cheap for _any_ caller. René's patch > avoids calling it in one specific callchain. > > They address different issues. René's patch is probably > an independently good change (I haven't thought about > the interactions with the topics in flight and its > implications on the future direction), but would not > help other/new callers that make many calls to > resolve_ref(). Agreed. Here is what I am seeing without René's patch. Checkout in NON packed ref repo takes about 20s, with patch v3 of binary search, it takes about 11s (1s slower than René's patch). Checkout in packed ref repo takes about 5:30min, with patch v3 of binary search, it takes about 10s (also 1s slower than René's patch). I'd say that's not bad, it seems like the 1s difference is doing the search 60K+times (my tests don't quite scan the full list), so the search seems to scale well with patch v3. -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 19:10 ` Junio C Hamano 2011-09-29 4:18 ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips @ 2011-09-29 20:44 ` Martin Fick 1 sibling, 0 replies; 126+ messages in thread From: Martin Fick @ 2011-09-29 20:44 UTC (permalink / raw) To: Junio C Hamano Cc: René Scharfe, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast On Thursday, September 29, 2011 01:10:06 pm Junio C Hamano wrote: > Also we would probably want to drop "next" field from > "struct ref_list" (i.e. making it not a linear list), > introduce a new "struct ref_array" that is a > ALLOC_GROW() managed array of pointers to "struct > ref_list", make get_packed_refs() and get_loose_refs() > return a pointer to "struct ref_array" after sorting the > array contents by "name". Then resolve_ref() can do a > bisection search in the packed refs array when it does > not find a loose ref. That would be nice, and I suspect it would shave a bit more of the orphan check and possibly even a fetch. If I understood all that, I might try. But I might need some hand holding, my C is pretty rusty... Is there a bisection search library in git already to use? Is there a git sorting library for the array also? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 18:27 ` René Scharfe 2011-09-29 19:10 ` Junio C Hamano @ 2011-09-29 19:10 ` Julian Phillips 2011-09-29 20:11 ` Martin Fick 2 siblings, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-29 19:10 UTC (permalink / raw) To: René Scharfe Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast, Junio C Hamano On Thu, 29 Sep 2011 20:27:43 +0200, René Scharfe wrote: > Am 29.09.2011 04:19, schrieb Julian Phillips: >> Does the following help? >> >> diff --git a/builtin/checkout.c b/builtin/checkout.c >> index 5e356a6..f0f4ca1 100644 >> --- a/builtin/checkout.c >> +++ b/builtin/checkout.c >> @@ -605,7 +605,7 @@ static int add_one_ref_to_rev_list_arg(const >> char >> *refname, >> int flags, >> void *cb_data) >> { >> - add_one_rev_list_arg(cb_data, refname); >> + add_one_rev_list_arg(cb_data, strdup(sha1_to_hex(sha1))); >> return 0; >> } > > Hmm. Can we get rid of the multiple ref lookups fixed by the above > *and* the overhead of dealing with a textual argument list at the > same > time by calling add_pending_object directly, like this? (Factoring > out add_pending_sha1 should be a separate patch..) Seems like a good idea. I get the same sort of times as with my patch, but it makes the code _feel_ much nicer (and slightly smaller). Mine was definitely more of a "it's 2am, but I think the problem is here" type of patch ;) -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 18:27 ` René Scharfe 2011-09-29 19:10 ` Junio C Hamano 2011-09-29 19:10 ` Julian Phillips @ 2011-09-29 20:11 ` Martin Fick 2011-09-30 9:12 ` René Scharfe 2 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-29 20:11 UTC (permalink / raw) To: René Scharfe Cc: Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast, Junio C Hamano On Thursday, September 29, 2011 12:27:43 pm René Scharfe wrote: > Hmm. Can we get rid of the multiple ref lookups fixed by > the above *and* the overhead of dealing with a textual > argument list at the same time by calling > add_pending_object directly, like this? (Factoring out > add_pending_sha1 should be a separate patch..) René, Your patch works well for me. It achieves about the same gains as Julian's patch. Thanks! After all the performance fixes get merged for large ref counts, it sure should help the Gerrit community. I wonder how it might impact Gerrit mirroring... -Martin Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-29 20:11 ` Martin Fick @ 2011-09-30 9:12 ` René Scharfe 2011-09-30 16:09 ` Martin Fick 2011-09-30 16:52 ` Junio C Hamano 0 siblings, 2 replies; 126+ messages in thread From: René Scharfe @ 2011-09-30 9:12 UTC (permalink / raw) To: Martin Fick Cc: Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast, Junio C Hamano Hi Martin, Am 29.09.2011 22:11, schrieb Martin Fick: > Your patch works well for me. It achieves about the same > gains as Julian's patch. Thanks! OK, and what happens if you apply the following patch on top of my first one? It avoids going through all the refs a second time during cleanup, at the cost of going through the list of all known objects. I wonder if that's any faster in your case. Thanks, René diff --git a/builtin/checkout.c b/builtin/checkout.c index 84e0cdc..a4b1003 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -596,15 +596,14 @@ static int add_pending_uninteresting_ref(const char *refname, return 0; } -static int clear_commit_marks_from_one_ref(const char *refname, - const unsigned char *sha1, - int flags, - void *cb_data) +static void clear_commit_marks_for_all(unsigned int mark) { - struct commit *commit = lookup_commit_reference_gently(sha1, 1); - if (commit) - clear_commit_marks(commit, -1); - return 0; + unsigned int i, max = get_max_object_index(); + for (i = 0; i < max; i++) { + struct object *object = get_indexed_object(i); + if (object && object->type == OBJ_COMMIT) + object->flags &= ~mark; + } } static void describe_one_orphan(struct strbuf *sb, struct commit *commit) @@ -690,8 +689,7 @@ static void orphaned_commit_warning(struct commit *commit) else describe_detached_head(_("Previous HEAD position was"), commit); - clear_commit_marks(commit, -1); - for_each_ref(clear_commit_marks_from_one_ref, NULL); + clear_commit_marks_for_all(ALL_REV_FLAGS); } static int switch_branches(struct checkout_opts *opts, struct branch_info *new) ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 9:12 ` René Scharfe @ 2011-09-30 16:09 ` Martin Fick 2011-09-30 16:52 ` Junio C Hamano 1 sibling, 0 replies; 126+ messages in thread From: Martin Fick @ 2011-09-30 16:09 UTC (permalink / raw) To: René Scharfe Cc: Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast, Junio C Hamano On Friday, September 30, 2011 03:12:08 am René Scharfe wrote: > OK, and what happens if you apply the following patch on > top of my first one? It avoids going through all the > refs a second time during cleanup, at the cost of going > through the list of all known objects. I wonder if > that's any faster in your case. This patch helps a bit more. It seems to shave about another .5s off in packed and non packed case w or w/o binary search. -Martin > diff --git a/builtin/checkout.c b/builtin/checkout.c > index 84e0cdc..a4b1003 100644 > --- a/builtin/checkout.c > +++ b/builtin/checkout.c > @@ -596,15 +596,14 @@ static int > add_pending_uninteresting_ref(const char *refname, > return 0; > } > > -static int clear_commit_marks_from_one_ref(const char > *refname, - const unsigned char *sha1, > - int flags, > - void *cb_data) > +static void clear_commit_marks_for_all(unsigned int > mark) { > - struct commit *commit = > lookup_commit_reference_gently(sha1, 1); - if (commit) > - clear_commit_marks(commit, -1); > - return 0; > + unsigned int i, max = get_max_object_index(); > + for (i = 0; i < max; i++) { > + struct object *object = get_indexed_object(i); > + if (object && object->type == OBJ_COMMIT) > + object->flags &= ~mark; > + } > } > > static void describe_one_orphan(struct strbuf *sb, > struct commit *commit) @@ -690,8 +689,7 @@ static void > orphaned_commit_warning(struct commit *commit) else > describe_detached_head(_("Previous HEAD position > was"), commit); > > - clear_commit_marks(commit, -1); > - for_each_ref(clear_commit_marks_from_one_ref, NULL); > + clear_commit_marks_for_all(ALL_REV_FLAGS); > } > > static int switch_branches(struct checkout_opts *opts, > struct branch_info *new) -- > To unsubscribe from this list: send the line "unsubscribe > git" in the body of a message to > majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 9:12 ` René Scharfe 2011-09-30 16:09 ` Martin Fick @ 2011-09-30 16:52 ` Junio C Hamano 2011-09-30 18:17 ` René Scharfe 1 sibling, 1 reply; 126+ messages in thread From: Junio C Hamano @ 2011-09-30 16:52 UTC (permalink / raw) To: René Scharfe Cc: Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast, Junio C Hamano René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > Hi Martin, > > Am 29.09.2011 22:11, schrieb Martin Fick: >> Your patch works well for me. It achieves about the same >> gains as Julian's patch. Thanks! > > OK, and what happens if you apply the following patch on top of my first > one? It avoids going through all the refs a second time during cleanup, > at the cost of going through the list of all known objects. I wonder if > that's any faster in your case. > ... > static void describe_one_orphan(struct strbuf *sb, struct commit *commit) > @@ -690,8 +689,7 @@ static void orphaned_commit_warning(struct commit *commit) > else > describe_detached_head(_("Previous HEAD position was"), commit); > > - clear_commit_marks(commit, -1); > - for_each_ref(clear_commit_marks_from_one_ref, NULL); > + clear_commit_marks_for_all(ALL_REV_FLAGS); > } The function already clears all the flag bits from commits near the tip of all the refs (i.e. whatever commit it traverses until it gets to the fork point), so it cannot be reused in other contexts where the caller - first marks commit objects with some flag bits for its own purpose, unrelated to the "orphaned"-ness check; - calls this function to issue a warning; and then - use the flag it earlier set to do something useful. which requires "cleaning after yourself, by clearing only the bits you used without disturbing other bits that you do not use" pattern. It might be a better solution to not bother to clear the marks at all; would it break anything in this codepath? ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 16:52 ` Junio C Hamano @ 2011-09-30 18:17 ` René Scharfe 2011-10-01 15:28 ` René Scharfe 0 siblings, 1 reply; 126+ messages in thread From: René Scharfe @ 2011-09-30 18:17 UTC (permalink / raw) To: Junio C Hamano Cc: Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Am 30.09.2011 18:52, schrieb Junio C Hamano: > René Scharfe <rene.scharfe@lsrfire.ath.cx> writes: > >> Hi Martin, >> >> Am 29.09.2011 22:11, schrieb Martin Fick: >>> Your patch works well for me. It achieves about the same >>> gains as Julian's patch. Thanks! >> >> OK, and what happens if you apply the following patch on top of my first >> one? It avoids going through all the refs a second time during cleanup, >> at the cost of going through the list of all known objects. I wonder if >> that's any faster in your case. >> ... >> static void describe_one_orphan(struct strbuf *sb, struct commit *commit) >> @@ -690,8 +689,7 @@ static void orphaned_commit_warning(struct commit *commit) >> else >> describe_detached_head(_("Previous HEAD position was"), commit); >> >> - clear_commit_marks(commit, -1); >> - for_each_ref(clear_commit_marks_from_one_ref, NULL); >> + clear_commit_marks_for_all(ALL_REV_FLAGS); >> } > > The function already clears all the flag bits from commits near the tip of > all the refs (i.e. whatever commit it traverses until it gets to the fork > point), so it cannot be reused in other contexts where the caller > > - first marks commit objects with some flag bits for its own purpose, > unrelated to the "orphaned"-ness check; > - calls this function to issue a warning; and then > - use the flag it earlier set to do something useful. > > which requires "cleaning after yourself, by clearing only the bits you > used without disturbing other bits that you do not use" pattern. Yes, clear_commit_marks_for_all is a bit brutal. Callers could clear specfic bits (e.g. SEEN|UNINTERESTING) instead of ALL_REV_FLAGS, though. > It might be a better solution to not bother to clear the marks at all; > would it break anything in this codepath? Unfortunately, yes; the cleanup part was added by 5c08dc48 later, when it become apparent that it's really needed. However, since the patch only buys us a 5% speedup I'm not sure it's worth it in its current form. René ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-30 18:17 ` René Scharfe @ 2011-10-01 15:28 ` René Scharfe 2011-10-01 15:38 ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe ` (7 more replies) 0 siblings, 8 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 15:28 UTC (permalink / raw) To: Junio C Hamano Cc: Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Am 30.09.2011 20:17, schrieb René Scharfe: > Am 30.09.2011 18:52, schrieb Junio C Hamano: >> It might be a better solution to not bother to clear the marks at >> all; would it break anything in this codepath? > > Unfortunately, yes; the cleanup part was added by 5c08dc48 later, > when it become apparent that it's really needed. > > However, since the patch only buys us a 5% speedup I'm not sure it's > worth it in its current form. I found something better: A trick used by bisect and bundle. They copy the list of pending objects from rev_info before calling prepare_revision_walk and then go through it to clean up the commit marks without going through the refs again. And I think we can even improve it a little. The following patches tighten some orphan/detached head tests a little, then comes a resend of my first patch on this topic, only split up into two, then four patches that introduce the trick mentioned above (which could be squashed together perhaps) and the last one is a bonus refactoring patch. bisect.c | 20 +++++++------- builtin/checkout.c | 58 +++++++++++++------------------------------ bundle.c | 11 +++----- commit.c | 14 ++++++++++ commit.h | 1 + revision.c | 14 +++++++--- revision.h | 2 + t/t2020-checkout-detach.sh | 7 ++++- 8 files changed, 64 insertions(+), 63 deletions(-) René ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 2011-10-01 15:28 ` René Scharfe @ 2011-10-01 15:38 ` René Scharfe 2011-10-01 19:02 ` Sverre Rabbelier 2011-10-01 15:43 ` [PATCH 2/8] revision: factor out add_pending_sha1 René Scharfe ` (6 subsequent siblings) 7 siblings, 1 reply; 126+ messages in thread From: René Scharfe @ 2011-10-01 15:38 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast If we leave a detached head, exactly one of two things happens: either checkout warns about it being an orphan or describes it as a courtesy. Test t2020 already checked that the warning is shown as needed. This patch also checks for the description. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- t/t2020-checkout-detach.sh | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/t/t2020-checkout-detach.sh b/t/t2020-checkout-detach.sh index 2366f0f..068fba4 100755 --- a/t/t2020-checkout-detach.sh +++ b/t/t2020-checkout-detach.sh @@ -12,11 +12,14 @@ check_not_detached () { } ORPHAN_WARNING='you are leaving .* commit.*behind' +PREV_HEAD_DESC='Previous HEAD position was' check_orphan_warning() { - test_i18ngrep "$ORPHAN_WARNING" "$1" + test_i18ngrep "$ORPHAN_WARNING" "$1" && + test_i18ngrep ! "$PREV_HEAD_DESC" "$1" } check_no_orphan_warning() { - test_i18ngrep ! "$ORPHAN_WARNING" "$1" + test_i18ngrep ! "$ORPHAN_WARNING" "$1" && + test_i18ngrep "$PREV_HEAD_DESC" "$1" } reset () { -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 2011-10-01 15:38 ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe @ 2011-10-01 19:02 ` Sverre Rabbelier 0 siblings, 0 replies; 126+ messages in thread From: Sverre Rabbelier @ 2011-10-01 19:02 UTC (permalink / raw) To: René Scharfe Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Heya, On Sat, Oct 1, 2011 at 17:38, René Scharfe <rene.scharfe@lsrfire.ath.cx> wrote: > If we leave a detached head, exactly one of two things happens: either > checkout warns about it being an orphan or describes it as a courtesy. > Test t2020 already checked that the warning is shown as needed. This > patch also checks for the description. A cover letter would have been nice for such a long series :). -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH 2/8] revision: factor out add_pending_sha1 2011-10-01 15:28 ` René Scharfe 2011-10-01 15:38 ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe @ 2011-10-01 15:43 ` René Scharfe 2011-10-01 15:51 ` [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check René Scharfe ` (5 subsequent siblings) 7 siblings, 0 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 15:43 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast This function is a combination of the static get_reference and add_pending_object. It can be used to easily queue objects by hash. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- The next patch is going to use it in checkout. revision.c | 11 ++++++++--- revision.h | 1 + 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/revision.c b/revision.c index c46cfaa..2e8aa33 100644 --- a/revision.c +++ b/revision.c @@ -185,6 +185,13 @@ static struct object *get_reference(struct rev_info *revs, const char *name, con return object; } +void add_pending_sha1(struct rev_info *revs, const char *name, + const unsigned char *sha1, unsigned int flags) +{ + struct object *object = get_reference(revs, name, sha1, flags); + add_pending_object(revs, object, name); +} + static struct commit *handle_commit(struct rev_info *revs, struct object *object, const char *name) { unsigned long flags = object->flags; @@ -832,9 +839,7 @@ struct all_refs_cb { static int handle_one_ref(const char *path, const unsigned char *sha1, int flag, void *cb_data) { struct all_refs_cb *cb = cb_data; - struct object *object = get_reference(cb->all_revs, path, sha1, - cb->all_flags); - add_pending_object(cb->all_revs, object, path); + add_pending_sha1(cb->all_revs, path, sha1, cb->all_flags); return 0; } diff --git a/revision.h b/revision.h index 3d64ada..4541265 100644 --- a/revision.h +++ b/revision.h @@ -191,6 +191,7 @@ extern void add_object(struct object *obj, const char *name); extern void add_pending_object(struct rev_info *revs, struct object *obj, const char *name); +extern void add_pending_sha1(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags); extern void add_head_to_pending(struct rev_info *); -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check 2011-10-01 15:28 ` René Scharfe 2011-10-01 15:38 ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe 2011-10-01 15:43 ` [PATCH 2/8] revision: factor out add_pending_sha1 René Scharfe @ 2011-10-01 15:51 ` René Scharfe 2011-10-01 15:56 ` [PATCH 4/8] revision: add leak_pending flag René Scharfe ` (4 subsequent siblings) 7 siblings, 0 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 15:51 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Instead of building a list of textual arguments for setup_revisions, use add_pending_object and add_pending_sha1 to queue the objects directly. This is both faster and simpler. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- builtin/checkout.c | 39 ++++++++++++--------------------------- 1 files changed, 12 insertions(+), 27 deletions(-) diff --git a/builtin/checkout.c b/builtin/checkout.c index 5e356a6..84e0cdc 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -588,24 +588,11 @@ static void update_refs_for_switch(struct checkout_opts *opts, report_tracking(new); } -struct rev_list_args { - int argc; - int alloc; - const char **argv; -}; - -static void add_one_rev_list_arg(struct rev_list_args *args, const char *s) -{ - ALLOC_GROW(args->argv, args->argc + 1, args->alloc); - args->argv[args->argc++] = s; -} - -static int add_one_ref_to_rev_list_arg(const char *refname, - const unsigned char *sha1, - int flags, - void *cb_data) +static int add_pending_uninteresting_ref(const char *refname, + const unsigned char *sha1, + int flags, void *cb_data) { - add_one_rev_list_arg(cb_data, refname); + add_pending_sha1(cb_data, refname, sha1, flags | UNINTERESTING); return 0; } @@ -685,19 +672,17 @@ static void suggest_reattach(struct commit *commit, struct rev_info *revs) */ static void orphaned_commit_warning(struct commit *commit) { - struct rev_list_args args = { 0, 0, NULL }; struct rev_info revs; - - add_one_rev_list_arg(&args, "(internal)"); - add_one_rev_list_arg(&args, sha1_to_hex(commit->object.sha1)); - add_one_rev_list_arg(&args, "--not"); - for_each_ref(add_one_ref_to_rev_list_arg, &args); - add_one_rev_list_arg(&args, "--"); - add_one_rev_list_arg(&args, NULL); + struct object *object = &commit->object; init_revisions(&revs, NULL); - if (setup_revisions(args.argc - 1, args.argv, &revs, NULL) != 1) - die(_("internal error: only -- alone should have been left")); + setup_revisions(0, NULL, &revs, NULL); + + object->flags &= ~UNINTERESTING; + add_pending_object(&revs, object, sha1_to_hex(object->sha1)); + + for_each_ref(add_pending_uninteresting_ref, &revs); + if (prepare_revision_walk(&revs)) die(_("internal error in revision walk")); if (!(commit->object.flags & UNINTERESTING)) -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 4/8] revision: add leak_pending flag 2011-10-01 15:28 ` René Scharfe ` (2 preceding siblings ...) 2011-10-01 15:51 ` [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check René Scharfe @ 2011-10-01 15:56 ` René Scharfe 2011-10-01 16:01 ` [PATCH 5/8] bisect: use " René Scharfe ` (3 subsequent siblings) 7 siblings, 0 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 15:56 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast The new flag leak_pending in struct rev_info can be used to prevent prepare_revision_walk from freeing the list of pending objects. It will still forget about them, so it really is leaked. This behaviour may look weird at first, but it can be useful if the pointer to the list is saved before calling prepare_revision_walk. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- The next three patches are going to use this flag. revision.c | 3 ++- revision.h | 1 + 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/revision.c b/revision.c index 2e8aa33..6d329b4 100644 --- a/revision.c +++ b/revision.c @@ -1974,7 +1974,8 @@ int prepare_revision_walk(struct rev_info *revs) } e++; } - free(list); + if (!revs->leak_pending) + free(list); if (revs->no_walk) return 0; diff --git a/revision.h b/revision.h index 4541265..366a9b4 100644 --- a/revision.h +++ b/revision.h @@ -97,6 +97,7 @@ struct rev_info { date_mode_explicit:1, preserve_subject:1; unsigned int disable_stdin:1; + unsigned int leak_pending:1; enum date_mode date_mode; -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 5/8] bisect: use leak_pending flag 2011-10-01 15:28 ` René Scharfe ` (3 preceding siblings ...) 2011-10-01 15:56 ` [PATCH 4/8] revision: add leak_pending flag René Scharfe @ 2011-10-01 16:01 ` René Scharfe 2011-10-01 16:02 ` [PATCH 6/8] bundle: " René Scharfe ` (2 subsequent siblings) 7 siblings, 0 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 16:01 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Instead of creating a copy of the list of pending objects, copy the struct object_array that points to it, turn on leak_pending, and thus cause prepare_revision_walk to leave it to us. And free it once we're done. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- bisect.c | 13 ++++++++----- 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/bisect.c b/bisect.c index c7b7d79..a05504f 100644 --- a/bisect.c +++ b/bisect.c @@ -831,12 +831,14 @@ static int check_ancestors(const char *prefix) bisect_rev_setup(&revs, prefix, "^%s", "%s", 0); /* Save pending objects, so they can be cleaned up later. */ - memset(&pending_copy, 0, sizeof(pending_copy)); - for (i = 0; i < revs.pending.nr; i++) - add_object_array(revs.pending.objects[i].item, - revs.pending.objects[i].name, - &pending_copy); + pending_copy = revs.pending; + revs.leak_pending = 1; + /* + * bisect_common calls prepare_revision_walk right away, which + * (together with .leak_pending = 1) makes us the sole owner of + * the list of pending objects. + */ bisect_common(&revs); res = (revs.commits != NULL); @@ -845,6 +847,7 @@ static int check_ancestors(const char *prefix) struct object *o = pending_copy.objects[i].item; clear_commit_marks((struct commit *)o, ALL_REV_FLAGS); } + free(pending_copy.objects); return res; } -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 6/8] bundle: use leak_pending flag 2011-10-01 15:28 ` René Scharfe ` (4 preceding siblings ...) 2011-10-01 16:01 ` [PATCH 5/8] bisect: use " René Scharfe @ 2011-10-01 16:02 ` René Scharfe 2011-10-01 16:09 ` [PATCH 7/8] checkout: " René Scharfe 2011-10-01 16:16 ` [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array René Scharfe 7 siblings, 0 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 16:02 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Instead of creating a copy of the list of pending objects, copy the struct object_array that points to it, turn on leak_pending, and thus cause prepare_revision_walk to leave it to us. And free it once we're done. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- bundle.c | 8 +++----- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/bundle.c b/bundle.c index f48fd7d..26cc9ab 100644 --- a/bundle.c +++ b/bundle.c @@ -122,11 +122,8 @@ int verify_bundle(struct bundle_header *header, int verbose) req_nr = revs.pending.nr; setup_revisions(2, argv, &revs, NULL); - memset(&refs, 0, sizeof(struct object_array)); - for (i = 0; i < revs.pending.nr; i++) { - struct object_array_entry *e = revs.pending.objects + i; - add_object_array(e->item, e->name, &refs); - } + refs = revs.pending; + revs.leak_pending = 1; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); @@ -146,6 +143,7 @@ int verify_bundle(struct bundle_header *header, int verbose) for (i = 0; i < refs.nr; i++) clear_commit_marks((struct commit *)refs.objects[i].item, -1); + free(refs.objects); if (verbose) { struct ref_list *r; -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 7/8] checkout: use leak_pending flag 2011-10-01 15:28 ` René Scharfe ` (5 preceding siblings ...) 2011-10-01 16:02 ` [PATCH 6/8] bundle: " René Scharfe @ 2011-10-01 16:09 ` René Scharfe 2011-10-01 16:16 ` [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array René Scharfe 7 siblings, 0 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 16:09 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Instead of going through all the references again when we clear the commit marks, do it like bisect and bundle and gain ownership of the list of pending objects which we constructed from those references. We simply copy the struct object_array that points to the list, set the flag leak_pending and then prepare_revision_walk won't destroy it and it's ours. We use it to clear the marks and free it at the end. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- builtin/checkout.c | 25 ++++++++++++------------- 1 files changed, 12 insertions(+), 13 deletions(-) diff --git a/builtin/checkout.c b/builtin/checkout.c index 84e0cdc..cfd7e59 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -596,17 +596,6 @@ static int add_pending_uninteresting_ref(const char *refname, return 0; } -static int clear_commit_marks_from_one_ref(const char *refname, - const unsigned char *sha1, - int flags, - void *cb_data) -{ - struct commit *commit = lookup_commit_reference_gently(sha1, 1); - if (commit) - clear_commit_marks(commit, -1); - return 0; -} - static void describe_one_orphan(struct strbuf *sb, struct commit *commit) { parse_commit(commit); @@ -674,6 +663,8 @@ static void orphaned_commit_warning(struct commit *commit) { struct rev_info revs; struct object *object = &commit->object; + struct object_array refs; + unsigned int i; init_revisions(&revs, NULL); setup_revisions(0, NULL, &revs, NULL); @@ -683,6 +674,9 @@ static void orphaned_commit_warning(struct commit *commit) for_each_ref(add_pending_uninteresting_ref, &revs); + refs = revs.pending; + revs.leak_pending = 1; + if (prepare_revision_walk(&revs)) die(_("internal error in revision walk")); if (!(commit->object.flags & UNINTERESTING)) @@ -690,8 +684,13 @@ static void orphaned_commit_warning(struct commit *commit) else describe_detached_head(_("Previous HEAD position was"), commit); - clear_commit_marks(commit, -1); - for_each_ref(clear_commit_marks_from_one_ref, NULL); + for (i = 0; i < refs.nr; i++) { + struct object *o = refs.objects[i].item; + struct commit *c = lookup_commit_reference_gently(o->sha1, 1); + if (c) + clear_commit_marks(c, ALL_REV_FLAGS); + } + free(refs.objects); } static int switch_branches(struct checkout_opts *opts, struct branch_info *new) -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array 2011-10-01 15:28 ` René Scharfe ` (6 preceding siblings ...) 2011-10-01 16:09 ` [PATCH 7/8] checkout: " René Scharfe @ 2011-10-01 16:16 ` René Scharfe 7 siblings, 0 replies; 126+ messages in thread From: René Scharfe @ 2011-10-01 16:16 UTC (permalink / raw) Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast Factor out the code to clear the commit marks for a whole struct object_array from builtin/checkout.c into its own exported function clear_commit_marks_for_object_array and use it in bisect and bundle as well. It handles tags and commits and ignores objects of any other type. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> --- bisect.c | 7 ++----- builtin/checkout.c | 8 +------- bundle.c | 3 +-- commit.c | 14 ++++++++++++++ commit.h | 1 + 5 files changed, 19 insertions(+), 14 deletions(-) diff --git a/bisect.c b/bisect.c index a05504f..b4547b9 100644 --- a/bisect.c +++ b/bisect.c @@ -826,7 +826,7 @@ static int check_ancestors(const char *prefix) { struct rev_info revs; struct object_array pending_copy; - int i, res; + int res; bisect_rev_setup(&revs, prefix, "^%s", "%s", 0); @@ -843,10 +843,7 @@ static int check_ancestors(const char *prefix) res = (revs.commits != NULL); /* Clean up objects used, as they will be reused. */ - for (i = 0; i < pending_copy.nr; i++) { - struct object *o = pending_copy.objects[i].item; - clear_commit_marks((struct commit *)o, ALL_REV_FLAGS); - } + clear_commit_marks_for_object_array(&pending_copy, ALL_REV_FLAGS); free(pending_copy.objects); return res; diff --git a/builtin/checkout.c b/builtin/checkout.c index cfd7e59..683819b 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -664,7 +664,6 @@ static void orphaned_commit_warning(struct commit *commit) struct rev_info revs; struct object *object = &commit->object; struct object_array refs; - unsigned int i; init_revisions(&revs, NULL); setup_revisions(0, NULL, &revs, NULL); @@ -684,12 +683,7 @@ static void orphaned_commit_warning(struct commit *commit) else describe_detached_head(_("Previous HEAD position was"), commit); - for (i = 0; i < refs.nr; i++) { - struct object *o = refs.objects[i].item; - struct commit *c = lookup_commit_reference_gently(o->sha1, 1); - if (c) - clear_commit_marks(c, ALL_REV_FLAGS); - } + clear_commit_marks_for_object_array(&refs, ALL_REV_FLAGS); free(refs.objects); } diff --git a/bundle.c b/bundle.c index 26cc9ab..a8ea918 100644 --- a/bundle.c +++ b/bundle.c @@ -141,8 +141,7 @@ int verify_bundle(struct bundle_header *header, int verbose) refs.objects[i].name); } - for (i = 0; i < refs.nr; i++) - clear_commit_marks((struct commit *)refs.objects[i].item, -1); + clear_commit_marks_for_object_array(&refs, ALL_REV_FLAGS); free(refs.objects); if (verbose) { diff --git a/commit.c b/commit.c index 97b4327..50af007 100644 --- a/commit.c +++ b/commit.c @@ -430,6 +430,20 @@ void clear_commit_marks(struct commit *commit, unsigned int mark) } } +void clear_commit_marks_for_object_array(struct object_array *a, unsigned mark) +{ + struct object *object; + struct commit *commit; + unsigned int i; + + for (i = 0; i < a->nr; i++) { + object = a->objects[i].item; + commit = lookup_commit_reference_gently(object->sha1, 1); + if (commit) + clear_commit_marks(commit, mark); + } +} + struct commit *pop_commit(struct commit_list **stack) { struct commit_list *top = *stack; diff --git a/commit.h b/commit.h index 12d100b..0a4c730 100644 --- a/commit.h +++ b/commit.h @@ -126,6 +126,7 @@ struct commit *pop_most_recent_commit(struct commit_list **list, struct commit *pop_commit(struct commit_list **stack); void clear_commit_marks(struct commit *commit, unsigned int mark); +void clear_commit_marks_for_object_array(struct object_array *a, unsigned mark); /* * Performs an in-place topological sort of list supplied. -- 1.7.7 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-25 20:43 ` Martin Fick 2011-09-26 12:41 ` Christian Couder @ 2011-09-26 15:15 ` Martin Fick 2011-09-26 15:21 ` Sverre Rabbelier 2011-09-26 18:07 ` Martin Fick 2011-09-26 15:32 ` Michael Haggerty 2 siblings, 2 replies; 126+ messages in thread From: Martin Fick @ 2011-09-26 15:15 UTC (permalink / raw) To: git, Julian Phillips OK, I have found what I believe is another performance regression for large ref counts (~100K). When I run git br on my repo which only has one branch, but has ~100K refs under ref/changes (a gerrit repo), it takes normally 3-6mins depending on whether my caches are fresh or not. After bisecting some older changes, I noticed that this ref seems to be where things start to get slow: c774aab98ce6c5ef7aaacbef38da0a501eb671d4 commit c774aab98ce6c5ef7aaacbef38da0a501eb671d4 Author: Julian Phillips <julian@quantumfyre.co.uk> Date: Tue Apr 17 02:42:50 2007 +0100 refs.c: add a function to sort a ref list, rather then sorting on add Rather than sorting the refs list while building it, sort in one go after it is built using a merge sort. This has a large performance boost with large numbers of refs. It shouldn't happen that we read duplicate entries into the same list, but just in case sort_ref_list drops them if the SHA1s are the same, or dies, as we have no way of knowing which one is the correct one. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <junkio@cox.net> which is a bit strange since that commit's purpose was to actually speed things up in the case of many refs. Just to verify, I reverted the commit on 1.7.7.rc0.73 and sure enough, things speed up down to the 14-20s range depending on caching. If this change does not actually speed things up, should it be reverted? Or was there a bug in the change that makes it not do what it was supposed to do? Thanks, -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 15:15 ` Git is not scalable with too many refs/* Martin Fick @ 2011-09-26 15:21 ` Sverre Rabbelier 2011-09-26 15:48 ` Martin Fick 2011-09-26 18:07 ` Martin Fick 1 sibling, 1 reply; 126+ messages in thread From: Sverre Rabbelier @ 2011-09-26 15:21 UTC (permalink / raw) To: Martin Fick; +Cc: git, Julian Phillips Heya, On Mon, Sep 26, 2011 at 17:15, Martin Fick <mfick@codeaurora.org> wrote: > If this change does not actually speed things up, should it > be reverted? Or was there a bug in the change that makes it > not do what it was supposed to do? It probably looks at the refs in refs/changes while it shouldn't, hence worsening your performance compared to not looking at those refs. I assume that it does improve your situation if you have all those refs under say refs/heads. -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 15:21 ` Sverre Rabbelier @ 2011-09-26 15:48 ` Martin Fick 2011-09-26 15:56 ` Sverre Rabbelier 0 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-26 15:48 UTC (permalink / raw) To: Sverre Rabbelier; +Cc: git, Julian Phillips On Monday, September 26, 2011 09:21:30 am Sverre Rabbelier wrote: > Heya, > > On Mon, Sep 26, 2011 at 17:15, Martin Fick <mfick@codeaurora.org> wrote: > > If this change does not actually speed things up, > > should it be reverted? Or was there a bug in the > > change that makes it not do what it was supposed to > > do? > > It probably looks at the refs in refs/changes while it > shouldn't, hence worsening your performance compared to > not looking at those refs. I assume that it does improve > your situation if you have all those refs under say > refs/heads. Hmm, I was thinking that too, and I just did a test. Instead of storing the changes under refs/changes, I fetched them under refs/heads/changes and then ran git 1.7.6 and it took about 3 mins. Then, I ran the 1.7.7.rc0.73 with c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and it only took 13s! So, if this indeed tests what you were suggesting, I think it shows that even in the intended case this change slowed things down? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 15:48 ` Martin Fick @ 2011-09-26 15:56 ` Sverre Rabbelier 2011-09-26 16:38 ` Martin Fick 0 siblings, 1 reply; 126+ messages in thread From: Sverre Rabbelier @ 2011-09-26 15:56 UTC (permalink / raw) To: Martin Fick; +Cc: git, Julian Phillips Heya, On Mon, Sep 26, 2011 at 17:48, Martin Fick <mfick@codeaurora.org> wrote: > Hmm, I was thinking that too, and I just did a test. > > Instead of storing the changes under refs/changes, I fetched > them under refs/heads/changes and then ran git 1.7.6 and it > took about 3 mins. Then, I ran the 1.7.7.rc0.73 with > c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and it > only took 13s! So, if this indeed tests what you were > suggesting, I think it shows that even in the intended case > this change slowed things down? And if you run 1.7.7 without that commit reverted? -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 15:56 ` Sverre Rabbelier @ 2011-09-26 16:38 ` Martin Fick 2011-09-26 16:49 ` Julian Phillips 0 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-26 16:38 UTC (permalink / raw) To: Sverre Rabbelier; +Cc: git, Julian Phillips On Monday, September 26, 2011 09:56:50 am Sverre Rabbelier wrote: > Heya, > > On Mon, Sep 26, 2011 at 17:48, Martin Fick <mfick@codeaurora.org> wrote: > > Hmm, I was thinking that too, and I just did a test. > > > > Instead of storing the changes under refs/changes, I > > fetched them under refs/heads/changes and then ran git > > 1.7.6 and it took about 3 mins. Then, I ran the > > 1.7.7.rc0.73 with > > c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and > > it only took 13s! So, if this indeed tests what you > > were suggesting, I think it shows that even in the > > intended case this change slowed things down? > > And if you run 1.7.7 without that commit reverted? Sorry, I probably confused things by mentioning 1.7.6, the bad commit was way before that early 1.5 days... As for 1.7.7, I don't think that exists yet, so did you mean the 1.7.7.rc0.73 version that I mentioned above without the revert? Strangely enough, that ends up being 1.7.7.rc0.72.g4b5ea. That is also slow with refs/heads/changes > 3mins. -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 16:38 ` Martin Fick @ 2011-09-26 16:49 ` Julian Phillips 0 siblings, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-26 16:49 UTC (permalink / raw) To: Martin Fick; +Cc: Sverre Rabbelier, git On Mon, 26 Sep 2011 10:38:34 -0600, Martin Fick wrote: > On Monday, September 26, 2011 09:56:50 am Sverre Rabbelier > wrote: >> Heya, >> >> On Mon, Sep 26, 2011 at 17:48, Martin Fick > <mfick@codeaurora.org> wrote: >> > Hmm, I was thinking that too, and I just did a test. >> > >> > Instead of storing the changes under refs/changes, I >> > fetched them under refs/heads/changes and then ran git >> > 1.7.6 and it took about 3 mins. Then, I ran the >> > 1.7.7.rc0.73 with >> > c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and >> > it only took 13s! So, if this indeed tests what you >> > were suggesting, I think it shows that even in the >> > intended case this change slowed things down? >> >> And if you run 1.7.7 without that commit reverted? > > Sorry, I probably confused things by mentioning 1.7.6, the > bad commit was way before that early 1.5 days... > > As for 1.7.7, I don't think that exists yet, so did you mean > the 1.7.7.rc0.73 version that I mentioned above without the > revert? Strangely enough, that ends up being > 1.7.7.rc0.72.g4b5ea. That is also slow with > refs/heads/changes > 3mins. Hmm ... something interesting is going on. I created a little test repo with ~100k unpacked refs. I tried "time git branch" with three versions of git, and I got (hot cache times): git version 1.7.6.1: ~1.2s git version 1.7.7.rc3: ~1.2s git version 1.7.7.rc3.1.gbc93f: ~40s Where the third was with the commit reverted. That was almost 40s of 100% CPU - my poor laptop had to turn the fans up to noisy ... > -Martin -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 15:15 ` Git is not scalable with too many refs/* Martin Fick 2011-09-26 15:21 ` Sverre Rabbelier @ 2011-09-26 18:07 ` Martin Fick 2011-09-26 18:37 ` Julian Phillips 1 sibling, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-26 18:07 UTC (permalink / raw) To: git; +Cc: Julian Phillips On Monday, September 26, 2011 09:15:29 am Martin Fick wrote: > OK, I have found what I believe is another performance > regression for large ref counts (~100K). > > When I run git br on my repo which only has one branch, > but has ~100K refs under ref/changes (a gerrit repo), it > takes normally 3-6mins depending on whether my caches > are fresh or not. After bisecting some older changes, I > noticed that this ref seems to be where things start to > get slow: c774aab98ce6c5ef7aaacbef38da0a501eb671d4 > > > commit c774aab98ce6c5ef7aaacbef38da0a501eb671d4 > Author: Julian Phillips <julian@quantumfyre.co.uk> > Date: Tue Apr 17 02:42:50 2007 +0100 > > refs.c: add a function to sort a ref list, rather > then sorting on add > > Rather than sorting the refs list while building it, > sort in one > go after it is built using a merge sort. This has a > large > performance boost with large numbers of refs. > > It shouldn't happen that we read duplicate entries > into the same > list, but just in case sort_ref_list drops them if > the SHA1s are > the same, or dies, as we have no way of knowing which > one is the > correct one. > > Signed-off-by: Julian Phillips > <julian@quantumfyre.co.uk> > Acked-by: Linus Torvalds > <torvalds@linux-foundation.org> Signed-off-by: Junio C > Hamano <junkio@cox.net> > > > > which is a bit strange since that commit's purpose was to > actually speed things up in the case of many refs. Just > to verify, I reverted the commit on 1.7.7.rc0.73 and > sure enough, things speed up down to the 14-20s range > depending on caching. > > If this change does not actually speed things up, should > it be reverted? Or was there a bug in the change that > makes it not do what it was supposed to do? Ahh, I think I have some more clues. So while this change does not speed things up for me normally, I found a case where it does! I set my .git/config to have [core] compression = 0 and ran git-gc on my repo. Now, with a modern git with this optimization in it (1.7.6, 1.7.7.rc0...), 'git branch' is almost instantaneous (.05s)! But, if I revert c774aa it takes > ~15s. So, it appears that this optimization is foiled by compression? In the case when this optimization helps, it save about 15s, when it hurts (with compression), it seems to cost > 3mins. I am not sure this optimization is worth it? Would there be someway for it to adjust to the repo conditions? Thanks, -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 18:07 ` Martin Fick @ 2011-09-26 18:37 ` Julian Phillips 2011-09-26 20:01 ` Martin Fick 0 siblings, 1 reply; 126+ messages in thread From: Julian Phillips @ 2011-09-26 18:37 UTC (permalink / raw) To: Martin Fick; +Cc: git On Mon, 26 Sep 2011 12:07:52 -0600, Martin Fick wrote: -- snip -- > Ahh, I think I have some more clues. So while this change > does not speed things up for me normally, I found a case > where it does! I set my .git/config to have > > [core] > compression = 0 > > and ran git-gc on my repo. Now, with a modern git with this > optimization in it (1.7.6, 1.7.7.rc0...), 'git branch' is > almost instantaneous (.05s)! But, if I revert c774aa it > takes > ~15s. I don't understand this. I don't see why core.compression should have anything to do with refs ... > So, it appears that this optimization is foiled by > compression? In the case when this optimization helps, it > save about 15s, when it hurts (with compression), it seems > to cost > 3mins. I am not sure this optimization is worth > it? Would there be someway for it to adjust to the repo > conditions? Well, in the case I tried it was 1.2s vs 40s. It would seem that you have managed to find some corner case. It doesn't seem right to punish everyone who has large numbers of refs by making their commands take orders of magnitude longer to save one person 3m. Much better to find, understand and fix the actual cause. I really can't see what effect core.compression can have on loading the ref_list. Certainly the sort doesn't load anything from the object database. It would be really good to profile and find out what is taking all the time - I am assuming that the CPU is at 100% for the 3+ minutes? Random thought. What happens to the with compression case if you leave the commit in, but add a sleep(15) to the end of sort_refs_list? -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 18:37 ` Julian Phillips @ 2011-09-26 20:01 ` Martin Fick 2011-09-26 20:07 ` Junio C Hamano 2011-09-26 20:28 ` Julian Phillips 0 siblings, 2 replies; 126+ messages in thread From: Martin Fick @ 2011-09-26 20:01 UTC (permalink / raw) To: Julian Phillips; +Cc: git On Monday, September 26, 2011 12:37:10 pm Julian Phillips wrote: > On Mon, 26 Sep 2011 12:07:52 -0600, Martin Fick wrote: > -- snip -- > > > Ahh, I think I have some more clues. So while this > > change does not speed things up for me normally, I > > found a case where it does! I set my .git/config to > > have > > > > [core] > > > > compression = 0 > > > > and ran git-gc on my repo. Now, with a modern git with > > this optimization in it (1.7.6, 1.7.7.rc0...), 'git > > branch' is almost instantaneous (.05s)! But, if I > > revert c774aa it takes > ~15s. > > I don't understand this. I don't see why > core.compression should have anything to do with refs > ... > > > So, it appears that this optimization is foiled by > > compression? In the case when this optimization helps, > > it save about 15s, when it hurts (with compression), > > it seems to cost > 3mins. I am not sure this > > optimization is worth it? Would there be someway for > > it to adjust to the repo conditions? > > Well, in the case I tried it was 1.2s vs 40s. It would > seem that you have managed to find some corner case. It > doesn't seem right to punish everyone who has large > numbers of refs by making their commands take orders of > magnitude longer to save one person 3m. Much better to > find, understand and fix the actual cause. I am not sure mine is the corner case, it is a real repo (albeit a Gerrit repo with strange refs/changes), while it sounds like yours is a test repo. It seems likely that whatever you did to create the test repo makes it perform well? I am also guessing that it is not the refs which are the problem but the objects since the refs don't get compressed do they? Does your repo have real data in it (not just 100K refs)? My repo compressed is about ~2G and uncompressed is ~1.1G Yes, the compressed one is larger than the uncompressed one. Since the compressed repo above was larger, I thought that I should at lest gc it. After git gc, it is ~1.1G, so it looks like the size difference was really because of not having gced it at first after fetching the 100K refs. After a gc, the repo does perform the similar to the uncompressed one (which was achieved via gc). After gc, it takes ~.05s do to a 'git branch' with 1.7.6 and git.1.7.7.rc0.72.g4b5ea. It also takes a bit more than 15s with the patch reverted. So it appears that compression is not likely the culprit, but rather the need to be gced. So, maybe you are correct, maybe my repo is the corner case? Is a repo which needs to be gced considered a corner case? Should git be able to detect that the repo is so in desperate need of gcing? Is it normal for git to need to gc right after a clone and then fetching ~100K refs? I am not sure what is right here, if this patch makes a repo which needs gcing degrade 5 to 10 times worse than the benefit of this patch, it still seems questionable to me. > I really can't see what effect core.compression can have > on loading the ref_list. Certainly the sort doesn't > load anything from the object database. It would be > really good to profile and find out what is taking all > the time - I am assuming that the CPU is at 100% for the > 3+ minutes? Yes, 100% CPU (I mostly run the tests at least twice and have 8G of RAM, so I think the entire repo gets cached). > Random thought. What happens to the with compression > case if you leave the commit in, but add a sleep(15) to > the end of sort_refs_list? Why, what are you thinking? Hmm, I am trying this on the non gced repo and it doesn't seem to be completing (no cpu usage)! It appears that perhaps it is being called many times (the sleeping would explain no cpu usage)?!? This could be a real problem, this should only get called once right? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 20:01 ` Martin Fick @ 2011-09-26 20:07 ` Junio C Hamano 2011-09-26 20:28 ` Julian Phillips 1 sibling, 0 replies; 126+ messages in thread From: Junio C Hamano @ 2011-09-26 20:07 UTC (permalink / raw) To: Martin Fick; +Cc: Julian Phillips, git Martin Fick <mfick@codeaurora.org> writes: > After a gc, the repo does perform the similar to the > uncompressed one (which was achieved via gc). After gc, it > takes ~.05s do to a 'git branch' with 1.7.6 and > git.1.7.7.rc0.72.g4b5ea. It also takes a bit more than 15s > with the patch reverted. So it appears that compression is > not likely the culprit, but rather the need to be gced. Isn't packing refs part of "gc" these days? ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 20:01 ` Martin Fick 2011-09-26 20:07 ` Junio C Hamano @ 2011-09-26 20:28 ` Julian Phillips 2011-09-26 21:39 ` Martin Fick 1 sibling, 1 reply; 126+ messages in thread From: Julian Phillips @ 2011-09-26 20:28 UTC (permalink / raw) To: Martin Fick; +Cc: git On Mon, 26 Sep 2011 14:01:38 -0600, Martin Fick wrote: -- snip -- > So, maybe you are correct, maybe my repo is the corner case? > Is a repo which needs to be gced considered a corner case? > Should git be able to detect that the repo is so in > desperate need of gcing? Is it normal for git to need to gc > right after a clone and then fetching ~100K refs? Were you 100k refs packed before the gc? If not, perhaps your refs are causing a lot of trouble for the merge sort? They will be written out sorted to the packed-refs file, so the merge sort won't have to do any real work when loading them after that... > I am not sure what is right here, if this patch makes a repo > which needs gcing degrade 5 to 10 times worse than the > benefit of this patch, it still seems questionable to me. Well - it does this _for your repo_, that doesn't automatically mean that it does generally, or frequently. For instance, none of my normal repos that have a lot of refs are Gerrit ones, and I wouldn't be surprised if they benefitted from the merge sort (assuming that I am right that the merge sort is taking a long time on your gerrit refs). Besides, you would be better off running gc, and thus getting the benefit too. >> Random thought. What happens to the with compression >> case if you leave the commit in, but add a sleep(15) to >> the end of sort_refs_list? > > Why, what are you thinking? Hmm, I am trying this on the > non gced repo and it doesn't seem to be completing (no cpu > usage)! It appears that perhaps it is being called many > times (the sleeping would explain no cpu usage)?!? This > could be a real problem, this should only get called once > right? I was just wondering if the time taken to get the refs was changing the interaction with something else. Not very likely, but ... I added a print statement, and it was called four times when I had unpacked refs, and once with packed. So, maybe you are hitting some nasty case with unpacked refs. If you use a print statement instead of a sleep, how many times does sort_refs_lists get called in your unpacked case? It may well also be worth calculating the time taken to do the sort. -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 20:28 ` Julian Phillips @ 2011-09-26 21:39 ` Martin Fick 2011-09-26 21:52 ` Martin Fick 2011-09-26 22:30 ` Julian Phillips 0 siblings, 2 replies; 126+ messages in thread From: Martin Fick @ 2011-09-26 21:39 UTC (permalink / raw) To: Julian Phillips; +Cc: git On Monday, September 26, 2011 02:28:53 pm Julian Phillips wrote: > On Mon, 26 Sep 2011 14:01:38 -0600, Martin Fick wrote: > -- snip -- > > > So, maybe you are correct, maybe my repo is the corner > > case? Is a repo which needs to be gced considered a > > corner case? Should git be able to detect that the > > repo is so in desperate need of gcing? Is it normal > > for git to need to gc right after a clone and then > > fetching ~100K refs? > > Were you 100k refs packed before the gc? If not, perhaps > your refs are causing a lot of trouble for the merge > sort? They will be written out sorted to the > packed-refs file, so the merge sort won't have to do any > real work when loading them after that... I am not sure how to determine that (?), but I think they were packed. Under .git/objects/pack there were 2 large files, both close to 500MB. Those 2 files constituted most of the space in the repo (I was wrong about the repo sizes, that included the working dir, so think about half the quoted sizes for all of .git). So does that mean it is mostly packed? Aside from the pack and idx files, there was nothing else under the objects dir. After gcing, it is down to just one ~500MB pack file. > > I am not sure what is right here, if this patch makes a > > repo which needs gcing degrade 5 to 10 times worse > > than the benefit of this patch, it still seems > > questionable to me. > > Well - it does this _for your repo_, that doesn't > automatically mean that it does generally, or > frequently. Oh, def agreed! I just didn't want to discount it so quickly as being a corner case. > For instance, none of my normal repos that > have a lot of refs are Gerrit ones, and I wouldn't be > surprised if they benefitted from the merge sort > (assuming that I am right that the merge sort is taking > a long time on your gerrit refs). > > Besides, you would be better off running gc, and thus > getting the benefit too. Agreed, which is why I was asking if git should have noticed my "degenerate" case and auto gced? But hopefully, there is an actual bug here somewhere and we both will get to eat our cake. :) > >> Random thought. What happens to the with compression > >> case if you leave the commit in, but add a sleep(15) > >> to the end of sort_refs_list? > > > > Why, what are you thinking? Hmm, I am trying this on > > the non gced repo and it doesn't seem to be completing > > (no cpu usage)! It appears that perhaps it is being > > called many times (the sleeping would explain no cpu > > usage)?!? This could be a real problem, this should > > only get called once right? > > I was just wondering if the time taken to get the refs > was changing the interaction with something else. Not > very likely, but ... > > I added a print statement, and it was called four times > when I had unpacked refs, and once with packed. So, > maybe you are hitting some nasty case with unpacked > refs. If you use a print statement instead of a sleep, > how many times does sort_refs_lists get called in your > unpacked case? It may well also be worth calculating > the time taken to do the sort. In my case it was called 18785 times! Any other tests I should run? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 21:39 ` Martin Fick @ 2011-09-26 21:52 ` Martin Fick 2011-09-26 23:26 ` Julian Phillips 2011-09-26 22:30 ` Julian Phillips 1 sibling, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-26 21:52 UTC (permalink / raw) To: Julian Phillips; +Cc: git On Monday, September 26, 2011 03:39:33 pm Martin Fick wrote: > On Monday, September 26, 2011 02:28:53 pm Julian Phillips > wrote: > > >> Random thought. What happens to the with > > >> compression case if you leave the commit in, but > > >> add a sleep(15) to the end of sort_refs_list? > > > > > > Why, what are you thinking? Hmm, I am trying this on > > > the non gced repo and it doesn't seem to be > > > completing (no cpu usage)! It appears that perhaps > > > it is being called many times (the sleeping would > > > explain no cpu usage)?!? This could be a real > > > problem, this should only get called once right? > > > > I was just wondering if the time taken to get the refs > > was changing the interaction with something else. Not > > very likely, but ... > > > > I added a print statement, and it was called four times > > when I had unpacked refs, and once with packed. So, > > maybe you are hitting some nasty case with unpacked > > refs. If you use a print statement instead of a sleep, > > how many times does sort_refs_lists get called in your > > unpacked case? It may well also be worth calculating > > the time taken to do the sort. > > In my case it was called 18785 times! Any other tests I > should run? Gerrit stores the changes in directories under refs/changes named after the last 2 digits of the change. Then under each change it stores each patchset. So it looks like this: refs/changes/dd/change_num/ps_num I noticed that: ls refs/changes/* | wc -l -> 18876 somewhat close, but not super close to 18785, I am not sure if that is a clue. It's almost like each change is causing a re-sort, -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 21:52 ` Martin Fick @ 2011-09-26 23:26 ` Julian Phillips 2011-09-26 23:37 ` David Michael Barr ` (3 more replies) 0 siblings, 4 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-26 23:26 UTC (permalink / raw) To: Martin Fick; +Cc: git On Mon, 26 Sep 2011 15:52:04 -0600, Martin Fick wrote: > On Monday, September 26, 2011 03:39:33 pm Martin Fick wrote: >> On Monday, September 26, 2011 02:28:53 pm Julian Phillips >> wrote: >> > >> Random thought. What happens to the with >> > >> compression case if you leave the commit in, but >> > >> add a sleep(15) to the end of sort_refs_list? >> > > >> > > Why, what are you thinking? Hmm, I am trying this on >> > > the non gced repo and it doesn't seem to be >> > > completing (no cpu usage)! It appears that perhaps >> > > it is being called many times (the sleeping would >> > > explain no cpu usage)?!? This could be a real >> > > problem, this should only get called once right? >> > >> > I was just wondering if the time taken to get the refs >> > was changing the interaction with something else. Not >> > very likely, but ... >> > >> > I added a print statement, and it was called four times >> > when I had unpacked refs, and once with packed. So, >> > maybe you are hitting some nasty case with unpacked >> > refs. If you use a print statement instead of a sleep, >> > how many times does sort_refs_lists get called in your >> > unpacked case? It may well also be worth calculating >> > the time taken to do the sort. >> >> In my case it was called 18785 times! Any other tests I >> should run? > > Gerrit stores the changes in directories under refs/changes > named after the last 2 digits of the change. Then under > each change it stores each patchset. So it looks like this: > refs/changes/dd/change_num/ps_num > > I noticed that: > > ls refs/changes/* | wc -l > -> 18876 > > somewhat close, but not super close to 18785, I am not sure > if that is a clue. It's almost like each change is causing > a re-sort, basically, it is ... Back when I made that change, I failed to notice that get_ref_dir was recursive for subdirectories ... sorry ... Hopefully this should speed things up. My test repo went from ~17m user time, to ~2.5s. Packing still make things much faster of course. diff --git a/refs.c b/refs.c index a615043..212e7ec 100644 --- a/refs.c +++ b/refs.c @@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char *submodule, c free(ref); closedir(dir); } - return sort_ref_list(list); + return list; } struct warn_if_dangling_data { @@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const char *submodu if (submodule) { free_ref_list(submodule_refs.loose); submodule_refs.loose = get_ref_dir(submodule, "refs", NULL); + submodule_refs.loose = sort_refs_list(submodule_refs.loose); return submodule_refs.loose; } if (!cached_refs.did_loose) { cached_refs.loose = get_ref_dir(NULL, "refs", NULL); + cached_refs.loose = sort_refs_list(cached_refs.loose); cached_refs.did_loose = 1; } return cached_refs.loose; > > > -Martin -- Julian ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 23:26 ` Julian Phillips @ 2011-09-26 23:37 ` David Michael Barr 2011-09-27 1:01 ` [PATCH] refs.c: Fix slowness with numerous loose refs David Barr 2011-09-26 23:38 ` Git is not scalable with too many refs/* Junio C Hamano ` (2 subsequent siblings) 3 siblings, 1 reply; 126+ messages in thread From: David Michael Barr @ 2011-09-26 23:37 UTC (permalink / raw) To: Julian Phillips; +Cc: Martin Fick, git On Tue, Sep 27, 2011 at 9:26 AM, Julian Phillips <julian@quantumfyre.co.uk> wrote: > > On Mon, 26 Sep 2011 15:52:04 -0600, Martin Fick wrote: >> >> On Monday, September 26, 2011 03:39:33 pm Martin Fick wrote: >>> >>> On Monday, September 26, 2011 02:28:53 pm Julian Phillips >>> wrote: >>> > >> Random thought. What happens to the with >>> > >> compression case if you leave the commit in, but >>> > >> add a sleep(15) to the end of sort_refs_list? >>> > > >>> > > Why, what are you thinking? Hmm, I am trying this on >>> > > the non gced repo and it doesn't seem to be >>> > > completing (no cpu usage)! It appears that perhaps >>> > > it is being called many times (the sleeping would >>> > > explain no cpu usage)?!? This could be a real >>> > > problem, this should only get called once right? >>> > >>> > I was just wondering if the time taken to get the refs >>> > was changing the interaction with something else. Not >>> > very likely, but ... >>> > >>> > I added a print statement, and it was called four times >>> > when I had unpacked refs, and once with packed. So, >>> > maybe you are hitting some nasty case with unpacked >>> > refs. If you use a print statement instead of a sleep, >>> > how many times does sort_refs_lists get called in your >>> > unpacked case? It may well also be worth calculating >>> > the time taken to do the sort. >>> >>> In my case it was called 18785 times! Any other tests I >>> should run? >> >> Gerrit stores the changes in directories under refs/changes >> named after the last 2 digits of the change. Then under >> each change it stores each patchset. So it looks like this: >> refs/changes/dd/change_num/ps_num >> >> I noticed that: >> >> ls refs/changes/* | wc -l >> -> 18876 >> >> somewhat close, but not super close to 18785, I am not sure >> if that is a clue. It's almost like each change is causing >> a re-sort, > > basically, it is ... > > Back when I made that change, I failed to notice that get_ref_dir was recursive for subdirectories ... sorry ... > > Hopefully this should speed things up. My test repo went from ~17m user time, to ~2.5s. > Packing still make things much faster of course. > > diff --git a/refs.c b/refs.c > index a615043..212e7ec 100644 > --- a/refs.c > +++ b/refs.c > @@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char *submodule, c > free(ref); > closedir(dir); > } > - return sort_ref_list(list); > + return list; > } > > struct warn_if_dangling_data { > @@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const char *submodu > if (submodule) { > free_ref_list(submodule_refs.loose); > submodule_refs.loose = get_ref_dir(submodule, "refs", NULL); > + submodule_refs.loose = sort_refs_list(submodule_refs.loose); > return submodule_refs.loose; > } > > if (!cached_refs.did_loose) { > cached_refs.loose = get_ref_dir(NULL, "refs", NULL); > + cached_refs.loose = sort_refs_list(cached_refs.loose); > cached_refs.did_loose = 1; > } > return cached_refs.loose; > > > >> >> >> -Martin > > -- > Julian > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Well done! I'll try to compose a patch attributed to Julian with the information from this thread. -- David Barr ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH] refs.c: Fix slowness with numerous loose refs 2011-09-26 23:37 ` David Michael Barr @ 2011-09-27 1:01 ` David Barr 2011-09-27 2:04 ` David Michael Barr 0 siblings, 1 reply; 126+ messages in thread From: David Barr @ 2011-09-27 1:01 UTC (permalink / raw) To: Git Mailing List; +Cc: Julian Phillips, Martin Fick, Junio C Hamano, David Barr Martin Fick reported: OK, I have found what I believe is another performance regression for large ref counts (~100K). When I run git br on my repo which only has one branch, but has ~100K refs under ref/changes (a gerrit repo), it takes normally 3-6mins depending on whether my caches are fresh or not. After bisecting some older changes, I noticed that this ref seems to be where things start to get slow: v1.5.2-rc0~21^2 (refs.c: add a function to sort a ref list, rather then sorting on add) (Julian Phillips, Apr 17, 2007) Martin Fick observed that sort_refs_lists() was called almost as many times as there were loose refs. Julian Phillips commented: Back when I made that change, I failed to notice that get_ref_dir was recursive for subdirectories ... sorry ... Hopefully this should speed things up. My test repo went from ~17m user time, to ~2.5s. Packing still make things much faster of course. Martin Fick acked: Excellent! This works (almost, in my refs.c it is called sort_ref_list, not sort_refs_list). So, on the non garbage collected repo, git branch now takes ~.5s, and in the garbage collected one it takes only ~.05s! [db: summarised transcript, rewrote patch to fix callee not callers] [attn jch: patch applies to maint] Analyzed-by: Martin Fick <mfick@codeaurora.org> Inspired-by: Julian Phillips <julian@quantumfyre.co.uk> Acked-by: Martin Fick <mfick@codeaurora.org> Signed-off-by: David Barr <davidbarr@google.com> --- refs.c | 14 ++++++++++---- 1 files changed, 10 insertions(+), 4 deletions(-) diff --git a/refs.c b/refs.c index 4c1fd47..e40a09c 100644 --- a/refs.c +++ b/refs.c @@ -255,8 +255,8 @@ static struct ref_list *get_packed_refs(const char *submodule) return refs->packed; } -static struct ref_list *get_ref_dir(const char *submodule, const char *base, - struct ref_list *list) +static struct ref_list *walk_ref_dir(const char *submodule, const char *base, + struct ref_list *list) { DIR *dir; const char *path; @@ -299,7 +299,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, if (stat(refdir, &st) < 0) continue; if (S_ISDIR(st.st_mode)) { - list = get_ref_dir(submodule, ref, list); + list = walk_ref_dir(submodule, ref, list); continue; } if (submodule) { @@ -319,7 +319,13 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, free(ref); closedir(dir); } - return sort_ref_list(list); + return list; +} + +static struct ref_list *get_ref_dir(const char *submodule, const char *base, + struct ref_list *list) +{ + return sort_ref_list(walk_ref_dir(submodule, base, list)); } struct warn_if_dangling_data { -- 1.7.5.75.g69330 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH] refs.c: Fix slowness with numerous loose refs 2011-09-27 1:01 ` [PATCH] refs.c: Fix slowness with numerous loose refs David Barr @ 2011-09-27 2:04 ` David Michael Barr 0 siblings, 0 replies; 126+ messages in thread From: David Michael Barr @ 2011-09-27 2:04 UTC (permalink / raw) To: Git Mailing List Cc: Julian Phillips, Martin Fick, Junio C Hamano, David Barr, Shawn O. Pearce +cc Shawn O. Pearce I used the following to generate a test repo shaped like a gerrit mirror with unpacked refs (10k, because life is too short for 100k tests): cd test.git git init touch empty git add empty git commit -m 'empty' REV=`git rev-parse HEAD` for ((d=0;d<100;++d)); do for ((n=0;n<100;++n)); do let r=n*100+d mkdir -p .git/refs/changes/$d/$r echo $REV > .git/refs/changes/$d/$r/1 done done time git branch xyz With warm caches... Git 1.7.6.4: real 0m8.232s user 0m7.842s sys 0m0.385s Git 1.7.6.4, with patch below: real 0m0.394s user 0m0.069s sys 0m0.324s On Tue, Sep 27, 2011 at 11:01 AM, David Barr <davidbarr@google.com> wrote: > Martin Fick reported: > OK, I have found what I believe is another performance > regression for large ref counts (~100K). > > When I run git br on my repo which only has one branch, but > has ~100K refs under ref/changes (a gerrit repo), it takes > normally 3-6mins depending on whether my caches are fresh or > not. After bisecting some older changes, I noticed that > this ref seems to be where things start to get slow: > v1.5.2-rc0~21^2 (refs.c: add a function to sort a ref list, > rather then sorting on add) (Julian Phillips, Apr 17, 2007) > > Martin Fick observed that sort_refs_lists() was called almost > as many times as there were loose refs. > > Julian Phillips commented: > Back when I made that change, I failed to notice that get_ref_dir > was recursive for subdirectories ... sorry ... > > Hopefully this should speed things up. My test repo went from > ~17m user time, to ~2.5s. > Packing still make things much faster of course. > > Martin Fick acked: > Excellent! This works (almost, in my refs.c it is called > sort_ref_list, not sort_refs_list). So, on the non garbage > collected repo, git branch now takes ~.5s, and in the > garbage collected one it takes only ~.05s! > > [db: summarised transcript, rewrote patch to fix callee not callers] > > [attn jch: patch applies to maint] > > Analyzed-by: Martin Fick <mfick@codeaurora.org> > Inspired-by: Julian Phillips <julian@quantumfyre.co.uk> > Acked-by: Martin Fick <mfick@codeaurora.org> > Signed-off-by: David Barr <davidbarr@google.com> > --- > refs.c | 14 ++++++++++---- > 1 files changed, 10 insertions(+), 4 deletions(-) > > diff --git a/refs.c b/refs.c > index 4c1fd47..e40a09c 100644 > --- a/refs.c > +++ b/refs.c > @@ -255,8 +255,8 @@ static struct ref_list *get_packed_refs(const char *submodule) > return refs->packed; > } > > -static struct ref_list *get_ref_dir(const char *submodule, const char *base, > - struct ref_list *list) > +static struct ref_list *walk_ref_dir(const char *submodule, const char *base, > + struct ref_list *list) > { > DIR *dir; > const char *path; > @@ -299,7 +299,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, > if (stat(refdir, &st) < 0) > continue; > if (S_ISDIR(st.st_mode)) { > - list = get_ref_dir(submodule, ref, list); > + list = walk_ref_dir(submodule, ref, list); > continue; > } > if (submodule) { > @@ -319,7 +319,13 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, > free(ref); > closedir(dir); > } > - return sort_ref_list(list); > + return list; > +} > + > +static struct ref_list *get_ref_dir(const char *submodule, const char *base, > + struct ref_list *list) > +{ > + return sort_ref_list(walk_ref_dir(submodule, base, list)); > } > > struct warn_if_dangling_data { > -- > 1.7.5.75.g69330 > > -- David Barr | Software Engineer | davidbarr@google.com | 614-3438-8348 ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 23:26 ` Julian Phillips 2011-09-26 23:37 ` David Michael Barr @ 2011-09-26 23:38 ` Junio C Hamano 2011-09-27 0:00 ` [PATCH] Don't sort ref_list too early Julian Phillips 2011-09-27 0:12 ` Git is not scalable with too many refs/* Martin Fick 2011-09-27 8:20 ` Sverre Rabbelier 3 siblings, 1 reply; 126+ messages in thread From: Junio C Hamano @ 2011-09-26 23:38 UTC (permalink / raw) To: Julian Phillips; +Cc: Martin Fick, git Julian Phillips <julian@quantumfyre.co.uk> writes: > Back when I made that change, I failed to notice that get_ref_dir was > recursive for subdirectories ... sorry ... Aha, I also was blind while I was watching this discussion from the sideline, and I thought I re-read the codepath involved X-<. Indeed we were sorting the list way too early and the patch looks correct. Thanks. > Hopefully this should speed things up. My test repo went from ~17m > user time, to ~2.5s. > Packing still make things much faster of course. > > diff --git a/refs.c b/refs.c > index a615043..212e7ec 100644 > --- a/refs.c > +++ b/refs.c > @@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char > *submodule, c > free(ref); > closedir(dir); > } > - return sort_ref_list(list); > + return list; > } > > struct warn_if_dangling_data { > @@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const > char *submodu > if (submodule) { > free_ref_list(submodule_refs.loose); > submodule_refs.loose = get_ref_dir(submodule, "refs", > NULL); > + submodule_refs.loose = > sort_refs_list(submodule_refs.loose); > return submodule_refs.loose; > } > > if (!cached_refs.did_loose) { > cached_refs.loose = get_ref_dir(NULL, "refs", NULL); > + cached_refs.loose = sort_refs_list(cached_refs.loose); > cached_refs.did_loose = 1; > } > return cached_refs.loose; > > > >> >> >> -Martin ^ permalink raw reply [flat|nested] 126+ messages in thread
* [PATCH] Don't sort ref_list too early 2011-09-26 23:38 ` Git is not scalable with too many refs/* Junio C Hamano @ 2011-09-27 0:00 ` Julian Phillips 2011-10-02 4:58 ` Michael Haggerty 0 siblings, 1 reply; 126+ messages in thread From: Julian Phillips @ 2011-09-27 0:00 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, git get_ref_dir is called recursively for subdirectories, which means that we were calling sort_ref_list for each directory of refs instead of once for all the refs. This is a massive wast of processing, so now just call sort_ref_list on the result of the top-level get_ref_dir, so that the sort is only done once. In the common case of only a few different directories of refs the difference isn't very noticable, but it becomes very noticeable when you have a large number of direcotries containing refs (e.g. as created by Gerrit). Reported by Martin Fick. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> --- This time the typos are fixed too ... perhaps I wrote the original commit at 1am too ... :$ refs.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/refs.c b/refs.c index a615043..a49ff74 100644 --- a/refs.c +++ b/refs.c @@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base, free(ref); closedir(dir); } - return sort_ref_list(list); + return list; } struct warn_if_dangling_data { @@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const char *submodule) if (submodule) { free_ref_list(submodule_refs.loose); submodule_refs.loose = get_ref_dir(submodule, "refs", NULL); + submodule_refs.loose = sort_ref_list(submodule_refs.loose); return submodule_refs.loose; } if (!cached_refs.did_loose) { cached_refs.loose = get_ref_dir(NULL, "refs", NULL); + cached_refs.loose = sort_ref_list(cached_refs.loose); cached_refs.did_loose = 1; } return cached_refs.loose; -- 1.7.6.1 ^ permalink raw reply related [flat|nested] 126+ messages in thread
* Re: [PATCH] Don't sort ref_list too early 2011-09-27 0:00 ` [PATCH] Don't sort ref_list too early Julian Phillips @ 2011-10-02 4:58 ` Michael Haggerty 0 siblings, 0 replies; 126+ messages in thread From: Michael Haggerty @ 2011-10-02 4:58 UTC (permalink / raw) To: Julian Phillips; +Cc: Junio C Hamano, Martin Fick, git On 09/27/2011 02:00 AM, Julian Phillips wrote: > get_ref_dir is called recursively for subdirectories, which means that > we were calling sort_ref_list for each directory of refs instead of > once for all the refs. This is a massive wast of processing, so now > just call sort_ref_list on the result of the top-level get_ref_dir, so > that the sort is only done once. +1 I think this patch should also be considered for maint, since it is noninvasive and fixes a bad performance regression. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 23:26 ` Julian Phillips 2011-09-26 23:37 ` David Michael Barr 2011-09-26 23:38 ` Git is not scalable with too many refs/* Junio C Hamano @ 2011-09-27 0:12 ` Martin Fick 2011-09-27 0:22 ` Julian Phillips 2011-09-27 8:20 ` Sverre Rabbelier 3 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-27 0:12 UTC (permalink / raw) To: Julian Phillips; +Cc: git On Monday, September 26, 2011 05:26:55 pm Julian Phillips wrote: > On Mon, 26 Sep 2011 15:52:04 -0600, Martin Fick wrote: > > On Monday, September 26, 2011 03:39:33 pm Martin Fick wrote: > >> On Monday, September 26, 2011 02:28:53 pm Julian > >> Phillips > >> In my case it was called 18785 times! Any other tests > >> I should run? > > > > Gerrit stores the changes in directories under > > refs/changes named after the last 2 digits of the > > change. Then under each change it stores each > > patchset. So it looks like this: > > refs/changes/dd/change_num/ps_num > > > > I noticed that: > > ls refs/changes/* | wc -l > > -> 18876 > > > > somewhat close, but not super close to 18785, I am not > > sure if that is a clue. It's almost like each change > > is causing a re-sort, > > basically, it is ... > > Back when I made that change, I failed to notice that > get_ref_dir was recursive for subdirectories ... sorry > ... > > Hopefully this should speed things up. My test repo went > from ~17m user time, to ~2.5s. > Packing still make things much faster of course. Excellent! This works (almost, in my refs.c it is called sort_ref_list, not sort_refs_list). So, on the non garbage collected repo, git branch now takes ~.5s, and in the garbage collected one it takes only ~.05s! Thanks way much!!! -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 0:12 ` Git is not scalable with too many refs/* Martin Fick @ 2011-09-27 0:22 ` Julian Phillips 2011-09-27 2:34 ` Martin Fick 0 siblings, 1 reply; 126+ messages in thread From: Julian Phillips @ 2011-09-27 0:22 UTC (permalink / raw) To: Martin Fick; +Cc: git On Mon, 26 Sep 2011 18:12:31 -0600, Martin Fick wrote: > On Monday, September 26, 2011 05:26:55 pm Julian Phillips > wrote: -- snip -- >> Back when I made that change, I failed to notice that >> get_ref_dir was recursive for subdirectories ... sorry >> ... >> >> Hopefully this should speed things up. My test repo went >> from ~17m user time, to ~2.5s. >> Packing still make things much faster of course. > > Excellent! This works (almost, in my refs.c it is called > sort_ref_list, not sort_refs_list). Yeah, in mine too ;) It's late and I got the compile/send mail sequence backwards. :$ It's fixed in the proper patch email. > So, on the non garbage > collected repo, git branch now takes ~.5s, and in the > garbage collected one it takes only ~.05s! That sounds a lot better. Hopefully other commands should be faster now too. > Thanks way much!!! No problem. Thank you for all the time you've put in to help chase this down. Makes it so much easier when the person with original problem mucks in with the investigation. Just think how much time you've saved for anyone with a large number of those Gerrit change refs ;) -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 0:22 ` Julian Phillips @ 2011-09-27 2:34 ` Martin Fick 2011-09-27 7:59 ` Julian Phillips 0 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-27 2:34 UTC (permalink / raw) To: Julian Phillips; +Cc: git > Julian Phillips <julian@quantumfyre.co.uk> wrote: >On Mon, 26 Sep 2011 18:12:31 -0600, Martin Fick wrote: >That sounds a lot better. Hopefully other commands should be faster >now too. Yeah, I will try this in a few other places to see. >> Thanks way much!!! > >No problem. Thank you for all the time you've put in to help chase >this down. Makes it so much easier when the person with original >problem mucks in with the investigation. >Just think how much time you've saved for anyone with a large number of > >those Gerrit change refs ;) Perhaps this is a naive question, but why are all these refs being put into a list to be sorted, only to be discarded soon thereafter anyway? After all, git branch knows that it isn't going to print these, and the refs are stored precategorized, so why not only grab the refs which matter upfront? -Martin ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 2:34 ` Martin Fick @ 2011-09-27 7:59 ` Julian Phillips 0 siblings, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-27 7:59 UTC (permalink / raw) To: Martin Fick; +Cc: git On Mon, 26 Sep 2011 20:34:02 -0600, Martin Fick wrote: >> Julian Phillips <julian@quantumfyre.co.uk> wrote: >>On Mon, 26 Sep 2011 18:12:31 -0600, Martin Fick wrote: >>That sounds a lot better. Hopefully other commands should be faster >>now too. > > Yeah, I will try this in a few other places to see. > >>> Thanks way much!!! >> >>No problem. Thank you for all the time you've put in to help chase >>this down. Makes it so much easier when the person with original >>problem mucks in with the investigation. >>Just think how much time you've saved for anyone with a large number >> of >> >>those Gerrit change refs ;) > > Perhaps this is a naive question, but why are all these refs being > put into a list to be sorted, only to be discarded soon thereafter > anyway? After all, git branch knows that it isn't going to print > these, and the refs are stored precategorized, so why not only grab > the refs which matter upfront? I can't say that I am aware of a specific decision having been taken on the subject, but I'll have a guess at the reason: The extra code it would take to have an API for getting a list of only a subset of the refs has never been considered worth the cost. It would take effort to implement, test and maintain - and it would have to be done separately for packed and unpacked cases to avoid still loading and discarding unwanted refs. All that to not do something that no-one has noticed taking any time? Until now, I doubt anyone has considered it something that was a problem - and now that even with 100k refs it takes less than a second, I doubt anyone will feel all that inclined to have a crack at it now either. -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 23:26 ` Julian Phillips ` (2 preceding siblings ...) 2011-09-27 0:12 ` Git is not scalable with too many refs/* Martin Fick @ 2011-09-27 8:20 ` Sverre Rabbelier 2011-09-27 9:01 ` Julian Phillips 3 siblings, 1 reply; 126+ messages in thread From: Sverre Rabbelier @ 2011-09-27 8:20 UTC (permalink / raw) To: Julian Phillips; +Cc: Martin Fick, git, Junio C Hamano, David Michael Barr Heya, On Tue, Sep 27, 2011 at 01:26, Julian Phillips <julian@quantumfyre.co.uk> wrote: > Back when I made that change, I failed to notice that get_ref_dir was > recursive for subdirectories ... sorry ... > > Hopefully this should speed things up. My test repo went from ~17m user > time, to ~2.5s. > Packing still make things much faster of course. Can we perhaps also have some tests to prevent this from happening again? -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 8:20 ` Sverre Rabbelier @ 2011-09-27 9:01 ` Julian Phillips 2011-09-27 10:01 ` Sverre Rabbelier 2011-09-27 11:07 ` Michael Haggerty 0 siblings, 2 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-27 9:01 UTC (permalink / raw) To: Sverre Rabbelier; +Cc: Martin Fick, git, Junio C Hamano, David Michael Barr On Tue, 27 Sep 2011 10:20:29 +0200, Sverre Rabbelier wrote: > Heya, > > On Tue, Sep 27, 2011 at 01:26, Julian Phillips > <julian@quantumfyre.co.uk> wrote: >> Back when I made that change, I failed to notice that get_ref_dir >> was >> recursive for subdirectories ... sorry ... >> >> Hopefully this should speed things up. My test repo went from ~17m >> user >> time, to ~2.5s. >> Packing still make things much faster of course. > > Can we perhaps also have some tests to prevent this from happening > again? Um ... any suggestion what to test? It has to be hot-cache, otherwise time taken to read the refs from disk will mean that it is always slow. On my Mac it seems to _always_ be slow reading the refs from disk, so even the "fast" case still takes ~17m. Also, what counts as ok, and what as broken? -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 9:01 ` Julian Phillips @ 2011-09-27 10:01 ` Sverre Rabbelier 2011-09-27 10:25 ` Nguyen Thai Ngoc Duy 2011-09-27 11:07 ` Michael Haggerty 1 sibling, 1 reply; 126+ messages in thread From: Sverre Rabbelier @ 2011-09-27 10:01 UTC (permalink / raw) To: Julian Phillips; +Cc: Martin Fick, git, Junio C Hamano, David Michael Barr Heya, On Tue, Sep 27, 2011 at 11:01, Julian Phillips <julian@quantumfyre.co.uk> wrote: > It has to be hot-cache, otherwise time taken to read the refs from disk will > mean that it is always slow. On my Mac it seems to _always_ be slow reading > the refs from disk, so even the "fast" case still takes ~17m. Ah, that seems unfortunate. Not sure how to test it then. -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 10:01 ` Sverre Rabbelier @ 2011-09-27 10:25 ` Nguyen Thai Ngoc Duy 0 siblings, 0 replies; 126+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-09-27 10:25 UTC (permalink / raw) To: Sverre Rabbelier Cc: Julian Phillips, Martin Fick, git, Junio C Hamano, David Michael Barr On Tue, Sep 27, 2011 at 8:01 PM, Sverre Rabbelier <srabbelier@gmail.com> wrote: > Heya, > > On Tue, Sep 27, 2011 at 11:01, Julian Phillips <julian@quantumfyre.co.uk> wrote: >> It has to be hot-cache, otherwise time taken to read the refs from disk will >> mean that it is always slow. On my Mac it seems to _always_ be slow reading >> the refs from disk, so even the "fast" case still takes ~17m. > > Ah, that seems unfortunate. Not sure how to test it then. If you care about performance, a perf test suite could be made, perhaps as a separate project. The output would be charts or spreadsheets, that interesting parties can look at and point out regressions. We may start with a set of common used operations. -- Duy ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 9:01 ` Julian Phillips 2011-09-27 10:01 ` Sverre Rabbelier @ 2011-09-27 11:07 ` Michael Haggerty 2011-09-27 12:10 ` Julian Phillips 1 sibling, 1 reply; 126+ messages in thread From: Michael Haggerty @ 2011-09-27 11:07 UTC (permalink / raw) To: Julian Phillips Cc: Sverre Rabbelier, Martin Fick, git, Junio C Hamano, David Michael Barr On 09/27/2011 11:01 AM, Julian Phillips wrote: > It has to be hot-cache, otherwise time taken to read the refs from disk > will mean that it is always slow. On my Mac it seems to _always_ be > slow reading the refs from disk, so even the "fast" case still takes ~17m. This case should be helped by lazy-loading of loose references, which I am working on. So if you develop some benchmarking code, it would help me with my work. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-27 11:07 ` Michael Haggerty @ 2011-09-27 12:10 ` Julian Phillips 0 siblings, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-27 12:10 UTC (permalink / raw) To: Michael Haggerty Cc: Sverre Rabbelier, Martin Fick, git, Junio C Hamano, David Michael Barr [-- Attachment #1: Type: text/plain, Size: 1249 bytes --] On Tue, 27 Sep 2011 13:07:15 +0200, Michael Haggerty wrote: > On 09/27/2011 11:01 AM, Julian Phillips wrote: >> It has to be hot-cache, otherwise time taken to read the refs from >> disk >> will mean that it is always slow. On my Mac it seems to _always_ be >> slow reading the refs from disk, so even the "fast" case still takes >> ~17m. > > This case should be helped by lazy-loading of loose references, which > I > am working on. So if you develop some benchmarking code, it would > help > me with my work. The attached script creates the repo structure I was testing with ... If you create a repo with 100k refs it takes quite a while to read the refs from disk. If you are lazy-loading then it should take practically no time, since the only interesting ref is refs/heads/master. The following is the hot-cache timing for "./refs-stress c 40000", with the sorting patch applied (wasn't prepared to wait for numbers with 100k refs). jp3@rayne: refs>(cd c; time ~/misc/git/git/git branch) * master real 0m0.885s user 0m0.161s sys 0m0.722s After doing "rm -rf c/.git/refs/changes/*", I get: jp3@rayne: refs>(cd c; time ~/misc/git/git/git branch) * master real 0m0.004s user 0m0.001s sys 0m0.002s -- Julian [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: refs-stress --] [-- Type: text/x-java; name=refs-stress, Size: 1406 bytes --] #!/usr/bin/env python import os import random import subprocess import sys def die(msg): print >> sys.stderr, msg sys.exit(1) def new_ref(a, b, commit): d = ".git/refs/changes/%d/%d" % (a, b) if not os.path.exists(d): os.makedirs(d) e = 1 p = "%s/%d" % (d, e) while os.path.exists(p): e += 1 p = "%s/%d" % (d, e) f = open(p, "w") f.write(commit) f.close() def make_refs(count, commit): while count > 0: sys.stdout.write("left: %d%s\r" % (count, " " * 30)) a = random.randrange(10, 30) b = random.randrange(10000, 50000) new_ref(a, b, commit) count -= 1 print "refs complete" def main(): if len(sys.argv) != 3: die("usage: %s <name> <ref count>" % sys.argv[0]) _, name, refs = sys.argv os.mkdir(name) os.chdir(name) if subprocess.call(["git", "init"]) != 0: die("failed to init repo") f = open("foobar.txt", "w") f.write("%s: %s refs\n" % (name, refs)) f.close() if subprocess.call(["git", "add", "foobar.txt"]) != 0: die("failed to add foobar.txt") if subprocess.call(["git", "commit", "-m", "inital commit"]) != 0: die("failed to create initial commit") commit = subprocess.check_output(["git", "show-ref", "-s", "master"]).strip() make_refs(int(refs), commit) if __name__ == "__main__": main() ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 21:39 ` Martin Fick 2011-09-26 21:52 ` Martin Fick @ 2011-09-26 22:30 ` Julian Phillips 1 sibling, 0 replies; 126+ messages in thread From: Julian Phillips @ 2011-09-26 22:30 UTC (permalink / raw) To: Martin Fick; +Cc: git On Mon, 26 Sep 2011 15:39:33 -0600, Martin Fick wrote: > On Monday, September 26, 2011 02:28:53 pm Julian Phillips > wrote: >> On Mon, 26 Sep 2011 14:01:38 -0600, Martin Fick wrote: >> -- snip -- >> >> > So, maybe you are correct, maybe my repo is the corner >> > case? Is a repo which needs to be gced considered a >> > corner case? Should git be able to detect that the >> > repo is so in desperate need of gcing? Is it normal >> > for git to need to gc right after a clone and then >> > fetching ~100K refs? >> >> Were you 100k refs packed before the gc? If not, perhaps >> your refs are causing a lot of trouble for the merge >> sort? They will be written out sorted to the >> packed-refs file, so the merge sort won't have to do any >> real work when loading them after that... > > I am not sure how to determine that (?), but I think they > were packed. Under .git/objects/pack there were 2 large > files, both close to 500MB. Those 2 files constituted most > of the space in the repo (I was wrong about the repo sizes, > that included the working dir, so think about half the > quoted sizes for all of .git). So does that mean it is > mostly packed? Aside from the pack and idx files, there was > nothing else under the objects dir. After gcing, it is down > to just one ~500MB pack file. If refs are listed under .git/refs/... they are unpacked, if they are listed in .git/packed-refs they are packed. They can be in both if updated since the last pack. >> > I am not sure what is right here, if this patch makes a >> > repo which needs gcing degrade 5 to 10 times worse >> > than the benefit of this patch, it still seems >> > questionable to me. >> >> Well - it does this _for your repo_, that doesn't >> automatically mean that it does generally, or >> frequently. > > Oh, def agreed! I just didn't want to discount it so quickly > as being a corner case. > > >> For instance, none of my normal repos that >> have a lot of refs are Gerrit ones, and I wouldn't be >> surprised if they benefitted from the merge sort >> (assuming that I am right that the merge sort is taking >> a long time on your gerrit refs). >> >> Besides, you would be better off running gc, and thus >> getting the benefit too. > > Agreed, which is why I was asking if git should have noticed > my "degenerate" case and auto gced? But hopefully, there is > an actual bug here somewhere and we both will get to eat our > cake. :) I think automatic gc is currently only triggered by unpacked objects, not unpacked refs ... perhaps the auto-gc should cover refs too? >> >> Random thought. What happens to the with compression >> >> case if you leave the commit in, but add a sleep(15) >> >> to the end of sort_refs_list? >> > >> > Why, what are you thinking? Hmm, I am trying this on >> > the non gced repo and it doesn't seem to be completing >> > (no cpu usage)! It appears that perhaps it is being >> > called many times (the sleeping would explain no cpu >> > usage)?!? This could be a real problem, this should >> > only get called once right? >> >> I was just wondering if the time taken to get the refs >> was changing the interaction with something else. Not >> very likely, but ... >> >> I added a print statement, and it was called four times >> when I had unpacked refs, and once with packed. So, >> maybe you are hitting some nasty case with unpacked >> refs. If you use a print statement instead of a sleep, >> how many times does sort_refs_lists get called in your >> unpacked case? It may well also be worth calculating >> the time taken to do the sort. > > In my case it was called 18785 times! Any other tests I > should run? That's a lot of sorts. I really can't see why there would need to be more than one ... I've created a new test repo, using a more complicated method to construct the 100k refs, and it took ~40m to run "git branch" instead of the 1.2s for the previous repo. So, I think the ref naming pattern used by Gerrit is definitely triggering something odd. However, progress is a bit slow - now that it takes over 1/2 an hour to try things out ... -- Julian ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-25 20:43 ` Martin Fick 2011-09-26 12:41 ` Christian Couder 2011-09-26 15:15 ` Git is not scalable with too many refs/* Martin Fick @ 2011-09-26 15:32 ` Michael Haggerty 2011-09-26 15:42 ` Martin Fick 2 siblings, 1 reply; 126+ messages in thread From: Michael Haggerty @ 2011-09-26 15:32 UTC (permalink / raw) To: Martin Fick; +Cc: git, Christian Couder On 09/25/2011 10:43 PM, Martin Fick wrote: > A coworker of mine pointed out to me that a simple > > git checkout > > can also take rather long periods of time > 3 mins when run > on a repo with ~100K refs. > > While this is not massive like the other problem I reported, > it still seems like it is more than one would expect. So, I > tried an older version of git, and to my surprise/delight, > it was much faster (.2s). So, I bisected this issue also, > and it seems that the "offending" commit is > 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07: I'm still working on changes to store references hierarchically in the cache and read them lazily. I hope that it will help some scaling problems with large number of refs. Unfortunately I keep getting tangled up in side issues, so it is taking a lot longer than expected. But there's still hope. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 15:32 ` Michael Haggerty @ 2011-09-26 15:42 ` Martin Fick 2011-09-26 16:25 ` Thomas Rast 0 siblings, 1 reply; 126+ messages in thread From: Martin Fick @ 2011-09-26 15:42 UTC (permalink / raw) To: Michael Haggerty; +Cc: git, Christian Couder On Monday, September 26, 2011 09:32:14 am Michael Haggerty wrote: > On 09/25/2011 10:43 PM, Martin Fick wrote: > > A coworker of mine pointed out to me that a simple > > > > git checkout > > > > can also take rather long periods of time > 3 mins when > > run on a repo with ~100K refs. > > > > While this is not massive like the other problem I > > reported, it still seems like it is more than one > > would expect. So, I tried an older version of git, > > and to my surprise/delight, it was much faster (.2s). > > So, I bisected this issue also, and it seems that the > > "offending" commit is > > > 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07: > I'm still working on changes to store references > hierarchically in the cache and read them lazily. I > hope that it will help some scaling problems with large > number of refs. > > Unfortunately I keep getting tangled up in side issues, > so it is taking a lot longer than expected. But there's > still hope. > > Michael Thanks Michael, I look forward to those changes. In the meantime however, I will try to take advantage of the current inefficiencies of large ref counts to attempt to find places where there are obvious problems in the code paths. I suspect that there are several commands in git which inadvertently scan all the refs when they probably shouldn't. Since this is likely very slow now, it should be easy to find those, if it were faster, this might get overlooked. I feel like git checkout is one of those cases, it does not seem like git checkout should be affected by the number of refs in a repo? -Martin -- Employee of Qualcomm Innovation Center, Inc. which is a member of Code Aurora Forum ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-26 15:42 ` Martin Fick @ 2011-09-26 16:25 ` Thomas Rast 0 siblings, 0 replies; 126+ messages in thread From: Thomas Rast @ 2011-09-26 16:25 UTC (permalink / raw) To: Martin Fick; +Cc: Michael Haggerty, git, Christian Couder Martin Fick wrote: > > I suspect that there are several commands in git > which inadvertently scan all the refs when they probably > shouldn't. [...] I feel like git checkout is one of those cases, > it does not seem like git checkout should be affected by the > number of refs in a repo? git-checkout checks whether you are leaving any unreferenced (orphaned) commits behind when you leave a detached HEAD, which requires that it scan the history of all refs for the commit you just left. So unless you disable that warning it'll be pretty expensive regardless. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-08 19:53 ` Martin Fick 2011-09-09 0:52 ` Martin Fick @ 2011-09-09 13:50 ` Michael Haggerty 2011-09-09 15:51 ` Michael Haggerty 2011-09-09 16:03 ` Jens Lehmann 1 sibling, 2 replies; 126+ messages in thread From: Michael Haggerty @ 2011-09-09 13:50 UTC (permalink / raw) To: Martin Fick; +Cc: git On 09/08/2011 09:53 PM, Martin Fick wrote: > Just thought that I should add some numbers to this thread as it seems that > the later versions of git are worse off by several orders of magnitude on > this one. > > We have a Gerrit repo with just under 100K refs in refs/changes/*. When I > fetch them all with git 1.7.6 it does not seem to complete. Even after 5 > days, it is just under half way through the ref #s! [...] I recently reported very slow performance when doing a "git filter-branch" involving only about 1000 tags, with hints of O(N^3) scaling [1]. That could certainly explain enormous runtimes for 100k refs. References are cached in git in a single linked list, so it is easy to imagine O(N^2) all over the place (which is bad enough for 100k references). I am working on improving the situation by reorganizing how the reference cache is stored in memory, but progress is slow. I'm not sure whether your problem is related. For example, it is not obvious to me why the commit that you cite (88a21979) would make the reference problem so dramatically worse. I suggest the following experiments to characterize the problem: 1. Fetch the references in batches of a few hundred each, and see if that dramatically decreases the total time. 2. Same as (1), except run "git pack-refs --all --prune" between the batches. In my experiments, packing references made a dramatic difference in runtimes. 3. Try using the --no-replace-objects option (I assume that it can be used like "git --no-replace-objects fetch ..."). In my case this option made a dramatic improvement in the runtimes. 4. Try a test using a repository generated something like the test script that I posted in [1]. If it also gives pathologically bad performance, then it can serve as a test case to use while we debug the problem. Yours, Michael [1] http://comments.gmane.org/gmane.comp.version-control.git/177103 -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-09 13:50 ` Michael Haggerty @ 2011-09-09 15:51 ` Michael Haggerty 2011-09-09 16:03 ` Jens Lehmann 1 sibling, 0 replies; 126+ messages in thread From: Michael Haggerty @ 2011-09-09 15:51 UTC (permalink / raw) To: Martin Fick; +Cc: git I have answered some of my own questions: On 09/09/2011 03:50 PM, Michael Haggerty wrote: > 3. Try using the --no-replace-objects option (I assume that it can be > used like "git --no-replace-objects fetch ..."). In my case this option > made a dramatic improvement in the runtimes. This does not seem to help much. > 4. Try a test using a repository generated something like the test > script that I posted in [1]. If it also gives pathologically bad > performance, then it can serve as a test case to use while we debug the > problem. Yes, a simple test repo like that created by the script is enough to reproduce the problem. The slowdown becomes very obvious after only a few hundred references. Curiously, "git clone" is very fast under the same circumstances that "git fetch" is excruciatingly slow. According to strace, git seems to be repopulating the ref cache after each new ref is created (it walks through the whole refs subdirectory and reads every file). Apparently the ref cache is being discarded completely whenever a ref is added (which can and should be fixed) and then being reloaded for some reason (though single refs can be inspected much faster without reading the cache). This situation should be improved by the hierarchical refcache changes that I'm working on plus smarter updating (rather than discarding) of the cache when a new reference is created. Some earlier speculation in this thread was that that slowdowns might be caused by "pessimal" ordering of revisions in the walker queue. But my test repository shards the references in such a way that the lexical order of the refnames does not correspond to the topological order of the commits. So that can't be the whole story. Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-09-09 13:50 ` Michael Haggerty 2011-09-09 15:51 ` Michael Haggerty @ 2011-09-09 16:03 ` Jens Lehmann 1 sibling, 0 replies; 126+ messages in thread From: Jens Lehmann @ 2011-09-09 16:03 UTC (permalink / raw) To: Michael Haggerty; +Cc: Martin Fick, git Am 09.09.2011 15:50, schrieb Michael Haggerty: > On 09/08/2011 09:53 PM, Martin Fick wrote: >> Just thought that I should add some numbers to this thread as it seems that >> the later versions of git are worse off by several orders of magnitude on >> this one. >> >> We have a Gerrit repo with just under 100K refs in refs/changes/*. When I >> fetch them all with git 1.7.6 it does not seem to complete. Even after 5 >> days, it is just under half way through the ref #s! [...] > > I recently reported very slow performance when doing a "git > filter-branch" involving only about 1000 tags, with hints of O(N^3) > scaling [1]. That could certainly explain enormous runtimes for 100k refs. > > References are cached in git in a single linked list, so it is easy to > imagine O(N^2) all over the place (which is bad enough for 100k > references). I am working on improving the situation by reorganizing > how the reference cache is stored in memory, but progress is slow. > > I'm not sure whether your problem is related. For example, it is not > obvious to me why the commit that you cite (88a21979) would make the > reference problem so dramatically worse. 88a21979 is the reason, as since then a "git rev-list <sha1> --not --all" is run for *every* updated ref to find out all new commits fetched for that ref. And if you have 100K of them ... ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 15:56 ` Shawn Pearce 2011-06-09 16:26 ` Jeff King @ 2011-06-10 7:41 ` Andreas Ericsson 2011-06-10 19:41 ` Shawn Pearce 1 sibling, 1 reply; 126+ messages in thread From: Andreas Ericsson @ 2011-06-10 7:41 UTC (permalink / raw) To: Shawn Pearce; +Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git On 06/09/2011 05:56 PM, Shawn Pearce wrote: > On Thu, Jun 9, 2011 at 08:52, A Large Angry SCM<gitzilla@gmail.com> wrote: >> On 06/09/2011 11:23 AM, Shawn Pearce wrote: >>> Having a reference to every commit in the repository is horrifically >>> slow. We run into this with Gerrit Code Review and I need to find >>> another solution. Git just wasn't meant to process repositories like >>> this. >> >> Assuming a very large number of refs, what is it that makes git so >> horrifically slow? Is there a design or implementation lesson here? > > A few things. > > Git does a sequential scan of all references when it first needs to > access references for an operation. This requires reading the entire > packed-refs file, and the recursive scan of the "refs/" subdirectory > for any loose refs that might override the packed-refs file. > > A lot of operations toss every commit that a reference points at into > the revision walker's LRU queue. If you have a tag pointing to every > commit, then the entire project history enters the LRU queue at once, > up front. That queue is managed with O(N^2) insertion time. And the > entire queue has to be filled before anything can be output. > Hmm. Since we're using pre-hashed data with an obvious lookup method we should be able to do much, much better than O(n^2) for insertion and better than O(n) for worst-case lookups. I'm thinking a 1-byte trie, resulting in a depth, lookup and insertion complexity of 20. It would waste some memory but it might be worth it for fixed asymptotic complexity for both insertion and lookup. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-10 7:41 ` Andreas Ericsson @ 2011-06-10 19:41 ` Shawn Pearce 2011-06-10 20:12 ` Jakub Narebski ` (2 more replies) 0 siblings, 3 replies; 126+ messages in thread From: Shawn Pearce @ 2011-06-10 19:41 UTC (permalink / raw) To: Andreas Ericsson Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git On Fri, Jun 10, 2011 at 00:41, Andreas Ericsson <ae@op5.se> wrote: > On 06/09/2011 05:56 PM, Shawn Pearce wrote: >> >> A lot of operations toss every commit that a reference points at into >> the revision walker's LRU queue. If you have a tag pointing to every >> commit, then the entire project history enters the LRU queue at once, >> up front. That queue is managed with O(N^2) insertion time. And the >> entire queue has to be filled before anything can be output. > > Hmm. Since we're using pre-hashed data with an obvious lookup method > we should be able to do much, much better than O(n^2) for insertion > and better than O(n) for worst-case lookups. I'm thinking a 1-byte > trie, resulting in a depth, lookup and insertion complexity of 20. It > would waste some memory but it might be worth it for fixed asymptotic > complexity for both insertion and lookup. Not really. The queue isn't sorting by SHA-1. Its sorting by commit timestamp, descending. Those aren't pre-hashed. The O(N^2) insertion is because the code is trying to find where this commit belongs in the list of commits as sorted by commit timestamp. There are some priority queue datastructures designed for this sort of work, e.g. a calendar queue might help. But its not as simple as a 1 byte trie. -- Shawn. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-10 19:41 ` Shawn Pearce @ 2011-06-10 20:12 ` Jakub Narebski 2011-06-10 20:35 ` Jeff King 2011-06-13 7:08 ` Andreas Ericsson 2 siblings, 0 replies; 126+ messages in thread From: Jakub Narebski @ 2011-06-10 20:12 UTC (permalink / raw) To: Shawn Pearce Cc: Andreas Ericsson, A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git Shawn Pearce <spearce@spearce.org> writes: > On Fri, Jun 10, 2011 at 00:41, Andreas Ericsson <ae@op5.se> wrote: >> On 06/09/2011 05:56 PM, Shawn Pearce wrote: >>> >>> A lot of operations toss every commit that a reference points at into >>> the revision walker's LRU queue. If you have a tag pointing to every >>> commit, then the entire project history enters the LRU queue at once, >>> up front. That queue is managed with O(N^2) insertion time. And the >>> entire queue has to be filled before anything can be output. >> >> Hmm. Since we're using pre-hashed data with an obvious lookup method >> we should be able to do much, much better than O(n^2) for insertion >> and better than O(n) for worst-case lookups. I'm thinking a 1-byte >> trie, resulting in a depth, lookup and insertion complexity of 20. It >> would waste some memory but it might be worth it for fixed asymptotic >> complexity for both insertion and lookup. > > Not really. > > The queue isn't sorting by SHA-1. Its sorting by commit timestamp, > descending. Those aren't pre-hashed. The O(N^2) insertion is because > the code is trying to find where this commit belongs in the list of > commits as sorted by commit timestamp. > > There are some priority queue datastructures designed for this sort of > work, e.g. a calendar queue might help. But its not as simple as a 1 > byte trie. In the case of Subversion numbers (revision number to hash mapping) sorted by name (in version order at least) means sorted by date. I wonder if there is data structure for which this is optimum insertion order (like for insertion sort almost sorted data is best case). -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-10 19:41 ` Shawn Pearce 2011-06-10 20:12 ` Jakub Narebski @ 2011-06-10 20:35 ` Jeff King 2011-06-13 7:08 ` Andreas Ericsson 2 siblings, 0 replies; 126+ messages in thread From: Jeff King @ 2011-06-10 20:35 UTC (permalink / raw) To: Shawn Pearce Cc: Andreas Ericsson, A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git On Fri, Jun 10, 2011 at 12:41:39PM -0700, Shawn O. Pearce wrote: > Not really. > > The queue isn't sorting by SHA-1. Its sorting by commit timestamp, > descending. Those aren't pre-hashed. The O(N^2) insertion is because > the code is trying to find where this commit belongs in the list of > commits as sorted by commit timestamp. > > There are some priority queue datastructures designed for this sort of > work, e.g. a calendar queue might help. But its not as simple as a 1 > byte trie. All you really need is a heap-based priority queue, which gives O(lg n) insertion and popping (and O(1) peeking at the top). I even wrote one and posted it recently (I won't dig up the reference, but I posted it elsewhere in this thread, I think). The problem is that many parts of the code assume that commit_list is a linked list and do fast iterations, or even splicing. It's nothing you couldn't get around with some work, but it turns out to involve a lot of code changes. -Peff ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-10 19:41 ` Shawn Pearce 2011-06-10 20:12 ` Jakub Narebski 2011-06-10 20:35 ` Jeff King @ 2011-06-13 7:08 ` Andreas Ericsson 2 siblings, 0 replies; 126+ messages in thread From: Andreas Ericsson @ 2011-06-13 7:08 UTC (permalink / raw) To: Shawn Pearce; +Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git On 06/10/2011 09:41 PM, Shawn Pearce wrote: > On Fri, Jun 10, 2011 at 00:41, Andreas Ericsson<ae@op5.se> wrote: >> On 06/09/2011 05:56 PM, Shawn Pearce wrote: >>> >>> A lot of operations toss every commit that a reference points at into >>> the revision walker's LRU queue. If you have a tag pointing to every >>> commit, then the entire project history enters the LRU queue at once, >>> up front. That queue is managed with O(N^2) insertion time. And the >>> entire queue has to be filled before anything can be output. >> >> Hmm. Since we're using pre-hashed data with an obvious lookup method >> we should be able to do much, much better than O(n^2) for insertion >> and better than O(n) for worst-case lookups. I'm thinking a 1-byte >> trie, resulting in a depth, lookup and insertion complexity of 20. It >> would waste some memory but it might be worth it for fixed asymptotic >> complexity for both insertion and lookup. > > Not really. > > The queue isn't sorting by SHA-1. Its sorting by commit timestamp, > descending. Those aren't pre-hashed. The O(N^2) insertion is because > the code is trying to find where this commit belongs in the list of > commits as sorted by commit timestamp. > Hmm. We should still be able to do better than that, and particularly for the "tag-each-commit" workflow. Since it's most likely those tags are generated using incrementing numbers, we could have a cut-off where we first parse all the refs and make an optimistic assumption that an alphabetical sort of the refs provides a map of insertion-points for the commits. Since the best case behaviour is still O(1) for insertion sort and it's unlikely that thousands of refs are in random order, that should cause the vast majority of the refs we insert to follow the best case scenario. This will fall on its arse when people start doing hg-ref -> git-commit tags ofcourse, but that doesn't seem to be happening, or at least not to the same extent as with svn-revisions -> git-gommit mapping. We're still not improving the asymptotic complexity, but it's a pretty safe bet that we for a vast majority of cases improve wallclock runtime by a hefty amount with a relatively minor effort. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 3:44 Git is not scalable with too many refs/* NAKAMURA Takumi 2011-06-09 6:50 ` Sverre Rabbelier @ 2011-06-09 11:18 ` Jakub Narebski 2011-06-09 15:42 ` Stephen Bash 1 sibling, 1 reply; 126+ messages in thread From: Jakub Narebski @ 2011-06-09 11:18 UTC (permalink / raw) To: NAKAMURA Takumi; +Cc: git NAKAMURA Takumi <geek4civic@gmail.com> writes: > Hello, Git. It is my 1st post here. > > I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn > repo locally. (over 100k refs/tags.) [...] That's insane. You would do much better to mark each commit with note. Notes are designed to be scalable. See e.g. this thread [RFD] Proposal for git-svn: storing SVN metadata (git-svn-id) in notes http://article.gmane.org/gmane.comp.version-control.git/174657 -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 126+ messages in thread
* Re: Git is not scalable with too many refs/* 2011-06-09 11:18 ` Jakub Narebski @ 2011-06-09 15:42 ` Stephen Bash 0 siblings, 0 replies; 126+ messages in thread From: Stephen Bash @ 2011-06-09 15:42 UTC (permalink / raw) To: Jakub Narebski; +Cc: git, NAKAMURA Takumi ----- Original Message ----- > From: "Jakub Narebski" <jnareb@gmail.com> > To: "NAKAMURA Takumi" <geek4civic@gmail.com> > Cc: "git" <git@vger.kernel.org> > Sent: Thursday, June 9, 2011 7:18:09 AM > Subject: Re: Git is not scalable with too many refs/* > NAKAMURA Takumi <geek4civic@gmail.com> writes: > > > Hello, Git. It is my 1st post here. > > > > I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn > > repo locally. (over 100k refs/tags.) > [...] > > That's insane. You would do much better to mark each commit with > note. Notes are designed to be scalable. See e.g. this thread > > [RFD] Proposal for git-svn: storing SVN metadata (git-svn-id) in notes > http://article.gmane.org/gmane.comp.version-control.git/174657 As a reformed SVN user (i.e. not using it anymore ;]) I agree that 100k tags seems crazy, but I was contemplating doing the exact same thing as Takumi. Skimming that thread, I didn't see the key point (IMO): notes can map from commits to a "name" (or other information), tags map from a "name" to commits. I've seen two different workflows develop: 1) Hacking on some code in Git the programmer finds something wrong. Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion. 2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356". I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit. In our office we created the cross reference table once by walking the Git tree and storing it as a file (we had some degenerate cases where one SVN rev mapped to multiple Git commits, but I don't remember the details), but it's not really usable from Git. Lightweight tags would be an awesome solution (if they worked). Perhaps a custom subcommand is a reasonable middle ground. Thanks, Stephen ^ permalink raw reply [flat|nested] 126+ messages in thread
end of thread, other threads:[~2011-10-09 5:44 UTC | newest] Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-06-09 3:44 Git is not scalable with too many refs/* NAKAMURA Takumi 2011-06-09 6:50 ` Sverre Rabbelier 2011-06-09 15:23 ` Shawn Pearce 2011-06-09 15:52 ` A Large Angry SCM 2011-06-09 15:56 ` Shawn Pearce 2011-06-09 16:26 ` Jeff King 2011-06-10 3:59 ` NAKAMURA Takumi 2011-06-13 22:27 ` Jeff King 2011-06-14 0:17 ` Andreas Ericsson 2011-06-14 0:30 ` Jeff King 2011-06-14 4:41 ` Junio C Hamano 2011-06-14 7:26 ` Sverre Rabbelier 2011-06-14 10:02 ` Johan Herland 2011-06-14 10:34 ` Sverre Rabbelier 2011-06-14 17:02 ` Jeff King 2011-06-14 19:20 ` Shawn Pearce 2011-06-14 19:47 ` Jeff King 2011-06-14 20:12 ` Shawn Pearce 2011-09-08 19:53 ` Martin Fick 2011-09-09 0:52 ` Martin Fick 2011-09-09 1:05 ` Thomas Rast 2011-09-09 1:13 ` Thomas Rast 2011-09-09 15:59 ` Jens Lehmann 2011-09-25 20:43 ` Martin Fick 2011-09-26 12:41 ` Christian Couder 2011-09-26 17:47 ` Martin Fick 2011-09-26 18:56 ` Christian Couder 2011-09-30 16:41 ` Martin Fick 2011-09-30 19:26 ` Martin Fick 2011-09-30 21:02 ` Martin Fick 2011-09-30 22:06 ` Martin Fick 2011-10-01 20:41 ` Junio C Hamano 2011-10-02 5:19 ` Michael Haggerty 2011-10-03 0:46 ` Martin Fick 2011-10-04 8:08 ` Michael Haggerty 2011-10-03 18:12 ` Martin Fick 2011-10-03 19:42 ` Junio C Hamano 2011-10-04 8:16 ` Michael Haggerty 2011-10-08 20:59 ` Martin Fick 2011-10-09 5:43 ` Michael Haggerty 2011-09-28 19:38 ` Martin Fick 2011-09-28 22:10 ` Martin Fick 2011-09-29 0:54 ` Julian Phillips 2011-09-29 1:37 ` Martin Fick 2011-09-29 2:19 ` Julian Phillips 2011-09-29 16:38 ` Martin Fick 2011-09-29 18:26 ` Julian Phillips 2011-09-29 18:27 ` René Scharfe 2011-09-29 19:10 ` Junio C Hamano 2011-09-29 4:18 ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips 2011-09-29 21:57 ` Junio C Hamano 2011-09-29 22:04 ` [PATCH v2] " Julian Phillips 2011-09-29 22:06 ` [PATCH] " Junio C Hamano 2011-09-29 22:11 ` [PATCH v3] " Julian Phillips 2011-09-29 23:48 ` Junio C Hamano 2011-09-30 15:30 ` Michael Haggerty 2011-09-30 16:38 ` Junio C Hamano 2011-09-30 17:56 ` [PATCH] refs: Remove duplicates after sorting with qsort Julian Phillips 2011-10-02 5:15 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty 2011-10-02 5:45 ` Junio C Hamano 2011-10-04 20:58 ` Junio C Hamano 2011-09-30 1:13 ` Martin Fick 2011-09-30 3:44 ` Junio C Hamano 2011-09-30 8:04 ` Julian Phillips 2011-09-30 15:45 ` Martin Fick 2011-09-29 20:44 ` Git is not scalable with too many refs/* Martin Fick 2011-09-29 19:10 ` Julian Phillips 2011-09-29 20:11 ` Martin Fick 2011-09-30 9:12 ` René Scharfe 2011-09-30 16:09 ` Martin Fick 2011-09-30 16:52 ` Junio C Hamano 2011-09-30 18:17 ` René Scharfe 2011-10-01 15:28 ` René Scharfe 2011-10-01 15:38 ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe 2011-10-01 19:02 ` Sverre Rabbelier 2011-10-01 15:43 ` [PATCH 2/8] revision: factor out add_pending_sha1 René Scharfe 2011-10-01 15:51 ` [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check René Scharfe 2011-10-01 15:56 ` [PATCH 4/8] revision: add leak_pending flag René Scharfe 2011-10-01 16:01 ` [PATCH 5/8] bisect: use " René Scharfe 2011-10-01 16:02 ` [PATCH 6/8] bundle: " René Scharfe 2011-10-01 16:09 ` [PATCH 7/8] checkout: " René Scharfe 2011-10-01 16:16 ` [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array René Scharfe 2011-09-26 15:15 ` Git is not scalable with too many refs/* Martin Fick 2011-09-26 15:21 ` Sverre Rabbelier 2011-09-26 15:48 ` Martin Fick 2011-09-26 15:56 ` Sverre Rabbelier 2011-09-26 16:38 ` Martin Fick 2011-09-26 16:49 ` Julian Phillips 2011-09-26 18:07 ` Martin Fick 2011-09-26 18:37 ` Julian Phillips 2011-09-26 20:01 ` Martin Fick 2011-09-26 20:07 ` Junio C Hamano 2011-09-26 20:28 ` Julian Phillips 2011-09-26 21:39 ` Martin Fick 2011-09-26 21:52 ` Martin Fick 2011-09-26 23:26 ` Julian Phillips 2011-09-26 23:37 ` David Michael Barr 2011-09-27 1:01 ` [PATCH] refs.c: Fix slowness with numerous loose refs David Barr 2011-09-27 2:04 ` David Michael Barr 2011-09-26 23:38 ` Git is not scalable with too many refs/* Junio C Hamano 2011-09-27 0:00 ` [PATCH] Don't sort ref_list too early Julian Phillips 2011-10-02 4:58 ` Michael Haggerty 2011-09-27 0:12 ` Git is not scalable with too many refs/* Martin Fick 2011-09-27 0:22 ` Julian Phillips 2011-09-27 2:34 ` Martin Fick 2011-09-27 7:59 ` Julian Phillips 2011-09-27 8:20 ` Sverre Rabbelier 2011-09-27 9:01 ` Julian Phillips 2011-09-27 10:01 ` Sverre Rabbelier 2011-09-27 10:25 ` Nguyen Thai Ngoc Duy 2011-09-27 11:07 ` Michael Haggerty 2011-09-27 12:10 ` Julian Phillips 2011-09-26 22:30 ` Julian Phillips 2011-09-26 15:32 ` Michael Haggerty 2011-09-26 15:42 ` Martin Fick 2011-09-26 16:25 ` Thomas Rast 2011-09-09 13:50 ` Michael Haggerty 2011-09-09 15:51 ` Michael Haggerty 2011-09-09 16:03 ` Jens Lehmann 2011-06-10 7:41 ` Andreas Ericsson 2011-06-10 19:41 ` Shawn Pearce 2011-06-10 20:12 ` Jakub Narebski 2011-06-10 20:35 ` Jeff King 2011-06-13 7:08 ` Andreas Ericsson 2011-06-09 11:18 ` Jakub Narebski 2011-06-09 15:42 ` Stephen Bash
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.