On Tue, 27 Mar 2007, Uwe Kleine-König wrote: > > embeddedproject$ git ls-tree HEAD | grep linux > 040000 commit 0123456789abcde0... linux-2.6 > > (or how ever you save submodules). Then you might have to duplicate the > objects of linux-2.6, because they are part of both histories. No they are not. Unless you do it wrong. The *only* object that is part of the superproject would be the tree that *contains* that entry itself. We should *never* automatically follow such an entry down, *exactly* because that doesn't scale. So to actually follow that entry for something like a recursive, you'd literally "cd into linux, and start 'git diff' from commit 0123456.." In other words, the subproject would be its own project, and the superproject never sees it as "part of itself". I really think, for example, that the "git diff" family of programs (diff-index, diff-tree, diff-files) and things like "git ls-tree" should literally: - have a mode where they don't even recurse into subprojects, and I personally think that it could/should be the default! - when they recurse, they should literally (at least to begin with) do that kind of "fork() ; if (child) { chdir(subproject); execve(myself) }" The latter is really to make sure that *even*by*mistake* we don't screw things up and tie the sub/superproject together too tightly. I'm serious. I really think that the first version (which ends up being the one that sets semantics) should be very careful here, so that subprojects never get mixed up with the superproject. And I'm also serious about the "don't recurse into subproject by default at all". If I'm at the superproject, and I maintain the superproject, I think the state of the subprojects themselves are a totally separate issue. It's quite a valid thing to do to maintain the build infrastructure, and if I'm the maintainer of that, and I do "git diff", I sure as hell don't want to wait for git to do "git diff" on the subprojects when there are 5000 of them! Sure, "git diff" is fast (on the kernel, it takes me 0.069s on a clean tree), but - multiply that 0.069s by 5000 and it's not so fast any more - when you have a thousand subprojects, it's quite possible (even likely) that all your directories won't fit in the cache any more, and suddenly even a single "git diff" takes several seconds. Really! Try this on the Linux tree (that "drop_caches" thing needs root privileges): echo 3 > /proc/sys/vm/drop_caches git diff and see it take something like 5 seconds. Now, imagine that you have a hundred subprojects, and they're big enough that the caches are *never* warm. People sometimes don't seem to understand what "scalability" really means. Scalability means that something that is so fast that you don't even *think* about it will become a major bottleneck when you do it a thousand times, and the working set has grown so big that it totally blows out several levels of caches (both CPU caches and disk caches) Linus