On Tue, 7 Apr 2009, Björn Steinbrink wrote: > On 2009.04.07 09:13:45 -0400, Nicolas Pitre wrote: > > Having git-rev-list consume about 2G RSS for the enumeration of 4M > > objects is simply inacceptable, period. This is the equivalent of 500 > > bytes per object pinned in memory on average, just for listing object, > > which is completely silly. We ought to do better than that. > > Ah, crap, I might have been fooled by "ps aux", top actually shows about > 1.3G being shared, likely the mmapped pack files. And that will be > reused, assuming the box has enough memory to keep all that stuff. Right. And since the pack is mapped read-only, it can be paged out easily by the OS. And if that doesn't help, we already have core.packedGitWindowSize and core.packedGitLimit config options to play with. > But that's still 700MB or about 150 bytes per object on average. > > A "struct tree" is 40 bytes here, adding the average path length (19 in > this repo) that's 59 byte, leaving about 90 bytes of "overhead" per > object, as end the end we seem to care only about the sha1 and the path > name. I'm starting to think more seriously about pack v4 again, where each path components are indexed in a table. Because most tree objects are different revisions of the same path, this could represent a significant saving in memory as well. > And in the upload-pack case, there's also pack-objects running > concurrently, already going up to 950M RSS/100M shared _while_ the > rev-list is still running. So that's 3G of memory usage (2G if you > ignore the shared stuff) before the "Compressing objects" part even > starts. And of course, pack-objects will apparently start to mmap the > pack files only after the rev-list finished, so a "smart" OS might have > removed a lot of the mmapped stuff from memory again, causing it to be > re-read. :-/ The first low hanging fruit to help this case is to make upload-pack use the --revs argument with pack-object to let it do the object enumeration itself directly, instead of relying on the rev-list output through a pipe. This is what 'git repack' does already. pack-objects has to access the pack anyway, so this would eliminate an extra access from a different process. Nicolas