[PATCH 0/2] diffcore-break optimizations

* [PATCH 0/2] diffcore-break optimizations
@ 2009-11-16 15:53 Jeff King
  2009-11-16 15:56 ` [PATCH 1/2] diffcore-break: free filespec data as we go Jeff King
  2009-11-16 16:02 ` [PATCH 2/2] diffcore-break: save cnt_data for other phases Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff King @ 2009-11-16 15:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On one of my more ridiculously gigantic repositories, I recently tried
to make a commit that ran git out of memory while trying to commit. The
repository has about 3 gigabytes of data, and I made a small-ish change
to every file. Pathological, yes, but I think we can do better than
chugging for 5 minutes and dying.

The culprit turned out to be memory usage in diffcore-break, which is on
by default for "git status" (and for the "git commit" template message).
It wants to have every changed blob in memory at once, which is just
silly.

The patches are:

  [1/2]: diffcore-break: free filespec data as we go

  This addresses the memory consumption issue. If you have enough
  memory, it doesn't actually yield a speed improvement, but nor does it
  show any slowdown for practical workloads.

  There is a theoretical slowdown when doing -B -M, because the rename
  phase has to re-fetch the blobs from the object store. However, I
  wasn't able to measure any slowdown for real-world cases (like "git
  log --summary -M -B >/dev/null" on git.git).

  I did manage to produce the slowdown on a pathological case: ten
  20-megabyte files, each copied with a slight modification to another
  file, and then replaced with totally different contents (so each one
  will be broken and then trigger an inexact rename). That diff went
  from 16s to 17s.

  But I improved that and more with the next optimization.

  [2/2]: diffcore-break: save cnt_data for other phases

  We already do this in rename detection, and since they use the same
  data format, there is little reason not to do so. My pathological case
  above went from 17s down to 12s. I wasn't able to detect any speedup
  or slowdown for sane cases.

  So I doubt anybody will even notice this, but I think since we can
  address pathological cases, we might as well (and as you will see, the
  code change is quite small).

All of that being said, I was able to do my commit, but I still had to
wait five minutes for it to chug through 3G of data. :) I am tempted to
add a "quick" mode to git-commit, but perhaps such a ridiculous case is
rare enough not to worry about. I worked around it by writing my commit
message separately and using "git commit -F".

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread