All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance issue exposed by git-filter-branch
@ 2010-12-17  1:07 Ken Brownfield
  2010-12-17  1:45 ` Jonathan Nieder
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Ken Brownfield @ 2010-12-17  1:07 UTC (permalink / raw)
  To: git

I have a large git repository (1,757,784 objects, 209,282 commits) from which I have been planning to filter large tree portions (~36,000 of ~132,000 files).  When I first ran git-filter-branch on this repository about a year ago:

git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch -- bigdirtree stuff/a stuff/b stuff/c stuff/dir/{a,b,c}' --prune-empty --tag-name-filter cat -- --all

The process took around 25 hours for the repository when it was at ~101k commits.  This wasn't ideal, but could be completed over a weekend maintenance.  There are 50 daily active committers to this repository, so the window has to be short.

However, we didn't have time to implement this newly filtered repo (it involves everyone recloning, etc) until now.

Now that the same repository has grown, this same filter-branch process now takes 6.5 *days* at 100% CPU on the same machine (2x4 Xeon, x86_64) on git-1.7.3.2.  There's no I/O, memory, or other resource contention.

I tend to doubt there are any multi-processing opportunities with this process, so at this point git-filter-branch is no longer feasible.

This is an oprofile sample (all samples >1%) at roughly one day into the 6.5 day Rewrite process:

[...]
11594     1.0208  git                      git                      add_index_entry
11616     1.0228  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server find_lock_page
12624     1.1115  git                      git                      decode_tree_entry
13065     1.1504  git                      git                      refresh_index
13757     1.2113  git                      git                      match_pathspec
14041     1.2363  git                      git                      read_packed_refs
18309     1.6121  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server unmap_vmas
20014     1.7622  libc-2.7.so              libc-2.7.so              _int_malloc
24248     2.1350  git                      git                      find_cache_pos
24560     2.1625  git                      git                      find_pack_entry_one
29042     2.5571  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server debug
31202     2.7473  libz.so.1.2.3.3          libz.so.1.2.3.3          inflate
34941     3.0765  git                      git                      df_name_compare
36749     3.2357  libz.so.1.2.3.3          libz.so.1.2.3.3          inflate_fast
41704     3.6720  git                      git                      index_name_pos
46908     4.1302  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server clear_page
92554     8.1493  libc-2.7.so              libc-2.7.so              memcpy
127439   11.2208  libcrypto.so.0.9.8       libcrypto.so.0.9.8       sha1_block_data_order
188373   16.5860  git                      git                      cache_name_compare

cache_name_compare (and the presumed follow-ons of memcpy/sha/malloc/etc) is the major consumer.

Sampling the filter only 2k commits into the Rewrite stage shows:

[...]
12058     1.0135  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server do_path_lookup
13532     1.1374  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server copy_user_generic_string
13934     1.1712  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server clear_page
16565     1.3924  git                      git                      cache_name_compare
16948     1.4246  libc-2.7.so              libc-2.7.so              memcpy
16969     1.4263  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server system_call
19189     1.6129  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server debug
22697     1.9078  git                      git                      add_ref
31112     2.6151  ext3                     ext3                     (no symbols)
33925     2.8516  git                      git                      sort_ref_list
34026     2.8600  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server _atomic_dec_and_lock
39304     3.3037  git                      git                      read_packed_refs
43920     3.6917  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server __link_path_walk
58504     4.9175  libc-2.7.so              libc-2.7.so              strcmp
79957     6.7208  vmlinux-debug-2.6.24-28-server vmlinux-debug-2.6.24-28-server __d_lookup
168696   14.1797  git                      git                      prepare_packed_git_one

The process is still pretty slow (1-2 commits per second) but cache_name_compare is in the background.

Is there a way to apply the optimizations mentioned in that old thread to the code paths used by git-filter-branch (mainly git-read and git-rm, seemingly), or is there another way to investigate and improve the performance of the filter?

Outside of this specific issue, it might be worth taking a look at the overall performance of git-filter-branch: bash loops iterating over core executables probably isn't ideal, but there are lower-hanging fruits.  Running the filter in a single process may allow some better caching and reduce duplication of work (maybe parallelization?), but I'm just guessing.

Our tree is quite large, but the O(n^2) nature of this process is pretty crippling for the larger repositories that are bound to be in the wild.  And while filter-branch isn't an everyday thing, when I /do/ need to use it, I won't be able to wait a week. :-)

I'd appreciate any feedback or suggestions anyone might have!

Thanks,
Ken

PS: On an unrelated note, I would recommend that the following code in git-filter-branch:

277:rev_args=$(git rev-parse --revs-only "$@")

be changed to write out to a temporary file, then piped into the "git rev-list" at line 289 with "--stdin".  For larger trees, the use of $rev_args on the "git rev-list" command-line exceeds the size of some shells' command-line buffers (from direct experience).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  1:07 Performance issue exposed by git-filter-branch Ken Brownfield
@ 2010-12-17  1:45 ` Jonathan Nieder
  2010-12-17  2:31   ` Ken Brownfield
  2010-12-17  1:54 ` Thomas Rast
  2010-12-17 13:01 ` Nguyen Thai Ngoc Duy
  2 siblings, 1 reply; 15+ messages in thread
From: Jonathan Nieder @ 2010-12-17  1:45 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git, David Barr

Hi Ken,

Ken Brownfield wrote:

> Is there a way to apply the optimizations mentioned in that old
> thread to the code paths used by git-filter-branch (mainly git-read
> and git-rm, seemingly), or is there another way to investigate and
> improve the performance of the filter?

Which old thread?

You might be able to get faster results using the approach of [1]
(using "git cat-file --batch-check" to collect the trees you want
and "git fast-import" to paste them together), which avoids unpacking
trees when not needed.

Hope that helps,
Jonathan

[1] http://repo.or.cz/w/git/barrbrain/github.git/commitdiff/db-svn-filter-root

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  1:07 Performance issue exposed by git-filter-branch Ken Brownfield
  2010-12-17  1:45 ` Jonathan Nieder
@ 2010-12-17  1:54 ` Thomas Rast
  2010-12-17  2:36   ` Ken Brownfield
  2010-12-17 13:01 ` Nguyen Thai Ngoc Duy
  2 siblings, 1 reply; 15+ messages in thread
From: Thomas Rast @ 2010-12-17  1:54 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git

Ken Brownfield wrote:
> git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch -- bigdirtree stuff/a stuff/b stuff/c stuff/dir/{a,b,c}' --prune-empty --tag-name-filter cat -- --all
[...]
> Now that the same repository has grown, this same filter-branch
> process now takes 6.5 *days* at 100% CPU on the same machine (2x4
> Xeon, x86_64) on git-1.7.3.2.  There's no I/O, memory, or other
> resource contention.

If all you do is an index-filter for deletion, I think it should be
rather easy to achieve good results by filtering the fast-export
stream to remove these files, and then piping that back to
fast-import.

(It's just that AFAIK nobody has written that code yet.)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  1:45 ` Jonathan Nieder
@ 2010-12-17  2:31   ` Ken Brownfield
  2010-12-17  3:22     ` Jonathan Nieder
  0 siblings, 1 reply; 15+ messages in thread
From: Ken Brownfield @ 2010-12-17  2:31 UTC (permalink / raw)
  To: git; +Cc: David Barr

The thread titled "git and larger trees, not so fast?".  Some of the history is lost, but here's the earliest post I can find:

http://lists-archives.org/git/627040-git-and-larger-trees-not-so-fast.html

On GMANE:
http://article.gmane.org/gmane.comp.version-control.git/55460/match=git+larger+trees+not+so+fast

But I can't figure out how to show the whole thread.

Sorry, that paragraph of my email disappeared. :-(

Ken

On Dec 16, 2010, at 5:45 PM, Jonathan Nieder wrote:

> Hi Ken,
> 
> Ken Brownfield wrote:
> 
>> Is there a way to apply the optimizations mentioned in that old
>> thread to the code paths used by git-filter-branch (mainly git-read
>> and git-rm, seemingly), or is there another way to investigate and
>> improve the performance of the filter?
> 
> Which old thread?
> 
> You might be able to get faster results using the approach of [1]
> (using "git cat-file --batch-check" to collect the trees you want
> and "git fast-import" to paste them together), which avoids unpacking
> trees when not needed.
> 
> Hope that helps,
> Jonathan
> 
> [1] http://repo.or.cz/w/git/barrbrain/github.git/commitdiff/db-svn-filter-root

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  1:54 ` Thomas Rast
@ 2010-12-17  2:36   ` Ken Brownfield
  2010-12-17  2:51     ` Jakub Narebski
  2010-12-17  3:08     ` Jonathan Nieder
  0 siblings, 2 replies; 15+ messages in thread
From: Ken Brownfield @ 2010-12-17  2:36 UTC (permalink / raw)
  To: git

I had considered this approach (and the one mentioned by Jonathan) but there are no git tools to actually perform the filter I wanted on the export in this form.  I could (and will) parse fast-export and make an attempt a filtering files/directories... my concern is that I won't do it right, and will introduce subtle corruption.  But if there's no existing tool, I'll take a crack at it. :-)

Thanks for your suggestions so far,

Ken

PS: This was my exact first thought, since I was previously used to performing "svnadmin dump/svndumpfilter/svnadmin load" on this repository when it was in SVN.

On Dec 16, 2010, at 5:54 PM, Thomas Rast wrote:

> Ken Brownfield wrote:
>> git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch -- bigdirtree stuff/a stuff/b stuff/c stuff/dir/{a,b,c}' --prune-empty --tag-name-filter cat -- --all
> [...]
>> Now that the same repository has grown, this same filter-branch
>> process now takes 6.5 *days* at 100% CPU on the same machine (2x4
>> Xeon, x86_64) on git-1.7.3.2.  There's no I/O, memory, or other
>> resource contention.
> 
> If all you do is an index-filter for deletion, I think it should be
> rather easy to achieve good results by filtering the fast-export
> stream to remove these files, and then piping that back to
> fast-import.
> 
> (It's just that AFAIK nobody has written that code yet.)
> 
> -- 
> Thomas Rast
> trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  2:36   ` Ken Brownfield
@ 2010-12-17  2:51     ` Jakub Narebski
  2010-12-21  4:49       ` Ken Brownfield
  2010-12-17  3:08     ` Jonathan Nieder
  1 sibling, 1 reply; 15+ messages in thread
From: Jakub Narebski @ 2010-12-17  2:51 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git, Jakub Narebski

Please do not toppost.

Ken Brownfield <krb@irridia.com> writes:

> I had considered this approach (and the one mentioned by Jonathan)
> but there are no git tools to actually perform the filter I wanted
> on the export in this form.  I could (and will) parse fast-export
> and make an attempt a filtering files/directories... my concern is
> that I won't do it right, and will introduce subtle corruption.  But
> if there's no existing tool, I'll take a crack at it. :-)

You can try ESR's reposurgeon:

  http://www.catb.org/~esr/reposurgeon/

It's limitation is that it loads structure of DAG of revisions (but
not blobs i.e. contents of file) to memory.  IIRC.  It is not
streaming, but "DOM" based, otherwise some commands would not work.


By the way, git-filter-branch documentation recomments to use
index-filter with git-update-index instead of tree-filter with git-rm,
and if tree-filter is needed, to use some fast filesystem, e.g. RAM
one.

But probably you know all that.
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  2:36   ` Ken Brownfield
  2010-12-17  2:51     ` Jakub Narebski
@ 2010-12-17  3:08     ` Jonathan Nieder
  2010-12-17  5:39       ` Elijah Newren
  1 sibling, 1 reply; 15+ messages in thread
From: Jonathan Nieder @ 2010-12-17  3:08 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git, David Barr, Elijah Newren, skimo, Eric Raymond

Ken Brownfield wrote:

> I had considered this approach (and the one mentioned by Jonathan)
> but there are no git tools to actually perform the filter I wanted
> on the export in this form.

Keep in mind that the two suggestions were subtly different from one
another.

For the "filter fast-import stream" technique, apparently there is a
tool called reposurgeon[1] to do that.  git_fast_filter[2] has the
same purpose, too, if I remember correctly.

For the unpack-trees avoidance technique, true, the only example I
know if is the one I mentioned[3].  The idea would be to sort the
commits you want in topological order and replay them, for each one
going like so:

	M 040000 <old tree id> ""
	D bad/directory/one
	D bad/directory/two

using fast-import from git.git master.  (Older versions of fast-import
do not properly handle replacing the root directory, so if that sort
of compatibility is important, you'd have to use a directory listing
and fill in all the _good_ directories instead.)

I'd be glad to look at any work in this direction.  Something like it
would be useful for postprocessing when importing from svn repos.

Thanks,
Jonathan

[1] http://esr.ibiblio.org/?p=2718
[2] http://thread.gmane.org/gmane.comp.version-control.git/116028
and links therein
[3] http://thread.gmane.org/gmane.comp.version-control.git/158375

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  2:31   ` Ken Brownfield
@ 2010-12-17  3:22     ` Jonathan Nieder
  2010-12-17  3:37       ` Jonathan Nieder
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Nieder @ 2010-12-17  3:22 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git, David Barr, Thomas Rast, Jakub Narebski

Ken Brownfield wrote:

> The thread titled "git and larger trees, not so fast?".

Here it is[1].  Sorry to say, the improvements discussed there
were made right away and indeed had a dramatic effect.

Jonathan

[1] http://thread.gmane.org/gmane.comp.version-control.git/55458/focus=55643

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  3:22     ` Jonathan Nieder
@ 2010-12-17  3:37       ` Jonathan Nieder
  0 siblings, 0 replies; 15+ messages in thread
From: Jonathan Nieder @ 2010-12-17  3:37 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git, David Barr, Thomas Rast, Jakub Narebski

Jonathan Nieder wrote:
> Ken Brownfield wrote:

>> The thread titled "git and larger trees, not so fast?".
>
> Here it is[1].  Sorry to say, the improvements discussed there
> were made right away and indeed had a dramatic effect.

Of course I missed your point. :)

filter-branch --index-filter works a little like this: for
each commit:

. find the underlying tree
. read-tree: unpack that tree and all of its subtrees into
the index file.  That is, convert from a recursive structure
   /:
	COPYING
	Documentation/
	INSTALL
	Makefile
	...

   Documentation/:
	CodingGuidelines
	Makefile
	...

into a flat structure

	COPYING
	Documentation/CodingGuideLines
	Documentation/Makefile
	Documentation/RelNotes/1.5.0.txt
	...
. rm: find entries matching certain patterns and remove them
from the index file.  This takes two passes through the index:
first to find matching entries, second to write the result to
disk.
. write-tree: write new trees for the object store.  That is,
convert from a flat structure back to a recursive structure.

This is convenient, but it does not sound to me like the most
efficient way to eliminate a few subtrees from each commit.  That is
why I was suggesting a method that avoids unpacking some trees
altogether.

That said, speedups for read-tree, rm, and write-tree would certainly
be nice to have.  One project of interest to some people is to give
the index file a recursive structure, so finding the entries to remove
in the "git rm" example could be faster.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  3:08     ` Jonathan Nieder
@ 2010-12-17  5:39       ` Elijah Newren
  2011-02-04 21:17         ` Ken Brownfield
  0 siblings, 1 reply; 15+ messages in thread
From: Elijah Newren @ 2010-12-17  5:39 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Ken Brownfield, git, David Barr, skimo, Eric Raymond

On Thu, Dec 16, 2010 at 8:08 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Ken Brownfield wrote:
>
>> I had considered this approach (and the one mentioned by Jonathan)
>> but there are no git tools to actually perform the filter I wanted
>> on the export in this form.
>
> Keep in mind that the two suggestions were subtly different from one
> another.
>
> For the "filter fast-import stream" technique, apparently there is a
> tool called reposurgeon[1] to do that.  git_fast_filter[2] has the
> same purpose, too, if I remember correctly.

Yes, git_fast_filter was written precisely because git-filter-branch
took waaaaaay too long.  IIRC, git-filter-branch would have taken
about 2-3 months for our use case (there's no way we could have shut
down the repositories for that long), whereas git_fast_filter (serving
along with fast-export and fast-import) allowed us to drop that to
about an hour (we couldn't use --index-filter with filter-branch as we
needed to do a number of operations on the actual file contents as
well).

All git_fast_filter really does is parse the fast-export output into
some basic python data structures, making it easy for you to modify
those structures as necessary (assuming basic python skills, though if
you only need to do what one of the examples shows then you could even
get away without that), and then pipes the results back out in the
format fast-import expects.  It has a few examples with it; removing
existing files is one of the simple examples.

I haven't really bothered keeping the public repository up-to-date
since there hasn't been any prior external interest in it, but we
haven't modified it much internally either, and most of those
modifications are likely for niche stuff that you wouldn't need.

Elijah

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  1:07 Performance issue exposed by git-filter-branch Ken Brownfield
  2010-12-17  1:45 ` Jonathan Nieder
  2010-12-17  1:54 ` Thomas Rast
@ 2010-12-17 13:01 ` Nguyen Thai Ngoc Duy
  2010-12-21  4:59   ` Ken Brownfield
  2 siblings, 1 reply; 15+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-12-17 13:01 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git

On Fri, Dec 17, 2010 at 8:07 AM, Ken Brownfield <krb@irridia.com> wrote:
> cache_name_compare (and the presumed follow-ons of memcpy/sha/malloc/etc) is the major consumer.

Other people have given you alternative approaches. I'm just wondering
if we can improve something here. cache_name_compare() is essentially
memcmp() on two full paths. A tree-based index might help. How long
are your file names on average? Are your trees deep?
-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  2:51     ` Jakub Narebski
@ 2010-12-21  4:49       ` Ken Brownfield
  0 siblings, 0 replies; 15+ messages in thread
From: Ken Brownfield @ 2010-12-21  4:49 UTC (permalink / raw)
  To: git

This does pretty much exactly what I want (and a lot more), but reposurgeon is now over three days into reading the fast-export stream at 100% CPU.  My guess is that it's about 30% done.

It does look like a great tool for smaller repositories.

Thanks for the suggestion, though!  It looks like git_fast_filter is my next stop.

Ken

On Dec 16, 2010, at 6:51 PM, Jakub Narebski wrote:

> Please do not toppost.
> 
> Ken Brownfield <krb@irridia.com> writes:
> 
>> I had considered this approach (and the one mentioned by Jonathan)
>> but there are no git tools to actually perform the filter I wanted
>> on the export in this form.  I could (and will) parse fast-export
>> and make an attempt a filtering files/directories... my concern is
>> that I won't do it right, and will introduce subtle corruption.  But
>> if there's no existing tool, I'll take a crack at it. :-)
> 
> You can try ESR's reposurgeon:
> 
>  http://www.catb.org/~esr/reposurgeon/
> 
> It's limitation is that it loads structure of DAG of revisions (but
> not blobs i.e. contents of file) to memory.  IIRC.  It is not
> streaming, but "DOM" based, otherwise some commands would not work.
> 
> 
> By the way, git-filter-branch documentation recomments to use
> index-filter with git-update-index instead of tree-filter with git-rm,
> and if tree-filter is needed, to use some fast filesystem, e.g. RAM
> one.
> 
> But probably you know all that.
> -- 
> Jakub Narebski
> Poland
> ShadeHawk on #git
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17 13:01 ` Nguyen Thai Ngoc Duy
@ 2010-12-21  4:59   ` Ken Brownfield
  0 siblings, 0 replies; 15+ messages in thread
From: Ken Brownfield @ 2010-12-21  4:59 UTC (permalink / raw)
  To: git

Filename lengths: min 4 avg 50 median 51 max 178
Directory depths: min 1 avg 5 median 6 max 13

Pretty standard Python naming/hierarchy, really.

Thanks!
Ken

On Dec 17, 2010, at 5:01 AM, Nguyen Thai Ngoc Duy wrote:

> On Fri, Dec 17, 2010 at 8:07 AM, Ken Brownfield <krb@irridia.com> wrote:
>> cache_name_compare (and the presumed follow-ons of memcpy/sha/malloc/etc) is the major consumer.
> 
> Other people have given you alternative approaches. I'm just wondering
> if we can improve something here. cache_name_compare() is essentially
> memcmp() on two full paths. A tree-based index might help. How long
> are your file names on average? Are your trees deep?
> -- 
> Duy
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2010-12-17  5:39       ` Elijah Newren
@ 2011-02-04 21:17         ` Ken Brownfield
  2011-02-05 14:21           ` Elijah Newren
  0 siblings, 1 reply; 15+ messages in thread
From: Ken Brownfield @ 2011-02-04 21:17 UTC (permalink / raw)
  To: git, Elijah Newren

Thanks for the feedback on git_fast_filter.  It takes 11.5 hours on our repository instead of 6.5 days, so that's a significant improvement. :-)  I have a couple of observations:

1) You said that your repo would have taken 2-3 months to filter with git-filter-branch, and the time was reduced to ~1hr.  I'm surprised our reduction was not quite as dramatic, although I presume the variability of repo contents are the explanation.

2) The resulting repository pack files are actually much larger.  A garbage collection reduces the size below the original, but only slightly.  I'm concerned that the recreated repository has redundant or inefficiently stored information, but I'm not sure how to verify what objects are taking up what space.

3) git_fast_filter doesn't currently support remote submodules.  When it tries to parse a submodule line, the regex fails and the code aborts:

Expected:
	M 100644 :433236 foo/bar/bletch
Received, something like:
	M 100644 cd821b4c0ea8e9493069ff43712a0b09 foo/bar/bletch

To correct the issue, I modified git_fast_filter to simply skip these.  While we no longer utilize remote submodules, I would prefer not to have them removed.

Any feedback on what the proper behavior would be in the submodule case?  Perhaps this is covered in your internal version?

Thanks,
-- 
Ken

On Dec 16, 2010, at 9:39 PM, Elijah Newren wrote:

> On Thu, Dec 16, 2010 at 8:08 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>> Ken Brownfield wrote:
>> 
>>> I had considered this approach (and the one mentioned by Jonathan)
>>> but there are no git tools to actually perform the filter I wanted
>>> on the export in this form.
>> 
>> Keep in mind that the two suggestions were subtly different from one
>> another.
>> 
>> For the "filter fast-import stream" technique, apparently there is a
>> tool called reposurgeon[1] to do that.  git_fast_filter[2] has the
>> same purpose, too, if I remember correctly.
> 
> Yes, git_fast_filter was written precisely because git-filter-branch
> took waaaaaay too long.  IIRC, git-filter-branch would have taken
> about 2-3 months for our use case (there's no way we could have shut
> down the repositories for that long), whereas git_fast_filter (serving
> along with fast-export and fast-import) allowed us to drop that to
> about an hour (we couldn't use --index-filter with filter-branch as we
> needed to do a number of operations on the actual file contents as
> well).
> 
> All git_fast_filter really does is parse the fast-export output into
> some basic python data structures, making it easy for you to modify
> those structures as necessary (assuming basic python skills, though if
> you only need to do what one of the examples shows then you could even
> get away without that), and then pipes the results back out in the
> format fast-import expects.  It has a few examples with it; removing
> existing files is one of the simple examples.
> 
> I haven't really bothered keeping the public repository up-to-date
> since there hasn't been any prior external interest in it, but we
> haven't modified it much internally either, and most of those
> modifications are likely for niche stuff that you wouldn't need.
> 
> Elijah
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance issue exposed by git-filter-branch
  2011-02-04 21:17         ` Ken Brownfield
@ 2011-02-05 14:21           ` Elijah Newren
  0 siblings, 0 replies; 15+ messages in thread
From: Elijah Newren @ 2011-02-05 14:21 UTC (permalink / raw)
  To: Ken Brownfield; +Cc: git

Hi,

On Fri, Feb 4, 2011 at 2:17 PM, Ken Brownfield <krb@irridia.com> wrote:
> Thanks for the feedback on git_fast_filter.  It takes 11.5 hours on our repository instead of 6.5 days, so that's a significant improvement. :-)  I have a couple of observations:
>
> 1) You said that your repo would have taken 2-3 months to filter with git-filter-branch, and the time was reduced to ~1hr.  I'm surprised our reduction was not quite as dramatic, although I presume the variability of repo contents are the explanation.

Variability of the repo certainly would account for some differences,
though I suspect more of the differences come from what kind of
filtering we were doing.  For example, the advantage of
git_fast_filter over filter-branch's --index-filter will be much less
than its advantage over filter-branch's --tree-filter.  Further, in my
case, I was parsing and potentially editing the contents of all files,
which becomes much more painful with filter-branch as you'll need to
re-edit the exact same contents in as many revisions of history as the
file remains unchanged in (in other words, duplicating the same work
hundreds or thousands of times).  With git_fast_filter, I only needed
to parse/edit a given version of some file exactly once.  That's what
really helped in my case.

> 2) The resulting repository pack files are actually much larger.  A garbage collection reduces the size below the original, but only slightly.  I'm concerned that the recreated repository has redundant or inefficiently stored information, but I'm not sure how to verify what objects are taking up what space.

You may want to use packinfo.pl from under contrib/stats/ in the git
repository to find out what objects take up how much space.  From my
notes on using it for this purpose:

  git verify-pack -v .git/objects/pack/pack-<sha1sum>.idx |
packinfo.pl -tree -filenames > tree-info.txt
  sort -k 4 -n tree-info.txt | grep -v ^$ | less

> 3) git_fast_filter doesn't currently support remote submodules.  When it tries to parse a submodule line, the regex fails and the code aborts:
>
> Expected:
>        M 100644 :433236 foo/bar/bletch
> Received, something like:
>        M 100644 cd821b4c0ea8e9493069ff43712a0b09 foo/bar/bletch
>
> To correct the issue, I modified git_fast_filter to simply skip these.  While we no longer utilize remote submodules, I would prefer not to have them removed.
>
> Any feedback on what the proper behavior would be in the submodule case?  Perhaps this is covered in your internal version?

git_fast_filter would need to be modified to handle this kind of
input, create an appropriate object type, and that object type would
need to be able to appropriately output itself later.  Since
submodules haven't really been relevant for me, I've never bothered
implementing this[*].  The assumption that git-fast-export will
produce numeric ids (i.e. that submodules are not present) is somewhat
hardwired in, so it'd take a little bit of refactoring, though
probably not to bad.


Elijah

[*] Well, actually we did hit it once somewhat recently when someone
created a commit containing a submodule...and then also immediately
reverted it.  Since we don't want to use submodules, I simply put in a
hack that would recognize them and unconditionally strip them out on
the input parsing end, which sounds like the same thing you did.
That's obviously not what you're asking for.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-02-05 14:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-17  1:07 Performance issue exposed by git-filter-branch Ken Brownfield
2010-12-17  1:45 ` Jonathan Nieder
2010-12-17  2:31   ` Ken Brownfield
2010-12-17  3:22     ` Jonathan Nieder
2010-12-17  3:37       ` Jonathan Nieder
2010-12-17  1:54 ` Thomas Rast
2010-12-17  2:36   ` Ken Brownfield
2010-12-17  2:51     ` Jakub Narebski
2010-12-21  4:49       ` Ken Brownfield
2010-12-17  3:08     ` Jonathan Nieder
2010-12-17  5:39       ` Elijah Newren
2011-02-04 21:17         ` Ken Brownfield
2011-02-05 14:21           ` Elijah Newren
2010-12-17 13:01 ` Nguyen Thai Ngoc Duy
2010-12-21  4:59   ` Ken Brownfield

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.