All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Improve Git performance on big trees
@ 2010-01-14 15:02 Nguyễn Thái Ngọc Duy
  2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-14 15:02 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

The discussion about using Git on Gentoo comes up again [1] and it
looks like we can improve Git a bit if user only works on some
subdirectories. In such cases, Git still does whole-tree check, which
takes quite some time. "git status" takes 3 seconds on my machine, but
I suspect lstat() is not the only culprit, "git diff --exit-code" is
about 1 sec.

These patches makes "git rm <path>" and "git status <path>" a bit
faster. Almost 1 sec for "git rm foo" still seems too long though,
probably due to writing a 9MB index.

[1] http://article.gmane.org/gmane.linux.gentoo.devel/64522

Nguyễn Thái Ngọc Duy (2):
  rm: only refresh entries that we may touch
  status: only touch path we may need to check

 builtin-commit.c |    2 +-
 builtin-rm.c     |    2 +-
 wt-status.c      |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] rm: only refresh entries that we may touch
  2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
@ 2010-01-14 15:02 ` Nguyễn Thái Ngọc Duy
  2010-01-16  4:39   ` Junio C Hamano
  2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-14 15:02 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

This gets rid of the whole tree cache refresh. Instead only path that
we touch will get refreshed. We may still lstat() more than needed,
but it'd be better playing safe.

This potentially reduces a large number of lstat() on big trees. Take
gentoo-x86 tree for example, which has roughly 80k files:

Unmodified Git:

$ time git rm --cached skel.ChangeLog
rm 'skel.ChangeLog'

real    0m1.323s
user    0m0.806s
sys     0m0.517s

Modified Git:

$ time git rm --cached skel.ChangeLog
rm 'skel.ChangeLog'

real    0m0.903s
user    0m0.780s
sys     0m0.100s

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin-rm.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/builtin-rm.c b/builtin-rm.c
index 57975db..4cac3d1 100644
--- a/builtin-rm.c
+++ b/builtin-rm.c
@@ -169,7 +169,6 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
 
 	if (read_cache() < 0)
 		die("index file corrupt");
-	refresh_cache(REFRESH_QUIET);
 
 	pathspec = get_pathspec(prefix, argv);
 	seen = NULL;
@@ -181,6 +180,7 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
 		struct cache_entry *ce = active_cache[i];
 		if (!match_pathspec(pathspec, ce->name, ce_namelen(ce), 0, seen))
 			continue;
+		refresh_cache_entry(ce, 1);
 		add_list(ce->name);
 	}
 
-- 
1.6.6.181.g5ee6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] status: only touch path we may need to check
  2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
  2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
@ 2010-01-14 15:02 ` Nguyễn Thái Ngọc Duy
  2010-01-14 16:02   ` Sverre Rabbelier
  2010-01-16  4:41   ` Junio C Hamano
  2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
  2010-01-17  8:43 ` [PATCH] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
  3 siblings, 2 replies; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-14 15:02 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

This patch gets rid of whole-tree cache refresh and untracked file
search. Instead only specified path will be looked at.

Again some numbers on gentoo-x86, ~80k files:

Unmodified Git:

$ time git st eclass/
nothing to commit (working directory clean)

real    0m3.211s
user    0m1.977s
sys     0m1.135s

Modified Git:

$ time ~/w/git/git st eclass/
nothing to commit (working directory clean)

real    0m1.587s
user    0m1.426s
sys     0m0.111s

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin-commit.c |    2 +-
 wt-status.c      |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/builtin-commit.c b/builtin-commit.c
index 6199db7..dae6cb3 100644
--- a/builtin-commit.c
+++ b/builtin-commit.c
@@ -1024,7 +1024,7 @@ int cmd_status(int argc, const char **argv, const char *prefix)
 		s.pathspec = get_pathspec(prefix, argv);
 
 	read_cache();
-	refresh_cache(REFRESH_QUIET|REFRESH_UNMERGED);
+	refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, s.pathspec, NULL, NULL);
 	s.is_initial = get_sha1(s.reference, sha1) ? 1 : 0;
 	s.in_merge = in_merge;
 	wt_status_collect(&s);
diff --git a/wt-status.c b/wt-status.c
index 5d56988..65feb29 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -343,7 +343,7 @@ static void wt_status_collect_untracked(struct wt_status *s)
 			DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES;
 	setup_standard_excludes(&dir);
 
-	fill_directory(&dir, NULL);
+	fill_directory(&dir, s->pathspec);
 	for (i = 0; i < dir.nr; i++) {
 		struct dir_entry *ent = dir.entries[i];
 		if (!cache_name_is_other(ent->name, ent->len))
-- 
1.6.6.181.g5ee6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] Improve Git performance on big trees
  2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
  2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
  2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
@ 2010-01-14 15:21 ` Martin Langhoff
  2010-01-15 13:36   ` Nguyen Thai Ngoc Duy
  2010-01-17  8:43 ` [PATCH] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
  3 siblings, 1 reply; 10+ messages in thread
From: Martin Langhoff @ 2010-01-14 15:21 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

2010/1/14 Nguyễn Thái Ngọc Duy <pclouds@gmail.com>:
> Almost 1 sec for "git rm foo" still seems too long though,
> probably due to writing a 9MB index.

One of the main issues there is that the Gentoo dir tree seems to be
very flat. The kernel tree is huge, but much deeper, and does not have
a huge top-level directory -- and git handles it fairly well.

Perhaps the Gentoo tree can be rearranged to be more nested? If your
devs strongly prefer a flat view of it, that could be arranged with
symlinks.

Alternatively, each port can be in its own repo, and you can make a
"top level repo" using git submodules -- this is what Fedora/RH is
exploring at the moment.

cheers,



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] status: only touch path we may need to check
  2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
@ 2010-01-14 16:02   ` Sverre Rabbelier
  2010-01-16  4:41   ` Junio C Hamano
  1 sibling, 0 replies; 10+ messages in thread
From: Sverre Rabbelier @ 2010-01-14 16:02 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Heya,

2010/1/14 Nguyễn Thái Ngọc Duy <pclouds@gmail.com>:
> $ time git st eclass/
> real    0m3.211s

> $ time ~/w/git/git st eclass/
> real    0m1.587s

Wow, nice reduction!

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] Improve Git performance on big trees
  2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
@ 2010-01-15 13:36   ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-01-15 13:36 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git

On 1/14/10, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> 2010/1/14 Nguyễn Thái Ngọc Duy <pclouds@gmail.com>:
>
> > Almost 1 sec for "git rm foo" still seems too long though,
>  > probably due to writing a 9MB index.
>
>
> One of the main issues there is that the Gentoo dir tree seems to be
>  very flat. The kernel tree is huge, but much deeper, and does not have
>  a huge top-level directory -- and git handles it fairly well.

Kernel tree is about one third the size of gentoo-x86 in terms of
worktree. It takes ~1 sec to do "git status" on kernel tree and 3 secs
on gentoo-x86, quite proportional. Except that 3 secs feel much longer
than 1 :-)

Directory structure may also affect performance though. Kernel tree
only has ~2k directories while gentoo-x86 has ~20k shallow dirs.

>  Perhaps the Gentoo tree can be rearranged to be more nested? If your
>  devs strongly prefer a flat view of it, that could be arranged with
>  symlinks.
>
>  Alternatively, each port can be in its own repo, and you can make a
>  "top level repo" using git submodules -- this is what Fedora/RH is
>  exploring at the moment.

There were discussions of these in the past. And to my experience,
these ideas won't go anywhere. The number of worktree files may drop
significantly after Git migration because I believe all ChangeLog
(~13k files on total ~80k) will be gone.
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] rm: only refresh entries that we may touch
  2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
@ 2010-01-16  4:39   ` Junio C Hamano
  2010-01-16 10:58     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2010-01-16  4:39 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> diff --git a/builtin-rm.c b/builtin-rm.c
> index 57975db..4cac3d1 100644
> --- a/builtin-rm.c
> +++ b/builtin-rm.c
> @@ -169,7 +169,6 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
>  
>  	if (read_cache() < 0)
>  		die("index file corrupt");
> -	refresh_cache(REFRESH_QUIET);
>  
>  	pathspec = get_pathspec(prefix, argv);
>  	seen = NULL;
> @@ -181,6 +180,7 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
>  		struct cache_entry *ce = active_cache[i];
>  		if (!match_pathspec(pathspec, ce->name, ce_namelen(ce), 0, seen))
>  			continue;
> +		refresh_cache_entry(ce, 1);

Why does this pass "1" instead of "0"?  The existing code does not give
refresh_cache() REFRESH_REALLY bit, and a patch that is marked as a pure
optimization should pass 0.  If you really mean it, please spell it as
CE_MATCH_IGNORE_VALID and justify why it is a good change in a separate
patch.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] status: only touch path we may need to check
  2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
  2010-01-14 16:02   ` Sverre Rabbelier
@ 2010-01-16  4:41   ` Junio C Hamano
  1 sibling, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2010-01-16  4:41 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Will queue this one (but not the "rm" one, yet).  Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] rm: only refresh entries that we may touch
  2010-01-16  4:39   ` Junio C Hamano
@ 2010-01-16 10:58     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-01-16 10:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 1/16/10, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>  > diff --git a/builtin-rm.c b/builtin-rm.c
>  > index 57975db..4cac3d1 100644
>  > --- a/builtin-rm.c
>  > +++ b/builtin-rm.c
>  > @@ -169,7 +169,6 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
>  >
>  >       if (read_cache() < 0)
>  >               die("index file corrupt");
>  > -     refresh_cache(REFRESH_QUIET);
>  >
>  >       pathspec = get_pathspec(prefix, argv);
>  >       seen = NULL;
>  > @@ -181,6 +180,7 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
>  >               struct cache_entry *ce = active_cache[i];
>  >               if (!match_pathspec(pathspec, ce->name, ce_namelen(ce), 0, seen))
>  >                       continue;
>  > +             refresh_cache_entry(ce, 1);
>
>
> Why does this pass "1" instead of "0"?  The existing code does not give
>  refresh_cache() REFRESH_REALLY bit, and a patch that is marked as a pure
>  optimization should pass 0.  If you really mean it, please spell it as
>  CE_MATCH_IGNORE_VALID and justify why it is a good change in a separate
>  patch.

"refresh_cache(REFRESH_QUIET);" does pass "0" indeed. Can you please
change it to "0"? Thanks.
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] rm: only refresh entries that we may touch
  2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
                   ` (2 preceding siblings ...)
  2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
@ 2010-01-17  8:43 ` Nguyễn Thái Ngọc Duy
  3 siblings, 0 replies; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-17  8:43 UTC (permalink / raw)
  To: git, Junio C Hamano; +Cc: Nguyễn Thái Ngọc Duy

This gets rid of the whole tree cache refresh. Instead only path that
we touch will get refreshed. We may still lstat() more than needed,
but it'd be better playing safe.

This potentially reduces a large number of lstat() on big trees. Take
gentoo-x86 tree for example, which has roughly 80k files:

Unmodified Git:

$ time git rm --cached skel.ebuild
rm 'skel.ebuild'

real    0m1.441s
user    0m0.821s
sys     0m0.531s

Modified Git:

$ time ~/w/git/git rm --cached skel.ebuild
rm 'skel.ebuild'

real    0m0.941s
user    0m0.828s
sys     0m0.091s

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 In the previous attempt, refresh_cache_entry() returns a new cache_entry.
 It does not modify the_index, so tests failed.

 builtin-rm.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/builtin-rm.c b/builtin-rm.c
index 57975db..f3772c8 100644
--- a/builtin-rm.c
+++ b/builtin-rm.c
@@ -169,9 +169,10 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
 
 	if (read_cache() < 0)
 		die("index file corrupt");
-	refresh_cache(REFRESH_QUIET);
 
 	pathspec = get_pathspec(prefix, argv);
+	refresh_index(&the_index, REFRESH_QUIET, pathspec, NULL, NULL);
+
 	seen = NULL;
 	for (i = 0; pathspec[i] ; i++)
 		/* nothing */;
-- 
1.6.6.181.g5ee6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-01-17  8:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
2010-01-16  4:39   ` Junio C Hamano
2010-01-16 10:58     ` Nguyen Thai Ngoc Duy
2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
2010-01-14 16:02   ` Sverre Rabbelier
2010-01-16  4:41   ` Junio C Hamano
2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
2010-01-15 13:36   ` Nguyen Thai Ngoc Duy
2010-01-17  8:43 ` [PATCH] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.