* [PATCH 0/2] Improve Git performance on big trees
@ 2010-01-14 15:02 Nguyễn Thái Ngọc Duy
2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-14 15:02 UTC (permalink / raw)
To: git; +Cc: Nguyễn Thái Ngọc Duy
The discussion about using Git on Gentoo comes up again [1] and it
looks like we can improve Git a bit if user only works on some
subdirectories. In such cases, Git still does whole-tree check, which
takes quite some time. "git status" takes 3 seconds on my machine, but
I suspect lstat() is not the only culprit, "git diff --exit-code" is
about 1 sec.
These patches makes "git rm <path>" and "git status <path>" a bit
faster. Almost 1 sec for "git rm foo" still seems too long though,
probably due to writing a 9MB index.
[1] http://article.gmane.org/gmane.linux.gentoo.devel/64522
Nguyễn Thái Ngọc Duy (2):
rm: only refresh entries that we may touch
status: only touch path we may need to check
builtin-commit.c | 2 +-
builtin-rm.c | 2 +-
wt-status.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] rm: only refresh entries that we may touch
2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
@ 2010-01-14 15:02 ` Nguyễn Thái Ngọc Duy
2010-01-16 4:39 ` Junio C Hamano
2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-14 15:02 UTC (permalink / raw)
To: git; +Cc: Nguyễn Thái Ngọc Duy
This gets rid of the whole tree cache refresh. Instead only path that
we touch will get refreshed. We may still lstat() more than needed,
but it'd be better playing safe.
This potentially reduces a large number of lstat() on big trees. Take
gentoo-x86 tree for example, which has roughly 80k files:
Unmodified Git:
$ time git rm --cached skel.ChangeLog
rm 'skel.ChangeLog'
real 0m1.323s
user 0m0.806s
sys 0m0.517s
Modified Git:
$ time git rm --cached skel.ChangeLog
rm 'skel.ChangeLog'
real 0m0.903s
user 0m0.780s
sys 0m0.100s
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
builtin-rm.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/builtin-rm.c b/builtin-rm.c
index 57975db..4cac3d1 100644
--- a/builtin-rm.c
+++ b/builtin-rm.c
@@ -169,7 +169,6 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
if (read_cache() < 0)
die("index file corrupt");
- refresh_cache(REFRESH_QUIET);
pathspec = get_pathspec(prefix, argv);
seen = NULL;
@@ -181,6 +180,7 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
struct cache_entry *ce = active_cache[i];
if (!match_pathspec(pathspec, ce->name, ce_namelen(ce), 0, seen))
continue;
+ refresh_cache_entry(ce, 1);
add_list(ce->name);
}
--
1.6.6.181.g5ee6
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] status: only touch path we may need to check
2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
@ 2010-01-14 15:02 ` Nguyễn Thái Ngọc Duy
2010-01-14 16:02 ` Sverre Rabbelier
2010-01-16 4:41 ` Junio C Hamano
2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
2010-01-17 8:43 ` [PATCH] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
3 siblings, 2 replies; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-14 15:02 UTC (permalink / raw)
To: git; +Cc: Nguyễn Thái Ngọc Duy
This patch gets rid of whole-tree cache refresh and untracked file
search. Instead only specified path will be looked at.
Again some numbers on gentoo-x86, ~80k files:
Unmodified Git:
$ time git st eclass/
nothing to commit (working directory clean)
real 0m3.211s
user 0m1.977s
sys 0m1.135s
Modified Git:
$ time ~/w/git/git st eclass/
nothing to commit (working directory clean)
real 0m1.587s
user 0m1.426s
sys 0m0.111s
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
builtin-commit.c | 2 +-
wt-status.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/builtin-commit.c b/builtin-commit.c
index 6199db7..dae6cb3 100644
--- a/builtin-commit.c
+++ b/builtin-commit.c
@@ -1024,7 +1024,7 @@ int cmd_status(int argc, const char **argv, const char *prefix)
s.pathspec = get_pathspec(prefix, argv);
read_cache();
- refresh_cache(REFRESH_QUIET|REFRESH_UNMERGED);
+ refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, s.pathspec, NULL, NULL);
s.is_initial = get_sha1(s.reference, sha1) ? 1 : 0;
s.in_merge = in_merge;
wt_status_collect(&s);
diff --git a/wt-status.c b/wt-status.c
index 5d56988..65feb29 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -343,7 +343,7 @@ static void wt_status_collect_untracked(struct wt_status *s)
DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES;
setup_standard_excludes(&dir);
- fill_directory(&dir, NULL);
+ fill_directory(&dir, s->pathspec);
for (i = 0; i < dir.nr; i++) {
struct dir_entry *ent = dir.entries[i];
if (!cache_name_is_other(ent->name, ent->len))
--
1.6.6.181.g5ee6
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] Improve Git performance on big trees
2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
@ 2010-01-14 15:21 ` Martin Langhoff
2010-01-15 13:36 ` Nguyen Thai Ngoc Duy
2010-01-17 8:43 ` [PATCH] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
3 siblings, 1 reply; 10+ messages in thread
From: Martin Langhoff @ 2010-01-14 15:21 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy; +Cc: git
2010/1/14 Nguyễn Thái Ngọc Duy <pclouds@gmail.com>:
> Almost 1 sec for "git rm foo" still seems too long though,
> probably due to writing a 9MB index.
One of the main issues there is that the Gentoo dir tree seems to be
very flat. The kernel tree is huge, but much deeper, and does not have
a huge top-level directory -- and git handles it fairly well.
Perhaps the Gentoo tree can be rearranged to be more nested? If your
devs strongly prefer a flat view of it, that could be arranged with
symlinks.
Alternatively, each port can be in its own repo, and you can make a
"top level repo" using git submodules -- this is what Fedora/RH is
exploring at the moment.
cheers,
m
--
martin.langhoff@gmail.com
martin@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] status: only touch path we may need to check
2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
@ 2010-01-14 16:02 ` Sverre Rabbelier
2010-01-16 4:41 ` Junio C Hamano
1 sibling, 0 replies; 10+ messages in thread
From: Sverre Rabbelier @ 2010-01-14 16:02 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy; +Cc: git
Heya,
2010/1/14 Nguyễn Thái Ngọc Duy <pclouds@gmail.com>:
> $ time git st eclass/
> real 0m3.211s
> $ time ~/w/git/git st eclass/
> real 0m1.587s
Wow, nice reduction!
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] Improve Git performance on big trees
2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
@ 2010-01-15 13:36 ` Nguyen Thai Ngoc Duy
0 siblings, 0 replies; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-01-15 13:36 UTC (permalink / raw)
To: Martin Langhoff; +Cc: git
On 1/14/10, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> 2010/1/14 Nguyễn Thái Ngọc Duy <pclouds@gmail.com>:
>
> > Almost 1 sec for "git rm foo" still seems too long though,
> > probably due to writing a 9MB index.
>
>
> One of the main issues there is that the Gentoo dir tree seems to be
> very flat. The kernel tree is huge, but much deeper, and does not have
> a huge top-level directory -- and git handles it fairly well.
Kernel tree is about one third the size of gentoo-x86 in terms of
worktree. It takes ~1 sec to do "git status" on kernel tree and 3 secs
on gentoo-x86, quite proportional. Except that 3 secs feel much longer
than 1 :-)
Directory structure may also affect performance though. Kernel tree
only has ~2k directories while gentoo-x86 has ~20k shallow dirs.
> Perhaps the Gentoo tree can be rearranged to be more nested? If your
> devs strongly prefer a flat view of it, that could be arranged with
> symlinks.
>
> Alternatively, each port can be in its own repo, and you can make a
> "top level repo" using git submodules -- this is what Fedora/RH is
> exploring at the moment.
There were discussions of these in the past. And to my experience,
these ideas won't go anywhere. The number of worktree files may drop
significantly after Git migration because I believe all ChangeLog
(~13k files on total ~80k) will be gone.
--
Duy
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] rm: only refresh entries that we may touch
2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
@ 2010-01-16 4:39 ` Junio C Hamano
2010-01-16 10:58 ` Nguyen Thai Ngoc Duy
0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2010-01-16 4:39 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy; +Cc: git
Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
> diff --git a/builtin-rm.c b/builtin-rm.c
> index 57975db..4cac3d1 100644
> --- a/builtin-rm.c
> +++ b/builtin-rm.c
> @@ -169,7 +169,6 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
>
> if (read_cache() < 0)
> die("index file corrupt");
> - refresh_cache(REFRESH_QUIET);
>
> pathspec = get_pathspec(prefix, argv);
> seen = NULL;
> @@ -181,6 +180,7 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
> struct cache_entry *ce = active_cache[i];
> if (!match_pathspec(pathspec, ce->name, ce_namelen(ce), 0, seen))
> continue;
> + refresh_cache_entry(ce, 1);
Why does this pass "1" instead of "0"? The existing code does not give
refresh_cache() REFRESH_REALLY bit, and a patch that is marked as a pure
optimization should pass 0. If you really mean it, please spell it as
CE_MATCH_IGNORE_VALID and justify why it is a good change in a separate
patch.
Thanks.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] status: only touch path we may need to check
2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
2010-01-14 16:02 ` Sverre Rabbelier
@ 2010-01-16 4:41 ` Junio C Hamano
1 sibling, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2010-01-16 4:41 UTC (permalink / raw)
To: Nguyễn Thái Ngọc Duy; +Cc: git
Will queue this one (but not the "rm" one, yet). Thanks.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] rm: only refresh entries that we may touch
2010-01-16 4:39 ` Junio C Hamano
@ 2010-01-16 10:58 ` Nguyen Thai Ngoc Duy
0 siblings, 0 replies; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-01-16 10:58 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On 1/16/10, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
>
> > diff --git a/builtin-rm.c b/builtin-rm.c
> > index 57975db..4cac3d1 100644
> > --- a/builtin-rm.c
> > +++ b/builtin-rm.c
> > @@ -169,7 +169,6 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
> >
> > if (read_cache() < 0)
> > die("index file corrupt");
> > - refresh_cache(REFRESH_QUIET);
> >
> > pathspec = get_pathspec(prefix, argv);
> > seen = NULL;
> > @@ -181,6 +180,7 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
> > struct cache_entry *ce = active_cache[i];
> > if (!match_pathspec(pathspec, ce->name, ce_namelen(ce), 0, seen))
> > continue;
> > + refresh_cache_entry(ce, 1);
>
>
> Why does this pass "1" instead of "0"? The existing code does not give
> refresh_cache() REFRESH_REALLY bit, and a patch that is marked as a pure
> optimization should pass 0. If you really mean it, please spell it as
> CE_MATCH_IGNORE_VALID and justify why it is a good change in a separate
> patch.
"refresh_cache(REFRESH_QUIET);" does pass "0" indeed. Can you please
change it to "0"? Thanks.
--
Duy
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] rm: only refresh entries that we may touch
2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
` (2 preceding siblings ...)
2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
@ 2010-01-17 8:43 ` Nguyễn Thái Ngọc Duy
3 siblings, 0 replies; 10+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2010-01-17 8:43 UTC (permalink / raw)
To: git, Junio C Hamano; +Cc: Nguyễn Thái Ngọc Duy
This gets rid of the whole tree cache refresh. Instead only path that
we touch will get refreshed. We may still lstat() more than needed,
but it'd be better playing safe.
This potentially reduces a large number of lstat() on big trees. Take
gentoo-x86 tree for example, which has roughly 80k files:
Unmodified Git:
$ time git rm --cached skel.ebuild
rm 'skel.ebuild'
real 0m1.441s
user 0m0.821s
sys 0m0.531s
Modified Git:
$ time ~/w/git/git rm --cached skel.ebuild
rm 'skel.ebuild'
real 0m0.941s
user 0m0.828s
sys 0m0.091s
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
In the previous attempt, refresh_cache_entry() returns a new cache_entry.
It does not modify the_index, so tests failed.
builtin-rm.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/builtin-rm.c b/builtin-rm.c
index 57975db..f3772c8 100644
--- a/builtin-rm.c
+++ b/builtin-rm.c
@@ -169,9 +169,10 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
if (read_cache() < 0)
die("index file corrupt");
- refresh_cache(REFRESH_QUIET);
pathspec = get_pathspec(prefix, argv);
+ refresh_index(&the_index, REFRESH_QUIET, pathspec, NULL, NULL);
+
seen = NULL;
for (i = 0; pathspec[i] ; i++)
/* nothing */;
--
1.6.6.181.g5ee6
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-01-17 8:44 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-14 15:02 [PATCH 0/2] Improve Git performance on big trees Nguyễn Thái Ngọc Duy
2010-01-14 15:02 ` [PATCH 1/2] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
2010-01-16 4:39 ` Junio C Hamano
2010-01-16 10:58 ` Nguyen Thai Ngoc Duy
2010-01-14 15:02 ` [PATCH 2/2] status: only touch path we may need to check Nguyễn Thái Ngọc Duy
2010-01-14 16:02 ` Sverre Rabbelier
2010-01-16 4:41 ` Junio C Hamano
2010-01-14 15:21 ` [PATCH 0/2] Improve Git performance on big trees Martin Langhoff
2010-01-15 13:36 ` Nguyen Thai Ngoc Duy
2010-01-17 8:43 ` [PATCH] rm: only refresh entries that we may touch Nguyễn Thái Ngọc Duy
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.