All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Faster git grep.
@ 2013-07-25 18:29 Ondřej Bílka
  2013-07-25 20:08 ` Jeff King
  2013-07-25 20:41 ` Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Ondřej Bílka @ 2013-07-25 18:29 UTC (permalink / raw)
  To: git

Hi, 

When I do git grep then with big codebase (gcc) it executes slowly.
I am thinking to add option to speed up search time.

One solution would be to use same trick as was done in google code. 
Build and keep database of trigraphs and which files contain how many of
them. When querry is made then check
only these files that have appropriate combination of trigraphs.

Updating database would be relatively inexpensive compared to disk
access we need to do anyway.

A space usage might be problem so which is why I decided go option
route.

Comments, pointers?

Ondra

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Faster git grep.
  2013-07-25 18:29 [RFC] Faster git grep Ondřej Bílka
@ 2013-07-25 20:08 ` Jeff King
  2013-07-25 20:41 ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2013-07-25 20:08 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: git

On Thu, Jul 25, 2013 at 08:29:05PM +0200, Ondřej Bílka wrote:

> One solution would be to use same trick as was done in google code.
> Build and keep database of trigraphs and which files contain how many of
> them. When querry is made then check
> only these files that have appropriate combination of trigraphs.

That seems like a sensible approach.

> Updating database would be relatively inexpensive compared to disk
> access we need to do anyway.

Yes, I think it can be quite cheap, since you would only need to
re-index files that have changed (and git is very quick at telling you
what those files are).

> A space usage might be problem so which is why I decided go option
> route.
> 
> Comments, pointers?

I think it is a good idea, but not need to be part of core git. It seems
more like you would want to glue together an existing code-indexing
solution (like codesearch) with git (which would provide the list of
files to index and to search).

If that proves useful in practice, but the interface is clunky for
whatever reason, then a good follow-on project could be to build support
for updating and using the index via the usual "git grep".

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Faster git grep.
  2013-07-25 18:29 [RFC] Faster git grep Ondřej Bílka
  2013-07-25 20:08 ` Jeff King
@ 2013-07-25 20:41 ` Junio C Hamano
  2013-07-25 21:31   ` Ondřej Bílka
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2013-07-25 20:41 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: git

Ondřej Bílka <neleai@seznam.cz> writes:

> One solution would be to use same trick as was done in google code. 
> Build and keep database of trigraphs and which files contain how many of
> them. When querry is made then check
> only these files that have appropriate combination of trigraphs.

This depends on how you go about trying to reducing the database
overhead, I think.  For example, a very naive approach would be to
create such trigraph hit index for each and every commit for all
paths.  When "git grep $commit $pattern" is run, you would consult
such table with $commit and potential trigraphs derived from the
$pattern to grab the potential paths your hits _might_ be in.

But the contents of a path usually do not change in each and every
commit.  So you may want to instead index with the blob object names
(i.e. which trigraphs appear in what blobs).  But once you go that
route, your "git grep $commit $pattern" needs to read and enumerate
all the blobs that appear in $commit's tree, and see which blobs may
potentially have hits.  Then you would need to build an index every
time you make a new commit for blobs whose trigraphs have not been
counted.

Nice thing is that once a blob (or a commit for that matter) is
created and its object name is known, its contents will not change,
so you can index once and reuse it many times.  But I am not yet
convinced if pre-indexing is an overall win, compared to the cost of
maintaining such a database.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Faster git grep.
  2013-07-25 20:41 ` Junio C Hamano
@ 2013-07-25 21:31   ` Ondřej Bílka
  2013-07-26  1:28     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Ondřej Bílka @ 2013-07-25 21:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Jul 25, 2013 at 01:41:13PM -0700, Junio C Hamano wrote:
> Ondřej Bílka <neleai@seznam.cz> writes:
> 
> > One solution would be to use same trick as was done in google code. 
> > Build and keep database of trigraphs and which files contain how many of
> > them. When querry is made then check
> > only these files that have appropriate combination of trigraphs.
> 
> This depends on how you go about trying to reducing the database
> overhead, I think.  For example, a very naive approach would be to
> create such trigraph hit index for each and every commit for all
> paths.  When "git grep $commit $pattern" is run, you would consult
> such table with $commit and potential trigraphs derived from the
> $pattern to grab the potential paths your hits _might_ be in.
>
Do you think that git grep $commit $pattern is run in more than 1% 
of cases than git grep $pattern ?

If grepping random commit in history is important use case then keeping
db information in history makes sense. Otherwise just having database
for current version and updating it on the fly as version changes is
enough.
> But the contents of a path usually do not change in each and every
> commit.  So you may want to instead index with the blob object names
> (i.e. which trigraphs appear in what blobs).  But once you go that
> route, your "git grep $commit $pattern" needs to read and enumerate
> all the blobs that appear in $commit's tree, and see which blobs may
> potentially have hits.  Then you would need to build an index every
> time you make a new commit for blobs whose trigraphs have not been
> counted.
> 
> Nice thing is that once a blob (or a commit for that matter) is
> created and its object name is known, its contents will not change,
> so you can index once and reuse it many times.  But I am not yet
> convinced if pre-indexing is an overall win, compared to the cost of
> maintaining such a database.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Faster git grep.
  2013-07-25 21:31   ` Ondřej Bílka
@ 2013-07-26  1:28     ` Junio C Hamano
  2013-07-26  5:45       ` Ondřej Bílka
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2013-07-26  1:28 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: git

Ondřej Bílka <neleai@seznam.cz> writes:

> If grepping random commit in history is important use case then keeping
> db information in history makes sense. Otherwise just having database
> for current version and updating it on the fly as version changes is
> enough.

Will you reindex every time I do "git checkout next; git checkout
master"?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Faster git grep.
  2013-07-26  1:28     ` Junio C Hamano
@ 2013-07-26  5:45       ` Ondřej Bílka
  0 siblings, 0 replies; 6+ messages in thread
From: Ondřej Bílka @ 2013-07-26  5:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Jul 25, 2013 at 06:28:50PM -0700, Junio C Hamano wrote:
> Ondřej Bílka <neleai@seznam.cz> writes:
> 
> > If grepping random commit in history is important use case then keeping
> > db information in history makes sense. Otherwise just having database
> > for current version and updating it on the fly as version changes is
> > enough.
> 
> Will you reindex every time I do "git checkout next; git checkout
> master"?

This is separate issue as you would need to change index anyway, number
of changes would be proportionate to size of diff so you would not gain
much. Possible problem here is that you would end changing many files. 
A possible solution is do rebuilding in background.

For switching often to different branches that are vastly different a
best solution for me seems to keep separate index for each branch.

Also data structure is trigraph: list of files with counts.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-07-26  5:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-25 18:29 [RFC] Faster git grep Ondřej Bílka
2013-07-25 20:08 ` Jeff King
2013-07-25 20:41 ` Junio C Hamano
2013-07-25 21:31   ` Ondřej Bílka
2013-07-26  1:28     ` Junio C Hamano
2013-07-26  5:45       ` Ondřej Bílka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.