All of lore.kernel.org
 help / color / mirror / Atom feed
* git-grep while excluding files in a blacklist
@ 2012-01-17  9:14 Dov Grobgeld
  2012-01-17  9:19 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 43+ messages in thread
From: Dov Grobgeld @ 2012-01-17  9:14 UTC (permalink / raw)
  To: git

Does git-grep allow searching for a pattern in all files *except*
files matching a pattern. E.g. in our project we have multiple DLL's
in git, but when searching I would like to exclude these for speed. Is
that possible with git-grep?

Thanks,
Dov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: git-grep while excluding files in a blacklist
  2012-01-17  9:14 git-grep while excluding files in a blacklist Dov Grobgeld
@ 2012-01-17  9:19 ` Nguyen Thai Ngoc Duy
  2012-01-17 20:09   ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-01-17  9:19 UTC (permalink / raw)
  To: Dov Grobgeld; +Cc: git

On Tue, Jan 17, 2012 at 4:14 PM, Dov Grobgeld <dov.grobgeld@gmail.com> wrote:
> Does git-grep allow searching for a pattern in all files *except*
> files matching a pattern. E.g. in our project we have multiple DLL's
> in git, but when searching I would like to exclude these for speed. Is
> that possible with git-grep?

Not from command line, no. You can put "*.dll" to .gitignore file then
"git grep --exclude-standard".
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: git-grep while excluding files in a blacklist
  2012-01-17  9:19 ` Nguyen Thai Ngoc Duy
@ 2012-01-17 20:09   ` Junio C Hamano
  2012-01-18  1:24     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-01-17 20:09 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Dov Grobgeld, git

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Tue, Jan 17, 2012 at 4:14 PM, Dov Grobgeld <dov.grobgeld@gmail.com> wrote:
>> Does git-grep allow searching for a pattern in all files *except*
>> files matching a pattern. E.g. in our project we have multiple DLL's
>> in git, but when searching I would like to exclude these for speed. Is
>> that possible with git-grep?
>
> Not from command line, no. You can put "*.dll" to .gitignore file then
> "git grep --exclude-standard".

No rush, but is this something we would eventually want to handle with the
negative pathspec?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: git-grep while excluding files in a blacklist
  2012-01-17 20:09   ` Junio C Hamano
@ 2012-01-18  1:24     ` Nguyen Thai Ngoc Duy
  2012-01-23  9:37       ` [PATCH] Don't search files with an unset "grep" attribute conrad.irwin
  0 siblings, 1 reply; 43+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-01-18  1:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Dov Grobgeld, git

On Wed, Jan 18, 2012 at 3:09 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> On Tue, Jan 17, 2012 at 4:14 PM, Dov Grobgeld <dov.grobgeld@gmail.com> wrote:
>>> Does git-grep allow searching for a pattern in all files *except*
>>> files matching a pattern. E.g. in our project we have multiple DLL's
>>> in git, but when searching I would like to exclude these for speed. Is
>>> that possible with git-grep?
>>
>> Not from command line, no. You can put "*.dll" to .gitignore file then
>> "git grep --exclude-standard".
>
> No rush, but is this something we would eventually want to handle with the
> negative pathspec?

Definitely. But because I'm stuck at adding "seen" feature from
match_pathspec_depth to tree_entry_interesting, that probably won't
happen this year. Adding "--exclude=<pattern>" to git-grep is a more
plausible option.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] Don't search files with an unset "grep" attribute
  2012-01-18  1:24     ` Nguyen Thai Ngoc Duy
@ 2012-01-23  9:37       ` conrad.irwin
  2012-01-23 18:33         ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: conrad.irwin @ 2012-01-23  9:37 UTC (permalink / raw)
  To: git; +Cc: Nguyen Thai Ngoc Duy, Conrad Irwin, Junio C Hamano, Dov Grobgeld

From: Conrad Irwin <conrad.irwin@gmail.com>

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
> Definitely. But because I'm stuck at adding "seen" feature from
> match_pathspec_depth to tree_entry_interesting, that probably won't happen
> this year. Adding "--exclude=<pattern>" to git-grep is a more plausible
> option.

I think the .gitattributes mechanism is a better place to configure this, as the
files I with to exclude from grep are always the same files (test fixtures
mainly, minified library code also). I often want to make git-diff be quieter
about them too, so I'll be setting the -diff attribute already.

I've attached a patch, which adds the "grep" attribute, which can (for now) only
be unset. This has the same effect for me as my original patch [1], but should
also improve life for people with large blobs as described above.

In future we could extend the attribute to give a meaning to set values, but I'm
not yet sure what that would look like. We could also add an --exclude=<foo>
flag to grep for people who want to configure grep on a per-run basis, but I
think that is a much less common desire.

The failing test attached to this patch is a symptom of a larger issue with the
way that git-grep handles objects that are not at the root of the repository. A
more obvious symptom can be revealed by comparing the output of:

  git grep int HEAD:./builtin

  cd builtin; git grep int HEAD:./

The problem is that grep doesn't correctly separate the path from the revision
part of the spec. It's currently unobvious to me how to fix this but I hope
someone more familiar with the code (Nguyen or Junio) might be able to see a
way.

Yours,
Conrad

[1] http://thread.gmane.org/gmane.comp.version-control.git/179299

---8<---


To set -grep on an file in .gitattributes will cause that file to be
skipped completely while grepping. This can be used to reduce the number
of false positives when your repository contains files that are
uninteresting to you, for example test fixtures, dlls or minified source
code.

The other approach considered was to allow an --exclude flag to grep at runtime,
however that better serves the less common use-case of wanting to customise the
list of files per-invocation rather than statically.

Signed-off-by: Conrad Irwin <conrad.irwin@gmail.com>
---
 Documentation/git-grep.txt      |    7 +++++++
 Documentation/gitattributes.txt |    9 +++++++++
 builtin/grep.c                  |    8 ++++++++
 grep.c                          |   21 +++++++++++++++++++++
 grep.h                          |    1 +
 t/t7810-grep.sh                 |   30 ++++++++++++++++++++++++++++++
 6 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 6a8b1e3..7c74165 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -242,6 +242,13 @@ OPTIONS
 	If given, limit the search to paths matching at least one pattern.
 	Both leading paths match and glob(7) patterns are supported.
 
+ATTRIBUTES
+----------
+
+grep::
+	If the grep attribute is unset on a file using the linkgit:gitattributes[1]
+	mechanism, then that file will not be searched.
+
 Examples
 --------
 
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..3ffcec7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -869,6 +869,15 @@ If this attribute is not set or has an invalid value, the value of the
 `gui.encoding` configuration variable is used instead
 (See linkgit:git-config[1]).
 
+Configuring files to search
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`grep`
+^^^^^^
+
+If the attribute `grep` is unset for a file then linkgit:git-grep[1]
+will ignore that file while searching for matches.
+
 
 USING MACRO ATTRIBUTES
 ----------------------
diff --git a/builtin/grep.c b/builtin/grep.c
index 9ce064a..9f8dfc0 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -398,6 +398,10 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 	struct strbuf pathbuf = STRBUF_INIT;
 	char *name;
 
+	if (!should_grep_path(opt, filename + tree_name_len)) {
+		return 0;
+	}
+
 	if (opt->relative && opt->prefix_length) {
 		quote_path_relative(filename + tree_name_len, -1, &pathbuf,
 				    opt->prefix);
@@ -464,6 +468,10 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 	struct strbuf buf = STRBUF_INIT;
 	char *name;
 
+	if (!should_grep_path(opt, filename)) {
+		return 0;
+	}
+
 	if (opt->relative && opt->prefix_length)
 		quote_path_relative(filename, -1, &buf, opt->prefix);
 	else
diff --git a/grep.c b/grep.c
index 486230b..e948576 100644
--- a/grep.c
+++ b/grep.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "grep.h"
+#include "attr.h"
 #include "userdiff.h"
 #include "xdiff-interface.h"
 
@@ -829,6 +830,26 @@ static inline void grep_attr_unlock(struct grep_opt *opt)
 #define grep_attr_unlock(opt)
 #endif
 
+static struct git_attr_check attr_check[1];
+static void setup_attr_check(void)
+{
+	if (attr_check[0].attr)
+		return; /* already done */
+	attr_check[0].attr = git_attr("grep");
+}
+int should_grep_path(struct grep_opt *opt, const char *name) {
+	int ret = 1;
+
+	grep_attr_lock(opt);
+	setup_attr_check();
+	git_check_attr(name, 1, attr_check);
+	if (ATTR_FALSE(attr_check[0].value))
+		ret = 0;
+	grep_attr_unlock(opt);
+
+	return ret;
+}
+
 static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
 {
 	xdemitconf_t *xecfg = opt->priv;
diff --git a/grep.h b/grep.h
index fb205f3..266002d 100644
--- a/grep.h
+++ b/grep.h
@@ -129,6 +129,7 @@ extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field
 extern void compile_grep_patterns(struct grep_opt *opt);
 extern void free_grep_patterns(struct grep_opt *opt);
 extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
+extern int should_grep_path(struct grep_opt *opt, const char *name);
 
 extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
 extern int grep_threads_ok(const struct grep_opt *opt);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 7ba5b16..c991518 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -871,4 +871,34 @@ test_expect_success 'mimic ack-grep --group' '
 	test_cmp expected actual
 '
 
+test_expect_success 'with -grep attribute' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf &&
+	rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific file' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf hello.c &&
+	rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific revision' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf HEAD &&
+	rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific file/revision' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf HEAD hello.c &&
+	rm .gitattributes
+'
+
+test_expect_failure 'with -grep attribute on specific tree' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf HEAD:hello.c &&
+	rm .gitattributes
+'
+
 test_done
-- 
1.7.9.rc2.1.g1fdd3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-23  9:37       ` [PATCH] Don't search files with an unset "grep" attribute conrad.irwin
@ 2012-01-23 18:33         ` Junio C Hamano
  2012-01-23 22:59           ` Conrad Irwin
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-01-23 18:33 UTC (permalink / raw)
  To: conrad.irwin; +Cc: git, Nguyen Thai Ngoc Duy, Dov Grobgeld

conrad.irwin@gmail.com writes:

> ---8<---
>
> To set -grep on an file in .gitattributes will cause that file to be
> skipped completely while grepping. This can be used to reduce the number
> of false positives when your repository contains files that are
> uninteresting to you, for example test fixtures, dlls or minified source
> code.

Please reword this to describe the problem being solved first (why it
needs to be solved, in what situation you cannot live without the
feature), and then describe the approach you used to solve it.

Plain "grep" does this:

	$ grep world hello.*
	hello.c: printf("Hello, world.\n");
        Binary file hello.o matches

in order to avoid uselessly spewing garbage at you while reminding you
that the command is not showing the whole truth and you _might_ want to
look into the elided file if you wanted to with "grep -a world hello.o".
Compared to this, it feels that the result your patch goes too far and
hides the truth a bit too much to my taste. Maybe it is just me.

Perhaps you, or all participants of your particular project, usually do
not want to see any grep hits from minified.js, but you may still want to
be able to say "I want to dig deeper and make sure I have copyright
strings in all files", for example.  It is unclear how you envision to
support such a use case building on top of this patch.

Your "attributes only" is not an acceptable solution in the longer run,
even though it is a good first step (i.e. "attributes first and other
enhancement later"). There should be an easy way to get the best of both
worlds.

> The other approach considered was to allow an --exclude flag to grep at
> runtime, however that better serves the less common use-case of wanting
> to customise the list of files per-invocation rather than statically.

I doubt that it is justifiable to call per-invocation "the less common".

By the way, if the uninteresting ones are dll and minified.js, I wonder
why it is insufficient to mark them binary, i.e. uninteresting for the
purpose of all textual operations not just grep but also for diff.

I am *not* going to ask why they are treated as source and tracked by git
to begin with.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] Don't search files with an unset "grep" attribute
  2012-01-23 18:33         ` Junio C Hamano
@ 2012-01-23 22:59           ` Conrad Irwin
  2012-01-24  6:59             ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Conrad Irwin @ 2012-01-23 22:59 UTC (permalink / raw)
  To: git; +Cc: Nguyen Thai Ngoc Duy, Conrad Irwin, Junio C Hamano, Dov Grobgeld

Junio C Hamano <gitster <at> pobox.com> writes:
> Please reword this to describe the problem being solved first (why it
> needs to be solved, in what situation you cannot live without the
> feature), and then describe the approach you used to solve it.
> 

Done — I also removed the extraneous braces from the patch.

> 
> in order to avoid uselessly spewing garbage at you while reminding you
> that the command is not showing the whole truth and you _might_ want to
> look into the elided file if you wanted to with "grep -a world hello.o".
> Compared to this, it feels that the result your patch goes too far and
> hides the truth a bit too much to my taste. Maybe it is just me.

I used to use this approach, hooking into the "diff" attribute directly to mark
a file as binary, however that was clearly a hack. When developing this patch I
went through a few iterations, one in which -grep meant "treat this file as
binary", however explaining that in the man page is subtle and ugly: "HINT: you
might want to set a file as binary because you don't want to see results from it
when grepping".  It's much more obvious to have -grep mean "don't show me
results".

A nicer alternative could be to allow "grep=binary" (and for completeness
"grep=text") in addition to "-grep". Then people who want to see matches but not
the contents of the matches can tell grep that their files are "binary". It
would also make sense to add "grep=binary" to the binary macro-attribute. We
could even extend the system arbitrarily to allow something like the textconv
attributes of git-diff... one step at a time is probably better though.

> Perhaps you, or all participants of your particular project, usually do
> not want to see any grep hits from minified.js, but you may still want to
> be able to say "I want to dig deeper and make sure I have copyright
> strings in all files", for example.  It is unclear how you envision to
> support such a use case building on top of this patch.

I think that it would be very reasonable to add a flag to grep to tell it to
ignore the attribute temporarily (like --no-textconv on git-diff) and update the
"-a" shorthand to imply "--text --no-exclude-attribute".

Yours,
Conrad

---8<---

Git grep is used by developers to search the code stored in their repositories,
however it can give noisy results when the repository contains files that are
not of direct interest to development. Examples of such files include test
fixtures, dlls, or minified source code.

To help these developers search efficiently, git grep will now use the
gitattributes mechanism to ignore all files with an unset "grep" attribute.

Another approach considered was to allow an --exclude flag to grep at runtime,
however this is more clunky to use when the set of files to be excluded is
fixed.

Signed-off-by: Conrad Irwin <conrad.irwin@gmail.com>
---
 Documentation/git-grep.txt      |    7 +++++++
 Documentation/gitattributes.txt |    9 +++++++++
 builtin/grep.c                  |    6 ++++++
 grep.c                          |   21 +++++++++++++++++++++
 grep.h                          |    1 +
 t/t7810-grep.sh                 |   30 ++++++++++++++++++++++++++++++
 6 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 6a8b1e3..7c74165 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -242,6 +242,13 @@ OPTIONS
 	If given, limit the search to paths matching at least one pattern.
 	Both leading paths match and glob(7) patterns are supported.
 
+ATTRIBUTES
+----------
+
+grep::
+	If the grep attribute is unset on a file using the linkgit:gitattributes[1]
+	mechanism, then that file will not be searched.
+
 Examples
 --------
 
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..3ffcec7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -869,6 +869,15 @@ If this attribute is not set or has an invalid value, the value of the
 `gui.encoding` configuration variable is used instead
 (See linkgit:git-config[1]).
 
+Configuring files to search
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`grep`
+^^^^^^
+
+If the attribute `grep` is unset for a file then linkgit:git-grep[1]
+will ignore that file while searching for matches.
+
 
 USING MACRO ATTRIBUTES
 ----------------------
diff --git a/builtin/grep.c b/builtin/grep.c
index 9ce064a..a7817fe 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -398,6 +398,9 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 	struct strbuf pathbuf = STRBUF_INIT;
 	char *name;
 
+	if (!should_grep_path(opt, filename + tree_name_len))
+		return 0;
+
 	if (opt->relative && opt->prefix_length) {
 		quote_path_relative(filename + tree_name_len, -1, &pathbuf,
 				    opt->prefix);
@@ -464,6 +467,9 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 	struct strbuf buf = STRBUF_INIT;
 	char *name;
 
+	if (!should_grep_path(opt, filename))
+		return 0;
+
 	if (opt->relative && opt->prefix_length)
 		quote_path_relative(filename, -1, &buf, opt->prefix);
 	else
diff --git a/grep.c b/grep.c
index 486230b..e948576 100644
--- a/grep.c
+++ b/grep.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "grep.h"
+#include "attr.h"
 #include "userdiff.h"
 #include "xdiff-interface.h"
 
@@ -829,6 +830,26 @@ static inline void grep_attr_unlock(struct grep_opt *opt)
 #define grep_attr_unlock(opt)
 #endif
 
+static struct git_attr_check attr_check[1];
+static void setup_attr_check(void)
+{
+	if (attr_check[0].attr)
+		return; /* already done */
+	attr_check[0].attr = git_attr("grep");
+}
+int should_grep_path(struct grep_opt *opt, const char *name) {
+	int ret = 1;
+
+	grep_attr_lock(opt);
+	setup_attr_check();
+	git_check_attr(name, 1, attr_check);
+	if (ATTR_FALSE(attr_check[0].value))
+		ret = 0;
+	grep_attr_unlock(opt);
+
+	return ret;
+}
+
 static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
 {
 	xdemitconf_t *xecfg = opt->priv;
diff --git a/grep.h b/grep.h
index fb205f3..266002d 100644
--- a/grep.h
+++ b/grep.h
@@ -129,6 +129,7 @@ extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field
 extern void compile_grep_patterns(struct grep_opt *opt);
 extern void free_grep_patterns(struct grep_opt *opt);
 extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
+extern int should_grep_path(struct grep_opt *opt, const char *name);
 
 extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
 extern int grep_threads_ok(const struct grep_opt *opt);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 7ba5b16..c991518 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -871,4 +871,34 @@ test_expect_success 'mimic ack-grep --group' '
 	test_cmp expected actual
 '
 
+test_expect_success 'with -grep attribute' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf &&
+	rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific file' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf hello.c &&
+	rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific revision' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf HEAD &&
+	rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific file/revision' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf HEAD hello.c &&
+	rm .gitattributes
+'
+
+test_expect_failure 'with -grep attribute on specific tree' '
+	echo "hello.c -grep" >.gitattributes &&
+	test_must_fail git grep printf HEAD:hello.c &&
+	rm .gitattributes
+'
+
 test_done
-- 
1.7.9.rc2.1.g1fdd3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-23 22:59           ` Conrad Irwin
@ 2012-01-24  6:59             ` Junio C Hamano
  2012-01-25 21:46               ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-01-24  6:59 UTC (permalink / raw)
  To: Conrad Irwin; +Cc: git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Conrad Irwin <conrad.irwin@gmail.com> writes:

>> in order to avoid uselessly spewing garbage at you while reminding you
>> that the command is not showing the whole truth and you _might_ want to
>> look into the elided file if you wanted to with "grep -a world hello.o".
>> Compared to this, it feels that the result your patch goes too far and
>> hides the truth a bit too much to my taste. Maybe it is just me.
>
> I used to use this approach, hooking into the "diff" attribute directly to mark
> a file as binary, however that was clearly a hack.

After thinking about this a bit more, I have to say I disagree that it is
a hack.

I agree that the issue you are trying to address is real, but I think the
approach taken by the patch is flawed on three counts:

 - You cannot do "grep -a" equivalent once you set your "-grep" attribute;

 - The user has flexibility to set "diff" and "grep" independently, which
   is an unnecessary complication [*1*]; and

 - The user does not get any hint that potential hits are elided, like
   normal "grep" would do.

I also think we can fix the above problems, while simplifying the external
interface visible to the end users.

So let's step back a bit and take a look at the handling of files for
which you do not want to see patch output and/or you do not want to see
grep hits, in a fictional but realisitic use scenario.

I am building a small embedded applicance that links with a binary blob
firmware, xyzzy.so, supplied by an upstream vendor, and this blob is
updated from time to time. In order to keep track of what binary blob went
into which product release, this binary blob is stored in the repository
together with the source files. Because it is useless to view binary
changes in "git log -p" output, I would naturally mark this as "binary".

Now, is it realistic to expect that I might want to see "grep" hits from
this binary blob? I do not think so [*2*].

When I say "xyzzy.so is a binary" in my .gitattributes file, according to
the current documentation, I am saying that "do not run diff, and also do
not run EOL conversion on this file" [*3*], but from the end user's point
of view, it is natural that I am entitled to expect much more than that.
Without having to know how many different kinds of textual operations the
SCM I happen to be using supports, I am telling it that I do not want to
see _any_ textual operations done on it. If a new version of "git grep"
learns to honor the attributes system, I want my "this is binary" to be
considered by the improved "git grep".

If you think about it this way from the very high level description of the
problem, aka, end user's point of view, it is fairly clear that tying the
"binary" attribute to "git grep" to allow us to override the built-in
buffer_is_binary() call you see in grep.c gives the most intuitive result,
without forcing the user to do anything more than they are already doing.

Because "binary" decomposes to "-diff" and "-text", we have two choices.
We _could_ say "-text" should be the one that overrides buffer_is_binary()
autodetection, but using "-diff" instead for that purpose opens the door
for us to give our users even more intuitive and consistent experience.

Suppose that this binary blob firmware came with an API manual formatted
in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the
repository, but again, running textual diff between updated versions of
the PDF documentation would not be very useful. I however may have a
textconv filter defined for it to grab the text by running pdf2ascii.

Now if my "git show --textconv xyzzy.pdf" has an output line that says a
string "XYZZY API 2.0" was added to the current version, wouldn't it be
natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf"
to find it [*4*]?

As an added bonus, if you truly want to omit _any_ hit from "git grep" for
your minified.js files, you can easily do so. Just define a textconv
filter that yields an empty string for them, and "grep --textconv" won't
produce any matches, not even "Binary file ... matches".

So I am inclined to think that reusing the infrastructure we already have
for the `diff` attribute as much as possible would be a better design for
this new feature you are proposing.

The name of the attribute `diff` is somewhat unfortunate, but the end user
interface to the feature at the highest level is already called "binary"
attribute (macro), and our low-level documentation can give precise
meaning of the attribute while alluding to the origin of the name so that
intelligent readers would understand it pretty easily (see attached
patch).

Besides, we already tell the users in "Marking files as binary" section to
mark them with "-diff" in the attributes file if they want their contents
treated as binary.


[Footnotes]

*1* An unused flexibility like this does nothing useful, other than
forcing the user to flip two attributes always as a pair, when one should
suffice.

*2* The essence of my previous message was "why it is insufficient to mark
them binary, i.e. uninteresting for the purpose of all textual operations
not just grep but also for diff." which was unanswered.

*3* This is because "binary" is defined as an attribute macro that expands
to "-diff -text".

*4* This also suggests that the infrastructure we already have for driving
"git diff" is a good match for "git grep".  The "funcname" patterns, which
"git grep -p" already can borrow and use from "diff" infrastructure, is
another indication that using `diff` is a good match for "git grep".


 Documentation/gitattributes.txt |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..63844c4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -388,9 +388,18 @@ Generating diff text
 `diff`
 ^^^^^^
 
-The attribute `diff` affects how 'git' generates diffs for particular
-files. It can tell git whether to generate a textual patch for the path
-or to treat the path as a binary file.  It can also affect what line is
+The attribute `diff` affects if `git` treats the contents of file as text
+or binary. Historically, `git diff` and its friends were the first to use
+this attribute, hence its name, but the attribute governs textual
+operations in general, including `git grep`.
+
+For `git diff` and its friends, differences in contents marked as text
+produces a textual patch, while differences in non-text are reported only
+as "Binary files differ". For `git grep`, matches in contents marked as
+text are reported by showing lines that contain them, while matches in
+non-text are reported only as "Binary file ... matches".
+
+This attribute can also affect what line is
 shown on the hunk header `@@ -k,l +n,m @@` line, tell git to use an
 external command to generate the diff, or ask git to convert binary
 files to a text format before generating the diff.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-24  6:59             ` Junio C Hamano
@ 2012-01-25 21:46               ` Jeff King
  2012-01-26 13:51                 ` Stephen Bash
                                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Jeff King @ 2012-01-25 21:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Mon, Jan 23, 2012 at 10:59:45PM -0800, Junio C Hamano wrote:

> Conrad Irwin <conrad.irwin@gmail.com> writes:
> > I used to use this approach, hooking into the "diff" attribute directly to mark
> > a file as binary, however that was clearly a hack.
> 
> After thinking about this a bit more, I have to say I disagree that it is
> a hack.

I kind of agree.

The biggest problem is that the name is wrong.  The "diff.*.command"
option really is about generating a diff between two blobs of a certain
type. But "diff.*.textconv" and "diff.*.binary" are really just
attributes of the file, and may or may not have to do with generating a
diff. Ditto for diff.*.funcname, I think.

You argue, and I agree, that if we are talking about attributes of the
files and not diff-specific things, then other parts of git can and
should make use of that information.

So if this was all spelled:

  $ cat .gitattributes
  *.pdf filetype=pdf
  $ cat .git/config
  [filetype "pdf"]
          binary = true
          textconv = pdf2txt

I think it would be a no-brainer that those type attributes should apply
to "git grep".

So maybe the first step on this path would be to introduce something
like "filetype" as a new attribute, have "diff" respect its settings,
and recommend people set up filetypes as appropriate. Or maybe that just
makes things more confusing in the long run, and we are better off
simply accepting that the name is slightly misleading. But either way,
it seems clear that git should be respecting gitattributes at the very
least to mark files as binary (and I think we already use funcname
patterns; textconv is a slightly stickier subject, so I'll address that
below).

But what I'm not sure I agree with is that the idea of "I don't want to
include path X in my grep" maps to "just mark the file as binary". For
example, in git we carry around a lot of code in compat/ that comes from
other places. I generally don't want to see grep results from
compat/nedmalloc/, because that isn't git code.

But should I mark everything in compat/nedmalloc as binary? I don't
think so. I _do_ want to see changes in nedmalloc in "git log" or "git
diff". They don't bother me because they're infrequent, and we still
want to produce regular text patches for the list when they come up. But
because nedmalloc contains a lot of lines of code (even though they
don't change a lot), it happens to produce a lot of uninteresting
matches when grepping.

>  - The user has flexibility to set "diff" and "grep" independently, which
>    is an unnecessary complication [*1*]; and

In the nedmalloc case, if we are to have "grep" and "diff" attributes
that behave similarly, you potentially do want to set them differently.
It would be nice to be able to treat them differently in the cases you
wanted to, but not _have_ to do so. Attribute macros can almost
implement this. You could add "-grep" to binary. But you can't (as far
as I know) do "macro=foo" to handle arbitrary diff drivers. I suspect we
could extend the rules to allow macros that take an argument and pass it
to their constituent attributes.

However, I think this is the wrong road to go down. You would want
macros like this _if_ you have grep and diff attributes that basically
do the same thing (e.g., marking a file as binary for diff versus binary
for grep). But I think that's a wrong road to go down. More likely a
file is binary or it is not binary, and the problem is conflating "file
is binary" with "I do not usually want to grep this file".

I'd much rather see grep inherit diff's file attributes unconditionally
(whether we still call them "diff" or not), and add a grep attribute
that is about "usually this is worth grepping". And then have a
tri-state command-line option for grep to either include uninteresting
things, exclude them, or to give terse output for them (mention that
there were matches in foo.c, but not each one). Probably defaulting to
terse output.

That makes the case you presented work out of the box: things marked as
binary for diffing look binary to grep, and we give the usual terse
"binary file matches" output. The user doesn't have to do anything.  For
more complex cases like nedmalloc, you can still achieve the "this is
text, but it's usually boring" behavior. And if you really want to do a
thorough grep, you can just "git grep --exclude=none".

> So let's step back a bit and take a look at the handling of files for
> which you do not want to see patch output and/or you do not want to see
> grep hits, in a fictional but realisitic use scenario.
> [...]

I think this is a nice user story, and it fits in with your suggested
git behavior. But I think there are other stories, too, like the
nedmalloc one. And it would be nice to make them work without hurting
the simplicity of the case you mentioned.

> If you think about it this way from the very high level description of the
> problem, aka, end user's point of view, it is fairly clear that tying the
> "binary" attribute to "git grep" to allow us to override the built-in
> buffer_is_binary() call you see in grep.c gives the most intuitive result,
> without forcing the user to do anything more than they are already doing.

This is not a complaint about the core of your point, but rather an
aside that should be considered: how many people are really using the
binary macro attribute? Personally, I never use it, because when I mark
something with a "diff" attribute, it is because I am telling git about
a specific diff driver (usually textconv). Otherwise I don't bother
setting attributes at all, because git's binary detection tends to be
good.  This leaves me without setting "-text", of course, but I don't
generally care because I don't do CRLF conversion at all.

So I think it is worth considering not just people setting "binary", but
how users of just "diff" (both "-diff" and "diff=foo") will want things
to work.

> Suppose that this binary blob firmware came with an API manual formatted
> in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the
> repository, but again, running textual diff between updated versions of
> the PDF documentation would not be very useful. I however may have a
> textconv filter defined for it to grab the text by running pdf2ascii.
> 
> Now if my "git show --textconv xyzzy.pdf" has an output line that says a
> string "XYZZY API 2.0" was added to the current version, wouldn't it be
> natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf"
> to find it [*4*]?

This is an interesting concept. As a user of textconv, I already have
some specialized grep tools for matching inside binary files (e.g., I
have a tool that greps within exif tags of images). But being able to do
so with "git grep", and even at an arbitrary revision, is kind of neat.

I would worry about turning it on by default, since the results could be
misleading. In particular, your pattern "foo" might be in the binary
file but not in the textconv'd version, leading you to think you had
found all instances of "foo" but had not (or much more subtle, things
like line breaks really matter during the conversion if you are going to
be using "grep -C").

Making it available by "--textconv" seems reasonable to me, though. The
only inconsistency is that it's on by default for "git show", but would
not be for "git grep".

Perhaps I am being overly paranoid on the "misleading" bit above. It
seems to me that grep has the room to be a lot more subtle, because an
omission from the output is considered "did not match". But you could
construct equally weird scenarios for "git show" (e.g., you changed
"foo" to "bar" but that part did not appear in the textconv portion.
Which is really a quality-of-implementation issue for your textconv
filter).

> As an added bonus, if you truly want to omit _any_ hit from "git grep" for
> your minified.js files, you can easily do so. Just define a textconv
> filter that yields an empty string for them, and "grep --textconv" won't
> produce any matches, not even "Binary file ... matches".

Clever. But then you will never ever see a diff for that file, either,
because we will consider all changes to be empty (actually, I didn't
check, but you may get the diff header without any content, similar to
the stat-dirty entries).

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-25 21:46               ` Jeff King
@ 2012-01-26 13:51                 ` Stephen Bash
  2012-01-26 17:29                   ` Jeff King
  2012-01-26 16:45                 ` Michael Haggerty
  2012-02-01  8:01                 ` Junio C Hamano
  2 siblings, 1 reply; 43+ messages in thread
From: Stephen Bash @ 2012-01-26 13:51 UTC (permalink / raw)
  To: Jeff King
  Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld, Junio C Hamano

----- Original Message -----
> From: "Jeff King" <peff@peff.net>
> Sent: Wednesday, January 25, 2012 4:46:26 PM
> Subject: Re: [PATCH] Don't search files with an unset "grep" attribute
>
> ... snip ...
> 
> So if this was all spelled:
> 
>   $ cat .gitattributes
>   *.pdf filetype=pdf
>   $ cat .git/config
>   [filetype "pdf"]
>           binary = true
>           textconv = pdf2txt
> 
> I think it would be a no-brainer that those type attributes should
> apply to "git grep".

Looking at this purely as a user, what difference/advantage would that bring versus

  $ cat .gitattributes
  *.pdf binary=true textconv=pdf2text

or

  $ cat .gitattributes
  [attr]pdf binary=true textconv=pdf2text
  *.pdf pdf

(admittedly I have no clue if gitattributes actually supports anything like this)

I guess my point is as a user, I've gravitated to "gitattributes is about files in my repo, gitconfig is about Git's behavior" (though this is a grey area).

To partially answer my own question: one advantage of putting the filetype information in a config file is it allows system- and user-wide filetype settings.  In my personal experience I've always handled that information on a per-repository basis, but that doesn't mean everyone would want to.

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-25 21:46               ` Jeff King
  2012-01-26 13:51                 ` Stephen Bash
@ 2012-01-26 16:45                 ` Michael Haggerty
  2012-01-27  6:35                   ` Jeff King
  2012-02-01  8:01                 ` Junio C Hamano
  2 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2012-01-26 16:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On 01/25/2012 10:46 PM, Jeff King wrote:
> But what I'm not sure I agree with is that the idea of "I don't want to
> include path X in my grep" maps to "just mark the file as binary".

Anybody who wants this policy can simply set

    [attr]binary -diff -text -grep

If they want finer granularity, they can adjust the settings for
particular file types or for particular files.

> But should I mark everything in compat/nedmalloc as binary? I don't
> think so. I _do_ want to see changes in nedmalloc in "git log" or "git
> diff". They don't bother me because they're infrequent, and we still
> want to produce regular text patches for the list when they come up. But
> because nedmalloc contains a lot of lines of code (even though they
> don't change a lot), it happens to produce a lot of uninteresting
> matches when grepping.

I think decisions such as whether to include an imported module in "git
diff" output is a personal preference and should not be decided at the
level of the git project.  The in-tree .gitattributes files should, by
and large, just *describe* the files and leave it to users to associate
policies with the tags (or at least make it possible for users to
override the policies) via .git/info/attributes.  For example, the
repository could set an "external=nedmalloc" attribute on all files
under compat/nedmalloc, and users could themselves configure a macro
"[attr]external -diff -grep" (or maybe something like
"[attr]external=nedmalloc -diff -grep") if that is their preference.

> It would be nice to be able to treat them differently in the cases you
> wanted to, but not _have_ to do so. Attribute macros can almost
> implement this. You could add "-grep" to binary. But you can't (as far
> as I know) do "macro=foo" to handle arbitrary diff drivers. I suspect we
> could extend the rules to allow macros that take an argument and pass it
> to their constituent attributes.

Is it really common to want to use the same argument on multiple macros
without also wanting to set other things specifically?  If not, then
there is not much reason to complicate macros with argument support.

For example, I do something like

    [attr]type-python type=python text diff=python check-ws
    *.py type-python

    [attr]type-makefile type=makefile text diff check-ws -check-tab
    Makefile.* type-makefile

for the main file types in my repository, and it is not very cumbersome.

"type-python" and "type=python" seem redundant but they are not.
"type-python" is needed so that it can be used as a macro.
"type=python" makes it easier to inquire about the type of a file using
something like "git check-attr type -- PATH" rather than having to
inquire about each possible type-* attribute.  It might be nice to
support a slightly extended macro definition syntax like

    [attr]type=python text diff=python check-ws
    *.py type=python

    [attr]type=makefile text diff check-ws -check-tab
    Makefile.* type=makefile

(i.e., macros that are only triggered for particular values of an
attribute).

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-26 13:51                 ` Stephen Bash
@ 2012-01-26 17:29                   ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-01-26 17:29 UTC (permalink / raw)
  To: Stephen Bash
  Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld, Junio C Hamano

On Thu, Jan 26, 2012 at 08:51:52AM -0500, Stephen Bash wrote:

> >   $ cat .gitattributes
> >   *.pdf filetype=pdf
> >   $ cat .git/config
> >   [filetype "pdf"]
> >           binary = true
> >           textconv = pdf2txt
> 
> Looking at this purely as a user, what difference/advantage would that bring versus
> 
>   $ cat .gitattributes
>   *.pdf binary=true textconv=pdf2text

For "binary", probably not much. But for textconv, it is all about the
split between attributes and config, as mentioned below:

> To partially answer my own question: one advantage of putting the
> filetype information in a config file is it allows system- and
> user-wide filetype settings.  In my personal experience I've always
> handled that information on a per-repository basis, but that doesn't
> mean everyone would want to.

Right. Setting things system-wide instead of per-repo is one advantage.
But more important is that attributes are not per-repo, but rather
"per-project". They get committed, and everybody who works on the
project shares them.

In your example, the gitattributes get committed, and the project is
mandating "you _will_ use pdf2text to view diffs of these files". But
that may not be appropriate for everybody who clones. Somebody may have
a different pdf-to-text converter. Somebody may simply have pdf2txt at a
different path, or need different options. Or somebody may want to skip
it altogether and use an external diff command, or even just see the
files as binary.

By splitting the information across the two files, the project gets to
say "this file is of type pdf", and then each user gets to decide "how
do I want to diff pdf files?"

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-26 16:45                 ` Michael Haggerty
@ 2012-01-27  6:35                   ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-01-27  6:35 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Junio C Hamano, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Thu, Jan 26, 2012 at 05:45:16PM +0100, Michael Haggerty wrote:

> I think decisions such as whether to include an imported module in "git
> diff" output is a personal preference and should not be decided at the
> level of the git project.

You're right. I thought of it as an annotation that the project could
mark via .gitattributes, or the user could mark via .git/info/attributes.
But that is not following the right split of responsibility for
attributes and config. The attributes should annotate "this isn't really
part of the regular git code base" or "this is really part of the
nedmalloc codebase". And then the _config_ should say "when I am
grepping, I am not interested in nedmalloc". I.e.:

  # mark a set of paths with an attribute
  echo "compat/nedmalloc external" >>.gitattributes

  # and then ignore that attribute for this grep
  git grep --exclude-attr=external 

  # or for all greps
  git config --add grep.exclude external

and git doesn't even have to care about what the attribute is called.
It's between the project and the user how they want to annotate their
files, and how they want to feed them to grep.

Or any other program, for that matter. I wonder if this could also be a
more powerful way of grouping files to be included or excluded from diff
pathspecs. Something like (and I'm just talking off the top of my head,
so there may be some syntactic conflicts here):

  # annotate some files
  cat >>.gitattributes <<-\EOF
  t/t????-*.sh test-script
  t/lib-*.sh test-script
  t/test-lib.sh test-script
  EOF

  # and then consider the tagged files to be a group, and look only at
  # that group
  git log :attr:test-script

  # ditto, but imagine we had the negative pathspecs Duy has proposed
  git log :~attr:test-script

That seems kind of cool to me. But maybe it is getting into crazy
over-engineering. I like the idea that we don't need a new option to
grep or diff; rather it is simply a new syntax for mentioning paths.

> The in-tree .gitattributes files should, by and large, just *describe*
> the files and leave it to users to associate policies with the tags
> (or at least make it possible for users to override the policies) via
> .git/info/attributes.  For example, the repository could set an
> "external=nedmalloc" attribute on all files under compat/nedmalloc,
> and users could themselves configure a macro "[attr]external -diff
> -grep" (or maybe something like "[attr]external=nedmalloc -diff
> -grep") if that is their preference.

So obviously I took what you were saying here and ran with it above. But
I do disagree with one thing here: the attributes should be giving some
tag to the paths, but the actual decision about whether to grep should
be part of the _config_. That's the usual split we have for all of the
other attributes, and I think it makes sense and has worked well.

> Is it really common to want to use the same argument on multiple macros
> without also wanting to set other things specifically?  If not, then
> there is not much reason to complicate macros with argument support.

I dunno. I admit my attribute usage tends to just match by extension,
and I generally only have one or two such lines.

> For example, I do something like
> 
>     [attr]type-python type=python text diff=python check-ws
>     *.py type-python
> 
>     [attr]type-makefile type=makefile text diff check-ws -check-tab
>     Makefile.* type-makefile
> 
> for the main file types in my repository, and it is not very cumbersome.

I think it's not a big deal if you are making your own macros. I was
more concerned that people would want to use the "binary" macro to get
the "-grep" automagic, but could not do so because they don't want
"-diff", but rather "diff=foo".

Anyway, after reading your response and thinking on it more, I think
"-grep" is totally the wrong way to go.  If the files are marked binary,
then grep should be respecting "-diff" or the "diff.*.binary" config. If
we want to do more advanced exclusion, then the right place for that is
the config file (or the weird :attr pathspec thing I mentioned above).

> "type-python" and "type=python" seem redundant but they are not.
> "type-python" is needed so that it can be used as a macro.
> "type=python" makes it easier to inquire about the type of a file using
> something like "git check-attr type -- PATH" rather than having to
> inquire about each possible type-* attribute.  It might be nice to
> support a slightly extended macro definition syntax like
> 
>     [attr]type=python text diff=python check-ws
>     *.py type=python
> 
>     [attr]type=makefile text diff check-ws -check-tab
>     Makefile.* type=makefile
> 
> (i.e., macros that are only triggered for particular values of an
> attribute).

I don't think there's any semantic reason why that is not workable. It's
simply not syntactically allowed at this point.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-01-25 21:46               ` Jeff King
  2012-01-26 13:51                 ` Stephen Bash
  2012-01-26 16:45                 ` Michael Haggerty
@ 2012-02-01  8:01                 ` Junio C Hamano
  2012-02-01  8:20                   ` Jeff King
  2 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-02-01  8:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Jeff King <peff@peff.net> writes:

> On Mon, Jan 23, 2012 at 10:59:45PM -0800, Junio C Hamano wrote:
>
>> Conrad Irwin <conrad.irwin@gmail.com> writes:
>> > I used to use this approach, hooking into the "diff" attribute directly to mark
>> > a file as binary, however that was clearly a hack.
>> 
>> After thinking about this a bit more, I have to say I disagree that it is
>> a hack.
>
> I kind of agree.
>
> The biggest problem is that the name is wrong.  The "diff.*.command"
> option really is about generating a diff between two blobs of a certain
> type. But "diff.*.textconv" and "diff.*.binary" are really just
> attributes of the file, and may or may not have to do with generating a
> diff. Ditto for diff.*.funcname, I think.
>
> You argue, and I agree, that if we are talking about attributes of the
> files and not diff-specific things, then other parts of git can and
> should make use of that information.
>
> So if this was all spelled:
>
>   $ cat .gitattributes
>   *.pdf filetype=pdf
>   $ cat .git/config
>   [filetype "pdf"]
>           binary = true
>           textconv = pdf2txt
>
> I think it would be a no-brainer that those type attributes should apply
> to "git grep".

I think this discussion has, instead of forking into two equally
interesting subthreads, veered to a more intellectually stimulating
tangent and we ended up losing focus.

Regardless of what to do with "I do not want to grep in these types of
files" and "I want textconv applied when grepping in these types", which
would be new attributes to implement two new features, I would like to see
us first concentrate on fixing the "binary" issue.  When somebody tells us
"Your autodetection may screw it up, but this file is binary; just show
'Binary files differ.' when comparing." with "-diff" (or "binary"), we
should honor that when "git grep" decides if it should take the 'Binary
file matches' codepath.  We currently do not, and it clearly is a bug.

This is especially made somewhat urgent because I do not want a half-baked
"two pathspecs" approach that only "git grep" knows about when we add the
support for "git grep --exclude-path=...".

We should have to teach the underlying machinery that matches pathspec
about negative pathspec entries only once. After we have done so, all the
callers, not just "git grep", should be able to take advantage of the
change by just learning to place negative pathspec entries in the "struct
pathspec" they pass to the machinery.  Doing anything else will lead to
madness of adding ad-hoc "here we should further filter with the other
negative 'struct pathspec'" in each and every application.

But I suspect that it would not materialize anytime soon.  And I also
suspect that the correct handling of 'Binary file matches', which is a
pure bugfix, should solve the original issue started these threads 90% in
practice.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-02-01  8:01                 ` Junio C Hamano
@ 2012-02-01  8:20                   ` Jeff King
  2012-02-01  9:10                     ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-02-01  8:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Wed, Feb 01, 2012 at 12:01:41AM -0800, Junio C Hamano wrote:

> > So if this was all spelled:
> >
> >   $ cat .gitattributes
> >   *.pdf filetype=pdf
> >   $ cat .git/config
> >   [filetype "pdf"]
> >           binary = true
> >           textconv = pdf2txt
> >
> > I think it would be a no-brainer that those type attributes should apply
> > to "git grep".
> 
> I think this discussion has, instead of forking into two equally
> interesting subthreads, veered to a more intellectually stimulating
> tangent and we ended up losing focus.

That's what I'm here for.

> Regardless of what to do with "I do not want to grep in these types of
> files" and "I want textconv applied when grepping in these types", which
> would be new attributes to implement two new features, I would like to see
> us first concentrate on fixing the "binary" issue.  When somebody tells us
> "Your autodetection may screw it up, but this file is binary; just show
> 'Binary files differ.' when comparing." with "-diff" (or "binary"), we
> should honor that when "git grep" decides if it should take the 'Binary
> file matches' codepath.  We currently do not, and it clearly is a bug.

Right. It may have been lost in the verbosity of what I wrote in my
previous email, but I completely agree. With the caveat that one should
also respect "diff=foo" coupled with "diff.foo.binary = true" as making
something binary. But that is already handled transparently by the
userdiff.[ch] code (which seems like the obvious entry point for
grep to use for attribute lookup, and which we already use there for
funcname lookup).

The trivial-ish patch is below.

> We should have to teach the underlying machinery that matches pathspec
> about negative pathspec entries only once. After we have done so, all the
> callers, not just "git grep", should be able to take advantage of the
> change by just learning to place negative pathspec entries in the "struct
> pathspec" they pass to the machinery.  Doing anything else will lead to
> madness of adding ad-hoc "here we should further filter with the other
> negative 'struct pathspec'" in each and every application.

Yes, I agree.

> But I suspect that it would not materialize anytime soon.  And I also
> suspect that the correct handling of 'Binary file matches', which is a
> pure bugfix, should solve the original issue started these threads 90% in
> practice.

Also agree. Let's fix the bug and then give it some time to see whether
people really want more explicit exclusions.

Here's the bug-fix patch. Not quite ready for inclusion, as it obviously
needs tests and a commit message. Also, we can cache the result of the
userdiff lookup so the funcname code doesn't have to look it up again.

---
diff --git a/grep.c b/grep.c
index b29d09c..d7ab054 100644
--- a/grep.c
+++ b/grep.c
@@ -960,6 +960,15 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
 	fwrite(buf, size, 1, stdout);
 }
 
+static int grep_buffer_is_binary(const char *path,
+				 char *buf, unsigned long size)
+{
+	struct userdiff_driver *drv = userdiff_find_by_path(path);
+	if (drv && drv->binary != -1)
+		return drv->binary;
+	return buffer_is_binary(buf, size);
+}
+
 static int grep_buffer_1(struct grep_opt *opt, const char *name,
 			 char *buf, unsigned long size, int collect_hits)
 {
@@ -994,11 +1003,11 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 
 	switch (opt->binary) {
 	case GREP_BINARY_DEFAULT:
-		if (buffer_is_binary(buf, size))
+		if (grep_buffer_is_binary(name, buf, size))
 			binary_match_only = 1;
 		break;
 	case GREP_BINARY_NOMATCH:
-		if (buffer_is_binary(buf, size))
+		if (grep_buffer_is_binary(name, buf, size))
 			return 0; /* Assume unmatch */
 		break;
 	case GREP_BINARY_TEXT:

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-02-01  8:20                   ` Jeff King
@ 2012-02-01  9:10                     ` Jeff King
  2012-02-01  9:28                       ` Conrad Irwin
  2012-02-01 16:28                       ` [PATCH] Don't search files with an unset "grep" attribute Junio C Hamano
  0 siblings, 2 replies; 43+ messages in thread
From: Jeff King @ 2012-02-01  9:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Wed, Feb 01, 2012 at 03:20:05AM -0500, Jeff King wrote:

> Here's the bug-fix patch. Not quite ready for inclusion, as it obviously
> needs tests and a commit message. Also, we can cache the result of the
> userdiff lookup so the funcname code doesn't have to look it up again.

Actually, it's a little bit more complicated. I was looking at a
slightly old version of grep.c. Since 0579f91 (grep: enable threading
with -p and -W using lazy attribute lookup, 2011-12-12), the lookup
happens in lots of sub-functions, and locking is required.

So this is what the patch looks like with proper locking and caching of
the looked-up driver. It's quite messy because the cached driver pointer
has to get passed around quite a bit. And I'm not sure it buys that much
in practice. The cost of attribute lookup _is_ noticeable (which I'll
discuss below), but funcname lookup only happens when we get a grep hit.
So unless you are searching for something extremely common, you're only
going to do a lookup very occasionally (compared to the load of actually
searching through the files). So all of the messiness and caching may
not be worth the effort, as I wasn't able to measure a performance gain.

But there's more. Respecting binary attributes does mean looking up
attributes for _every_ file. And that has a noticeable impact. My
best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.392s.
Yuck.

Part of the problem, I suspect, is that the attribute lookup code is
optimized for locality. We only unwind as much of the stack as we need,
so looking at "foo/bar/baz.c" after "foo/bar/bleep.c" is much cheaper
than looking at "some/other/directory.c". But with threaded grep, that
locality is likely lost, as we are mixing up attribute requests from
different threads.

Given that binary lookup means we need every file's gitattribute, it
might be better to look them up serially at the beginning of the
program, and then pass the resulting userdiff driver to grep_buffer
along with each path.

---
diff --git a/grep.c b/grep.c
index 486230b..3ca840a 100644
--- a/grep.c
+++ b/grep.c
@@ -829,15 +829,28 @@ static inline void grep_attr_unlock(struct grep_opt *opt)
 #define grep_attr_unlock(opt)
 #endif
 
-static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
+static struct userdiff_driver *get_cached_userdiff(struct grep_opt *opt,
+						   const char *path,
+						   struct userdiff_driver **drv)
 {
-	xdemitconf_t *xecfg = opt->priv;
-	if (xecfg && !xecfg->find_func) {
-		struct userdiff_driver *drv;
+	if (!*drv) {
 		grep_attr_lock(opt);
-		drv = userdiff_find_by_path(name);
+		*drv = userdiff_find_by_path(path);
+		if (!*drv)
+			*drv = userdiff_find_by_name("default");
 		grep_attr_unlock(opt);
-		if (drv && drv->funcname.pattern) {
+	}
+	return *drv;
+}
+
+static int match_funcname(struct grep_opt *opt, const char *name,
+			  struct userdiff_driver **drv_p,
+			  char *bol, char *eol)
+{
+	xdemitconf_t *xecfg = opt->priv;
+	if (xecfg && !xecfg->find_func) {
+		struct userdiff_driver *drv = get_cached_userdiff(opt, name, drv_p);
+		if (drv->funcname.pattern) {
 			const struct userdiff_funcname *pe = &drv->funcname;
 			xdiff_set_find_func(xecfg, pe->pattern, pe->cflags);
 		} else {
@@ -859,6 +872,7 @@ static int match_funcname(struct grep_opt *opt, const char *name, char *bol, cha
 }
 
 static void show_funcname_line(struct grep_opt *opt, const char *name,
+			       struct userdiff_driver **drv_p,
 			       char *buf, char *bol, unsigned lno)
 {
 	while (bol > buf) {
@@ -871,20 +885,21 @@ static void show_funcname_line(struct grep_opt *opt, const char *name,
 		if (lno <= opt->last_shown)
 			break;
 
-		if (match_funcname(opt, name, bol, eol)) {
+		if (match_funcname(opt, name, drv_p, bol, eol)) {
 			show_line(opt, bol, eol, name, lno, '=');
 			break;
 		}
 	}
 }
 
-static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
-			     char *bol, char *end, unsigned lno)
+static void show_pre_context(struct grep_opt *opt, const char *name,
+			     struct userdiff_driver **drv_p,
+			     char *buf, char *bol, char *end, unsigned lno)
 {
 	unsigned cur = lno, from = 1, funcname_lno = 0;
 	int funcname_needed = !!opt->funcname;
 
-	if (opt->funcbody && !match_funcname(opt, name, bol, end))
+	if (opt->funcbody && !match_funcname(opt, name, drv_p, bol, end))
 		funcname_needed = 2;
 
 	if (opt->pre_context < lno)
@@ -900,7 +915,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
 		while (bol > buf && bol[-1] != '\n')
 			bol--;
 		cur--;
-		if (funcname_needed && match_funcname(opt, name, bol, eol)) {
+		if (funcname_needed && match_funcname(opt, name, drv_p, bol, eol)) {
 			funcname_lno = cur;
 			funcname_needed = 0;
 		}
@@ -908,7 +923,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
 
 	/* We need to look even further back to find a function signature. */
 	if (opt->funcname && funcname_needed)
-		show_funcname_line(opt, name, buf, bol, cur);
+		show_funcname_line(opt, name, drv_p, buf, bol, cur);
 
 	/* Back forward. */
 	while (cur < lno) {
@@ -983,6 +998,17 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
 	fwrite(buf, size, 1, stdout);
 }
 
+static int grep_buffer_is_binary(struct grep_opt *opt,
+				 const char *path,
+				 char *buf, unsigned long size,
+				 struct userdiff_driver **drv_p)
+{
+	struct userdiff_driver *drv = get_cached_userdiff(opt, path, drv_p);
+	if (drv && drv->binary != -1)
+		return drv->binary;
+	return buffer_is_binary(buf, size);
+}
+
 static int grep_buffer_1(struct grep_opt *opt, const char *name,
 			 char *buf, unsigned long size, int collect_hits)
 {
@@ -996,6 +1022,7 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 	int show_function = 0;
 	enum grep_context ctx = GREP_CONTEXT_HEAD;
 	xdemitconf_t xecfg;
+	struct userdiff_driver *drv = NULL;
 
 	if (!opt->output)
 		opt->output = std_output;
@@ -1017,11 +1044,11 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 
 	switch (opt->binary) {
 	case GREP_BINARY_DEFAULT:
-		if (buffer_is_binary(buf, size))
+		if (grep_buffer_is_binary(opt, name, buf, size, &drv))
 			binary_match_only = 1;
 		break;
 	case GREP_BINARY_NOMATCH:
-		if (buffer_is_binary(buf, size))
+		if (grep_buffer_is_binary(opt, name, buf, size, &drv))
 			return 0; /* Assume unmatch */
 		break;
 	case GREP_BINARY_TEXT:
@@ -1099,16 +1126,16 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 			 * pre-context lines, we would need to show them.
 			 */
 			if (opt->pre_context || opt->funcbody)
-				show_pre_context(opt, name, buf, bol, eol, lno);
+				show_pre_context(opt, name, &drv, buf, bol, eol, lno);
 			else if (opt->funcname)
-				show_funcname_line(opt, name, buf, bol, lno);
+				show_funcname_line(opt, name, &drv, buf, bol, lno);
 			show_line(opt, bol, eol, name, lno, ':');
 			last_hit = lno;
 			if (opt->funcbody)
 				show_function = 1;
 			goto next_line;
 		}
-		if (show_function && match_funcname(opt, name, bol, eol))
+		if (show_function && match_funcname(opt, name, &drv, bol, eol))
 			show_function = 0;
 		if (show_function ||
 		    (last_hit && lno <= last_hit + opt->post_context)) {

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-02-01  9:10                     ` Jeff King
@ 2012-02-01  9:28                       ` Conrad Irwin
  2012-02-01 22:14                         ` Jeff King
  2012-02-01 16:28                       ` [PATCH] Don't search files with an unset "grep" attribute Junio C Hamano
  1 sibling, 1 reply; 43+ messages in thread
From: Conrad Irwin @ 2012-02-01  9:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Wed, Feb 1, 2012 at 1:10 AM, Jeff King <peff@peff.net> wrote:
> On Wed, Feb 01, 2012 at 03:20:05AM -0500, Jeff King wrote:
>
> Actually, it's a little bit more complicated. I was looking at a
> slightly old version of grep.c. Since 0579f91 (grep: enable threading
> with -p and -W using lazy attribute lookup, 2011-12-12), the lookup
> happens in lots of sub-functions, and locking is required.

Heh, you just beat me to it.

> But there's more. Respecting binary attributes does mean looking up
> attributes for _every_ file. And that has a noticeable impact. My
> best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.392s.
> Yuck.

The first time I introduced this behaviour[1], I made it conditional
on a preference — those who wanted "good" grep could set the
preference, while those who wanted "fast" grep could not. I think
that's not a good idea, though if the performance issues are
show-stoppers, I'd suggest the opposite preference (so speed-freaks
can disable the checks).

Tests from [1] included below in case they're still useful (they pass
with your change)

[1] http://article.gmane.org/gmane.comp.version-control.git/179299/match=grep
---

diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index 917a264..4d94461 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -99,4 +99,23 @@ test_expect_success 'git grep y<NUL>x a' "
        test_must_fail git grep -f f a
 "

+test_expect_success 'git -c grep.binaryFiles=1 grep ina a' "
+       echo 'a diff' > .gitattributes &&
+       printf 'binaryQfile' | q_to_nul >a &&
+       echo 'a:binaryQfile' | q_to_nul >expect &&
+       git -c grep.binaryFiles=1 grep ina a > actual &&
+       rm .gitattributes &&
+       test_cmp expect actual
+"
+test_expect_success 'git -c grep.binaryFiles=1 grep tex t' "
+       echo 'text' > t &&
+       git add t &&
+       echo 't -diff' > .gitattributes &&
+       echo Binary file t matches >expect &&
+       git -c grep.binaryFiles=1 grep tex t >actual &&
+       rm .gitattributes &&
+       test_cmp expect actual
+"
+
+
 test_done

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-02-01  9:10                     ` Jeff King
  2012-02-01  9:28                       ` Conrad Irwin
@ 2012-02-01 16:28                       ` Junio C Hamano
  1 sibling, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2012-02-01 16:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Jeff King <peff@peff.net> writes:

> Part of the problem, I suspect, is that the attribute lookup code is
> optimized for locality. We only unwind as much of the stack as we need,
> so looking at "foo/bar/baz.c" after "foo/bar/bleep.c" is much cheaper
> than looking at "some/other/directory.c". But with threaded grep, that
> locality is likely lost, as we are mixing up attribute requests from
> different threads.
>
> Given that binary lookup means we need every file's gitattribute, it
> might be better to look them up serially at the beginning of the
> program, and then pass the resulting userdiff driver to grep_buffer
> along with each path.

Yeah, that was my impression when the performance of threaded grep was
discussed, which was before this "let's honor binary attribute".

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-02-01  9:28                       ` Conrad Irwin
@ 2012-02-01 22:14                         ` Jeff King
  2012-02-01 23:20                           ` Jeff King
                                             ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Jeff King @ 2012-02-01 22:14 UTC (permalink / raw)
  To: Conrad Irwin; +Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Wed, Feb 01, 2012 at 01:28:47AM -0800, Conrad Irwin wrote:

> > But there's more. Respecting binary attributes does mean looking up
> > attributes for _every_ file. And that has a noticeable impact. My
> > best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.392s.
> > Yuck.
> 
> The first time I introduced this behaviour[1], I made it conditional
> on a preference — those who wanted "good" grep could set the
> preference, while those who wanted "fast" grep could not. I think
> that's not a good idea, though if the performance issues are
> show-stoppers, I'd suggest the opposite preference (so speed-freaks
> can disable the checks).

I've been able to get somewhat better performance by hoisting the
attribute lookup into the parent thread. That means it happens in order
(which lets the attr code's stack optimizations work), and there's no
lock contention.

I'll post finished patches with numbers in a few minutes.

> Tests from [1] included below in case they're still useful (they pass
> with your change)

Thanks, I'll include them.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-02-01 22:14                         ` Jeff King
@ 2012-02-01 23:20                           ` Jeff King
  2012-02-02  2:03                             ` Junio C Hamano
  2012-02-01 23:21                           ` [PATCH 1/2] grep: let grep_buffer callers specify a binary flag Jeff King
  2012-02-01 23:21                           ` [PATCH 2/2] grep: respect diff attributes for binary-ness Jeff King
  2 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-02-01 23:20 UTC (permalink / raw)
  To: Conrad Irwin; +Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Wed, Feb 01, 2012 at 05:14:37PM -0500, Jeff King wrote:

> > The first time I introduced this behaviour[1], I made it conditional
> > on a preference — those who wanted "good" grep could set the
> > preference, while those who wanted "fast" grep could not. I think
> > that's not a good idea, though if the performance issues are
> > show-stoppers, I'd suggest the opposite preference (so speed-freaks
> > can disable the checks).
> 
> I've been able to get somewhat better performance by hoisting the
> attribute lookup into the parent thread. That means it happens in order
> (which lets the attr code's stack optimizations work), and there's no
> lock contention.
> 
> I'll post finished patches with numbers in a few minutes.

OK, here they are. After playing with some options, I'm satisfied this
is a sane way to do it. I don't think it's worth having a config option.
There is a measurable slowdown, but it's simply not that big.

  [1/2]: grep: let grep_buffer callers specify a binary flag
  [2/2]: grep: respect diff attributes for binary-ness

There are a few optimizations I didn't do that you could put on top:

  1. When "-a" is given, we can avoid the attribute lookup altogether.

  2. When "-I" is given, we can actually check attributes _before_
     loading the file or blob into memory. This can help with very large
     binaries.

  3. When "-I" is given but we have no attribute, we can stream the
     beginning of the file or blob to check for binary-ness, and then
     avoid loading the whole thing if it turns out to be binary.

I think (1) and (2) should be easy. Doing (3) is a little messier,
because binary detection happens inside grep_buffer, but we can hoist it
out. However, for large files, it might be nice to have a streaming grep
interface anyway, and (3) could be part of that.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/2] grep: let grep_buffer callers specify a binary flag
  2012-02-01 22:14                         ` Jeff King
  2012-02-01 23:20                           ` Jeff King
@ 2012-02-01 23:21                           ` Jeff King
  2012-02-02  0:47                             ` Junio C Hamano
  2012-02-01 23:21                           ` [PATCH 2/2] grep: respect diff attributes for binary-ness Jeff King
  2 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-02-01 23:21 UTC (permalink / raw)
  To: Conrad Irwin; +Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

The caller of grep_buffer may have extra information about
whether a buffer is binary or not (e.g., from configuration).
Let's give them a chance to pass along that information and
override our binary auto-detection.

Callers can still pass "-1" to get the regular
auto-detection (and all callers are converted to do this,
meaning there should be no behavior change yet).

We could maintain source compatibility for callers by adding
a new "grep_buffer_with_flags" and leaving "grep_buffer" as
a wrapper that always passes "-1". But there are only 5
callers of grep_buffer, and only 1 of those (grepping commit
buffers) will not be converted to pass something useful in
the next patch. So it's simpler to just add a "-1" there.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/grep.c |    8 ++++----
 grep.c         |   23 ++++++++++++++++-------
 grep.h         |    2 +-
 revision.c     |    1 +
 4 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 9ce064a..e328316 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -221,14 +221,14 @@ static void *run(void *arg)
 			void* data = load_sha1(w->identifier, &sz, w->name);
 
 			if (data) {
-				hit |= grep_buffer(opt, w->name, data, sz);
+				hit |= grep_buffer(opt, w->name, -1, data, sz);
 				free(data);
 			}
 		} else if (w->type == WORK_FILE) {
 			size_t sz;
 			void* data = load_file(w->identifier, &sz);
 			if (data) {
-				hit |= grep_buffer(opt, w->name, data, sz);
+				hit |= grep_buffer(opt, w->name, -1, data, sz);
 				free(data);
 			}
 		} else {
@@ -421,7 +421,7 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 		if (!data)
 			hit = 0;
 		else
-			hit = grep_buffer(opt, name, data, sz);
+			hit = grep_buffer(opt, name, -1, data, sz);
 
 		free(data);
 		free(name);
@@ -483,7 +483,7 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 		if (!data)
 			hit = 0;
 		else
-			hit = grep_buffer(opt, name, data, sz);
+			hit = grep_buffer(opt, name, -1, data, sz);
 
 		free(data);
 		free(name);
diff --git a/grep.c b/grep.c
index 486230b..e547db2 100644
--- a/grep.c
+++ b/grep.c
@@ -983,8 +983,16 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
 	fwrite(buf, size, 1, stdout);
 }
 
+static int grep_buffer_is_binary(char *buf, unsigned long size, int flag)
+{
+	if (flag == -1)
+		flag = buffer_is_binary(buf, size);
+	return flag;
+}
+
 static int grep_buffer_1(struct grep_opt *opt, const char *name,
-			 char *buf, unsigned long size, int collect_hits)
+			 int is_binary, char *buf, unsigned long size,
+			 int collect_hits)
 {
 	char *bol = buf;
 	unsigned long left = size;
@@ -1017,11 +1025,11 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 
 	switch (opt->binary) {
 	case GREP_BINARY_DEFAULT:
-		if (buffer_is_binary(buf, size))
+		if (grep_buffer_is_binary(buf, size, is_binary))
 			binary_match_only = 1;
 		break;
 	case GREP_BINARY_NOMATCH:
-		if (buffer_is_binary(buf, size))
+		if (grep_buffer_is_binary(buf, size, is_binary))
 			return 0; /* Assume unmatch */
 		break;
 	case GREP_BINARY_TEXT:
@@ -1182,23 +1190,24 @@ static int chk_hit_marker(struct grep_expr *x)
 	}
 }
 
-int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+int grep_buffer(struct grep_opt *opt, const char *name, int is_binary,
+		char *buf, unsigned long size)
 {
 	/*
 	 * we do not have to do the two-pass grep when we do not check
 	 * buffer-wide "all-match".
 	 */
 	if (!opt->all_match)
-		return grep_buffer_1(opt, name, buf, size, 0);
+		return grep_buffer_1(opt, name, is_binary, buf, size, 0);
 
 	/* Otherwise the toplevel "or" terms hit a bit differently.
 	 * We first clear hit markers from them.
 	 */
 	clr_hit_marker(opt->pattern_expression);
-	grep_buffer_1(opt, name, buf, size, 1);
+	grep_buffer_1(opt, name, is_binary, buf, size, 1);
 
 	if (!chk_hit_marker(opt->pattern_expression))
 		return 0;
 
-	return grep_buffer_1(opt, name, buf, size, 0);
+	return grep_buffer_1(opt, name, is_binary, buf, size, 0);
 }
diff --git a/grep.h b/grep.h
index fb205f3..8447e4c 100644
--- a/grep.h
+++ b/grep.h
@@ -128,7 +128,7 @@ extern void append_grep_pattern(struct grep_opt *opt, const char *pat, const cha
 extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field, const char *);
 extern void compile_grep_patterns(struct grep_opt *opt);
 extern void free_grep_patterns(struct grep_opt *opt);
-extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
+extern int grep_buffer(struct grep_opt *opt, const char *name, int is_binary, char *buf, unsigned long size);
 
 extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
 extern int grep_threads_ok(const struct grep_opt *opt);
diff --git a/revision.c b/revision.c
index c97d834..3dcd968 100644
--- a/revision.c
+++ b/revision.c
@@ -2150,6 +2150,7 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 		return 1;
 	return grep_buffer(&opt->grep_filter,
 			   NULL, /* we say nothing, not even filename */
+			   -1,
 			   commit->buffer, strlen(commit->buffer));
 }
 
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/2] grep: respect diff attributes for binary-ness
  2012-02-01 22:14                         ` Jeff King
  2012-02-01 23:20                           ` Jeff King
  2012-02-01 23:21                           ` [PATCH 1/2] grep: let grep_buffer callers specify a binary flag Jeff King
@ 2012-02-01 23:21                           ` Jeff King
  2 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-01 23:21 UTC (permalink / raw)
  To: Conrad Irwin; +Cc: Junio C Hamano, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

There is currently no way for users to tell git-grep that a
particular path is or is not a binary file; instead, grep
always relies on its auto-detection (or the user specifying
"-a" to treat all binary-looking files like text).

This patch teaches git-grep to use the same attribute lookup
that is used by git-diff. We could add a new "grep" flag,
but that is unnecessarily complex and unlikely to be useful.
Despite the name, the "-diff" attribute (or "diff=foo" and
the associated diff.foo.binary config option) are really
about describing the contents of the path. It's simply
historical that diff was the only thing that cared about
these attributes in the past.

And if this simple approach turns out to be insufficient, we
still have a backwards-compatible path forward: we can add a
separate "grep" attribute, and fall back to respecting
"diff" if it is unset.

There are a few things worth nothing about the
implementation.

One is that the attribute lookup happens outside of the
grep.[ch] interface (i.e., outside of grep_buffer). We could
do it at a lower level, which would be slightly more
convenient for callers. However, this interacts badly with
threading.  The attribute-lookup code performs best when
lookup order matches the filesystem order (so looking up
"a/b/c" is cheaper if we just looked up "a/b/d" than if we
just did "x/y/z"). Because we issue many simultaneous
requests to grep_buffer, performing the attribute lookup at
that level will cause requests from unrelated paths to
interleave, and we lose the locality that makes the lookup
optimization work.

Instead, in the threaded case we check the attributes as
they are added to the work queue, meaning they are looked up
in the optimal order (in the single threaded case, this is a
non-issue, as we process the files serially in the optimal
order).

Here are a few numbers showing the difference. The first is
a best-of-five time for "git grep foo" on the linux-2.6 repo
before this patch (on a 4-core HT processor, using 8
threads):

  real    0m0.306s
  user    0m1.512s
  sys     0m0.412s

Now here's the time for the same operation with a trial
implementation looking up attributes in grep_buffer,
showing a 31% slowdown:

  real    0m0.401s
  user    0m1.760s
  sys     0m0.636s

And here's the same operation with this patch, with only an
11% slowdown:

  real    0m0.339s
  user    0m1.444s
  sys     0m0.584s

Note that while the percentages are big, the absolute
numbers are pretty small. In particular, this is a very
inexpensive grep to do. A more complex regex should have the
same absolute slowdown from the attribute lookup, but it
would be a much smaller percentage of the total processing
time.  So doing it this way is not a huge win, but it does
help on small greps.

The second issue worth noting is that while we do a full
attribute lookup, we pass along only the binary flag to
grep_buffer. When the "-p" flag is given to grep, we will
actually look up the same attributes to find funcname
patterns of matches. We could pass along the pointer to the
userdiff driver for reuse.

However, it's not worth doing this. It clutters the code, as
the driver has to be passed through a large number of helper
functions (and pollutes the grep_buffer interface with
userdiff code). And in my tests, it didn't actually improve
performance. Because we only have to look up the attribute
for a grep hit, in most cases we will only do the funcname
lookup for a small subset of files. The cost of the extra
lookups turns out to be negligible.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/grep.c         |   26 ++++++++++++++++++++++----
 t/t7008-grep-binary.sh |   24 ++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index e328316..bb38804 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -42,6 +42,7 @@ enum work_type {WORK_SHA1, WORK_FILE};
 struct work_item {
 	enum work_type type;
 	char *name;
+	int is_binary;
 
 	/* if type == WORK_SHA1, then 'identifier' is a SHA1,
 	 * otherwise type == WORK_FILE, and 'identifier' is a NUL
@@ -113,6 +114,14 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int path_is_binary(const char *path)
+{
+	struct userdiff_driver *drv = userdiff_find_by_path(path);
+	if (drv)
+		return drv->binary;
+	return -1;
+}
+
 static void add_work(enum work_type type, char *name, void *id)
 {
 	grep_lock();
@@ -123,6 +132,11 @@ static void add_work(enum work_type type, char *name, void *id)
 
 	todo[todo_end].type = type;
 	todo[todo_end].name = name;
+
+	pthread_mutex_lock(&grep_attr_mutex);
+	todo[todo_end].is_binary = path_is_binary(name);
+	pthread_mutex_unlock(&grep_attr_mutex);
+
 	todo[todo_end].identifier = id;
 	todo[todo_end].done = 0;
 	strbuf_reset(&todo[todo_end].out);
@@ -221,14 +235,16 @@ static void *run(void *arg)
 			void* data = load_sha1(w->identifier, &sz, w->name);
 
 			if (data) {
-				hit |= grep_buffer(opt, w->name, -1, data, sz);
+				hit |= grep_buffer(opt, w->name, w->is_binary,
+						   data, sz);
 				free(data);
 			}
 		} else if (w->type == WORK_FILE) {
 			size_t sz;
 			void* data = load_file(w->identifier, &sz);
 			if (data) {
-				hit |= grep_buffer(opt, w->name, -1, data, sz);
+				hit |= grep_buffer(opt, w->name, w->is_binary,
+						   data, sz);
 				free(data);
 			}
 		} else {
@@ -421,7 +437,8 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 		if (!data)
 			hit = 0;
 		else
-			hit = grep_buffer(opt, name, -1, data, sz);
+			hit = grep_buffer(opt, name, path_is_binary(name),
+					  data, sz);
 
 		free(data);
 		free(name);
@@ -483,7 +500,8 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 		if (!data)
 			hit = 0;
 		else
-			hit = grep_buffer(opt, name, -1, data, sz);
+			hit = grep_buffer(opt, name, path_is_binary(name),
+					  data, sz);
 
 		free(data);
 		free(name);
diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index 917a264..fd6410f 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -99,4 +99,28 @@ test_expect_success 'git grep y<NUL>x a' "
 	test_must_fail git grep -f f a
 "
 
+test_expect_success 'grep respects binary diff attribute' '
+	echo text >t &&
+	git add t &&
+	echo t:text >expect &&
+	git grep text t >actual &&
+	test_cmp expect actual &&
+	echo "t -diff" >.gitattributes &&
+	echo "Binary file t matches" >expect &&
+	git grep text t >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep respects not-binary diff attribute' '
+	echo binQary | q_to_nul >b &&
+	git add b &&
+	echo "Binary file b matches" >expect &&
+	git grep bin b >actual &&
+	test_cmp expect actual &&
+	echo "b diff" >.gitattributes &&
+	echo "b:binQary" >expect &&
+	git grep bin b | nul_to_q >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] grep: let grep_buffer callers specify a binary flag
  2012-02-01 23:21                           ` [PATCH 1/2] grep: let grep_buffer callers specify a binary flag Jeff King
@ 2012-02-02  0:47                             ` Junio C Hamano
  2012-02-02  0:52                               ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-02-02  0:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Jeff King <peff@peff.net> writes:

> The caller of grep_buffer may have extra information about
> whether a buffer is binary or not (e.g., from configuration).
> Let's give them a chance to pass along that information and
> override our binary auto-detection.

Hrm, I would have expected a patch that turns "const char *name" into a
structure that has name and drv as its members, so that later we can tell
the function more about the nature of the contents. Or a separate pointer
to drv in place of your "binary" flag word.

I am not saying that your patch is wrong. It was just somewhat unexpected
that "binary" is the only additional thing we want to tell the function.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] grep: let grep_buffer callers specify a binary flag
  2012-02-02  0:47                             ` Junio C Hamano
@ 2012-02-02  0:52                               ` Jeff King
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-02-02  0:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Wed, Feb 01, 2012 at 04:47:38PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The caller of grep_buffer may have extra information about
> > whether a buffer is binary or not (e.g., from configuration).
> > Let's give them a chance to pass along that information and
> > override our binary auto-detection.
> 
> Hrm, I would have expected a patch that turns "const char *name" into a
> structure that has name and drv as its members, so that later we can tell
> the function more about the nature of the contents. Or a separate pointer
> to drv in place of your "binary" flag word.

Hmm. Yeah, I would be OK with that, as it's really encapsulating the
idea of "stuff we want to tell grep about this buffer".

What I really didn't want to do was pass the userdiff driver directly,
as that feels way too much like an implementation detail (and while it
can be used to avoid further lookups, it doesn't seem to make a
difference in practice -- see the following patch). And passing it
around became a big messy chore.

But if it were a "struct grep_context" that encapsulated those things, I
think it would be much nicer (it could even carry a pointer to "struct
grep_opt" in it, making the code _more_ pleasant to read, not less).

I'll take a look at re-working it that way.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] Don't search files with an unset "grep" attribute
  2012-02-01 23:20                           ` Jeff King
@ 2012-02-02  2:03                             ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2012-02-02  2:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Jeff King <peff@peff.net> writes:

> There are a few optimizations I didn't do that you could put on top:
>
>   1. When "-a" is given, we can avoid the attribute lookup altogether.

Correct.

>   2. When "-I" is given, we can actually check attributes _before_
>      loading the file or blob into memory. This can help with very large
>      binaries.

Nice.

> ... However, for large files, it might be nice to have a streaming grep
> interface anyway, and (3) could be part of that.

Even nicer ;-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 0/9] respect binary attribute in grep
  2012-02-02  0:52                               ` Jeff King
@ 2012-02-02  8:17                                 ` Jeff King
  2012-02-02  8:18                                   ` [PATCH 1/9] grep: make locking flag global Jeff King
                                                     ` (12 more replies)
  0 siblings, 13 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

[+cc Thomas, as I am mangling some of his recent work with my
     refactoring]

On Wed, Feb 01, 2012 at 07:52:09PM -0500, Jeff King wrote:

> > Hrm, I would have expected a patch that turns "const char *name" into a
> > structure that has name and drv as its members, so that later we can tell
> > the function more about the nature of the contents. Or a separate pointer
> > to drv in place of your "binary" flag word.
> [...]
> I'll take a look at re-working it that way.

Thanks for a dose of sanity. The result turned out much easier to read
(and explain in the commit messages, as it was simple to break into
smaller commits). In particular, the "don't read binary-marked files at
all with -I" optimization became very natural.

I implemented all of the other optimizations I mentioned except the
"only stream the first few bytes when auto-detecting binary-ness" one.
However, it should be easy to do on top of these changes. I need to
re-visit the similar change to diff_filespec_is_binary, and I'll do both
at the same time.

The patches are:

  [1/9]: grep: make locking flag global
  [2/9]: grep: move sha1-reading mutex into low-level code
  [3/9]: grep: refactor the concept of "grep source" into an object
  [4/9]: convert git-grep to use grep_source interface
  [5/9]: grep: drop grep_buffer's "name" parameter
  [6/9]: grep: cache userdiff_driver in grep_source

    These are all refactoring that should have no behavior change.

  [7/9]: grep: respect diff attributes for binary-ness

    This is the point of the series. :)

  [8/9]: grep: load file data after checking binary-ness
  [9/9]: grep: pre-load userdiff drivers when threaded

    And these two are simple optimizations.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/9] grep: make locking flag global
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
@ 2012-02-02  8:18                                   ` Jeff King
  2012-02-02  8:18                                   ` [PATCH 2/9] grep: move sha1-reading mutex into low-level code Jeff King
                                                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

The low-level grep code traditionally didn't care about
threading, as it doesn't do any threading itself and didn't
call out to other non-thread-safe code.  That changed with
0579f91 (grep: enable threading with -p and -W using lazy
attribute lookup, 2011-12-12), which pushed the lookup of
funcname attributes (which is not thread-safe) into the
low-level grep code.

As a result, the low-level code learned about a new global
"grep_attr_mutex" to serialize access to the attribute code.
A multi-threaded caller (e.g., builtin/grep.c) is expected
to initialize the mutex and set "use_threads" in the
grep_opt structure. The low-level code only uses the lock if
use_threads is set.

However, putting the use_threads flag into the grep_opt
struct is not the most logical place. Whether threading is
in use is not something that matters for each call to
grep_buffer, but is instead global to the whole program
(i.e., if any thread is doing multi-threaded grep, every
other thread, even if it thinks it is doing its own
single-threaded grep, would need to use the locking).  In
practice, this distinction isn't a problem for us, because
the only user of multi-threaded grep is "git-grep", which
does nothing except call grep.

This patch turns the opt->use_threads flag into a global
flag. More important than the nit-picking semantic argument
above is that this means that the locking functions don't
need to actually have access to a grep_opt to know whether
to lock. Which in turn can make adding new locks simpler, as
we don't need to pass around a grep_opt.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/grep.c |    4 ++--
 grep.c         |   18 ++++++++++--------
 grep.h         |    2 +-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 5c2ae94..06983f9 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -259,6 +259,7 @@ static void start_threads(struct grep_opt *opt)
 	pthread_cond_init(&cond_add, NULL);
 	pthread_cond_init(&cond_write, NULL);
 	pthread_cond_init(&cond_result, NULL);
+	grep_use_locks = 1;
 
 	for (i = 0; i < ARRAY_SIZE(todo); i++) {
 		strbuf_init(&todo[i].out, 0);
@@ -307,6 +308,7 @@ static int wait_all(void)
 	pthread_cond_destroy(&cond_add);
 	pthread_cond_destroy(&cond_write);
 	pthread_cond_destroy(&cond_result);
+	grep_use_locks = 0;
 
 	return hit;
 }
@@ -1030,8 +1032,6 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 	use_threads = 0;
 #endif
 
-	opt.use_threads = use_threads;
-
 #ifndef NO_PTHREADS
 	if (use_threads) {
 		if (!(opt.name_only || opt.unmatch_name_only || opt.count)
diff --git a/grep.c b/grep.c
index 486230b..7a67c2f 100644
--- a/grep.c
+++ b/grep.c
@@ -807,26 +807,28 @@ static void show_line(struct grep_opt *opt, char *bol, char *eol,
 }
 
 #ifndef NO_PTHREADS
+int grep_use_locks;
+
 /*
  * This lock protects access to the gitattributes machinery, which is
  * not thread-safe.
  */
 pthread_mutex_t grep_attr_mutex;
 
-static inline void grep_attr_lock(struct grep_opt *opt)
+static inline void grep_attr_lock(void)
 {
-	if (opt->use_threads)
+	if (grep_use_locks)
 		pthread_mutex_lock(&grep_attr_mutex);
 }
 
-static inline void grep_attr_unlock(struct grep_opt *opt)
+static inline void grep_attr_unlock(void)
 {
-	if (opt->use_threads)
+	if (grep_use_locks)
 		pthread_mutex_unlock(&grep_attr_mutex);
 }
 #else
-#define grep_attr_lock(opt)
-#define grep_attr_unlock(opt)
+#define grep_attr_lock()
+#define grep_attr_unlock()
 #endif
 
 static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
@@ -834,9 +836,9 @@ static int match_funcname(struct grep_opt *opt, const char *name, char *bol, cha
 	xdemitconf_t *xecfg = opt->priv;
 	if (xecfg && !xecfg->find_func) {
 		struct userdiff_driver *drv;
-		grep_attr_lock(opt);
+		grep_attr_lock();
 		drv = userdiff_find_by_path(name);
-		grep_attr_unlock(opt);
+		grep_attr_unlock();
 		if (drv && drv->funcname.pattern) {
 			const struct userdiff_funcname *pe = &drv->funcname;
 			xdiff_set_find_func(xecfg, pe->pattern, pe->cflags);
diff --git a/grep.h b/grep.h
index fb205f3..3653bb3 100644
--- a/grep.h
+++ b/grep.h
@@ -116,7 +116,6 @@ struct grep_opt {
 	int show_hunk_mark;
 	int file_break;
 	int heading;
-	int use_threads;
 	void *priv;
 
 	void (*output)(struct grep_opt *opt, const void *data, size_t size);
@@ -138,6 +137,7 @@ extern int grep_threads_ok(const struct grep_opt *opt);
  * Mutex used around access to the attributes machinery if
  * opt->use_threads.  Must be initialized/destroyed by callers!
  */
+extern int grep_use_locks;
 extern pthread_mutex_t grep_attr_mutex;
 #endif
 
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/9] grep: move sha1-reading mutex into low-level code
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
  2012-02-02  8:18                                   ` [PATCH 1/9] grep: make locking flag global Jeff King
@ 2012-02-02  8:18                                   ` Jeff King
  2012-02-02  8:19                                   ` [PATCH 3/9] grep: refactor the concept of "grep source" into an object Jeff King
                                                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

The multi-threaded git-grep code needs to serialize access
to the thread-unsafe read_sha1_file call. It does this with
a mutex that is local to builtin/grep.c.

Let's instead push this down into grep.c, where it can be
used by both builtin/grep.c and grep.c. This will let us
safely teach the low-level grep.c code tricks that involve
reading from the object db.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/grep.c |   29 ++++++-----------------------
 grep.c         |    6 ++++++
 grep.h         |   17 +++++++++++++++++
 3 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 06983f9..f4402fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -85,21 +85,6 @@ static inline void grep_unlock(void)
 		pthread_mutex_unlock(&grep_mutex);
 }
 
-/* Used to serialize calls to read_sha1_file. */
-static pthread_mutex_t read_sha1_mutex;
-
-static inline void read_sha1_lock(void)
-{
-	if (use_threads)
-		pthread_mutex_lock(&read_sha1_mutex);
-}
-
-static inline void read_sha1_unlock(void)
-{
-	if (use_threads)
-		pthread_mutex_unlock(&read_sha1_mutex);
-}
-
 /* Signalled when a new work_item is added to todo. */
 static pthread_cond_t cond_add;
 
@@ -254,7 +239,7 @@ static void start_threads(struct grep_opt *opt)
 	int i;
 
 	pthread_mutex_init(&grep_mutex, NULL);
-	pthread_mutex_init(&read_sha1_mutex, NULL);
+	pthread_mutex_init(&grep_read_mutex, NULL);
 	pthread_mutex_init(&grep_attr_mutex, NULL);
 	pthread_cond_init(&cond_add, NULL);
 	pthread_cond_init(&cond_write, NULL);
@@ -303,7 +288,7 @@ static int wait_all(void)
 	}
 
 	pthread_mutex_destroy(&grep_mutex);
-	pthread_mutex_destroy(&read_sha1_mutex);
+	pthread_mutex_destroy(&grep_read_mutex);
 	pthread_mutex_destroy(&grep_attr_mutex);
 	pthread_cond_destroy(&cond_add);
 	pthread_cond_destroy(&cond_write);
@@ -313,8 +298,6 @@ static int wait_all(void)
 	return hit;
 }
 #else /* !NO_PTHREADS */
-#define read_sha1_lock()
-#define read_sha1_unlock()
 
 static int wait_all(void)
 {
@@ -376,9 +359,9 @@ static void *lock_and_read_sha1_file(const unsigned char *sha1, enum object_type
 {
 	void *data;
 
-	read_sha1_lock();
+	grep_read_lock();
 	data = read_sha1_file(sha1, type, size);
-	read_sha1_unlock();
+	grep_read_unlock();
 	return data;
 }
 
@@ -617,10 +600,10 @@ static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		struct strbuf base;
 		int hit, len;
 
-		read_sha1_lock();
+		grep_read_lock();
 		data = read_object_with_reference(obj->sha1, tree_type,
 						  &size, NULL);
-		read_sha1_unlock();
+		grep_read_unlock();
 
 		if (!data)
 			die(_("unable to read tree (%s)"), sha1_to_hex(obj->sha1));
diff --git a/grep.c b/grep.c
index 7a67c2f..db58a29 100644
--- a/grep.c
+++ b/grep.c
@@ -826,6 +826,12 @@ static inline void grep_attr_unlock(void)
 	if (grep_use_locks)
 		pthread_mutex_unlock(&grep_attr_mutex);
 }
+
+/*
+ * Same as git_attr_mutex, but protecting the thread-unsafe object db access.
+ */
+pthread_mutex_t grep_read_mutex;
+
 #else
 #define grep_attr_lock()
 #define grep_attr_unlock()
diff --git a/grep.h b/grep.h
index 3653bb3..4f1b025 100644
--- a/grep.h
+++ b/grep.h
@@ -139,6 +139,23 @@ extern int grep_threads_ok(const struct grep_opt *opt);
  */
 extern int grep_use_locks;
 extern pthread_mutex_t grep_attr_mutex;
+extern pthread_mutex_t grep_read_mutex;
+
+static inline void grep_read_lock(void)
+{
+	if (grep_use_locks)
+		pthread_mutex_lock(&grep_read_mutex);
+}
+
+static inline void grep_read_unlock(void)
+{
+	if (grep_use_locks)
+		pthread_mutex_unlock(&grep_read_mutex);
+}
+
+#else
+#define grep_read_lock()
+#define grep_read_unlock()
 #endif
 
 #endif
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/9] grep: refactor the concept of "grep source" into an object
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
  2012-02-02  8:18                                   ` [PATCH 1/9] grep: make locking flag global Jeff King
  2012-02-02  8:18                                   ` [PATCH 2/9] grep: move sha1-reading mutex into low-level code Jeff King
@ 2012-02-02  8:19                                   ` Jeff King
  2012-02-02  8:19                                   ` [PATCH 4/9] convert git-grep to use grep_source interface Jeff King
                                                     ` (9 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

The main interface to the low-level grep code is
grep_buffer, which takes a pointer to a buffer and a size.
This is convenient and flexible (we use it to grep commit
bodies, files on disk, and blobs by sha1), but it makes it
hard to pass extra information about what we are grepping
(either for correctness, like overriding binary
auto-detection, or for optimizations, like lazily loading
blob contents).

Instead, let's encapsulate the idea of a "grep source",
including the buffer, its size, and where the data is coming
from. This is similar to the diff_filespec structure used by
the diff code (unsurprising, since future patches will
implement some of the same optimizations found there).

The diffstat is slightly scarier than the actual patch
content. Most of the modified lines are simply replacing
access to raw variables with their counterparts that are now
in a "struct grep_source". Most of the added lines were
taken from builtin/grep.c, which partially abstracted the
idea of grep sources (for file vs sha1 sources).

Instead of dropping the now-redundant code, this patch
leaves builtin/grep.c using the traditional grep_buffer
interface (which now wraps the grep_source interface). That
makes it easy to test that there is no change of behavior
(yet).

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c |  198 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 grep.h |   22 +++++++
 2 files changed, 186 insertions(+), 34 deletions(-)

diff --git a/grep.c b/grep.c
index db58a29..8204ca2 100644
--- a/grep.c
+++ b/grep.c
@@ -837,13 +837,13 @@ pthread_mutex_t grep_read_mutex;
 #define grep_attr_unlock()
 #endif
 
-static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
+static int match_funcname(struct grep_opt *opt, struct grep_source *gs, char *bol, char *eol)
 {
 	xdemitconf_t *xecfg = opt->priv;
 	if (xecfg && !xecfg->find_func) {
 		struct userdiff_driver *drv;
 		grep_attr_lock();
-		drv = userdiff_find_by_path(name);
+		drv = userdiff_find_by_path(gs->name);
 		grep_attr_unlock();
 		if (drv && drv->funcname.pattern) {
 			const struct userdiff_funcname *pe = &drv->funcname;
@@ -866,33 +866,33 @@ static int match_funcname(struct grep_opt *opt, const char *name, char *bol, cha
 	return 0;
 }
 
-static void show_funcname_line(struct grep_opt *opt, const char *name,
-			       char *buf, char *bol, unsigned lno)
+static void show_funcname_line(struct grep_opt *opt, struct grep_source *gs,
+			       char *bol, unsigned lno)
 {
-	while (bol > buf) {
+	while (bol > gs->buf) {
 		char *eol = --bol;
 
-		while (bol > buf && bol[-1] != '\n')
+		while (bol > gs->buf && bol[-1] != '\n')
 			bol--;
 		lno--;
 
 		if (lno <= opt->last_shown)
 			break;
 
-		if (match_funcname(opt, name, bol, eol)) {
-			show_line(opt, bol, eol, name, lno, '=');
+		if (match_funcname(opt, gs, bol, eol)) {
+			show_line(opt, bol, eol, gs->name, lno, '=');
 			break;
 		}
 	}
 }
 
-static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
+static void show_pre_context(struct grep_opt *opt, struct grep_source *gs,
 			     char *bol, char *end, unsigned lno)
 {
 	unsigned cur = lno, from = 1, funcname_lno = 0;
 	int funcname_needed = !!opt->funcname;
 
-	if (opt->funcbody && !match_funcname(opt, name, bol, end))
+	if (opt->funcbody && !match_funcname(opt, gs, bol, end))
 		funcname_needed = 2;
 
 	if (opt->pre_context < lno)
@@ -901,14 +901,14 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
 		from = opt->last_shown + 1;
 
 	/* Rewind. */
-	while (bol > buf &&
+	while (bol > gs->buf &&
 	       cur > (funcname_needed == 2 ? opt->last_shown + 1 : from)) {
 		char *eol = --bol;
 
-		while (bol > buf && bol[-1] != '\n')
+		while (bol > gs->buf && bol[-1] != '\n')
 			bol--;
 		cur--;
-		if (funcname_needed && match_funcname(opt, name, bol, eol)) {
+		if (funcname_needed && match_funcname(opt, gs, bol, eol)) {
 			funcname_lno = cur;
 			funcname_needed = 0;
 		}
@@ -916,7 +916,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
 
 	/* We need to look even further back to find a function signature. */
 	if (opt->funcname && funcname_needed)
-		show_funcname_line(opt, name, buf, bol, cur);
+		show_funcname_line(opt, gs, bol, cur);
 
 	/* Back forward. */
 	while (cur < lno) {
@@ -924,7 +924,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
 
 		while (*eol != '\n')
 			eol++;
-		show_line(opt, bol, eol, name, cur, sign);
+		show_line(opt, bol, eol, gs->name, cur, sign);
 		bol = eol + 1;
 		cur++;
 	}
@@ -991,11 +991,10 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
 	fwrite(buf, size, 1, stdout);
 }
 
-static int grep_buffer_1(struct grep_opt *opt, const char *name,
-			 char *buf, unsigned long size, int collect_hits)
+static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int collect_hits)
 {
-	char *bol = buf;
-	unsigned long left = size;
+	char *bol;
+	unsigned long left;
 	unsigned lno = 1;
 	unsigned last_hit = 0;
 	int binary_match_only = 0;
@@ -1023,13 +1022,16 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 	}
 	opt->last_shown = 0;
 
+	if (grep_source_load(gs) < 0)
+		return 0;
+
 	switch (opt->binary) {
 	case GREP_BINARY_DEFAULT:
-		if (buffer_is_binary(buf, size))
+		if (buffer_is_binary(gs->buf, gs->size))
 			binary_match_only = 1;
 		break;
 	case GREP_BINARY_NOMATCH:
-		if (buffer_is_binary(buf, size))
+		if (buffer_is_binary(gs->buf, gs->size))
 			return 0; /* Assume unmatch */
 		break;
 	case GREP_BINARY_TEXT:
@@ -1043,6 +1045,8 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 
 	try_lookahead = should_lookahead(opt);
 
+	bol = gs->buf;
+	left = gs->size;
 	while (left) {
 		char *eol, ch;
 		int hit;
@@ -1091,14 +1095,14 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 			if (opt->status_only)
 				return 1;
 			if (opt->name_only) {
-				show_name(opt, name);
+				show_name(opt, gs->name);
 				return 1;
 			}
 			if (opt->count)
 				goto next_line;
 			if (binary_match_only) {
 				opt->output(opt, "Binary file ", 12);
-				output_color(opt, name, strlen(name),
+				output_color(opt, gs->name, strlen(gs->name),
 					     opt->color_filename);
 				opt->output(opt, " matches\n", 9);
 				return 1;
@@ -1107,23 +1111,23 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 			 * pre-context lines, we would need to show them.
 			 */
 			if (opt->pre_context || opt->funcbody)
-				show_pre_context(opt, name, buf, bol, eol, lno);
+				show_pre_context(opt, gs, bol, eol, lno);
 			else if (opt->funcname)
-				show_funcname_line(opt, name, buf, bol, lno);
-			show_line(opt, bol, eol, name, lno, ':');
+				show_funcname_line(opt, gs, bol, lno);
+			show_line(opt, bol, eol, gs->name, lno, ':');
 			last_hit = lno;
 			if (opt->funcbody)
 				show_function = 1;
 			goto next_line;
 		}
-		if (show_function && match_funcname(opt, name, bol, eol))
+		if (show_function && match_funcname(opt, gs, bol, eol))
 			show_function = 0;
 		if (show_function ||
 		    (last_hit && lno <= last_hit + opt->post_context)) {
 			/* If the last hit is within the post context,
 			 * we need to show this line.
 			 */
-			show_line(opt, bol, eol, name, lno, '-');
+			show_line(opt, bol, eol, gs->name, lno, '-');
 		}
 
 	next_line:
@@ -1141,7 +1145,7 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 		return 0;
 	if (opt->unmatch_name_only) {
 		/* We did not see any hit, so we want to show this */
-		show_name(opt, name);
+		show_name(opt, gs->name);
 		return 1;
 	}
 
@@ -1155,7 +1159,7 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 	 */
 	if (opt->count && count) {
 		char buf[32];
-		output_color(opt, name, strlen(name), opt->color_filename);
+		output_color(opt, gs->name, strlen(gs->name), opt->color_filename);
 		output_sep(opt, ':');
 		snprintf(buf, sizeof(buf), "%u\n", count);
 		opt->output(opt, buf, strlen(buf));
@@ -1190,23 +1194,149 @@ static int chk_hit_marker(struct grep_expr *x)
 	}
 }
 
-int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+int grep_source(struct grep_opt *opt, struct grep_source *gs)
 {
 	/*
 	 * we do not have to do the two-pass grep when we do not check
 	 * buffer-wide "all-match".
 	 */
 	if (!opt->all_match)
-		return grep_buffer_1(opt, name, buf, size, 0);
+		return grep_source_1(opt, gs, 0);
 
 	/* Otherwise the toplevel "or" terms hit a bit differently.
 	 * We first clear hit markers from them.
 	 */
 	clr_hit_marker(opt->pattern_expression);
-	grep_buffer_1(opt, name, buf, size, 1);
+	grep_source_1(opt, gs, 1);
 
 	if (!chk_hit_marker(opt->pattern_expression))
 		return 0;
 
-	return grep_buffer_1(opt, name, buf, size, 0);
+	return grep_source_1(opt, gs, 0);
+}
+
+int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+{
+	struct grep_source gs;
+	int r;
+
+	grep_source_init(&gs, GREP_SOURCE_BUF, name, NULL);
+	gs.buf = buf;
+	gs.size = size;
+
+	r = grep_source(opt, &gs);
+
+	grep_source_clear(&gs);
+	return r;
+}
+
+void grep_source_init(struct grep_source *gs, enum grep_source_type type,
+		      const char *name, const void *identifier)
+{
+	gs->type = type;
+	gs->name = name ? xstrdup(name) : NULL;
+	gs->buf = NULL;
+	gs->size = 0;
+
+	switch (type) {
+	case GREP_SOURCE_FILE:
+		gs->identifier = xstrdup(identifier);
+		break;
+	case GREP_SOURCE_SHA1:
+		gs->identifier = xmalloc(20);
+		memcpy(gs->identifier, identifier, 20);
+		break;
+	case GREP_SOURCE_BUF:
+		gs->identifier = NULL;
+	}
+}
+
+void grep_source_clear(struct grep_source *gs)
+{
+	free(gs->name);
+	gs->name = NULL;
+	free(gs->identifier);
+	gs->identifier = NULL;
+	grep_source_clear_data(gs);
+}
+
+void grep_source_clear_data(struct grep_source *gs)
+{
+	switch (gs->type) {
+	case GREP_SOURCE_FILE:
+	case GREP_SOURCE_SHA1:
+		free(gs->buf);
+		gs->buf = NULL;
+		gs->size = 0;
+		break;
+	case GREP_SOURCE_BUF:
+		/* leave user-provided buf intact */
+		break;
+	}
+}
+
+static int grep_source_load_sha1(struct grep_source *gs)
+{
+	enum object_type type;
+
+	grep_read_lock();
+	gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
+	grep_read_unlock();
+
+	if (!gs->buf)
+		return error(_("'%s': unable to read %s"),
+			     gs->name,
+			     sha1_to_hex(gs->identifier));
+	return 0;
+}
+
+static int grep_source_load_file(struct grep_source *gs)
+{
+	const char *filename = gs->identifier;
+	struct stat st;
+	char *data;
+	size_t size;
+	int i;
+
+	if (lstat(filename, &st) < 0) {
+	err_ret:
+		if (errno != ENOENT)
+			error(_("'%s': %s"), filename, strerror(errno));
+		return -1;
+	}
+	if (!S_ISREG(st.st_mode))
+		return -1;
+	size = xsize_t(st.st_size);
+	i = open(filename, O_RDONLY);
+	if (i < 0)
+		goto err_ret;
+	data = xmalloc(size + 1);
+	if (st.st_size != read_in_full(i, data, size)) {
+		error(_("'%s': short read %s"), filename, strerror(errno));
+		close(i);
+		free(data);
+		return -1;
+	}
+	close(i);
+	data[size] = 0;
+
+	gs->buf = data;
+	gs->size = size;
+	return 0;
+}
+
+int grep_source_load(struct grep_source *gs)
+{
+	if (gs->buf)
+		return 0;
+
+	switch (gs->type) {
+	case GREP_SOURCE_FILE:
+		return grep_source_load_file(gs);
+	case GREP_SOURCE_SHA1:
+		return grep_source_load_sha1(gs);
+	case GREP_SOURCE_BUF:
+		return gs->buf ? 0 : -1;
+	}
+	die("BUG: invalid grep_source type");
 }
diff --git a/grep.h b/grep.h
index 4f1b025..e386ca4 100644
--- a/grep.h
+++ b/grep.h
@@ -129,6 +129,28 @@ extern void compile_grep_patterns(struct grep_opt *opt);
 extern void free_grep_patterns(struct grep_opt *opt);
 extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
 
+struct grep_source {
+	char *name;
+
+	enum grep_source_type {
+		GREP_SOURCE_SHA1,
+		GREP_SOURCE_FILE,
+		GREP_SOURCE_BUF,
+	} type;
+	void *identifier;
+
+	char *buf;
+	unsigned long size;
+};
+
+void grep_source_init(struct grep_source *gs, enum grep_source_type type,
+		      const char *name, const void *identifier);
+int grep_source_load(struct grep_source *gs);
+void grep_source_clear_data(struct grep_source *gs);
+void grep_source_clear(struct grep_source *gs);
+
+int grep_source(struct grep_opt *opt, struct grep_source *gs);
+
 extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
 extern int grep_threads_ok(const struct grep_opt *opt);
 
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/9] convert git-grep to use grep_source interface
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (2 preceding siblings ...)
  2012-02-02  8:19                                   ` [PATCH 3/9] grep: refactor the concept of "grep source" into an object Jeff King
@ 2012-02-02  8:19                                   ` Jeff King
  2012-02-02  8:20                                   ` [PATCH 5/9] grep: drop grep_buffer's "name" parameter Jeff King
                                                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

The grep_source interface (as opposed to grep_buffer) will
eventually gives us a richer interface for telling the
low-level grep code about our buffers. Eventually this will
lead to things like better binary-file handling. For now, it
lets us drop a lot of now-redundant code.

The conversion is mostly straight-forward. One thing to note
is that the memory ownership rules for "struct grep_source"
are different than the "struct work_item" found here (the
former will copy things like the filename, rather than
taking ownership). Therefore you will also see some slight
tweaking of when filename buffers are released.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/grep.c |  142 +++++++++-----------------------------------------------
 1 files changed, 23 insertions(+), 119 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index f4402fa..bc85a20 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -29,25 +29,12 @@ static int use_threads = 1;
 #define THREADS 8
 static pthread_t threads[THREADS];
 
-static void *load_sha1(const unsigned char *sha1, unsigned long *size,
-		       const char *name);
-static void *load_file(const char *filename, size_t *sz);
-
-enum work_type {WORK_SHA1, WORK_FILE};
-
 /* We use one producer thread and THREADS consumer
  * threads. The producer adds struct work_items to 'todo' and the
  * consumers pick work items from the same array.
  */
 struct work_item {
-	enum work_type type;
-	char *name;
-
-	/* if type == WORK_SHA1, then 'identifier' is a SHA1,
-	 * otherwise type == WORK_FILE, and 'identifier' is a NUL
-	 * terminated filename.
-	 */
-	void *identifier;
+	struct grep_source source;
 	char done;
 	struct strbuf out;
 };
@@ -98,7 +85,8 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
-static void add_work(enum work_type type, char *name, void *id)
+static void add_work(enum grep_source_type type, const char *name,
+		     const void *id)
 {
 	grep_lock();
 
@@ -106,9 +94,7 @@ static void add_work(enum work_type type, char *name, void *id)
 		pthread_cond_wait(&cond_write, &grep_mutex);
 	}
 
-	todo[todo_end].type = type;
-	todo[todo_end].name = name;
-	todo[todo_end].identifier = id;
+	grep_source_init(&todo[todo_end].source, type, name, id);
 	todo[todo_end].done = 0;
 	strbuf_reset(&todo[todo_end].out);
 	todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
@@ -136,21 +122,6 @@ static struct work_item *get_work(void)
 	return ret;
 }
 
-static void grep_sha1_async(struct grep_opt *opt, char *name,
-			    const unsigned char *sha1)
-{
-	unsigned char *s;
-	s = xmalloc(20);
-	memcpy(s, sha1, 20);
-	add_work(WORK_SHA1, name, s);
-}
-
-static void grep_file_async(struct grep_opt *opt, char *name,
-			    const char *filename)
-{
-	add_work(WORK_FILE, name, xstrdup(filename));
-}
-
 static void work_done(struct work_item *w)
 {
 	int old_done;
@@ -177,8 +148,7 @@ static void work_done(struct work_item *w)
 
 			write_or_die(1, p, len);
 		}
-		free(w->name);
-		free(w->identifier);
+		grep_source_clear(&w->source);
 	}
 
 	if (old_done != todo_done)
@@ -201,25 +171,8 @@ static void *run(void *arg)
 			break;
 
 		opt->output_priv = w;
-		if (w->type == WORK_SHA1) {
-			unsigned long sz;
-			void* data = load_sha1(w->identifier, &sz, w->name);
-
-			if (data) {
-				hit |= grep_buffer(opt, w->name, data, sz);
-				free(data);
-			}
-		} else if (w->type == WORK_FILE) {
-			size_t sz;
-			void* data = load_file(w->identifier, &sz);
-			if (data) {
-				hit |= grep_buffer(opt, w->name, data, sz);
-				free(data);
-			}
-		} else {
-			assert(0);
-		}
-
+		hit |= grep_source(opt, &w->source);
+		grep_source_clear_data(&w->source);
 		work_done(w);
 	}
 	free_grep_patterns(arg);
@@ -365,23 +318,10 @@ static void *lock_and_read_sha1_file(const unsigned char *sha1, enum object_type
 	return data;
 }
 
-static void *load_sha1(const unsigned char *sha1, unsigned long *size,
-		       const char *name)
-{
-	enum object_type type;
-	void *data = lock_and_read_sha1_file(sha1, &type, size);
-
-	if (!data)
-		error(_("'%s': unable to read %s"), name, sha1_to_hex(sha1));
-
-	return data;
-}
-
 static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 		     const char *filename, int tree_name_len)
 {
 	struct strbuf pathbuf = STRBUF_INIT;
-	char *name;
 
 	if (opt->relative && opt->prefix_length) {
 		quote_path_relative(filename + tree_name_len, -1, &pathbuf,
@@ -391,87 +331,51 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 		strbuf_addstr(&pathbuf, filename);
 	}
 
-	name = strbuf_detach(&pathbuf, NULL);
-
 #ifndef NO_PTHREADS
 	if (use_threads) {
-		grep_sha1_async(opt, name, sha1);
+		add_work(GREP_SOURCE_SHA1, pathbuf.buf, sha1);
+		strbuf_release(&pathbuf);
 		return 0;
 	} else
 #endif
 	{
+		struct grep_source gs;
 		int hit;
-		unsigned long sz;
-		void *data = load_sha1(sha1, &sz, name);
-		if (!data)
-			hit = 0;
-		else
-			hit = grep_buffer(opt, name, data, sz);
 
-		free(data);
-		free(name);
-		return hit;
-	}
-}
+		grep_source_init(&gs, GREP_SOURCE_SHA1, pathbuf.buf, sha1);
+		strbuf_release(&pathbuf);
+		hit = grep_source(opt, &gs);
 
-static void *load_file(const char *filename, size_t *sz)
-{
-	struct stat st;
-	char *data;
-	int i;
-
-	if (lstat(filename, &st) < 0) {
-	err_ret:
-		if (errno != ENOENT)
-			error(_("'%s': %s"), filename, strerror(errno));
-		return NULL;
-	}
-	if (!S_ISREG(st.st_mode))
-		return NULL;
-	*sz = xsize_t(st.st_size);
-	i = open(filename, O_RDONLY);
-	if (i < 0)
-		goto err_ret;
-	data = xmalloc(*sz + 1);
-	if (st.st_size != read_in_full(i, data, *sz)) {
-		error(_("'%s': short read %s"), filename, strerror(errno));
-		close(i);
-		free(data);
-		return NULL;
+		grep_source_clear(&gs);
+		return hit;
 	}
-	close(i);
-	data[*sz] = 0;
-	return data;
 }
 
 static int grep_file(struct grep_opt *opt, const char *filename)
 {
 	struct strbuf buf = STRBUF_INIT;
-	char *name;
 
 	if (opt->relative && opt->prefix_length)
 		quote_path_relative(filename, -1, &buf, opt->prefix);
 	else
 		strbuf_addstr(&buf, filename);
-	name = strbuf_detach(&buf, NULL);
 
 #ifndef NO_PTHREADS
 	if (use_threads) {
-		grep_file_async(opt, name, filename);
+		add_work(GREP_SOURCE_FILE, buf.buf, filename);
+		strbuf_release(&buf);
 		return 0;
 	} else
 #endif
 	{
+		struct grep_source gs;
 		int hit;
-		size_t sz;
-		void *data = load_file(filename, &sz);
-		if (!data)
-			hit = 0;
-		else
-			hit = grep_buffer(opt, name, data, sz);
 
-		free(data);
-		free(name);
+		grep_source_init(&gs, GREP_SOURCE_FILE, buf.buf, filename);
+		strbuf_release(&buf);
+		hit = grep_source(opt, &gs);
+
+		grep_source_clear(&gs);
 		return hit;
 	}
 }
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/9] grep: drop grep_buffer's "name" parameter
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (3 preceding siblings ...)
  2012-02-02  8:19                                   ` [PATCH 4/9] convert git-grep to use grep_source interface Jeff King
@ 2012-02-02  8:20                                   ` Jeff King
  2012-02-02  8:20                                   ` [PATCH 6/9] grep: cache userdiff_driver in grep_source Jeff King
                                                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Before the grep_source interface existed, grep_buffer was
used by two types of callers:

  1. Ones which pulled a file into a buffer, and then wanted
     to supply the file's name for the output (i.e.,
     git grep).

  2. Ones which really just wanted to grep a buffer (i.e.,
     git log --grep).

Callers in set (1) should now be using grep_source. Callers
in set (2) always pass NULL for the "name" parameter of
grep_buffer. We can therefore get rid of this now-useless
parameter.

Signed-off-by: Jeff King <peff@peff.net>
---
This one isn't necessary, obviously, but I think it's a nice clean-up
after the last two patches.

 grep.c     |    4 ++--
 grep.h     |    2 +-
 revision.c |    1 -
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index 8204ca2..2a3fe7c 100644
--- a/grep.c
+++ b/grep.c
@@ -1215,12 +1215,12 @@ int grep_source(struct grep_opt *opt, struct grep_source *gs)
 	return grep_source_1(opt, gs, 0);
 }
 
-int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
 {
 	struct grep_source gs;
 	int r;
 
-	grep_source_init(&gs, GREP_SOURCE_BUF, name, NULL);
+	grep_source_init(&gs, GREP_SOURCE_BUF, NULL, NULL);
 	gs.buf = buf;
 	gs.size = size;
 
diff --git a/grep.h b/grep.h
index e386ca4..8bf3001 100644
--- a/grep.h
+++ b/grep.h
@@ -127,7 +127,7 @@ extern void append_grep_pattern(struct grep_opt *opt, const char *pat, const cha
 extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field, const char *);
 extern void compile_grep_patterns(struct grep_opt *opt);
 extern void free_grep_patterns(struct grep_opt *opt);
-extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
+extern int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size);
 
 struct grep_source {
 	char *name;
diff --git a/revision.c b/revision.c
index c97d834..819ff01 100644
--- a/revision.c
+++ b/revision.c
@@ -2149,7 +2149,6 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 	if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
 		return 1;
 	return grep_buffer(&opt->grep_filter,
-			   NULL, /* we say nothing, not even filename */
 			   commit->buffer, strlen(commit->buffer));
 }
 
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/9] grep: cache userdiff_driver in grep_source
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (4 preceding siblings ...)
  2012-02-02  8:20                                   ` [PATCH 5/9] grep: drop grep_buffer's "name" parameter Jeff King
@ 2012-02-02  8:20                                   ` Jeff King
  2012-02-02 18:34                                     ` Junio C Hamano
  2012-02-02  8:21                                   ` [PATCH 7/9] grep: respect diff attributes for binary-ness Jeff King
                                                     ` (6 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Right now, grep only uses the userdiff_driver for one thing:
looking up funcname patterns for "-p" and "-W".  As new uses
for userdiff drivers are added to the grep code, we want to
minimize attribute lookups, which can be expensive.

It might seem at first that this would also optimize multiple
lookups when the funcname pattern for a file is needed
multiple times. However, the compiled funcname pattern is
already cached in struct grep_opt's "priv" member, so
multiple lookups are already suppressed.

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c |   22 ++++++++++++++++------
 grep.h |    4 ++++
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/grep.c b/grep.c
index 2a3fe7c..bb18569 100644
--- a/grep.c
+++ b/grep.c
@@ -841,12 +841,9 @@ static int match_funcname(struct grep_opt *opt, struct grep_source *gs, char *bo
 {
 	xdemitconf_t *xecfg = opt->priv;
 	if (xecfg && !xecfg->find_func) {
-		struct userdiff_driver *drv;
-		grep_attr_lock();
-		drv = userdiff_find_by_path(gs->name);
-		grep_attr_unlock();
-		if (drv && drv->funcname.pattern) {
-			const struct userdiff_funcname *pe = &drv->funcname;
+		grep_source_load_driver(gs);
+		if (gs->driver->funcname.pattern) {
+			const struct userdiff_funcname *pe = &gs->driver->funcname;
 			xdiff_set_find_func(xecfg, pe->pattern, pe->cflags);
 		} else {
 			xecfg = opt->priv = NULL;
@@ -1237,6 +1234,7 @@ void grep_source_init(struct grep_source *gs, enum grep_source_type type,
 	gs->name = name ? xstrdup(name) : NULL;
 	gs->buf = NULL;
 	gs->size = 0;
+	gs->driver = NULL;
 
 	switch (type) {
 	case GREP_SOURCE_FILE:
@@ -1340,3 +1338,15 @@ int grep_source_load(struct grep_source *gs)
 	}
 	die("BUG: invalid grep_source type");
 }
+
+void grep_source_load_driver(struct grep_source *gs)
+{
+	if (gs->driver)
+		return;
+
+	grep_attr_lock();
+	gs->driver = userdiff_find_by_path(gs->name);
+	if (!gs->driver)
+		gs->driver = userdiff_find_by_name("default");
+	grep_attr_unlock();
+}
diff --git a/grep.h b/grep.h
index 8bf3001..73b28c2 100644
--- a/grep.h
+++ b/grep.h
@@ -9,6 +9,7 @@ typedef int pcre_extra;
 #endif
 #include "kwset.h"
 #include "thread-utils.h"
+#include "userdiff.h"
 
 enum grep_pat_token {
 	GREP_PATTERN,
@@ -141,6 +142,8 @@ struct grep_source {
 
 	char *buf;
 	unsigned long size;
+
+	struct userdiff_driver *driver;
 };
 
 void grep_source_init(struct grep_source *gs, enum grep_source_type type,
@@ -148,6 +151,7 @@ void grep_source_init(struct grep_source *gs, enum grep_source_type type,
 int grep_source_load(struct grep_source *gs);
 void grep_source_clear_data(struct grep_source *gs);
 void grep_source_clear(struct grep_source *gs);
+void grep_source_load_driver(struct grep_source *gs);
 
 int grep_source(struct grep_opt *opt, struct grep_source *gs);
 
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/9] grep: respect diff attributes for binary-ness
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (5 preceding siblings ...)
  2012-02-02  8:20                                   ` [PATCH 6/9] grep: cache userdiff_driver in grep_source Jeff King
@ 2012-02-02  8:21                                   ` Jeff King
  2012-02-02  8:21                                   ` [PATCH 8/9] grep: load file data after checking binary-ness Jeff King
                                                     ` (5 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

There is currently no way for users to tell git-grep that a
particular path is or is not a binary file; instead, grep
always relies on its auto-detection (or the user specifying
"-a" to treat all binary-looking files like text).

This patch teaches git-grep to use the same attribute lookup
that is used by git-diff. We could add a new "grep" flag,
but that is unnecessarily complex and unlikely to be useful.
Despite the name, the "-diff" attribute (or "diff=foo" and
the associated diff.foo.binary config option) are really
about describing the contents of the path. It's simply
historical that diff was the only thing that cared about
these attributes in the past.

And if this simple approach turns out to be insufficient, we
still have a backwards-compatible path forward: we can add a
separate "grep" attribute, and fall back to respecting
"diff" if it is unset.

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c                 |   16 ++++++++++++++--
 grep.h                 |    1 +
 t/t7008-grep-binary.sh |   24 ++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index bb18569..a50d161 100644
--- a/grep.c
+++ b/grep.c
@@ -1024,11 +1024,11 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 
 	switch (opt->binary) {
 	case GREP_BINARY_DEFAULT:
-		if (buffer_is_binary(gs->buf, gs->size))
+		if (grep_source_is_binary(gs))
 			binary_match_only = 1;
 		break;
 	case GREP_BINARY_NOMATCH:
-		if (buffer_is_binary(gs->buf, gs->size))
+		if (grep_source_is_binary(gs))
 			return 0; /* Assume unmatch */
 		break;
 	case GREP_BINARY_TEXT:
@@ -1350,3 +1350,15 @@ void grep_source_load_driver(struct grep_source *gs)
 		gs->driver = userdiff_find_by_name("default");
 	grep_attr_unlock();
 }
+
+int grep_source_is_binary(struct grep_source *gs)
+{
+	grep_source_load_driver(gs);
+	if (gs->driver->binary != -1)
+		return gs->driver->binary;
+
+	if (!grep_source_load(gs))
+		return buffer_is_binary(gs->buf, gs->size);
+
+	return 0;
+}
diff --git a/grep.h b/grep.h
index 73b28c2..36e49d8 100644
--- a/grep.h
+++ b/grep.h
@@ -152,6 +152,7 @@ int grep_source_load(struct grep_source *gs);
 void grep_source_clear_data(struct grep_source *gs);
 void grep_source_clear(struct grep_source *gs);
 void grep_source_load_driver(struct grep_source *gs);
+int grep_source_is_binary(struct grep_source *gs);
 
 int grep_source(struct grep_opt *opt, struct grep_source *gs);
 
diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index 917a264..fd6410f 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -99,4 +99,28 @@ test_expect_success 'git grep y<NUL>x a' "
 	test_must_fail git grep -f f a
 "
 
+test_expect_success 'grep respects binary diff attribute' '
+	echo text >t &&
+	git add t &&
+	echo t:text >expect &&
+	git grep text t >actual &&
+	test_cmp expect actual &&
+	echo "t -diff" >.gitattributes &&
+	echo "Binary file t matches" >expect &&
+	git grep text t >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep respects not-binary diff attribute' '
+	echo binQary | q_to_nul >b &&
+	git add b &&
+	echo "Binary file b matches" >expect &&
+	git grep bin b >actual &&
+	test_cmp expect actual &&
+	echo "b diff" >.gitattributes &&
+	echo "b:binQary" >expect &&
+	git grep bin b | nul_to_q >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 8/9] grep: load file data after checking binary-ness
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (6 preceding siblings ...)
  2012-02-02  8:21                                   ` [PATCH 7/9] grep: respect diff attributes for binary-ness Jeff King
@ 2012-02-02  8:21                                   ` Jeff King
  2012-02-02  8:24                                   ` [PATCH 9/9] grep: pre-load userdiff drivers when threaded Jeff King
                                                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Usually we load each file to grep into memory, check whether
it's binary, and then either grep it (the default) or not
(if "-I" was given).

In the "-I" case, we can skip loading the file entirely if
it is marked as binary via gitattributes. On my giant
3-gigabyte media repository, doing "git grep -I foo" went
from:

  real    0m0.712s
  user    0m0.044s
  sys     0m4.780s

to:

  real    0m0.026s
  user    0m0.016s
  sys     0m0.020s

Obviously this is an extreme example. The repo is almost
entirely binary files, and you can see that we spent all of
our time asking the kernel to read() the data. However, with
a cold disk cache, even avoiding a few binary files can have
an impact.

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index a50d161..3821400 100644
--- a/grep.c
+++ b/grep.c
@@ -1019,9 +1019,6 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 	}
 	opt->last_shown = 0;
 
-	if (grep_source_load(gs) < 0)
-		return 0;
-
 	switch (opt->binary) {
 	case GREP_BINARY_DEFAULT:
 		if (grep_source_is_binary(gs))
@@ -1042,6 +1039,9 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 
 	try_lookahead = should_lookahead(opt);
 
+	if (grep_source_load(gs) < 0)
+		return 0;
+
 	bol = gs->buf;
 	left = gs->size;
 	while (left) {
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 9/9] grep: pre-load userdiff drivers when threaded
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (7 preceding siblings ...)
  2012-02-02  8:21                                   ` [PATCH 8/9] grep: load file data after checking binary-ness Jeff King
@ 2012-02-02  8:24                                   ` Jeff King
  2012-02-02  8:30                                   ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

The low-level grep_source code will automatically load the
userdiff driver to see whether a file is binary. However,
when we are threaded, it will load the drivers in a
non-deterministic order, handling each one as its assigned
thread happens to be scheduled.

Meanwhile, the attribute lookup code (which underlies the
userdiff driver lookup) is optimized to handle paths in
sequential order (because they tend to share the same
gitattributes files). Multi-threading the lookups destroys
the locality and makes this optimization less effective.

We can fix this by pre-loading the userdiff driver in the
main thread, before we hand off the file to a worker thread.
My best-of-five for "git grep foo" on the linux-2.6
repository went from:

  real    0m0.391s
  user    0m1.708s
  sys     0m0.584s

to:

  real    0m0.360s
  user    0m1.576s
  sys     0m0.572s

Not a huge speedup, but it's quite easy to do. The only
trick is that we shouldn't perform this optimization if "-a"
was used, in which case we won't bother checking whether
the files are binary at all.

Signed-off-by: Jeff King <peff@peff.net>
---
The speedup is especially unimpressive when you consider that it won't
grow as the grep load grows. This is a pretty fast grep. If you used a
real regex, the whole thing would take even longer, and you will still
only be shaving off a few tens of milliseconds. So I wouldn't be
heart-broken if this patch was dropped. I included it because it's easy
to do, and maybe somebody with a slower machine would find the absolute
time difference more noticeable.

 builtin/grep.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index bc85a20..9fc3e95 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -85,8 +85,8 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
-static void add_work(enum grep_source_type type, const char *name,
-		     const void *id)
+static void add_work(struct grep_opt *opt, enum grep_source_type type,
+		     const char *name, const void *id)
 {
 	grep_lock();
 
@@ -95,6 +95,8 @@ static void add_work(enum grep_source_type type, const char *name,
 	}
 
 	grep_source_init(&todo[todo_end].source, type, name, id);
+	if (opt->binary != GREP_BINARY_TEXT)
+		grep_source_load_driver(&todo[todo_end].source);
 	todo[todo_end].done = 0;
 	strbuf_reset(&todo[todo_end].out);
 	todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
@@ -333,7 +335,7 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 
 #ifndef NO_PTHREADS
 	if (use_threads) {
-		add_work(GREP_SOURCE_SHA1, pathbuf.buf, sha1);
+		add_work(opt, GREP_SOURCE_SHA1, pathbuf.buf, sha1);
 		strbuf_release(&pathbuf);
 		return 0;
 	} else
@@ -362,7 +364,7 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 
 #ifndef NO_PTHREADS
 	if (use_threads) {
-		add_work(GREP_SOURCE_FILE, buf.buf, filename);
+		add_work(opt, GREP_SOURCE_FILE, buf.buf, filename);
 		strbuf_release(&buf);
 		return 0;
 	} else
-- 
1.7.9.3.gc3fce1.dirty

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] respect binary attribute in grep
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (8 preceding siblings ...)
  2012-02-02  8:24                                   ` [PATCH 9/9] grep: pre-load userdiff drivers when threaded Jeff King
@ 2012-02-02  8:30                                   ` Jeff King
  2012-02-02 11:00                                   ` Thomas Rast
                                                     ` (2 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02  8:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Thu, Feb 02, 2012 at 03:17:47AM -0500, Jeff King wrote:

> I implemented all of the other optimizations I mentioned except the
> "only stream the first few bytes when auto-detecting binary-ness" one.
> However, it should be easy to do on top of these changes. I need to
> re-visit the similar change to diff_filespec_is_binary, and I'll do both
> at the same time.

Oh, and I didn't even think about implementing streaming grep.  The
context-finding code relies on being able to backtrack through the file
in memory. We _could_ implement streaming only for binary files (i.e.,
when we will just print "Binary file foo matches"). However, I suspect
people with big binary files will want to be using "-I" anyway, so as to
avoid even pulling the data from disk at all.

We might eventually want to add a config-option version of "-I" for
people who have repositories of mixed source code and large binary
assets.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] respect binary attribute in grep
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (9 preceding siblings ...)
  2012-02-02  8:30                                   ` [PATCH 0/9] respect binary attribute in grep Jeff King
@ 2012-02-02 11:00                                   ` Thomas Rast
  2012-02-02 11:07                                     ` Jeff King
  2012-02-02 18:39                                   ` Junio C Hamano
  2012-02-04 19:22                                   ` Pete Wyckoff
  12 siblings, 1 reply; 43+ messages in thread
From: Thomas Rast @ 2012-02-02 11:00 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Thomas Rast, Conrad Irwin, git,
	Nguyen Thai Ngoc Duy, Dov Grobgeld

Jeff King <peff@peff.net> writes:

> [+cc Thomas, as I am mangling some of his recent work with my
>      refactoring]

Mangling?  I think it all looks very good.

My original plan was to make use_threads git-global, instead of
grep-global (and shift responsibility to the subsystems instead of their
users), but that's just me and the patches aren't ready yet.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] respect binary attribute in grep
  2012-02-02 11:00                                   ` Thomas Rast
@ 2012-02-02 11:07                                     ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02 11:07 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Junio C Hamano, Thomas Rast, Conrad Irwin, git,
	Nguyen Thai Ngoc Duy, Dov Grobgeld

On Thu, Feb 02, 2012 at 12:00:47PM +0100, Thomas Rast wrote:

> My original plan was to make use_threads git-global, instead of
> grep-global (and shift responsibility to the subsystems instead of their
> users), but that's just me and the patches aren't ready yet.

Yeah, having just dug into the threading code in grep a bit, I agree
that would be a saner approach. The locking is all bolted-on, so you end
up with these weird contracts between code, like the low-level grep code
asking anybody who might be multi-threading it to initialize the mutexes
to cover access to a totally different subsystem. I'd much rather each
subsystem just take care of itself.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] grep: cache userdiff_driver in grep_source
  2012-02-02  8:20                                   ` [PATCH 6/9] grep: cache userdiff_driver in grep_source Jeff King
@ 2012-02-02 18:34                                     ` Junio C Hamano
  2012-02-02 19:37                                       ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-02-02 18:34 UTC (permalink / raw)
  To: Jeff King
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Jeff King <peff@peff.net> writes:

> -		grep_attr_lock();
> -		drv = userdiff_find_by_path(gs->name);
> -		grep_attr_unlock();
> -		if (drv && drv->funcname.pattern) {
> -			const struct userdiff_funcname *pe = &drv->funcname;
> +		grep_source_load_driver(gs);
> +		if (gs->driver->funcname.pattern) {
> +			const struct userdiff_funcname *pe = &gs->driver->funcname;

When we load driver, gs->driver gets at least "default" driver, so we no
longer need to check for drv != NULL as we used to?  Is that the reason
for the slight difference here?

> @@ -1237,6 +1234,7 @@ void grep_source_init(struct grep_source *gs, enum grep_source_type type,
>  	gs->name = name ? xstrdup(name) : NULL;
>  	gs->buf = NULL;
>  	gs->size = 0;
> +	gs->driver = NULL;
>  
>  	switch (type) {
>  	case GREP_SOURCE_FILE:

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] respect binary attribute in grep
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (10 preceding siblings ...)
  2012-02-02 11:00                                   ` Thomas Rast
@ 2012-02-02 18:39                                   ` Junio C Hamano
  2012-02-04 19:22                                   ` Pete Wyckoff
  12 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2012-02-02 18:39 UTC (permalink / raw)
  To: Jeff King
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

Jeff King <peff@peff.net> writes:

> ... The result turned out much easier to read
> (and explain in the commit messages, as it was simple to break into
> smaller commits)....

Indeed the series is very nicely done ;-)

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] grep: cache userdiff_driver in grep_source
  2012-02-02 18:34                                     ` Junio C Hamano
@ 2012-02-02 19:37                                       ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-02 19:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Rast, Conrad Irwin, git, Nguyen Thai Ngoc Duy, Dov Grobgeld

On Thu, Feb 02, 2012 at 10:34:07AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > -		grep_attr_lock();
> > -		drv = userdiff_find_by_path(gs->name);
> > -		grep_attr_unlock();
> > -		if (drv && drv->funcname.pattern) {
> > -			const struct userdiff_funcname *pe = &drv->funcname;
> > +		grep_source_load_driver(gs);
> > +		if (gs->driver->funcname.pattern) {
> > +			const struct userdiff_funcname *pe = &gs->driver->funcname;
> 
> When we load driver, gs->driver gets at least "default" driver, so we no
> longer need to check for drv != NULL as we used to?  Is that the reason
> for the slight difference here?

Yes, exactly.

We could just leave gs->driver NULL instead of looking up "default", and
then use NULL to signal to the calling code that defaults should be
used. But NULL is interpreted by grep_source_load_driver as "we did not
look up the driver yet", so the common case of "no driver" would mean we
accidentally do the lookup multiple times.  The diff_filespec code uses
the same convention to solve the same problem.

Speaking of which, there was some notion in my mind that a "grep_source"
and a "diff_filespec" were very similar objects, and that we could
possibly unify the implementations. I decided against that route with
this series, as it would have involved pretty heavy refactoring of the
diff code to prevent a fairly small amount of code duplication.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] respect binary attribute in grep
  2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
                                                     ` (11 preceding siblings ...)
  2012-02-02 18:39                                   ` Junio C Hamano
@ 2012-02-04 19:22                                   ` Pete Wyckoff
  2012-02-04 23:18                                     ` Jeff King
  12 siblings, 1 reply; 43+ messages in thread
From: Pete Wyckoff @ 2012-02-04 19:22 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Thomas Rast, Conrad Irwin, git,
	Nguyen Thai Ngoc Duy, Dov Grobgeld

I took a look at this series.  It's nice.  My worry was that the
extra open() of non-existent .gitattributes files in all the
directories would cause performance problems across networked
filesystems like NFS.

My usual (non-public) repository has order:

    100k files
     10k directories

and no files marked as binary.  The grep string is such that it
is disk-bound, and not expected to match in any file (or binary):
"time ~/src/git/bin-wrappers/git grep unfindable-string".

With your change, there are 10k new open() calls looking for
.gitattributes in each directory, all of which return ENOENT.
This turns out to have an insignificant impact on performance due
to the much bigger time sink of stat()-ing all the files.

I think this happens to be true because the gitattributes lookups
run in parallel to all the file stat work, as the main thread
dispatches file work while doing its own gitattributes lookups.

It could be plausible that deep directory structures with few
grep-able files will suffer with this change.  For example, many
big binary blobs in deep directory hierarchies, but also some
useful files here and there.

One could argue that with the use of .gitattributes to specify
which blobs should not be searched, this series makes this faster
by not having to to read the binary blobs at all.  And I'd be
okay with that.

Just FYI that there may be a performance impact on certain
repositories.

		-- Pete

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] respect binary attribute in grep
  2012-02-04 19:22                                   ` Pete Wyckoff
@ 2012-02-04 23:18                                     ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-02-04 23:18 UTC (permalink / raw)
  To: Pete Wyckoff
  Cc: Junio C Hamano, Thomas Rast, Conrad Irwin, git,
	Nguyen Thai Ngoc Duy, Dov Grobgeld

On Sat, Feb 04, 2012 at 02:22:52PM -0500, Pete Wyckoff wrote:

> I took a look at this series.  It's nice.  My worry was that the
> extra open() of non-existent .gitattributes files in all the
> directories would cause performance problems across networked
> filesystems like NFS.

Yeah, I was able to measure a small slow-down on a quick grep even with
a warm cache. So it does take some extra effort, but I think the
correctness is worth it (and note that the slow down is in the tens of
milliseconds if you have a reasonable stat()).

If people have big trees on NFS (or some other slow-stat system) where
these lookups are actually a problem, I'd rather see a global option to
disable .gitattributes lookups for both diff and grep (i.e., a "trust
me, I'm not using gitattributes, and don't bother with stat" flag). In
practice, though, I think such a thing is not necessary because the
stat() is local to the file being examined (e.g., for "foo/bar/baz", we
look only at "foo/bar/.gitattributes", "foo/.gitattributes", and
".gitattributes", without having to touch other parts of the tree).

Anyway, thanks for doing some performance testing. More data is always
good.

> It could be plausible that deep directory structures with few
> grep-able files will suffer with this change.  For example, many
> big binary blobs in deep directory hierarchies, but also some
> useful files here and there.
>
> One could argue that with the use of .gitattributes to specify
> which blobs should not be searched, this series makes this faster
> by not having to to read the binary blobs at all.  And I'd be
> okay with that.

Yes, exactly. I think this will end up being a big win for such cases,
because the cost of loading even one large binary file from disk will
dwarf all of the stats. But it does depend on people marking their
binaries and using "-I".

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-02-04 23:18 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-17  9:14 git-grep while excluding files in a blacklist Dov Grobgeld
2012-01-17  9:19 ` Nguyen Thai Ngoc Duy
2012-01-17 20:09   ` Junio C Hamano
2012-01-18  1:24     ` Nguyen Thai Ngoc Duy
2012-01-23  9:37       ` [PATCH] Don't search files with an unset "grep" attribute conrad.irwin
2012-01-23 18:33         ` Junio C Hamano
2012-01-23 22:59           ` Conrad Irwin
2012-01-24  6:59             ` Junio C Hamano
2012-01-25 21:46               ` Jeff King
2012-01-26 13:51                 ` Stephen Bash
2012-01-26 17:29                   ` Jeff King
2012-01-26 16:45                 ` Michael Haggerty
2012-01-27  6:35                   ` Jeff King
2012-02-01  8:01                 ` Junio C Hamano
2012-02-01  8:20                   ` Jeff King
2012-02-01  9:10                     ` Jeff King
2012-02-01  9:28                       ` Conrad Irwin
2012-02-01 22:14                         ` Jeff King
2012-02-01 23:20                           ` Jeff King
2012-02-02  2:03                             ` Junio C Hamano
2012-02-01 23:21                           ` [PATCH 1/2] grep: let grep_buffer callers specify a binary flag Jeff King
2012-02-02  0:47                             ` Junio C Hamano
2012-02-02  0:52                               ` Jeff King
2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
2012-02-02  8:18                                   ` [PATCH 1/9] grep: make locking flag global Jeff King
2012-02-02  8:18                                   ` [PATCH 2/9] grep: move sha1-reading mutex into low-level code Jeff King
2012-02-02  8:19                                   ` [PATCH 3/9] grep: refactor the concept of "grep source" into an object Jeff King
2012-02-02  8:19                                   ` [PATCH 4/9] convert git-grep to use grep_source interface Jeff King
2012-02-02  8:20                                   ` [PATCH 5/9] grep: drop grep_buffer's "name" parameter Jeff King
2012-02-02  8:20                                   ` [PATCH 6/9] grep: cache userdiff_driver in grep_source Jeff King
2012-02-02 18:34                                     ` Junio C Hamano
2012-02-02 19:37                                       ` Jeff King
2012-02-02  8:21                                   ` [PATCH 7/9] grep: respect diff attributes for binary-ness Jeff King
2012-02-02  8:21                                   ` [PATCH 8/9] grep: load file data after checking binary-ness Jeff King
2012-02-02  8:24                                   ` [PATCH 9/9] grep: pre-load userdiff drivers when threaded Jeff King
2012-02-02  8:30                                   ` [PATCH 0/9] respect binary attribute in grep Jeff King
2012-02-02 11:00                                   ` Thomas Rast
2012-02-02 11:07                                     ` Jeff King
2012-02-02 18:39                                   ` Junio C Hamano
2012-02-04 19:22                                   ` Pete Wyckoff
2012-02-04 23:18                                     ` Jeff King
2012-02-01 23:21                           ` [PATCH 2/2] grep: respect diff attributes for binary-ness Jeff King
2012-02-01 16:28                       ` [PATCH] Don't search files with an unset "grep" attribute Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.