All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael J Gruber <git@drmicha.warpmail.net>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [WIP/RFH/RFD/PATCH] grep: allow to use textconv filters
Date: Tue, 05 Feb 2013 17:21:18 +0100	[thread overview]
Message-ID: <5111317E.8060906@drmicha.warpmail.net> (raw)
In-Reply-To: <20130205111353.GD24973@sigill.intra.peff.net>

Jeff King venit, vidit, dixit 05.02.2013 12:13:
> On Mon, Feb 04, 2013 at 04:27:31PM +0100, Michael J Gruber wrote:
> 
>> Recently and not so recently, we made sure that log/grep type operations
>> use textconv filters when a userfacing diff would do the same:
>>
>> ef90ab6 (pickaxe: use textconv for -S counting, 2012-10-28)
>> b1c2f57 (diff_grep: use textconv buffers for add/deleted files, 2012-10-28)
>> 0508fe5 (combine-diff: respect textconv attributes, 2011-05-23)
>>
>> "git grep" currently does not use textconv filters at all, that is
>> neither for displaying the match and context nor for the actual grepping.
>>
>> Introduce a binary mode "--textconv" (in addition to "--text" and "-I")
>> which makes git grep use any configured textconv filters for grepping
>> and output purposes.
> 
> Sounds like a reasonable goal.
> 
>>     The difficulty is in getting the different cases (blob/sha1 vs.
>>     worktree) right, and in making the changes minimally invasive. It seems
>>     that some more refactoring could help: "git show --textconv" does not
>>     use textconv filters when used on blobs either. (It does for diffs, of
>>     course.) Most existing helper functions are tailored for diffs.
> 
> I think "git show" with blobs originally did not because we have no
> filename with which to look up the attributes. IIRC, the patches to
> support "cat-file --textconv" taught get_sha1_with_context to report the
> path at which we found a blob. I suspect it is mostly a matter of
> plumbing that information from the revision parser through to
> show_blob_object.
> 
>>     Nota bene: --textconv does not affect "diff --stat" either...
> 
> Yeah, though I wonder if it should be on by default. The diffstat for a
> binary file, unlike the diff, is already useful. The diffstat of the
> textconv'd data may _also_ be useful, of course.
> 
>> @@ -659,6 +659,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>>  		OPT_SET_INT('I', NULL, &opt.binary,
>>  			N_("don't match patterns in binary files"),
>>  			GREP_BINARY_NOMATCH),
>> +		OPT_SET_INT(0, "textconv", &opt.binary,
>> +			N_("process binary files with textconv filters"),
>> +			GREP_BINARY_TEXTCONV),
> 
> Is this really a new form of GREP_BINARY_*? What happens when a file
> does not have a textconv filter?
> 
> I would expect this to be more like diff's "--textconv" flag, which is
> really "allow textconv to be respected". Then you could do:
> 
>   git grep -I --textconv foo
> 
> to grep in the text version of files which support it, and ignore the
> rest.
> 
>> -static int grep_source_load(struct grep_source *gs);
>> -static int grep_source_is_binary(struct grep_source *gs);
>> +static int grep_source_load(struct grep_source *gs, struct grep_opt *opt);
>> +static int grep_source_is_binary(struct grep_source *gs, struct grep_opt *opt);
> 
> Hmm. grep_source_load is more or less the analogue of
> diff_populate_filespec, which does not know about textconv at all. So I
> feel like this might be going in at the wrong layer...
> 
>> @@ -1354,14 +1356,15 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>>  
>>  	switch (opt->binary) {
>>  	case GREP_BINARY_DEFAULT:
>> -		if (grep_source_is_binary(gs))
>> +		if (grep_source_is_binary(gs, opt))
>>  			binary_match_only = 1;
>>  		break;
>>  	case GREP_BINARY_NOMATCH:
>> -		if (grep_source_is_binary(gs))
>> +		if (grep_source_is_binary(gs, opt))
>>  			return 0; /* Assume unmatch */
>>  		break;
> 
> The is_binary function learned about "opt" so that it could pass it to
> grep_source_load, which might do the textconv for us. But we _know_ that
> we will not here, because we see that we have other GREP_BINARY flags.
> 
> And when we do have _TEXTCONV:
> 
>>  	case GREP_BINARY_TEXT:
>> +	case GREP_BINARY_TEXTCONV:
>>  		break;
> 
> We do not call is_binary. :)
> 
>> -static int grep_source_load_sha1(struct grep_source *gs)
>> +static int grep_source_load_sha1(struct grep_source *gs, struct grep_opt *opt)
>>  {
>>  	enum object_type type;
>> -
>>  	grep_read_lock();
>> -	gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
>> +	if (opt->binary == GREP_BINARY_TEXTCONV) {
>> +		struct diff_filespec *df = alloc_filespec(gs->name);
>> +		gs->size = fill_textconv(gs->driver, df, &gs->buf);
>> +		free_filespec(df);
>> +	} else {
>> +		gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
>> +	}
> 
> So here we do the textconv for the sha1 case. But what about file
> sources?
> 
> This is why I think the layer is wrong; you want the fill_textconv
> function to call your abstract _load function (in the diff_filespec
> world, it is diff_filespec_populate, but it is the moral equivalent).
> And you want it to hold off as long as possible in case we can pull the
> value from cache, or feed the working tree version of a file straight to
> the filter.
> 
>> -void grep_source_load_driver(struct grep_source *gs)
>> +void grep_source_load_driver(struct grep_source *gs, struct grep_opt *opt)
>>  {
>>  	if (gs->driver)
>>  		return;
>>  
>> -	grep_attr_lock();
>> +	grep_attr_lock(); //TODO
>> +	printf("Looking up userdiff driver for: %s", gs->path);
>>  	if (gs->path)
>>  		gs->driver = userdiff_find_by_path(gs->path);
>>  	if (!gs->driver)
>>  		gs->driver = userdiff_find_by_name("default");
>> +	if (opt->binary == GREP_BINARY_TEXTCONV)
>> +		gs->driver = userdiff_get_textconv(gs->driver);
>>  	grep_attr_unlock();
>>  }
> 
> This is wrong. The point of userdiff_get_textconv is that it will return
> NULL when we are not doing textconv for this path. So you can use it
> like:
> 
>   struct userdiff_driver *textconv = userdiff_get_textconv(gs->driver);
> 
>   if (textconv) {
>           /* ok, we are doing textconv. Call our fill_textconv
>            * equivalent. */
>   }
>   else {
>           /* nope, plain old file. */
>   }
> 
> But by assigning it on top of gs->driver, you're going to end up with a
> NULL driver sometimes. And the post-condition of the load_driver
> function is that gs->driver always points to a valid driver (even if it
> is the default one). I wouldn't be surprised if this causes segfaults.
> 
> 
> So I would do it more like the patch below. Only lightly tested by me.
> There are some refactoring opportunities if you want to bring
> grep_source and diff_filespec closer together.
> 
> ---
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 8025964..915c8ef 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -659,6 +659,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>  		OPT_SET_INT('I', NULL, &opt.binary,
>  			N_("don't match patterns in binary files"),
>  			GREP_BINARY_NOMATCH),
> +		OPT_BOOL(0, "textconv", &opt.allow_textconv,
> +			 N_("process binary files with textconv filters")),
>  		{ OPTION_INTEGER, 0, "max-depth", &opt.max_depth, N_("depth"),
>  			N_("descend at most <depth> levels"), PARSE_OPT_NONEG,
>  			NULL, 1 },
> diff --git a/grep.c b/grep.c
> index 4bd1b8b..3880d64 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -2,6 +2,8 @@
>  #include "grep.h"
>  #include "userdiff.h"
>  #include "xdiff-interface.h"
> +#include "diff.h"
> +#include "diffcore.h"
>  
>  static int grep_source_load(struct grep_source *gs);
>  static int grep_source_is_binary(struct grep_source *gs);
> @@ -1321,6 +1323,58 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
>  	fwrite(buf, size, 1, stdout);
>  }
>  
> +static int fill_textconv_grep(struct userdiff_driver *driver,
> +			      struct grep_source *gs)
> +{
> +	struct diff_filespec *df;
> +	char *buf;
> +	size_t size;
> +
> +	if (!driver || !driver->textconv)
> +		return grep_source_load(gs);
> +
> +	/*
> +	 * The textconv interface is intimately tied to diff_filespecs, so we
> +	 * have to pretend to be one. If we could unify the grep_source
> +	 * and diff_filespec structs, this mess could just go away.
> +	 */
> +	df = alloc_filespec(gs->path);
> +	switch (gs->type) {
> +	case GREP_SOURCE_SHA1:
> +		fill_filespec(df, gs->identifier, 1, 0100644);
> +		break;
> +	case GREP_SOURCE_FILE:
> +		fill_filespec(df, null_sha1, 0, 0100644);
> +		break;
> +	default:
> +		die("BUG: attempt to textconv something without a path?");
> +	}
> +
> +	/*
> +	 * fill_textconv is not remotely thread-safe; it may load objects
> +	 * behind the scenes, and it modifies the global diff tempfile
> +	 * structure.
> +	 */
> +	grep_read_lock();
> +	size = fill_textconv(driver, df, &buf);
> +	grep_read_unlock();
> +	free_filespec(df);
> +
> +	/*
> +	 * The normal fill_textconv usage by the diff machinery would just keep
> +	 * the textconv'd buf separate from the diff_filespec. But much of the
> +	 * grep code passes around a grep_source and assumes that its "buf"
> +	 * pointer is the beginning of the thing we are searching. So let's
> +	 * install our textconv'd version into the grep_source, taking care not
> +	 * to leak any existing buffer.
> +	 */
> +	grep_source_clear_data(gs);
> +	gs->buf = buf;
> +	gs->size = size;
> +
> +	return 0;
> +}
> +
>  static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int collect_hits)
>  {
>  	char *bol;
> @@ -1331,6 +1385,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>  	unsigned count = 0;
>  	int try_lookahead = 0;
>  	int show_function = 0;
> +	struct userdiff_driver *textconv = NULL;
>  	enum grep_context ctx = GREP_CONTEXT_HEAD;
>  	xdemitconf_t xecfg;
>  
> @@ -1352,19 +1407,36 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>  	}
>  	opt->last_shown = 0;
>  
> -	switch (opt->binary) {
> -	case GREP_BINARY_DEFAULT:
> -		if (grep_source_is_binary(gs))
> -			binary_match_only = 1;
> -		break;
> -	case GREP_BINARY_NOMATCH:
> -		if (grep_source_is_binary(gs))
> -			return 0; /* Assume unmatch */
> -		break;
> -	case GREP_BINARY_TEXT:
> -		break;
> -	default:
> -		die("bug: unknown binary handling mode");
> +	if (opt->allow_textconv) {
> +		grep_source_load_driver(gs);
> +		/*
> +		 * We might set up the shared textconv cache data here, which
> +		 * is not thread-safe.
> +		 */
> +		grep_attr_lock();
> +		textconv = userdiff_get_textconv(gs->driver);
> +		grep_attr_unlock();
> +	}
> +
> +	/*
> +	 * We know the result of a textconv is text, so we only have to care
> +	 * about binary handling if we are not using it.
> +	 */
> +	if (!textconv) {
> +		switch (opt->binary) {
> +		case GREP_BINARY_DEFAULT:
> +			if (grep_source_is_binary(gs))
> +				binary_match_only = 1;
> +			break;
> +		case GREP_BINARY_NOMATCH:
> +			if (grep_source_is_binary(gs))
> +				return 0; /* Assume unmatch */
> +			break;
> +		case GREP_BINARY_TEXT:
> +			break;
> +		default:
> +			die("bug: unknown binary handling mode");
> +		}
>  	}
>  
>  	memset(&xecfg, 0, sizeof(xecfg));
> @@ -1372,7 +1444,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>  
>  	try_lookahead = should_lookahead(opt);
>  
> -	if (grep_source_load(gs) < 0)
> +	if (fill_textconv_grep(textconv, gs) < 0)
>  		return 0;
>  
>  	bol = gs->buf;
> diff --git a/grep.h b/grep.h
> index 8fc854f..94a7ac2 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -106,6 +106,7 @@ struct grep_opt {
>  #define GREP_BINARY_NOMATCH	1
>  #define GREP_BINARY_TEXT	2
>  	int binary;
> +	int allow_textconv;
>  	int extended;
>  	int use_reflog_filter;
>  	int pcre;
> 

Thanks Jeff, that helps a lot! It covers "grep expr" and "grep expr rev
-- path" just fine. I'll look into "grep expr rev:path" which does not
work yet because of an empty driver.

I also have "show --textconv" covered and a suggestion for "cat-file
--textconv" (to work without a textconv filter).

Expect a mini-series soon :)

Michael

  reply	other threads:[~2013-02-05 16:21 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-04 15:27 [WIP/RFH/RFD/PATCH] grep: allow to use textconv filters Michael J Gruber
2013-02-04 17:12 ` Junio C Hamano
2013-02-05  8:48   ` Michael J Gruber
2013-02-05 11:13 ` Jeff King
2013-02-05 16:21   ` Michael J Gruber [this message]
2013-02-05 20:11     ` Jeff King
2013-02-06 15:08       ` [RFC/PATCH 0/4] textconv for show and grep Michael J Gruber
2013-02-06 15:08         ` [RFC/PATCH 1/4] show: obey --textconv for blobs Michael J Gruber
2013-02-06 16:53           ` Junio C Hamano
2013-02-06 22:12             ` Jeff King
2013-02-06 23:49               ` Junio C Hamano
2013-02-07  0:10                 ` Jeff King
2013-02-07  0:26                   ` Junio C Hamano
2013-02-07  8:48             ` Michael J Gruber
2013-02-06 22:06           ` Jeff King
2013-02-07  9:05             ` Michael J Gruber
2013-02-07  9:11               ` Jeff King
2013-02-07  9:34                 ` Michael J Gruber
2013-02-07  9:43                   ` Jeff King
2013-02-06 15:08         ` [RFC/PATCH 2/4] cat-file: do not die on --textconv without textconv filters Michael J Gruber
2013-02-06 16:47           ` Junio C Hamano
2013-02-06 22:19           ` Jeff King
2013-02-06 22:23             ` Junio C Hamano
2013-02-06 22:43               ` Jeff King
2013-02-06 15:08         ` [RFC/PATCH 3/4] grep: allow to use " Michael J Gruber
2013-02-06 15:12           ` Matthieu Moy
2013-02-06 22:23           ` Jeff King
2013-02-06 15:08         ` [RFC/PATCH 4/4] grep: obey --textconv for the case rev:path Michael J Gruber
2013-02-06 22:36           ` Jeff King
2013-02-07  9:05             ` Michael J Gruber
2013-02-07  9:26               ` Jeff King
2013-02-07  9:47                 ` Michael J Gruber
2013-02-07  9:55                   ` Jeff King
2013-02-07 10:31                     ` Michael J Gruber
2013-02-07 18:03                       ` Junio C Hamano
2013-02-08 11:27                         ` Michael J Gruber
2013-02-06 16:55         ` [RFC/PATCH 0/4] textconv for show and grep Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5111317E.8060906@drmicha.warpmail.net \
    --to=git@drmicha.warpmail.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.