All of lore.kernel.org
 help / color / mirror / Atom feed
* git grep -P fatal: pcre_exec failed with error code -8
@ 2017-11-05  0:06 Дилян Палаузов
  2017-11-05  2:16 ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Дилян Палаузов @ 2017-11-05  0:06 UTC (permalink / raw)
  To: git

Hello,

with git 2.14.3 linked with libpcre.so.1.2.9 when I do:
   git clone https://github.com/django/django
   cd django
   git grep -P "if.*([^\s])+\s+and\s+\1" 
django/contrib/admin/static/admin/js/vendor/select2/select2.full.min.js
the output is:
   fatal: pcre_exec failed with error code -8


(But not with
git clone https://github.com/select2/select2
cd select2
git grep -P "if.*([^\s])+\s+and\s+\1" dist/js/select2.full.min.js

as the two select2.full.min.js files differ slightly in their size)

Any ideas?

Regards
   Dilian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep -P fatal: pcre_exec failed with error code -8
  2017-11-05  0:06 git grep -P fatal: pcre_exec failed with error code -8 Дилян Палаузов
@ 2017-11-05  2:16 ` Jeff King
  2017-11-05  9:41   ` Дилян Палаузов
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-11-05  2:16 UTC (permalink / raw)
  To: Дилян
	Палаузов
  Cc: git

On Sun, Nov 05, 2017 at 01:06:21AM +0100, Дилян Палаузов wrote:

> with git 2.14.3 linked with libpcre.so.1.2.9 when I do:
>   git clone https://github.com/django/django
>   cd django
>   git grep -P "if.*([^\s])+\s+and\s+\1"
> django/contrib/admin/static/admin/js/vendor/select2/select2.full.min.js
> the output is:
>   fatal: pcre_exec failed with error code -8

Code -8 is PCRE_ERROR_MATCHLIMIT. And "man pcreapi" has this to say:

  The match_limit field provides a means of preventing PCRE from
  using up a vast amount of resources when running patterns that
  are not going to match, but which have a very large number of
  possibilities in their search trees. The classic example is a
  pattern that uses nested unlimited repeats.

  Internally, pcre_exec() uses a function called match(), which
  it calls repeatedly (sometimes recursively). The limit set by
  match_limit is imposed on the number of times this function is
  called during a match, which has the effect of limiting the
  amount of backtracking that can take place. For patterns that
  are not anchored, the count restarts from zero for each posi‐
  tion in the subject string.

  When pcre_exec() is called with a pattern that was successfully
  studied with a JIT option, the way that the matching is exe‐
  cuted is entirely different. However, there is still the pos‐
  sibility of runaway matching that goes on for a very long time,
  and so the match_limit value is also used in this case (but in
  a different way) to limit how long the matching can continue.

  The default value for the limit can be set when PCRE is built;
  the default default is 10 million, which handles all but the
  most extreme cases. You can override the default by suppling
  pcre_exec() with a pcre_extra block in which match_limit is
  set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If
  the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCH‐
  LIMIT.

So your pattern is just really expensive and is running afoul of pcre's
backtracking limits (and it's not helped by the fact that the file is
basically one giant line).

There's no way to ask Git to specify a larger match_limit to pcre, but
you might be able to construct your pattern in a way that involves less
backtracking. It looks like you're trying to find things like "if foo
and foo"?

Should the captured term actually be "([^\s]+)" (with the "+" on the
_inside_ of the capture? Or maybe I'm just misunderstanding your goal.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep -P fatal: pcre_exec failed with error code -8
  2017-11-05  2:16 ` Jeff King
@ 2017-11-05  9:41   ` Дилян Палаузов
  2017-11-06 10:31     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Дилян Палаузов @ 2017-11-05  9:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Hello,

thanks for your answer.

I understand that the PCRE's stack can get exhausted for some files, but 
in such cases, git grep shall proceed with the other files, and print at 
the end/stderr for which files the pattern was not applied.  Such 
behaviour would be more usefull than the current one.

Regards
   Dilian

On 11/05/2017 03:16 AM, Jeff King wrote:
> On Sun, Nov 05, 2017 at 01:06:21AM +0100, Дилян Палаузов wrote:
> 
>> with git 2.14.3 linked with libpcre.so.1.2.9 when I do:
>>    git clone https://github.com/django/django
>>    cd django
>>    git grep -P "if.*([^\s])+\s+and\s+\1"
>> django/contrib/admin/static/admin/js/vendor/select2/select2.full.min.js
>> the output is:
>>    fatal: pcre_exec failed with error code -8
> 
> Code -8 is PCRE_ERROR_MATCHLIMIT. And "man pcreapi" has this to say:
> 
>    The match_limit field provides a means of preventing PCRE from
>    using up a vast amount of resources when running patterns that
>    are not going to match, but which have a very large number of
>    possibilities in their search trees. The classic example is a
>    pattern that uses nested unlimited repeats.
> 
>    Internally, pcre_exec() uses a function called match(), which
>    it calls repeatedly (sometimes recursively). The limit set by
>    match_limit is imposed on the number of times this function is
>    called during a match, which has the effect of limiting the
>    amount of backtracking that can take place. For patterns that
>    are not anchored, the count restarts from zero for each posi‐
>    tion in the subject string.
> 
>    When pcre_exec() is called with a pattern that was successfully
>    studied with a JIT option, the way that the matching is exe‐
>    cuted is entirely different. However, there is still the pos‐
>    sibility of runaway matching that goes on for a very long time,
>    and so the match_limit value is also used in this case (but in
>    a different way) to limit how long the matching can continue.
> 
>    The default value for the limit can be set when PCRE is built;
>    the default default is 10 million, which handles all but the
>    most extreme cases. You can override the default by suppling
>    pcre_exec() with a pcre_extra block in which match_limit is
>    set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If
>    the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCH‐
>    LIMIT.
> 
> So your pattern is just really expensive and is running afoul of pcre's
> backtracking limits (and it's not helped by the fact that the file is
> basically one giant line).
> 
> There's no way to ask Git to specify a larger match_limit to pcre, but
> you might be able to construct your pattern in a way that involves less
> backtracking. It looks like you're trying to find things like "if foo
> and foo"?
> 
> Should the captured term actually be "([^\s]+)" (with the "+" on the
> _inside_ of the capture? Or maybe I'm just misunderstanding your goal.
> 
> -Peff
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep -P fatal: pcre_exec failed with error code -8
  2017-11-05  9:41   ` Дилян Палаузов
@ 2017-11-06 10:31     ` Jeff King
  2017-11-06 11:50       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-11-06 10:31 UTC (permalink / raw)
  To: Дилян
	Палаузов
  Cc: git

On Sun, Nov 05, 2017 at 10:41:17AM +0100, Дилян Палаузов wrote:

> I understand that the PCRE's stack can get exhausted for some files, but in
> such cases, git grep shall proceed with the other files, and print at the
> end/stderr for which files the pattern was not applied.  Such behaviour
> would be more usefull than the current one.

Yes, I had a similar thought. It does feel a little funny for us to
basically treat an error as "no match" for non-interactive use, but then
the current behavior works out to be more or less the same (we return an
error code which most shell scripts would interpret as failure).

IOW, I think something like this is probably the right direction:

diff --git a/grep.c b/grep.c
index ce6a48e634..2c152e5908 100644
--- a/grep.c
+++ b/grep.c
@@ -427,7 +427,7 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 	}
 
 	if (ret < 0 && ret != PCRE_ERROR_NOMATCH)
-		die("pcre_exec failed with error code %d", ret);
+		warning("pcre_exec failed with error code %d", ret);
 	if (ret > 0) {
 		ret = 0;
 		match->rm_so = ovector[0];

but possibly:

  1. It would be nice to report the filename that we couldn't match on.
     But we don't know it at this level of the code (and it might not be
     a file at all that we are matching). So probably we'd want to pass
     the error much further up the call stack. This is tricky as there
     are multiple regex libraries we can use, and the return value gets
     normalized to 1/0 for hit/not-hit long before we get as far as
     something that knows the filename.

     We might need to do something invasive like adding an extra
     parameter to hold the error message, and passing it through the
     whole stack.

  2. We should still try to exit with an exit code other than "1" to
     indicate we hit an error besides "no lines were found".

  3. Other regex libraries might need similar treatment. Probably
     pcre2match() needs it. It doesn't look like regexec() can ever
     return an error besides REG_NOMATCH.

-Peff

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: git grep -P fatal: pcre_exec failed with error code -8
  2017-11-06 10:31     ` Jeff King
@ 2017-11-06 11:50       ` Ævar Arnfjörð Bjarmason
  2017-11-06 12:24         ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-11-06 11:50 UTC (permalink / raw)
  To: Jeff King
  Cc: Дилян
	Палаузов,
	git


On Mon, Nov 06 2017, Jeff King jotted:

> On Sun, Nov 05, 2017 at 10:41:17AM +0100, Дилян Палаузов wrote:
>
>> I understand that the PCRE's stack can get exhausted for some files, but in
>> such cases, git grep shall proceed with the other files, and print at the
>> end/stderr for which files the pattern was not applied.  Such behaviour
>> would be more usefull than the current one.
>
> Yes, I had a similar thought. It does feel a little funny for us to
> basically treat an error as "no match" for non-interactive use, but then
> the current behavior works out to be more or less the same (we return an
> error code which most shell scripts would interpret as failure).
>
> IOW, I think something like this is probably the right direction:
>
> diff --git a/grep.c b/grep.c
> index ce6a48e634..2c152e5908 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -427,7 +427,7 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
>  	}
>
>  	if (ret < 0 && ret != PCRE_ERROR_NOMATCH)
> -		die("pcre_exec failed with error code %d", ret);
> +		warning("pcre_exec failed with error code %d", ret);
>  	if (ret > 0) {
>  		ret = 0;
>  		match->rm_so = ovector[0];
>
> but possibly:
>
>   1. It would be nice to report the filename that we couldn't match on.
>      But we don't know it at this level of the code (and it might not be
>      a file at all that we are matching). So probably we'd want to pass
>      the error much further up the call stack. This is tricky as there
>      are multiple regex libraries we can use, and the return value gets
>      normalized to 1/0 for hit/not-hit long before we get as far as
>      something that knows the filename.
>
>      We might need to do something invasive like adding an extra
>      parameter to hold the error message, and passing it through the
>      whole stack.
>
>   2. We should still try to exit with an exit code other than "1" to
>      indicate we hit an error besides "no lines were found".
>
>   3. Other regex libraries might need similar treatment. Probably
>      pcre2match() needs it. It doesn't look like regexec() can ever
>      return an error besides REG_NOMATCH.
>
> -Peff

Some replies to the thread in general, didn't want to spread this out
into different replies.

 * Yes this sucks.

 * Just emitting a warning without an appropriate exit code would suck
   more, would break batch jobs & whatnot that expcept certain results
   from grep.

 * As you point out it would be nice to print out the file name we
   didn't match on, we'd need to pass the grep_source struct down
   further, it goes as far as grep_source_1 but stops there and isn't
   passed to e.g. look_ahead(), which calls patmatch() which calls the
   engine-specific matcher and would need to report the error. We could
   just do this, would slow down things a bit (probably trivally) but we
   could emit better error messages in genreal.

 * You can adjust these limts in PCRE in Git, although it doesn't help
   in this case, you just add (*LIMIT_MATCH=NUM) or
   (*LIMIT_RECURSION=NUM) (or both) to the start of the pattern. See
   pcresyntax(3) or pcre2syntax(3) man pages depending on what version
   you have installed.

 * While regexec() won't return an error its version of dealing with
   this is (at least under glibc) to balloon CPU/memory use until the
   OOMkiller kills git (although not on this particular pattern).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep -P fatal: pcre_exec failed with error code -8
  2017-11-06 11:50       ` Ævar Arnfjörð Bjarmason
@ 2017-11-06 12:24         ` Jeff King
  2017-11-06 14:04           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-11-06 12:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Дилян
	Палаузов,
	git

On Mon, Nov 06, 2017 at 12:50:45PM +0100, Ævar Arnfjörð Bjarmason wrote:

> Some replies to the thread in general, didn't want to spread this out
> into different replies.
> 
>  * Yes this sucks.
> 
>  * Just emitting a warning without an appropriate exit code would suck
>    more, would break batch jobs & whatnot that expcept certain results
>    from grep.

That was my first thought, too, but something that does:

  git grep foo && echo found

would behave basically the same. Do you mean here scripts that actually
do:

  git grep foo
  case "$?" in
  0) echo found ;;
  1) echo not found ;;
  *) echo wtf? ;;
  esac

I agree it would be nice to at least have _some_ way to distinguish
between those final two cases.

Though something like "git log --grep" is even more complicated. We
perhaps _would_ want to distinguish between:

  - match (in which case we print the commit)

  - no match (in which case we do not)

  - error (in which case we do not print, but also mark the exit code as
    failing)

>  * As you point out it would be nice to print out the file name we
>    didn't match on, we'd need to pass the grep_source struct down
>    further, it goes as far as grep_source_1 but stops there and isn't
>    passed to e.g. look_ahead(), which calls patmatch() which calls the
>    engine-specific matcher and would need to report the error. We could
>    just do this, would slow down things a bit (probably trivally) but we
>    could emit better error messages in genreal.

I'm not sure if the grep_source has enough information for all cases.
E.g., if you hit an error while grepping in commit headers, you'd
probably want to mention the oid of the commit. There's an "identifier"
field in the grep_source, but it's opaque.

The caller may also want to do more things than just print an error
(like the exit code adjustment I mentioned above). Which implies to me
we should be passing the error information up, not trying to bring the
context down.

>  * You can adjust these limts in PCRE in Git, although it doesn't help
>    in this case, you just add (*LIMIT_MATCH=NUM) or
>    (*LIMIT_RECURSION=NUM) (or both) to the start of the pattern. See
>    pcresyntax(3) or pcre2syntax(3) man pages depending on what version
>    you have installed.

I saw that in the pcre manual, but I had the impression you can't
actually raise the limits above the defaults with that, only lower them.

>  * While regexec() won't return an error its version of dealing with
>    this is (at least under glibc) to balloon CPU/memory use until the
>    OOMkiller kills git (although not on this particular pattern).

So in a sense our current behavior with pcre is the same. We just have
to provoke the death ourselves. ;)

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep -P fatal: pcre_exec failed with error code -8
  2017-11-06 12:24         ` Jeff King
@ 2017-11-06 14:04           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-11-06 14:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Дилян
	Палаузов,
	git


On Mon, Nov 06 2017, Jeff King jotted:

> On Mon, Nov 06, 2017 at 12:50:45PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> Some replies to the thread in general, didn't want to spread this out
>> into different replies.
>>
>>  * Yes this sucks.
>>
>>  * Just emitting a warning without an appropriate exit code would suck
>>    more, would break batch jobs & whatnot that expcept certain results
>>    from grep.
>
> That was my first thought, too, but something that does:
>
>   git grep foo && echo found
>
> would behave basically the same. Do you mean here scripts that actually
> do:
>
>   git grep foo
>   case "$?" in
>   0) echo found ;;
>   1) echo not found ;;
>   *) echo wtf? ;;
>   esac
>
> I agree it would be nice to at least have _some_ way to distinguish
> between those final two cases.

Maybe we should emit a different exit code, but I just had in mind that
we could continue but the same non-zero as when the grep fails now, but
with an additional warning.

> Though something like "git log --grep" is even more complicated. We
> perhaps _would_ want to distinguish between:
>
>   - match (in which case we print the commit)
>
>   - no match (in which case we do not)
>
>   - error (in which case we do not print, but also mark the exit code as
>     failing)
>
>>  * As you point out it would be nice to print out the file name we
>>    didn't match on, we'd need to pass the grep_source struct down
>>    further, it goes as far as grep_source_1 but stops there and isn't
>>    passed to e.g. look_ahead(), which calls patmatch() which calls the
>>    engine-specific matcher and would need to report the error. We could
>>    just do this, would slow down things a bit (probably trivally) but we
>>    could emit better error messages in genreal.
>
> I'm not sure if the grep_source has enough information for all cases.
> E.g., if you hit an error while grepping in commit headers, you'd
> probably want to mention the oid of the commit. There's an "identifier"
> field in the grep_source, but it's opaque.
>
> The caller may also want to do more things than just print an error
> (like the exit code adjustment I mentioned above). Which implies to me
> we should be passing the error information up, not trying to bring the
> context down.

Indeed, I was just thinking of the case where we're grepping a file in
the tree, not when the engine is used for --grep et al.

>>  * You can adjust these limts in PCRE in Git, although it doesn't help
>>    in this case, you just add (*LIMIT_MATCH=NUM) or
>>    (*LIMIT_RECURSION=NUM) (or both) to the start of the pattern. See
>>    pcresyntax(3) or pcre2syntax(3) man pages depending on what version
>>    you have installed.
>
> I saw that in the pcre manual, but I had the impression you can't
> actually raise the limits above the defaults with that, only lower them.

You can, you just can't set them to anything less conservative than
limits already set via the C API, but we don't set any of those.

I.e. PCRE allows you to say expose a regex field in a web form (as we do
with gitweb) with really conservative settings that can't be overriden
via a (*LIMIT) set in the pattern.

But since we don't use that C API PCRE runs in a mode where the user
gets set whatever limit they want in the pattern (or other
pattern-altering switch), which makes sense for interactive git-grep
use.

>>  * While regexec() won't return an error its version of dealing with
>>    this is (at least under glibc) to balloon CPU/memory use until the
>>    OOMkiller kills git (although not on this particular pattern).
>
> So in a sense our current behavior with pcre is the same. We just have
> to provoke the death ourselves. ;)

Indeed :)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-06 14:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-05  0:06 git grep -P fatal: pcre_exec failed with error code -8 Дилян Палаузов
2017-11-05  2:16 ` Jeff King
2017-11-05  9:41   ` Дилян Палаузов
2017-11-06 10:31     ` Jeff King
2017-11-06 11:50       ` Ævar Arnfjörð Bjarmason
2017-11-06 12:24         ` Jeff King
2017-11-06 14:04           ` Ævar Arnfjörð Bjarmason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.