git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git diff looping?
@ 2009-06-16  1:37 John Bito
  2009-06-16  2:44 ` Jeff Epler
  2009-06-16 11:47 ` Jeff King
  0 siblings, 2 replies; 37+ messages in thread
From: John Bito @ 2009-06-16  1:37 UTC (permalink / raw)
  To: git

Running Git 1.6.1 on Solaris 10, git diff seems to go into a loop -
consuming CPU and producing no output after a little bit.  While the
repository isn't small, it's not huge (it's
http://repo.or.cz/w/egit.git). I've tried the following:

$ git diff v0.4.0  > ~/t.diff
$ git diff v0.4.0 HEAD  > ~/t.diff
Both sit indefinitely eating most of 1 CPU
$ git fsck
Exits quickly with no output.

I don't mind going to a newer version of Git, but I'd like to have
some idea of the source of the problem before I start trying things.

Thanks!
John

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16  1:37 git diff looping? John Bito
@ 2009-06-16  2:44 ` Jeff Epler
  2009-06-16  2:53   ` John Bito
  2009-06-16 11:47 ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Jeff Epler @ 2009-06-16  2:44 UTC (permalink / raw)
  To: John Bito; +Cc: git

It's apples and oranges, but I had no problem on
    $ git --version
    git version 1.5.4.3
    $ uname -a
    Linux platte 2.6.24-24-generic #1 SMP Wed Apr 15 15:11:35 UTC 2009 x86_64 GNU/Linux
or 
    $ /usr/src/git/git --version
    git version 1.6.3.2.31.g7af0

    $ git clone git://repo.or.cz/egit.git
    $ cd egit
    $ git diff v0.4.0 | wc -l
    47943

    $ /usr/src/git/git diff v0.4.0 | wc -l
    47943

Jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16  2:44 ` Jeff Epler
@ 2009-06-16  2:53   ` John Bito
  0 siblings, 0 replies; 37+ messages in thread
From: John Bito @ 2009-06-16  2:53 UTC (permalink / raw)
  To: Jeff Epler; +Cc: git

Thanks, Jeff!

I didn't mean to suggest there was a problem in the (origin)
repository.  I was wondering if anyone could recommend an approach to
finding out what went wrong.  I can certainly clone a fresh repo and
apply my changes there.  I'd just like to have some idea of what went
wrong.

~John

On Mon, Jun 15, 2009 at 7:44 PM, Jeff Epler<jepler@unpythonic.net> wrote:
> It's apples and oranges, but I had no problem on
>    $ git --version
>    git version 1.5.4.3
>    $ uname -a
>    Linux platte 2.6.24-24-generic #1 SMP Wed Apr 15 15:11:35 UTC 2009 x86_64 GNU/Linux
> or
>    $ /usr/src/git/git --version
>    git version 1.6.3.2.31.g7af0
>
>    $ git clone git://repo.or.cz/egit.git
>    $ cd egit
>    $ git diff v0.4.0 | wc -l
>    47943
>
>    $ /usr/src/git/git diff v0.4.0 | wc -l
>    47943
>
> Jeff
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16  1:37 git diff looping? John Bito
  2009-06-16  2:44 ` Jeff Epler
@ 2009-06-16 11:47 ` Jeff King
  2009-06-16 12:07   ` Jeff King
                     ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 11:47 UTC (permalink / raw)
  To: John Bito; +Cc: git

On Mon, Jun 15, 2009 at 06:37:21PM -0700, John Bito wrote:

> Running Git 1.6.1 on Solaris 10, git diff seems to go into a loop -
> consuming CPU and producing no output after a little bit.  While the
> repository isn't small, it's not huge (it's
> http://repo.or.cz/w/egit.git). I've tried the following:

I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
be caused by a horribly slow system regex implementation; it really
chokes on the regex we use to find the "funcname" line for java files. I
tried running "git diff v0.4.0" and it still hadn't finished after 90
seconds. Then I did:

  git config diff.java.xfuncname foo ;# some garbage regex
  git diff v0.4.0

and it completed in about 2.5 seconds.

Can you try that and see if it works around the problem for you?

If anybody wants to look further into the problem, I think it is
specifically triggered by this file (and the built-in xfuncname for java
files):

  $ git clone git://repo.or.cz/egit.git
  $ git diff v0.4.0 -- \
    org.spearce.egit.core.test/src/org/spearce/egit/core/op/T0001_ConnectProviderOperationTest.java

which isn't even all that big a file, but it is either causing some
horrible algorithmic behavior in the regex library, or is outright
sending it into an infinite loop.

I tried building against the code in compat/regex; it completes in a
reasonable amount of time, though it is still noticeably slow. With
system regex, the diff given above doesn't complete in less than 90
seconds (at which I get bored and kill it). With compat/regex, it
completes in about 2.2 seconds. Disabling the xfuncname, it completes in
0.14 seconds.

So I think it is a viable solution to recommend building against
compat/regex on Solaris, but I think there is still room for improvement
in what we ship in compat/.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 11:47 ` Jeff King
@ 2009-06-16 12:07   ` Jeff King
  2009-06-16 12:11     ` [PATCH 1/2] Makefile: refactor regex compat support Jeff King
  2009-06-16 12:14     ` [PATCH " Jeff King
  2009-06-16 15:48   ` git diff looping? John Bito
  2009-06-16 16:51   ` Junio C Hamano
  2 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 12:07 UTC (permalink / raw)
  To: John Bito; +Cc: Junio C Hamano, git

On Tue, Jun 16, 2009 at 07:47:26AM -0400, Jeff King wrote:

>   $ git clone git://repo.or.cz/egit.git
>   $ git diff v0.4.0 -- \
>     org.spearce.egit.core.test/src/org/spearce/egit/core/op/T0001_ConnectProviderOperationTest.java
> 
> which isn't even all that big a file, but it is either causing some
> horrible algorithmic behavior in the regex library, or is outright
> sending it into an infinite loop.
> 
> I tried building against the code in compat/regex; it completes in a
> reasonable amount of time, though it is still noticeably slow. With
> system regex, the diff given above doesn't complete in less than 90
> seconds (at which I get bored and kill it). With compat/regex, it
> completes in about 2.2 seconds. Disabling the xfuncname, it completes in
> 0.14 seconds.

And here is a patch series to use compat/regex on Solaris. I think the
first one should be non-controversial, as it just makes the knob more
convenient to turn. The second one is up for debate.

  1/2: Makefile: refactor regex compat support
  2/2: Makefile: use compat regex on Solaris

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] Makefile: refactor regex compat support
  2009-06-16 12:07   ` Jeff King
@ 2009-06-16 12:11     ` Jeff King
  2009-06-16 18:47       ` Johannes Sixt
  2009-06-16 12:14     ` [PATCH " Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Jeff King @ 2009-06-16 12:11 UTC (permalink / raw)
  To: John Bito; +Cc: Junio C Hamano, Johannes Sixt, git

There was no tweakable knob to use the regex compat code; it
was embedded in the mingw build. Since other platforms may
want to use it, let's factor it out in the usual way for
build configuration knobs.

Signed-off-by: Jeff King <peff@peff.net>
---
This should behave the same as before on all platforms. The only one I
might have screwed up by doing it wrong is mingw, but I have no machine
to test on. Johannes, can you confirm that it is right?

 Makefile |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 04bf8b1..a1e4e45 100644
--- a/Makefile
+++ b/Makefile
@@ -182,6 +182,8 @@ all::
 #
 # Define NO_CROSS_DIRECTORY_HARDLINKS if you plan to distribute the installed
 # programs as a tar, where bin/ and libexec/ might be on different file systems.
+#
+# Define NO_REGEX if you have no or inferior regex support in your C library.
 
 GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
 	@$(SHELL_PATH) ./GIT-VERSION-GEN
@@ -863,10 +865,11 @@ ifneq (,$(findstring MINGW,$(uname_S)))
 	USE_WIN32_MMAP = YesPlease
 	UNRELIABLE_FSTAT = UnfortunatelyYes
 	OBJECT_CREATION_USES_RENAMES = UnfortunatelyNeedsTo
-	COMPAT_CFLAGS += -D__USE_MINGW_ACCESS -DNOGDI -Icompat -Icompat/regex -Icompat/fnmatch
+	NO_REGEX = YesPlease
+	COMPAT_CFLAGS += -D__USE_MINGW_ACCESS -DNOGDI -Icompat -Icompat/fnmatch
 	COMPAT_CFLAGS += -DSNPRINTF_SIZE_CORR=1
 	COMPAT_CFLAGS += -DSTRIP_EXTENSION=\".exe\"
-	COMPAT_OBJS += compat/mingw.o compat/fnmatch/fnmatch.o compat/regex/regex.o compat/winansi.o
+	COMPAT_OBJS += compat/mingw.o compat/fnmatch/fnmatch.o compat/winansi.o
 	EXTLIBS += -lws2_32
 	X = .exe
 endif
@@ -1157,6 +1160,10 @@ endif
 ifdef UNRELIABLE_FSTAT
 	BASIC_CFLAGS += -DUNRELIABLE_FSTAT
 endif
+ifdef NO_REGEX
+	COMPAT_CFLAGS += -Icompat/regex
+	COMPAT_OBJS += compat/regex/regex.o
+endif
 
 ifeq ($(TCLTK_PATH),)
 NO_TCLTK=NoThanks
-- 
1.6.3.2.225.gb8364.dirty

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/2] Makefile: use compat regex on Solaris
  2009-06-16 12:07   ` Jeff King
  2009-06-16 12:11     ` [PATCH 1/2] Makefile: refactor regex compat support Jeff King
@ 2009-06-16 12:14     ` Jeff King
  1 sibling, 0 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 12:14 UTC (permalink / raw)
  To: John Bito; +Cc: Junio C Hamano, Brandon Casey, git

The system regex is either slow or buggy for complex
patterns, like the built-in xfuncname pattern for java
files.

Signed-off-by: Jeff King <peff@peff.net>
---
If people want to see the speed difference, try:

  $ git clone git://repo.or.cz/egit.git
  $ cd egit
  $ time git diff v0.4.0 >/dev/null

before and after.

 Makefile |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index a1e4e45..0e09ec8 100644
--- a/Makefile
+++ b/Makefile
@@ -713,6 +713,7 @@ ifeq ($(uname_S),SunOS)
 	NO_HSTRERROR = YesPlease
 	NO_MKDTEMP = YesPlease
 	NO_MKSTEMPS = YesPlease
+	NO_REGEX = YesPlease
 	ifneq ($(uname_R),5.11)
 		OLD_ICONV = UnfortunatelyYes
 	endif
-- 
1.6.3.2.225.gb8364.dirty

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 11:47 ` Jeff King
  2009-06-16 12:07   ` Jeff King
@ 2009-06-16 15:48   ` John Bito
  2009-06-16 16:51   ` Junio C Hamano
  2 siblings, 0 replies; 37+ messages in thread
From: John Bito @ 2009-06-16 15:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Thank you, Jeff!

Configuring the dummy regex allows the diff process to complete on
Solaris 10, as well.

~John

On Tue, Jun 16, 2009 at 4:47 AM, Jeff King<peff@peff.net> wrote:
> On Mon, Jun 15, 2009 at 06:37:21PM -0700, John Bito wrote:
>
>> Running Git 1.6.1 on Solaris 10, git diff seems to go into a loop -
>> consuming CPU and producing no output after a little bit.  While the
>> repository isn't small, it's not huge (it's
>> http://repo.or.cz/w/egit.git). I've tried the following:
>
> I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
> be caused by a horribly slow system regex implementation; it really
> chokes on the regex we use to find the "funcname" line for java files. I
> tried running "git diff v0.4.0" and it still hadn't finished after 90
> seconds. Then I did:
>
>  git config diff.java.xfuncname foo ;# some garbage regex
>  git diff v0.4.0
>
> and it completed in about 2.5 seconds.
>
> Can you try that and see if it works around the problem for you?
>
> If anybody wants to look further into the problem, I think it is
> specifically triggered by this file (and the built-in xfuncname for java
> files):
>
>  $ git clone git://repo.or.cz/egit.git
>  $ git diff v0.4.0 -- \
>    org.spearce.egit.core.test/src/org/spearce/egit/core/op/T0001_ConnectProviderOperationTest.java
>
> which isn't even all that big a file, but it is either causing some
> horrible algorithmic behavior in the regex library, or is outright
> sending it into an infinite loop.
>
> I tried building against the code in compat/regex; it completes in a
> reasonable amount of time, though it is still noticeably slow. With
> system regex, the diff given above doesn't complete in less than 90
> seconds (at which I get bored and kill it). With compat/regex, it
> completes in about 2.2 seconds. Disabling the xfuncname, it completes in
> 0.14 seconds.
>
> So I think it is a viable solution to recommend building against
> compat/regex on Solaris, but I think there is still room for improvement
> in what we ship in compat/.
>
> -Peff
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 11:47 ` Jeff King
  2009-06-16 12:07   ` Jeff King
  2009-06-16 15:48   ` git diff looping? John Bito
@ 2009-06-16 16:51   ` Junio C Hamano
  2009-06-16 17:15     ` Jeff King
  2009-06-16 17:16     ` git diff looping? John Bito
  2 siblings, 2 replies; 37+ messages in thread
From: Junio C Hamano @ 2009-06-16 16:51 UTC (permalink / raw)
  To: Jeff King; +Cc: John Bito, git

Jeff King <peff@peff.net> writes:

> I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
> be caused by a horribly slow system regex implementation; it really
> chokes on the regex we use to find the "funcname" line for java files.

Hmm.  Is running under LC_ALL=C LANG=C _with_ the slow system regex help?

> I tried building against the code in compat/regex; it completes in a
> reasonable amount of time, though it is still noticeably slow. With
> system regex, the diff given above doesn't complete in less than 90
> seconds (at which I get bored and kill it). With compat/regex, it
> completes in about 2.2 seconds. Disabling the xfuncname, it completes in
> 0.14 seconds.

In this particular case it is clear that a good way to fix the problem is
to replace Solaris's dumb regex implemention with what comes in compat/,
but I at the same time have to wonder if that funcname pattern for java
can somehow be simplified, so that it does not to require so sophisticated
implementation of regexp?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 16:51   ` Junio C Hamano
@ 2009-06-16 17:15     ` Jeff King
  2009-06-16 17:35       ` Brandon Casey
  2009-06-17  8:46       ` Paolo Bonzini
  2009-06-16 17:16     ` git diff looping? John Bito
  1 sibling, 2 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 17:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: John Bito, git

On Tue, Jun 16, 2009 at 09:51:24AM -0700, Junio C Hamano wrote:

> > I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
> > be caused by a horribly slow system regex implementation; it really
> > chokes on the regex we use to find the "funcname" line for java files.
> 
> Hmm.  Is running under LC_ALL=C LANG=C _with_ the slow system regex help?

No, it remains extremely slow (it is possible that it _is_ faster,
though, but I never managed to run either case to completion; they are
both clearly orders of magnitude off of acceptable).

> In this particular case it is clear that a good way to fix the problem is
> to replace Solaris's dumb regex implemention with what comes in compat/,
> but I at the same time have to wonder if that funcname pattern for java
> can somehow be simplified, so that it does not to require so sophisticated
> implementation of regexp?

That may be a possibility. The default pattern is actually two regexes
(one is a "do not match this" and the other is "match this"). The
problematic one seems to be (and that is a space and a tab between the
brackets):

  ^[      ]*(([   ]*[A-Za-z_][A-Za-z_0-9]*){2,}[  ]*\([^;]*)$

which I determined by setting diff.java.xfuncname just to that (and it
remains slow). Whereas setting it to:

  ^[     ]*(catch|do|for|if|instanceof|new|return|switch|throw|while)

completes in about 5 seconds of CPU time (in the actual pattern it is
negated, but that shouldn't matter as we do the negation ourselves).

Now that being said, 5 seconds is still embarrassingly bad. Watch this
(with the solaris system regex):

  $ git config diff.java.xfuncname '^[ 	]*(catch|do|for|if|instanceof|new|return|switch|throw|while)'
  $ time git diff v0.4.0 >/dev/null
  real    0m5.869s
  user    0m4.720s
  sys     0m0.200s

  $ git config diff.java.xfuncname foo
  $ time git diff v0.4.0 >/dev/null
  real    0m1.895s
  user    0m0.980s
  sys     0m0.210s

So besides learning that this machine is horribly slow, we can see that
running that relatively simple regex takes almost 4 seconds, compared to
a little over 1 second to do the entire rest of the diff. I am inclined
to say that regex performance like that is so bad that we shouldn't care
about optimizing for it, and just use something else.

Bear in mind that the same engine will be used for "grep", too. So you
aren't really doing "git grep" users any favors by linking against such
an awful library.

Really, that performance is so bad that I'm beginning to wonder if I am
somehow measuring something wrong. How could they ship something so
crappy through so many versions?

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 16:51   ` Junio C Hamano
  2009-06-16 17:15     ` Jeff King
@ 2009-06-16 17:16     ` John Bito
  2009-06-16 17:24       ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: John Bito @ 2009-06-16 17:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

The Solaris 10 server here isn't set up to build git.  git/Makefile
isn't compatible with /usr/ccs/bin/make. Is it desired to have a
Makefile that's portable to the Sun tools?

I was going to test Jeff's patch, but I probably won't install GNU
make on this machine unless I find I more compelling need to build git
on Solaris.

If it would help folks out, I'd be willing to try to create a Makefile
patch that works with the Sun tools, but I don't currently have a
Linux machine that I can easily use to verify compatibility.


On Tue, Jun 16, 2009 at 9:51 AM, Junio C Hamano<gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
>> be caused by a horribly slow system regex implementation; it really
>> chokes on the regex we use to find the "funcname" line for java files.
>
> Hmm.  Is running under LC_ALL=C LANG=C _with_ the slow system regex help?
>
>> I tried building against the code in compat/regex; it completes in a
>> reasonable amount of time, though it is still noticeably slow. With
>> system regex, the diff given above doesn't complete in less than 90
>> seconds (at which I get bored and kill it). With compat/regex, it
>> completes in about 2.2 seconds. Disabling the xfuncname, it completes in
>> 0.14 seconds.
>
> In this particular case it is clear that a good way to fix the problem is
> to replace Solaris's dumb regex implemention with what comes in compat/,
> but I at the same time have to wonder if that funcname pattern for java
> can somehow be simplified, so that it does not to require so sophisticated
> implementation of regexp?
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 17:16     ` git diff looping? John Bito
@ 2009-06-16 17:24       ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 17:24 UTC (permalink / raw)
  To: John Bito; +Cc: Junio C Hamano, git

On Tue, Jun 16, 2009 at 10:16:39AM -0700, John Bito wrote:

> The Solaris 10 server here isn't set up to build git.  git/Makefile
> isn't compatible with /usr/ccs/bin/make. Is it desired to have a
> Makefile that's portable to the Sun tools?

No, the Makefile is hopelessly GNU, and that is intentional: the subset
of make that is portable means there are a lot of things you just can't
do. I think it was decided long ago that it wasn't worth trying to
support non-gmake.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 17:15     ` Jeff King
@ 2009-06-16 17:35       ` Brandon Casey
  2009-06-16 17:39         ` John Bito
  2009-06-16 20:22         ` Brandon Casey
  2009-06-17  8:46       ` Paolo Bonzini
  1 sibling, 2 replies; 37+ messages in thread
From: Brandon Casey @ 2009-06-16 17:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, John Bito, git

Jeff King wrote:
> On Tue, Jun 16, 2009 at 09:51:24AM -0700, Junio C Hamano wrote:
> 
>>> I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
>>> be caused by a horribly slow system regex implementation; it really
>>> chokes on the regex we use to find the "funcname" line for java files.
>> Hmm.  Is running under LC_ALL=C LANG=C _with_ the slow system regex help?
> 
> No, it remains extremely slow (it is possible that it _is_ faster,
> though, but I never managed to run either case to completion; they are
> both clearly orders of magnitude off of acceptable).

I haven't tried setting LC_ALL, LANG, but this Solaris regex is MANY orders
of magnitude slower.  I've been running your example diff on the egit
repository for 2 hours and it still hasn't finished.  The compat/regex
version finished in 3 seconds.  Solaris 10 x86.

-brandon

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 17:35       ` Brandon Casey
@ 2009-06-16 17:39         ` John Bito
  2009-06-16 17:41           ` Jeff King
  2009-06-16 20:22         ` Brandon Casey
  1 sibling, 1 reply; 37+ messages in thread
From: John Bito @ 2009-06-16 17:39 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Jeff King, Junio C Hamano, git

I believe the issue is that Solaris implements 'extended' regular
expressions only in regcomp/regexec.  The implementation of
regcmp/regex seems to be from SysV and supports only 'basic' regular
expressions.

On Tue, Jun 16, 2009 at 10:35 AM, Brandon Casey<casey@nrlssc.navy.mil> wrote:
> Jeff King wrote:
>> On Tue, Jun 16, 2009 at 09:51:24AM -0700, Junio C Hamano wrote:
>>
>>>> I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
>>>> be caused by a horribly slow system regex implementation; it really
>>>> chokes on the regex we use to find the "funcname" line for java files.
>>> Hmm.  Is running under LC_ALL=C LANG=C _with_ the slow system regex help?
>>
>> No, it remains extremely slow (it is possible that it _is_ faster,
>> though, but I never managed to run either case to completion; they are
>> both clearly orders of magnitude off of acceptable).
>
> I haven't tried setting LC_ALL, LANG, but this Solaris regex is MANY orders
> of magnitude slower.  I've been running your example diff on the egit
> repository for 2 hours and it still hasn't finished.  The compat/regex
> version finished in 3 seconds.  Solaris 10 x86.
>
> -brandon
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 17:39         ` John Bito
@ 2009-06-16 17:41           ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 17:41 UTC (permalink / raw)
  To: John Bito; +Cc: Brandon Casey, Junio C Hamano, git

On Tue, Jun 16, 2009 at 10:39:39AM -0700, John Bito wrote:

> I believe the issue is that Solaris implements 'extended' regular
> expressions only in regcomp/regexec.  The implementation of
> regcmp/regex seems to be from SysV and supports only 'basic' regular
> expressions.

The regexps in question end up being compiled by regcomp (see
xdiff-interface.c:xdiff_set_find_func), so I don't think that is the
issue.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] Makefile: refactor regex compat support
  2009-06-16 12:11     ` [PATCH 1/2] Makefile: refactor regex compat support Jeff King
@ 2009-06-16 18:47       ` Johannes Sixt
  2009-06-16 19:05         ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Johannes Sixt @ 2009-06-16 18:47 UTC (permalink / raw)
  To: Jeff King; +Cc: John Bito, Junio C Hamano, git

On Dienstag, 16. Juni 2009, Jeff King wrote:
> There was no tweakable knob to use the regex compat code; it
> was embedded in the mingw build. Since other platforms may
> want to use it, let's factor it out in the usual way for
> build configuration knobs.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> This should behave the same as before on all platforms. The only one I
> might have screwed up by doing it wrong is mingw, but I have no machine
> to test on. Johannes, can you confirm that it is right?

It compiles and passes t/t40* (diff stuff) on Windows. But the patch conflicts 
with recent master, though nothing worrisome.

-- Hannes

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] Makefile: refactor regex compat support
  2009-06-16 18:47       ` Johannes Sixt
@ 2009-06-16 19:05         ` Jeff King
  2009-06-16 19:07           ` [PATCH v2 " Jeff King
  2009-06-16 19:08           ` [PATCH v2 2/2] Makefile: use compat regex on Solaris Jeff King
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 19:05 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: John Bito, Junio C Hamano, git

On Tue, Jun 16, 2009 at 08:47:28PM +0200, Johannes Sixt wrote:

> It compiles and passes t/t40* (diff stuff) on Windows. But the patch
> conflicts with recent master, though nothing worrisome.

Thanks for checking.

Yeah, it looks like several changes got merged to master right after
my branch point. All the conflicts are purely textual. For convenience,
I'll repost a rebased version.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/2] Makefile: refactor regex compat support
  2009-06-16 19:05         ` Jeff King
@ 2009-06-16 19:07           ` Jeff King
  2009-06-16 19:08           ` [PATCH v2 2/2] Makefile: use compat regex on Solaris Jeff King
  1 sibling, 0 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 19:07 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: John Bito, Junio C Hamano, git

There was no tweakable knob to use the regex compat code; it
was embedded in the mingw build. Since other platforms may
want to use it, let's factor it out in the usual way for
build configuration knobs.

Signed-off-by: Jeff King <peff@peff.net>
---
Rebased on today's master to resolve textual conflicts.

 Makefile |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 41ab8e9..0cb21da 100644
--- a/Makefile
+++ b/Makefile
@@ -194,6 +194,8 @@ all::
 #
 # Define USE_NED_ALLOCATOR if you want to replace the platforms default
 # memory allocators with the nedmalloc allocator written by Niall Douglas.
+#
+# Define NO_REGEX if you have no or inferior regex support in your C library.
 
 GIT-VERSION-FILE: .FORCE-GIT-VERSION-FILE
 	@$(SHELL_PATH) ./GIT-VERSION-GEN
@@ -884,9 +886,10 @@ ifneq (,$(findstring MINGW,$(uname_S)))
 	USE_NED_ALLOCATOR = YesPlease
 	UNRELIABLE_FSTAT = UnfortunatelyYes
 	OBJECT_CREATION_USES_RENAMES = UnfortunatelyNeedsTo
-	COMPAT_CFLAGS += -D__USE_MINGW_ACCESS -DNOGDI -Icompat -Icompat/regex -Icompat/fnmatch
+	NO_REGEX = YesPlease
+	COMPAT_CFLAGS += -D__USE_MINGW_ACCESS -DNOGDI -Icompat -Icompat/fnmatch
 	COMPAT_CFLAGS += -DSTRIP_EXTENSION=\".exe\"
-	COMPAT_OBJS += compat/mingw.o compat/fnmatch/fnmatch.o compat/regex/regex.o compat/winansi.o
+	COMPAT_OBJS += compat/mingw.o compat/fnmatch/fnmatch.o compat/winansi.o
 	EXTLIBS += -lws2_32
 	X = .exe
 ifneq (,$(wildcard ../THIS_IS_MSYSGIT))
@@ -1200,6 +1203,10 @@ endif
 ifdef UNRELIABLE_FSTAT
 	BASIC_CFLAGS += -DUNRELIABLE_FSTAT
 endif
+ifdef NO_REGEX
+	COMPAT_CFLAGS += -Icompat/regex
+	COMPAT_OBJS += compat/regex/regex.o
+endif
 
 ifdef USE_NED_ALLOCATOR
        COMPAT_CFLAGS += -DUSE_NED_ALLOCATOR -DOVERRIDE_STRDUP -DNDEBUG -DREPLACE_SYSTEM_ALLOCATOR -Icompat/nedmalloc
-- 
1.6.3.2.411.gffb5

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/2] Makefile: use compat regex on Solaris
  2009-06-16 19:05         ` Jeff King
  2009-06-16 19:07           ` [PATCH v2 " Jeff King
@ 2009-06-16 19:08           ` Jeff King
  2009-06-16 20:07             ` Brandon Casey
  2009-06-17 13:15             ` Mike Ralphson
  1 sibling, 2 replies; 37+ messages in thread
From: Jeff King @ 2009-06-16 19:08 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: John Bito, Junio C Hamano, git

The system regex is either slow or buggy for complex
patterns, like the built-in xfuncname pattern for java
files.

Signed-off-by: Jeff King <peff@peff.net>
---
Rebased to today's master to resolve textual conflicts.

 Makefile |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index 0cb21da..3bd0c08 100644
--- a/Makefile
+++ b/Makefile
@@ -725,6 +725,7 @@ ifeq ($(uname_S),SunOS)
 	NO_MEMMEM = YesPlease
 	NO_MKDTEMP = YesPlease
 	NO_MKSTEMPS = YesPlease
+	NO_REGEX = YesPlease
 	ifeq ($(uname_R),5.7)
 		NEEDS_RESOLV = YesPlease
 		NO_IPV6 = YesPlease
-- 
1.6.3.2.411.gffb5

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/2] Makefile: use compat regex on Solaris
  2009-06-16 19:08           ` [PATCH v2 2/2] Makefile: use compat regex on Solaris Jeff King
@ 2009-06-16 20:07             ` Brandon Casey
  2009-06-17 13:15             ` Mike Ralphson
  1 sibling, 0 replies; 37+ messages in thread
From: Brandon Casey @ 2009-06-16 20:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Sixt, John Bito, Junio C Hamano, git

Jeff King wrote:
> The system regex is either slow or buggy for complex
> patterns, like the built-in xfuncname pattern for java
> files.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Rebased to today's master to resolve textual conflicts.
> 
>  Makefile |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 0cb21da..3bd0c08 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -725,6 +725,7 @@ ifeq ($(uname_S),SunOS)
>  	NO_MEMMEM = YesPlease
>  	NO_MKDTEMP = YesPlease
>  	NO_MKSTEMPS = YesPlease
> +	NO_REGEX = YesPlease
>  	ifeq ($(uname_R),5.7)
>  		NEEDS_RESOLV = YesPlease
>  		NO_IPV6 = YesPlease


You need to add -DHAVE_ALLOCA_H to the BASIC_CFLAGS statement in
the SunOS section of the Makefile so that alloca.h will be included
in compat/regex/regex.c.  This is necessary for the SUNWspro compiler.
It takes me a long time to compile, so I haven't checked yet whether
this causes any problems for the GNU compiler.

-brandon

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 17:35       ` Brandon Casey
  2009-06-16 17:39         ` John Bito
@ 2009-06-16 20:22         ` Brandon Casey
  1 sibling, 0 replies; 37+ messages in thread
From: Brandon Casey @ 2009-06-16 20:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, John Bito, git

Brandon Casey wrote:
> Jeff King wrote:
>> On Tue, Jun 16, 2009 at 09:51:24AM -0700, Junio C Hamano wrote:
>>
>>>> I can reproduce the problem on Solaris 8 using git v1.6.3. It seems to
>>>> be caused by a horribly slow system regex implementation; it really
>>>> chokes on the regex we use to find the "funcname" line for java files.
>>> Hmm.  Is running under LC_ALL=C LANG=C _with_ the slow system regex help?
>> No, it remains extremely slow (it is possible that it _is_ faster,
>> though, but I never managed to run either case to completion; they are
>> both clearly orders of magnitude off of acceptable).
> 
> I haven't tried setting LC_ALL, LANG, but this Solaris regex is MANY orders
> of magnitude slower.  I've been running your example diff on the egit
> repository for 2 hours and it still hasn't finished.  The compat/regex
> version finished in 3 seconds.  Solaris 10 x86.

Ok, I don't think this call is going to finish.  'git diff v0.4.0' on
Solaris 10 x86 using the native regex library.  It has been running now
for over 4.5 hours.

If you're interested in a data point from another non-gnu regex library,
I ran the same test on a mips IRIX6.5.  It took 19.5 secs, and this is
not a young machine.  It takes 4 secs when diff.java.xfuncname is set
to 'foo'.

-brandon

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-16 17:15     ` Jeff King
  2009-06-16 17:35       ` Brandon Casey
@ 2009-06-17  8:46       ` Paolo Bonzini
  2009-06-17 10:23         ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2009-06-17  8:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, John Bito, git


> Really, that performance is so bad that I'm beginning to wonder if I am
> somehow measuring something wrong. How could they ship something so
> crappy through so many versions?

Because without some care in the matcher, the regex can be exponential. 
This happens because you can backtrack arbitrarily from [A-Za-z_0-9]* 
into [A-Za-z_] and ironically it also causes the regex not to work as 
intended; for example "catch(" can match the complex part of the regex 
(e.g. the first repetition can be "c" and the second can be "atch".

We can make it faster and more correct at the expense of additional 
complication.

Starting from:

^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\([^;]*)$

we have to:

1) move [ \t] at the end of the repeated subexpression so that it 
removes the need for the [ \t] after

^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]*){2,}\([^;]*)$

2) make sure that at least one space/tab is eaten on all but the last 
occurrence of the repeated subexpression.  To this end the LHS of {2,} 
is duplicated, once with [ \t]+ and once with [ \t]*.  The repetition 
itself becomes a + since the last occurrence is now separately handled:

^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*
[ \t]*\([^;]*)$

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-17  8:46       ` Paolo Bonzini
@ 2009-06-17 10:23         ` Jeff King
  2009-06-17 11:02           ` Paolo Bonzini
                             ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Jeff King @ 2009-06-17 10:23 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Junio C Hamano, John Bito, git

On Wed, Jun 17, 2009 at 10:46:21AM +0200, Paolo Bonzini wrote:

> 2) make sure that at least one space/tab is eaten on all but the last  
> occurrence of the repeated subexpression.  To this end the LHS of {2,} is 
> duplicated, once with [ \t]+ and once with [ \t]*.  The repetition itself 
> becomes a + since the last occurrence is now separately handled:
>
> ^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*
> [ \t]*\([^;]*)$

Thanks, I can confirm that this is _much_ faster. Here are some timings
from my Solaris 8 box for the "git diff v0.4.0" case using the system
and compat engines, and using three regexes: the original that git is
using now, an updated one with your regex above[1] replacing the second
line of the stock pattern, and a baseline regex of "." which should take
virtually no time at all.

  system,  orig: infinite
  system, paolo:   2.5s
  system,   ".":   0.6s
  compat,  orig: 288.0s
  compat, paolo:   1.5s
  compat,   ".":   0.6s

So it goes from infinite to 2.5s. Which still spends 3 times as long
matching funcname regexes as it does actually calculating the diff. The
compat library is a little better, but still chokes pretty badly on the
original regex.

Let's compare compat to the glibc implementation on my Debian box:

  system,  orig:   0.22s
  system, paolo:   0.22s
  system,   ".":   0.15s
  compat,  orig: 150.88s
  compat, paolo:   0.43s
  compat,   ".":   0.15s

Besides the exponential behavior on the original regex, it is still
about twice as slow as the system one.

So I think there are three possible optimizations worth considering:

  1. Replace the builtin diff.java.xfuncname pattern with what Paolo
     suggested (though I haven't verified its correctness beyond a
     cursory look at the results). This is easy to do, and will help
     people with crappy system regex libraries and people on
     compat/regex/ (right now just mingw) a _lot_. The downside is that
     it's a little harder to read the regex, but not terribly so.

  2. Recommend NO_REGEX for people with slow system regex libraries.
     This is also easy to do, and will help people even if we do (1) for
     two reasons:

       a. we process user-defined regexes through diff.*.xfuncname
          patterns, as well as through "git grep"; so we are protecting
          against poor performance when they give us a complex regex

       b. even on more reasonable regexps like Paolo's, we seem to get a
          2:1 speedup over the Solaris system library

  3. Replace compat/regex with something faster. It still produces
     exponential behavior in complex cases where glibc does not, and it
     seems to be about 1/3 as fast on Paolo's regex.

     I haven't looked at how large or how portable the glibc
     implementation is. Another alternative is that we could provide a
     simple compat/ as now, and have better support for linking against
     an external library like pcre, if it is available.

-Peff

[1] Note if you are cutting and pasting Paolo's regex into the C code,
    the "\(" needs to be "\\(", which I screwed up in my initial
    timings. :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-17 10:23         ` Jeff King
@ 2009-06-17 11:02           ` Paolo Bonzini
  2009-06-17 11:31           ` Andreas Ericsson
  2009-06-17 14:26           ` [PATCH] avoid exponential regex match for java and objc function names Paolo Bonzini
  2 siblings, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2009-06-17 11:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, John Bito, git

>   system,  orig:   0.22s
>   system, paolo:   0.22s
>   system,   ".":   0.15s
>   compat,  orig: 150.88s
>   compat, paolo:   0.43s
>   compat,   ".":   0.15s
> 
> Besides the exponential behavior on the original regex, it is still
> about twice as slow as the system one.

The reason is that the glibc regex is a DFA-based matcher.  It is much 
slower on regexes with backreferences, but otherwise it is faster.

>   1. Replace the builtin diff.java.xfuncname pattern with what Paolo
>      suggested (though I haven't verified its correctness beyond a
>      cursory look at the results).

I checked it a bit harder, but still it is not easy to check because of 
the false positives in the original regex.  I'm pretty sure it's correct 
  though; I find it even easier to read (though longer) than the 
original one.

>      I haven't looked at how large or how portable the glibc
>      implementation is.

Decently portable, but I don't think it's worth it.  Users that write 
regexes so complex should know of the exponential behavior, I think.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-17 10:23         ` Jeff King
  2009-06-17 11:02           ` Paolo Bonzini
@ 2009-06-17 11:31           ` Andreas Ericsson
  2009-06-17 13:08             ` Paolo Bonzini
  2009-06-17 14:26           ` [PATCH] avoid exponential regex match for java and objc function names Paolo Bonzini
  2 siblings, 1 reply; 37+ messages in thread
From: Andreas Ericsson @ 2009-06-17 11:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Paolo Bonzini, Junio C Hamano, John Bito, git

Jeff King wrote:
> 
>   3. Replace compat/regex with something faster. It still produces
>      exponential behavior in complex cases where glibc does not, and it
>      seems to be about 1/3 as fast on Paolo's regex.
> 
>      I haven't looked at how large or how portable the glibc
>      implementation is. Another alternative is that we could provide a
>      simple compat/ as now, and have better support for linking against
>      an external library like pcre, if it is available.
> 

The glibc implementation is quite large. Cutting the library-specific
cruft it still sits at about 10k LOC.

Using PCRE is a no-go, as it uses perl-compatible regexes even for the
posix-compatible API, as per pcreposix(3):

       When  PCRE  is  called  via these functions, it is only the API that is
       POSIX-like in style. The syntax and semantics of  the  regular  expres-
       sions  themselves  are  still  those of Perl, subject to the setting of
       various PCRE options, as described below. "POSIX-like in  style"  means
       that  the  API  approximates  to  the POSIX definition; it is not fully
       POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
       even less compatible.

This would probably surprise some "git grep" users quite a lot, I think.

I like your other two suggestions though. The stuff already in compat/
seems to work well enough, so with Paolo's improved pattern it should
be fine.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-17 11:31           ` Andreas Ericsson
@ 2009-06-17 13:08             ` Paolo Bonzini
  2009-06-17 13:16               ` Andreas Ericsson
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2009-06-17 13:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, John Bito, git


> The glibc implementation is quite large. Cutting the library-specific
> cruft it still sits at about 10k LOC.
> 
> Using PCRE is a no-go, as it uses perl-compatible regexes even for the
> posix-compatible API, as per pcreposix(3):

I have a PCRE fork that has POSIX semantics (except the braindead 
leftmost-longest *sub*expressions).  It weighs 8kLOC, you can find it in 
branch ssed of GNU sed's git repository.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/2] Makefile: use compat regex on Solaris
  2009-06-16 19:08           ` [PATCH v2 2/2] Makefile: use compat regex on Solaris Jeff King
  2009-06-16 20:07             ` Brandon Casey
@ 2009-06-17 13:15             ` Mike Ralphson
  2009-06-17 13:55               ` Mike Ralphson
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Ralphson @ 2009-06-17 13:15 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Johannes Sixt, John Bito, git

2009/6/16 Jeff King <peff@peff.net>
>
> The system regex is either slow or buggy for complex
> patterns, like the built-in xfuncname pattern for java
> files.

Also required on AIX (5.3). I thought this was a major performance
regression somewhere else, but my 'old stable' version dates from a
time when we were briefly using compat/regex on this platform before
we reverted that change and switched over to extended regexes.

With the above:
3m33.399s

Without:
... manually aborted after 58m !

git 1.6.0.2.229.g1293c
3m21.645s

Could you squash

+       NO_REGEX = YesPlease

in ifeq ($(uname_S),AIX) please? Shout if follow-up patch preferred.

Signed-off-by: Mike Ralphson <mike@abacus.co.uk>

Mike

PS Pound to a penny INTERNAL_QSORT would be a win on Solaris too...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-17 13:08             ` Paolo Bonzini
@ 2009-06-17 13:16               ` Andreas Ericsson
  2009-06-17 13:58                 ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Andreas Ericsson @ 2009-06-17 13:16 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Jeff King, Junio C Hamano, John Bito, git

Paolo Bonzini wrote:
> 
>> The glibc implementation is quite large. Cutting the library-specific
>> cruft it still sits at about 10k LOC.
>>
>> Using PCRE is a no-go, as it uses perl-compatible regexes even for the
>> posix-compatible API, as per pcreposix(3):
> 
> I have a PCRE fork that has POSIX semantics (except the braindead 
> leftmost-longest *sub*expressions).  It weighs 8kLOC, you can find it in 
> branch ssed of GNU sed's git repository.
> 

Sounds neat. Do you by any chance have some performance measurements
for it? If the work's already done and it provides a significant
improvement I'm all for it ;-)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/2] Makefile: use compat regex on Solaris
  2009-06-17 13:15             ` Mike Ralphson
@ 2009-06-17 13:55               ` Mike Ralphson
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Ralphson @ 2009-06-17 13:55 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano, Paolo Bonzini; +Cc: Johannes Sixt, John Bito, git

2009/6/17 Mike Ralphson <mike.ralphson@gmail.com>:
> Also required on AIX (5.3).

Scratch that, sorry - I hadn't seen the related thread (git diff looping?)

Paolo's fix is the only one required for AIX.

Sorry for the noise.

Mike

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: git diff looping?
  2009-06-17 13:16               ` Andreas Ericsson
@ 2009-06-17 13:58                 ` Paolo Bonzini
  0 siblings, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2009-06-17 13:58 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Jeff King, Junio C Hamano, John Bito, git


> Sounds neat. Do you by any chance have some performance measurements
> for it? If the work's already done and it provides a significant
> improvement I'm all for it ;-)

It's very very fast, but only as fast as a backtracking matcher can be. 
  I think it would trounce glibc on my regex but probably not on the 
buggy one.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] avoid exponential regex match for java and objc function names
  2009-06-17 10:23         ` Jeff King
  2009-06-17 11:02           ` Paolo Bonzini
  2009-06-17 11:31           ` Andreas Ericsson
@ 2009-06-17 14:26           ` Paolo Bonzini
  2009-06-17 15:46             ` demerphq
  2009-06-17 16:42             ` Junio C Hamano
  2 siblings, 2 replies; 37+ messages in thread
From: Paolo Bonzini @ 2009-06-17 14:26 UTC (permalink / raw)
  To: git; +Cc: peff, gitster

In the old regex

^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\([^;]*)$
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

you can backtrack arbitrarily from [A-Za-z_0-9]* into [A-Za-z_], thus
causing an exponential number of backtracks.  Ironically it also causes
the regex not to work as intended; for example "catch" can match the
underlined part of the regex, the first repetition matching "c" and
the second matching "atch".

The replacement regex avoids this problem, because it makes sure that
at least a space/tab is eaten on each repetition.  In other words,
a suffix of a repetition can never be a prefix of the next repetition.

Signed-off-by: Paolo Bonzini <bonzini@gnu.org>
---
 userdiff.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/userdiff.c b/userdiff.c
index d556da9..57529ae 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -13,7 +13,8 @@ PATTERNS("html", "^[ \t]*(<[Hh][1-6][ \t].*>.*)$",
 	 "[^<>= \t]+|[^[:space:]]|[\x80-\xff]+"),
 PATTERNS("java",
 	 "!^[ \t]*(catch|do|for|if|instanceof|new|return|switch|throw|while)\n"
-	 "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$",
+	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$",
+	 /* -- */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
 	 "|[-+*/<>%&^|=!]="
@@ -25,7 +26,7 @@ PATTERNS("objc",
 	 /* Objective-C methods */
 	 "^[ \t]*([-+][ \t]*\\([ \t]*[A-Za-z_][A-Za-z_0-9* \t]*\\)[ \t]*[A-Za-z_].*)$\n"
 	 /* C functions */
-	 "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$\n"
+	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$\n"
 	 /* Objective-C class/protocol definitions */
 	 "^(@(implementation|interface|protocol)[ \t].*)$",
 	 /* -- */
-- 
1.6.0.3

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] avoid exponential regex match for java and objc function  names
  2009-06-17 14:26           ` [PATCH] avoid exponential regex match for java and objc function names Paolo Bonzini
@ 2009-06-17 15:46             ` demerphq
  2009-06-17 15:56               ` Jeff King
  2009-06-17 16:42             ` Junio C Hamano
  1 sibling, 1 reply; 37+ messages in thread
From: demerphq @ 2009-06-17 15:46 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: git, peff, gitster

Just  a note, but If the  Java regex library you are using supports
the PCRE compatible (?>...) atomic matching construct (or their
equivalent *+ and ++) then these patterns can be significantly
improved beyond their current state.


2009/6/17 Paolo Bonzini <bonzini@gnu.org>:
> In the old regex
>
> ^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\([^;]*)$
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> you can backtrack arbitrarily from [A-Za-z_0-9]* into [A-Za-z_], thus
> causing an exponential number of backtracks.  Ironically it also causes
> the regex not to work as intended; for example "catch" can match the
> underlined part of the regex, the first repetition matching "c" and
> the second matching "atch".
>
> The replacement regex avoids this problem, because it makes sure that
> at least a space/tab is eaten on each repetition.  In other words,
> a suffix of a repetition can never be a prefix of the next repetition.
>
> Signed-off-by: Paolo Bonzini <bonzini@gnu.org>
> ---
>  userdiff.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/userdiff.c b/userdiff.c
> index d556da9..57529ae 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -13,7 +13,8 @@ PATTERNS("html", "^[ \t]*(<[Hh][1-6][ \t].*>.*)$",
>         "[^<>= \t]+|[^[:space:]]|[\x80-\xff]+"),
>  PATTERNS("java",
>         "!^[ \t]*(catch|do|for|if|instanceof|new|return|switch|throw|while)\n"
> -        "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$",
> +        "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$",
> +        /* -- */
>         "[a-zA-Z_][a-zA-Z0-9_]*"
>         "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
>         "|[-+*/<>%&^|=!]="
> @@ -25,7 +26,7 @@ PATTERNS("objc",
>         /* Objective-C methods */
>         "^[ \t]*([-+][ \t]*\\([ \t]*[A-Za-z_][A-Za-z_0-9* \t]*\\)[ \t]*[A-Za-z_].*)$\n"
>         /* C functions */
> -        "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$\n"
> +        "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$\n"
>         /* Objective-C class/protocol definitions */
>         "^(@(implementation|interface|protocol)[ \t].*)$",
>         /* -- */
> --
> 1.6.0.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] avoid exponential regex match for java and objc function names
  2009-06-17 15:46             ` demerphq
@ 2009-06-17 15:56               ` Jeff King
  2009-06-17 16:00                 ` demerphq
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2009-06-17 15:56 UTC (permalink / raw)
  To: demerphq; +Cc: Paolo Bonzini, git, gitster

On Wed, Jun 17, 2009 at 05:46:54PM +0200, demerphq wrote:

> Just  a note, but If the  Java regex library you are using supports
> the PCRE compatible (?>...) atomic matching construct (or their
> equivalent *+ and ++) then these patterns can be significantly
> improved beyond their current state.

To clarify, this isn't a java regex library, but rather regexps used to
match function names inside java language files when generating diffs.
The regex library itself is the POSIX regex routines provided by libc.

PCRE syntax is nice, but we don't want to require it for every build,
and it's important to have the same syntax everywhere (so that, e.g.,
your config from one build works on a different build).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] avoid exponential regex match for java and objc function  names
  2009-06-17 15:56               ` Jeff King
@ 2009-06-17 16:00                 ` demerphq
  2009-06-17 16:04                   ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: demerphq @ 2009-06-17 16:00 UTC (permalink / raw)
  To: Jeff King; +Cc: Paolo Bonzini, git, gitster

2009/6/17 Jeff King <peff@peff.net>:
> On Wed, Jun 17, 2009 at 05:46:54PM +0200, demerphq wrote:
>
>> Just  a note, but If the  Java regex library you are using supports
>> the PCRE compatible (?>...) atomic matching construct (or their
>> equivalent *+ and ++) then these patterns can be significantly
>> improved beyond their current state.
>
> To clarify, this isn't a java regex library, but rather regexps used to
> match function names inside java language files when generating diffs.
> The regex library itself is the POSIX regex routines provided by libc.
>
> PCRE syntax is nice, but we don't want to require it for every build,
> and it's important to have the same syntax everywhere (so that, e.g.,
> your config from one build works on a different build).

Ah ok. Im not familiar with the finer points of the POSIX engine, but
PCRE and Perl's engine, and most similar engines are not true regular
expression engines and thus benefit *greatly* from atomic matching if
it is available.

Like the difference between heat-death performance (or stack
overflow), and running instantly.

Yves



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] avoid exponential regex match for java and objc function names
  2009-06-17 16:00                 ` demerphq
@ 2009-06-17 16:04                   ` Paolo Bonzini
  0 siblings, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2009-06-17 16:04 UTC (permalink / raw)
  To: demerphq; +Cc: Jeff King, Paolo Bonzini, git, gitster


> Ah ok. Im not familiar with the finer points of the POSIX engine, but
> PCRE and Perl's engine, and most similar engines are not true regular
> expression engines and thus benefit *greatly* from atomic matching if
> it is available.
> 
> Like the difference between heat-death performance (or stack
> overflow), and running instantly.

You can almost always fix the regex to avoid this, by ensuring that 
whenever you have (...)+ (or *) a suffix of the subexpression cannot be 
a prefix of the subexpression too.  This is what my patch did -- 
changing a bad regex to a nicely behaving one.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] avoid exponential regex match for java and objc function names
  2009-06-17 14:26           ` [PATCH] avoid exponential regex match for java and objc function names Paolo Bonzini
  2009-06-17 15:46             ` demerphq
@ 2009-06-17 16:42             ` Junio C Hamano
  2009-06-18  6:45               ` Paolo Bonzini
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2009-06-17 16:42 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: git, peff

Paolo Bonzini <bonzini@gnu.org> writes:

> In the old regex
>
> ^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\([^;]*)$
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> you can backtrack arbitrarily from [A-Za-z_0-9]* into [A-Za-z_], thus
> causing an exponential number of backtracks.  Ironically it also causes
> the regex not to work as intended; for example "catch" can match the
> underlined part of the regex, the first repetition matching "c" and
> the second matching "atch".
>
> The replacement regex avoids this problem, because it makes sure that
> at least a space/tab is eaten on each repetition.  In other words,
> a suffix of a repetition can never be a prefix of the next repetition.

Thanks; nicely done.

Should I remove the "/* -- */" or is it for better readability I should
keep?

> Signed-off-by: Paolo Bonzini <bonzini@gnu.org>
> ---
>  userdiff.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/userdiff.c b/userdiff.c
> index d556da9..57529ae 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -13,7 +13,8 @@ PATTERNS("html", "^[ \t]*(<[Hh][1-6][ \t].*>.*)$",
>  	 "[^<>= \t]+|[^[:space:]]|[\x80-\xff]+"),
>  PATTERNS("java",
>  	 "!^[ \t]*(catch|do|for|if|instanceof|new|return|switch|throw|while)\n"
> -	 "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$",
> +	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$",
> +	 /* -- */
>  	 "[a-zA-Z_][a-zA-Z0-9_]*"
>  	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
>  	 "|[-+*/<>%&^|=!]="
> @@ -25,7 +26,7 @@ PATTERNS("objc",
>  	 /* Objective-C methods */
>  	 "^[ \t]*([-+][ \t]*\\([ \t]*[A-Za-z_][A-Za-z_0-9* \t]*\\)[ \t]*[A-Za-z_].*)$\n"
>  	 /* C functions */
> -	 "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$\n"
> +	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$\n"
>  	 /* Objective-C class/protocol definitions */
>  	 "^(@(implementation|interface|protocol)[ \t].*)$",
>  	 /* -- */
> -- 
> 1.6.0.3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] avoid exponential regex match for java and objc function names
  2009-06-17 16:42             ` Junio C Hamano
@ 2009-06-18  6:45               ` Paolo Bonzini
  0 siblings, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2009-06-18  6:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Paolo Bonzini, git, peff


> Should I remove the "/* -- */" or is it for better readability I should
> keep?

It helps detecting the separation between the function regex and the 
word regex:

>> -	 "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$",
>> +	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$",
>> +	 /* -- */

I stole the idea from the Objective-C part:

>>  	 /* C functions */
>> -	 "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$\n"
>> +	 "^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*[ \t]*\\([^;]*)$\n"
>>  	 /* Objective-C class/protocol definitions */
>>  	 "^(@(implementation|interface|protocol)[ \t].*)$",
>>  	 /* -- */

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2009-06-18  6:46 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-16  1:37 git diff looping? John Bito
2009-06-16  2:44 ` Jeff Epler
2009-06-16  2:53   ` John Bito
2009-06-16 11:47 ` Jeff King
2009-06-16 12:07   ` Jeff King
2009-06-16 12:11     ` [PATCH 1/2] Makefile: refactor regex compat support Jeff King
2009-06-16 18:47       ` Johannes Sixt
2009-06-16 19:05         ` Jeff King
2009-06-16 19:07           ` [PATCH v2 " Jeff King
2009-06-16 19:08           ` [PATCH v2 2/2] Makefile: use compat regex on Solaris Jeff King
2009-06-16 20:07             ` Brandon Casey
2009-06-17 13:15             ` Mike Ralphson
2009-06-17 13:55               ` Mike Ralphson
2009-06-16 12:14     ` [PATCH " Jeff King
2009-06-16 15:48   ` git diff looping? John Bito
2009-06-16 16:51   ` Junio C Hamano
2009-06-16 17:15     ` Jeff King
2009-06-16 17:35       ` Brandon Casey
2009-06-16 17:39         ` John Bito
2009-06-16 17:41           ` Jeff King
2009-06-16 20:22         ` Brandon Casey
2009-06-17  8:46       ` Paolo Bonzini
2009-06-17 10:23         ` Jeff King
2009-06-17 11:02           ` Paolo Bonzini
2009-06-17 11:31           ` Andreas Ericsson
2009-06-17 13:08             ` Paolo Bonzini
2009-06-17 13:16               ` Andreas Ericsson
2009-06-17 13:58                 ` Paolo Bonzini
2009-06-17 14:26           ` [PATCH] avoid exponential regex match for java and objc function names Paolo Bonzini
2009-06-17 15:46             ` demerphq
2009-06-17 15:56               ` Jeff King
2009-06-17 16:00                 ` demerphq
2009-06-17 16:04                   ` Paolo Bonzini
2009-06-17 16:42             ` Junio C Hamano
2009-06-18  6:45               ` Paolo Bonzini
2009-06-16 17:16     ` git diff looping? John Bito
2009-06-16 17:24       ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).