All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
@ 2016-09-05 15:44 Johannes Schindelin
  2016-09-05 15:45 ` [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
                   ` (5 more replies)
  0 siblings, 6 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-05 15:44 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

This patch series addresses a problem where `git diff` is called using
`-G` or `-S --pickaxe-regex` on new-born files that are configured
without user diff drivers, and that hence get mmap()ed into memory.

The problem with that: mmap()ed memory is *not* NUL-terminated, yet the
pickaxe code calls regexec() on it just the same.

This problem has been reported by my colleague Chris Sidi.

Please note that this patch series is a hot fix I applied to Git for
Windows (the bug does not trigger a segmentation fault for me on Linux,
strangely enough, but it is really a problem on Windows).

So at least I have a workaround in place. Ideally, though, we would
NUL-terminate the buffers only when needed, or somehow call regexec() on
ptr/size parameters instead of passing a supposedly NUL-terminated
string to it?


Johannes Schindelin (3):
  Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  diff_populate_filespec: NUL-terminate buffers
  diff_grep: add assertions verifying that the buffers are
    NUL-terminated

 diff.c                  |  9 +++++++++
 diffcore-pickaxe.c      |  2 ++
 t/t4059-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 3 files changed, 33 insertions(+)
 create mode 100755 t/t4059-diff-pickaxe.sh

Published-As: https://github.com/dscho/git/releases/tag/mmap-regexec-v1
Fetch-It-Via: git fetch https://github.com/dscho/git mmap-regexec-v1

-- 
2.10.0.windows.1.2.g732a511

base-commit: 6ebdac1bab966b720d776aa43ca188fe378b1f4b

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
@ 2016-09-05 15:45 ` Johannes Schindelin
  2016-09-06 18:43   ` Jeff King
  2016-09-05 15:45 ` [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers Johannes Schindelin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-05 15:45 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

When our pickaxe code feeds file contents to regexec(), it implicitly
assumes that the file contents are read into implicitly NUL-terminated
buffers (i.e. that we overallocate by 1, appending a single '\0').

This is not so.

In particular when the file contents are simply mmap()ed, we can be
virtually certain that the buffer is preceding uninitialized bytes, or
invalid pages.

Note that the test we add here is known to be flakey: we simply cannot
know whether the byte following the mmap()ed ones is a NUL or not.

Typically, on Linux the test passes. On Windows, it fails virtually
every time due to an access violation (that's a segmentation fault for
you Unix-y people out there). And Windows would be correct: the
regexec() call wants to operate on a regular, NUL-terminated string,
there is no NUL in the mmap()ed memory range, and it is undefined
whether the next byte is even legal to access.

When run with --valgrind it demonstrates quite clearly the breakage, of
course.

So we simply mark it with `test_expect_success` for now.

This test case represents a Minimal, Complete and Verifiable Example of
a breakage reported by Chris Sidi.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t4059-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100755 t/t4059-diff-pickaxe.sh

diff --git a/t/t4059-diff-pickaxe.sh b/t/t4059-diff-pickaxe.sh
new file mode 100755
index 0000000..f0bf50b
--- /dev/null
+++ b/t/t4059-diff-pickaxe.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+#
+# Copyright (c) 2016 Johannes Schindelin
+#
+
+test_description='Pickaxe options'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+	test_commit initial &&
+	printf "%04096d" 0 >4096-zeroes.txt &&
+	git add 4096-zeroes.txt &&
+	test_tick &&
+	git commit -m "A 4k file"
+'
+test_expect_success '-G matches' '
+	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
+	test 4096-zeroes.txt = "$(cat out)"
+'
+
+test_done
-- 
2.10.0.windows.1.2.g732a511



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
  2016-09-05 15:45 ` [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
@ 2016-09-05 15:45 ` Johannes Schindelin
  2016-09-06  7:06   ` Jeff King
  2016-09-05 15:45 ` [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated Johannes Schindelin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-05 15:45 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

While the xdiff machinery is quite capable of working with strings given
as pointer and size, Git's add-on functionality simply assumes that we
are operating on NUL-terminated strings, e.g. y running regexec() on the
provided pointer, with no way to pass the size, too.

In general, this assumption is wrong.

It is true that many code paths populate the mmfile_t structure silently
appending a NUL, e.g. when running textconv on a temporary file and
reading the results back into an strbuf.

The assumption is most definitely wrong, however, when mmap()ing a file.

Practically, we seemed to be lucky that the bytes after mmap()ed memory
were 1) accessible and 2) somehow contained NUL bytes *somewhere*.

In a use case reported by Chris Sidi, it turned out that the mmap()ed
file had the precise size of a memory page, and on Windows the bytes
after memory-mapped pages are in general not valid.

This patch works around that issue, giving us time to discuss the best
course how to fix this problem more generally.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 diff.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/diff.c b/diff.c
index 534c12e..32f7f46 100644
--- a/diff.c
+++ b/diff.c
@@ -2826,6 +2826,15 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 			s->data = strbuf_detach(&buf, &size);
 			s->size = size;
 			s->should_free = 1;
+		} else {
+			/* data must be NUL-terminated so e.g. for regexec() */
+			char *data = xmalloc(s->size + 1);
+			memcpy(data, s->data, s->size);
+			data[s->size] = '\0';
+			munmap(s->data, s->size);
+			s->should_munmap = 0;
+			s->data = data;
+			s->should_free = 1;
 		}
 	}
 	else {
-- 
2.10.0.windows.1.2.g732a511



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated
  2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
  2016-09-05 15:45 ` [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
  2016-09-05 15:45 ` [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers Johannes Schindelin
@ 2016-09-05 15:45 ` Johannes Schindelin
  2016-09-06  7:08   ` Jeff King
  2016-09-05 19:10 ` [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Junio C Hamano
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-05 15:45 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

Before calling regexec() on the file contents, we better be certain that
the strings fulfill the contract of C strings assumed by said function.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 diffcore-pickaxe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 55067ca..88820b6 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -49,6 +49,8 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
 	xpparam_t xpp;
 	xdemitconf_t xecfg;
 
+	assert(!one || one->ptr[one->size] == '\0');
+	assert(!two || two->ptr[two->size] == '\0');
 	if (!one)
 		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
 	if (!two)
-- 
2.10.0.windows.1.2.g732a511

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
                   ` (2 preceding siblings ...)
  2016-09-05 15:45 ` [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated Johannes Schindelin
@ 2016-09-05 19:10 ` Junio C Hamano
  2016-09-06  7:12   ` Jeff King
  2016-09-06 13:21   ` Johannes Schindelin
  2016-09-06  6:58 ` Jeff King
  2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
  5 siblings, 2 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-05 19:10 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> This patch series addresses a problem where `git diff` is called using
> `-G` or `-S --pickaxe-regex` on new-born files that are configured
> without user diff drivers, and that hence get mmap()ed into memory.

Good spotting.  This has been with us almost forever; I do not think
the original pickaxe had it, but I am sure it is broken after the
"--pickaxe-regex" enhancement.

I am somehow surprised that this is a problem on Windows, though.
Wouldn't we be at least running CRLF conversions, and causing diff
or grep machinery to work on a NUL-terminated buffer?  The convesion
code would have to look at mmap'ed memory but I do not think it
assumes NUL-termination.  Perhaps people on Windows do not usually
use straight-through and that is why this was discovered after many
years, or something?  In any case, that is a digression.

> Windows (the bug does not trigger a segmentation fault for me on Linux,
> strangely enough, but it is really a problem on Windows).

I think it is an issue on all platforms that lets us use mmap().
When the size of a file is multiple of pagesize, the byte past the
end of the file can very well fall on an unmapped address.

> So at least I have a workaround in place. Ideally, though, we would
> NUL-terminate the buffers only when needed, or somehow call regexec() on
> ptr/size parameters instead of passing a supposedly NUL-terminated
> string to it?

I see two reasonable approaches.

 * We may want to revisit the performance numbers to see if using
   mmap() to read from the working tree files still buys us much.
   If not, we should stop borrowing from the working tree using
   mmap(); instead just slurp in and NUL-terminate it.

 * We could use <ptr,len> variant of regexp engine as you proposed,
   which I think is a preferrable solution.  Do people know of a
   widely accepted implementation that we can throw into compat/ as
   fallback that is compatible with GPLv2?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
                   ` (3 preceding siblings ...)
  2016-09-05 19:10 ` [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Junio C Hamano
@ 2016-09-06  6:58 ` Jeff King
  2016-09-06 14:13   ` Johannes Schindelin
  2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
  5 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-06  6:58 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Mon, Sep 05, 2016 at 05:44:57PM +0200, Johannes Schindelin wrote:

> The problem with that: mmap()ed memory is *not* NUL-terminated, yet the
> pickaxe code calls regexec() on it just the same.
> 
> This problem has been reported by my colleague Chris Sidi.
> 
> Please note that this patch series is a hot fix I applied to Git for
> Windows (the bug does not trigger a segmentation fault for me on Linux,
> strangely enough, but it is really a problem on Windows).

This has come up before, and I think somebody mentioned that on Linux,
you are OK unless the buffer ends right at a page boundary (i.e., the
buffer size is a multiple of the page size). I don't know if that's true
or not.

> So at least I have a workaround in place. Ideally, though, we would
> NUL-terminate the buffers only when needed, or somehow call regexec() on
> ptr/size parameters instead of passing a supposedly NUL-terminated
> string to it?

There's some discussion in:

  http://public-inbox.org/git/20121030121747.GA4231@sigill.intra.peff.net/#r

and the thread below it. The quickest way to fix regexec() would be to
have everybody use the built-in GNU regex in compat/. People seemed
somewhat positive on that direction, but we never followed up.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-05 15:45 ` [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers Johannes Schindelin
@ 2016-09-06  7:06   ` Jeff King
  2016-09-06 16:02     ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-06  7:06 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Mon, Sep 05, 2016 at 05:45:06PM +0200, Johannes Schindelin wrote:

> It is true that many code paths populate the mmfile_t structure silently
> appending a NUL, e.g. when running textconv on a temporary file and
> reading the results back into an strbuf.
> 
> The assumption is most definitely wrong, however, when mmap()ing a file.
> 
> Practically, we seemed to be lucky that the bytes after mmap()ed memory
> were 1) accessible and 2) somehow contained NUL bytes *somewhere*.
> 
> In a use case reported by Chris Sidi, it turned out that the mmap()ed
> file had the precise size of a memory page, and on Windows the bytes
> after memory-mapped pages are in general not valid.
> 
> This patch works around that issue, giving us time to discuss the best
> course how to fix this problem more generally.

I don't know if we are in that much of a rush. This bug has been around
for many years (the thread I linked earlier is from 2012). Yes, it's bad
and annoying, but we can probably spend a few days discussing the
solution.

> diff --git a/diff.c b/diff.c
> index 534c12e..32f7f46 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2826,6 +2826,15 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
>  			s->data = strbuf_detach(&buf, &size);
>  			s->size = size;
>  			s->should_free = 1;
> +		} else {
> +			/* data must be NUL-terminated so e.g. for regexec() */
> +			char *data = xmalloc(s->size + 1);
> +			memcpy(data, s->data, s->size);
> +			data[s->size] = '\0';
> +			munmap(s->data, s->size);
> +			s->should_munmap = 0;
> +			s->data = data;
> +			s->should_free = 1;
>  		}

Without having done a complete audit recently, my gut and my
recollection from previous discussions is that regexec() really is the
culprit here for the diff code[1]. If we are going to do a workaround
like this, I think we could limit it only to cases where know it
matters, like --pickaxe-regex.

Can it be triggered with -G? I thought that operated on the diff content
itself, which would always be in a heap buffer (which should be NUL
terminated, but if it isn't, that would be a separate fix from this).

-Peff

[1] We do make the assumption elsewhere that git objects are
    NUL-terminated, but that is enforced by the object-reading code
    (with the exception of streamed blobs, but those are obviously dealt
    with separately anyway).

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated
  2016-09-05 15:45 ` [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated Johannes Schindelin
@ 2016-09-06  7:08   ` Jeff King
  2016-09-06 16:04     ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-06  7:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Mon, Sep 05, 2016 at 05:45:09PM +0200, Johannes Schindelin wrote:

> Before calling regexec() on the file contents, we better be certain that
> the strings fulfill the contract of C strings assumed by said function.

If you have a buffer that is exactly "size" bytes and you are worried
about regexec reading off the end, then...

> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 55067ca..88820b6 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -49,6 +49,8 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
>  	xpparam_t xpp;
>  	xdemitconf_t xecfg;
>  
> +	assert(!one || one->ptr[one->size] == '\0');
> +	assert(!two || two->ptr[two->size] == '\0');
>  	if (!one)
>  		return !regexec(regexp, two->ptr, 1, &regmatch, 0);

...don't your asserts also read off the end?

So you might still segfault, though you do catch a case where we have N
bytes of junk before the end of the page (and you have a 255/256 chance
of catching it).

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-05 19:10 ` [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Junio C Hamano
@ 2016-09-06  7:12   ` Jeff King
  2016-09-06 14:06     ` Johannes Schindelin
  2016-09-06 13:21   ` Johannes Schindelin
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-06  7:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

On Mon, Sep 05, 2016 at 12:10:11PM -0700, Junio C Hamano wrote:

>  * We could use <ptr,len> variant of regexp engine as you proposed,
>    which I think is a preferrable solution.  Do people know of a
>    widely accepted implementation that we can throw into compat/ as
>    fallback that is compatible with GPLv2?

Maybe the one already in compat/regex? ;P

I think re_search() the correct replacement function but it's been a
while since I've looked into it.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-05 19:10 ` [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Junio C Hamano
  2016-09-06  7:12   ` Jeff King
@ 2016-09-06 13:21   ` Johannes Schindelin
  1 sibling, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-06 13:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi Junio,

On Mon, 5 Sep 2016, Junio C Hamano wrote:

> Johannes Schindelin <johannes.schindelin@gmx.de> writes:
> 
> > This patch series addresses a problem where `git diff` is called using
> > `-G` or `-S --pickaxe-regex` on new-born files that are configured
> > without user diff drivers, and that hence get mmap()ed into memory.
> 
> Good spotting.  This has been with us almost forever; I do not think
> the original pickaxe had it, but I am sure it is broken after the
> "--pickaxe-regex" enhancement.

Agreed, regexec() is the call where it segfaults.

> I am somehow surprised that this is a problem on Windows, though.
> Wouldn't we be at least running CRLF conversions, and causing diff
> or grep machinery to work on a NUL-terminated buffer?

It is true that the CR/LF conversion hides this problem. In fact, in the
case reported to me, it turned out that the segfault happened only
recently, when the repository was switched from LF line endings to CR/LF
line endings.

That switch is unfortunately required: it saves *tons* of time because the
regular CR/LF conversion just takes too much time. It was worse before the
repository defined the .gitattributes: the auto-detection contributed to
the time spent by Git.

So yes: the CR/LF conversion hid the bug, but no, we cannot re-introduce
the CR/LF conversion into said repository.

> The convesion code would have to look at mmap'ed memory but I do not
> think it assumes NUL-termination.  Perhaps people on Windows do not
> usually use straight-through and that is why this was discovered after
> many years, or something?  In any case, that is a digression.

Indeed, it is.

> > Windows (the bug does not trigger a segmentation fault for me on
> > Linux, strangely enough, but it is really a problem on Windows).
> 
> I think it is an issue on all platforms that lets us use mmap().
> When the size of a file is multiple of pagesize, the byte past the
> end of the file can very well fall on an unmapped address.

Correct.

> > So at least I have a workaround in place. Ideally, though, we would
> > NUL-terminate the buffers only when needed, or somehow call regexec()
> > on ptr/size parameters instead of passing a supposedly NUL-terminated
> > string to it?
> 
> I see two reasonable approaches.
> 
>  * We may want to revisit the performance numbers to see if using
>    mmap() to read from the working tree files still buys us much.
>    If not, we should stop borrowing from the working tree using
>    mmap(); instead just slurp in and NUL-terminate it.

I would like to warn against putting too much stock into such a test,
unless it is performed on Linux, MacOSX, Windows and various BSDs. That
would make it hard, of course, to come up with a definitive result, but we
simply should not make the mistake of over-optimizing for one platform.
We used to, of course, and look how much performance it costs e.g. on
Windows.

>  * We could use <ptr,len> variant of regexp engine as you proposed,
>    which I think is a preferrable solution.  Do people know of a
>    widely accepted implementation that we can throw into compat/ as
>    fallback that is compatible with GPLv2?

As suggested by Peff, there is a compat/regex/, and I will spout my
thoughts in a reply to his mail.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-06  7:12   ` Jeff King
@ 2016-09-06 14:06     ` Johannes Schindelin
  2016-09-06 18:29       ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-06 14:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Hi Peff & Junio,

On Tue, 6 Sep 2016, Jeff King wrote:

> On Mon, Sep 05, 2016 at 12:10:11PM -0700, Junio C Hamano wrote:
> 
> >  * We could use <ptr,len> variant of regexp engine as you proposed,
> >    which I think is a preferrable solution.  Do people know of a
> >    widely accepted implementation that we can throw into compat/ as
> >    fallback that is compatible with GPLv2?
> 
> Maybe the one already in compat/regex? ;P

Indeed. That happens to be the implementation used by Git for Windows,
anyway.

> I think re_search() the correct replacement function but it's been a
> while since I've looked into it.

The segfault I investigated happened in a call to strlen(). I see many
calls to strlen() in compat/regex/... The one that triggers the segfault
is in regexec(), compat/regex/regexec.c:241.

As to re_search(): I have not been able to reason about its callees in a
reasonable amount of time. I agree that they *should* not run over the
buffer, but I cannot easily verify it.

The bigger problem is that re_search() is defined in the __USE_GNU section
of regex.h, and I do not think it is appropriate to universally #define
said constant before #include'ing regex.h. So it would appear that major
surgery would be required if we wanted to use regular expressions on
strings that are not NUL-terminated.

So I agree that a better idea may be to simply ensure NUL-terminated
buffers when we require them, although that still might be tricky. More on
that in a reply to your comment to that end.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-06  6:58 ` Jeff King
@ 2016-09-06 14:13   ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-06 14:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi Peff,

On Tue, 6 Sep 2016, Jeff King wrote:

> On Mon, Sep 05, 2016 at 05:44:57PM +0200, Johannes Schindelin wrote:
> 
> > The problem with that: mmap()ed memory is *not* NUL-terminated, yet
> > the pickaxe code calls regexec() on it just the same.
> > 
> > This problem has been reported by my colleague Chris Sidi.
> > 
> > Please note that this patch series is a hot fix I applied to Git for
> > Windows (the bug does not trigger a segmentation fault for me on
> > Linux, strangely enough, but it is really a problem on Windows).
> 
> This has come up before, and I think somebody mentioned that on Linux,
> you are OK unless the buffer ends right at a page boundary (i.e., the
> buffer size is a multiple of the page size). I don't know if that's true
> or not.

In my tests on Linux, even when the buffer ended right at the page
boundary, the memory after that was still legal to access, and typically
had a NUL *somewhere*.

That's happenstance, of course, and could very well result in false
positives (however unlikely that is).

> > So at least I have a workaround in place. Ideally, though, we would
> > NUL-terminate the buffers only when needed, or somehow call regexec() on
> > ptr/size parameters instead of passing a supposedly NUL-terminated
> > string to it?
> 
> There's some discussion in:
> 
>   http://public-inbox.org/git/20121030121747.GA4231@sigill.intra.peff.net/#r
> 
> and the thread below it. The quickest way to fix regexec() would be to
> have everybody use the built-in GNU regex in compat/. People seemed
> somewhat positive on that direction, but we never followed up.

I had a brief look, and it is not pretty. You would have to pay me good
money to dive in and try to implement a regexecn() based on what we have
in compat/regex/. And then people would still complain, I guess, for not
using the native regex support, where available.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-06  7:06   ` Jeff King
@ 2016-09-06 16:02     ` Johannes Schindelin
  2016-09-06 18:41       ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-06 16:02 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi Peff,

On Tue, 6 Sep 2016, Jeff King wrote:

> On Mon, Sep 05, 2016 at 05:45:06PM +0200, Johannes Schindelin wrote:
> 
> > It is true that many code paths populate the mmfile_t structure
> > silently appending a NUL, e.g. when running textconv on a temporary
> > file and reading the results back into an strbuf.
> > 
> > The assumption is most definitely wrong, however, when mmap()ing a
> > file.
> > 
> > Practically, we seemed to be lucky that the bytes after mmap()ed
> > memory were 1) accessible and 2) somehow contained NUL bytes
> > *somewhere*.
> > 
> > In a use case reported by Chris Sidi, it turned out that the mmap()ed
> > file had the precise size of a memory page, and on Windows the bytes
> > after memory-mapped pages are in general not valid.
> > 
> > This patch works around that issue, giving us time to discuss the best
> > course how to fix this problem more generally.
> 
> I don't know if we are in that much of a rush.

I am ;-)

> This bug has been around for many years (the thread I linked earlier is
> from 2012). Yes, it's bad and annoying, but we can probably spend a few
> days discussing the solution.

Sure we can. But I got to have a solution due to a recent switch from
storing LF to storing CR/LF in the repository (that resulted in a
noticable performance improvement): combined with -G being an integral
part of the workflow in the project that reported the issue, it is
essential that this bug gets fixed. Before I go mostly offline.

> > diff --git a/diff.c b/diff.c
> > index 534c12e..32f7f46 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2826,6 +2826,15 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> >  			s->data = strbuf_detach(&buf, &size);
> >  			s->size = size;
> >  			s->should_free = 1;
> > +		} else {
> > +			/* data must be NUL-terminated so e.g. for regexec() */
> > +			char *data = xmalloc(s->size + 1);
> > +			memcpy(data, s->data, s->size);
> > +			data[s->size] = '\0';
> > +			munmap(s->data, s->size);
> > +			s->should_munmap = 0;
> > +			s->data = data;
> > +			s->should_free = 1;
> >  		}
> 
> Without having done a complete audit recently, my gut and my
> recollection from previous discussions is that regexec() really is the
> culprit here for the diff code[1]. If we are going to do a workaround
> like this, I think we could limit it only to cases where know it
> matters, like --pickaxe-regex.

Sure.

We could introduce a new NEEDS_NUL flag.

It will still be quite tricky, because we have to touch a function that is
rather at the bottom of the food chain: diff_populate_filespec() is called
from fill_textconv(), which in turn is called from pickaxe_match(), and
only pickaxe_match() knows whether we want to call regexec() or not (it
depends on its regexp parameter).

Adding a flag to diff_populate_filespec() sounds really reasonable until
you see how many call sites fill_textconv() has.

See below for a better idea.

> Can it be triggered with -G?

It can, and it is, as demonstrated by the test I introduced in 1/3.

> I thought that operated on the diff content itself, which would always
> be in a heap buffer (which should be NUL terminated, but if it isn't,
> that would be a separate fix from this).

That is true.

Except when preimage or postimage does not exist. In which case we call

	regexec(regexp, two->ptr, 1, &regmatch, 0);

or the same with one->ptr. Note the notable absence of two->size.

> [1] We do make the assumption elsewhere that git objects are
>     NUL-terminated, but that is enforced by the object-reading code
>     (with the exception of streamed blobs, but those are obviously dealt
>     with separately anyway).

I know. I am the reason you introduced that, because I added code to
fsck.c that assumes that tag/commit messages are NUL-terminated.

So now for the better idea.

While I was researching the code for this reply, I hit upon one thing that
I never knew existed, introduced in f96e567 (grep: use REG_STARTEND for
all matching if available, 2010-05-22). Apparently, NetBSD introduced an
extension to regexec() where you can specify buffer boundaries using
REG_STARTEND. Which is pretty much what we need.

So I have this as my current proof-of-concept (which passes the test
suite, but is white-space corrupted, because I really have no time to get
non-white-space-corrupted text into this here mailer):

-- snipsnap --
diff --git a/diff.c b/diff.c
index 534c12e..2c5a360 100644
--- a/diff.c
+++ b/diff.c
@@ -951,7 +951,13 @@ static int find_word_boundaries(mmfile_t *buffer,
regex_t *word_regex,
 {
 	if (word_regex && *begin < buffer->size) {
 		regmatch_t match[1];
-		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
		0)) {
+		int f = 0;
+#ifdef REG_STARTEND
+		match[0].rm_so = 0;
+		match[0].rm_eo = *end - *begin;
+		f = REG_STARTEND;
+#endif
+		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
f)) {
 			char *p = memchr(buffer->ptr + *begin +
match[0].rm_so,
 					'\n', match[0].rm_eo -
match[0].rm_so);
 			*end = p ? p - buffer->ptr : match[0].rm_eo +
*begin;
@@ -994,7 +1000,7 @@ static void diff_words_fill(struct diff_words_buffer
*buffer, mmfile_t *out,
 	buffer->orig[0].begin = buffer->orig[0].end = buffer->text.ptr;
 	buffer->orig_nr = 1;
 
-	for (i = 0; i < buffer->text.size; i++) {
+	for (i = 0, j = buffer->text.size; i < buffer->text.size; i++) {
 		if (find_word_boundaries(&buffer->text, word_regex, &i,
&j))
 			return;
 
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 55067ca..2cd09e2 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -23,7 +23,9 @@ static void diffgrep_consume(void *priv, char *line,
unsigned long len)
 {
 	struct diffgrep_cb *data = priv;
 	regmatch_t regmatch;
+#ifndef REG_STARTEND
 	int hold;
+#endif
 
 	if (line[0] != '+' && line[0] != '-')
 		return;
@@ -33,11 +35,18 @@ static void diffgrep_consume(void *priv, char *line,
unsigned long len)
 		 * caller early.
 		 */
 		return;
+#ifdef REG_STARTEND
+	regmatch.rm_so = 0;
+	regmatch.rm_eo = len;
+	data->hit = !regexec(data->regexp, line + 1, 1,
+			     &regmatch, REG_STARTEND);
+#else
 	/* Yuck -- line ought to be "const char *"! */
 	hold = line[len];
 	line[len] = '\0';
-	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, 0);
+	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, f);
 	line[len] = hold;
+#endif
 }
 
 static int diff_grep(mmfile_t *one, mmfile_t *two,
@@ -49,10 +58,24 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
 	xpparam_t xpp;
 	xdemitconf_t xecfg;
 
-	if (!one)
-		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
-	if (!two)
-		return !regexec(regexp, one->ptr, 1, &regmatch, 0);
+	if (!one) {
+		int flags = 0;
+#ifdef REG_STARTEND
+		regmatch.rm_so = 0;
+		regmatch.rm_eo = two->size;
+		flags = REG_STARTEND;
+#endif
+		return !regexec(regexp, two->ptr, 1, &regmatch, flags);
+	}
+	if (!two) {
+		int flags = 0;
+#ifdef REG_STARTEND
+		regmatch.rm_so = 0;
+		regmatch.rm_eo = one->size;
+		flags = REG_STARTEND;
+#endif
+		return !regexec(regexp, one->ptr, 1, &regmatch, flags);
+	}
 
 	/*
 	 * We have both sides; need to run textual diff and see if
@@ -83,7 +106,13 @@ static unsigned int contains(mmfile_t *mf, regex_t
*regexp, kwset_t kws)
 		regmatch_t regmatch;
 		int flags = 0;
 
+#ifndef REG_STARTEND
 		assert(data[sz] == '\0');
+#else
+		regmatch.rm_so = 0;
+		regmatch.rm_eo = sz;
+		flags |= REG_STARTEND;
+#endif
 		while (*data && !regexec(regexp, data, 1, &regmatch,
flags)) {
 			flags |= REG_NOTBOL;
 			data += regmatch.rm_eo;
diff --git a/xdiff-interface.c b/xdiff-interface.c
index f34ea76..c179d43 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -218,7 +218,7 @@ static long ff_regexp(const char *line, long len,
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
 	int i;
-	int result = -1;
+	int result = -1, flags = 0;
 
 	/* Exclude terminating newline (and cr) from matching */
 	if (len > 0 && line[len-1] == '\n') {
@@ -228,11 +228,20 @@ static long ff_regexp(const char *line, long len,
 			len--;
 	}
 
+#ifndef REG_STARTEND
 	line_buffer = xstrndup(line, len); /* make NUL terminated */
+#else
+	line_buffer = (char *)line;
+	flags = REG_STARTEND;
+#endif
 
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
-		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
+#ifdef REG_STARTEND
+		pmatch->rm_so = 0;
+		pmatch->rm_eo = len;
+#endif
+		if (!regexec(&reg->re, line_buffer, 2, pmatch, flags)) {
 			if (reg->negate)
 				goto fail;
 			break;
@@ -249,7 +258,9 @@ static long ff_regexp(const char *line, long len,
 		result--;
 	memcpy(buffer, line, result);
  fail:
+#ifndef REG_STARTEND
 	free(line_buffer);
+#endif
 	return result;
 }

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated
  2016-09-06  7:08   ` Jeff King
@ 2016-09-06 16:04     ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-06 16:04 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi Peff,

On Tue, 6 Sep 2016, Jeff King wrote:

> On Mon, Sep 05, 2016 at 05:45:09PM +0200, Johannes Schindelin wrote:
> 
> > Before calling regexec() on the file contents, we better be certain that
> > the strings fulfill the contract of C strings assumed by said function.
> 
> If you have a buffer that is exactly "size" bytes and you are worried
> about regexec reading off the end, then...
> 
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 55067ca..88820b6 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -49,6 +49,8 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
> >  	xpparam_t xpp;
> >  	xdemitconf_t xecfg;
> >  
> > +	assert(!one || one->ptr[one->size] == '\0');
> > +	assert(!two || two->ptr[two->size] == '\0');
> >  	if (!one)
> >  		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
> 
> ...don't your asserts also read off the end?

Yes, they would read off the end, *unless* a NUL was somehow appended to
the buffers.

> So you might still segfault, though you do catch a case where we have N
> bytes of junk before the end of the page (and you have a 255/256 chance
> of catching it).

Right. The assertion may fail, or a segfault happen. In both cases,
assumptions are violated and we need to fix the code.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-06 14:06     ` Johannes Schindelin
@ 2016-09-06 18:29       ` Jeff King
  2016-09-08  7:29         ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-06 18:29 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

On Tue, Sep 06, 2016 at 04:06:32PM +0200, Johannes Schindelin wrote:

> > I think re_search() the correct replacement function but it's been a
> > while since I've looked into it.
> 
> The segfault I investigated happened in a call to strlen(). I see many
> calls to strlen() in compat/regex/... The one that triggers the segfault
> is in regexec(), compat/regex/regexec.c:241.

Yes, that is the important one, I think. The others are for patterns,
error msgs, etc. Of course strlen() is not the only function that cares
about NUL delimiters (and there might even be a "while (*p)" somewhere
in the code).

I always assumed the _point_ of re_search taking a ptr/len pair was
exactly to handle this case. The documentation[1] says:

   `string` is the string you want to match; it can contain newline and
   null characters. `size` is the length of that string.

Which seems pretty definitive to me (that's for re_match(), but
re_search() is defined in the docs in terms of re_match()).

[1] http://www.delorie.com/gnu/docs/regex/regex_47.html

> As to re_search(): I have not been able to reason about its callees in a
> reasonable amount of time. I agree that they *should* not run over the
> buffer, but I cannot easily verify it.

Between the documentation above, and the fact that your new test passes
when we switch to it (see below), I feel pretty good about it.

> The bigger problem is that re_search() is defined in the __USE_GNU section
> of regex.h, and I do not think it is appropriate to universally #define
> said constant before #include'ing regex.h. So it would appear that major
> surgery would be required if we wanted to use regular expressions on
> strings that are not NUL-terminated.

We can contain this to the existing compat/regexec/regexec.c, and just
provide a wrapper that is similar to regexec but takes a ptr/len pair.

Like:

diff --git a/compat/regex/regex.h b/compat/regex/regex.h
index 61c9683..b2dd0b7 100644
--- a/compat/regex/regex.h
+++ b/compat/regex/regex.h
@@ -569,6 +569,11 @@ extern int regexec (const regex_t *__restrict __preg,
 		    regmatch_t __pmatch[__restrict_arr],
 		    int __eflags);
 
+extern int regexecn (const regex_t *__restrict __preg,
+		     const char *__restrict __cstring, size_t __length,
+		     size_t __nmatch, regmatch_t __pmatch[__restrict_arr],
+		     int __eflags);
+
 extern size_t regerror (int __errcode, const regex_t *__restrict __preg,
 			char *__restrict __errbuf, size_t __errbuf_size);
 
diff --git a/compat/regex/regexec.c b/compat/regex/regexec.c
index eb5e1d4..8afe26b 100644
--- a/compat/regex/regexec.c
+++ b/compat/regex/regexec.c
@@ -217,15 +217,16 @@ static reg_errcode_t extend_buffers (re_match_context_t *mctx)
    We return 0 if we find a match and REG_NOMATCH if not.  */
 
 int
-regexec (
+regexecn (
 	const regex_t *__restrict preg,
 	const char *__restrict string,
+	size_t length,
 	size_t nmatch,
 	regmatch_t pmatch[],
 	int eflags)
 {
   reg_errcode_t err;
-  int start, length;
+  int start;
 
   if (eflags & ~(REG_NOTBOL | REG_NOTEOL | REG_STARTEND))
     return REG_BADPAT;
@@ -238,7 +239,7 @@ regexec (
   else
     {
       start = 0;
-      length = strlen (string);
+      /* length already passed in */
     }
 
   __libc_lock_lock (dfa->lock);
@@ -252,6 +253,17 @@ regexec (
   return err != REG_NOERROR;
 }
 
+int
+regexec (
+	const regex_t *__restrict preg,
+	const char *__restrict string,
+	size_t nmatch,
+	regmatch_t pmatch[],
+	int eflags)
+{
+  return regexecn(preg, string, strlen(string), nmatch, pmatch, eflags);
+}
+
 #ifdef _LIBC
 # include <shlib-compat.h>
 versioned_symbol (libc, __regexec, regexec, GLIBC_2_3_4);
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 55067ca..fdd08dd 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -50,9 +50,9 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
 	xdemitconf_t xecfg;
 
 	if (!one)
-		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
+		return !regexecn(regexp, two->ptr, two->size, 1, &regmatch, 0);
 	if (!two)
-		return !regexec(regexp, one->ptr, 1, &regmatch, 0);
+		return !regexecn(regexp, one->ptr, one->size, 1, &regmatch, 0);
 
 	/*
 	 * We have both sides; need to run textual diff and see if

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-06 16:02     ` Johannes Schindelin
@ 2016-09-06 18:41       ` Jeff King
  2016-09-07 18:31         ` Junio C Hamano
  2016-09-08  7:49         ` Johannes Schindelin
  0 siblings, 2 replies; 66+ messages in thread
From: Jeff King @ 2016-09-06 18:41 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Tue, Sep 06, 2016 at 06:02:59PM +0200, Johannes Schindelin wrote:

> It will still be quite tricky, because we have to touch a function that is
> rather at the bottom of the food chain: diff_populate_filespec() is called
> from fill_textconv(), which in turn is called from pickaxe_match(), and
> only pickaxe_match() knows whether we want to call regexec() or not (it
> depends on its regexp parameter).
> 
> Adding a flag to diff_populate_filespec() sounds really reasonable until
> you see how many call sites fill_textconv() has.

I was thinking of something quite gross, like a global "switch to using
slower-but-safer NUL termination" flag (but I agree with Junio's point
elsewhere that we do not even know if it is "slower").

> > I thought that operated on the diff content itself, which would always
> > be in a heap buffer (which should be NUL terminated, but if it isn't,
> > that would be a separate fix from this).
> 
> That is true.
> 
> Except when preimage or postimage does not exist. In which case we call
> 
> 	regexec(regexp, two->ptr, 1, &regmatch, 0);
> 
> or the same with one->ptr. Note the notable absence of two->size.

Thanks, I forgot about that case.

> > [1] We do make the assumption elsewhere that git objects are
> >     NUL-terminated, but that is enforced by the object-reading code
> >     (with the exception of streamed blobs, but those are obviously dealt
> >     with separately anyway).
> 
> I know. I am the reason you introduced that, because I added code to
> fsck.c that assumes that tag/commit messages are NUL-terminated.

Sort of. I think it has been part of the design since e871b64
(unpack_sha1_file: zero-pad the unpacked object., 2005-05-25), though I
do recall that we missed a code path that did its allocation differently
(in index-pack, IIRC).

Anyway, that is neither here nor there for the diff code, which as you
noticed may operate on things besides git objects.

> So now for the better idea.
> 
> While I was researching the code for this reply, I hit upon one thing that
> I never knew existed, introduced in f96e567 (grep: use REG_STARTEND for
> all matching if available, 2010-05-22). Apparently, NetBSD introduced an
> extension to regexec() where you can specify buffer boundaries using
> REG_STARTEND. Which is pretty much what we need.

Yes, and compat/regex support this, too. My question is whether it is
portable. I see:

> diff --git a/diff.c b/diff.c
> index 534c12e..2c5a360 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -951,7 +951,13 @@ static int find_word_boundaries(mmfile_t *buffer,
> regex_t *word_regex,
>  {
>  	if (word_regex && *begin < buffer->size) {
>  		regmatch_t match[1];
> -		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> 		0)) {
> +		int f = 0;
> +#ifdef REG_STARTEND
> +		match[0].rm_so = 0;
> +		match[0].rm_eo = *end - *begin;
> +		f = REG_STARTEND;
> +#endif
> +		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> f)) {

What happens to those poor souls on systems without REG_STARTEND? Do
they get to keep segfaulting?

I think the solution is to push them into setting NO_REGEX. So looking
at this versus a "regexecn", it seems:

  - this lets people keep using their native regexec if it supports
    STARTEND

  - this is a bit more clunky to use at the callsites (though we could
    _create_ a portable regexecn wrapper that uses this technique on top
    of the native regex library)

But I much prefer this approach to copying the data just to add a NUL.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  2016-09-05 15:45 ` [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
@ 2016-09-06 18:43   ` Jeff King
  2016-09-08  7:53     ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-06 18:43 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Mon, Sep 05, 2016 at 05:45:02PM +0200, Johannes Schindelin wrote:

> Typically, on Linux the test passes. On Windows, it fails virtually
> every time due to an access violation (that's a segmentation fault for
> you Unix-y people out there). And Windows would be correct: the
> regexec() call wants to operate on a regular, NUL-terminated string,
> there is no NUL in the mmap()ed memory range, and it is undefined
> whether the next byte is even legal to access.
> 
> When run with --valgrind it demonstrates quite clearly the breakage, of
> course.
> 
> So we simply mark it with `test_expect_success` for now.

I'd prefer if this were marked as expect_failure. It fails reliably for
me on Linux, even without --valgrind. But even if that were not so,
there is no reason to hurt bisectability of somebody running with
"--valgrind" (not when it costs so little to mark it correctly).

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-06 18:41       ` Jeff King
@ 2016-09-07 18:31         ` Junio C Hamano
  2016-09-08  7:52           ` Johannes Schindelin
  2016-09-08  7:49         ` Johannes Schindelin
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-09-07 18:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git

Jeff King <peff@peff.net> writes:

> What happens to those poor souls on systems without REG_STARTEND? Do
> they get to keep segfaulting?
>
> I think the solution is to push them into setting NO_REGEX. So looking
> at this versus a "regexecn", it seems:
>
>   - this lets people keep using their native regexec if it supports
>     STARTEND
>
>   - this is a bit more clunky to use at the callsites (though we could
>     _create_ a portable regexecn wrapper that uses this technique on top
>     of the native regex library)
>
> But I much prefer this approach to copying the data just to add a NUL.

I first thought "push them to NO_REGEX" to mean "they live with
crippled Git that does not do regexp" and went "Huh?", but it merely
means "let's avoid platform regex library and use on from the
compat/ hierarchy", which would solve the STARTEND portability issue
for everybody.

Which is very good.

The idea to create a thin regexecn() wrapper also sounds like a good
idea, too.  The changes to the callsites in the demonstration patch
does look a bit clunky to me, too.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-06 18:29       ` Jeff King
@ 2016-09-08  7:29         ` Johannes Schindelin
  2016-09-08  8:00           ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Hi Peff,

sorry for the late answer, I was really busy trying to come up with a new
and improved version of the patch series, and while hunting a bug I
introduced got bogged down with other tasks.

The good news is that I made up my mind about releasing a Git for Windows
v2.10.0(2): originally, I had planned to do that today, to have time for
any hot fixes until Sunday, if necessary, before going semi-dark.

FWIW I am now trying to track my plans for v2.10.0(2) (or v2.10.1, if
upstream Git v2.10.1 is released before) on GitHub:

	https://github.com/git-for-windows/git/milestone/3

On Tue, 6 Sep 2016, Jeff King wrote:

> On Tue, Sep 06, 2016 at 04:06:32PM +0200, Johannes Schindelin wrote:
> 
> > > I think re_search() the correct replacement function but it's been a
> > > while since I've looked into it.
> > 
> > The segfault I investigated happened in a call to strlen(). I see many
> > calls to strlen() in compat/regex/... The one that triggers the segfault
> > is in regexec(), compat/regex/regexec.c:241.
> 
> Yes, that is the important one, I think. The others are for patterns,
> error msgs, etc. Of course strlen() is not the only function that cares
> about NUL delimiters (and there might even be a "while (*p)" somewhere
> in the code).
> 
> I always assumed the _point_ of re_search taking a ptr/len pair was
> exactly to handle this case. The documentation[1] says:
> 
>    `string` is the string you want to match; it can contain newline and
>    null characters. `size` is the length of that string.
> 
> Which seems pretty definitive to me (that's for re_match(), but
> re_search() is defined in the docs in terms of re_match()).

Right. The problem is: I *really* want to avoid using GNU-isms.

> > The bigger problem is that re_search() is defined in the __USE_GNU section
> > of regex.h, and I do not think it is appropriate to universally #define
> > said constant before #include'ing regex.h. So it would appear that major
> > surgery would be required if we wanted to use regular expressions on
> > strings that are not NUL-terminated.
> 
> We can contain this to the existing compat/regexec/regexec.c, and just
> provide a wrapper that is similar to regexec but takes a ptr/len pair.

But we can do even better than that: we can provide a wrapper that uses
REG_STARTEND where available (which is really the majority of platforms we
care about: Linux, MacOSX, Windows, and even the *BSDs). Where it is not
available, we simply malloc(), memcpy() and append a NUL.

Which is what my v2 does (will send it out in a moment).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
                   ` (4 preceding siblings ...)
  2016-09-06  6:58 ` Jeff King
@ 2016-09-08  7:31 ` Johannes Schindelin
  2016-09-08  7:31   ` [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
                     ` (4 more replies)
  5 siblings, 5 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:31 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

This patch series addresses a problem where `git diff` is called using
`-G` or `-S --pickaxe-regex` on new-born files that are configured
without user diff drivers, and that hence get mmap()ed into memory.

The problem with that: mmap()ed memory is *not* NUL-terminated, yet the
pickaxe code calls regexec() on it just the same.

This problem has been reported by my colleague Chris Sidi.

We solve this by introducing a helper, regexec_buf(), that takes a
pointer and a length instead of a NUL-terminated string.

This helper then uses REG_STARTEND where available, and falls back to
allocating and constructing a NUL-terminated string. Given the
wide-spread support for REG_STARTEND (Linux has it, MacOSX has it, Git
for Windows has it because it uses compat/regex/ that has it), I think
this is a fair trade-off.

Changes since v1:

- completely revamped the approach: now we no longer force-append a NUL
  whenever mmap()ing buffers for use by the diff machinery, but we fix
  things at the regexec() level.


Johannes Schindelin (3):
  Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  Introduce a function to run regexec() on non-NUL-terminated buffers
  Use the newly-introduced regexec_buf() function

 diff.c                  |  3 ++-
 diffcore-pickaxe.c      | 18 ++++++++----------
 git-compat-util.h       | 21 +++++++++++++++++++++
 t/t4061-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 xdiff-interface.c       | 13 ++++---------
 5 files changed, 57 insertions(+), 20 deletions(-)
 create mode 100755 t/t4061-diff-pickaxe.sh

Published-As: https://github.com/dscho/git/releases/tag/mmap-regexec-v2
Fetch-It-Via: git fetch https://github.com/dscho/git mmap-regexec-v2

Interdiff vs v1:

 diff --git a/diff.c b/diff.c
 index 32f7f46..526775a 100644
 --- a/diff.c
 +++ b/diff.c
 @@ -951,7 +951,8 @@ static int find_word_boundaries(mmfile_t *buffer, regex_t *word_regex,
  {
  	if (word_regex && *begin < buffer->size) {
  		regmatch_t match[1];
 -		if (!regexec(word_regex, buffer->ptr + *begin, 1, match, 0)) {
 +		if (!regexec_buf(word_regex, buffer->ptr + *begin,
 +				 buffer->size - *begin, 1, match, 0)) {
  			char *p = memchr(buffer->ptr + *begin + match[0].rm_so,
  					'\n', match[0].rm_eo - match[0].rm_so);
  			*end = p ? p - buffer->ptr : match[0].rm_eo + *begin;
 @@ -2826,15 +2827,6 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
  			s->data = strbuf_detach(&buf, &size);
  			s->size = size;
  			s->should_free = 1;
 -		} else {
 -			/* data must be NUL-terminated so e.g. for regexec() */
 -			char *data = xmalloc(s->size + 1);
 -			memcpy(data, s->data, s->size);
 -			data[s->size] = '\0';
 -			munmap(s->data, s->size);
 -			s->should_munmap = 0;
 -			s->data = data;
 -			s->should_free = 1;
  		}
  	}
  	else {
 diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
 index 88820b6..9795ca1 100644
 --- a/diffcore-pickaxe.c
 +++ b/diffcore-pickaxe.c
 @@ -23,7 +23,6 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
  {
  	struct diffgrep_cb *data = priv;
  	regmatch_t regmatch;
 -	int hold;
  
  	if (line[0] != '+' && line[0] != '-')
  		return;
 @@ -33,11 +32,8 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
  		 * caller early.
  		 */
  		return;
 -	/* Yuck -- line ought to be "const char *"! */
 -	hold = line[len];
 -	line[len] = '\0';
 -	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, 0);
 -	line[len] = hold;
 +	data->hit = !regexec_buf(data->regexp, line + 1, len - 1, 1,
 +				 &regmatch, 0);
  }
  
  static int diff_grep(mmfile_t *one, mmfile_t *two,
 @@ -49,12 +45,12 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
  	xpparam_t xpp;
  	xdemitconf_t xecfg;
  
 -	assert(!one || one->ptr[one->size] == '\0');
 -	assert(!two || two->ptr[two->size] == '\0');
  	if (!one)
 -		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
 +		return !regexec_buf(regexp, two->ptr, two->size,
 +				    1, &regmatch, 0);
  	if (!two)
 -		return !regexec(regexp, one->ptr, 1, &regmatch, 0);
 +		return !regexec_buf(regexp, one->ptr, one->size,
 +				    1, &regmatch, 0);
  
  	/*
  	 * We have both sides; need to run textual diff and see if
 @@ -85,8 +81,8 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws)
  		regmatch_t regmatch;
  		int flags = 0;
  
 -		assert(data[sz] == '\0');
 -		while (*data && !regexec(regexp, data, 1, &regmatch, flags)) {
 +		while (*data &&
 +		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
  			flags |= REG_NOTBOL;
  			data += regmatch.rm_eo;
  			if (*data && regmatch.rm_so == regmatch.rm_eo)
 diff --git a/git-compat-util.h b/git-compat-util.h
 index db89ba7..19128b3 100644
 --- a/git-compat-util.h
 +++ b/git-compat-util.h
 @@ -965,6 +965,27 @@ void git_qsort(void *base, size_t nmemb, size_t size,
  #define qsort git_qsort
  #endif
  
 +static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
 +			      size_t nmatch, regmatch_t pmatch[], int eflags)
 +{
 +#ifdef REG_STARTEND
 +	assert(nmatch > 0 && pmatch);
 +	pmatch[0].rm_so = 0;
 +	pmatch[0].rm_eo = size;
 +	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
 +#else
 +	char *buf2 = xmalloc(size + 1);
 +	int ret;
 +
 +	memcpy(buf2, buf, size);
 +	buf2[size] = '\0';
 +	ret = regexec(preg, buf2, nmatch, pmatch, eflags);
 +	free(buf2);
 +
 +	return ret;
 +#endif
 +}
 +
  #ifndef DIR_HAS_BSD_GROUP_SEMANTICS
  # define FORCE_DIR_SET_GID S_ISGID
  #else
 diff --git a/t/t4059-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
 similarity index 91%
 rename from t/t4059-diff-pickaxe.sh
 rename to t/t4061-diff-pickaxe.sh
 index f0bf50b..5929f2e 100755
 --- a/t/t4059-diff-pickaxe.sh
 +++ b/t/t4061-diff-pickaxe.sh
 @@ -14,7 +14,7 @@ test_expect_success setup '
  	test_tick &&
  	git commit -m "A 4k file"
  '
 -test_expect_success '-G matches' '
 +test_expect_failure '-G matches' '
  	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
  	test 4096-zeroes.txt = "$(cat out)"
  '
 diff --git a/xdiff-interface.c b/xdiff-interface.c
 index f34ea76..50702a2 100644
 --- a/xdiff-interface.c
 +++ b/xdiff-interface.c
 @@ -214,11 +214,10 @@ struct ff_regs {
  static long ff_regexp(const char *line, long len,
  		char *buffer, long buffer_size, void *priv)
  {
 -	char *line_buffer;
  	struct ff_regs *regs = priv;
  	regmatch_t pmatch[2];
  	int i;
 -	int result = -1;
 +	int result;
  
  	/* Exclude terminating newline (and cr) from matching */
  	if (len > 0 && line[len-1] == '\n') {
 @@ -228,18 +227,16 @@ static long ff_regexp(const char *line, long len,
  			len--;
  	}
  
 -	line_buffer = xstrndup(line, len); /* make NUL terminated */
 -
  	for (i = 0; i < regs->nr; i++) {
  		struct ff_reg *reg = regs->array + i;
 -		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
 +		if (!regexec_buf(&reg->re, line, len, 2, pmatch, 0)) {
  			if (reg->negate)
 -				goto fail;
 +				return -1;
  			break;
  		}
  	}
  	if (regs->nr <= i)
 -		goto fail;
 +		return -1;
  	i = pmatch[1].rm_so >= 0 ? 1 : 0;
  	line += pmatch[i].rm_so;
  	result = pmatch[i].rm_eo - pmatch[i].rm_so;
 @@ -248,8 +245,6 @@ static long ff_regexp(const char *line, long len,
  	while (result > 0 && (isspace(line[result - 1])))
  		result--;
  	memcpy(buffer, line, result);
 - fail:
 -	free(line_buffer);
  	return result;
  }
  

-- 
2.10.0.windows.1.10.g803177d

base-commit: 6ebdac1bab966b720d776aa43ca188fe378b1f4b

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers
  2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
@ 2016-09-08  7:31   ` Johannes Schindelin
  2016-09-08  8:04     ` Jeff King
  2016-09-08  7:31   ` [PATCH v2 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:31 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

We just introduced a test that demonstrates that our sloppy use of
regexec() on a mmap()ed area can result in incorrect results or even
hard crashes.

So what we need to fix this is a function that calls regexec() on a
length-delimited, rather than a NUL-terminated, string.

Happily, there is an extension to regexec() introduced by the NetBSD
project and present in all major regex implementation including
Linux', MacOSX' and the one Git includes in compat/regex/: by using
the (non-POSIX) REG_STARTEND flag, it is possible to tell the
regexec() function that it should only look at the offsets between
pmatch[0].rm_so and pmatch[0].rm_eo.

That is exactly what we need.

Since support for REG_STARTEND is so widespread by now, let's just
introduce a helper function that uses it, and fall back to allocating
and constructing a NUL-terminated when REG_STARTEND is not available.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 git-compat-util.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index db89ba7..19128b3 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -965,6 +965,27 @@ void git_qsort(void *base, size_t nmemb, size_t size,
 #define qsort git_qsort
 #endif
 
+static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
+			      size_t nmatch, regmatch_t pmatch[], int eflags)
+{
+#ifdef REG_STARTEND
+	assert(nmatch > 0 && pmatch);
+	pmatch[0].rm_so = 0;
+	pmatch[0].rm_eo = size;
+	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
+#else
+	char *buf2 = xmalloc(size + 1);
+	int ret;
+
+	memcpy(buf2, buf, size);
+	buf2[size] = '\0';
+	ret = regexec(preg, buf2, nmatch, pmatch, eflags);
+	free(buf2);
+
+	return ret;
+#endif
+}
+
 #ifndef DIR_HAS_BSD_GROUP_SEMANTICS
 # define FORCE_DIR_SET_GID S_ISGID
 #else
-- 
2.10.0.windows.1.10.g803177d



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v2 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
  2016-09-08  7:31   ` [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
@ 2016-09-08  7:31   ` Johannes Schindelin
  2016-09-08  7:31   ` [PATCH v2 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:31 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

When our pickaxe code feeds file contents to regexec(), it implicitly
assumes that the file contents are read into implicitly NUL-terminated
buffers (i.e. that we overallocate by 1, appending a single '\0').

This is not so.

In particular when the file contents are simply mmap()ed, we can be
virtually certain that the buffer is preceding uninitialized bytes, or
invalid pages.

Note that the test we add here is known to be flakey: we simply cannot
know whether the byte following the mmap()ed ones is a NUL or not.

Typically, on Linux the test passes. On Windows, it fails virtually
every time due to an access violation (that's a segmentation fault for
you Unix-y people out there). And Windows would be correct: the
regexec() call wants to operate on a regular, NUL-terminated string,
there is no NUL in the mmap()ed memory range, and it is undefined
whether the next byte is even legal to access.

When run with --valgrind it demonstrates quite clearly the breakage, of
course.

Being marked with `test_expect_failure`, this test will sometimes be
declare "TODO fixed", even if it only passes by mistake.

This test case represents a Minimal, Complete and Verifiable Example of
a breakage reported by Chris Sidi.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t4061-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100755 t/t4061-diff-pickaxe.sh

diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
new file mode 100755
index 0000000..5929f2e
--- /dev/null
+++ b/t/t4061-diff-pickaxe.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+#
+# Copyright (c) 2016 Johannes Schindelin
+#
+
+test_description='Pickaxe options'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+	test_commit initial &&
+	printf "%04096d" 0 >4096-zeroes.txt &&
+	git add 4096-zeroes.txt &&
+	test_tick &&
+	git commit -m "A 4k file"
+'
+test_expect_failure '-G matches' '
+	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
+	test 4096-zeroes.txt = "$(cat out)"
+'
+
+test_done
-- 
2.10.0.windows.1.10.g803177d



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
  2016-09-08  7:31   ` [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
  2016-09-08  7:31   ` [PATCH v2 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
@ 2016-09-08  7:31   ` Johannes Schindelin
  2016-09-08  7:54     ` Johannes Schindelin
  2016-09-08  7:33   ` [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
  2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
  4 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:31 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

The new regexec_buf() function operates on buffers with an explicitly
specified length, rather than NUL-terminated strings.

We need to use this function whenever the buffer we want to pass to
regexec() may have been mmap()ed (and is hence not NUL-terminated).

Note: the original motivation for this patch was to fix a bug where
`git diff -G <regex>` would crash. This patch converts more callers,
though, some of which explicitly allocated and constructed
NUL-terminated strings (or worse: modified read-only buffers to insert
NULs).

Some of the buffers actually may be NUL-terminated. As regexec_buf()
uses REG_STARTEND where available, but has to fall back to allocating
and constructing NUL-terminated strings where REG_STARTEND is not
available, this makes the code less efficient in the latter case.

However, given the widespread support for REG_STARTEND, combined with
the improved ease of code maintenance, we strike the balance in favor
of REG_STARTEND.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 diff.c             |  3 ++-
 diffcore-pickaxe.c | 18 ++++++++----------
 xdiff-interface.c  | 13 ++++---------
 3 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/diff.c b/diff.c
index 534c12e..526775a 100644
--- a/diff.c
+++ b/diff.c
@@ -951,7 +951,8 @@ static int find_word_boundaries(mmfile_t *buffer, regex_t *word_regex,
 {
 	if (word_regex && *begin < buffer->size) {
 		regmatch_t match[1];
-		if (!regexec(word_regex, buffer->ptr + *begin, 1, match, 0)) {
+		if (!regexec_buf(word_regex, buffer->ptr + *begin,
+				 buffer->size - *begin, 1, match, 0)) {
 			char *p = memchr(buffer->ptr + *begin + match[0].rm_so,
 					'\n', match[0].rm_eo - match[0].rm_so);
 			*end = p ? p - buffer->ptr : match[0].rm_eo + *begin;
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 55067ca..9795ca1 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -23,7 +23,6 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
 {
 	struct diffgrep_cb *data = priv;
 	regmatch_t regmatch;
-	int hold;
 
 	if (line[0] != '+' && line[0] != '-')
 		return;
@@ -33,11 +32,8 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
 		 * caller early.
 		 */
 		return;
-	/* Yuck -- line ought to be "const char *"! */
-	hold = line[len];
-	line[len] = '\0';
-	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, 0);
-	line[len] = hold;
+	data->hit = !regexec_buf(data->regexp, line + 1, len - 1, 1,
+				 &regmatch, 0);
 }
 
 static int diff_grep(mmfile_t *one, mmfile_t *two,
@@ -50,9 +46,11 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
 	xdemitconf_t xecfg;
 
 	if (!one)
-		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
+		return !regexec_buf(regexp, two->ptr, two->size,
+				    1, &regmatch, 0);
 	if (!two)
-		return !regexec(regexp, one->ptr, 1, &regmatch, 0);
+		return !regexec_buf(regexp, one->ptr, one->size,
+				    1, &regmatch, 0);
 
 	/*
 	 * We have both sides; need to run textual diff and see if
@@ -83,8 +81,8 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws)
 		regmatch_t regmatch;
 		int flags = 0;
 
-		assert(data[sz] == '\0');
-		while (*data && !regexec(regexp, data, 1, &regmatch, flags)) {
+		while (*data &&
+		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
 			flags |= REG_NOTBOL;
 			data += regmatch.rm_eo;
 			if (*data && regmatch.rm_so == regmatch.rm_eo)
diff --git a/xdiff-interface.c b/xdiff-interface.c
index f34ea76..50702a2 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -214,11 +214,10 @@ struct ff_regs {
 static long ff_regexp(const char *line, long len,
 		char *buffer, long buffer_size, void *priv)
 {
-	char *line_buffer;
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
 	int i;
-	int result = -1;
+	int result;
 
 	/* Exclude terminating newline (and cr) from matching */
 	if (len > 0 && line[len-1] == '\n') {
@@ -228,18 +227,16 @@ static long ff_regexp(const char *line, long len,
 			len--;
 	}
 
-	line_buffer = xstrndup(line, len); /* make NUL terminated */
-
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
-		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
+		if (!regexec_buf(&reg->re, line, len, 2, pmatch, 0)) {
 			if (reg->negate)
-				goto fail;
+				return -1;
 			break;
 		}
 	}
 	if (regs->nr <= i)
-		goto fail;
+		return -1;
 	i = pmatch[1].rm_so >= 0 ? 1 : 0;
 	line += pmatch[i].rm_so;
 	result = pmatch[i].rm_eo - pmatch[i].rm_so;
@@ -248,8 +245,6 @@ static long ff_regexp(const char *line, long len,
 	while (result > 0 && (isspace(line[result - 1])))
 		result--;
 	memcpy(buffer, line, result);
- fail:
-	free(line_buffer);
 	return result;
 }
 
-- 
2.10.0.windows.1.10.g803177d

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
                     ` (2 preceding siblings ...)
  2016-09-08  7:31   ` [PATCH v2 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
@ 2016-09-08  7:33   ` Johannes Schindelin
  2016-09-08  8:13     ` Jeff King
  2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
  4 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:33 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Hi,

On Thu, 8 Sep 2016, Johannes Schindelin wrote:

> We solve this by introducing a helper, regexec_buf(), that takes a
> pointer and a length instead of a NUL-terminated string.

BTW I should have clarified why I decided on another name than regexecn()
(I had considered this even before reading Peff's proposed patch): the <n>
in string functions suggest a limiting of NUL-terminated strings. In other
words, if n = 100 and the provided pointer points to a NUL-terminated
string of length 3, the *n function will treat it as a string of length 3.

That is not what regexec_buf() does: it ignores the NUL. Hence the
different name.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-06 18:41       ` Jeff King
  2016-09-07 18:31         ` Junio C Hamano
@ 2016-09-08  7:49         ` Johannes Schindelin
  2016-09-08  8:22           ` Jeff King
  1 sibling, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi Peff,

On Tue, 6 Sep 2016, Jeff King wrote:

> On Tue, Sep 06, 2016 at 06:02:59PM +0200, Johannes Schindelin wrote:
> 
> > It will still be quite tricky, because we have to touch a function that is
> > rather at the bottom of the food chain: diff_populate_filespec() is called
> > from fill_textconv(), which in turn is called from pickaxe_match(), and
> > only pickaxe_match() knows whether we want to call regexec() or not (it
> > depends on its regexp parameter).
> > 
> > Adding a flag to diff_populate_filespec() sounds really reasonable until
> > you see how many call sites fill_textconv() has.
> 
> I was thinking of something quite gross, like a global "switch to using
> slower-but-safer NUL termination" flag (but I agree with Junio's point
> elsewhere that we do not even know if it is "slower").

Urgh.

;-)

> > So now for the better idea.
> > 
> > While I was researching the code for this reply, I hit upon one thing
> > that I never knew existed, introduced in f96e567 (grep: use
> > REG_STARTEND for all matching if available, 2010-05-22). Apparently,
> > NetBSD introduced an extension to regexec() where you can specify
> > buffer boundaries using REG_STARTEND. Which is pretty much what we
> > need.
> 
> Yes, and compat/regex support this, too. My question is whether it is
> portable.

That is only one question.

Another, important question is: is it efficient?

I have no idea whether there exists any hardware-accelerated regex library
out there, maybe even using CUDA (I know that there is some code out there
using SSE to perform LF -> CR/LF conversion, unfortunately it is
intentionally incompatible with GPLv2).

We cannot simply switch everybody and her dog to compat/regex/ just
because we want to avoid a segfault.

> > diff --git a/diff.c b/diff.c
> > index 534c12e..2c5a360 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -951,7 +951,13 @@ static int find_word_boundaries(mmfile_t *buffer,
> > regex_t *word_regex,
> >  {
> >  	if (word_regex && *begin < buffer->size) {
> >  		regmatch_t match[1];
> > -		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> > 		0)) {
> > +		int f = 0;
> > +#ifdef REG_STARTEND
> > +		match[0].rm_so = 0;
> > +		match[0].rm_eo = *end - *begin;
> > +		f = REG_STARTEND;
> > +#endif
> > +		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> > f)) {

Heh. You introduced the same bug I did. Or maybe you just fetched my
mmap-regexec branch and looked at an intermediate iteration?

The problem with this patch is that *end is uninitialized. I actually
initialized it in my patch, but it was still incorrect. I settled on using
buffer->size - *begin in the end.

> What happens to those poor souls on systems without REG_STARTEND? Do
> they get to keep segfaulting?

Of course not. Those poor souls on systems without REG_STARTEND pay a
little price for that: malloc(); memcpy(); *end = '\0'; ... free();

I think it is worth it: maintenance of the code is much easier that way
than forcing everybody and her dog and her dog's hamster to compat/regex/.

> But I much prefer this approach to copying the data just to add a NUL.

I think it is not worth the burden. The only regex implementation in
semi-widespread use that do not support REG_STARTEND seems to be musl.

I'd rather not spend *so much* effort just to support an obscure platform.
Not when the users of that obscure platform could spend that effort
themselves. And probably won't, because we only copy data to add a NUL on
those platforms when regexec() is called on an mmfile_t.

Better to keep it simple,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-07 18:31         ` Junio C Hamano
@ 2016-09-08  7:52           ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

Hi Junio,

On Wed, 7 Sep 2016, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > What happens to those poor souls on systems without REG_STARTEND? Do
> > they get to keep segfaulting?
> >
> > I think the solution is to push them into setting NO_REGEX. So looking
> > at this versus a "regexecn", it seems:
> >
> >   - this lets people keep using their native regexec if it supports
> >     STARTEND
> >
> >   - this is a bit more clunky to use at the callsites (though we could
> >     _create_ a portable regexecn wrapper that uses this technique on top
> >     of the native regex library)
> >
> > But I much prefer this approach to copying the data just to add a NUL.
> 
> I first thought "push them to NO_REGEX" to mean "they live with
> crippled Git that does not do regexp" and went "Huh?", but it merely
> means "let's avoid platform regex library and use on from the
> compat/ hierarchy", which would solve the STARTEND portability issue
> for everybody.
> 
> Which is very good.
> 
> The idea to create a thin regexecn() wrapper also sounds like a good
> idea, too.  The changes to the callsites in the demonstration patch
> does look a bit clunky to me, too.

The demonstration patch was only meant as a mere demonstration where this
leads us. I DRY'd it up quite a bit (which was my plan all along, but it
was faster to make the changes in place, to avoid a full-sale
recompilation due to a central header change; you might not care because
you use Linux with its native POSIX, while I have to use MSYS2, making
even my builds slower).

And I really do not think that it would be a good idea to use
compat/regex/ for everybody, even if they already have a working regex.h
on their system.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  2016-09-06 18:43   ` Jeff King
@ 2016-09-08  7:53     ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:53 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi Peff,

On Tue, 6 Sep 2016, Jeff King wrote:

> On Mon, Sep 05, 2016 at 05:45:02PM +0200, Johannes Schindelin wrote:
> 
> > Typically, on Linux the test passes. On Windows, it fails virtually
> > every time due to an access violation (that's a segmentation fault for
> > you Unix-y people out there). And Windows would be correct: the
> > regexec() call wants to operate on a regular, NUL-terminated string,
> > there is no NUL in the mmap()ed memory range, and it is undefined
> > whether the next byte is even legal to access.
> > 
> > When run with --valgrind it demonstrates quite clearly the breakage, of
> > course.
> > 
> > So we simply mark it with `test_expect_success` for now.
> 
> I'd prefer if this were marked as expect_failure. It fails reliably for
> me on Linux, even without --valgrind. But even if that were not so,
> there is no reason to hurt bisectability of somebody running with
> "--valgrind" (not when it costs so little to mark it correctly).

Forgot to say in the cover letter: I did change this from
test_expect_success to test_expect_failure.

But of course, now I remember that I failed to change it back in 3/3. Bah.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  7:31   ` [PATCH v2 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
@ 2016-09-08  7:54     ` Johannes Schindelin
  2016-09-08  8:10       ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:54 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

Hi,

On Thu, 8 Sep 2016, Johannes Schindelin wrote:

> The new regexec_buf() function operates on buffers with an explicitly
> specified length, rather than NUL-terminated strings.
> 
> We need to use this function whenever the buffer we want to pass to
> regexec() may have been mmap()ed (and is hence not NUL-terminated).
> 
> Note: the original motivation for this patch was to fix a bug where
> `git diff -G <regex>` would crash. This patch converts more callers,
> though, some of which explicitly allocated and constructed
> NUL-terminated strings (or worse: modified read-only buffers to insert
> NULs).
> 
> Some of the buffers actually may be NUL-terminated. As regexec_buf()
> uses REG_STARTEND where available, but has to fall back to allocating
> and constructing NUL-terminated strings where REG_STARTEND is not
> available, this makes the code less efficient in the latter case.
> 
> However, given the widespread support for REG_STARTEND, combined with
> the improved ease of code maintenance, we strike the balance in favor
> of REG_STARTEND.
> 
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  diff.c             |  3 ++-
>  diffcore-pickaxe.c | 18 ++++++++----------
>  xdiff-interface.c  | 13 ++++---------
>  3 files changed, 14 insertions(+), 20 deletions(-)

I just realized that this should switch the test_expect_failure from 1/3
to a test_expect_success.

Will send out v3 in a moment.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
                     ` (3 preceding siblings ...)
  2016-09-08  7:33   ` [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
@ 2016-09-08  7:57   ` Johannes Schindelin
  2016-09-08  7:57     ` [PATCH v3 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
                       ` (3 more replies)
  4 siblings, 4 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:57 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

This patch series addresses a problem where `git diff` is called using
`-G` or `-S --pickaxe-regex` on new-born files that are configured
without user diff drivers, and that hence get mmap()ed into memory.

The problem with that: mmap()ed memory is *not* NUL-terminated, yet the
pickaxe code calls regexec() on it just the same.

This problem has been reported by my colleague Chris Sidi.

We solve this by introducing a helper, regexec_buf(), that takes a
pointer and a length instead of a NUL-terminated string.

This helper then uses REG_STARTEND where available, and falls back to
allocating and constructing a NUL-terminated string. Given the
wide-spread support for REG_STARTEND (Linux has it, MacOSX has it, Git
for Windows has it because it uses compat/regex/ that has it), I think
this is a fair trade-off.

Changes since v2:

- changed 3/3 to switch the test_expect_failure from 1/3 to a
  test_expect_success


Johannes Schindelin (3):
  Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  Introduce a function to run regexec() on non-NUL-terminated buffers
  Use the newly-introduced regexec_buf() function

 diff.c                  |  3 ++-
 diffcore-pickaxe.c      | 18 ++++++++----------
 git-compat-util.h       | 21 +++++++++++++++++++++
 t/t4061-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 xdiff-interface.c       | 13 ++++---------
 5 files changed, 57 insertions(+), 20 deletions(-)
 create mode 100755 t/t4061-diff-pickaxe.sh

Published-As: https://github.com/dscho/git/releases/tag/mmap-regexec-v3
Fetch-It-Via: git fetch https://github.com/dscho/git mmap-regexec-v3

Interdiff vs v2:

 diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
 index 5929f2e..f0bf50b 100755
 --- a/t/t4061-diff-pickaxe.sh
 +++ b/t/t4061-diff-pickaxe.sh
 @@ -14,7 +14,7 @@ test_expect_success setup '
  	test_tick &&
  	git commit -m "A 4k file"
  '
 -test_expect_failure '-G matches' '
 +test_expect_success '-G matches' '
  	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
  	test 4096-zeroes.txt = "$(cat out)"
  '

-- 
2.10.0.windows.1.10.g803177d

base-commit: 6ebdac1bab966b720d776aa43ca188fe378b1f4b

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers
  2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
@ 2016-09-08  7:57     ` Johannes Schindelin
  2016-09-08  7:58     ` [PATCH v3 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:57 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

When our pickaxe code feeds file contents to regexec(), it implicitly
assumes that the file contents are read into implicitly NUL-terminated
buffers (i.e. that we overallocate by 1, appending a single '\0').

This is not so.

In particular when the file contents are simply mmap()ed, we can be
virtually certain that the buffer is preceding uninitialized bytes, or
invalid pages.

Note that the test we add here is known to be flakey: we simply cannot
know whether the byte following the mmap()ed ones is a NUL or not.

Typically, on Linux the test passes. On Windows, it fails virtually
every time due to an access violation (that's a segmentation fault for
you Unix-y people out there). And Windows would be correct: the
regexec() call wants to operate on a regular, NUL-terminated string,
there is no NUL in the mmap()ed memory range, and it is undefined
whether the next byte is even legal to access.

When run with --valgrind it demonstrates quite clearly the breakage, of
course.

Being marked with `test_expect_failure`, this test will sometimes be
declare "TODO fixed", even if it only passes by mistake.

This test case represents a Minimal, Complete and Verifiable Example of
a breakage reported by Chris Sidi.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t4061-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100755 t/t4061-diff-pickaxe.sh

diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
new file mode 100755
index 0000000..5929f2e
--- /dev/null
+++ b/t/t4061-diff-pickaxe.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+#
+# Copyright (c) 2016 Johannes Schindelin
+#
+
+test_description='Pickaxe options'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+	test_commit initial &&
+	printf "%04096d" 0 >4096-zeroes.txt &&
+	git add 4096-zeroes.txt &&
+	test_tick &&
+	git commit -m "A 4k file"
+'
+test_expect_failure '-G matches' '
+	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
+	test 4096-zeroes.txt = "$(cat out)"
+'
+
+test_done
-- 
2.10.0.windows.1.10.g803177d



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers
  2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
  2016-09-08  7:57     ` [PATCH v3 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
@ 2016-09-08  7:58     ` Johannes Schindelin
  2016-09-08 17:03       ` Junio C Hamano
  2016-09-08  7:59     ` [PATCH v3 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
  2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
  3 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:58 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

We just introduced a test that demonstrates that our sloppy use of
regexec() on a mmap()ed area can result in incorrect results or even
hard crashes.

So what we need to fix this is a function that calls regexec() on a
length-delimited, rather than a NUL-terminated, string.

Happily, there is an extension to regexec() introduced by the NetBSD
project and present in all major regex implementation including
Linux', MacOSX' and the one Git includes in compat/regex/: by using
the (non-POSIX) REG_STARTEND flag, it is possible to tell the
regexec() function that it should only look at the offsets between
pmatch[0].rm_so and pmatch[0].rm_eo.

That is exactly what we need.

Since support for REG_STARTEND is so widespread by now, let's just
introduce a helper function that uses it, and fall back to allocating
and constructing a NUL-terminated when REG_STARTEND is not available.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 git-compat-util.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index db89ba7..19128b3 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -965,6 +965,27 @@ void git_qsort(void *base, size_t nmemb, size_t size,
 #define qsort git_qsort
 #endif
 
+static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
+			      size_t nmatch, regmatch_t pmatch[], int eflags)
+{
+#ifdef REG_STARTEND
+	assert(nmatch > 0 && pmatch);
+	pmatch[0].rm_so = 0;
+	pmatch[0].rm_eo = size;
+	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
+#else
+	char *buf2 = xmalloc(size + 1);
+	int ret;
+
+	memcpy(buf2, buf, size);
+	buf2[size] = '\0';
+	ret = regexec(preg, buf2, nmatch, pmatch, eflags);
+	free(buf2);
+
+	return ret;
+#endif
+}
+
 #ifndef DIR_HAS_BSD_GROUP_SEMANTICS
 # define FORCE_DIR_SET_GID S_ISGID
 #else
-- 
2.10.0.windows.1.10.g803177d



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v3 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
  2016-09-08  7:57     ` [PATCH v3 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
  2016-09-08  7:58     ` [PATCH v3 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
@ 2016-09-08  7:59     ` Johannes Schindelin
  2016-09-08 17:09       ` Junio C Hamano
  2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
  3 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08  7:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King

The new regexec_buf() function operates on buffers with an explicitly
specified length, rather than NUL-terminated strings.

We need to use this function whenever the buffer we want to pass to
regexec() may have been mmap()ed (and is hence not NUL-terminated).

Note: the original motivation for this patch was to fix a bug where
`git diff -G <regex>` would crash. This patch converts more callers,
though, some of which explicitly allocated and constructed
NUL-terminated strings (or worse: modified read-only buffers to insert
NULs).

Some of the buffers actually may be NUL-terminated. As regexec_buf()
uses REG_STARTEND where available, but has to fall back to allocating
and constructing NUL-terminated strings where REG_STARTEND is not
available, this makes the code less efficient in the latter case.

However, given the widespread support for REG_STARTEND, combined with
the improved ease of code maintenance, we strike the balance in favor
of REG_STARTEND.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 diff.c                  |  3 ++-
 diffcore-pickaxe.c      | 18 ++++++++----------
 t/t4061-diff-pickaxe.sh |  2 +-
 xdiff-interface.c       | 13 ++++---------
 4 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/diff.c b/diff.c
index 534c12e..526775a 100644
--- a/diff.c
+++ b/diff.c
@@ -951,7 +951,8 @@ static int find_word_boundaries(mmfile_t *buffer, regex_t *word_regex,
 {
 	if (word_regex && *begin < buffer->size) {
 		regmatch_t match[1];
-		if (!regexec(word_regex, buffer->ptr + *begin, 1, match, 0)) {
+		if (!regexec_buf(word_regex, buffer->ptr + *begin,
+				 buffer->size - *begin, 1, match, 0)) {
 			char *p = memchr(buffer->ptr + *begin + match[0].rm_so,
 					'\n', match[0].rm_eo - match[0].rm_so);
 			*end = p ? p - buffer->ptr : match[0].rm_eo + *begin;
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 55067ca..9795ca1 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -23,7 +23,6 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
 {
 	struct diffgrep_cb *data = priv;
 	regmatch_t regmatch;
-	int hold;
 
 	if (line[0] != '+' && line[0] != '-')
 		return;
@@ -33,11 +32,8 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
 		 * caller early.
 		 */
 		return;
-	/* Yuck -- line ought to be "const char *"! */
-	hold = line[len];
-	line[len] = '\0';
-	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, 0);
-	line[len] = hold;
+	data->hit = !regexec_buf(data->regexp, line + 1, len - 1, 1,
+				 &regmatch, 0);
 }
 
 static int diff_grep(mmfile_t *one, mmfile_t *two,
@@ -50,9 +46,11 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
 	xdemitconf_t xecfg;
 
 	if (!one)
-		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
+		return !regexec_buf(regexp, two->ptr, two->size,
+				    1, &regmatch, 0);
 	if (!two)
-		return !regexec(regexp, one->ptr, 1, &regmatch, 0);
+		return !regexec_buf(regexp, one->ptr, one->size,
+				    1, &regmatch, 0);
 
 	/*
 	 * We have both sides; need to run textual diff and see if
@@ -83,8 +81,8 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws)
 		regmatch_t regmatch;
 		int flags = 0;
 
-		assert(data[sz] == '\0');
-		while (*data && !regexec(regexp, data, 1, &regmatch, flags)) {
+		while (*data &&
+		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
 			flags |= REG_NOTBOL;
 			data += regmatch.rm_eo;
 			if (*data && regmatch.rm_so == regmatch.rm_eo)
diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
index 5929f2e..f0bf50b 100755
--- a/t/t4061-diff-pickaxe.sh
+++ b/t/t4061-diff-pickaxe.sh
@@ -14,7 +14,7 @@ test_expect_success setup '
 	test_tick &&
 	git commit -m "A 4k file"
 '
-test_expect_failure '-G matches' '
+test_expect_success '-G matches' '
 	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
 	test 4096-zeroes.txt = "$(cat out)"
 '
diff --git a/xdiff-interface.c b/xdiff-interface.c
index f34ea76..50702a2 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -214,11 +214,10 @@ struct ff_regs {
 static long ff_regexp(const char *line, long len,
 		char *buffer, long buffer_size, void *priv)
 {
-	char *line_buffer;
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
 	int i;
-	int result = -1;
+	int result;
 
 	/* Exclude terminating newline (and cr) from matching */
 	if (len > 0 && line[len-1] == '\n') {
@@ -228,18 +227,16 @@ static long ff_regexp(const char *line, long len,
 			len--;
 	}
 
-	line_buffer = xstrndup(line, len); /* make NUL terminated */
-
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
-		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
+		if (!regexec_buf(&reg->re, line, len, 2, pmatch, 0)) {
 			if (reg->negate)
-				goto fail;
+				return -1;
 			break;
 		}
 	}
 	if (regs->nr <= i)
-		goto fail;
+		return -1;
 	i = pmatch[1].rm_so >= 0 ? 1 : 0;
 	line += pmatch[i].rm_so;
 	result = pmatch[i].rm_eo - pmatch[i].rm_so;
@@ -248,8 +245,6 @@ static long ff_regexp(const char *line, long len,
 	while (result > 0 && (isspace(line[result - 1])))
 		result--;
 	memcpy(buffer, line, result);
- fail:
-	free(line_buffer);
 	return result;
 }
 
-- 
2.10.0.windows.1.10.g803177d

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-08  7:29         ` Johannes Schindelin
@ 2016-09-08  8:00           ` Jeff King
  2016-09-09 10:09             ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-08  8:00 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

On Thu, Sep 08, 2016 at 09:29:58AM +0200, Johannes Schindelin wrote:

> sorry for the late answer, I was really busy trying to come up with a new
> and improved version of the patch series, and while hunting a bug I
> introduced got bogged down with other tasks.

No problem. I am not in a hurry.

> > I always assumed the _point_ of re_search taking a ptr/len pair was
> > exactly to handle this case. The documentation[1] says:
> > 
> >    `string` is the string you want to match; it can contain newline and
> >    null characters. `size` is the length of that string.
> > 
> > Which seems pretty definitive to me (that's for re_match(), but
> > re_search() is defined in the docs in terms of re_match()).
> 
> Right. The problem is: I *really* want to avoid using GNU-isms.

I don't think GNU-isms are a problem if we wrap them to give a nice
interface, and if we rely on having compat/regex. But if you mean "I do
not want to rely on using compat/regex everywhere", then OK. I can see
arguments both for and against using a consistent regex library, but I
do not care that much either way myself.

> > We can contain this to the existing compat/regexec/regexec.c, and just
> > provide a wrapper that is similar to regexec but takes a ptr/len pair.
> 
> But we can do even better than that: we can provide a wrapper that uses
> REG_STARTEND where available (which is really the majority of platforms we
> care about: Linux, MacOSX, Windows, and even the *BSDs). Where it is not
> available, we simply malloc(), memcpy() and append a NUL.

Doesn't that make things much _worse_ for people on systems without
REG_STARTEND? If we imagine that most regexec calls would operate on a
NUL-terminated buffer, then they are now paying the extra malloc and
copy for each call to regexec_buf(), even if the buffer was already
NUL-terminated (because they have no idea whether it was or not).

I think I'd rather just have:

  #ifndef REG_STARTEND
  #error "Your regex library sucks. Compile with NO_REGEX=NeedsStartEnd"
  #endif

(or you could just use REG_STARTEND and let the compiler complain, but
then the user has to figure out the right knob to twiddle).


One other question about REG_STARTEND is: what does it do with NULs
inside the buffer? Certainly glibc (and our compat/regex) treat it as a
buffer with a particular length and ignore embedded NULs, as we want.
But the NetBSD documentation says only:

     REG_STARTEND   The string is considered to start at string +
		    pmatch[0].rm_so and to have a terminating NUL
		    located at string + pmatch[0].rm_eo (there need not
		    actually be a NUL at that location), 

Besides avoiding a segfault, one of the benefits of regcomp_buf() is
that we will now find pickaxe-regex strings inside mixed binary/text
files. But it's not clear to me that NetBSD's implementation does this.

I guess we can assume it is fine (it is certainly no _worse_ than the
current behavior), and if people's platforms do not handle it, they can
build with NO_REGEX.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers
  2016-09-08  7:31   ` [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
@ 2016-09-08  8:04     ` Jeff King
  2016-09-09  9:45       ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-08  8:04 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Thu, Sep 08, 2016 at 09:31:11AM +0200, Johannes Schindelin wrote:

> diff --git a/git-compat-util.h b/git-compat-util.h
> index db89ba7..19128b3 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -965,6 +965,27 @@ void git_qsort(void *base, size_t nmemb, size_t size,
>  #define qsort git_qsort
>  #endif
>  
> +static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
> +			      size_t nmatch, regmatch_t pmatch[], int eflags)
> +{
> +#ifdef REG_STARTEND
> +	assert(nmatch > 0 && pmatch);
> +	pmatch[0].rm_so = 0;
> +	pmatch[0].rm_eo = size;
> +	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
> +#else
> +	char *buf2 = xmalloc(size + 1);
> +	int ret;
> +
> +	memcpy(buf2, buf, size);
> +	buf2[size] = '\0';

I mentioned elsewhere that I'd prefer we just push people into using
compat/regex if they don't have REG_STARTEND. But if we _do_ keep this
fallback, note that the above has a buffer overflow (think what happens
when "size" is the maximum value for a size_t).  You can avoid it by
using xmallocz().

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  7:54     ` Johannes Schindelin
@ 2016-09-08  8:10       ` Jeff King
  2016-09-08  8:14         ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-08  8:10 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Thu, Sep 08, 2016 at 09:54:51AM +0200, Johannes Schindelin wrote:

> >  diff.c             |  3 ++-
> >  diffcore-pickaxe.c | 18 ++++++++----------
> >  xdiff-interface.c  | 13 ++++---------
> >  3 files changed, 14 insertions(+), 20 deletions(-)
> 
> I just realized that this should switch the test_expect_failure from 1/3
> to a test_expect_success.

Yep. I wonder if we also would want to test that we correctly find
regexes inside binary files.

E.g., given a mixed binary/text file like:

  printf 'binary\0text' >file &&
  git add file &&
  git commit -m file

then "git log -Stext" will find that file, but "--pickaxe-regex" will
not (using stock git). Ditto for "-Gtext".

Your patch should fix that.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-08  7:33   ` [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
@ 2016-09-08  8:13     ` Jeff King
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff King @ 2016-09-08  8:13 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Thu, Sep 08, 2016 at 09:33:29AM +0200, Johannes Schindelin wrote:

> On Thu, 8 Sep 2016, Johannes Schindelin wrote:
> 
> > We solve this by introducing a helper, regexec_buf(), that takes a
> > pointer and a length instead of a NUL-terminated string.
> 
> BTW I should have clarified why I decided on another name than regexecn()
> (I had considered this even before reading Peff's proposed patch): the <n>
> in string functions suggest a limiting of NUL-terminated strings. In other
> words, if n = 100 and the provided pointer points to a NUL-terminated
> string of length 3, the *n function will treat it as a string of length 3.
> 
> That is not what regexec_buf() does: it ignores the NUL. Hence the
> different name.

I agree that is a better name (this was the exact thing I wondered about
with REG_STARTEND, but certainly what we _want_ is true "_buf"
semantics).

I guess an argument that REG_STARTEND does what we want everywhere is
that the GNU implementation does what we want, and since it has been
around for over a decade presumably _somebody_ would have complained if
it did not match the NetBSD behavior.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  8:10       ` Jeff King
@ 2016-09-08  8:14         ` Jeff King
  2016-09-08  8:35           ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-08  8:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Thu, Sep 08, 2016 at 04:10:24AM -0400, Jeff King wrote:

> On Thu, Sep 08, 2016 at 09:54:51AM +0200, Johannes Schindelin wrote:
> 
> > >  diff.c             |  3 ++-
> > >  diffcore-pickaxe.c | 18 ++++++++----------
> > >  xdiff-interface.c  | 13 ++++---------
> > >  3 files changed, 14 insertions(+), 20 deletions(-)
> > 
> > I just realized that this should switch the test_expect_failure from 1/3
> > to a test_expect_success.
> 
> Yep. I wonder if we also would want to test that we correctly find
> regexes inside binary files.
> 
> E.g., given a mixed binary/text file like:
> 
>   printf 'binary\0text' >file &&
>   git add file &&
>   git commit -m file
> 
> then "git log -Stext" will find that file, but "--pickaxe-regex" will
> not (using stock git). Ditto for "-Gtext".
> 
> Your patch should fix that.

Of course if I had actually _looked carefully_ at your patch, I would
have seen that your test doesn't just check that we don't segfault, but
actually confirms that we find the entry.

Sorry for the noise.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-08  7:49         ` Johannes Schindelin
@ 2016-09-08  8:22           ` Jeff King
  2016-09-08 16:57             ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-08  8:22 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Thu, Sep 08, 2016 at 09:49:38AM +0200, Johannes Schindelin wrote:

> > > diff --git a/diff.c b/diff.c
> > > index 534c12e..2c5a360 100644
> > > --- a/diff.c
> > > +++ b/diff.c
> > > @@ -951,7 +951,13 @@ static int find_word_boundaries(mmfile_t *buffer,
> > > regex_t *word_regex,
> > >  {
> > >  	if (word_regex && *begin < buffer->size) {
> > >  		regmatch_t match[1];
> > > -		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> > > 		0)) {
> > > +		int f = 0;
> > > +#ifdef REG_STARTEND
> > > +		match[0].rm_so = 0;
> > > +		match[0].rm_eo = *end - *begin;
> > > +		f = REG_STARTEND;
> > > +#endif
> > > +		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
> > > f)) {
> 
> Heh. You introduced the same bug I did. Or maybe you just fetched my
> mmap-regexec branch and looked at an intermediate iteration?

I do not think I introduced anything.  The quoted text is what you
sent. Which is perhaps why it has your bug. :)

> > But I much prefer this approach to copying the data just to add a NUL.
> 
> I think it is not worth the burden. The only regex implementation in
> semi-widespread use that do not support REG_STARTEND seems to be musl.
> 
> I'd rather not spend *so much* effort just to support an obscure platform.
> Not when the users of that obscure platform could spend that effort
> themselves. And probably won't, because we only copy data to add a NUL on
> those platforms when regexec() is called on an mmfile_t.

I'm confused about what you think I'm proposing. I was saying I _like_
something like regexec_buf() instead of copying the data. Which seems
like the simpler thing to me (and presumably to you). Or do you mean
using compat/regex to build on re_search() consistently? I do not think
that is all that complex; the question is only whether people really
want to use their own regex libraries.

Between the two options for regexec_buf(), I think you have convinced me
that REG_STARTEND is better than just using compat/regex everywhere. I
do think the fallback for platforms like musl should be "use
compat/regex" and not doing an expensive copy (which in most cases is
not even necessary).

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  8:14         ` Jeff King
@ 2016-09-08  8:35           ` Jeff King
  2016-09-08 19:06             ` Ramsay Jones
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-08  8:35 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Thu, Sep 08, 2016 at 04:14:46AM -0400, Jeff King wrote:

> On Thu, Sep 08, 2016 at 04:10:24AM -0400, Jeff King wrote:
> 
> > On Thu, Sep 08, 2016 at 09:54:51AM +0200, Johannes Schindelin wrote:
> > 
> > > >  diff.c             |  3 ++-
> > > >  diffcore-pickaxe.c | 18 ++++++++----------
> > > >  xdiff-interface.c  | 13 ++++---------
> > > >  3 files changed, 14 insertions(+), 20 deletions(-)
> > > 
> > > I just realized that this should switch the test_expect_failure from 1/3
> > > to a test_expect_success.
> > 
> > Yep. I wonder if we also would want to test that we correctly find
> > regexes inside binary files.
> > 
> > E.g., given a mixed binary/text file like:
> > 
> >   printf 'binary\0text' >file &&
> >   git add file &&
> >   git commit -m file
> > 
> > then "git log -Stext" will find that file, but "--pickaxe-regex" will
> > not (using stock git). Ditto for "-Gtext".
> > 
> > Your patch should fix that.
> 
> Of course if I had actually _looked carefully_ at your patch, I would
> have seen that your test doesn't just check that we don't segfault, but
> actually confirms that we find the entry.
> 
> Sorry for the noise.

Actually, I take it back again. Your test case doesn't have an embedded
NUL in it (so we check that git finds it, but aside from the lack of
segfault, stock git would already find it).

Sorry for the double-noise.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-08  8:22           ` Jeff King
@ 2016-09-08 16:57             ` Junio C Hamano
  2016-09-08 18:22               ` Johannes Schindelin
  2016-09-08 18:48               ` Jeff King
  0 siblings, 2 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-08 16:57 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git

Jeff King <peff@peff.net> writes:

> Between the two options for regexec_buf(), I think you have convinced me
> that REG_STARTEND is better than just using compat/regex everywhere. I
> do think the fallback for platforms like musl should be "use
> compat/regex" and not doing an expensive copy (which in most cases is
> not even necessary).

I agree with you that it would be the best approach to build
regexec_buf() that unconditionally uses REG_STARTEND and tell people
without REG_STARTEND to use compat/regex instead of their platform
regex library.

The description in Makefile may want to be rephrased to clarify.

-# Define NO_REGEX if you have no or inferior regex support in your C library.
+# Define NO_REGEX if your C library lacks regex support with REG_STARTEND
+# feature.

The word "inferior" is not giving any useful information there.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers
  2016-09-08  7:58     ` [PATCH v3 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
@ 2016-09-08 17:03       ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-08 17:03 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Jeff King

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> We just introduced a test that demonstrates that our sloppy use of
> regexec() on a mmap()ed area can result in incorrect results or even
> hard crashes.
>
> So what we need to fix this is a function that calls regexec() on a
> length-delimited, rather than a NUL-terminated, string.
>
> Happily, there is an extension to regexec() introduced by the NetBSD
> project and present in all major regex implementation including
> Linux', MacOSX' and the one Git includes in compat/regex/: by using
> the (non-POSIX) REG_STARTEND flag, it is possible to tell the
> regexec() function that it should only look at the offsets between
> pmatch[0].rm_so and pmatch[0].rm_eo.
>
> That is exactly what we need.

Yes, that is good.

> Since support for REG_STARTEND is so widespread by now, let's just
> introduce a helper function that uses it, and fall back to allocating
> and constructing a NUL-terminated when REG_STARTEND is not available.

I do not think this fallback is good; we do ship a compat/ fallback
that does support REG_STARTEND and you'd want to use that.  Not
having the copying fallback means you do not even have to worry
about the size+1 overflow and fix it with xmallocz() ;-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  7:59     ` [PATCH v3 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
@ 2016-09-08 17:09       ` Junio C Hamano
  2016-09-09  9:52         ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-09-08 17:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Jeff King

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> @@ -33,11 +32,8 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
>  		 * caller early.
>  		 */
>  		return;
> -	/* Yuck -- line ought to be "const char *"! */
> -	hold = line[len];
> -	line[len] = '\0';
> -	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, 0);
> -	line[len] = hold;
> +	data->hit = !regexec_buf(data->regexp, line + 1, len - 1, 1,
> +				 &regmatch, 0);

This is an unexpected happy surprise.  It really feels good to see
that "Yuck" line go.

> @@ -228,18 +227,16 @@ static long ff_regexp(const char *line, long len,
>  			len--;
>  	}
>  
> -	line_buffer = xstrndup(line, len); /* make NUL terminated */
> -
>  	for (i = 0; i < regs->nr; i++) {
>  		struct ff_reg *reg = regs->array + i;
> -		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
> +		if (!regexec_buf(&reg->re, line, len, 2, pmatch, 0)) {

So is this hunk.  Removing unnecessary copying is a very good thing.

Please give these three patches a common prefix, e.g.

	regex: -G<pattern> feeds a non NUL-terminated string to	regexec() and fails
        regex: add regexec_buf() that can work on a non NUL-terminated string
	regex: use regexec_buf()

or something like that.

Also I agree with Peff that a test with an embedded NUL would be a
good thing.

This round is so close to perfect.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-08 16:57             ` Junio C Hamano
@ 2016-09-08 18:22               ` Johannes Schindelin
  2016-09-08 18:48               ` Jeff King
  1 sibling, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-08 18:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

Hi Peff & Junio,

On Thu, 8 Sep 2016, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Between the two options for regexec_buf(), I think you have convinced me
> > that REG_STARTEND is better than just using compat/regex everywhere. I
> > do think the fallback for platforms like musl should be "use
> > compat/regex" and not doing an expensive copy (which in most cases is
> > not even necessary).
> 
> I agree with you that it would be the best approach to build
> regexec_buf() that unconditionally uses REG_STARTEND and tell people
> without REG_STARTEND to use compat/regex instead of their platform
> regex library.
> 
> The description in Makefile may want to be rephrased to clarify.
> 
> -# Define NO_REGEX if you have no or inferior regex support in your C library.
> +# Define NO_REGEX if your C library lacks regex support with REG_STARTEND
> +# feature.

Okay, I will make that happen. But first I need to sleep.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
  2016-09-08 16:57             ` Junio C Hamano
  2016-09-08 18:22               ` Johannes Schindelin
@ 2016-09-08 18:48               ` Jeff King
  1 sibling, 0 replies; 66+ messages in thread
From: Jeff King @ 2016-09-08 18:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

On Thu, Sep 08, 2016 at 09:57:43AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Between the two options for regexec_buf(), I think you have convinced me
> > that REG_STARTEND is better than just using compat/regex everywhere. I
> > do think the fallback for platforms like musl should be "use
> > compat/regex" and not doing an expensive copy (which in most cases is
> > not even necessary).
> 
> I agree with you that it would be the best approach to build
> regexec_buf() that unconditionally uses REG_STARTEND and tell people
> without REG_STARTEND to use compat/regex instead of their platform
> regex library.
> 
> The description in Makefile may want to be rephrased to clarify.
> 
> -# Define NO_REGEX if you have no or inferior regex support in your C library.
> +# Define NO_REGEX if your C library lacks regex support with REG_STARTEND
> +# feature.
> 
> The word "inferior" is not giving any useful information there.

Yeah, very much agreed (and the "#error" when we lack REG_STARTEND I
mentioned earlier may be a good hint).

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08  8:35           ` Jeff King
@ 2016-09-08 19:06             ` Ramsay Jones
  2016-09-08 19:53               ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Ramsay Jones @ 2016-09-08 19:06 UTC (permalink / raw)
  To: Jeff King, Johannes Schindelin; +Cc: git, Junio C Hamano



On 08/09/16 09:35, Jeff King wrote:
> On Thu, Sep 08, 2016 at 04:14:46AM -0400, Jeff King wrote:
> 
>> On Thu, Sep 08, 2016 at 04:10:24AM -0400, Jeff King wrote:
>>
>>> On Thu, Sep 08, 2016 at 09:54:51AM +0200, Johannes Schindelin wrote:
>>>
>>>>>  diff.c             |  3 ++-
>>>>>  diffcore-pickaxe.c | 18 ++++++++----------
>>>>>  xdiff-interface.c  | 13 ++++---------
>>>>>  3 files changed, 14 insertions(+), 20 deletions(-)
>>>>
>>>> I just realized that this should switch the test_expect_failure from 1/3
>>>> to a test_expect_success.
>>>
>>> Yep. I wonder if we also would want to test that we correctly find
>>> regexes inside binary files.
>>>
>>> E.g., given a mixed binary/text file like:
>>>
>>>   printf 'binary\0text' >file &&
>>>   git add file &&
>>>   git commit -m file
>>>
>>> then "git log -Stext" will find that file, but "--pickaxe-regex" will
>>> not (using stock git). Ditto for "-Gtext".
>>>
>>> Your patch should fix that.
>>
>> Of course if I had actually _looked carefully_ at your patch, I would
>> have seen that your test doesn't just check that we don't segfault, but
>> actually confirms that we find the entry.
>>
>> Sorry for the noise.
> 
> Actually, I take it back again. Your test case doesn't have an embedded
> NUL in it (so we check that git finds it, but aside from the lack of
> segfault, stock git would already find it).

This reminds me ... despite the native cygwin regex library no longer
having the 'regex bug' (ie t0070.5 now passes), I still have NO_REGEX
set on cygwin. This is because, when building with the native library,
we have an "unexpected pass" for t7008.12, which looks like:

test_expect_failure 'git grep .fi a' '
        git grep .fi a
'
[where the file a is set up earlier by: echo 'binaryQfile' | q_to_nul >a]

commit f96e5673 ("grep: use REG_STARTEND for all matching if available",
22-05-2010) introduced this test and expects ".. NUL characters themselves
are not matched in any way". With the native library on cygwin they are
matched, with the compat/regex they are not. Indeed, if you use the system
'grep' command (rather than 'git grep'), then it will also not match ... :-D

Slightly off topic, but ...

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08 19:06             ` Ramsay Jones
@ 2016-09-08 19:53               ` Jeff King
  2016-09-08 21:30                 ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-08 19:53 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Johannes Schindelin, git, Junio C Hamano

On Thu, Sep 08, 2016 at 08:06:35PM +0100, Ramsay Jones wrote:

> > Actually, I take it back again. Your test case doesn't have an embedded
> > NUL in it (so we check that git finds it, but aside from the lack of
> > segfault, stock git would already find it).
> 
> This reminds me ... despite the native cygwin regex library no longer
> having the 'regex bug' (ie t0070.5 now passes), I still have NO_REGEX
> set on cygwin. This is because, when building with the native library,
> we have an "unexpected pass" for t7008.12, which looks like:
> 
> test_expect_failure 'git grep .fi a' '
>         git grep .fi a
> '
> [where the file a is set up earlier by: echo 'binaryQfile' | q_to_nul >a]
> 
> commit f96e5673 ("grep: use REG_STARTEND for all matching if available",
> 22-05-2010) introduced this test and expects ".. NUL characters themselves
> are not matched in any way". With the native library on cygwin they are
> matched, with the compat/regex they are not. Indeed, if you use the system
> 'grep' command (rather than 'git grep'), then it will also not match ... :-D
> 
> Slightly off topic, but ...

Hmm. So it sounds like the "regmatch" in grep.c could go away in favor
of Johannes's regexec_buf(), and cygwin ought to be using NO_REGEX.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08 19:53               ` Jeff King
@ 2016-09-08 21:30                 ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-08 21:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Ramsay Jones, Johannes Schindelin, git

Jeff King <peff@peff.net> writes:

>> commit f96e5673 ("grep: use REG_STARTEND for all matching if available",
>> 22-05-2010) introduced this test and expects ".. NUL characters themselves
>> are not matched in any way". With the native library on cygwin they are
>> matched, with the compat/regex they are not. Indeed, if you use the system
>> 'grep' command (rather than 'git grep'), then it will also not match ... :-D
>> 
>> Slightly off topic, but ...
>
> Hmm. So it sounds like the "regmatch" in grep.c could go away in favor
> of Johannes's regexec_buf(), and cygwin ought to be using NO_REGEX.

Sounds like a plan.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers
  2016-09-08  8:04     ` Jeff King
@ 2016-09-09  9:45       ` Johannes Schindelin
  2016-09-09  9:59         ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-09  9:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi Peff,

On Thu, 8 Sep 2016, Jeff King wrote:

> On Thu, Sep 08, 2016 at 09:31:11AM +0200, Johannes Schindelin wrote:
> 
> > diff --git a/git-compat-util.h b/git-compat-util.h
> > index db89ba7..19128b3 100644
> > --- a/git-compat-util.h
> > +++ b/git-compat-util.h
> > @@ -965,6 +965,27 @@ void git_qsort(void *base, size_t nmemb, size_t size,
> >  #define qsort git_qsort
> >  #endif
> >  
> > +static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
> > +			      size_t nmatch, regmatch_t pmatch[], int eflags)
> > +{
> > +#ifdef REG_STARTEND
> > +	assert(nmatch > 0 && pmatch);
> > +	pmatch[0].rm_so = 0;
> > +	pmatch[0].rm_eo = size;
> > +	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
> > +#else
> > +	char *buf2 = xmalloc(size + 1);
> > +	int ret;
> > +
> > +	memcpy(buf2, buf, size);
> > +	buf2[size] = '\0';
> 
> I mentioned elsewhere that I'd prefer we just push people into using
> compat/regex if they don't have REG_STARTEND. But if we _do_ keep this
> fallback, note that the above has a buffer overflow (think what happens
> when "size" is the maximum value for a size_t).  You can avoid it by
> using xmallocz().

That buffer overflow does not exist: If size were the maximum value for
size_t, then buf->ptr would point at a buffer that occupies the entire
available memory, meaning that there is no space left for buf->ptr, let
alone for buf.

But I get your point. It is better to be consistent and use the same logic
for *all* allocations.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 3/3] Use the newly-introduced regexec_buf() function
  2016-09-08 17:09       ` Junio C Hamano
@ 2016-09-09  9:52         ` Johannes Schindelin
  2016-09-09  9:57           ` Jeff King
  2016-09-09 17:49           ` Junio C Hamano
  0 siblings, 2 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-09  9:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King

Hi Junio,

On Thu, 8 Sep 2016, Junio C Hamano wrote:

> Please give these three patches a common prefix, e.g.
> 
> 	regex: -G<pattern> feeds a non NUL-terminated string to	regexec() and fails
>         regex: add regexec_buf() that can work on a non NUL-terminated string
> 	regex: use regexec_buf()
> 
> or something like that.

Done.

> Also I agree with Peff that a test with an embedded NUL would be a
> good thing.

This is something I will leave to somebody else, as it was not my
intention to fix this and I *really* have more pressing things to do right
now... Sorry!

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 3/3] Use the newly-introduced regexec_buf() function
  2016-09-09  9:52         ` Johannes Schindelin
@ 2016-09-09  9:57           ` Jeff King
  2016-09-09 10:41             ` Johannes Schindelin
  2016-09-09 17:49           ` Junio C Hamano
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-09  9:57 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

On Fri, Sep 09, 2016 at 11:52:50AM +0200, Johannes Schindelin wrote:

> > Also I agree with Peff that a test with an embedded NUL would be a
> > good thing.
> 
> This is something I will leave to somebody else, as it was not my
> intention to fix this and I *really* have more pressing things to do right
> now... Sorry!

I think it is literally just squashing this into your final patch:

diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
index f0bf50b..37b8dde 100755
--- a/t/t4061-diff-pickaxe.sh
+++ b/t/t4061-diff-pickaxe.sh
@@ -19,4 +19,13 @@ test_expect_success '-G matches' '
 	test 4096-zeroes.txt = "$(cat out)"
 '
 
+test_expect_success '-G matches after embedded NUL' '
+	printf "one\0two" >file &&
+	git add file &&
+	git commit -m embedded &&
+	echo embedded >expect &&
+	git log -Gtwo --format=%s >actual &&
+	test_cmp expect actual
+'
+
 test_done

-Peff

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers
  2016-09-09  9:45       ` Johannes Schindelin
@ 2016-09-09  9:59         ` Jeff King
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff King @ 2016-09-09  9:59 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano

On Fri, Sep 09, 2016 at 11:45:01AM +0200, Johannes Schindelin wrote:

> > I mentioned elsewhere that I'd prefer we just push people into using
> > compat/regex if they don't have REG_STARTEND. But if we _do_ keep this
> > fallback, note that the above has a buffer overflow (think what happens
> > when "size" is the maximum value for a size_t).  You can avoid it by
> > using xmallocz().
> 
> That buffer overflow does not exist: If size were the maximum value for
> size_t, then buf->ptr would point at a buffer that occupies the entire
> available memory, meaning that there is no space left for buf->ptr, let
> alone for buf.

True. I fixed quite a lot of these last summer, but they are only really
dangerous when we have not already allocated the buffer.

> But I get your point. It is better to be consistent and use the same logic
> for *all* allocations.

Yep. Also, it is easier to audit if you do not have to trace back and
see that even though we do overflow the argument to malloc, it can't
happen because of memory constraints (this one is fairly obvious, but
quite a few that I fixed previously involved complicated reasoning about
how much RAM you could use).

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-08  8:00           ` Jeff King
@ 2016-09-09 10:09             ` Johannes Schindelin
  2016-09-09 17:46               ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-09 10:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 2506 bytes --]

Hi Peff,

On Thu, 8 Sep 2016, Jeff King wrote:

> On Thu, Sep 08, 2016 at 09:29:58AM +0200, Johannes Schindelin wrote:
> 
> > sorry for the late answer, I was really busy trying to come up with a new
> > and improved version of the patch series, and while hunting a bug I
> > introduced got bogged down with other tasks.
> 
> No problem. I am not in a hurry.

I kind of am. The second half of September, I won't be able to do much of
anything Git-related, and this is a major bug that blocks some important
work.

So I kind of have to press on that front.

> I think I'd rather just have:
> 
>   #ifndef REG_STARTEND
>   #error "Your regex library sucks. Compile with NO_REGEX=NeedsStartEnd"
>   #endif

Done. Although I permitted myself to reword this a little ;-)

> One other question about REG_STARTEND is: what does it do with NULs
> inside the buffer? Certainly glibc (and our compat/regex) treat it as a
> buffer with a particular length and ignore embedded NULs, as we want.
> But the NetBSD documentation says only:
> 
>      REG_STARTEND   The string is considered to start at string +
> 		    pmatch[0].rm_so and to have a terminating NUL
> 		    located at string + pmatch[0].rm_eo (there need not
> 		    actually be a NUL at that location), 
> 
> Besides avoiding a segfault, one of the benefits of regcomp_buf() is
> that we will now find pickaxe-regex strings inside mixed binary/text
> files. But it's not clear to me that NetBSD's implementation does this.
> 
> I guess we can assume it is fine (it is certainly no _worse_ than the
> current behavior), and if people's platforms do not handle it, they can
> build with NO_REGEX.

René mentioned in f96e567 (grep: use REG_STARTEND for all matching if
available, 2010-05-22) something along the lines of REG_STARTEND being
able to parse beyond NULs. My interpretation of NetBSD's documentation
agrees with your interpretation, though, that the buffers are still
thought of as being NUL-terminated, even if rm_eo makes the code *not*
look at that particular NUL.

Be that as it may: it is completely outside the purpose of my patch series
to take care of making it possible for Git's regex functions to match
buffers with embedded NULs. The only purpose of my patch series is to fix
the crash that was reported to me due to regexec() reading past a mmap()ed
buffer. I already let myself being talked into fixing more things than
that, and I have to leave it at that.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 3/3] Use the newly-introduced regexec_buf() function
  2016-09-09  9:57           ` Jeff King
@ 2016-09-09 10:41             ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-09 10:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Hi Peff,

On Fri, 9 Sep 2016, Jeff King wrote:

> On Fri, Sep 09, 2016 at 11:52:50AM +0200, Johannes Schindelin wrote:
> 
> > > Also I agree with Peff that a test with an embedded NUL would be a
> > > good thing.
> > 
> > This is something I will leave to somebody else, as it was not my
> > intention to fix this and I *really* have more pressing things to do right
> > now... Sorry!
> 
> I think it is literally just squashing this into your final patch:
> 
> diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
> index f0bf50b..37b8dde 100755
> --- a/t/t4061-diff-pickaxe.sh
> +++ b/t/t4061-diff-pickaxe.sh
> @@ -19,4 +19,13 @@ test_expect_success '-G matches' '
>  	test 4096-zeroes.txt = "$(cat out)"
>  '
>  
> +test_expect_success '-G matches after embedded NUL' '
> +	printf "one\0two" >file &&
> +	git add file &&
> +	git commit -m embedded &&
> +	echo embedded >expect &&
> +	git log -Gtwo --format=%s >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_done

Thank you for providing me with the patch.

However, the whole idea of supporting regular expressions on buffers with
embedded NULs *is* different from the purpose of this patch series.

And in my quick web search, I got the impression that the presence of
REG_STARTEND really does not guarantee that regexec() won't stop at the
first NUL when rm_eo points after it.

So yeah, the patch would be easy to squash in, but the entire "rat's tail"
of making sure that this works everywhere, *in addition* to making sure
that the crash on mmap()ed buffers no longer occurs, would just delay this
patch series.

And unfortunately I do not have time for that right now.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-09 10:09             ` Johannes Schindelin
@ 2016-09-09 17:46               ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-09 17:46 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> Besides avoiding a segfault, one of the benefits of regcomp_buf() is
>> that we will now find pickaxe-regex strings inside mixed binary/text
>> files. But it's not clear to me that NetBSD's implementation does this.
>> 
>> I guess we can assume it is fine (it is certainly no _worse_ than the
>> current behavior), and if people's platforms do not handle it, they can
>> build with NO_REGEX.
>
> René mentioned in f96e567 (grep: use REG_STARTEND for all matching if
> available, 2010-05-22) something along the lines of REG_STARTEND being
> able to parse beyond NULs. My interpretation of NetBSD's documentation
> agrees with your interpretation, though, that the buffers are still
> thought of as being NUL-terminated, even if rm_eo makes the code *not*
> look at that particular NUL.
>
> Be that as it may: it is completely outside the purpose of my patch series
> to take care of making it possible for Git's regex functions to match
> buffers with embedded NULs.

I think you two have agreed that regexec_buf() wrapper that always
relies on REG_STARTEND is a good first step whether we want to do an
embedded NUL, that we just need that good first step for now, and
that that good first step would not block future progress, i.e. our
wanting to handle embedded NUL.

So let's see how well the first step flies in practice.  We tell
people to build with NO_REGEX if their platform's regexp does not do
(any form of) REG_STARTEND.  We _might_ later have to tell them to
set NO_REGX even if they are on NetBSD and has REG_STARTEND that may
not handle embedded NUL the way we want, but that can safely be left
to the future.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 3/3] Use the newly-introduced regexec_buf() function
  2016-09-09  9:52         ` Johannes Schindelin
  2016-09-09  9:57           ` Jeff King
@ 2016-09-09 17:49           ` Junio C Hamano
  1 sibling, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-09 17:49 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Jeff King

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> Also I agree with Peff that a test with an embedded NUL would be a
>> good thing.
>
> This is something I will leave to somebody else, as it was not my
> intention to fix this and I *really* have more pressing things to do right
> now... Sorry!

As I said a few minutes ago, I think we can stop _before_ worrying
about an embedded NUL, which is something we haven't handled before
anyway so it is a new feature that can be built later outside the
scope of this series.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
                       ` (2 preceding siblings ...)
  2016-09-08  7:59     ` [PATCH v3 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
@ 2016-09-21 18:23     ` Johannes Schindelin
  2016-09-21 18:23       ` [PATCH v4 1/3] regex: -G<pattern> feeds a non NUL-terminated string to regexec() and fails Johannes Schindelin
                         ` (3 more replies)
  3 siblings, 4 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-21 18:23 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Benjamin Kramer, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 4565 bytes --]

[Cc:ing Benjamin Kramer & René Scharfe because they both worked on
the REG_STARTEND code in grep.c that I replace in this iteration of the
patch series]

This patch series addresses a problem where `git diff` is called using
`-G` or `-S --pickaxe-regex` on new-born files that are configured
without user diff drivers, and that hence get mmap()ed into memory.

The problem with that: mmap()ed memory is *not* NUL-terminated, yet the
pickaxe code calls regexec() on it just the same.

This problem has been reported by my colleague Chris Sidi.

We solve this by introducing a helper, regexec_buf(), that takes a
pointer and a length instead of a NUL-terminated string.

This helper then uses REG_STARTEND where available, and falls back to
allocating and constructing a NUL-terminated string. Given the
wide-spread support for REG_STARTEND (Linux has it, MacOSX has it, Git
for Windows has it because it uses compat/regex/ that has it), I think
this is a fair trade-off.

Changes since v3:

- reworded the onelines as per Junio's suggestions.

- removed fallback when REG_STARTEND is not supported, in favor of
  requiring NO_REGEX.

- removed the regmatch() function from grep.c, in favor of using
  regexec_buf().


Johannes Schindelin (3):
  regex: -G<pattern> feeds a non NUL-terminated string to regexec() and
    fails
  regex: add regexec_buf() that can work on a non NUL-terminated string
  regex: use regexec_buf()

 Makefile                |  3 ++-
 diff.c                  |  3 ++-
 diffcore-pickaxe.c      | 18 ++++++++----------
 git-compat-util.h       | 13 +++++++++++++
 grep.c                  | 14 ++------------
 t/t4061-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 xdiff-interface.c       | 13 ++++---------
 7 files changed, 53 insertions(+), 33 deletions(-)
 create mode 100755 t/t4061-diff-pickaxe.sh

Published-As: https://github.com/dscho/git/releases/tag/mmap-regexec-v4
Fetch-It-Via: git fetch https://github.com/dscho/git mmap-regexec-v4

Interdiff vs v3:

 diff --git a/Makefile b/Makefile
 index df4f86b..c6f7f66 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -301,7 +301,8 @@ all::
  # crashes due to allocation and free working on different 'heaps'.
  # It's defined automatically if USE_NED_ALLOCATOR is set.
  #
 -# Define NO_REGEX if you have no or inferior regex support in your C library.
 +# Define NO_REGEX if your C library lacks regex support with REG_STARTEND
 +# feature.
  #
  # Define HAVE_DEV_TTY if your system can open /dev/tty to interact with the
  # user.
 diff --git a/git-compat-util.h b/git-compat-util.h
 index 627ec5f..8aab0c3 100644
 --- a/git-compat-util.h
 +++ b/git-compat-util.h
 @@ -977,25 +977,17 @@ void git_qsort(void *base, size_t nmemb, size_t size,
  #define qsort git_qsort
  #endif
  
 +#ifndef REG_STARTEND
 +#error "Git requires REG_STARTEND support. Compile with NO_REGEX=NeedsStartEnd"
 +#endif
 +
  static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
  			      size_t nmatch, regmatch_t pmatch[], int eflags)
  {
 -#ifdef REG_STARTEND
  	assert(nmatch > 0 && pmatch);
  	pmatch[0].rm_so = 0;
  	pmatch[0].rm_eo = size;
  	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
 -#else
 -	char *buf2 = xmalloc(size + 1);
 -	int ret;
 -
 -	memcpy(buf2, buf, size);
 -	buf2[size] = '\0';
 -	ret = regexec(preg, buf2, nmatch, pmatch, eflags);
 -	free(buf2);
 -
 -	return ret;
 -#endif
  }
  
  #ifndef DIR_HAS_BSD_GROUP_SEMANTICS
 diff --git a/grep.c b/grep.c
 index d7d00b8..1194d35 100644
 --- a/grep.c
 +++ b/grep.c
 @@ -898,17 +898,6 @@ static int fixmatch(struct grep_pat *p, char *line, char *eol,
  	}
  }
  
 -static int regmatch(const regex_t *preg, char *line, char *eol,
 -		    regmatch_t *match, int eflags)
 -{
 -#ifdef REG_STARTEND
 -	match->rm_so = 0;
 -	match->rm_eo = eol - line;
 -	eflags |= REG_STARTEND;
 -#endif
 -	return regexec(preg, line, 1, match, eflags);
 -}
 -
  static int patmatch(struct grep_pat *p, char *line, char *eol,
  		    regmatch_t *match, int eflags)
  {
 @@ -919,7 +908,8 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
  	else if (p->pcre_regexp)
  		hit = !pcrematch(p, line, eol, match, eflags);
  	else
 -		hit = !regmatch(&p->regexp, line, eol, match, eflags);
 +		hit = !regexec_buf(&p->regexp, line, eol - line, 1, match,
 +				   eflags);
  
  	return hit;
  }

-- 
2.10.0.windows.1.10.g803177d

base-commit: f6727b0509ec3417a5183ba6e658143275a734f5

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v4 1/3] regex: -G<pattern> feeds a non NUL-terminated string to regexec() and fails
  2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
@ 2016-09-21 18:23       ` Johannes Schindelin
  2016-09-21 18:24       ` [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string Johannes Schindelin
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-21 18:23 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Benjamin Kramer, René Scharfe

When our pickaxe code feeds file contents to regexec(), it implicitly
assumes that the file contents are read into implicitly NUL-terminated
buffers (i.e. that we overallocate by 1, appending a single '\0').

This is not so.

In particular when the file contents are simply mmap()ed, we can be
virtually certain that the buffer is preceding uninitialized bytes, or
invalid pages.

Note that the test we add here is known to be flakey: we simply cannot
know whether the byte following the mmap()ed ones is a NUL or not.

Typically, on Linux the test passes. On Windows, it fails virtually
every time due to an access violation (that's a segmentation fault for
you Unix-y people out there). And Windows would be correct: the
regexec() call wants to operate on a regular, NUL-terminated string,
there is no NUL in the mmap()ed memory range, and it is undefined
whether the next byte is even legal to access.

When run with --valgrind it demonstrates quite clearly the breakage, of
course.

Being marked with `test_expect_failure`, this test will sometimes be
declare "TODO fixed", even if it only passes by mistake.

This test case represents a Minimal, Complete and Verifiable Example of
a breakage reported by Chris Sidi.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t4061-diff-pickaxe.sh | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100755 t/t4061-diff-pickaxe.sh

diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
new file mode 100755
index 0000000..5929f2e
--- /dev/null
+++ b/t/t4061-diff-pickaxe.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+#
+# Copyright (c) 2016 Johannes Schindelin
+#
+
+test_description='Pickaxe options'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+	test_commit initial &&
+	printf "%04096d" 0 >4096-zeroes.txt &&
+	git add 4096-zeroes.txt &&
+	test_tick &&
+	git commit -m "A 4k file"
+'
+test_expect_failure '-G matches' '
+	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
+	test 4096-zeroes.txt = "$(cat out)"
+'
+
+test_done
-- 
2.10.0.windows.1.10.g803177d



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string
  2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
  2016-09-21 18:23       ` [PATCH v4 1/3] regex: -G<pattern> feeds a non NUL-terminated string to regexec() and fails Johannes Schindelin
@ 2016-09-21 18:24       ` Johannes Schindelin
  2016-09-21 19:17         ` Junio C Hamano
  2016-09-21 18:24       ` [PATCH v4 3/3] regex: use regexec_buf() Johannes Schindelin
  2016-09-21 22:04       ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Jeff King
  3 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-21 18:24 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Benjamin Kramer, René Scharfe

We just introduced a test that demonstrates that our sloppy use of
regexec() on a mmap()ed area can result in incorrect results or even
hard crashes.

So what we need to fix this is a function that calls regexec() on a
length-delimited, rather than a NUL-terminated, string.

Happily, there is an extension to regexec() introduced by the NetBSD
project and present in all major regex implementation including
Linux', MacOSX' and the one Git includes in compat/regex/: by using
the (non-POSIX) REG_STARTEND flag, it is possible to tell the
regexec() function that it should only look at the offsets between
pmatch[0].rm_so and pmatch[0].rm_eo.

That is exactly what we need.

Since support for REG_STARTEND is so widespread by now, let's just
introduce a helper function that uses it, and fall back to allocating
and constructing a NUL-terminated when REG_STARTEND is not available.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 Makefile          |  3 ++-
 git-compat-util.h | 13 +++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index df4f86b..c6f7f66 100644
--- a/Makefile
+++ b/Makefile
@@ -301,7 +301,8 @@ all::
 # crashes due to allocation and free working on different 'heaps'.
 # It's defined automatically if USE_NED_ALLOCATOR is set.
 #
-# Define NO_REGEX if you have no or inferior regex support in your C library.
+# Define NO_REGEX if your C library lacks regex support with REG_STARTEND
+# feature.
 #
 # Define HAVE_DEV_TTY if your system can open /dev/tty to interact with the
 # user.
diff --git a/git-compat-util.h b/git-compat-util.h
index 37cce07..8aab0c3 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -977,6 +977,19 @@ void git_qsort(void *base, size_t nmemb, size_t size,
 #define qsort git_qsort
 #endif
 
+#ifndef REG_STARTEND
+#error "Git requires REG_STARTEND support. Compile with NO_REGEX=NeedsStartEnd"
+#endif
+
+static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
+			      size_t nmatch, regmatch_t pmatch[], int eflags)
+{
+	assert(nmatch > 0 && pmatch);
+	pmatch[0].rm_so = 0;
+	pmatch[0].rm_eo = size;
+	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
+}
+
 #ifndef DIR_HAS_BSD_GROUP_SEMANTICS
 # define FORCE_DIR_SET_GID S_ISGID
 #else
-- 
2.10.0.windows.1.10.g803177d



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 3/3] regex: use regexec_buf()
  2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
  2016-09-21 18:23       ` [PATCH v4 1/3] regex: -G<pattern> feeds a non NUL-terminated string to regexec() and fails Johannes Schindelin
  2016-09-21 18:24       ` [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string Johannes Schindelin
@ 2016-09-21 18:24       ` Johannes Schindelin
  2016-09-21 19:18         ` Junio C Hamano
  2016-09-21 22:03         ` Jeff King
  2016-09-21 22:04       ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Jeff King
  3 siblings, 2 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-21 18:24 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Benjamin Kramer, René Scharfe

The new regexec_buf() function operates on buffers with an explicitly
specified length, rather than NUL-terminated strings.

We need to use this function whenever the buffer we want to pass to
regexec() may have been mmap()ed (and is hence not NUL-terminated).

Note: the original motivation for this patch was to fix a bug where
`git diff -G <regex>` would crash. This patch converts more callers,
though, some of which explicitly allocated and constructed
NUL-terminated strings (or worse: modified read-only buffers to insert
NULs).

Some of the buffers actually may be NUL-terminated. As regexec_buf()
uses REG_STARTEND where available, but has to fall back to allocating
and constructing NUL-terminated strings where REG_STARTEND is not
available, this makes the code less efficient in the latter case.

However, given the widespread support for REG_STARTEND, combined with
the improved ease of code maintenance, we strike the balance in favor
of REG_STARTEND.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 diff.c                  |  3 ++-
 diffcore-pickaxe.c      | 18 ++++++++----------
 grep.c                  | 14 ++------------
 t/t4061-diff-pickaxe.sh |  2 +-
 xdiff-interface.c       | 13 ++++---------
 5 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/diff.c b/diff.c
index c6da383..fb99235 100644
--- a/diff.c
+++ b/diff.c
@@ -952,7 +952,8 @@ static int find_word_boundaries(mmfile_t *buffer, regex_t *word_regex,
 {
 	if (word_regex && *begin < buffer->size) {
 		regmatch_t match[1];
-		if (!regexec(word_regex, buffer->ptr + *begin, 1, match, 0)) {
+		if (!regexec_buf(word_regex, buffer->ptr + *begin,
+				 buffer->size - *begin, 1, match, 0)) {
 			char *p = memchr(buffer->ptr + *begin + match[0].rm_so,
 					'\n', match[0].rm_eo - match[0].rm_so);
 			*end = p ? p - buffer->ptr : match[0].rm_eo + *begin;
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 55067ca..9795ca1 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -23,7 +23,6 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
 {
 	struct diffgrep_cb *data = priv;
 	regmatch_t regmatch;
-	int hold;
 
 	if (line[0] != '+' && line[0] != '-')
 		return;
@@ -33,11 +32,8 @@ static void diffgrep_consume(void *priv, char *line, unsigned long len)
 		 * caller early.
 		 */
 		return;
-	/* Yuck -- line ought to be "const char *"! */
-	hold = line[len];
-	line[len] = '\0';
-	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, 0);
-	line[len] = hold;
+	data->hit = !regexec_buf(data->regexp, line + 1, len - 1, 1,
+				 &regmatch, 0);
 }
 
 static int diff_grep(mmfile_t *one, mmfile_t *two,
@@ -50,9 +46,11 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
 	xdemitconf_t xecfg;
 
 	if (!one)
-		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
+		return !regexec_buf(regexp, two->ptr, two->size,
+				    1, &regmatch, 0);
 	if (!two)
-		return !regexec(regexp, one->ptr, 1, &regmatch, 0);
+		return !regexec_buf(regexp, one->ptr, one->size,
+				    1, &regmatch, 0);
 
 	/*
 	 * We have both sides; need to run textual diff and see if
@@ -83,8 +81,8 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws)
 		regmatch_t regmatch;
 		int flags = 0;
 
-		assert(data[sz] == '\0');
-		while (*data && !regexec(regexp, data, 1, &regmatch, flags)) {
+		while (*data &&
+		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
 			flags |= REG_NOTBOL;
 			data += regmatch.rm_eo;
 			if (*data && regmatch.rm_so == regmatch.rm_eo)
diff --git a/grep.c b/grep.c
index d7d00b8..1194d35 100644
--- a/grep.c
+++ b/grep.c
@@ -898,17 +898,6 @@ static int fixmatch(struct grep_pat *p, char *line, char *eol,
 	}
 }
 
-static int regmatch(const regex_t *preg, char *line, char *eol,
-		    regmatch_t *match, int eflags)
-{
-#ifdef REG_STARTEND
-	match->rm_so = 0;
-	match->rm_eo = eol - line;
-	eflags |= REG_STARTEND;
-#endif
-	return regexec(preg, line, 1, match, eflags);
-}
-
 static int patmatch(struct grep_pat *p, char *line, char *eol,
 		    regmatch_t *match, int eflags)
 {
@@ -919,7 +908,8 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
 	else if (p->pcre_regexp)
 		hit = !pcrematch(p, line, eol, match, eflags);
 	else
-		hit = !regmatch(&p->regexp, line, eol, match, eflags);
+		hit = !regexec_buf(&p->regexp, line, eol - line, 1, match,
+				   eflags);
 
 	return hit;
 }
diff --git a/t/t4061-diff-pickaxe.sh b/t/t4061-diff-pickaxe.sh
index 5929f2e..f0bf50b 100755
--- a/t/t4061-diff-pickaxe.sh
+++ b/t/t4061-diff-pickaxe.sh
@@ -14,7 +14,7 @@ test_expect_success setup '
 	test_tick &&
 	git commit -m "A 4k file"
 '
-test_expect_failure '-G matches' '
+test_expect_success '-G matches' '
 	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
 	test 4096-zeroes.txt = "$(cat out)"
 '
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 3bfc69c..060038c 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -214,11 +214,10 @@ struct ff_regs {
 static long ff_regexp(const char *line, long len,
 		char *buffer, long buffer_size, void *priv)
 {
-	char *line_buffer;
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
 	int i;
-	int result = -1;
+	int result;
 
 	/* Exclude terminating newline (and cr) from matching */
 	if (len > 0 && line[len-1] == '\n') {
@@ -228,18 +227,16 @@ static long ff_regexp(const char *line, long len,
 			len--;
 	}
 
-	line_buffer = xstrndup(line, len); /* make NUL terminated */
-
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
-		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
+		if (!regexec_buf(&reg->re, line, len, 2, pmatch, 0)) {
 			if (reg->negate)
-				goto fail;
+				return -1;
 			break;
 		}
 	}
 	if (regs->nr <= i)
-		goto fail;
+		return -1;
 	i = pmatch[1].rm_so >= 0 ? 1 : 0;
 	line += pmatch[i].rm_so;
 	result = pmatch[i].rm_eo - pmatch[i].rm_so;
@@ -248,8 +245,6 @@ static long ff_regexp(const char *line, long len,
 	while (result > 0 && (isspace(line[result - 1])))
 		result--;
 	memcpy(buffer, line, result);
- fail:
-	free(line_buffer);
 	return result;
 }
 
-- 
2.10.0.windows.1.10.g803177d

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string
  2016-09-21 18:24       ` [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string Johannes Schindelin
@ 2016-09-21 19:17         ` Junio C Hamano
  2016-09-22 18:38           ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-09-21 19:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Jeff King, Benjamin Kramer, René Scharfe

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> ...
> Happily, there is an extension to regexec() introduced by the NetBSD
> project and present in all major regex implementation including
> Linux', MacOSX' and the one Git includes in compat/regex/: by using
> the (non-POSIX) REG_STARTEND flag, it is possible to tell the
> regexec() function that it should only look at the offsets between
> pmatch[0].rm_so and pmatch[0].rm_eo.
>
> That is exactly what we need.

Wonderful.

> Since support for REG_STARTEND is so widespread by now, let's just
> introduce a helper function that uses it, and fall back to allocating
> and constructing a NUL-terminated when REG_STARTEND is not available.

I'd somehow reword the last paragraph here, though ;-)

    Since support for REG_STARTEND is so widespread by now, let's just
    introduce a helper function that always uses it, and tell people
    on a platform whose regex library does not support it to use the
    one from our compat/regex/ directory.

The patch itself looks very sane.  Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 3/3] regex: use regexec_buf()
  2016-09-21 18:24       ` [PATCH v4 3/3] regex: use regexec_buf() Johannes Schindelin
@ 2016-09-21 19:18         ` Junio C Hamano
  2016-09-21 20:09           ` Junio C Hamano
  2016-09-21 22:03         ` Jeff King
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-09-21 19:18 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Jeff King, Benjamin Kramer, René Scharfe

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> The new regexec_buf() function operates on buffers with an explicitly
> specified length, rather than NUL-terminated strings.
>
> We need to use this function whenever the buffer we want to pass to
> regexec() may have been mmap()ed (and is hence not NUL-terminated).
>
> Note: the original motivation for this patch was to fix a bug where
> `git diff -G <regex>` would crash. This patch converts more callers,
> though, some of which explicitly allocated and constructed
> NUL-terminated strings (or worse: modified read-only buffers to insert
> NULs).
>
> Some of the buffers actually may be NUL-terminated. As regexec_buf()
> uses REG_STARTEND where available, but has to fall back to allocating
> and constructing NUL-terminated strings where REG_STARTEND is not
> available, this makes the code less efficient in the latter case.
>
> However, given the widespread support for REG_STARTEND, combined with
> the improved ease of code maintenance, we strike the balance in favor
> of REG_STARTEND.

The last paragraph can go (2/3 was already justified separately),
and the paragraph before that needs rewording, as you no longer do
the "duplicate, run regexec, and free" dance.

Will comment on the patch text itself later.

Thanks for following it through.  This topic actually fell under my
radar until now.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 3/3] regex: use regexec_buf()
  2016-09-21 19:18         ` Junio C Hamano
@ 2016-09-21 20:09           ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-21 20:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Jeff King, Benjamin Kramer, René Scharfe

Junio C Hamano <gitster@pobox.com> writes:

> Johannes Schindelin <johannes.schindelin@gmx.de> writes:
>
>> The new regexec_buf() function operates on buffers with an explicitly
>> specified length, rather than NUL-terminated strings.
>>
>> We need to use this function whenever the buffer we want to pass to
>> regexec() may have been mmap()ed (and is hence not NUL-terminated).
>>
>> Note: the original motivation for this patch was to fix a bug where
>> `git diff -G <regex>` would crash. This patch converts more callers,
>> though, some of which explicitly allocated and constructed
>> NUL-terminated strings (or worse: modified read-only buffers to insert
>> NULs).

Also, I think there is nobody that modified read-only buffer.
diffgrep_consume() does say "Yuck -- line ought to be const",
but its "line" parameter is actually a non-const exactly for
this yuckiness (iow, it knew what it was doing).

Perhaps like so?

    regex: use regexec_buf()
    
    The new regexec_buf() function operates on buffers with an explicitly
    specified length, rather than NUL-terminated strings.
    
    We need to use this function whenever the buffer we want to pass to
    regexec(3) may have been mmap(2)ed (and is hence not NUL-terminated).
    
    Note: the original motivation for this patch was to fix a bug where
    `git diff -G <regex>` would crash. This patch converts more callers,
    though, some of which allocated to construct NUL-terminated strings,
    or worse, modified buffers to temporarily insert NULs while calling
    regexec(3).  By converting them to use regexec_buf() they have become
    much cleaner.
    
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

The patch text looked good to me.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 3/3] regex: use regexec_buf()
  2016-09-21 18:24       ` [PATCH v4 3/3] regex: use regexec_buf() Johannes Schindelin
  2016-09-21 19:18         ` Junio C Hamano
@ 2016-09-21 22:03         ` Jeff King
  2016-09-25 14:01           ` Johannes Schindelin
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff King @ 2016-09-21 22:03 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: git, Junio C Hamano, Benjamin Kramer, René Scharfe

On Wed, Sep 21, 2016 at 08:24:14PM +0200, Johannes Schindelin wrote:

> The new regexec_buf() function operates on buffers with an explicitly
> specified length, rather than NUL-terminated strings.
> 
> We need to use this function whenever the buffer we want to pass to
> regexec() may have been mmap()ed (and is hence not NUL-terminated).
> 
> Note: the original motivation for this patch was to fix a bug where
> `git diff -G <regex>` would crash. This patch converts more callers,
> though, some of which explicitly allocated and constructed
> NUL-terminated strings (or worse: modified read-only buffers to insert
> NULs).

Nice. I probably would have split these into their own patch, but I
think it is OK here.

> @@ -228,18 +227,16 @@ static long ff_regexp(const char *line, long len,
>  			len--;
>  	}
>  
> -	line_buffer = xstrndup(line, len); /* make NUL terminated */
> -

Nice to see this one going away in particular, since it's called quite a
lot. According to perf, "git log -p" on git.git drops about 1.5 million
malloc calls (about 9% of the total). And here are best-of-five results
for that same command:

  [before]
  real    0m14.676s
  user    0m13.988s
  sys     0m0.676s

  [after]
  real    0m14.394s
  user    0m13.624s
  sys     0m0.760s

Not a _huge_ improvement, but more significant than the run-to-run
noise.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data
  2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
                         ` (2 preceding siblings ...)
  2016-09-21 18:24       ` [PATCH v4 3/3] regex: use regexec_buf() Johannes Schindelin
@ 2016-09-21 22:04       ` Jeff King
  3 siblings, 0 replies; 66+ messages in thread
From: Jeff King @ 2016-09-21 22:04 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: git, Junio C Hamano, Benjamin Kramer, René Scharfe

On Wed, Sep 21, 2016 at 08:23:11PM +0200, Johannes Schindelin wrote:

> We solve this by introducing a helper, regexec_buf(), that takes a
> pointer and a length instead of a NUL-terminated string.
> 
> This helper then uses REG_STARTEND where available, and falls back to
> allocating and constructing a NUL-terminated string. Given the
> wide-spread support for REG_STARTEND (Linux has it, MacOSX has it, Git
> for Windows has it because it uses compat/regex/ that has it), I think
> this is a fair trade-off.

I did a double-take on this, but then read:

> Changes since v3:
> [...]
> - removed fallback when REG_STARTEND is not supported, in favor of
>   requiring NO_REGEX.

So I think we are all in agreement. :)

With the exception of a few commit message fixups that Junio already
pointed out, this looks good to me. Thanks.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string
  2016-09-21 19:17         ` Junio C Hamano
@ 2016-09-22 18:38           ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-22 18:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Benjamin Kramer, René Scharfe

Hi Junio,

On Wed, 21 Sep 2016, Junio C Hamano wrote:

> Johannes Schindelin <johannes.schindelin@gmx.de> writes:
> 
> > ...
> > Happily, there is an extension to regexec() introduced by the NetBSD
> > project and present in all major regex implementation including
> > Linux', MacOSX' and the one Git includes in compat/regex/: by using
> > the (non-POSIX) REG_STARTEND flag, it is possible to tell the
> > regexec() function that it should only look at the offsets between
> > pmatch[0].rm_so and pmatch[0].rm_eo.
> >
> > That is exactly what we need.
> 
> Wonderful.
> 
> > Since support for REG_STARTEND is so widespread by now, let's just
> > introduce a helper function that uses it, and fall back to allocating
> > and constructing a NUL-terminated when REG_STARTEND is not available.
> 
> I'd somehow reword the last paragraph here, though ;-)

Oh drats. I thought I had prepared the fixed commit messages already :-(

What you have in `pu` as of today looks good.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 3/3] regex: use regexec_buf()
  2016-09-21 22:03         ` Jeff King
@ 2016-09-25 14:01           ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2016-09-25 14:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Benjamin Kramer, René Scharfe

Hi Peff,

On Wed, 21 Sep 2016, Jeff King wrote:

> On Wed, Sep 21, 2016 at 08:24:14PM +0200, Johannes Schindelin wrote:
> 
> > The new regexec_buf() function operates on buffers with an explicitly
> > specified length, rather than NUL-terminated strings.
> > 
> > We need to use this function whenever the buffer we want to pass to
> > regexec() may have been mmap()ed (and is hence not NUL-terminated).
> > 
> > Note: the original motivation for this patch was to fix a bug where
> > `git diff -G <regex>` would crash. This patch converts more callers,
> > though, some of which explicitly allocated and constructed
> > NUL-terminated strings (or worse: modified read-only buffers to insert
> > NULs).
> 
> Nice. I probably would have split these into their own patch, but I
> think it is OK here.

True. I guess I was a little lazy...

> > @@ -228,18 +227,16 @@ static long ff_regexp(const char *line, long len,
> >  			len--;
> >  	}
> >  
> > -	line_buffer = xstrndup(line, len); /* make NUL terminated */
> > -
> 
> Nice to see this one going away in particular, since it's called quite a
> lot. According to perf, "git log -p" on git.git drops about 1.5 million
> malloc calls (about 9% of the total). And here are best-of-five results
> for that same command:
> 
>   [before]
>   real    0m14.676s
>   user    0m13.988s
>   sys     0m0.676s
> 
>   [after]
>   real    0m14.394s
>   user    0m13.624s
>   sys     0m0.760s
> 
> Not a _huge_ improvement, but more significant than the run-to-run
> noise.

Oh, that's nice! As you guessed, my aim was not to improve performance,
but it is a pretty side effect...

Thanks for testing!
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2016-09-25 14:02 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
2016-09-05 15:45 ` [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
2016-09-06 18:43   ` Jeff King
2016-09-08  7:53     ` Johannes Schindelin
2016-09-05 15:45 ` [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers Johannes Schindelin
2016-09-06  7:06   ` Jeff King
2016-09-06 16:02     ` Johannes Schindelin
2016-09-06 18:41       ` Jeff King
2016-09-07 18:31         ` Junio C Hamano
2016-09-08  7:52           ` Johannes Schindelin
2016-09-08  7:49         ` Johannes Schindelin
2016-09-08  8:22           ` Jeff King
2016-09-08 16:57             ` Junio C Hamano
2016-09-08 18:22               ` Johannes Schindelin
2016-09-08 18:48               ` Jeff King
2016-09-05 15:45 ` [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated Johannes Schindelin
2016-09-06  7:08   ` Jeff King
2016-09-06 16:04     ` Johannes Schindelin
2016-09-05 19:10 ` [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Junio C Hamano
2016-09-06  7:12   ` Jeff King
2016-09-06 14:06     ` Johannes Schindelin
2016-09-06 18:29       ` Jeff King
2016-09-08  7:29         ` Johannes Schindelin
2016-09-08  8:00           ` Jeff King
2016-09-09 10:09             ` Johannes Schindelin
2016-09-09 17:46               ` Junio C Hamano
2016-09-06 13:21   ` Johannes Schindelin
2016-09-06  6:58 ` Jeff King
2016-09-06 14:13   ` Johannes Schindelin
2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
2016-09-08  7:31   ` [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
2016-09-08  8:04     ` Jeff King
2016-09-09  9:45       ` Johannes Schindelin
2016-09-09  9:59         ` Jeff King
2016-09-08  7:31   ` [PATCH v2 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
2016-09-08  7:31   ` [PATCH v2 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
2016-09-08  7:54     ` Johannes Schindelin
2016-09-08  8:10       ` Jeff King
2016-09-08  8:14         ` Jeff King
2016-09-08  8:35           ` Jeff King
2016-09-08 19:06             ` Ramsay Jones
2016-09-08 19:53               ` Jeff King
2016-09-08 21:30                 ` Junio C Hamano
2016-09-08  7:33   ` [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
2016-09-08  8:13     ` Jeff King
2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
2016-09-08  7:57     ` [PATCH v3 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
2016-09-08  7:58     ` [PATCH v3 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
2016-09-08 17:03       ` Junio C Hamano
2016-09-08  7:59     ` [PATCH v3 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
2016-09-08 17:09       ` Junio C Hamano
2016-09-09  9:52         ` Johannes Schindelin
2016-09-09  9:57           ` Jeff King
2016-09-09 10:41             ` Johannes Schindelin
2016-09-09 17:49           ` Junio C Hamano
2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
2016-09-21 18:23       ` [PATCH v4 1/3] regex: -G<pattern> feeds a non NUL-terminated string to regexec() and fails Johannes Schindelin
2016-09-21 18:24       ` [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string Johannes Schindelin
2016-09-21 19:17         ` Junio C Hamano
2016-09-22 18:38           ` Johannes Schindelin
2016-09-21 18:24       ` [PATCH v4 3/3] regex: use regexec_buf() Johannes Schindelin
2016-09-21 19:18         ` Junio C Hamano
2016-09-21 20:09           ` Junio C Hamano
2016-09-21 22:03         ` Jeff King
2016-09-25 14:01           ` Johannes Schindelin
2016-09-21 22:04       ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.