[PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string
@ 2014-07-29 13:10 Nguyễn Thái Ngọc Duy
  2014-07-29 19:56 ` Junio C Hamano
  2014-08-10  7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy
  0 siblings, 2 replies; 5+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2014-07-29 13:10 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

When utf8_width(&src) is called with *src == NULL (because the
source string ends with an ansi sequence), it returns 0 and steps
'src' by one. This stepping makes strbuf_utf8_replace add NUL to the
destination string at the end of the loop. Check and break the loop
early.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 utf8.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/utf8.c b/utf8.c
index b30790d..cd090a1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -381,6 +381,8 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width,
 			src += n;
 			dst += n;
 		}
+		if (src >= end)
+			break;
 
 		old = src;
 		n = utf8_width((const char**)&src, NULL);
-- 
2.1.0.rc0.66.gb9187ad

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string
  2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy
@ 2014-07-29 19:56 ` Junio C Hamano
  2014-07-30 10:20   ` Duy Nguyen
  2014-08-10  7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy
  1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2014-07-29 19:56 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> When utf8_width(&src) is called with *src == NULL (because the
> source string ends with an ansi sequence),

I am not sure what you mean by "because" here.  Do you mean somebody
(who?) decides to call the utf8_width() with NULL pointer stored in
*src because of "ansi sequence"?

What do you mean by "ansi sequence"?  I'll assume that you mean
those terminal control that all use bytes with hi-bit clear.

At the very beginning of utf8_width(), *start can be cleared to
point at a NULL pointer by pick_one_utf8_char() if the pointer that
comes into utf8_width() originally points at an invalid UTF-8
string, but as far as I can see, ESC (or any bytes that would be
used in those terminal control sequences like colors and cursor
control) will simply be returned as a single byte, without going
into error path that clears *start = NULL.

Puzzled...

> it returns 0 and steps
> 'src' by one. 

Here "it" refers to utf8_width()?  Who steps 'src' by one?

Ahh, did you mean *src == NUL, i.e. "already at the end of the
string"?

I think utf8_width() called with an empty string should not move the
pointer past that end-of-string NUL in the first place.  It makes me
wonder if it would be a better fix to make it not to do that (and
return 0), but if we declare it is the caller's fault, perhaps we
may want to add

	if (!**start)
        	die("BUG: empty string to utf8_width()???");

at the very beginning of utf8_width(), even before it calls
pick-one-utf8-char.

Still puzzled...

> This stepping makes strbuf_utf8_replace add NUL to the
> destination string at the end of the loop. Check and break the loop
> early.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  utf8.c | 2 ++
>  1 file changed, 2 insertions(+)

Tests?

> diff --git a/utf8.c b/utf8.c
> index b30790d..cd090a1 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -381,6 +381,8 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width,
>  			src += n;
>  			dst += n;
>  		}
> +		if (src >= end)
> +			break;
>  
>  		old = src;
>  		n = utf8_width((const char**)&src, NULL);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string
  2014-07-29 19:56 ` Junio C Hamano
@ 2014-07-30 10:20   ` Duy Nguyen
  2014-07-30 18:23     ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Duy Nguyen @ 2014-07-30 10:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Jul 29, 2014 at 12:56:24PM -0700, Junio C Hamano wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
> 
> > When utf8_width(&src) is called with *src == NULL (because the
> > source string ends with an ansi sequence),
> 
> I am not sure what you mean by "because" here.  Do you mean somebody
> (who?) decides to call the utf8_width() with NULL pointer stored in
> *src because of "ansi sequence"?
> 
> What do you mean by "ansi sequence"?  I'll assume that you mean
> those terminal control that all use bytes with hi-bit clear.

OK let's try with an example, strbuf_utf8_replace(&sb, 0, 0, "") where
"sb" contains "\033[m". The expected result is exactly the same string
with length of 3. After this block

		while ((n = display_mode_esc_sequence_len(src))) {
			memcpy(dst, src, n);
			src += n;
			dst += n;
		}

src will be right after 'm', *src is NUL (iow. src == end). After

		n = utf8_width((const char**)&src, NULL);

n is zero but 'src' is advanced by another character, so
src - old_src == 1. This NUL character is counted as part of the
string, so after the _setlen, we have the same string with length of
_4_ (instead of 3).

> At the very beginning of utf8_width(), *start can be cleared to
> point at a NULL pointer by pick_one_utf8_char() if the pointer that
> comes into utf8_width() originally points at an invalid UTF-8
> string, but as far as I can see, ESC (or any bytes that would be
> used in those terminal control sequences like colors and cursor
> control) will simply be returned as a single byte, without going
> into error path that clears *start = NULL.
> 
> Puzzled...
> 
> > it returns 0 and steps 'src' by one. 
> 
> Here "it" refers to utf8_width()?  Who steps 'src' by one?

utf8_width() steps 'src'.

> 
> Ahh, did you mean *src == NUL, i.e. "already at the end of the
> string"?

Yes.. I guess you have a better commit message prepared for me now :)

> I think utf8_width() called with an empty string should not move the
> pointer past that end-of-string NUL in the first place.  It makes me
> wonder if it would be a better fix to make it not to do that (and
> return 0), but if we declare it is the caller's fault, perhaps we
> may want to add
> 
> 	if (!**start)
>         	die("BUG: empty string to utf8_width()???");
> 
> at the very beginning of utf8_width(), even before it calls
> pick-one-utf8-char.
> 
> Still puzzled...

I don't know. I mean, in a UTF-8 byte sequence, NUL is a valid
character (part of the ASCII subset), so advancing '*start' by one
makes sense. Whether we have any call sites crazy enough to do that on
purpose is a different matter.

> > This stepping makes strbuf_utf8_replace add NUL to the
> > destination string at the end of the loop. Check and break the loop
> > early.
> >
> > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> > ---
> >  utf8.c | 2 ++
> >  1 file changed, 2 insertions(+)
> 
> Tests?

Ugh.. this was triggered by one of the alignment specifier in log
--pretty. Probably easier to export strbuf_utf8_replace() as a
test-command and verify the output?

--
Duy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string
  2014-07-30 10:20   ` Duy Nguyen
@ 2014-07-30 18:23     ` Junio C Hamano
  0 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2014-07-30 18:23 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git

Duy Nguyen <pclouds@gmail.com> writes:

>> > it returns 0 and steps 'src' by one. 
>> 
>> Here "it" refers to utf8_width()?  Who steps 'src' by one?
>
> utf8_width() steps 'src'.
>
>> 
>> Ahh, did you mean *src == NUL, i.e. "already at the end of the
>> string"?
>
> Yes.. I guess you have a better commit message prepared for me now :)

Heh.  At least you realized that the log message was uninformative ;-)

>> I think utf8_width() called with an empty string should not move the
>> pointer past that end-of-string NUL in the first place.  It makes me
>> wonder if it would be a better fix to make it not to do that (and
>> return 0), but if we declare it is the caller's fault, perhaps we
>> may want to add
>> 
>> 	if (!**start)
>>         	die("BUG: empty string to utf8_width()???");
>> 
>> at the very beginning of utf8_width(), even before it calls
>> pick-one-utf8-char.
>> 
>> Still puzzled...
>
> I don't know. I mean, in a UTF-8 byte sequence, NUL is a valid
> character (part of the ASCII subset), so advancing '*start' by one
> makes sense.

Dubious.  NUL may be valid but it is valid only in the sense that it
will mark the end of the string, i.e. the caller is expected to do:

        const char *string = ...;

        while (not reached the end of the string) {
                this_char = pick_one_char(&string);
                do something to this_char;
        }

and there are two ways to structure such a loop:

 (1) Make pick-one-char signal the termination condition.  i.e.

	while ((this_char = pick_one(&string)) != EOF)
		do something to this_char;

     That's a typical getchar()-style loop.  It would better not
     advance string when it returns EOF.

 (2) Make it the responsibility of the caller not to call beyond the
     end. i.e.

	while (string < end) {
        	this_char = pick_one_char(&string);
		do something to this char;
	}

and your patch takes the latter, which I think is in line with the
way how existing callchain works.  So the addition of "BUG" merely
is to make sure that everybody follows that same convention.

We cannot declare that it is caller's responsibility to feed
sensible input to the function only at one callsite and allow other
callsites feed garbage to the same function, after all, no?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string
  2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy
  2014-07-29 19:56 ` Junio C Hamano
@ 2014-08-10  7:05 ` Nguyễn Thái Ngọc Duy
  1 sibling, 0 replies; 5+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2014-08-10  7:05 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

The main loop in strbuf_utf8_replace() could summed up as:

  while ('src' is still valid) {
    1) advance 'src' to copy ANSI escape sequences
    2) advance 'src' to copy/replace visible characters
  }

The problem is after #1, 'src' may have reached the end of the string
(so 'src' points to NUL) and #2 will continue to copy that NUL as if
it's a normal character. Because the output is stored in a strbuf,
this NULL accounted in the 'len' field as well. Check after #1 and
break the loop if necessary.

The test does not look obvious, but the combination of %>>() should
make a call trace like this

  show_log()
  pretty_print_commit()
  format_commit_message()
  strbuf_expand()
  format_commit_item()
  format_and_pad_commit()
  strbuf_utf8_replace()

where %C(auto)%d would insert a color reset escape sequence in the end
of the string given to strbuf_utf8_replace() and show_log() uses
fwrite() to send everything to stdout (including the incorrect NUL
inserted by strbuf_utf8_replace)

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 t/t4205-log-pretty-formats.sh | 7 +++++++
 utf8.c                        | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh
index 349c531..de0cc4a 100755
--- a/t/t4205-log-pretty-formats.sh
+++ b/t/t4205-log-pretty-formats.sh
@@ -431,6 +431,13 @@ EOF
 	test_cmp expected actual
 '
 
+test_expect_success 'strbuf_utf8_replace() not producing NUL' '
+	git log --color --pretty="tformat:%<(10,trunc)%s%>>(10,ltrunc)%C(auto)%d" |
+		test_decode_color |
+		nul_to_q >actual &&
+	! grep Q actual
+'
+
 # get new digests (with no abbreviations)
 head1=$(git rev-parse --verify HEAD~0) &&
 head2=$(git rev-parse --verify HEAD~1) &&
diff --git a/utf8.c b/utf8.c
index b30790d..401a6a5 100644
--- a/utf8.c
+++ b/utf8.c
@@ -382,6 +382,9 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width,
 			dst += n;
 		}
 
+		if (src >= end)
+			break;
+
 		old = src;
 		n = utf8_width((const char**)&src, NULL);
 		if (!src) 	/* broken utf-8, do nothing */
-- 
2.1.0.rc0.78.gc0d8480

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-08-10  7:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy
2014-07-29 19:56 ` Junio C Hamano
2014-07-30 10:20   ` Duy Nguyen
2014-07-30 18:23     ` Junio C Hamano
2014-08-10  7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.