[PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
@ 2014-03-20 19:39 George Papanikolaou
  2014-03-21  2:48 ` Eric Sunshine
  2014-03-21 11:14 ` Michael Haggerty
  0 siblings, 2 replies; 9+ messages in thread
From: George Papanikolaou @ 2014-03-20 19:39 UTC (permalink / raw)
  To: gitster; +Cc: git, George Papanikolaou

Removing the bloat of checking for both '\r' and '\n' with the prettier
iswspace() function which checks for other characters as well. (read: \f \t \v)
---

This is one more try to clean up this fuzzy_matchlines() function as part of a
microproject for GSOC. The rest more clarrified microprojects were taken.
I'm obviously planning on applying.

Thanks

Signed-of-by: George 'papanikge' Papanikolaou <g3orge.app@gmail.com>

 builtin/apply.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/builtin/apply.c b/builtin/apply.c
index b0d0986..912a53a 100644
--- a/builtin/apply.c
+++ b/builtin/apply.c
@@ -295,9 +295,9 @@ static int fuzzy_matchlines(const char *s1, size_t n1,
 	int result = 0;
 
 	/* ignore line endings */
-	while ((*last1 == '\r') || (*last1 == '\n'))
+	while (iswspace(*last1))
 		last1--;
-	while ((*last2 == '\r') || (*last2 == '\n'))
+	while (iswspace(*last2))
 		last2--;
 
 	/* skip leading whitespace */
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
  2014-03-20 19:39 [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars George Papanikolaou
@ 2014-03-21  2:48 ` Eric Sunshine
       [not found]   ` <CAByyCQBmCTfW0HBL04MMqwm+bDe4Rb6n+MfWdYUQ6M6yW_u=yw@mail.gmail.com>
  2014-03-21 11:14 ` Michael Haggerty
  1 sibling, 1 reply; 9+ messages in thread
From: Eric Sunshine @ 2014-03-21  2:48 UTC (permalink / raw)
  To: George Papanikolaou; +Cc: Junio C Hamano, Git List

On Thu, Mar 20, 2014 at 3:39 PM, George Papanikolaou
<g3orge.app@gmail.com> wrote:
> Removing the bloat of checking for both '\r' and '\n' with the prettier
> iswspace() function which checks for other characters as well. (read: \f \t \v)

Use imperative mood. "Remove" rather than "Removing".

Bloat? Prettier? Subjective stuff.

Did you verify that it is safe to strip all whitespace characters
rather than only line-endings? Perhaps say so in the commit message.

Why the choice of iswspace()? These are normal-width character
strings, so why apply a wide-character function?

More below.

> ---
>
> This is one more try to clean up this fuzzy_matchlines() function as part of a
> microproject for GSOC. The rest more clarrified microprojects were taken.
> I'm obviously planning on applying.
>
> Thanks
>
> Signed-of-by: George 'papanikge' Papanikolaou <g3orge.app@gmail.com>
>
>  builtin/apply.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/apply.c b/builtin/apply.c
> index b0d0986..912a53a 100644
> --- a/builtin/apply.c
> +++ b/builtin/apply.c
> @@ -295,9 +295,9 @@ static int fuzzy_matchlines(const char *s1, size_t n1,
>         int result = 0;
>
>         /* ignore line endings */
> -       while ((*last1 == '\r') || (*last1 == '\n'))
> +       while (iswspace(*last1))
>                 last1--;
> -       while ((*last2 == '\r') || (*last2 == '\n'))
> +       while (iswspace(*last2))
>                 last2--;

Doesn't this change turn the comment preceding this code into a
half-truth? Perhaps update the comment?

>         /* skip leading whitespace */
> --
> 1.9.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
  2014-03-20 19:39 [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars George Papanikolaou
  2014-03-21  2:48 ` Eric Sunshine
@ 2014-03-21 11:14 ` Michael Haggerty
  2014-03-25  4:54   ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Haggerty @ 2014-03-21 11:14 UTC (permalink / raw)
  To: George Papanikolaou; +Cc: gitster, git

On 03/20/2014 08:39 PM, George Papanikolaou wrote:
> Removing the bloat of checking for both '\r' and '\n' with the prettier
> iswspace() function which checks for other characters as well. (read: \f \t \v)
> ---
> 
> This is one more try to clean up this fuzzy_matchlines() function as part of a
> microproject for GSOC. The rest more clarrified microprojects were taken.
> I'm obviously planning on applying.
> 
> Thanks
> 
> Signed-of-by: George 'papanikge' Papanikolaou <g3orge.app@gmail.com>
> 
>  builtin/apply.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/builtin/apply.c b/builtin/apply.c
> index b0d0986..912a53a 100644
> --- a/builtin/apply.c
> +++ b/builtin/apply.c
> @@ -295,9 +295,9 @@ static int fuzzy_matchlines(const char *s1, size_t n1,
>  	int result = 0;
>  
>  	/* ignore line endings */
> -	while ((*last1 == '\r') || (*last1 == '\n'))
> +	while (iswspace(*last1))
>  		last1--;
> -	while ((*last2 == '\r') || (*last2 == '\n'))
> +	while (iswspace(*last2))
>  		last2--;
>  
>  	/* skip leading whitespace */
> 

In addition to Eric's comments...

What happens if the string consists *only* of whitespace?

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
       [not found]   ` <CAByyCQBmCTfW0HBL04MMqwm+bDe4Rb6n+MfWdYUQ6M6yW_u=yw@mail.gmail.com>
@ 2014-03-21 23:07     ` Eric Sunshine
       [not found]     ` <CAPig+cTct-42w5S=OUS_DQ2cD5X9nWa_eUVoFBGTT7nAEahi5g@mail.gmail.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Sunshine @ 2014-03-21 23:07 UTC (permalink / raw)
  To: George Papanikolaou, Git List, Michael Haggerty

[Please reply on-list to review comments. Other people may learn from
the discussion or have comments of their own.]

On Fri, Mar 21, 2014 at 6:00 PM, George Papanikolaou
<g3orge.app@gmail.com> wrote:
> On Fri, Mar 21, 2014 at 4:48 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
>>
>> Did you verify that it is safe to strip all whitespace characters
>> rather than only line-endings? Perhaps say so in the commit message.
>>
>> Why the choice of iswspace()? These are normal-width character
>> strings, so why apply a wide-character function?
>>
> why not?

Because it's unnecessary and invites confusion from people reading
code since they now have to wonder if there is something unusual and
non-obvious going on. Worse, the two loops immediately below the ones
you changed, as well as the rest of the function, use plain isspace(),
which really ramps up the "huh?"-factor from the reader.

The original code has the asset of being clear and obvious. Changing
these two loops to use a wide-character function makes it less so.

> since at this point it is checking for any non-readable
> characters at the end of the buffer, I figured we should check for the
> "wide-character" function that covers these.

Neither the function comment nor the existing code implies that it is
checking for "any non-readable characters". (I'm not even sure what
that means.) The only thing the existing code says at that point is
that it is ignoring line-endings.

> It is true that the
> comment should change in that matter.
>
> Also why wouldn't it be safe? And how can I check?

You're changing the behavior of the function (assuming I'm reading
correctly), which is why I asked if you verified that doing so was
safe. The existing code considers "foo bar" and "foo bar " to be
different. With your change, they are considered equal, which is
actually more in line with what the function comment says.
Nevertheless, callers may be relying upon the existing behavior.

At the very least, the unit tests should be run as a quick check of
whether this behavior change introduces problems. Manual inspection of
callers also wouldn't hurt.

There's also the issue that Michael raised when he asked what would
happen if either string was composed of whitespace only. The existing
code is not robust and can crash, but your change may increase the
likelihood of the crash.

> Thanks
>
> --
> papanikge's surrogate email.
> I may reply back.
> http://www.5slingshots.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
       [not found]     ` <CAPig+cTct-42w5S=OUS_DQ2cD5X9nWa_eUVoFBGTT7nAEahi5g@mail.gmail.com>
@ 2014-03-22  9:33       ` George Papanikolaou
  2014-03-23  9:35         ` Eric Sunshine
  0 siblings, 1 reply; 9+ messages in thread
From: George Papanikolaou @ 2014-03-22  9:33 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Michael Haggerty

On Sat, Mar 22, 2014 at 12:46 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> Because it's unnecessary and invites confusion from people reading the
> code since they now have to wonder if there is something unusual and
> non-obvious going. Worse, the two loops immediately below the ones you
> changed, as well as the rest of the function, use plain isspace(),
> which really ramps up the "huh?"-factor from the reader.
>
> The original code has the asset of being clear and obvious. Changing
> these two loops to use a wide-character function makes it less so.
>

Yes I understand it does add a factor of ambiguity.

>
> Neither the function comment nor the existing code implies that it is
> checking for "any non-readable characters". (I'm not even sure what
> that means.) The only thing the existing code says at that point is
> that it is ignoring line-endings.
>

I mean characters that are not printable like letters, numbers, dots etc

>
> You're changing the behavior of the function (assuming I'm reading it
> correctly), which is why I asked if you verified that doing so was
> safe. The existing code considers "foo bar" and "foo bar " to be
> different. With your change, they are considered equal, which is
> actually more in line with what the function comment says.
> Nevertheless, callers may be relying upon the existing behavior.
>
> At the very least, the unit tests should be run as a quick check of
> whether if this behavior change introduces problems. Manual inspection
> of callers also wouldn't hurt.
>

I did not think about that possibility, because I ran `make` and the
tests passed so I thought that that would be ok.

Anyway, do you have any ideas on how to improve that function?

Thanks again for the feedback.

-- 
papanikge's surrogate email.
I may reply back.
http://www.5slingshots.com/I did not think about that possibility.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
  2014-03-22  9:33       ` George Papanikolaou
@ 2014-03-23  9:35         ` Eric Sunshine
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Sunshine @ 2014-03-23  9:35 UTC (permalink / raw)
  To: George Papanikolaou; +Cc: Git List, Michael Haggerty

On Sat, Mar 22, 2014 at 5:33 AM, George Papanikolaou
<g3orge.app@gmail.com> wrote:
> On Sat, Mar 22, 2014 at 12:46 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
>> Because it's unnecessary and invites confusion from people reading the
>> code since they now have to wonder if there is something unusual and
>> non-obvious going. Worse, the two loops immediately below the ones you
>> changed, as well as the rest of the function, use plain isspace(),
>> which really ramps up the "huh?"-factor from the reader.
>>
>> The original code has the asset of being clear and obvious. Changing
>> these two loops to use a wide-character function makes it less so.
>>
> Yes I understand it does add a factor of ambiguity.
>
>> Neither the function comment nor the existing code implies that it is
>> checking for "any non-readable characters". (I'm not even sure what
>> that means.) The only thing the existing code says at that point is
>> that it is ignoring line-endings.
>>
> I mean characters that are not printable like letters, numbers, dots etc

It's still not clear how this answer relates to my question about why
you used iswspace() rather than isspace().

Nothing in the code or comments indicates that it wants to ignore
non-printing characters. Even if the intention of your change had
indeed been to ignore such characters, you would have used !isprint()
or !iswprint().

>> You're changing the behavior of the function (assuming I'm reading it
>> correctly), which is why I asked if you verified that doing so was
>> safe. The existing code considers "foo bar" and "foo bar " to be
>> different. With your change, they are considered equal, which is
>> actually more in line with what the function comment says.
>> Nevertheless, callers may be relying upon the existing behavior.
>>
>> At the very least, the unit tests should be run as a quick check of
>> whether if this behavior change introduces problems. Manual inspection
>> of callers also wouldn't hurt.
>>
> I did not think about that possibility, because I ran `make` and the
> tests passed so I thought that that would be ok.

Unit tests may cover a lot of functionality, but there will always be
holes in the coverage. Thus, it's a good idea to examine callers and
surrounding code manually, as well.

Since this is a behavior change, it deserves mention in the commit
message, as well as assurance that you verified (as best you can) that
it did not break existing callers. (It also wouldn't hurt to mention
that it brings the code more in line with the function documentation.)

> Anyway, do you have any ideas on how to improve that function?

Michael gave you a strong clue when he asked what would happen, with
your change in place, if the string consisted only of whitespace. The
loops you touched are already fragile, even without your change.
Making them more robust would likely be considered an improvement.

> Thanks again for the feedback.
>
> --
> papanikge's surrogate email.
> I may reply back.
> http://www.5slingshots.com/I did not think about that possibility.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
  2014-03-21 11:14 ` Michael Haggerty
@ 2014-03-25  4:54   ` Junio C Hamano
  2014-03-26 16:58     ` George Papanikolaou
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2014-03-25  4:54 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: George Papanikolaou, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

>> -	while ((*last1 == '\r') || (*last1 == '\n'))
>> +	while (iswspace(*last1))
>>  		last1--;
>> -	while ((*last2 == '\r') || (*last2 == '\n'))
>> +	while (iswspace(*last2))
>>  		last2--;
>>  
>>  	/* skip leading whitespace */
>> 
>
> In addition to Eric's comments...
>
> What happens if the string consists *only* of whitespace?

Also, why would casting char to wchar_t without any conversion be
safe and/or sane?

I would sort-of understand if the change were to use isspace(), but
I do not think that is a correct conversion, either.  Isn't a pair
of strings "a bc" and "a bc " supposed not to match?

My understanding is that two strings that differ only at places
where they have runs of whitespaces whose length differ are to
compare the same, e.g. "a_bc__" and "a__bc_" (SP replaced with _ to
make them stand out).  Ignoring whitespace change is very different
from ignoring all whitespaces (the latter of which would make "a b"
and "ab" match).

As a tangent, I have a suspicion that the current implementation may
be wrong at the beginning of the string.  Wouldn't it match " abc"
and "abc", even though these two strings shouldn't match?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
  2014-03-25  4:54   ` Junio C Hamano
@ 2014-03-26 16:58     ` George Papanikolaou
  2014-03-26 18:02       ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: George Papanikolaou @ 2014-03-26 16:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, Git List

On Tue, Mar 25, 2014 at 6:54 AM, Junio C Hamano <gitster@pobox.com> wrote:
> As a tangent, I have a suspicion that the current implementation may
> be wrong at the beginning of the string.  Wouldn't it match " abc"
> and "abc", even though these two strings shouldn't match?

Wouldn't that be accomplished by just removing the leading whitespace check?

I'm somewhat confused about what the function should match. I haven't
grasped it.

--
papanikge's surrogate email.
I may reply back.
http://www.5slingshots.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars
  2014-03-26 16:58     ` George Papanikolaou
@ 2014-03-26 18:02       ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2014-03-26 18:02 UTC (permalink / raw)
  To: George Papanikolaou; +Cc: Michael Haggerty, Git List

George Papanikolaou <g3orge.app@gmail.com> writes:

> On Tue, Mar 25, 2014 at 6:54 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> As a tangent, I have a suspicion that the current implementation may
>> be wrong at the beginning of the string.  Wouldn't it match " abc"
>> and "abc", even though these two strings shouldn't match?
>
> Wouldn't that be accomplished by just removing the leading whitespace check?

Yes.  I was wondering *what* semantics we want in the first place;
how to implement what I suggested is so trivial that it goes without
saying for the intended audiences of that comment ;-).

> I'm somewhat confused about what the function should match. I haven't
> grasped it.

This function is used when attempting to resurrect a patch that is
whitespace-damaged.  The patch may want to change a line "a_bc" in
the original into something else [*1*], and we may not find "a_bc"
in the current source, but there may be "a__bc" (two spaces instead
of one the whitespace-damaged patch claims to expect).  By ignoring
the amount of whitespaces, it forces "git apply" to consider that
"a_bc" in the broken patch meant to refer to "a__bc" in reality.

I _think_ the original motivation of ignore_ws_change was to match
the "--ignore-space-change" option of "diff", i.e. "ignore changes
in the amount of white space".  I just checked the source
(xdiff/xutils.c) and made sure that "git diff" does not treat the
beginning of line any differently hence "_a_bc" and "a_bc" are not
considered a match under its --ignore-space-change option.

The current implementation of "apply --ignore-space-change" that
ignores leading whitespaces (not "ignore changes in the amount of
leading whitespaces") is likely to be a bug from this point of view.

But I wanted to hear opinions from other Git experts [*2*].  Hence
my "As a tangent, I have a suspicion".

[Footnote]

*1* This mode is not enabled by default.  I am not even sure if
    anybody sane would (or should) use this option.  Sure, the fuzzy
    match may be able to find the original line that the patch
    author may meant to patch even when it is whiltespace-damaged
    because it does not fully trust what the original lines exactly
    say (i.e. context lines prefixed by " " and old lines prefixed
    by "-").  What makes it sane for us to trust what the
    replacement lines (i.e. new lines prefixed by "+") in such a
    mangled patch says?

*2* For example, somebody may be able to point out that "this is
    meant to match the option of the same name 'diff' has", which is
    my assumption that leads to the above discussion, may not be
    true.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-03-26 18:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-20 19:39 [PATCH] builtin/apply.c: use iswspace() to detect line-ending-like chars George Papanikolaou
2014-03-21  2:48 ` Eric Sunshine
     [not found]   ` <CAByyCQBmCTfW0HBL04MMqwm+bDe4Rb6n+MfWdYUQ6M6yW_u=yw@mail.gmail.com>
2014-03-21 23:07     ` Eric Sunshine
     [not found]     ` <CAPig+cTct-42w5S=OUS_DQ2cD5X9nWa_eUVoFBGTT7nAEahi5g@mail.gmail.com>
2014-03-22  9:33       ` George Papanikolaou
2014-03-23  9:35         ` Eric Sunshine
2014-03-21 11:14 ` Michael Haggerty
2014-03-25  4:54   ` Junio C Hamano
2014-03-26 16:58     ` George Papanikolaou
2014-03-26 18:02       ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.