git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dragan Simic <dsimic@manjaro.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, rsbecker@nexbridge.com, github@seichter.de
Subject: Re: [PATCH 1/3] config.txt: describe whitespace characters further and more accurately
Date: Thu, 14 Mar 2024 07:20:00 +0100	[thread overview]
Message-ID: <ff7b0a2ead90ad9a9456141da5e4df4a@manjaro.org> (raw)
In-Reply-To: <xmqqjzm51ugt.fsf@gitster.g>

On 2024-03-14 02:18, Junio C Hamano wrote:
> Dragan Simic <dsimic@manjaro.org> writes:
>> -The syntax is fairly flexible and permissive; whitespaces are mostly
>> -ignored.  The '#' and ';' characters begin comments to the end of 
>> line,
>> -blank lines are ignored.
>> +The syntax is fairly flexible and permissive.  Whitespace characters,
>> +which in this context are the space character (SP) and the horizontal
>> +tabulation (HT), are mostly ignored.  The '#' and ';' characters 
>> begin
>> +comments to the end of line.  Blank lines are ignored.
> 
> OK, except for "whitespace characters"---do we need to say
> "whitespace characters", after we already listed HT and SP are the
> ones, instead of just "whitespaces"?

I also spent some time thinking about that.  To me, the plural form, 
i.e.
"whitespaces", simply doesn't sound very good, because "whitespace" 
feels
to me more like a mass noun, and I really haven't seen it used in plural
form in other projects.

>>  A line that defines a value can be continued to the next line by
>> +ending it with a `\`; the backslash and the end-of-line are stripped.
>> +Leading whitespace characters after 'name =', the remainder of the
>>  line after the first comment character '#' or ';', and trailing
>> +whitespace characters of the line are discarded unless they are 
>> enclosed
>> +in double quotes.  This discarding of the trailing whitespace 
>> characters
>> +also applies after the remainder of the line after the comment 
>> character
>> +is discarded.
> 
> "also" makes it sound as if we do it twice, once to remove trailing
> whitespaces after the remainder of the line after '#", and then trim
> the trailing whitespaces after we removed the comment.

Good point, I also felt it the same way, but went with such wording 
simply
because I thought it should be more understandable to the users, despite
being technically a bit incorrect.

> I wonder if we can make it clearer by following the step-by-step
> nature of the earlier part of the paragraph through.  We already say
> the folded line processing is done first, so break things down in
> conceptual phases/steps, perhaps like
> 
>  * The backslash at the end-of-line is removed, together with the
>    end-of-line, to form a single long line.
> 
>  * Anything that come after the first unquoted comment character,
>    either '#' or ';', are discarded.
> 
>  * The leading and trailing whitespaces around the value part
>    (i.e. what follows 'name =') are discarded.
> 
>  * Remaining unquoted whitespaces inside the value part are munged.

Hmm, I'm not really sure that such a description would be more clear
to the users, despite being technically more correct.  I'll think a bit
more about it.

>> +Any number of internal whitespace characters found within
>> +the value are converted to the same number of space (SP) characters.
> 
> The last one sounds like a bug to me, by the way.
> 
> At least the very original 17712991 (Add ".git/config" file parser,
> 2005-10-10) squashed a run of whitespace characters into a single
> SP, which makes sense as a "clean-up".
> 
> But ebdaae37 (config: Keep inner whitespace verbatim, 2009-07-30),
> while claiming to "Keep" inner whitespaces, broke it by replacing
> any isspace() bytes that are not SP with SP, contradicting its
> stated purpose.

Thank you for the investigation.  The ebdaae37 commit certainly
introduced a bug to the value parsing, which presumably has remained
undected because the included test passes.

The way I see it, fixing the bug may actually be a breaking change,
because some user configurations may actually rely on the current
(mis)behavior.  This makes me somewhat afraid that fixing this bug,
which I already thought about, may actually do more harm than good.
However, fixing this bug seems to be only right thing to do, which
I'll explain further below.

> As the latest change by the author of that change is from more than
> 10 years ago, I do not expect that he is still interested in this
> part of the codebase, but thanks to a very clearly written log
> message, we can read what the motivation behind that change was, and
> seeing that what the code does contradicts with the stated
> motivation we can safely declare that this is an ancient bug.

Agreed, the evidence is clear.

> Fixing that bug can of course be left outside the series.  For those
> who are looking for microproject ideas who discovered this message
> by searching for the #leftoverbits keyword, one possible fix would
> be to revert ebdaae37, make sure a value with any whitespace in it
> gets quoted, and document clearly that an unquoted run of
> whitespaces is squashed into a single SP.  Another way that is
> milder is to finish what ebdaae37 wanted to do and retain the
> whitespaces "verbatim".

I already though about fixing the bug so the value parser actually does
what git-config(1) currently says, but as I already noted above, I'm
afraid a bit that fixing this bug may actually do more harm than good.

Though, further investigation shows that setting a configuration value,
by invoking git-config(1), converts value-internal tabs into "\t" escape
sequences, which the value-parsing logic doesn't "squash" into spaces.
That's why the test included in the ebdaae37 commit passes.  On the 
other
hand, value-internal literal tab characters, found in a configuration
file, do get "squashed" by the value-parsing logic, so I'd say that the
only right thing to do is to fix this bug by making the value-internal
whitespace characters preserved verbatim.

I'd be happy to include the bugfix into this series, if my 
above-mentioned
fears prove to be unnecessary.

  reply	other threads:[~2024-03-14  6:20 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-12 15:55 [PATCH 0/3] Improve the documentation and test coverage for whitespace and comments Dragan Simic
2024-03-12 15:55 ` [PATCH 1/3] config.txt: describe whitespace characters further and more accurately Dragan Simic
2024-03-14  1:18   ` Junio C Hamano
2024-03-14  6:20     ` Dragan Simic [this message]
2024-03-14 16:45       ` Junio C Hamano
2024-03-14 18:48         ` Dragan Simic
2024-03-12 15:55 ` [PATCH 2/3] config.txt: perform some minor reformatting Dragan Simic
2024-03-14  1:58   ` Junio C Hamano
2024-03-14  6:20     ` Dragan Simic
2024-03-14 16:22       ` Junio C Hamano
2024-03-14 18:40         ` Dragan Simic
2024-03-12 15:55 ` [PATCH 3/3] t1300: add tests for internal whitespace and inline comments Dragan Simic
2024-03-14  2:18   ` Junio C Hamano
2024-03-14  6:20     ` Dragan Simic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ff7b0a2ead90ad9a9456141da5e4df4a@manjaro.org \
    --to=dsimic@manjaro.org \
    --cc=git@vger.kernel.org \
    --cc=github@seichter.de \
    --cc=gitster@pobox.com \
    --cc=rsbecker@nexbridge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).