Should the --encoding argument to log/show commands make any guarantees about their output?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Should the --encoding argument to log/show commands make any guarantees about their output?
@ 2015-06-15  8:50 Jan-Philip Gehrcke
  2015-06-15 16:21 ` Torsten Bögershausen
  2015-06-17 16:42 ` Junio C Hamano
  0 siblings, 2 replies; 9+ messages in thread
From: Jan-Philip Gehrcke @ 2015-06-15  8:50 UTC (permalink / raw)
  To: git

Hello,

I was surprised to see that the output of

     git log --encoding=utf-8 "--format=format:%b"

can contain byte sequences that are invalid in UTF-8. Note: I am using 
git 2.1.4 and the %b format specifier represents the commit message body.

I have seen this with the Linux git repository and the following test:

     git log --encoding=utf-8 "--format=format:%b" | python2 -c \
         'import sys; [l.decode("utf-8") for l in sys.stdin]'

Soon enough errors like this appears:

     'utf8' codec can't decode byte 0xf6 in position 19

The help message to the --encoding argument reads:

> The commit objects record the encoding used for the log message in
> their encoding header; this option can be used to tell the command to
> re-code the commit log message in the encoding preferred by the user

I realize that this message does not give any guarantee about the output 
of the command, in the sense that --encoding=utf-8 produces valid UTF-8 
data in all cases.

However, I wonder what --encoding precisely does and if it has the 
behavior most users would expect.

Let me describe what I think it currently does:

The program attempts to re-code a log message, so it follows the chain

	raw input -> unicode -> raw output

For the first step, knowledge about the input encoding is required. This 
is retrieved from the encoding header of the commit object if present or 
(from the docs) "lack of this header implies that the commit log message 
is encoded in UTF-8." If this step fails (if the entry contains a byte 
sequence that is invalid in the specified/assumed input codec), the 
procedure is aborted and the data is dumped as is (obviously without 
applying the requested output encoding).

Is that correct?

 From my point of view the most natural abstraction of a log *message* 
is *text*, not bytes. The same is true for author names. If I want to 
build a tool chain on top of log/show, this usually means that I want to 
work with text information. Hence, I want to retrieve text (a sequence 
of code points) from git show/log. Text must be transported in encoded 
form, sure, but it must not contain byte sequences that are invalid in 
this codec. Because otherwise it's just not text anymore.

Hence, from my point of view, the rational that git show/log should be 
able to output *text* information means that they should not emit byte 
sequences that are invalid in the codec specified via the --encoding 
argument. In the current situation, the work of dealing with invalid 
byte sequences is just outsourced to software further below in the tool 
chain (at some point a replacement character � should be displayed to 
the user instead of the invalid raw bytes).

I am not entirely sure where this discussion should lead to. However, I 
think that if the behavior of the software will not be changed, then the 
documentation for the --encoding option should be more precise and 
clarify what actually happens behind the scenes. What do you think?

Cheers,

Jan-Philip Gehrcke

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-15  8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
@ 2015-06-15 16:21 ` Torsten Bögershausen
  2015-06-16  9:38   ` Jan-Philip Gehrcke
  2015-06-17 16:42 ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Torsten Bögershausen @ 2015-06-15 16:21 UTC (permalink / raw)
  To: Jan-Philip Gehrcke, git

On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
> Hello,
> 
> I was surprised to see that the output of
> 
>     git log --encoding=utf-8 "--format=format:%b"
> 
> can contain byte sequences that are invalid in UTF-8. Note: I am using git 2.1.4 and the %b format specifier represents the commit message body.
> 
> I have seen this with the Linux git repository and the following test:
> 
>     git log --encoding=utf-8 "--format=format:%b" | python2 -c \
>         'import sys; [l.decode("utf-8") for l in sys.stdin]'
> 
> Soon enough errors like this appears:
> 
>     'utf8' codec can't decode byte 0xf6 in position 19
> 
> The help message to the --encoding argument reads:
> 
>> The commit objects record the encoding used for the log message in
>> their encoding header; this option can be used to tell the command to
>> re-code the commit log message in the encoding preferred by the user
> 
> I realize that this message does not give any guarantee about the output of the command, in the sense that --encoding=utf-8 produces valid UTF-8 data in all cases.
> 
> However, I wonder what --encoding precisely does and if it has the behavior most users would expect.
> 
> Let me describe what I think it currently does:
> 
> The program attempts to re-code a log message, so it follows the chain
> 
>     raw input -> unicode -> raw output
Not sure what "raw input/output" means.
But there is only one reencode step involved, e.g.
input(8859) -> output(UTF-8)
When the encoding of the commit message is undefined, UTF-8 is assumed.
But Git does no verify if the encoding is really UTF-8.
We could guess that if it is not UTF-8 then it is ISO-8859-1, but that is not implemented.

> 
> For the first step, knowledge about the input encoding is required. 
When someone does a commit where the commit message does not conform to UTF-8,
This message is shown from Git:
"Warning: commit message did not conform to UTF-8.\n"
"You may want to amend it after fixing the message, or set the config\n"
"variable i18n.commitencoding to the encoding your project uses.\n";

If the user ignores this warning, how should Git guess the encoding  ?
(Later Git versions try do an auto-conversion assuming ISO-8859-1) ,
but that doesn't help real existing repos.

> This is retrieved from the encoding header of the commit object if present or (from the docs) 
>"lack of this header implies that the commit log message is encoded in UTF-8." 
>If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec), 
>the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding).
> 
> Is that correct?
Yes, see above.
> 
> From my point of view the most natural abstraction of a log *message* is *text*, not bytes. 

>The same is true for author names. 

>If I want to build a tool chain on top of log/show, this usually means that I want to work with text information. 
>Hence, I want to retrieve text (a sequence of code points) from git show/log. 
>Text must be transported in encoded form, sure, 
>but it must not contain byte sequences that are invalid in this codec. 
>Because otherwise it's just not text anymore.
> 
Call it corrupted.
> Hence, from my point of view, the rational that git show/log should be able to output *text* information means
> that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument. 
> In the current situation, the work of dealing with invalid byte sequences is just outsourced to software
> further below in the tool chain 
>(at some point a replacement character � should be displayed to the user instead of the invalid raw bytes).
> 
> I am not entirely sure where this discussion should lead to. 
Yes, until someone writes a patch to improve either the documentation or the code,
nothing will be changed.
> However, I think that if the behavior of the software will not be changed, 
>then the documentation for the --encoding option should be more precise and 
>clarify what actually happens behind the scenes. What do you think?
Patches are more than welcome.
> 
> 
> Cheers,
> 
> 
> Jan-Philip Gehrcke

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-15 16:21 ` Torsten Bögershausen
@ 2015-06-16  9:38   ` Jan-Philip Gehrcke
  2015-06-16 20:04     ` Torsten Bögershausen
  0 siblings, 1 reply; 9+ messages in thread
From: Jan-Philip Gehrcke @ 2015-06-16  9:38 UTC (permalink / raw)
  To: Torsten Bögershausen, git

On 15.06.2015 18:21, Torsten Bögershausen wrote:
> On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
>> Let me describe what I think it currently does:
>>
>> The program attempts to re-code a log message, so it follows the chain
>>
>>      raw input -> unicode -> raw output
> Not sure what "raw input/output" means.
> But there is only one reencode step involved, e.g.
> input(8859) -> output(UTF-8)

We surely agree. With "raw" I meant a sequence of bytes, and with 
"unicode" I meant the intermediate state in the process of re-encoding 
(which can be thought of as decoding and encoding with a transient 
intermediate state).

> If the user ignores this warning, how should Git guess the encoding  ?

I entirely appreciate that there is no satisfying solution to this very 
problem.

>> If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec),
>> the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding).
>>
>> Is that correct?
> Yes, see above.

Thanks!

>> Hence, from my point of view, the rational that git show/log should be able to output *text* information means
>> that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument.
>> In the current situation, the work of dealing with invalid byte sequences is just outsourced to software
>> further below in the tool chain
>> (at some point a replacement character � should be displayed to the user instead of the invalid raw bytes).
>>
>> I am not entirely sure where this discussion should lead to.
> Yes, until someone writes a patch to improve either the documentation or the code,
> nothing will be changed.
>> However, I think that if the behavior of the software will not be changed,
>> then the documentation for the --encoding option should be more precise and
>> clarify what actually happens behind the scenes. What do you think?
> Patches are more than welcome.

I'd be willing to contribute, but of course there must be a discussion 
and an agreement before that, if there is need to change something at 
all, and what exactly.

To this discussion I would like to contribute that I am of the opinion 
that there should be a command line option to make git show/log/friends 
emit a byte stream that is guaranteed to be valid in a given codec.

That would require detection and treatment of those cases where 
corrupted text resides in the repository (we cannot prevent it from 
entering the repository, as discussed). In these cases, one could emit a 
replacement symbol (e.g. '?') per invalid byte subsequence (this seems 
to be more established than just swallowing the invalid byte sequence).

What do you think?

I think the --encoding option would have ideal semantics for described 
behavior.

However, I guess maintaining backwards compatibility is an issue here. 
On the other hand, I realize that the --encoding option undergoes 
changes: the docs for git log in release 2.4.3 do not even list the 
--encoding option anymore. Why is that? I haven't found a corresponding 
changelog/release notes entry.

Thanks,

Jan-Philip

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-16  9:38   ` Jan-Philip Gehrcke
@ 2015-06-16 20:04     ` Torsten Bögershausen
  0 siblings, 0 replies; 9+ messages in thread
From: Torsten Bögershausen @ 2015-06-16 20:04 UTC (permalink / raw)
  To: Jan-Philip Gehrcke, Torsten Bögershausen, git

On 2015-06-16 11.38, Jan-Philip Gehrcke wrote:
> On 15.06.2015 18:21, Torsten Bögershausen wrote:
>> On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
>>> Let me describe what I think it currently does:
>>>
>>> The program attempts to re-code a log message, so it follows the chain
>>>
>>>      raw input -> unicode -> raw output
>> Not sure what "raw input/output" means.
>> But there is only one reencode step involved, e.g.
>> input(8859) -> output(UTF-8)

I probably need to correct myself:
pretty.c
void format_commit_message(const struct commit *commit,

The the message is converted from the commit encoding into UTF-8.
When the log encoding is different from UTF-8,
There is a second conversio (as you said).
(But not in your case, here the second conversion is skipped)

[snip]

> 
> I'd be willing to contribute, but of course there must be a discussion and an
> agreement before that, if there is need to change something at all, and what
> exactly.
[]
> What do you think?
See commit.c:

/*
 * This verifies that the buffer is in proper utf8 format.
 *
 * If it isn't, it assumes any non-utf8 characters are Latin1,
 * and does the conversion.
 */
static int verify_utf8(struct strbuf *buf)

> 
> I think the --encoding option would have ideal semantics for described behavior.
> 
> However, I guess maintaining backwards compatibility is an issue here. On the
> other hand, I realize that the --encoding option undergoes changes: the docs for
> git log in release 2.4.3 do not even list the --encoding option anymore. Why is
> that? I haven't found a corresponding changelog/release notes entry.

It still seems to work:
git log --encoding ISO-8859-1 | xxd | grep gersha | less
includes this line:
7465 6e20 42f6 6765 7273 6861 7573 656e  ten B.gershausen

So my suggestion (in short):
move verify_utf8() from commit.c into utf8.c,
make it non-static, and add a prototype in utf8.h

Use that function in pretty.c (and commit.c), test it.
Make a patch out of it and send it to the list.

In the ideal world the patch will include a test case,
but I don't know how easy it is to create such a commit.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-15  8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
  2015-06-15 16:21 ` Torsten Bögershausen
@ 2015-06-17 16:42 ` Junio C Hamano
  2015-06-17 17:07   ` Jan-Philip Gehrcke
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2015-06-17 16:42 UTC (permalink / raw)
  To: Jan-Philip Gehrcke; +Cc: git

Jan-Philip Gehrcke <jgehrcke@googlemail.com> writes:

> I was surprised to see that the output of
>
>     git log --encoding=utf-8 "--format=format:%b"
>
> can contain byte sequences that are invalid in UTF-8. Note: I am using
> git 2.1.4 and the %b format specifier represents the commit message
> body.

Yeah, if the original was bad and cannot be sanely expressed in
UTF-8, you have two options.  You can show the contents as raw bytes
recorded in the object with a warning so that the user can use it as
such (e.g. perhaps the original was indeed an iso8859-2 but was
incorrectly marked as UTF-8, or something like that, and a human
that is more intelligent than a tool _could_ guess and attempt to
recover).  Or you can error out and refuse to produce output.

We deliberately made a design choice to take the former option.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-17 16:42 ` Junio C Hamano
@ 2015-06-17 17:07   ` Jan-Philip Gehrcke
  2015-06-17 18:46     ` Jeff King
  2015-06-17 19:55     ` Torsten Bögershausen
  0 siblings, 2 replies; 9+ messages in thread
From: Jan-Philip Gehrcke @ 2015-06-17 17:07 UTC (permalink / raw)
  To: git

On 17.06.2015 18:42, Junio C Hamano wrote:
> Jan-Philip Gehrcke <jgehrcke@googlemail.com> writes:
>
>> I was surprised to see that the output of
>>
>>      git log --encoding=utf-8 "--format=format:%b"
>>
>> can contain byte sequences that are invalid in UTF-8. Note: I am using
>> git 2.1.4 and the %b format specifier represents the commit message
>> body.
>
> Yeah, if the original was bad and cannot be sanely expressed in
> UTF-8, you have two options.  You can show the contents as raw bytes
> recorded in the object with a warning so that the user can use it as
> such (e.g. perhaps the original was indeed an iso8859-2 but was
> incorrectly marked as UTF-8, or something like that, and a human
> that is more intelligent than a tool _could_ guess and attempt to
> recover).  Or you can error out and refuse to produce output.

The two-option scenario is totally clear. Although one must stress that 
the "error-out" option can, as discussed, be kept minimally invasive: it 
is sufficient (and common) to just skip those byte sequences (and 
replace them with a replacement symbol) that would be invalid in the 
requested output encoding. This would retain as much information as 
possible while guaranteeing a subsequent decoder to retrieve valid input.

> We deliberately made a design choice to take the former option.

I totally support this design choice in general, especially when 
invoking `git whatever` without options. This here is, I think, mainly 
about documentation and the semantics of "--encoding". From my point of 
view, `--encoding=utf-8` semantically suggests that the output *is* 
valid UTF-8. But it is not, not always. May initial question was: what 
do you think about this? Should we

* just make this more clear in the docs and/or
* should we adjust the behavior of --encoding or
* should we do something entirely different, like adding a new command 
line option or
* should we just leave things as they are?

Thanks and cheers,

Jan-Philip

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-17 17:07   ` Jan-Philip Gehrcke
@ 2015-06-17 18:46     ` Jeff King
  2015-06-17 20:02       ` Junio C Hamano
  2015-06-17 19:55     ` Torsten Bögershausen
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff King @ 2015-06-17 18:46 UTC (permalink / raw)
  To: Jan-Philip Gehrcke; +Cc: Junio C Hamano, git

On Wed, Jun 17, 2015 at 07:07:48PM +0200, Jan-Philip Gehrcke wrote:

> The two-option scenario is totally clear. Although one must stress that the
> "error-out" option can, as discussed, be kept minimally invasive: it is
> sufficient (and common) to just skip those byte sequences (and replace them
> with a replacement symbol) that would be invalid in the requested output
> encoding. This would retain as much information as possible while
> guaranteeing a subsequent decoder to retrieve valid input.

I think "munge into valid UTF-8, even if it means losing data" is a
totally valid and useful option. I'm not completely sure that git should
do that, though.  E.g., you could just as easily do:

  git log --encoding=utf8 | drop_invalid_utf8 | your_script

Or quite possibly, your_script could do the munging itself while reading
the data. I do not know much about Python's input handling, but in Perl,
it is easy to say "the input is utf8, and replace anything bogus with a
substitution character"[1].

> Should we
> 
> * just make this more clear in the docs and/or
> * should we adjust the behavior of --encoding or
> * should we do something entirely different, like adding a new command line
> option or
> * should we just leave things as they are?

I would vote for a documentation change, perhaps like:

Subject: docs: clarify that --encoding can produce invalid sequences

In the common case that the commit encoding matches the
output encoding, we do not touch the buffer at all, which
makes things much more efficient. But it might be unclear to
a consumer that we will pass through bogus sequences.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/pretty-options.txt | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
index 74aa01a..642af6e 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -37,7 +37,10 @@ people using 80-column terminals.
 	in their encoding header; this option can be used to tell the
 	command to re-code the commit log message in the encoding
 	preferred by the user.  For non plumbing commands this
-	defaults to UTF-8.
+	defaults to UTF-8. Note that if an object claims to be encoded
+	in `X` and we are outputting in `X`, we will output the object
+	verbatim; this means that invalid sequences in the original
+	commit may be copied to the output.
 
 --notes[=<ref>]::
 	Show the notes (see linkgit:git-notes[1]) that annotate the
-- 
2.4.4.719.g3984bc6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-17 17:07   ` Jan-Philip Gehrcke
  2015-06-17 18:46     ` Jeff King
@ 2015-06-17 19:55     ` Torsten Bögershausen
  1 sibling, 0 replies; 9+ messages in thread
From: Torsten Bögershausen @ 2015-06-17 19:55 UTC (permalink / raw)
  To: Jan-Philip Gehrcke, git

> * just make this more clear in the docs and/or
> * should we adjust the behavior of --encoding or
> * should we do something entirely different, like adding a new command line
> option or
The general spirit is to keep things backwards compatible, so that users which
expect the "raw" (and possible corrupted UTF-8) data still get the same results,
when they updata their Git installation.

A new command line option will allow users to get clean UTF-8.

One suggestion could be
--fixbroken=ISO-8859-1    (a)
--fixbroken=octalescape   (b)
--fixbroken=hexescape     (c)

(a) would replace  "0xf6" with "0xc3 0xb6"
(b) could write "\366"
(c) could write "<F6>"

The exact form of the syntax can be discussed of course.

However, I would probably start with (a), and add other options
if needed.

> * should we just leave things as they are?
.... not the ideal thing.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Should the --encoding argument to log/show commands make any guarantees about their output?
  2015-06-17 18:46     ` Jeff King
@ 2015-06-17 20:02       ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2015-06-17 20:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Jan-Philip Gehrcke, git

Jeff King <peff@peff.net> writes:

> I would vote for a documentation change, perhaps like:
>
> Subject: docs: clarify that --encoding can produce invalid sequences
>
> In the common case that the commit encoding matches the
> output encoding, we do not touch the buffer at all, which
> makes things much more efficient. But it might be unclear to
> a consumer that we will pass through bogus sequences.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---

Sounds like a sensible thing to do.

>  Documentation/pretty-options.txt | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
> index 74aa01a..642af6e 100644
> --- a/Documentation/pretty-options.txt
> +++ b/Documentation/pretty-options.txt
> @@ -37,7 +37,10 @@ people using 80-column terminals.
>  	in their encoding header; this option can be used to tell the
>  	command to re-code the commit log message in the encoding
>  	preferred by the user.  For non plumbing commands this
> -	defaults to UTF-8.
> +	defaults to UTF-8. Note that if an object claims to be encoded
> +	in `X` and we are outputting in `X`, we will output the object
> +	verbatim; this means that invalid sequences in the original
> +	commit may be copied to the output.
>  
>  --notes[=<ref>]::
>  	Show the notes (see linkgit:git-notes[1]) that annotate the

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-17 20:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-15  8:50 Should the --encoding argument to log/show commands make any guarantees about their output? Jan-Philip Gehrcke
2015-06-15 16:21 ` Torsten Bögershausen
2015-06-16  9:38   ` Jan-Philip Gehrcke
2015-06-16 20:04     ` Torsten Bögershausen
2015-06-17 16:42 ` Junio C Hamano
2015-06-17 17:07   ` Jan-Philip Gehrcke
2015-06-17 18:46     ` Jeff King
2015-06-17 20:02       ` Junio C Hamano
2015-06-17 19:55     ` Torsten Bögershausen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.