Re: [PATCH v1] convert: add support for 'encoding' attribute

From: Lars Schneider <larsxschneider@gmail.com>
To: "Torsten Bögershausen" <tboegi@web.de>
Cc: Lars Schneider <lars.schneider@autodesk.com>,
	Git List <git@vger.kernel.org>,
	gitster@pobox.com, peff@peff.net, patrick@luehne.de
Subject: Re: [PATCH v1] convert: add support for 'encoding' attribute
Date: Thu, 28 Dec 2017 17:14:25 +0100	[thread overview]
Message-ID: <BF5C58AC-C1E6-4AD5-A4F8-C4CC5C9108F7@gmail.com> (raw)
In-Reply-To: <20171217171404.GA18175@tor.lan>

> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 

>> +`encoding`
>> +^^^^^^^^^^
>> +
>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
> 
> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.

I am not sure I understand what you want to tell me here.
Do you think UTF-8 encoding is not correct in the sentence above?

> 
>> +attribute sets the encoding to be used in the working directory.
>> +If the path is added to the index, then Git encodes the content to
>> +UTF-8.  On checkout the content is encoded back to the original
>> +encoding.  Consequently, you can use all built-in Git text processing
>> +tools (e.g. git diff, line ending conversions, etc.) with your
>> +non-UTF-8 encoded file.
>> +
>> +Please note that re-encoding content can cause errors and requires
>> +resources. Use the `encoding` attribute only if you cannot store
>> +a file in UTF-8 encoding and if you want Git to be able to process
>> +the text.
>> +
>> +------------------------
>> +*.txt		text encoding=UTF-16
>> +------------------------
> 
> I think that encoding (or worktree-encoding as said elsewhere) must imply text.
> (That is not done in convert.c, but assuming binary or even auto
> does not make much sense to me)

"text" only means "LF". "-text" means CRLF which would be perfectly fine
for UTF-16. Therefore I don't think we need to imply text.
Does this make sense?

> 
> 
>> +
>> +All `iconv` encodings with a stable round-trip conversion to and from
>> +UTF-8 are supported.  You can see a full list with the following command:
>> +
>> +------------------------
>> +iconv --list
>> +------------------------
>> +
>> `eol`
>> ^^^^^
>> 
>> diff --git a/convert.c b/convert.c
>> index 20d7ab67bd..ee19c17104 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -7,6 +7,7 @@
>> #include "sigchain.h"
>> #include "pkt-line.h"
>> #include "sub-process.h"
>> +#include "utf8.h"
>> 
>> /*
>>  * convert.c - convert a file when checking it out and checking it in.
>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> 
>> }
> 
> I would avoid to use these #ifdefs here.
> All functions can be in strbuf.c (and may have #ifdefs there), but not here.

I'll try that. But wouldn't it make more sense to move the functions to utf.c?

> 
>> 
>> +#ifdef NO_ICONV
>> +#ifndef _ICONV_T
>> +/* The type is just a placeholder and not actually used. */
>> +typedef void* iconv_t;
>> +#endif
>> +#endif
>> +
>> +static struct encoding {
>> +	const char *name;
>> +	iconv_t to_git;       /* conversion to Git's canonical encoding (UTF-8) */
>> +	iconv_t to_worktree;  /* conversion to user-defined encoding */
>> +	struct encoding *next;
>> +} *encoding, **encoding_tail;
> 
> This seems like an optimazation, assuning that iconv_open() eats a lot of ressources.
> I don't think this is the case. (Undere MacOS we run iconv_open() together with
> every opendir(), and optimizing this out did not show any measurable improvements)

True, but then I would need to free() the memory in a lot of places.
Therefore I thought it is easier to keep the object. OK for you?

>> +static const char *default_encoding = "utf-8";
> 
> The most portable form is "UTF-8" (correct me if that is wrong)

It shouldn't matter. But I've changed it to uppercase to be on the safe side.

>> +static iconv_t invalid_conversion = (iconv_t)-1;
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +			 struct strbuf *buf, struct encoding *enc)
>> +{
>> +#ifndef NO_ICONV
>> +	char *dst, *re_src;
>> +	int dst_len, re_src_len;
>> +
>> +	/*
>> +	 * No encoding is specified or there is nothing to encode.
>> +	 * Tell the caller that the content was not modified.
>> +	 */
>> +	if (!enc || (src && !src_len))
>> +		return 0;
>> +
>> +	/*
>> +	 * Looks like we got called from "would_convert_to_git()".
>> +	 * This means Git wants to know if it would encode (= modify!)
>> +	 * the content. Let's answer with "yes", since an encoding was
>> +	 * specified.
>> +	 */
>> +	if (!buf && !src)
>> +		return 1;
>> +
>> +	if (enc->to_git == invalid_conversion) {
>> +		enc->to_git = iconv_open(default_encoding, encoding->name);
>> +		if (enc->to_git == invalid_conversion)
>> +			warning(_("unsupported encoding %s"), encoding->name);
>> +	}
> 
> 	/* There are 2 different types of reaction:
> 	      Either users  know what that a warning means: You asked for problems,
> 	      	        and do the right thing. Other may may ignore the warning
> 			       - in both cases an error is useful */

Agreed!

>> +	if (enc->to_worktree == invalid_conversion)
>> +		enc->to_worktree = iconv_open(encoding->name, default_encoding);
>> +
>> +	/*
>> +	 * Do nothing if the encoding is not supported. This could happen in
>> +	 * two cases:
>> +	 *   (1) The encoding is garbage. That is no problem as we would not
>> +	 *       encode the content to UTF-8 on "add" and we would not encode
>> +	 *       it back on "checkout".
>> +	 *   (2) Git users use different iconv versions that support different
>> +	 *       encodings. This could lead to problems, as the content might
>> +	 *       not be encoded on "add" but encoded back on "checkout" (or
>> +	 *       the other way around).
>> +	 * We print a one-time warning to the user in both cases above.
>> +	 */
> 
> Isn't an error better than "garbage in -> garbage out" ?

Agreed. I changed the warning to an error.

>> diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh
> 
> (I didn't review the test yet)
> 
> Another comment for a possible improvement:
> "git diff"  could be told to write the worktree-encoding into the diff,
> and "git apply" can pick that up. 

Yes, we could do that. However, I would tackle that in a separate series.

Thanks,
Lars