Re: RFC: Native clean/smudge filter for UTF-16 files

From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Shawn Pearce <spearce@spearce.org>,
	Jonathan Nieder <jrnieder@gmail.com>,
	Jakub Narebski <jnareb@gmail.com>
Subject: Re: RFC: Native clean/smudge filter for UTF-16 files
Date: Sun, 3 Dec 2017 19:48:01 +0100	[thread overview]
Message-ID: <759F0C3A-8C46-4685-BB27-6D508B26BB49@gmail.com> (raw)
In-Reply-To: <20171124180401.GB29190@sigill>

> On 24 Nov 2017, at 19:04, Jeff King <peff@peff.net> wrote:
> 
> On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:
> 
>> Alternatively, I could add a native attribute to Git that translates UTF-16 
>> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
>> on Windows. Limiting this feature to Windows wouldn't be a problem from my
>> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
>> could look like this:
>> 
>>    *.txt        text encoding=utf-16
>> 
>> There was a previous discussion on the topic and Jonathan already suggested
>> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
>> is already present but, as far as I can tell, is only used by the git gui
>> for viewing [5].
> 
> I would not want to see a proliferation of built-in filters, but it
> really seems like text-encoding conversion is a broad and practical one
> that many people might benefit from. So just like line-ending
> conversion, which _could_ be done by generic filters, it makes sense to
> me to support it natively for speed and simplicity.
> 
>> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
>> "encoding=utf-16" on Windows would have a chance to get accepted?
> 
> You haven't fully specified the semantics here, so let me sketch out
> what I think it ought to look like:

Thank you :-)

> - declare utf8 the "canonical" in-repo representation, just as we have
>   declared LF for line endings (alternatively this could be
>   configurable, but if we can get away with declaring utf8 the one true
>   encoding, that cuts out a lot of corner cases).

100% agree on UTF-8 as canonical representation and I wouldn't make 
that configurable as it would open a can of worms.

> - if core.convertEncoding is true, then for any file with an
>   encoding=foo attribute, internally run iconv(foo, utf8) in
>   convert_to_git(), and likewise iconv(utf8, foo) in
>   convert_to_working_tree.
> 
> - I'm not sure if core.convertEncoding should be enabled by default. If
>   it's a noop as long as there's no encoding attribute, then it's
>   probably fine. But I would not want accidental conversion or any
>   slowdown for the common case that the user wants no conversion.

I think we should mimic the behavior of "eol=crlf/lf" attribute.

AFAIK whenever I set "*.ext text eol=crlf", then I can be sure the 
file is checked out with CRLF independent of any of my local config
settings. Isn't that correct? I would expect a similar behavior if
"*.ext text encoding=utf16" is set. Wouldn't that mean that we do
not need a "core.convertEncoding" config?

I do 100% agree that it must be a noop if there is no encoding 
attribute present.

> - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I
>   don't think people consistently have all utf-16 files (the way they
>   might have all CRLF files) rather a few files that must be utf-16.

Agreed!

> - I have actually seen two types of utf-16 in git repos in the wild:
>   files which really must be utf-16 (because some tool demands it) and
>   files which happen to be utf-16, but could just as easily be utf-8
>   (and the user simply does not notice and commits utf-16, but doesn't
>   realize it until much later when their diffs are unreadable).
> 
>   For the first case, the "encoding" thing above would work fine. For
>   the second case, in theory we could have an option that takes any
>   file with a "text" attribute and no "encoding" attribute, and
>   converts it to utf-8.

TBH I would label that a "non-goal". Because AFAIK we cannot reliability
detect the encoding automatically.

>   I suspect that's opening a can of worms for false positives similar
>   to core.autocrlf. And performance drops as we try to guess the
>   encoding and convert all incoming data.
> 
>   So I mention it mostly as a direction I think we probably _don't_
>   want to go. Anybody with the "this could have been utf-8 all along"
>   type of file can remedy it by converting and committing the result.

100 % agree.

> Omitting all of the "we shouldn't do this" bullet points, it seems
> pretty simple and sane to me.
> 
> There is one other approach, which is to really store utf-16 in the
> repository and better teach the diff tools to handle it (which are
> really the main thing in git that cares about looking into the blob
> contents). You can do this already with a textconv filter, but:
> 
>  1. It's slow (though cacheable).
> 
>  2. It doesn't work unless each repo configures the filter (so not on
>     sites like GitHub, unless we define a micro-format that diff=utf16
>     should be textconv'd on display, and get all implementations to
>     respect that).

Actually, rendering diffs on Git hosting sites such as GitHub is one
of my goals. Therefore, storing content as UTF-16 wouldn't be a solution
for me.

>  3. Textconv patches look good, but can't be applied. This occasionally
>     makes things awkward, depending on your workflow.

TBH I dont't understand what you mean here. What do you mean with
"textconv patches"?

>  4. You have to actually mark each file with an attribute, which is
>     slightly annoying and more thing to remember to do (I see this from
>     the server side, since people often commit utf-16 without any
>     attributes, and then get annoyed when they see the file marked as
>     binary).
> 
> We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and
> doing an "auto-textconv" to utf-8 (for our human-readable diffs only, of
> course). That solves (1), (2), and (4), but leaves (3). I actually
> looked into using libicu to do it not just for UTF-16, but to detect any
> encoding. It turned out to be really slow, though. :)
> 
> So anyway, that is an alternate strategy, but I think I like "canonical
> in-repo text is utf-8" approach a lot more, since then git operations
> work consistently. There are still a few rough edges (e.g., I'm not sure
> if you could apply a utf-8 patch directly to a utf-16 working tree file.
> Certainly not using "patch", but I'm not sure how well "git apply" would
> handle that case either). But I think it would mostly Just Work as long
> as people were willing to set their encoding attributes.

Agreed!

Thanks for your thoughts! I'll try to come up with a v1 of this idea.

- Lars