All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Shawn Pearce <spearce@spearce.org>,
	Jonathan Nieder <jrnieder@gmail.com>,
	Jakub Narebski <jnareb@gmail.com>
Subject: Re: RFC: Native clean/smudge filter for UTF-16 files
Date: Sun, 3 Dec 2017 19:48:01 +0100	[thread overview]
Message-ID: <759F0C3A-8C46-4685-BB27-6D508B26BB49@gmail.com> (raw)
In-Reply-To: <20171124180401.GB29190@sigill>


> On 24 Nov 2017, at 19:04, Jeff King <peff@peff.net> wrote:
> 
> On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:
> 
>> Alternatively, I could add a native attribute to Git that translates UTF-16 
>> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
>> on Windows. Limiting this feature to Windows wouldn't be a problem from my
>> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
>> could look like this:
>> 
>>    *.txt        text encoding=utf-16
>> 
>> There was a previous discussion on the topic and Jonathan already suggested
>> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
>> is already present but, as far as I can tell, is only used by the git gui
>> for viewing [5].
> 
> I would not want to see a proliferation of built-in filters, but it
> really seems like text-encoding conversion is a broad and practical one
> that many people might benefit from. So just like line-ending
> conversion, which _could_ be done by generic filters, it makes sense to
> me to support it natively for speed and simplicity.
> 
>> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
>> "encoding=utf-16" on Windows would have a chance to get accepted?
> 
> You haven't fully specified the semantics here, so let me sketch out
> what I think it ought to look like:

Thank you :-)


> - declare utf8 the "canonical" in-repo representation, just as we have
>   declared LF for line endings (alternatively this could be
>   configurable, but if we can get away with declaring utf8 the one true
>   encoding, that cuts out a lot of corner cases).

100% agree on UTF-8 as canonical representation and I wouldn't make 
that configurable as it would open a can of worms.


> - if core.convertEncoding is true, then for any file with an
>   encoding=foo attribute, internally run iconv(foo, utf8) in
>   convert_to_git(), and likewise iconv(utf8, foo) in
>   convert_to_working_tree.
> 
> - I'm not sure if core.convertEncoding should be enabled by default. If
>   it's a noop as long as there's no encoding attribute, then it's
>   probably fine. But I would not want accidental conversion or any
>   slowdown for the common case that the user wants no conversion.

I think we should mimic the behavior of "eol=crlf/lf" attribute.

AFAIK whenever I set "*.ext text eol=crlf", then I can be sure the 
file is checked out with CRLF independent of any of my local config
settings. Isn't that correct? I would expect a similar behavior if
"*.ext text encoding=utf16" is set. Wouldn't that mean that we do
not need a "core.convertEncoding" config?

I do 100% agree that it must be a noop if there is no encoding 
attribute present.


> - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I
>   don't think people consistently have all utf-16 files (the way they
>   might have all CRLF files) rather a few files that must be utf-16.

Agreed!


> - I have actually seen two types of utf-16 in git repos in the wild:
>   files which really must be utf-16 (because some tool demands it) and
>   files which happen to be utf-16, but could just as easily be utf-8
>   (and the user simply does not notice and commits utf-16, but doesn't
>   realize it until much later when their diffs are unreadable).
> 
>   For the first case, the "encoding" thing above would work fine. For
>   the second case, in theory we could have an option that takes any
>   file with a "text" attribute and no "encoding" attribute, and
>   converts it to utf-8.

TBH I would label that a "non-goal". Because AFAIK we cannot reliability
detect the encoding automatically.


>   I suspect that's opening a can of worms for false positives similar
>   to core.autocrlf. And performance drops as we try to guess the
>   encoding and convert all incoming data.
> 
>   So I mention it mostly as a direction I think we probably _don't_
>   want to go. Anybody with the "this could have been utf-8 all along"
>   type of file can remedy it by converting and committing the result.

100 % agree.


> Omitting all of the "we shouldn't do this" bullet points, it seems
> pretty simple and sane to me.
> 
> There is one other approach, which is to really store utf-16 in the
> repository and better teach the diff tools to handle it (which are
> really the main thing in git that cares about looking into the blob
> contents). You can do this already with a textconv filter, but:
> 
>  1. It's slow (though cacheable).
> 
>  2. It doesn't work unless each repo configures the filter (so not on
>     sites like GitHub, unless we define a micro-format that diff=utf16
>     should be textconv'd on display, and get all implementations to
>     respect that).

Actually, rendering diffs on Git hosting sites such as GitHub is one
of my goals. Therefore, storing content as UTF-16 wouldn't be a solution
for me.


>  3. Textconv patches look good, but can't be applied. This occasionally
>     makes things awkward, depending on your workflow.

TBH I dont't understand what you mean here. What do you mean with
"textconv patches"?


>  4. You have to actually mark each file with an attribute, which is
>     slightly annoying and more thing to remember to do (I see this from
>     the server side, since people often commit utf-16 without any
>     attributes, and then get annoyed when they see the file marked as
>     binary).
> 
> We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and
> doing an "auto-textconv" to utf-8 (for our human-readable diffs only, of
> course). That solves (1), (2), and (4), but leaves (3). I actually
> looked into using libicu to do it not just for UTF-16, but to detect any
> encoding. It turned out to be really slow, though. :)
> 
> So anyway, that is an alternate strategy, but I think I like "canonical
> in-repo text is utf-8" approach a lot more, since then git operations
> work consistently. There are still a few rough edges (e.g., I'm not sure
> if you could apply a utf-8 patch directly to a utf-16 working tree file.
> Certainly not using "patch", but I'm not sure how well "git apply" would
> handle that case either). But I think it would mostly Just Work as long
> as people were willing to set their encoding attributes.

Agreed!

Thanks for your thoughts! I'll try to come up with a v1 of this idea.

- Lars

  parent reply	other threads:[~2017-12-03 18:48 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-23 15:18 RFC: Native clean/smudge filter for UTF-16 files Lars Schneider
2017-11-23 20:09 ` Torsten Bögershausen
2017-11-24 18:04 ` Jeff King
2017-11-25  2:32   ` Junio C Hamano
2017-12-03 18:48   ` Lars Schneider [this message]
2017-12-04 17:23     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=759F0C3A-8C46-4685-BB27-6D508B26BB49@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.