All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Native clean/smudge filter for UTF-16 files
@ 2017-11-23 15:18 Lars Schneider
  2017-11-23 20:09 ` Torsten Bögershausen
  2017-11-24 18:04 ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Lars Schneider @ 2017-11-23 15:18 UTC (permalink / raw)
  To: Git List
  Cc: Junio C Hamano, Johannes Schindelin, Shawn Pearce,
	Jonathan Nieder, Jakub Narebski, Jeff King

Hi,

I am working with a team that owns a repository with lots of UTF-16 files.
Converting these files to UTF-8 is no good option as downstream applications
require the UTF-16 encoding. Keeping the files in UTF-16 is no good option
either as Git and Git related tools (e.g. GitHub) consider the files binary
and consequently do not render diffs.

The obvious solution is to setup a clean/smudge filter like this [1]:
    [filter "winutf16"]
        clean = iconv -f utf-16 -t utf-8
        smudge = iconv -f utf-8 -t utf-16

In general this works well but the "per-file" clean/smudge process invocation 
can be slow for many files. I could apply the same trick that we used for Git
LFS and write a iconv that processes all files with a single invocation (see
"edcc85814c convert: add filter.<driver>.process option" [2]). 

Alternatively, I could add a native attribute to Git that translates UTF-16 
to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
on Windows. Limiting this feature to Windows wouldn't be a problem from my
point of view as UTF-16 is only relevant on Windows anyways. The attribute 
could look like this:

    *.txt        text encoding=utf-16

There was a previous discussion on the topic and Jonathan already suggested
a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
is already present but, as far as I can tell, is only used by the git gui
for viewing [5].

Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
"encoding=utf-16" on Windows would have a chance to get accepted?

Thanks,
Lars

[1] https://github.com/msysgit/msysgit/issues/113#issue-13142846
[2] https://github.com/git/git/commit/edcc85814c87ebd7f3b1b7d3979fac3dfb84d308
[3] https://github.com/git/git/blob/14c63a9dc093d6738454f6369a4f5663ca732cf7/compat/mingw.h#L501-L533
[4] https://public-inbox.org/git/20101022195331.GA12014@burratino/
[5] https://github.com/git/git/commit/1ffca60f0b0395e1e593e64d66e7ed3c47d8517e

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: Native clean/smudge filter for UTF-16 files
  2017-11-23 15:18 RFC: Native clean/smudge filter for UTF-16 files Lars Schneider
@ 2017-11-23 20:09 ` Torsten Bögershausen
  2017-11-24 18:04 ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Torsten Bögershausen @ 2017-11-23 20:09 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Johannes Schindelin, Shawn Pearce,
	Jonathan Nieder, Jakub Narebski, Jeff King

On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:
> Hi,
> 
> I am working with a team that owns a repository with lots of UTF-16 files.
> Converting these files to UTF-8 is no good option as downstream applications
> require the UTF-16 encoding. Keeping the files in UTF-16 is no good option
> either as Git and Git related tools (e.g. GitHub) consider the files binary
> and consequently do not render diffs.
> 
> The obvious solution is to setup a clean/smudge filter like this [1]:
>     [filter "winutf16"]
>         clean = iconv -f utf-16 -t utf-8
>         smudge = iconv -f utf-8 -t utf-16
> 
> In general this works well but the "per-file" clean/smudge process invocation 
> can be slow for many files. I could apply the same trick that we used for Git
> LFS and write a iconv that processes all files with a single invocation (see
> "edcc85814c convert: add filter.<driver>.process option" [2]). 
> 
> Alternatively, I could add a native attribute to Git that translates UTF-16 
> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
> on Windows. Limiting this feature to Windows wouldn't be a problem from my
> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
> could look like this:
> 
>     *.txt        text encoding=utf-16
> 

I would probably vote for UTF-16LE.
But we can specify whatever iconv allows, see below.

[]
> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
> "encoding=utf-16" on Windows would have a chance to get accepted?

Yes.
The thing is that we have convert.c and utf8.c (reencode_string_iconv()
and possible strbuf.c, which feels that they are in charge here.

Having a path using mingw.h would mean that this is Windows only,
and that is not a good solution.
(I sometimes host files on a Linux box, export them via SAMBA to
 Mac and Windows. Having that kind of setup allows to commit
 directly on the Linux fileserver)

But even if you don't have that setup, cross platform is a must, I would say.
There may be possible optimizations using xutftowcsn() and friends under
Windows, but they may come later into the picture (or are not needed)
What file sizes are we talking about ?
100K ?
100M ?

It is even possible to hook into the streaming interface.

> 
> Thanks,
> Lars
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: Native clean/smudge filter for UTF-16 files
  2017-11-23 15:18 RFC: Native clean/smudge filter for UTF-16 files Lars Schneider
  2017-11-23 20:09 ` Torsten Bögershausen
@ 2017-11-24 18:04 ` Jeff King
  2017-11-25  2:32   ` Junio C Hamano
  2017-12-03 18:48   ` Lars Schneider
  1 sibling, 2 replies; 6+ messages in thread
From: Jeff King @ 2017-11-24 18:04 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Johannes Schindelin, Shawn Pearce,
	Jonathan Nieder, Jakub Narebski

On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:

> Alternatively, I could add a native attribute to Git that translates UTF-16 
> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
> on Windows. Limiting this feature to Windows wouldn't be a problem from my
> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
> could look like this:
> 
>     *.txt        text encoding=utf-16
>
> There was a previous discussion on the topic and Jonathan already suggested
> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
> is already present but, as far as I can tell, is only used by the git gui
> for viewing [5].

I would not want to see a proliferation of built-in filters, but it
really seems like text-encoding conversion is a broad and practical one
that many people might benefit from. So just like line-ending
conversion, which _could_ be done by generic filters, it makes sense to
me to support it natively for speed and simplicity.

> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
> "encoding=utf-16" on Windows would have a chance to get accepted?

You haven't fully specified the semantics here, so let me sketch out
what I think it ought to look like:

 - declare utf8 the "canonical" in-repo representation, just as we have
   declared LF for line endings (alternatively this could be
   configurable, but if we can get away with declaring utf8 the one true
   encoding, that cuts out a lot of corner cases).

 - if core.convertEncoding is true, then for any file with an
   encoding=foo attribute, internally run iconv(foo, utf8) in
   convert_to_git(), and likewise iconv(utf8, foo) in
   convert_to_working_tree.

 - I'm not sure if core.convertEncoding should be enabled by default. If
   it's a noop as long as there's no encoding attribute, then it's
   probably fine. But I would not want accidental conversion or any
   slowdown for the common case that the user wants no conversion.

 - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I
   don't think people consistently have all utf-16 files (the way they
   might have all CRLF files) rather a few files that must be utf-16.

 - I have actually seen two types of utf-16 in git repos in the wild:
   files which really must be utf-16 (because some tool demands it) and
   files which happen to be utf-16, but could just as easily be utf-8
   (and the user simply does not notice and commits utf-16, but doesn't
   realize it until much later when their diffs are unreadable).

   For the first case, the "encoding" thing above would work fine. For
   the second case, in theory we could have an option that takes any
   file with a "text" attribute and no "encoding" attribute, and
   converts it to utf-8.

   I suspect that's opening a can of worms for false positives similar
   to core.autocrlf. And performance drops as we try to guess the
   encoding and convert all incoming data.

   So I mention it mostly as a direction I think we probably _don't_
   want to go. Anybody with the "this could have been utf-8 all along"
   type of file can remedy it by converting and committing the result.

Omitting all of the "we shouldn't do this" bullet points, it seems
pretty simple and sane to me.

There is one other approach, which is to really store utf-16 in the
repository and better teach the diff tools to handle it (which are
really the main thing in git that cares about looking into the blob
contents). You can do this already with a textconv filter, but:

  1. It's slow (though cacheable).

  2. It doesn't work unless each repo configures the filter (so not on
     sites like GitHub, unless we define a micro-format that diff=utf16
     should be textconv'd on display, and get all implementations to
     respect that).

  3. Textconv patches look good, but can't be applied. This occasionally
     makes things awkward, depending on your workflow.

  4. You have to actually mark each file with an attribute, which is
     slightly annoying and more thing to remember to do (I see this from
     the server side, since people often commit utf-16 without any
     attributes, and then get annoyed when they see the file marked as
     binary).

We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and
doing an "auto-textconv" to utf-8 (for our human-readable diffs only, of
course). That solves (1), (2), and (4), but leaves (3). I actually
looked into using libicu to do it not just for UTF-16, but to detect any
encoding. It turned out to be really slow, though. :)

So anyway, that is an alternate strategy, but I think I like "canonical
in-repo text is utf-8" approach a lot more, since then git operations
work consistently. There are still a few rough edges (e.g., I'm not sure
if you could apply a utf-8 patch directly to a utf-16 working tree file.
Certainly not using "patch", but I'm not sure how well "git apply" would
handle that case either). But I think it would mostly Just Work as long
as people were willing to set their encoding attributes.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: Native clean/smudge filter for UTF-16 files
  2017-11-24 18:04 ` Jeff King
@ 2017-11-25  2:32   ` Junio C Hamano
  2017-12-03 18:48   ` Lars Schneider
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2017-11-25  2:32 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, Git List, Johannes Schindelin, Shawn Pearce,
	Jonathan Nieder, Jakub Narebski

Jeff King <peff@peff.net> writes:

> So anyway, that is an alternate strategy, but I think I like "canonical
> in-repo text is utf-8" approach a lot more, since then git operations
> work consistently. There are still a few rough edges (e.g., I'm not sure

Sounds like a good way forward.

> if you could apply a utf-8 patch directly to a utf-16 working tree file.
> Certainly not using "patch", but I'm not sure how well "git apply" would
> handle that case either). But I think it would mostly Just Work as long
> as people were willing to set their encoding attributes.

It should work (or fail) just like applying LF patch to CRLF working
tree, so I wouldn't worry too much about it.

Thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: Native clean/smudge filter for UTF-16 files
  2017-11-24 18:04 ` Jeff King
  2017-11-25  2:32   ` Junio C Hamano
@ 2017-12-03 18:48   ` Lars Schneider
  2017-12-04 17:23     ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Lars Schneider @ 2017-12-03 18:48 UTC (permalink / raw)
  To: Jeff King
  Cc: Git List, Junio C Hamano, Johannes Schindelin, Shawn Pearce,
	Jonathan Nieder, Jakub Narebski


> On 24 Nov 2017, at 19:04, Jeff King <peff@peff.net> wrote:
> 
> On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:
> 
>> Alternatively, I could add a native attribute to Git that translates UTF-16 
>> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
>> on Windows. Limiting this feature to Windows wouldn't be a problem from my
>> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
>> could look like this:
>> 
>>    *.txt        text encoding=utf-16
>> 
>> There was a previous discussion on the topic and Jonathan already suggested
>> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
>> is already present but, as far as I can tell, is only used by the git gui
>> for viewing [5].
> 
> I would not want to see a proliferation of built-in filters, but it
> really seems like text-encoding conversion is a broad and practical one
> that many people might benefit from. So just like line-ending
> conversion, which _could_ be done by generic filters, it makes sense to
> me to support it natively for speed and simplicity.
> 
>> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
>> "encoding=utf-16" on Windows would have a chance to get accepted?
> 
> You haven't fully specified the semantics here, so let me sketch out
> what I think it ought to look like:

Thank you :-)


> - declare utf8 the "canonical" in-repo representation, just as we have
>   declared LF for line endings (alternatively this could be
>   configurable, but if we can get away with declaring utf8 the one true
>   encoding, that cuts out a lot of corner cases).

100% agree on UTF-8 as canonical representation and I wouldn't make 
that configurable as it would open a can of worms.


> - if core.convertEncoding is true, then for any file with an
>   encoding=foo attribute, internally run iconv(foo, utf8) in
>   convert_to_git(), and likewise iconv(utf8, foo) in
>   convert_to_working_tree.
> 
> - I'm not sure if core.convertEncoding should be enabled by default. If
>   it's a noop as long as there's no encoding attribute, then it's
>   probably fine. But I would not want accidental conversion or any
>   slowdown for the common case that the user wants no conversion.

I think we should mimic the behavior of "eol=crlf/lf" attribute.

AFAIK whenever I set "*.ext text eol=crlf", then I can be sure the 
file is checked out with CRLF independent of any of my local config
settings. Isn't that correct? I would expect a similar behavior if
"*.ext text encoding=utf16" is set. Wouldn't that mean that we do
not need a "core.convertEncoding" config?

I do 100% agree that it must be a noop if there is no encoding 
attribute present.


> - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I
>   don't think people consistently have all utf-16 files (the way they
>   might have all CRLF files) rather a few files that must be utf-16.

Agreed!


> - I have actually seen two types of utf-16 in git repos in the wild:
>   files which really must be utf-16 (because some tool demands it) and
>   files which happen to be utf-16, but could just as easily be utf-8
>   (and the user simply does not notice and commits utf-16, but doesn't
>   realize it until much later when their diffs are unreadable).
> 
>   For the first case, the "encoding" thing above would work fine. For
>   the second case, in theory we could have an option that takes any
>   file with a "text" attribute and no "encoding" attribute, and
>   converts it to utf-8.

TBH I would label that a "non-goal". Because AFAIK we cannot reliability
detect the encoding automatically.


>   I suspect that's opening a can of worms for false positives similar
>   to core.autocrlf. And performance drops as we try to guess the
>   encoding and convert all incoming data.
> 
>   So I mention it mostly as a direction I think we probably _don't_
>   want to go. Anybody with the "this could have been utf-8 all along"
>   type of file can remedy it by converting and committing the result.

100 % agree.


> Omitting all of the "we shouldn't do this" bullet points, it seems
> pretty simple and sane to me.
> 
> There is one other approach, which is to really store utf-16 in the
> repository and better teach the diff tools to handle it (which are
> really the main thing in git that cares about looking into the blob
> contents). You can do this already with a textconv filter, but:
> 
>  1. It's slow (though cacheable).
> 
>  2. It doesn't work unless each repo configures the filter (so not on
>     sites like GitHub, unless we define a micro-format that diff=utf16
>     should be textconv'd on display, and get all implementations to
>     respect that).

Actually, rendering diffs on Git hosting sites such as GitHub is one
of my goals. Therefore, storing content as UTF-16 wouldn't be a solution
for me.


>  3. Textconv patches look good, but can't be applied. This occasionally
>     makes things awkward, depending on your workflow.

TBH I dont't understand what you mean here. What do you mean with
"textconv patches"?


>  4. You have to actually mark each file with an attribute, which is
>     slightly annoying and more thing to remember to do (I see this from
>     the server side, since people often commit utf-16 without any
>     attributes, and then get annoyed when they see the file marked as
>     binary).
> 
> We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and
> doing an "auto-textconv" to utf-8 (for our human-readable diffs only, of
> course). That solves (1), (2), and (4), but leaves (3). I actually
> looked into using libicu to do it not just for UTF-16, but to detect any
> encoding. It turned out to be really slow, though. :)
> 
> So anyway, that is an alternate strategy, but I think I like "canonical
> in-repo text is utf-8" approach a lot more, since then git operations
> work consistently. There are still a few rough edges (e.g., I'm not sure
> if you could apply a utf-8 patch directly to a utf-16 working tree file.
> Certainly not using "patch", but I'm not sure how well "git apply" would
> handle that case either). But I think it would mostly Just Work as long
> as people were willing to set their encoding attributes.

Agreed!

Thanks for your thoughts! I'll try to come up with a v1 of this idea.

- Lars

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: Native clean/smudge filter for UTF-16 files
  2017-12-03 18:48   ` Lars Schneider
@ 2017-12-04 17:23     ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2017-12-04 17:23 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Johannes Schindelin, Shawn Pearce,
	Jonathan Nieder, Jakub Narebski

On Sun, Dec 03, 2017 at 07:48:01PM +0100, Lars Schneider wrote:

> > - if core.convertEncoding is true, then for any file with an
> >   encoding=foo attribute, internally run iconv(foo, utf8) in
> >   convert_to_git(), and likewise iconv(utf8, foo) in
> >   convert_to_working_tree.
> > 
> > - I'm not sure if core.convertEncoding should be enabled by default. If
> >   it's a noop as long as there's no encoding attribute, then it's
> >   probably fine. But I would not want accidental conversion or any
> >   slowdown for the common case that the user wants no conversion.
> 
> I think we should mimic the behavior of "eol=crlf/lf" attribute.
> 
> AFAIK whenever I set "*.ext text eol=crlf", then I can be sure the 
> file is checked out with CRLF independent of any of my local config
> settings. Isn't that correct? I would expect a similar behavior if
> "*.ext text encoding=utf16" is set. Wouldn't that mean that we do
> not need a "core.convertEncoding" config?

Yeah, on further thought, that's probably the right thing. Both "eol"
and "encoding" attributes are definite indications of what should happen
(unlike "text", which is just saying you _could_ convert line endings if
you wished to, and therefore has to be used in conjunction with a config
setting).

I like the name "encoding" for the attribute, but I do wonder if this
would bite anybody using it already for other purposes (like gitk).

> > There is one other approach, which is to really store utf-16 in the
> > repository and better teach the diff tools to handle it (which are
> > really the main thing in git that cares about looking into the blob
> > contents). You can do this already with a textconv filter, but:
> > 
> >  1. It's slow (though cacheable).
> > 
> >  2. It doesn't work unless each repo configures the filter (so not on
> >     sites like GitHub, unless we define a micro-format that diff=utf16
> >     should be textconv'd on display, and get all implementations to
> >     respect that).
> 
> Actually, rendering diffs on Git hosting sites such as GitHub is one
> of my goals. Therefore, storing content as UTF-16 wouldn't be a solution
> for me.

If there were a convention for specifying the attribute, I think sites
like GitHub would start respecting it in the server-side diffs (though
like I said, we could also just auto-detect via BOM without even
requiring any attributes to be set up).

> >  3. Textconv patches look good, but can't be applied. This occasionally
> >     makes things awkward, depending on your workflow.
> 
> TBH I dont't understand what you mean here. What do you mean with
> "textconv patches"?

I mean the patch produced by "git diff" when textconv is in effect. That
patch cannot be applied to the original content. E.g.:

  git init
  echo "* diff=foo" >.git/info/attributes
  git config diff.foo.textconv "sed s/^/foo:/"

  echo base >file
  git add file
  git commit -m base

  echo change >file
  git diff >patch

  git reset --hard
  git apply patch

That works in the absence of the textconv, but not with it. (For a real
binary file, you'd probably need "diff --binary" to produce a usable
patch, but the principle is the same).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-12-04 17:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-23 15:18 RFC: Native clean/smudge filter for UTF-16 files Lars Schneider
2017-11-23 20:09 ` Torsten Bögershausen
2017-11-24 18:04 ` Jeff King
2017-11-25  2:32   ` Junio C Hamano
2017-12-03 18:48   ` Lars Schneider
2017-12-04 17:23     ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.