All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
@ 2017-03-30 19:35 Jakub Narębski
  2017-03-30 20:00 ` Jeff King
  2017-03-31 12:38 ` Torsten Bögershausen
  0 siblings, 2 replies; 10+ messages in thread
From: Jakub Narębski @ 2017-03-30 19:35 UTC (permalink / raw)
  To: git

Hello,

Recently I had to work on a project which uses legacy 8-bit encoding
(namely cp1250 encoding) instead of utf-8 for text files (LaTeX
documents).  My terminal, that is Git Bash from Git for Windows is set
up for utf-8.

I wanted for "git diff" and friends to return something sane on said
utf-8 terminal, instead of mojibake.  There is 'encoding'
gitattribute... but it works only for GUI ('git gui', that is).

Therefore I have (ab)used textconv facility to convert from cp1250 of
file encoding to utf-8 encoding of console.

I have set the following in .gitattributes file:

  ## LaTeX documents in cp1250 encoding
  *.tex text diff=mylatex

The 'mylatex' driver is defined as:

  [diff "mylatex"]
        xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
        wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
        textconv  = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
        cachetextconv = true

And everything would be all right... if not the fact that Git appends
spurious ^M to added lines in the `git diff` output.  Files use CRLF
end-of-line convention (the native MS Windows one).

  $ git diff test.tex
  diff --git a/test.tex b/test.tex
  index 029646e..250ab16 100644
  --- a/test.tex
  +++ b/test.tex
  @@ -1,4 +1,4 @@
  -\documentclass{article}
  +\documentclass{mwart}^M
  
   \usepackage[cp1250]{inputenc}
   \usepackage{polski}

What gives?  Why there is this ^M tacked on the end of added lines,
while it is not present in deleted lines, nor in content lines?

Puzzled.

P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
that it doesn't supports in core `encoding` attribute together with
having `i18n.outputEncoding`.
--
Jakub Narębski



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-03-30 19:35 [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows Jakub Narębski
@ 2017-03-30 20:00 ` Jeff King
  2017-03-31 13:24   ` Jakub Narębski
  2017-03-31 12:38 ` Torsten Bögershausen
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2017-03-30 20:00 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git

On Thu, Mar 30, 2017 at 09:35:27PM +0200, Jakub Narębski wrote:

> And everything would be all right... if not the fact that Git appends
> spurious ^M to added lines in the `git diff` output.  Files use CRLF
> end-of-line convention (the native MS Windows one).
> 
>   $ git diff test.tex
>   diff --git a/test.tex b/test.tex
>   index 029646e..250ab16 100644
>   --- a/test.tex
>   +++ b/test.tex
>   @@ -1,4 +1,4 @@
>   -\documentclass{article}
>   +\documentclass{mwart}^M
>   
>    \usepackage[cp1250]{inputenc}
>    \usepackage{polski}
> 
> What gives?  Why there is this ^M tacked on the end of added lines,
> while it is not present in deleted lines, nor in content lines?

Perhaps it's trailing whitespace highlighting for added lines? You can
add "cr-at-eol" to core.whitespace to suppress it.

I suspect in the normal case that git is doing line-ending conversion,
but it's suppressed when textconv is in use.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-03-30 19:35 [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows Jakub Narębski
  2017-03-30 20:00 ` Jeff King
@ 2017-03-31 12:38 ` Torsten Bögershausen
  2017-03-31 19:44   ` Jakub Narębski
  1 sibling, 1 reply; 10+ messages in thread
From: Torsten Bögershausen @ 2017-03-31 12:38 UTC (permalink / raw)
  To: Jakub Narębski, git

On 30.03.17 21:35, Jakub Narębski wrote:
> Hello,
> 
> Recently I had to work on a project which uses legacy 8-bit encoding
> (namely cp1250 encoding) instead of utf-8 for text files (LaTeX
> documents).  My terminal, that is Git Bash from Git for Windows is set
> up for utf-8.
> 
> I wanted for "git diff" and friends to return something sane on said
> utf-8 terminal, instead of mojibake.  There is 'encoding'
> gitattribute... but it works only for GUI ('git gui', that is).
> 
> Therefore I have (ab)used textconv facility to convert from cp1250 of
> file encoding to utf-8 encoding of console.
> 
> I have set the following in .gitattributes file:
> 
>   ## LaTeX documents in cp1250 encoding
>   *.tex text diff=mylatex
> 
> The 'mylatex' driver is defined as:
> 
>   [diff "mylatex"]
>         xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
>         wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
>         textconv  = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
>         cachetextconv = true
> 
> And everything would be all right... if not the fact that Git appends
> spurious ^M to added lines in the `git diff` output.  Files use CRLF
> end-of-line convention (the native MS Windows one).
> 
>   $ git diff test.tex
>   diff --git a/test.tex b/test.tex
>   index 029646e..250ab16 100644
>   --- a/test.tex
>   +++ b/test.tex
>   @@ -1,4 +1,4 @@
>   -\documentclass{article}
>   +\documentclass{mwart}^M
>   
>    \usepackage[cp1250]{inputenc}
>    \usepackage{polski}
> 
> What gives?  Why there is this ^M tacked on the end of added lines,
> while it is not present in deleted lines, nor in content lines?
> 
> Puzzled.
> 
> P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
> that it doesn't supports in core `encoding` attribute together with
> having `i18n.outputEncoding`.
> --
> Jakub Narębski
> 
> 
Is there a chance to give us a receipt how to reproduce it?
A complete test script or ?
(I don't want to speculate, if the invocation of iconv is the problem,
 where stdout is not in "binary mode", or however this is called under Windows)





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-03-30 20:00 ` Jeff King
@ 2017-03-31 13:24   ` Jakub Narębski
  2017-04-01  6:08     ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2017-03-31 13:24 UTC (permalink / raw)
  To: Jeff King; +Cc: git

W dniu 30.03.2017 o 22:00, Jeff King pisze:
> On Thu, Mar 30, 2017 at 09:35:27PM +0200, Jakub Narębski wrote:
> 
>> And everything would be all right... if not the fact that Git appends
>> spurious ^M to added lines in the `git diff` output.  Files use CRLF
>> end-of-line convention (the native MS Windows one).
>>
>>   $ git diff test.tex
>>   diff --git a/test.tex b/test.tex
>>   index 029646e..250ab16 100644
>>   --- a/test.tex
>>   +++ b/test.tex
>>   @@ -1,4 +1,4 @@
>>   -\documentclass{article}
>>   +\documentclass{mwart}^M
>>   
>>    \usepackage[cp1250]{inputenc}
>>    \usepackage{polski}
>>
>> What gives?  Why there is this ^M tacked on the end of added lines,
>> while it is not present in deleted lines, nor in content lines?

Gah, I forgot that Git for Windows installed with default options uses
`core.autocrlf=true`, so file contents is stored in repository and
in the index using LF end-of-line convention -- that is why there is
no ^M in pre-image (in removed lines).
 
> Perhaps it's trailing whitespace highlighting for added lines? You can
> add "cr-at-eol" to core.whitespace to suppress it.

Thanks! That solves the problem (or rather workarounds it).

> 
> I suspect in the normal case that git is doing line-ending conversion,
> but it's suppressed when textconv is in use.

I would not consider this a bug if not for the fact that there is no ^M
without using iconv as textconv.

Compare (without textconv => no ^M, but mojibake):

  $ git diff test.txt
  diff --git a/test.txt b/test.txt
  index 029646e..38cd657 100644
  --- a/test.txt
  +++ b/test.txt
  @@ -1,9 +1,10 @@
  -\documentclass{article}
  +\documentclass{mwart}
  
   \usepackage[cp1250]{inputenc}
   \usepackage{polski}
  
   \begin{document}
  +Za<BF><F3><B3><E6> g<EA><9C>l<B9> ja<9F><F1>!
  
   \end{document}

with the following (with textconv => no gibberish, but ^M):

  $ git diff test.tex
  diff --git a/test.tex b/test.tex
  index 029646e..38cd657 100644
  --- a/test.tex
  +++ b/test.tex
  @@ -1,9 +1,10 @@
  -\documentclass{article}
  +\documentclass{mwart}^M
  
   \usepackage[cp1250]{inputenc}
   \usepackage{polski}
  
   \begin{document}
  +Zażółć gęślą jaźń!^M
  
   \end{document}

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-03-31 12:38 ` Torsten Bögershausen
@ 2017-03-31 19:44   ` Jakub Narębski
  2017-04-02  4:34     ` Torsten Bögershausen
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2017-03-31 19:44 UTC (permalink / raw)
  To: Torsten Bögershausen, git

W dniu 31.03.2017 o 14:38, Torsten Bögershausen pisze:
> On 30.03.17 21:35, Jakub Narębski wrote:
>> Hello,
>>
>> Recently I had to work on a project which uses legacy 8-bit encoding
>> (namely cp1250 encoding) instead of utf-8 for text files (LaTeX
>> documents).  My terminal, that is Git Bash from Git for Windows is set
>> up for utf-8.
>>
>> I wanted for "git diff" and friends to return something sane on said
>> utf-8 terminal, instead of mojibake.  There is 'encoding'
>> gitattribute... but it works only for GUI ('git gui', that is).
>>
>> Therefore I have (ab)used textconv facility to convert from cp1250 of
>> file encoding to utf-8 encoding of console.
>>
>> I have set the following in .gitattributes file:
>>
>>   ## LaTeX documents in cp1250 encoding
>>   *.tex text diff=mylatex
>>
>> The 'mylatex' driver is defined as:
>>
>>   [diff "mylatex"]
>>         xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
>>         wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
>>         textconv  = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
>>         cachetextconv = true
>>
>> And everything would be all right... if not the fact that Git appends
>> spurious ^M to added lines in the `git diff` output.  Files use CRLF
>> end-of-line convention (the native MS Windows one).
>>
>>   $ git diff test.tex
>>   diff --git a/test.tex b/test.tex
>>   index 029646e..250ab16 100644
>>   --- a/test.tex
>>   +++ b/test.tex
>>   @@ -1,4 +1,4 @@
>>   -\documentclass{article}
>>   +\documentclass{mwart}^M
>>   
>>    \usepackage[cp1250]{inputenc}
>>    \usepackage{polski}
>>
>> What gives?  Why there is this ^M tacked on the end of added lines,
>> while it is not present in deleted lines, nor in content lines?
>>
>> Puzzled.
>>
>> P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
>> that it doesn't supports in core `encoding` attribute together with
>> having `i18n.outputEncoding`.
>
> Is there a chance to give us a receipt how to reproduce it?
> A complete test script or ?
> (I don't want to speculate, if the invocation of iconv is the problem,
>  where stdout is not in "binary mode", or however this is called under Windows)

I'm sorry, I though I posted whole recipe, but I missed some details
in the above description of the case.

First, files are stored on filesystem using CRLF eol (DOS end-of-line
convention).  Due to `core.autocrlf` they are converted to LF in blobs,
that is in the index and in the repository.

Second, a textconv with filter preserving end-of-line needs to be
configured.  I have used `iconv`, but I suspect that the problem would
happen also for `cat`.

In the .gitattributes file, or .git/info/attributes add, for example:

  *.tex text diff=myconv

In the .git/config configure the textconv filter, for example:

  [diff "myconv"]
         textconv  = iconv.exe -f cp1250 -t utf-8

Create a file which filename matches the attribute line, and which
uses CRLF end of line convention, and add it to Git (adding it to
the index):

  $ printf "foo\r\n" >foo.tex
  $ git add foo.tex

Modify file (also with CRLF):

  $ printf "bar\r\n" >foo.tex

Check the difference

  $ git diff foo.tex

HTH
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-03-31 13:24   ` Jakub Narębski
@ 2017-04-01  6:08     ` Jeff King
  2017-04-01 18:31       ` Jakub Narębski
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2017-04-01  6:08 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git

On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:

> > I suspect in the normal case that git is doing line-ending conversion,
> > but it's suppressed when textconv is in use.
> 
> I would not consider this a bug if not for the fact that there is no ^M
> without using iconv as textconv.

I don't think it's a bug, though. You have told Git that you will
convert the contents (whatever their format) into the canonical format,
but your program to do so includes a CR.

We _could_ further process with other canonicalizations, but I'm not
sure that is a good idea (line-endings sound reasonably harmless, but
almost certainly we should not be doing clean/smudge filtering). And I'm
not sure if there would be any compatibility fallouts.

So I think the behavior is perhaps not what you want, but it's not an
unreasonable one. And the solution is to define your textconv such that
it produces clean LF-only output. Perhaps:

  [diff.whatever]
  textconv = "iconv ... | tr -d '\r'"

?

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-04-01  6:08     ` Jeff King
@ 2017-04-01 18:31       ` Jakub Narębski
  2017-04-02  7:45         ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2017-04-01 18:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git

W dniu 01.04.2017 o 08:08, Jeff King pisze:
> On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:
> 
>>> I suspect in the normal case that git is doing line-ending conversion,
>>> but it's suppressed when textconv is in use.
>>
>> I would not consider this a bug if not for the fact that there is no ^M
>> without using iconv as textconv.
> 
> I don't think it's a bug, though. You have told Git that you will
> convert the contents (whatever their format) into the canonical format,
> but your program to do so includes a CR.

Well, I have not declared file binary with "binary = true" in diff driver
definition, isn't it?

> 
> We _could_ further process with other canonicalizations, but I'm not
> sure that is a good idea (line-endings sound reasonably harmless, but
> almost certainly we should not be doing clean/smudge filtering). And I'm
> not sure if there would be any compatibility fallouts.

Yes, gitattributes(5) defines interaction between 'text'/'eol', 'ident'
and 'filter' attributes, but nothing about 'diff' and 'text'/'eol'.

> 
> So I think the behavior is perhaps not what you want, but it's not an
> unreasonable one. And the solution is to define your textconv such that
> it produces clean LF-only output. Perhaps:
> 
>   [diff.whatever]
>   textconv = "iconv ... | tr -d '\r'"

Well, either this (or equivalent using dos2unix), or using 'whitespace'
attribute (or 'core.whitespace') with cr-at-eol.


P.S. What do you think about Git supporting 'encoding' attribute (or
'core.encoding' config) plus 'core.outputEncoding' in-core?

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-03-31 19:44   ` Jakub Narębski
@ 2017-04-02  4:34     ` Torsten Bögershausen
  0 siblings, 0 replies; 10+ messages in thread
From: Torsten Bögershausen @ 2017-04-02  4:34 UTC (permalink / raw)
  To: Jakub Narębski, git

On 2017-03-31 21:44, Jakub Narębski wrote:
> W dniu 31.03.2017 o 14:38, Torsten Bögershausen pisze:
>> On 30.03.17 21:35, Jakub Narębski wrote:
>>> Hello,
>>>
>>> Recently I had to work on a project which uses legacy 8-bit encoding
>>> (namely cp1250 encoding) instead of utf-8 for text files (LaTeX
>>> documents).  My terminal, that is Git Bash from Git for Windows is set
>>> up for utf-8.
>>>
>>> I wanted for "git diff" and friends to return something sane on said
>>> utf-8 terminal, instead of mojibake.  There is 'encoding'
>>> gitattribute... but it works only for GUI ('git gui', that is).
>>>
>>> Therefore I have (ab)used textconv facility to convert from cp1250 of
>>> file encoding to utf-8 encoding of console.
>>>
>>> I have set the following in .gitattributes file:
>>>
>>>   ## LaTeX documents in cp1250 encoding
>>>   *.tex text diff=mylatex
>>>
>>> The 'mylatex' driver is defined as:
>>>
>>>   [diff "mylatex"]
>>>         xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
>>>         wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
>>>         textconv  = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
>>>         cachetextconv = true
>>>
>>> And everything would be all right... if not the fact that Git appends
>>> spurious ^M to added lines in the `git diff` output.  Files use CRLF
>>> end-of-line convention (the native MS Windows one).
>>>
>>>   $ git diff test.tex
>>>   diff --git a/test.tex b/test.tex
>>>   index 029646e..250ab16 100644
>>>   --- a/test.tex
>>>   +++ b/test.tex
>>>   @@ -1,4 +1,4 @@
>>>   -\documentclass{article}
>>>   +\documentclass{mwart}^M
>>>   
>>>    \usepackage[cp1250]{inputenc}
>>>    \usepackage{polski}
>>>
>>> What gives?  Why there is this ^M tacked on the end of added lines,
>>> while it is not present in deleted lines, nor in content lines?
>>>
>>> Puzzled.
>>>
>>> P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
>>> that it doesn't supports in core `encoding` attribute together with
>>> having `i18n.outputEncoding`.
>>
>> Is there a chance to give us a receipt how to reproduce it?
>> A complete test script or ?
>> (I don't want to speculate, if the invocation of iconv is the problem,
>>  where stdout is not in "binary mode", or however this is called under Windows)
> 
> I'm sorry, I though I posted whole recipe, but I missed some details
> in the above description of the case.
> 
> First, files are stored on filesystem using CRLF eol (DOS end-of-line
> convention).  Due to `core.autocrlf` they are converted to LF in blobs,
> that is in the index and in the repository.
> 
> Second, a textconv with filter preserving end-of-line needs to be
> configured.  I have used `iconv`, but I suspect that the problem would
> happen also for `cat`.
> 
> In the .gitattributes file, or .git/info/attributes add, for example:
> 
>   *.tex text diff=myconv
> 
> In the .git/config configure the textconv filter, for example:
> 
>   [diff "myconv"]
>          textconv  = iconv.exe -f cp1250 -t utf-8
> 
> Create a file which filename matches the attribute line, and which
> uses CRLF end of line convention, and add it to Git (adding it to
> the index):
> 
>   $ printf "foo\r\n" >foo.tex
>   $ git add foo.tex
> 
> Modify file (also with CRLF):
> 
>   $ printf "bar\r\n" >foo.tex
> 
> Check the difference
> 
>   $ git diff foo.tex
> 
> HTH
> 

There seems to be a bug in Git, when it comes to "git diff".
Before we feed the content of the working tree into the
diff machinery, a call to convert_to_git() should be made.
But it seems as there is something missing, the expected
"+fox" becomes a "+foxQ"

#!/bin/sh

test_description='CRLF with diff filter'

. ./test-lib.sh

test_expect_success 'setup' '
	git config core.autocrlf input &&
	printf "foo\r\n" >foo.tex &&
  git add foo.tex &&
	echo >.gitattributes &&
	git checkout -b master &&
	git add .gitattributes &&
	git commit -m "Add foo.txt" &&
	cat >.git/config <<-\EOF
	[diff "myconv"]
	       textconv  = sed -e "s/f/g"
EOF
'

test_expect_success 'check EOL in diff' '
	printf "fox\r\n" >foo.tex &&
	cat >expect <<-\EOF &&
diff --git a/foo.tex b/foo.tex
index 257cc56..88c2893 100644
--- a/foo.tex
+++ b/foo.tex
@@ -1 +1 @@
-foo
+fox
EOF
	git diff foo.tex | tr "\015" Q >actual &&
	test_cmp expect actual
'

test_done


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-04-01 18:31       ` Jakub Narębski
@ 2017-04-02  7:45         ` Jeff King
  2017-04-02 11:40           ` Jakub Narębski
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2017-04-02  7:45 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git

On Sat, Apr 01, 2017 at 08:31:27PM +0200, Jakub Narębski wrote:

> W dniu 01.04.2017 o 08:08, Jeff King pisze:
> > On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:
> > 
> >>> I suspect in the normal case that git is doing line-ending conversion,
> >>> but it's suppressed when textconv is in use.
> >>
> >> I would not consider this a bug if not for the fact that there is no ^M
> >> without using iconv as textconv.
> > 
> > I don't think it's a bug, though. You have told Git that you will
> > convert the contents (whatever their format) into the canonical format,
> > but your program to do so includes a CR.
> 
> Well, I have not declared file binary with "binary = true" in diff driver
> definition, isn't it?

I don't think binary has anything to do with it. A textconv filter takes
input (binary or not) and delivers a normalized representation to feed
to the diff algorithm. There's no further post-processing, and it's the
responsibility of the filter to deliver the bytes it wants diffed.

Like I said, I could see an argument for treating the filter output as
text to be pre-processed, but that's not how it works (and I don't think
it is a good idea to change it now, unless by adding an option to the
diff filter).

> P.S. What do you think about Git supporting 'encoding' attribute (or
> 'core.encoding' config) plus 'core.outputEncoding' in-core?

Supporting an "encoding" attribute to normalize file encodings in diffs
seems reasonable to me. But it would have to be enabled only for
human-readable diffs, as the result could not be applied (so the same as
textconv).

I don't think core.outputEncoding is necessarily a good idea. We are not
really equipped anything that isn't an ascii superset, as we intermingle
the bytes with ascii diff headers (though I think that is true of the
commitEncoding stuff; I assume everything breaks horribly if you tried
to set that to UTF-16, but I've never tried it).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
  2017-04-02  7:45         ` Jeff King
@ 2017-04-02 11:40           ` Jakub Narębski
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Narębski @ 2017-04-02 11:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git

W dniu 02.04.2017 o 09:45, Jeff King pisze:
> On Sat, Apr 01, 2017 at 08:31:27PM +0200, Jakub Narębski wrote:
> 
>> W dniu 01.04.2017 o 08:08, Jeff King pisze:
>>> On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:
>>>
>>>>> I suspect in the normal case that git is doing line-ending conversion,
>>>>> but it's suppressed when textconv is in use.
>>>>
>>>> I would not consider this a bug if not for the fact that there is no ^M
>>>> without using iconv as textconv.
>>>
>>> I don't think it's a bug, though. You have told Git that you will
>>> convert the contents (whatever their format) into the canonical format,
>>> but your program to do so includes a CR.
>>
>> Well, I have not declared file binary with "binary = true" in diff driver
>> definition, isn't it?
> 
> I don't think binary has anything to do with it. A textconv filter takes
> input (binary or not) and delivers a normalized representation to feed
> to the diff algorithm. There's no further post-processing, and it's the
> responsibility of the filter to deliver the bytes it wants diffed.
> 
> Like I said, I could see an argument for treating the filter output as
> text to be pre-processed, but that's not how it works (and I don't think
> it is a good idea to change it now, unless by adding an option to the
> diff filter).

I think that actually there is something wrong.

If textconv really gets normalized representation of pre-image (the index
version) and post-image (the filesystem version), as it should I think,
both pre-image lines ('-') and post-image lines ('+') should use CRLF,
so there should be no warning, i.e. ^M

Or textconv filter gets normalized representation (it looks this way
when examining diff result saved to file with `git diff test.tex >test.diff`;
I were unable to use `tr '\r' 'Q', either I got "fatal: bad config line"
from Git, or "tr: extra operand" from tr), and somehow Git mistakes
what is happening and writes those ^M.

If I understand it correctly, if pre-image, post-image and context
all use the same eol, there should be no warning, isn't it?

> 
>> P.S. What do you think about Git supporting 'encoding' attribute (or
>> 'core.encoding' config) plus 'core.outputEncoding' in-core?
> 
> Supporting an "encoding" attribute to normalize file encodings in diffs
> seems reasonable to me. But it would have to be enabled only for
> human-readable diffs, as the result could not be applied (so the same as
> textconv).

I was thinking about human readable diffs, and 'git show <blob>', same
as with textconv.

> 
> I don't think core.outputEncoding is necessarily a good idea. We are not
> really equipped anything that isn't an ascii superset, as we intermingle
> the bytes with ascii diff headers (though I think that is true of the
> commitEncoding stuff; I assume everything breaks horribly if you tried
> to set that to UTF-16, but I've never tried it).

Well, the understanding would be that the same limitation as for 
core.logOutputEncoding (documented if it isn't) that only encodings that
are ASCII compatibile are supported.
 
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-02 11:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-30 19:35 [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows Jakub Narębski
2017-03-30 20:00 ` Jeff King
2017-03-31 13:24   ` Jakub Narębski
2017-04-01  6:08     ` Jeff King
2017-04-01 18:31       ` Jakub Narębski
2017-04-02  7:45         ` Jeff King
2017-04-02 11:40           ` Jakub Narębski
2017-03-31 12:38 ` Torsten Bögershausen
2017-03-31 19:44   ` Jakub Narębski
2017-04-02  4:34     ` Torsten Bögershausen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.