* [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
@ 2017-03-30 19:35 Jakub Narębski
2017-03-30 20:00 ` Jeff King
2017-03-31 12:38 ` Torsten Bögershausen
0 siblings, 2 replies; 10+ messages in thread
From: Jakub Narębski @ 2017-03-30 19:35 UTC (permalink / raw)
To: git
Hello,
Recently I had to work on a project which uses legacy 8-bit encoding
(namely cp1250 encoding) instead of utf-8 for text files (LaTeX
documents). My terminal, that is Git Bash from Git for Windows is set
up for utf-8.
I wanted for "git diff" and friends to return something sane on said
utf-8 terminal, instead of mojibake. There is 'encoding'
gitattribute... but it works only for GUI ('git gui', that is).
Therefore I have (ab)used textconv facility to convert from cp1250 of
file encoding to utf-8 encoding of console.
I have set the following in .gitattributes file:
## LaTeX documents in cp1250 encoding
*.tex text diff=mylatex
The 'mylatex' driver is defined as:
[diff "mylatex"]
xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
textconv = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
cachetextconv = true
And everything would be all right... if not the fact that Git appends
spurious ^M to added lines in the `git diff` output. Files use CRLF
end-of-line convention (the native MS Windows one).
$ git diff test.tex
diff --git a/test.tex b/test.tex
index 029646e..250ab16 100644
--- a/test.tex
+++ b/test.tex
@@ -1,4 +1,4 @@
-\documentclass{article}
+\documentclass{mwart}^M
\usepackage[cp1250]{inputenc}
\usepackage{polski}
What gives? Why there is this ^M tacked on the end of added lines,
while it is not present in deleted lines, nor in content lines?
Puzzled.
P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
that it doesn't supports in core `encoding` attribute together with
having `i18n.outputEncoding`.
--
Jakub Narębski
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-03-30 19:35 [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows Jakub Narębski
@ 2017-03-30 20:00 ` Jeff King
2017-03-31 13:24 ` Jakub Narębski
2017-03-31 12:38 ` Torsten Bögershausen
1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2017-03-30 20:00 UTC (permalink / raw)
To: Jakub Narębski; +Cc: git
On Thu, Mar 30, 2017 at 09:35:27PM +0200, Jakub Narębski wrote:
> And everything would be all right... if not the fact that Git appends
> spurious ^M to added lines in the `git diff` output. Files use CRLF
> end-of-line convention (the native MS Windows one).
>
> $ git diff test.tex
> diff --git a/test.tex b/test.tex
> index 029646e..250ab16 100644
> --- a/test.tex
> +++ b/test.tex
> @@ -1,4 +1,4 @@
> -\documentclass{article}
> +\documentclass{mwart}^M
>
> \usepackage[cp1250]{inputenc}
> \usepackage{polski}
>
> What gives? Why there is this ^M tacked on the end of added lines,
> while it is not present in deleted lines, nor in content lines?
Perhaps it's trailing whitespace highlighting for added lines? You can
add "cr-at-eol" to core.whitespace to suppress it.
I suspect in the normal case that git is doing line-ending conversion,
but it's suppressed when textconv is in use.
-Peff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-03-30 19:35 [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows Jakub Narębski
2017-03-30 20:00 ` Jeff King
@ 2017-03-31 12:38 ` Torsten Bögershausen
2017-03-31 19:44 ` Jakub Narębski
1 sibling, 1 reply; 10+ messages in thread
From: Torsten Bögershausen @ 2017-03-31 12:38 UTC (permalink / raw)
To: Jakub Narębski, git
On 30.03.17 21:35, Jakub Narębski wrote:
> Hello,
>
> Recently I had to work on a project which uses legacy 8-bit encoding
> (namely cp1250 encoding) instead of utf-8 for text files (LaTeX
> documents). My terminal, that is Git Bash from Git for Windows is set
> up for utf-8.
>
> I wanted for "git diff" and friends to return something sane on said
> utf-8 terminal, instead of mojibake. There is 'encoding'
> gitattribute... but it works only for GUI ('git gui', that is).
>
> Therefore I have (ab)used textconv facility to convert from cp1250 of
> file encoding to utf-8 encoding of console.
>
> I have set the following in .gitattributes file:
>
> ## LaTeX documents in cp1250 encoding
> *.tex text diff=mylatex
>
> The 'mylatex' driver is defined as:
>
> [diff "mylatex"]
> xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
> wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
> textconv = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
> cachetextconv = true
>
> And everything would be all right... if not the fact that Git appends
> spurious ^M to added lines in the `git diff` output. Files use CRLF
> end-of-line convention (the native MS Windows one).
>
> $ git diff test.tex
> diff --git a/test.tex b/test.tex
> index 029646e..250ab16 100644
> --- a/test.tex
> +++ b/test.tex
> @@ -1,4 +1,4 @@
> -\documentclass{article}
> +\documentclass{mwart}^M
>
> \usepackage[cp1250]{inputenc}
> \usepackage{polski}
>
> What gives? Why there is this ^M tacked on the end of added lines,
> while it is not present in deleted lines, nor in content lines?
>
> Puzzled.
>
> P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
> that it doesn't supports in core `encoding` attribute together with
> having `i18n.outputEncoding`.
> --
> Jakub Narębski
>
>
Is there a chance to give us a receipt how to reproduce it?
A complete test script or ?
(I don't want to speculate, if the invocation of iconv is the problem,
where stdout is not in "binary mode", or however this is called under Windows)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-03-30 20:00 ` Jeff King
@ 2017-03-31 13:24 ` Jakub Narębski
2017-04-01 6:08 ` Jeff King
0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2017-03-31 13:24 UTC (permalink / raw)
To: Jeff King; +Cc: git
W dniu 30.03.2017 o 22:00, Jeff King pisze:
> On Thu, Mar 30, 2017 at 09:35:27PM +0200, Jakub Narębski wrote:
>
>> And everything would be all right... if not the fact that Git appends
>> spurious ^M to added lines in the `git diff` output. Files use CRLF
>> end-of-line convention (the native MS Windows one).
>>
>> $ git diff test.tex
>> diff --git a/test.tex b/test.tex
>> index 029646e..250ab16 100644
>> --- a/test.tex
>> +++ b/test.tex
>> @@ -1,4 +1,4 @@
>> -\documentclass{article}
>> +\documentclass{mwart}^M
>>
>> \usepackage[cp1250]{inputenc}
>> \usepackage{polski}
>>
>> What gives? Why there is this ^M tacked on the end of added lines,
>> while it is not present in deleted lines, nor in content lines?
Gah, I forgot that Git for Windows installed with default options uses
`core.autocrlf=true`, so file contents is stored in repository and
in the index using LF end-of-line convention -- that is why there is
no ^M in pre-image (in removed lines).
> Perhaps it's trailing whitespace highlighting for added lines? You can
> add "cr-at-eol" to core.whitespace to suppress it.
Thanks! That solves the problem (or rather workarounds it).
>
> I suspect in the normal case that git is doing line-ending conversion,
> but it's suppressed when textconv is in use.
I would not consider this a bug if not for the fact that there is no ^M
without using iconv as textconv.
Compare (without textconv => no ^M, but mojibake):
$ git diff test.txt
diff --git a/test.txt b/test.txt
index 029646e..38cd657 100644
--- a/test.txt
+++ b/test.txt
@@ -1,9 +1,10 @@
-\documentclass{article}
+\documentclass{mwart}
\usepackage[cp1250]{inputenc}
\usepackage{polski}
\begin{document}
+Za<BF><F3><B3><E6> g<EA><9C>l<B9> ja<9F><F1>!
\end{document}
with the following (with textconv => no gibberish, but ^M):
$ git diff test.tex
diff --git a/test.tex b/test.tex
index 029646e..38cd657 100644
--- a/test.tex
+++ b/test.tex
@@ -1,9 +1,10 @@
-\documentclass{article}
+\documentclass{mwart}^M
\usepackage[cp1250]{inputenc}
\usepackage{polski}
\begin{document}
+Zażółć gęślą jaźń!^M
\end{document}
--
Jakub Narębski
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-03-31 12:38 ` Torsten Bögershausen
@ 2017-03-31 19:44 ` Jakub Narębski
2017-04-02 4:34 ` Torsten Bögershausen
0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2017-03-31 19:44 UTC (permalink / raw)
To: Torsten Bögershausen, git
W dniu 31.03.2017 o 14:38, Torsten Bögershausen pisze:
> On 30.03.17 21:35, Jakub Narębski wrote:
>> Hello,
>>
>> Recently I had to work on a project which uses legacy 8-bit encoding
>> (namely cp1250 encoding) instead of utf-8 for text files (LaTeX
>> documents). My terminal, that is Git Bash from Git for Windows is set
>> up for utf-8.
>>
>> I wanted for "git diff" and friends to return something sane on said
>> utf-8 terminal, instead of mojibake. There is 'encoding'
>> gitattribute... but it works only for GUI ('git gui', that is).
>>
>> Therefore I have (ab)used textconv facility to convert from cp1250 of
>> file encoding to utf-8 encoding of console.
>>
>> I have set the following in .gitattributes file:
>>
>> ## LaTeX documents in cp1250 encoding
>> *.tex text diff=mylatex
>>
>> The 'mylatex' driver is defined as:
>>
>> [diff "mylatex"]
>> xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
>> wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
>> textconv = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
>> cachetextconv = true
>>
>> And everything would be all right... if not the fact that Git appends
>> spurious ^M to added lines in the `git diff` output. Files use CRLF
>> end-of-line convention (the native MS Windows one).
>>
>> $ git diff test.tex
>> diff --git a/test.tex b/test.tex
>> index 029646e..250ab16 100644
>> --- a/test.tex
>> +++ b/test.tex
>> @@ -1,4 +1,4 @@
>> -\documentclass{article}
>> +\documentclass{mwart}^M
>>
>> \usepackage[cp1250]{inputenc}
>> \usepackage{polski}
>>
>> What gives? Why there is this ^M tacked on the end of added lines,
>> while it is not present in deleted lines, nor in content lines?
>>
>> Puzzled.
>>
>> P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
>> that it doesn't supports in core `encoding` attribute together with
>> having `i18n.outputEncoding`.
>
> Is there a chance to give us a receipt how to reproduce it?
> A complete test script or ?
> (I don't want to speculate, if the invocation of iconv is the problem,
> where stdout is not in "binary mode", or however this is called under Windows)
I'm sorry, I though I posted whole recipe, but I missed some details
in the above description of the case.
First, files are stored on filesystem using CRLF eol (DOS end-of-line
convention). Due to `core.autocrlf` they are converted to LF in blobs,
that is in the index and in the repository.
Second, a textconv with filter preserving end-of-line needs to be
configured. I have used `iconv`, but I suspect that the problem would
happen also for `cat`.
In the .gitattributes file, or .git/info/attributes add, for example:
*.tex text diff=myconv
In the .git/config configure the textconv filter, for example:
[diff "myconv"]
textconv = iconv.exe -f cp1250 -t utf-8
Create a file which filename matches the attribute line, and which
uses CRLF end of line convention, and add it to Git (adding it to
the index):
$ printf "foo\r\n" >foo.tex
$ git add foo.tex
Modify file (also with CRLF):
$ printf "bar\r\n" >foo.tex
Check the difference
$ git diff foo.tex
HTH
--
Jakub Narębski
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-03-31 13:24 ` Jakub Narębski
@ 2017-04-01 6:08 ` Jeff King
2017-04-01 18:31 ` Jakub Narębski
0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2017-04-01 6:08 UTC (permalink / raw)
To: Jakub Narębski; +Cc: git
On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:
> > I suspect in the normal case that git is doing line-ending conversion,
> > but it's suppressed when textconv is in use.
>
> I would not consider this a bug if not for the fact that there is no ^M
> without using iconv as textconv.
I don't think it's a bug, though. You have told Git that you will
convert the contents (whatever their format) into the canonical format,
but your program to do so includes a CR.
We _could_ further process with other canonicalizations, but I'm not
sure that is a good idea (line-endings sound reasonably harmless, but
almost certainly we should not be doing clean/smudge filtering). And I'm
not sure if there would be any compatibility fallouts.
So I think the behavior is perhaps not what you want, but it's not an
unreasonable one. And the solution is to define your textconv such that
it produces clean LF-only output. Perhaps:
[diff.whatever]
textconv = "iconv ... | tr -d '\r'"
?
-Peff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-04-01 6:08 ` Jeff King
@ 2017-04-01 18:31 ` Jakub Narębski
2017-04-02 7:45 ` Jeff King
0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2017-04-01 18:31 UTC (permalink / raw)
To: Jeff King; +Cc: git
W dniu 01.04.2017 o 08:08, Jeff King pisze:
> On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:
>
>>> I suspect in the normal case that git is doing line-ending conversion,
>>> but it's suppressed when textconv is in use.
>>
>> I would not consider this a bug if not for the fact that there is no ^M
>> without using iconv as textconv.
>
> I don't think it's a bug, though. You have told Git that you will
> convert the contents (whatever their format) into the canonical format,
> but your program to do so includes a CR.
Well, I have not declared file binary with "binary = true" in diff driver
definition, isn't it?
>
> We _could_ further process with other canonicalizations, but I'm not
> sure that is a good idea (line-endings sound reasonably harmless, but
> almost certainly we should not be doing clean/smudge filtering). And I'm
> not sure if there would be any compatibility fallouts.
Yes, gitattributes(5) defines interaction between 'text'/'eol', 'ident'
and 'filter' attributes, but nothing about 'diff' and 'text'/'eol'.
>
> So I think the behavior is perhaps not what you want, but it's not an
> unreasonable one. And the solution is to define your textconv such that
> it produces clean LF-only output. Perhaps:
>
> [diff.whatever]
> textconv = "iconv ... | tr -d '\r'"
Well, either this (or equivalent using dos2unix), or using 'whitespace'
attribute (or 'core.whitespace') with cr-at-eol.
P.S. What do you think about Git supporting 'encoding' attribute (or
'core.encoding' config) plus 'core.outputEncoding' in-core?
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-03-31 19:44 ` Jakub Narębski
@ 2017-04-02 4:34 ` Torsten Bögershausen
0 siblings, 0 replies; 10+ messages in thread
From: Torsten Bögershausen @ 2017-04-02 4:34 UTC (permalink / raw)
To: Jakub Narębski, git
On 2017-03-31 21:44, Jakub Narębski wrote:
> W dniu 31.03.2017 o 14:38, Torsten Bögershausen pisze:
>> On 30.03.17 21:35, Jakub Narębski wrote:
>>> Hello,
>>>
>>> Recently I had to work on a project which uses legacy 8-bit encoding
>>> (namely cp1250 encoding) instead of utf-8 for text files (LaTeX
>>> documents). My terminal, that is Git Bash from Git for Windows is set
>>> up for utf-8.
>>>
>>> I wanted for "git diff" and friends to return something sane on said
>>> utf-8 terminal, instead of mojibake. There is 'encoding'
>>> gitattribute... but it works only for GUI ('git gui', that is).
>>>
>>> Therefore I have (ab)used textconv facility to convert from cp1250 of
>>> file encoding to utf-8 encoding of console.
>>>
>>> I have set the following in .gitattributes file:
>>>
>>> ## LaTeX documents in cp1250 encoding
>>> *.tex text diff=mylatex
>>>
>>> The 'mylatex' driver is defined as:
>>>
>>> [diff "mylatex"]
>>> xfuncname = "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$"
>>> wordRegex = "\\\\[a-zA-Z]+|[{}]|\\\\.|[^\\{}[:space:]]+"
>>> textconv = \"C:/Program Files/Git/usr/bin/iconv.exe\" -f cp1250 -t utf-8
>>> cachetextconv = true
>>>
>>> And everything would be all right... if not the fact that Git appends
>>> spurious ^M to added lines in the `git diff` output. Files use CRLF
>>> end-of-line convention (the native MS Windows one).
>>>
>>> $ git diff test.tex
>>> diff --git a/test.tex b/test.tex
>>> index 029646e..250ab16 100644
>>> --- a/test.tex
>>> +++ b/test.tex
>>> @@ -1,4 +1,4 @@
>>> -\documentclass{article}
>>> +\documentclass{mwart}^M
>>>
>>> \usepackage[cp1250]{inputenc}
>>> \usepackage{polski}
>>>
>>> What gives? Why there is this ^M tacked on the end of added lines,
>>> while it is not present in deleted lines, nor in content lines?
>>>
>>> Puzzled.
>>>
>>> P.S. Git has `i18n.commitEncoding` and `i18n.logOutputEncoding`; pity
>>> that it doesn't supports in core `encoding` attribute together with
>>> having `i18n.outputEncoding`.
>>
>> Is there a chance to give us a receipt how to reproduce it?
>> A complete test script or ?
>> (I don't want to speculate, if the invocation of iconv is the problem,
>> where stdout is not in "binary mode", or however this is called under Windows)
>
> I'm sorry, I though I posted whole recipe, but I missed some details
> in the above description of the case.
>
> First, files are stored on filesystem using CRLF eol (DOS end-of-line
> convention). Due to `core.autocrlf` they are converted to LF in blobs,
> that is in the index and in the repository.
>
> Second, a textconv with filter preserving end-of-line needs to be
> configured. I have used `iconv`, but I suspect that the problem would
> happen also for `cat`.
>
> In the .gitattributes file, or .git/info/attributes add, for example:
>
> *.tex text diff=myconv
>
> In the .git/config configure the textconv filter, for example:
>
> [diff "myconv"]
> textconv = iconv.exe -f cp1250 -t utf-8
>
> Create a file which filename matches the attribute line, and which
> uses CRLF end of line convention, and add it to Git (adding it to
> the index):
>
> $ printf "foo\r\n" >foo.tex
> $ git add foo.tex
>
> Modify file (also with CRLF):
>
> $ printf "bar\r\n" >foo.tex
>
> Check the difference
>
> $ git diff foo.tex
>
> HTH
>
There seems to be a bug in Git, when it comes to "git diff".
Before we feed the content of the working tree into the
diff machinery, a call to convert_to_git() should be made.
But it seems as there is something missing, the expected
"+fox" becomes a "+foxQ"
#!/bin/sh
test_description='CRLF with diff filter'
. ./test-lib.sh
test_expect_success 'setup' '
git config core.autocrlf input &&
printf "foo\r\n" >foo.tex &&
git add foo.tex &&
echo >.gitattributes &&
git checkout -b master &&
git add .gitattributes &&
git commit -m "Add foo.txt" &&
cat >.git/config <<-\EOF
[diff "myconv"]
textconv = sed -e "s/f/g"
EOF
'
test_expect_success 'check EOL in diff' '
printf "fox\r\n" >foo.tex &&
cat >expect <<-\EOF &&
diff --git a/foo.tex b/foo.tex
index 257cc56..88c2893 100644
--- a/foo.tex
+++ b/foo.tex
@@ -1 +1 @@
-foo
+fox
EOF
git diff foo.tex | tr "\015" Q >actual &&
test_cmp expect actual
'
test_done
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-04-01 18:31 ` Jakub Narębski
@ 2017-04-02 7:45 ` Jeff King
2017-04-02 11:40 ` Jakub Narębski
0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2017-04-02 7:45 UTC (permalink / raw)
To: Jakub Narębski; +Cc: git
On Sat, Apr 01, 2017 at 08:31:27PM +0200, Jakub Narębski wrote:
> W dniu 01.04.2017 o 08:08, Jeff King pisze:
> > On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:
> >
> >>> I suspect in the normal case that git is doing line-ending conversion,
> >>> but it's suppressed when textconv is in use.
> >>
> >> I would not consider this a bug if not for the fact that there is no ^M
> >> without using iconv as textconv.
> >
> > I don't think it's a bug, though. You have told Git that you will
> > convert the contents (whatever their format) into the canonical format,
> > but your program to do so includes a CR.
>
> Well, I have not declared file binary with "binary = true" in diff driver
> definition, isn't it?
I don't think binary has anything to do with it. A textconv filter takes
input (binary or not) and delivers a normalized representation to feed
to the diff algorithm. There's no further post-processing, and it's the
responsibility of the filter to deliver the bytes it wants diffed.
Like I said, I could see an argument for treating the filter output as
text to be pre-processed, but that's not how it works (and I don't think
it is a good idea to change it now, unless by adding an option to the
diff filter).
> P.S. What do you think about Git supporting 'encoding' attribute (or
> 'core.encoding' config) plus 'core.outputEncoding' in-core?
Supporting an "encoding" attribute to normalize file encodings in diffs
seems reasonable to me. But it would have to be enabled only for
human-readable diffs, as the result could not be applied (so the same as
textconv).
I don't think core.outputEncoding is necessarily a good idea. We are not
really equipped anything that isn't an ascii superset, as we intermingle
the bytes with ascii diff headers (though I think that is true of the
commitEncoding stuff; I assume everything breaks horribly if you tried
to set that to UTF-16, but I've never tried it).
-Peff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows
2017-04-02 7:45 ` Jeff King
@ 2017-04-02 11:40 ` Jakub Narębski
0 siblings, 0 replies; 10+ messages in thread
From: Jakub Narębski @ 2017-04-02 11:40 UTC (permalink / raw)
To: Jeff King; +Cc: git
W dniu 02.04.2017 o 09:45, Jeff King pisze:
> On Sat, Apr 01, 2017 at 08:31:27PM +0200, Jakub Narębski wrote:
>
>> W dniu 01.04.2017 o 08:08, Jeff King pisze:
>>> On Fri, Mar 31, 2017 at 03:24:48PM +0200, Jakub Narębski wrote:
>>>
>>>>> I suspect in the normal case that git is doing line-ending conversion,
>>>>> but it's suppressed when textconv is in use.
>>>>
>>>> I would not consider this a bug if not for the fact that there is no ^M
>>>> without using iconv as textconv.
>>>
>>> I don't think it's a bug, though. You have told Git that you will
>>> convert the contents (whatever their format) into the canonical format,
>>> but your program to do so includes a CR.
>>
>> Well, I have not declared file binary with "binary = true" in diff driver
>> definition, isn't it?
>
> I don't think binary has anything to do with it. A textconv filter takes
> input (binary or not) and delivers a normalized representation to feed
> to the diff algorithm. There's no further post-processing, and it's the
> responsibility of the filter to deliver the bytes it wants diffed.
>
> Like I said, I could see an argument for treating the filter output as
> text to be pre-processed, but that's not how it works (and I don't think
> it is a good idea to change it now, unless by adding an option to the
> diff filter).
I think that actually there is something wrong.
If textconv really gets normalized representation of pre-image (the index
version) and post-image (the filesystem version), as it should I think,
both pre-image lines ('-') and post-image lines ('+') should use CRLF,
so there should be no warning, i.e. ^M
Or textconv filter gets normalized representation (it looks this way
when examining diff result saved to file with `git diff test.tex >test.diff`;
I were unable to use `tr '\r' 'Q', either I got "fatal: bad config line"
from Git, or "tr: extra operand" from tr), and somehow Git mistakes
what is happening and writes those ^M.
If I understand it correctly, if pre-image, post-image and context
all use the same eol, there should be no warning, isn't it?
>
>> P.S. What do you think about Git supporting 'encoding' attribute (or
>> 'core.encoding' config) plus 'core.outputEncoding' in-core?
>
> Supporting an "encoding" attribute to normalize file encodings in diffs
> seems reasonable to me. But it would have to be enabled only for
> human-readable diffs, as the result could not be applied (so the same as
> textconv).
I was thinking about human readable diffs, and 'git show <blob>', same
as with textconv.
>
> I don't think core.outputEncoding is necessarily a good idea. We are not
> really equipped anything that isn't an ascii superset, as we intermingle
> the bytes with ascii diff headers (though I think that is true of the
> commitEncoding stuff; I assume everything breaks horribly if you tried
> to set that to UTF-16, but I've never tried it).
Well, the understanding would be that the same limitation as for
core.logOutputEncoding (documented if it isn't) that only encodings that
are ASCII compatibile are supported.
--
Jakub Narębski
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-04-02 11:41 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-30 19:35 [BUG?] iconv used as textconv, and spurious ^M on added lines on Windows Jakub Narębski
2017-03-30 20:00 ` Jeff King
2017-03-31 13:24 ` Jakub Narębski
2017-04-01 6:08 ` Jeff King
2017-04-01 18:31 ` Jakub Narębski
2017-04-02 7:45 ` Jeff King
2017-04-02 11:40 ` Jakub Narębski
2017-03-31 12:38 ` Torsten Bögershausen
2017-03-31 19:44 ` Jakub Narębski
2017-04-02 4:34 ` Torsten Bögershausen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.