All of lore.kernel.org
 help / color / mirror / Atom feed
* git clone corrupts file.
       [not found] ` <BN6PR15MB14261C40E614CC11416388B4CBFA9@BN6PR15MB1426.namprd15.prod.outlook.com>
@ 2021-08-13 18:54   ` Russell, Scott
  2021-08-13 22:30     ` brian m. carlson
  0 siblings, 1 reply; 13+ messages in thread
From: Russell, Scott @ 2021-08-13 18:54 UTC (permalink / raw)
  To: git

What did you do before the bug happened? git clone
What did you expect to happen? file cloned matches github copy

What happened instead? file corrupted, does not match github copy  see example
What's different between what you expected and what actually happened?  corruption


[System Info]

git version:
git version 2.31.1.windows.1

cpu: x86_64

sizeof-long: 4

sizeof-size_t: 8

shell-path: /bin/sh
feature: 
fsmonitor--daemon
uname: Windows 10.0 17134 

compiler info: gnuc: 10.2

libc info: no libc information available

$SHELL (typically, interactive shell): <unset>

[Enabled Hooks]
not run from a git repository - no hooks to show

File from git. 

਍⼀⼀ 䴀椀挀爀漀猀漀昀琀 嘀椀猀甀愀氀 䌀⬀⬀ 最攀渀攀爀愀琀攀搀 椀渀挀氀甀搀攀 昀椀氀攀⸀ഀഀ
// Used by CamTest.rc
਍⼀⼀ഀഀ
#define IDM_ABOUTBOX                    0x0010
਍⌀搀攀昀椀渀攀 䤀䐀䐀开䄀䈀伀唀吀䈀伀堀                    ㄀  ഀഀ

File in github.  

//{{NO_DEPENDENCIES}}
// Microsoft Visual C++ generated include file.
// Used by CamTest.rc
//


Thanks, 

Scott Russell
Staff SW Engineer 
NCR Corporation 
Phone: +17706237512
mailto:Scott.Russell2@ncr.com  |  http://www.ncr.com/
       


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: git clone corrupts file.
  2021-08-13 18:54   ` git clone corrupts file Russell, Scott
@ 2021-08-13 22:30     ` brian m. carlson
  2021-08-16 15:24       ` Russell, Scott
  0 siblings, 1 reply; 13+ messages in thread
From: brian m. carlson @ 2021-08-13 22:30 UTC (permalink / raw)
  To: Russell, Scott; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

On 2021-08-13 at 18:54:43, Russell, Scott wrote:
> File from git.
> 
> ਍⼀⼀ 䴀椀挀爀漀猀漀昀琀 嘀椀猀甀愀氀 䌀⬀⬀ 最攀渀攀爀愀琀攀搀 椀渀挀氀甀搀攀 昀椀氀攀⸀ഀഀ
> // Used by CamTest.rc
> ਍⼀⼀ഀഀ
> #define IDM_ABOUTBOX                    0x0010
> ਍⌀搀攀昀椀渀攀 䤀䐀䐀开䄀䈀伀唀吀䈀伀堀                    ㄀  ഀഀ
> 
> File in github. 
> 
> //{{NO_DEPENDENCIES}}
> // Microsoft Visual C++ generated include file.
> // Used by CamTest.rc
> //

We're probably going to need a little more information about this.  My
guess as to what's happening here is that the editor you're using to
view the file is set to read files as UTF-16, but the repository has
them stored in UTF-8, or (less likely) vice versa.

Can you tell us what editor or other tool you're using to view the file
and what settings it's using for text encoding?  Can you tell us about
any working-tree-encoding declarations in your .gitattributes?  You can
use "git check-attr -a PATH" to see more information about that.

What code page are you using on your system?  Are you using PowerShell,
CMD, or Git Bash?  If you're using Git Bash, what are your locale
settings?
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: git clone corrupts file.
  2021-08-13 22:30     ` brian m. carlson
@ 2021-08-16 15:24       ` Russell, Scott
  2021-08-16 16:53         ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Russell, Scott @ 2021-08-16 15:24 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

Brian,  

Thanks for your interest in this issue.   The issue has been determined to have 2 factors. 

1.  The files corrupted are in Unicode.   Though the .h file mentioned certainly doesn't have to be Unicode, it can be ANSI, we have other files that must be Unicode.  We use Unicode in quite a number of our text files.
2.  Git appears to corrupt the file by making line endings changes.  
          a.   Github has the correct file.  It views correct there.  When downloaded as a binary or text from Github in a browser, it is not corrupted. 
          b.   Git seems to change line endings as if the file were ANSI or single byte encoding, not Unicode. 
          c.   Git has the setting git config core.autocrlf false.   But apparently, it is not being observed.   
          d.   The .gitconfig file has the [core] section with the entry autocrlf = false following the section.  
          e.   There is a .gitattributes file in the repo.   
          f.    Entries in .gitattributes specified by type are specified for the affected files. 
                        *.h     text eol=crlf
                        *.ini   text eol=crlf

If you look at the 1st line of the binary view of the original file, it looks like this:

FF FE 2F 00 2F 00 7B 00   7B 00 4E 00 4F 00 5F 00
44 00 45 00 50 00 45 00  4E 00 44 00 45 00 4E 00 
43 00 49 00 45 00 53 00  7D 00 7D 00 0D 00 0A 00   	Note - Unicode CR LF  0D 00 0A 00   

2nd line 
2F 00 2F 00 20 00 4D 00  69 00 63 00 72 00 6F 00  etc.   

If you look at the git file, it looks very similar.   
However, git has put a non Unicode CF LF into the end of line. 
Plus an extra NULL.   This extra NULL throws the 2 byte Unicode encoding off.   It corrupts the line.  On the next line, the extra NULL lines up the 2 byte encoding, so that line appears uncorrupted.  
You can see that in my original email below.   Every other line is not readable.  

FF FE 2F 00 2F 00 7B 00   7B 00 4E 00 4F 00 5F 00
44 00 45 00 50 00 45 00  4E 00 44 00 45 00 4E 00 
43 00 49 00 45 00 53 00  7D 00 7D 00 0D 00 0D 0A0   	Note - Unicode CR LF  0D 00 0A 00   

2nd line 
00 2F 00 2F 00 20 00 4D 00  69 00 63 00 72 00 6F  etc.   

I would like git to observe the autocrlf false as directed.   

It's important that we retain 2 byte Unicode file encoding in many of our files.   And that git not add single byte CR LF into our 2 byte files.  
We can't convert the files to other encoding for convenience of git.  

Thanks, 

Scott Russell
Staff SW Engineer 
NCR Corporation 
Phone: +17706237512
Scott.Russell2@ncr.com  |  ncr.com
       

-----Original Message-----
From: brian m. carlson <sandals@crustytoothpaste.net> 
Sent: Friday, August 13, 2021 6:30 PM
To: Russell, Scott <Scott.Russell2@ncr.com>
Cc: git@vger.kernel.org
Subject: Re: git clone corrupts file.

*External Message* - Use caution before opening links or attachments

On 2021-08-13 at 18:54:43, Russell, Scott wrote:
> File from git.
> 
> ਍⼀⼀ 䴀椀挀爀漀猀漀昀琀 嘀椀猀甀愀氀 䌀⬀⬀ 最攀渀攀爀愀琀攀搀 椀渀挀氀甀搀攀 昀椀氀攀⸀ഀഀ
> // Used by CamTest.rc
> ਍⼀⼀ഀഀ
> #define IDM_ABOUTBOX                    0x0010
> ਍⌀搀攀昀椀渀攀 䤀䐀䐀开䄀䈀伀唀吀䈀伀堀                    ㄀  ഀഀ
> 
> File in github.
> 
> //{{NO_DEPENDENCIES}}
> // Microsoft Visual C++ generated include file.
> // Used by CamTest.rc
> //

We're probably going to need a little more information about this.  My guess as to what's happening here is that the editor you're using to view the file is set to read files as UTF-16, but the repository has them stored in UTF-8, or (less likely) vice versa.

Can you tell us what editor or other tool you're using to view the file and what settings it's using for text encoding?  Can you tell us about any working-tree-encoding declarations in your .gitattributes?  You can use "git check-attr -a PATH" to see more information about that.

What code page are you using on your system?  Are you using PowerShell, CMD, or Git Bash?  If you're using Git Bash, what are your locale settings?
--
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: git clone corrupts file.
  2021-08-16 15:24       ` Russell, Scott
@ 2021-08-16 16:53         ` Jeff King
  2021-08-16 17:39           ` Russell, Scott
  2021-08-16 18:51           ` Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2021-08-16 16:53 UTC (permalink / raw)
  To: Russell, Scott; +Cc: brian m. carlson, git

On Mon, Aug 16, 2021 at 03:24:28PM +0000, Russell, Scott wrote:

> 1.  The files corrupted are in Unicode.   Though the .h file mentioned
>     certainly doesn't have to be Unicode, it can be ANSI, we have
>     other files that must be Unicode.  We use Unicode in quite a
>     number of our text files.

By Unicode, I'll assume you mean UTF-16, since your example below
appears to have a BOM marker at the beginning (FF FE).

Unlike UTF-8, UTF-16 is not a superset of ASCII, and thus can't be
treated as "text" by Git (e.g., the line ending byte is no longer just
hex "0A", but "00 0A").

>           f.    Entries in .gitattributes specified by type are specified for the affected files. 
>                         *.h     text eol=crlf
>                         *.ini   text eol=crlf

So this is your problem. The "text" attribute is telling Git to treat
the file as text (which will handle any ASCII-superset encoding like
UTF-8, ISO8859-1, etc, but not others like UTF-16, UTF-32, EUC-JP, etc).

Depending on what's in your repo and what you want to have happen,
you'll want to:

  - remove that attribute, if all of your ".h" files are UTF-16

  - if only some are UTF-16, you'll need to provide patterns that
    distinguish between the two types by giving them different
    attributes (e.g., "-text" should override for specific files)

  - you can stop there if you don't need line-ending conversion for
    UTF-16 files (and there may be little point; Git will treat them as
    binary for the purposes of diffing, so there is little point in
    matching the canonical in-repo endings)

  - if you do want to do line ending conversion (or any other
    modifications on them), you can do so with a custom clean/smudge
    filter (see the "filter" attribute in "git help attributes")

> I would like git to observe the autocrlf false as directed.

Hopefully the above explains it, but just to be clear, this isn't
autocrlf kicking in, but rather the "text" and "eol" attributes you've
specified.

> We can't convert the files to other encoding for convenience of git.

If you're happy enough not being able to get meaningful text diffs for
these files from Git, then the above should make your problem go away.

But an alternative workflow, if you really want UTF-16 in the working
tree, is to convert between UTF-8 and UTF-16 as the files go in and out
o the working tree. There's no built-in support for that, but you could
do it with a custom clean/smudge filter. That would let Git store UTF-8
internally, do diffs, etc.

One lighter alternative to that is to actually store UTF-16 in the
repository as you are now, but provide a textconv filter (see diff
attributes in "git help attributes") to convert it to UTF-8 on the fly
when showing a diff. You won't be able to apply such a diff, but they're
useful for human eyes.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: git clone corrupts file.
  2021-08-16 16:53         ` Jeff King
@ 2021-08-16 17:39           ` Russell, Scott
  2021-08-16 18:49             ` Jeff King
  2021-08-16 18:51           ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Russell, Scott @ 2021-08-16 17:39 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git

Jeff,  

Thanks for your reply.   
We don't want any EOL handling of any file.  That's why we specify autocrlf false.  

We would like git to not any cr lf conversion on any file.   Whether they be ANSI or Unicode.   They had the right endings when we checked them in. 
We don't want them adjusted.

Does removing the eol = cr lf fix that? 

	You said:  UTF-16 ...  can't be treated as "text" by Git.

We can't make any changes to the files to suit git.   We just need git to store and retrieve files as committed.  

Will removing the 

	eol=cr lf 

from the line 

*.ini text 

from the attributes file stop the issue?  

If not, does .gitattributes allow a path?  Such that we could say 

\config\Language Specific\*   type  -    If these are Unicode, what would we say here.   Can it not be text?  Then binary?  
*.ini	text 


Thanks, 

Scott Russell
Staff SW Engineer 
NCR Corporation 
Phone: +17706237512
Scott.Russell2@ncr.com  |  ncr.com
       

-----Original Message-----
From: Jeff King <peff@peff.net> 
Sent: Monday, August 16, 2021 12:54 PM
To: Russell, Scott <Scott.Russell2@ncr.com>
Cc: brian m. carlson <sandals@crustytoothpaste.net>; git@vger.kernel.org
Subject: Re: git clone corrupts file.

*External Message* - Use caution before opening links or attachments

On Mon, Aug 16, 2021 at 03:24:28PM +0000, Russell, Scott wrote:

> 1.  The files corrupted are in Unicode.   Though the .h file mentioned
>     certainly doesn't have to be Unicode, it can be ANSI, we have
>     other files that must be Unicode.  We use Unicode in quite a
>     number of our text files.

By Unicode, I'll assume you mean UTF-16, since your example below appears to have a BOM marker at the beginning (FF FE).

Unlike UTF-8, UTF-16 is not a superset of ASCII, and thus can't be treated as "text" by Git (e.g., the line ending byte is no longer just hex "0A", but "00 0A").

>           f.    Entries in .gitattributes specified by type are specified for the affected files. 
>                         *.h     text eol=crlf
>                         *.ini   text eol=crlf

So this is your problem. The "text" attribute is telling Git to treat the file as text (which will handle any ASCII-superset encoding like UTF-8, ISO8859-1, etc, but not others like UTF-16, UTF-32, EUC-JP, etc).

Depending on what's in your repo and what you want to have happen, you'll want to:

  - remove that attribute, if all of your ".h" files are UTF-16

  - if only some are UTF-16, you'll need to provide patterns that
    distinguish between the two types by giving them different
    attributes (e.g., "-text" should override for specific files)

  - you can stop there if you don't need line-ending conversion for
    UTF-16 files (and there may be little point; Git will treat them as
    binary for the purposes of diffing, so there is little point in
    matching the canonical in-repo endings)

  - if you do want to do line ending conversion (or any other
    modifications on them), you can do so with a custom clean/smudge
    filter (see the "filter" attribute in "git help attributes")

> I would like git to observe the autocrlf false as directed.

Hopefully the above explains it, but just to be clear, this isn't autocrlf kicking in, but rather the "text" and "eol" attributes you've specified.

> We can't convert the files to other encoding for convenience of git.

If you're happy enough not being able to get meaningful text diffs for these files from Git, then the above should make your problem go away.

But an alternative workflow, if you really want UTF-16 in the working tree, is to convert between UTF-8 and UTF-16 as the files go in and out o the working tree. There's no built-in support for that, but you could do it with a custom clean/smudge filter. That would let Git store UTF-8 internally, do diffs, etc.

One lighter alternative to that is to actually store UTF-16 in the repository as you are now, but provide a textconv filter (see diff attributes in "git help attributes") to convert it to UTF-8 on the fly when showing a diff. You won't be able to apply such a diff, but they're useful for human eyes.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: git clone corrupts file.
  2021-08-16 17:39           ` Russell, Scott
@ 2021-08-16 18:49             ` Jeff King
  2021-08-16 18:52               ` Russell, Scott
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2021-08-16 18:49 UTC (permalink / raw)
  To: Russell, Scott; +Cc: brian m. carlson, git

On Mon, Aug 16, 2021 at 05:39:12PM +0000, Russell, Scott wrote:

> We don't want any EOL handling of any file.  That's why we specify autocrlf false.

Right, but it's not the whole story. autocrlf is an older and broader
mechanism for doing line-ending conversion. From its documentation in
"git help config":

  core.autocrlf
    Setting this variable to "true" is the same as setting the text
    attribute to "auto" on all files and core.eol to "crlf".[...]

You obviously don't want that, but you _also_ don't want to set the text
and eol attributes on individual paths, either.

> We would like git to not any cr lf conversion on any file.   Whether
> they be ANSI or Unicode.   They had the right endings when we checked
> them in.
> We don't want them adjusted.
> 
> Does removing the eol = cr lf fix that?

That might be sufficient. You may also need to drop "text", as well.
Otherwise core.eol will kick in and do conversions. (Sorry, I don't use
Windows and it has been a long time since I looked into these options,
so you may have to do some experimenting).

> 	You said:  UTF-16 ...  can't be treated as "text" by Git.
> 
> We can't make any changes to the files to suit git.   We just need git to store and retrieve files as committed.

Right. That's what it does by default (if you don't set any .gitattributes).

What I mean by "can't be treated as text" is that Git will not correctly
implement text features like CRLF conversion, nor diffs, for such an
encoding. It is effectively a binary file from Git's perspective.

> Will removing the 
> 
> 	eol=cr lf 
> 
> from the line 
> 
> *.ini text 
> 
> from the attributes file stop the issue?
> 
> If not, does .gitattributes allow a path?  Such that we could say 
> 
> \config\Language Specific\*   type  -    If these are Unicode, what would we say here.   Can it not be text?  Then binary?  
> *.ini	text

If you simply drop the attributes entirely, Git will use its
auto-detection to determine whether a file is binary, which looks for
NULs (and UTF-16 files are generally full of them). So I suspect that
would do it. You can also provide the "-text" attribute to override that
and make sure no line-ending conversion is done.

If you want to override a specific file, then yes, you can provide paths
(I don't recall offhand whether we allow backslashes in the patterns;
you may need to use forward slashes). You can also put the pattern "*"
in the "config/Language Specific/.gitattributes" to have it apply only
to that directory (and ones below it).

The patterns are the same as those in .gitignore files; see the section
"PATTERN FORMAT" in "git help ignore".

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: git clone corrupts file.
  2021-08-16 16:53         ` Jeff King
  2021-08-16 17:39           ` Russell, Scott
@ 2021-08-16 18:51           ` Jeff King
  2021-08-16 18:53             ` Russell, Scott
  2021-08-16 21:50             ` brian m. carlson
  1 sibling, 2 replies; 13+ messages in thread
From: Jeff King @ 2021-08-16 18:51 UTC (permalink / raw)
  To: Russell, Scott; +Cc: brian m. carlson, git

On Mon, Aug 16, 2021 at 12:53:36PM -0400, Jeff King wrote:

> But an alternative workflow, if you really want UTF-16 in the working
> tree, is to convert between UTF-8 and UTF-16 as the files go in and out
> o the working tree. There's no built-in support for that, but you could
> do it with a custom clean/smudge filter. That would let Git store UTF-8
> internally, do diffs, etc.

Oh, by the way, I totally forgot that we added an internal version of
this, which is easier to configure and much more efficient. See the
"working-tree-encoding" attribute in "git help attributes".

Just in case you do want to go that route.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: git clone corrupts file.
  2021-08-16 18:49             ` Jeff King
@ 2021-08-16 18:52               ` Russell, Scott
  0 siblings, 0 replies; 13+ messages in thread
From: Russell, Scott @ 2021-08-16 18:52 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git

Jeff,  

Thanks for your response.  I will try these suggestions.  I suspect I can come to some solution.   


Thanks, 

Scott Russell
Staff SW Engineer 
NCR Corporation 
Phone: +17706237512
Scott.Russell2@ncr.com  |  ncr.com
       

-----Original Message-----
From: Jeff King <peff@peff.net> 
Sent: Monday, August 16, 2021 2:49 PM
To: Russell, Scott <Scott.Russell2@ncr.com>
Cc: brian m. carlson <sandals@crustytoothpaste.net>; git@vger.kernel.org
Subject: Re: git clone corrupts file.

*External Message* - Use caution before opening links or attachments

On Mon, Aug 16, 2021 at 05:39:12PM +0000, Russell, Scott wrote:

> We don't want any EOL handling of any file.  That's why we specify autocrlf false.

Right, but it's not the whole story. autocrlf is an older and broader mechanism for doing line-ending conversion. From its documentation in "git help config":

  core.autocrlf
    Setting this variable to "true" is the same as setting the text
    attribute to "auto" on all files and core.eol to "crlf".[...]

You obviously don't want that, but you _also_ don't want to set the text and eol attributes on individual paths, either.

> We would like git to not any cr lf conversion on any file.   Whether
> they be ANSI or Unicode.   They had the right endings when we checked
> them in.
> We don't want them adjusted.
> 
> Does removing the eol = cr lf fix that?

That might be sufficient. You may also need to drop "text", as well.
Otherwise core.eol will kick in and do conversions. (Sorry, I don't use Windows and it has been a long time since I looked into these options, so you may have to do some experimenting).

> 	You said:  UTF-16 ...  can't be treated as "text" by Git.
> 
> We can't make any changes to the files to suit git.   We just need git to store and retrieve files as committed.

Right. That's what it does by default (if you don't set any .gitattributes).

What I mean by "can't be treated as text" is that Git will not correctly implement text features like CRLF conversion, nor diffs, for such an encoding. It is effectively a binary file from Git's perspective.

> Will removing the
> 
> 	eol=cr lf
> 
> from the line
> 
> *.ini text
> 
> from the attributes file stop the issue?
> 
> If not, does .gitattributes allow a path?  Such that we could say
> 
> \config\Language Specific\*   type  -    If these are Unicode, what would we say here.   Can it not be text?  Then binary?  
> *.ini	text

If you simply drop the attributes entirely, Git will use its auto-detection to determine whether a file is binary, which looks for NULs (and UTF-16 files are generally full of them). So I suspect that would do it. You can also provide the "-text" attribute to override that and make sure no line-ending conversion is done.

If you want to override a specific file, then yes, you can provide paths (I don't recall offhand whether we allow backslashes in the patterns; you may need to use forward slashes). You can also put the pattern "*"
in the "config/Language Specific/.gitattributes" to have it apply only to that directory (and ones below it).

The patterns are the same as those in .gitignore files; see the section "PATTERN FORMAT" in "git help ignore".

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: git clone corrupts file.
  2021-08-16 18:51           ` Jeff King
@ 2021-08-16 18:53             ` Russell, Scott
  2021-08-16 21:50             ` brian m. carlson
  1 sibling, 0 replies; 13+ messages in thread
From: Russell, Scott @ 2021-08-16 18:53 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git

Okay, thanks.  I will look for that.  


Thanks, 

Scott Russell
Staff SW Engineer 
NCR Corporation 
Phone: +17706237512
Scott.Russell2@ncr.com  |  ncr.com
       

-----Original Message-----
From: Jeff King <peff@peff.net> 
Sent: Monday, August 16, 2021 2:51 PM
To: Russell, Scott <Scott.Russell2@ncr.com>
Cc: brian m. carlson <sandals@crustytoothpaste.net>; git@vger.kernel.org
Subject: Re: git clone corrupts file.

*External Message* - Use caution before opening links or attachments

On Mon, Aug 16, 2021 at 12:53:36PM -0400, Jeff King wrote:

> But an alternative workflow, if you really want UTF-16 in the working 
> tree, is to convert between UTF-8 and UTF-16 as the files go in and 
> out o the working tree. There's no built-in support for that, but you 
> could do it with a custom clean/smudge filter. That would let Git 
> store UTF-8 internally, do diffs, etc.

Oh, by the way, I totally forgot that we added an internal version of this, which is easier to configure and much more efficient. See the "working-tree-encoding" attribute in "git help attributes".

Just in case you do want to go that route.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: git clone corrupts file.
  2021-08-16 18:51           ` Jeff King
  2021-08-16 18:53             ` Russell, Scott
@ 2021-08-16 21:50             ` brian m. carlson
  2021-08-16 22:04               ` Russell, Scott
  1 sibling, 1 reply; 13+ messages in thread
From: brian m. carlson @ 2021-08-16 21:50 UTC (permalink / raw)
  To: Jeff King; +Cc: Russell, Scott, git

[-- Attachment #1: Type: text/plain, Size: 1578 bytes --]

On 2021-08-16 at 18:51:04, Jeff King wrote:
> On Mon, Aug 16, 2021 at 12:53:36PM -0400, Jeff King wrote:
> 
> > But an alternative workflow, if you really want UTF-16 in the working
> > tree, is to convert between UTF-8 and UTF-16 as the files go in and out
> > o the working tree. There's no built-in support for that, but you could
> > do it with a custom clean/smudge filter. That would let Git store UTF-8
> > internally, do diffs, etc.
> 
> Oh, by the way, I totally forgot that we added an internal version of
> this, which is easier to configure and much more efficient. See the
> "working-tree-encoding" attribute in "git help attributes".
> 
> Just in case you do want to go that route.

The specific information you need is located in the Git FAQ[0], but
roughly, you would probably want something like this:

*.h text lf=crlf working-tree-encoding=UTF-16LE-BOM

That means that when checked out, the file will be in the format that
legacy Windows programs prefer (CRLF with little-endian UTF-16 with a
BOM), but will be stored internally in Git with LF and UTF-8.  That will
make things like git diff work much better, but still permit things to
be in the working tree as you wish.

If you really don't want those to be modified at all, then you'd want to
write this:

*.h -text

However, Git will consider these files to be binary, since they are, and
git diff won't work on them without a textconv filter.

[0] https://git-scm.com/docs/gitfaq#windows-text-binary
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: git clone corrupts file.
  2021-08-16 21:50             ` brian m. carlson
@ 2021-08-16 22:04               ` Russell, Scott
  2021-08-16 22:19                 ` brian m. carlson
  0 siblings, 1 reply; 13+ messages in thread
From: Russell, Scott @ 2021-08-16 22:04 UTC (permalink / raw)
  To: brian m. carlson, Jeff King; +Cc: git

Thanks Brian,  

I appreciate the guidance.   All our .h files can call be converted to ANSI.   I don't know why we seemed to have just one saved as Unicode. 
But it was a wakeup, and led to discovery of other files not correct.  

Upon reading the help on .gitattributes, I was reminded that Windows Visual Studio can save some .rc files as Unicode.  
I think that most all are ANSI but that leaves the possible result that any one saved as Unicode could unexpectedly fail compiling due to the conversion.  

We have a mix of *.ini files which are a mix of mostly ANSI and more than a few others are Unicode.  
I don't know how to handle a mixture.

Perhaps I will have to specify 

*.ini -text.  

Unless, does .gitattributes allow paths to be specified?  In effect use the 

Path/path/path/*  text lf=crlf working-tree-encoding=UTF-16LE-BOM

And otherwise, 
*.ini text 	- these would be ansi if not in path/path/path  

Thanks, 

Scott Russell
Staff SW Engineer 
NCR Corporation 
Phone: +17706237512
Scott.Russell2@ncr.com  |  ncr.com
       

-----Original Message-----
From: brian m. carlson <sandals@crustytoothpaste.net> 
Sent: Monday, August 16, 2021 5:51 PM
To: Jeff King <peff@peff.net>
Cc: Russell, Scott <Scott.Russell2@ncr.com>; git@vger.kernel.org
Subject: Re: git clone corrupts file.

*External Message* - Use caution before opening links or attachments

On 2021-08-16 at 18:51:04, Jeff King wrote:
> On Mon, Aug 16, 2021 at 12:53:36PM -0400, Jeff King wrote:
> 
> > But an alternative workflow, if you really want UTF-16 in the 
> > working tree, is to convert between UTF-8 and UTF-16 as the files go 
> > in and out o the working tree. There's no built-in support for that, 
> > but you could do it with a custom clean/smudge filter. That would 
> > let Git store UTF-8 internally, do diffs, etc.
> 
> Oh, by the way, I totally forgot that we added an internal version of 
> this, which is easier to configure and much more efficient. See the 
> "working-tree-encoding" attribute in "git help attributes".
> 
> Just in case you do want to go that route.

The specific information you need is located in the Git FAQ[0], but roughly, you would probably want something like this:

*.h text lf=crlf working-tree-encoding=UTF-16LE-BOM

That means that when checked out, the file will be in the format that legacy Windows programs prefer (CRLF with little-endian UTF-16 with a BOM), but will be stored internally in Git with LF and UTF-8.  That will make things like git diff work much better, but still permit things to be in the working tree as you wish.

If you really don't want those to be modified at all, then you'd want to write this:

*.h -text

However, Git will consider these files to be binary, since they are, and git diff won't work on them without a textconv filter.

[0] https://git-scm.com/docs/gitfaq#windows-text-binary
--
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: git clone corrupts file.
  2021-08-16 22:04               ` Russell, Scott
@ 2021-08-16 22:19                 ` brian m. carlson
  2021-08-16 22:26                   ` Russell, Scott
  0 siblings, 1 reply; 13+ messages in thread
From: brian m. carlson @ 2021-08-16 22:19 UTC (permalink / raw)
  To: Russell, Scott; +Cc: Jeff King, git

[-- Attachment #1: Type: text/plain, Size: 2388 bytes --]

On 2021-08-16 at 22:04:20, Russell, Scott wrote:
> Thanks Brian,
> 
> I appreciate the guidance.   All our .h files can call be converted to ANSI.   I don't know why we seemed to have just one saved as Unicode.
> But it was a wakeup, and led to discovery of other files not correct.
> 
> Upon reading the help on .gitattributes, I was reminded that Windows Visual Studio can save some .rc files as Unicode.
> I think that most all are ANSI but that leaves the possible result that any one saved as Unicode could unexpectedly fail compiling due to the conversion.

I do want to specify a distinction here.  You're referring to "Unicode"
and "ANSI", which traditionally mean, on Windows, little-endian UTF-16
with BOM and Windows-1252.  You do not generally want Windows-1252, or
the encoding on which it's based, ISO-8859-1.  Those are obsolete and
have been for well over a decade.  It's unfortunate that many Windows
programs continue to use these terms, because neither "Unicode" nor
"ANSI" describe an actual character set according to IANA.

What is going to work best here is UTF-8 without a BOM.  Most Windows
programs can handle that these days, but some still don't.  If you try
to save things as "ANSI" without a working-tree-encoding and they aren't
completely ASCII files, then you will end up with some weird diff output
at the very least.

If the files are completely ASCII, then no working-tree-encoding is
necessary, because ASCII is a subset of UTF-8.

> We have a mix of *.ini files which are a mix of mostly ANSI and more than a few others are Unicode.
> I don't know how to handle a mixture.
> 
> Perhaps I will have to specify
> 
> *.ini -text.
> 
> Unless, does .gitattributes allow paths to be specified?  In effect use the
> 
> Path/path/path/*  text lf=crlf working-tree-encoding=UTF-16LE-BOM

Yes, this syntax is allowed.  See the gitattributes(5) manual page for
what's allowed.  You can even do this:

dir/sub/path/*.ini text eol=crlf working-tree-encoding=UTF-16LE-BOM

One thing I forgot to mention is that after modifying your
.gitattributes file, you'll want to run "git add --renormalize ." and
then commit both the .gitattributes file and any changes.  Otherwise,
you may end up with files that don't end up converted the way that you
want.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: git clone corrupts file.
  2021-08-16 22:19                 ` brian m. carlson
@ 2021-08-16 22:26                   ` Russell, Scott
  0 siblings, 0 replies; 13+ messages in thread
From: Russell, Scott @ 2021-08-16 22:26 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Jeff King, git

Ok, thanks for all the help.  
I think with the path in .gitattributes It will be fine.  

dir/sub/path/*.ini text eol=crlf working-tree-encoding=UTF-16LE-BOM

I will give those a try and see how it works out.   And especially thanks for the help advice on add -renormalize.   I would never have done that.  


Thanks, 

Scott Russell
Staff SW Engineer 
NCR Corporation 
Phone: +17706237512
Scott.Russell2@ncr.com  |  ncr.com
       

-----Original Message-----
From: brian m. carlson <sandals@crustytoothpaste.net> 
Sent: Monday, August 16, 2021 6:20 PM
To: Russell, Scott <Scott.Russell2@ncr.com>
Cc: Jeff King <peff@peff.net>; git@vger.kernel.org
Subject: Re: git clone corrupts file.

*External Message* - Use caution before opening links or attachments

On 2021-08-16 at 22:04:20, Russell, Scott wrote:
> Thanks Brian,
> 
> I appreciate the guidance.   All our .h files can call be converted to ANSI.   I don't know why we seemed to have just one saved as Unicode.
> But it was a wakeup, and led to discovery of other files not correct.
> 
> Upon reading the help on .gitattributes, I was reminded that Windows Visual Studio can save some .rc files as Unicode.
> I think that most all are ANSI but that leaves the possible result that any one saved as Unicode could unexpectedly fail compiling due to the conversion.

I do want to specify a distinction here.  You're referring to "Unicode"
and "ANSI", which traditionally mean, on Windows, little-endian UTF-16 with BOM and Windows-1252.  You do not generally want Windows-1252, or the encoding on which it's based, ISO-8859-1.  Those are obsolete and have been for well over a decade.  It's unfortunate that many Windows programs continue to use these terms, because neither "Unicode" nor "ANSI" describe an actual character set according to IANA.

What is going to work best here is UTF-8 without a BOM.  Most Windows programs can handle that these days, but some still don't.  If you try to save things as "ANSI" without a working-tree-encoding and they aren't completely ASCII files, then you will end up with some weird diff output at the very least.

If the files are completely ASCII, then no working-tree-encoding is necessary, because ASCII is a subset of UTF-8.

> We have a mix of *.ini files which are a mix of mostly ANSI and more than a few others are Unicode.
> I don't know how to handle a mixture.
> 
> Perhaps I will have to specify
> 
> *.ini -text.
> 
> Unless, does .gitattributes allow paths to be specified?  In effect 
> use the
> 
> Path/path/path/*  text lf=crlf working-tree-encoding=UTF-16LE-BOM

Yes, this syntax is allowed.  See the gitattributes(5) manual page for what's allowed.  You can even do this:

dir/sub/path/*.ini text eol=crlf working-tree-encoding=UTF-16LE-BOM

One thing I forgot to mention is that after modifying your .gitattributes file, you'll want to run "git add --renormalize ." and then commit both the .gitattributes file and any changes.  Otherwise, you may end up with files that don't end up converted the way that you want.
--
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-16 22:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BN6PR15MB1426E50F03A0530CA9140F98CBFA9@BN6PR15MB1426.namprd15.prod.outlook.com>
     [not found] ` <BN6PR15MB14261C40E614CC11416388B4CBFA9@BN6PR15MB1426.namprd15.prod.outlook.com>
2021-08-13 18:54   ` git clone corrupts file Russell, Scott
2021-08-13 22:30     ` brian m. carlson
2021-08-16 15:24       ` Russell, Scott
2021-08-16 16:53         ` Jeff King
2021-08-16 17:39           ` Russell, Scott
2021-08-16 18:49             ` Jeff King
2021-08-16 18:52               ` Russell, Scott
2021-08-16 18:51           ` Jeff King
2021-08-16 18:53             ` Russell, Scott
2021-08-16 21:50             ` brian m. carlson
2021-08-16 22:04               ` Russell, Scott
2021-08-16 22:19                 ` brian m. carlson
2021-08-16 22:26                   ` Russell, Scott

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.