All of lore.kernel.org
 help / color / mirror / Atom feed
* mingw, windows, crlf/lf, and git
@ 2007-02-11 23:13 Mark Levedahl
  2007-02-11 23:34 ` Johannes Schindelin
                   ` (4 more replies)
  0 siblings, 5 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-11 23:13 UTC (permalink / raw)
  To: Git Mailing List

I am NOT intending to start a flamewar O:-) , so please don't turn this 
into one.

The recent threads on a mingw git port are explicit in the intent to 
provide a Windows native git. I believe there is a fundamental conflict 
here with the position, clearly stated by Linus, that git does not alter 
content in any way. Windows suffers the curse of DOS line endings (\r\n 
vs \n), and a true port to Windows *must* allow for \r\n and \n to be 
semantically the same thing as most large projects end up with a mixture 
of such files and/or are targeting cross-platform capabilities. The 
major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg) 
have capability to recognize "text" files and transparently replace \r\n 
with \n on input, the reverse on output, and ignore all such differences 
on diff operations. To be relevant on native Windows, git must do the 
same. Otherwise, git will be deemed "too wierd" and dismissed in favor 
of a tool "that works."

There is no use to debating the technical merits of \r\n vs \n vs \r vs 
whatever, nor of not converting. Really. Just accept that there is a 
fundamental requirement that any version control tool on Windows be able 
to silently convert between \r\n and \n. To believe otherwise is to 
expect that the conversion be pushed elsewhere into the tool chain in 
use, and that won't happen as the competition already provide this 
conversion capability.

So, I think the git project needs to come to an explicit position on 
this, basically being:

1) git is a POSIX only tool (i.e., there will be no \r\n munging), or
2) a Windows port of git will handle and mung \r\n and \n line endings.

If the answer is 1, the mingw port is a waste of time as it simply won't 
be usable by its target audience. If the answer is 2, then I think a 
very careful design of this capability is in order.

Comments?

BTW, I have addressed this in my own world using a pre-commit script 
that converts textfile line endings into \n, recognizing that our 
Windows tool chain handles such files perfectly well, while our Linux 
toolchain requires it.

Mark Levedahl

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
@ 2007-02-11 23:34 ` Johannes Schindelin
  2007-02-12  0:46   ` Jakub Narebski
  2007-02-12  0:14 ` Robin Rosenberg
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-11 23:34 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: Git Mailing List

Hi,

On Sun, 11 Feb 2007, Mark Levedahl wrote:

> The major competing solutions git seeks to supplant (cvs, cvsnt, svn, 
> hg) have capability to recognize "text" files and transparently replace 
> \r\n with \n on input, the reverse on output, and ignore all such 
> differences on diff operations.

Agree with transformations on input and output; disagree on diff.

The problem is that it really is a transformtion. Since most Windows tools 
(at least those used in portable software) handle \n without \r quite 
well, thank you, I'd tend towards the view point: do not mess with line 
endings pre-commit/post-checkout.

Even MacOSX uses \n now, instead of \r.

Of course, for those projects which _use_ CRLF: they can continue with it. 
Git has no problem with those line endings.

The only problem CVS tried to solve (badly) was to be able to checkout 
text files on DOS, Unix _and_ MacOS. In practice, though, this use case 
does not matter anymore IMHO.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
  2007-02-11 23:34 ` Johannes Schindelin
@ 2007-02-12  0:14 ` Robin Rosenberg
  2007-02-12  2:37   ` Mark Levedahl
  2007-02-12  4:24 ` Theodore Tso
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-12  0:14 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: Git Mailing List

måndag 12 februari 2007 00:13 skrev Mark Levedahl:
> The recent threads on a mingw git port are explicit in the intent to 
> provide a Windows native git. I believe there is a fundamental conflict 
> here with the position, clearly stated by Linus, that git does not alter 
> content in any way. Windows suffers the curse of DOS line endings (\r\n 
> vs \n), and a true port to Windows *must* allow for \r\n and \n to be 
> semantically the same thing as most large projects end up with a mixture 
> of such files and/or are targeting cross-platform capabilities. The 
> major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg) 
> have capability to recognize "text" files and transparently replace \r\n 
> with \n on input, the reverse on output, and ignore all such differences 
> on diff operations. To be relevant on native Windows, git must do the 
> same. Otherwise, git will be deemed "too wierd" and dismissed in favor 
> of a tool "that works."
> 
As of today git is a posix tool simply because it's not fully ported to
other enviromnents. I brought this up quite a time ago, and didn't face heavy artillery
then, and wouldn't today either. The code is still missing though. I didn't 
write it then, because it's my #1 priority and nobody else did. Linus even did a 
rough scetch, but that's it. 

I guess git will get this feature when someone does the code for it.

-- robin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-11 23:34 ` Johannes Schindelin
@ 2007-02-12  0:46   ` Jakub Narebski
  2007-02-12  2:36     ` Mark Levedahl
  2007-02-12 11:21     ` Johannes Schindelin
  0 siblings, 2 replies; 83+ messages in thread
From: Jakub Narebski @ 2007-02-12  0:46 UTC (permalink / raw)
  To: git

Johannes Schindelin wrote:
> On Sun, 11 Feb 2007, Mark Levedahl wrote:
> 
>> The major competing solutions git seeks to supplant (cvs, cvsnt, svn, 
>> hg) have capability to recognize "text" files and transparently replace 
>> \r\n with \n on input, the reverse on output, and ignore all such 
>> differences on diff operations.
> 
> Agree with transformations on input and output; disagree on diff.

I wonder if this could/should be solved with adding some option to git-diff,
similar to --ignore-space-change and --ignore-all-space...

Just a [idle] thought.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12  0:46   ` Jakub Narebski
@ 2007-02-12  2:36     ` Mark Levedahl
  2007-02-12 11:21     ` Johannes Schindelin
  1 sibling, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-12  2:36 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski wrote:
> Johannes Schindelin wrote:
>   
>> On Sun, 11 Feb 2007, Mark Levedahl wrote:
>>
>>     
>>> The major competing solutions git seeks to supplant (cvs, cvsnt, svn, 
>>> hg) have capability to recognize "text" files and transparently replace 
>>> \r\n with \n on input, the reverse on output, and ignore all such 
>>> differences on diff operations.
>>>       
>> Agree with transformations on input and output; disagree on diff.
>>     
>
> I wonder if this could/should be solved with adding some option to git-diff,
> similar to --ignore-space-change and --ignore-all-space...
>
> Just a [idle] thought.
>   
That would work. Assuming blobs are stored in with \n, diff just has to 
open files in 'rt' mode rather than just 'r' and the \r\n are 
transformed  on read so are never seen by git code. That is basically 
what Windows native tools do, but they also write files opened in 'wt' 
mode so \n become \r\n on output. Of course, if this were an option, 
users could look for line ending differences if they cared.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12  0:14 ` Robin Rosenberg
@ 2007-02-12  2:37   ` Mark Levedahl
  0 siblings, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-12  2:37 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Mark Levedahl, Git Mailing List

Robin Rosenberg wrote:
> As of today git is a posix tool simply because it's not fully ported to
> other enviromnents. I brought this up quite a time ago, and didn't face heavy artillery
> then, and wouldn't today either. The code is still missing though. I didn't 
> write it then, because it's my #1 priority and nobody else did. Linus even did a 
> rough scetch, but that's it.
So, the basic design for this feature exists where? I would assume this 
would include a file mode indicator set in the blob or tree designating 
the blob is "text", along with mechanism to specify for a project what 
files are "text", along with some safety valve to check and not do 
transformation when the file does not look text-ish.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
  2007-02-11 23:34 ` Johannes Schindelin
  2007-02-12  0:14 ` Robin Rosenberg
@ 2007-02-12  4:24 ` Theodore Tso
  2007-02-12  7:28   ` David Lang
                     ` (2 more replies)
  2007-02-13  2:02 ` Junio C Hamano
  2007-02-13  3:32 ` Alexander Litvinov
  4 siblings, 3 replies; 83+ messages in thread
From: Theodore Tso @ 2007-02-12  4:24 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: Git Mailing List

On Sun, Feb 11, 2007 at 06:13:16PM -0500, Mark Levedahl wrote:
> I am NOT intending to start a flamewar O:-) , so please don't turn this 
> into one.
> 
> The recent threads on a mingw git port are explicit in the intent to 
> provide a Windows native git. I believe there is a fundamental conflict 
> here with the position, clearly stated by Linus, that git does not alter 
> content in any way. Windows suffers the curse of DOS line endings (\r\n 
> vs \n), and a true port to Windows *must* allow for \r\n and \n to be 
> semantically the same thing as most large projects end up with a mixture 
> of such files and/or are targeting cross-platform capabilities. The 
> major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg) 
> have capability to recognize "text" files and transparently replace \r\n 
> with \n on input, the reverse on output, and ignore all such differences 
> on diff operations. To be relevant on native Windows, git must do the 
> same. Otherwise, git will be deemed "too wierd" and dismissed in favor 
> of a tool "that works."

So this is something that I've tried proposing to the Mercurial
developers, but it's never been implemented in hg.  It'll be
interesting to see what the git community thinks.  :-)

My proposal does require adding a file type to each file, as tracked
metadata, which may doom it from the start.  If you add a file type,
then you have to support mutating the file type, and some way of
handling merge conflicts (generally, picking one type or another).

Then for each file type, we implement a set of interfaces (perhaps as
simple as a series of executables named git-<type>-<operation>) which
if present, transforms the file from its live format to the canonical
format which is actually checked in and back again.  Besides using
this for the DOS CR/LF problem, it also allows for an efficient
storage of things like OpenOffice files which are a zipped set of .xml
files.  By decompressing them before pushing them into the SCM, it
means that if the user makes a tiny spelling correction in their
OpenOffice file, the delta stored in the git repository can be much
more efficiently stored (since the diff of the .xml tree will be
small, where as the diff of the entire compressed file is likely going
to be close to the entire size of the .odt file).

Another nice thing to provide for each file type would be a
pretty-printer for the diffs, so it becomes easier to see the delta
between two versions of an OpenOffice file in a textual window.

So, is this idea sane or completely insane?  Hopefully it passes
Linus's it-solves-multiple-problems-at-once test, at least.  :-)

							- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12  4:24 ` Theodore Tso
@ 2007-02-12  7:28   ` David Lang
  2007-02-12 11:36   ` Johannes Schindelin
  2007-02-12 17:20   ` Linus Torvalds
  2 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-12  7:28 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Mark Levedahl, Git Mailing List

On Sun, 11 Feb 2007, Theodore Tso wrote:

> Then for each file type, we implement a set of interfaces (perhaps as
> simple as a series of executables named git-<type>-<operation>) which
> if present, transforms the file from its live format to the canonical
> format which is actually checked in and back again.  Besides using
> this for the DOS CR/LF problem, it also allows for an efficient
> storage of things like OpenOffice files which are a zipped set of .xml
> files.  By decompressing them before pushing them into the SCM, it
> means that if the user makes a tiny spelling correction in their
> OpenOffice file, the delta stored in the git repository can be much
> more efficiently stored (since the diff of the .xml tree will be
> small, where as the diff of the entire compressed file is likely going
> to be close to the entire size of the .odt file).
>
> Another nice thing to provide for each file type would be a
> pretty-printer for the diffs, so it becomes easier to see the delta
> between two versions of an OpenOffice file in a textual window.
>
> So, is this idea sane or completely insane?  Hopefully it passes
> Linus's it-solves-multiple-problems-at-once test, at least.  :-)

there have been other things discussed that could use the 'do this on checkout' 
hooks, specificly on the issue of useing git to manage /etc the need to 
save/restore permissions requires a hook on checkout that doesn't exist yet. 
this sounds like it would solve that problem as well.

David Lang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12  0:46   ` Jakub Narebski
  2007-02-12  2:36     ` Mark Levedahl
@ 2007-02-12 11:21     ` Johannes Schindelin
  1 sibling, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 11:21 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Mark Levedahl, git

Hi,

[Cc'ing git list, which I sometimes have to do when Jakub replies]

On Mon, 12 Feb 2007, Jakub Narebski wrote:

> Johannes Schindelin wrote:
> > On Sun, 11 Feb 2007, Mark Levedahl wrote:
> > 
> >> The major competing solutions git seeks to supplant (cvs, cvsnt, svn, 
> >> hg) have capability to recognize "text" files and transparently replace 
> >> \r\n with \n on input, the reverse on output, and ignore all such 
> >> differences on diff operations.
> > 
> > Agree with transformations on input and output; disagree on diff.
> 
> I wonder if this could/should be solved with adding some option to git-diff,
> similar to --ignore-space-change and --ignore-all-space...

It could be done, but those options were introduced for CRLF breakage in 
the first place.

You need --ignore-crlf-breakage? Just holler.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12  4:24 ` Theodore Tso
  2007-02-12  7:28   ` David Lang
@ 2007-02-12 11:36   ` Johannes Schindelin
  2007-02-12 17:20   ` Linus Torvalds
  2 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 11:36 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Mark Levedahl, Git Mailing List

Hi,

On Sun, 11 Feb 2007, Theodore Tso wrote:

> My proposal does require adding a file type to each file, as tracked
> metadata, which may doom it from the start.

I'd rather do that a la .gitignore, i.e. make this handling dependent on 
file name patterns. It is not only backwards compatible (from the 
viewpoint of the repository format), it also avoids having to specify over 
and over again that yes, this new .odt file _is_ an OpenOffice document.

> Then for each file type, we implement a set of interfaces (perhaps as
> simple as a series of executables named git-<type>-<operation>) which
> if present, transforms the file from its live format to the canonical
> format which is actually checked in and back again.

Again, I propose a slight change: Let's add a transformation driver like 
the merge driver: this allows inlining common operations like unzipping, 
CRLF->LF conversion, etc.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12  4:24 ` Theodore Tso
  2007-02-12  7:28   ` David Lang
  2007-02-12 11:36   ` Johannes Schindelin
@ 2007-02-12 17:20   ` Linus Torvalds
  2007-02-12 22:37     ` Johannes Schindelin
  2007-02-12 22:54     ` Junio C Hamano
  2 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 17:20 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Mark Levedahl, Git Mailing List



On Sun, 11 Feb 2007, Theodore Tso wrote:
> 
> So this is something that I've tried proposing to the Mercurial
> developers, but it's never been implemented in hg.  It'll be
> interesting to see what the git community thinks.  :-)
> 
> My proposal does require adding a file type to each file, as tracked
> metadata, which may doom it from the start.  If you add a file type,
> then you have to support mutating the file type, and some way of
> handling merge conflicts (generally, picking one type or another).

I agree that a file-type approch would work, but I personally think it's 
too inflexible (just cr/lf vs lf? There are tons of other interesting 
issues that are valid). I also think it falls down on another (and in some 
ways much more fundamental problem): these things exist EVEN WHEN THE FILE 
ITSELF DOES NOT EXIST!

In other words, a policy about cr/lf is *not* a policy about actual 
content. It's something much more: it's a policy about representation in 
general, which includes *potential* content. It should obviously take 
effect on "git add" even with content that didn't exist before, and to 
work well, it should do so without the user having to think about it.

Equally importantly, this happens with content that was added by people 
who simply DO NOT CARE. In other words, I think a "file type" thing 
fundamentally cannot work, because under UNIX, it would be stupid and 
pointless, so any project that is maintained under UNIX might _add_ the 
file types, but since they won't matter, they'll inevitably be wrong (ie 
people forgot to mark a binary thing binary, or a text thing as text).

So: file types or attributes are broken. They cannot work well.

But enough on the negative rambling, I do have a positive and constructive 
suggestion, because I actually think I have a great model for it. But I've 
never cared enough (and since the main target would be some windows issue, 
I suspect I never really _will_ care enough) to really worry about it.

Anyway, if somebody really wants to look at this, and wants to create 
something that is actually _usable_, my suggestion is to simply extend on 
the ".gitignore" file approach. The great thing about .gitignore is that

 (a) you can track it like you track any other file

     This makes merges a *lot* easier. You see it as conflicts, you can 
     fix it up, and in general, you can use all the same tools with it as 
     you use with anything else. In contrast, explicit per-file filetypes 
     are _horrible_ for maintenance.

 (b) you can add to it with *patterns*, which is exactly what you want for 
     file types.

     You can do things like

	*.bin: binary
	*: text

     to say "everythgn that matches *.bin is binary, the rest is text", 
     and solves the maintenance issue trivially. Everybody will like it. 
     For the kernel, for example, we'd have a really easy

	Documentation/logo.gif: binary
	*: text

     and that would probably take care of it.

     You can also have a few default file patterns built in, which would 
     take care of it for 99% of all projects without anybody ever having 
     to even think about it - even under DOS.

 (c) it doesn't actually affect database representation, it only changes 
     behaviour for programs, which is also  exactly what you want (if you 
     have per-file "file types", you end up having serious problems at 
     merge time: when I say "affect database representation", I don't mean 
     that I think git cannot change its database, I literally mean at a 
     "higher" level: represening per-file attributes is a DISASTER from a 
     merge situation)

     So not only is it backwards-compatible with traditional git usage, 
     it's much more fundamentally simple: it doesn't add any new core data 
     structures or rules. All the core stays exactly as it is, and it just 
     affects higher-level behaviour. And that's important: one reason git 
     has been so stable is that the really core data structures are really 
     really stable and simple.

     Even when we did *really* core changes like the whole packfile thing, 
     the fundamental data structures didn't change at all *conceptually*.

 (d) it's actually a lot more flexible than file types.

     Merge stategies, anybody? We can easily have the default merge 
     strategy be the normal three-way merge (which is obviously the right 
     thing for almost anything), but how about something like

	*.doc: binary,merge=doc-merge

     which tells git that it should use a separate "doc-merge" program to 
     merge those kinds of files when it needs to do a nontrivial merge..

 (e) exactly like ".gitignore", you should also be able to have a 
     ".git/info/exclude" file that is your _private_ rules, and 
     per-directory ".gitignore" files that are the _hierarchical_ rules.

     This just makes maintenance much simpler. Not one big file that has 
     everything, and that clashes. Make the top-level one contain all the 
     generic default rules, and then lower down we can have more specific 
     rules for very specific things, exactly like the kernel .gitignore 
     files do. The top-level file should *not* have to know all the 
     details of some architecture- or sub-project specific file behaviour.

     Similarly, having an untracked file (.git/info/exclude) allows people 
     to have rules that make sense for *them*, but that might not make 
     sense for the upstream developers (say, somebody crazy enough to 
     develop Linux under Windows). So people can have their purely local 
     rules without forcing them on others.

Anyway, that would be my suggestion. Call it ".gitattributes" or 
something. Make it a nice ASCII format, exactly like .gitignore, and make 
all the rules exactly the same, except it has a ": <attributelist>" at the 
end for each line.

Start off supporting just "binary" and "text", but keep in mind that 
people may want other things. Individualized merge strategies etc.

Also, keep in mind that a *lot* of git operations will work purely on a 
SHA1 level, and those operations fundamentally *will*not*care* about file 
types. So when you merge a file, for example, the initial merge will be 
done purely on SHA1's, and git would do all the normal "if it didn't 
change in branch 1, take the branch 2 version directly" without ever even 
*looking* at any file rules.

This is important, because this is what makes git efficient for large 
projects, and which would allow git to _remain_ efficient even in the face 
of having to read all those comples .gitattributes files. When we merge 
two repositories with 20,000+ files, we usually really only "merge" a 
couple of the files. 

Same goes for "text" mode. The "text" thing would only affect things like 
"git add" etc that use "git-update-index" to calculate the new SHA1. We'd 
never use it "normally". "git diff" would still be instantaneous, because 
the git index shows the file still matches, and that is all done on a SHA1 
only level. So only when you do a "git add" or when it needs to refresh 
the index because the file changed, and it reads in the file, will it 
actually care about whether it's a text or a binary file.

This is actually *exactly* what you want. Not just for performance, but 
simply because this is also how you can take something like the Linux 
archive, and "just use it" under Windows, even if your editor adds (or 
wants) CR/LF.

Btw, how would I implement this? If I really were energetic enough to 
implement it, I would do:

 (a) Add a flag to "git-ls-files" logic to add "type information" in 
     front.

     Not only do you want this *anyway* for other reasons, but for
     binary/text, the thing you actually care most about is "git add", and 
     it already basically just does "take this file pattern, feed it 
     through git-ls-files, and add those files". So you'd get it basically 
     for free.

     It is also fairly easy to add at this stage, because you can simply 
     look for all the places that work with "info/exclude" and 
     ".gitignore", and you know that "Ahh, I need to teach these exact 
     places to understand about attributes". So you'd add an 
     "add_attributes_from_file()" function etc etc.

     Quite straightforward. In fact, you might be able to use the 
     gitignore parsing *as*is*, and just teach it about more flags that 
     just "ignore": both in "struct dir_entry" and in "struct exclude".

 (b) Teach the git-update-index logic about hashing text blobs.

 (c) Profit!

It really should be fairly straightforward. I'm sure it wouldn't be 
*entirely* trivial, but I'm also fairly sure that somebody reasonably 
competent could do it in a couple of days (with testing) if they were just 
sufficiently motivated to get started.

Anybody?

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 17:20   ` Linus Torvalds
@ 2007-02-12 22:37     ` Johannes Schindelin
  2007-02-12 23:02       ` Linus Torvalds
  2007-02-12 22:54     ` Junio C Hamano
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 22:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List

Hi,

[I agree on the .gitignore approach; see my other mail in this thread]

On Mon, 12 Feb 2007, Linus Torvalds wrote:

> Btw, how would I implement this? If I really were energetic enough to 
> implement it, I would do:
> 
>  (a) Add a flag to "git-ls-files" logic to add "type information" in 
>      front.
> 
>      Not only do you want this *anyway* for other reasons, but for
>      binary/text, the thing you actually care most about is "git add", and 
>      it already basically just does "take this file pattern, feed it 
>      through git-ls-files, and add those files". So you'd get it basically 
>      for free.
> 
>      It is also fairly easy to add at this stage, because you can simply 
>      look for all the places that work with "info/exclude" and 
>      ".gitignore", and you know that "Ahh, I need to teach these exact 
>      places to understand about attributes". So you'd add an 
>      "add_attributes_from_file()" function etc etc.
> 
>      Quite straightforward. In fact, you might be able to use the 
>      gitignore parsing *as*is*, and just teach it about more flags that 
>      just "ignore": both in "struct dir_entry" and in "struct exclude".
> 
>  (b) Teach the git-update-index logic about hashing text blobs.
> 
>  (c) Profit!

Not so fast.

In order for this to be _useful_, you also have to have a way to _extract_ 
the text blobs. Not only for read-tree, but _also_ for diff. It makes no 
sense at all to have this transformation one-way. For diff, you _might_ 
want to have a diff beautifier (for example the .odt thing), but read-tree 
is _really_ important.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 17:20   ` Linus Torvalds
  2007-02-12 22:37     ` Johannes Schindelin
@ 2007-02-12 22:54     ` Junio C Hamano
  2007-02-12 23:02       ` Junio C Hamano
                         ` (2 more replies)
  1 sibling, 3 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-12 22:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Btw, how would I implement this? If I really were energetic enough to 
> implement it, I would do:
>
>  (a) Add a flag to "git-ls-files" logic to add "type information" in 
>      front.
>
>      Not only do you want this *anyway* for other reasons, but for
>      binary/text, the thing you actually care most about is "git add", and 
>      it already basically just does "take this file pattern, feed it 
>      through git-ls-files, and add those files". So you'd get it basically 
>      for free.
>
>      It is also fairly easy to add at this stage, because you can simply 
>      look for all the places that work with "info/exclude" and 
>      ".gitignore", and you know that "Ahh, I need to teach these exact 
>      places to understand about attributes". So you'd add an 
>      "add_attributes_from_file()" function etc etc.
>
>      Quite straightforward. In fact, you might be able to use the 
>      gitignore parsing *as*is*, and just teach it about more flags that 
>      just "ignore": both in "struct dir_entry" and in "struct exclude".
>
>  (b) Teach the git-update-index logic about hashing text blobs.

I agree that we can assume editors can grok files with LF
end-of-line just fine and we would not need to do the reverse
conversion on checkout paths (e.g. "read-tree -u", "checkout-index").

Textual diff generation needs to learn the CRLF-to-LF conversion
in diff_populate_filespec(); this needs to be done even when the
caller wants size_only.

Oops.

Not so fast.  What's your plan for st_size?

>  (c) Profit!
>
> It really should be fairly straightforward. I'm sure it wouldn't be 
> *entirely* trivial, but I'm also fairly sure that somebody reasonably 
> competent could do it in a couple of days (with testing) if they were just 
> sufficiently motivated to get started.
>
> Anybody?

Not me.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 22:37     ` Johannes Schindelin
@ 2007-02-12 23:02       ` Linus Torvalds
  0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 23:02 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List



On Mon, 12 Feb 2007, Johannes Schindelin wrote:
>
> >  (c) Profit!
> 
> Not so fast.

Aww! And just when I _finally_ had a "step 2".

> In order for this to be _useful_, you also have to have a way to _extract_ 
> the text blobs. Not only for read-tree, but _also_ for diff.

Actually, my argument is that we don't need it all that much.

For example, your "read-tree" argument is actually wrong. Anything that is 
in a tree is _already_ fixed to be '\n'. So as long as we keep to things 
like

	git diff version1..version2

we'll actually always get the right version.

Also, the index will make sure that we don't even *try* to diff normal 
checked out files.

So the only time you actually really need to test the .gitattributes file 
is when you do an "open blob in working tree". And once you do that 
function right, and just make sure both git-update-index and yes, the 
"diff against working tree" cases use it, you really should be mostly 
done.

Both git-update-index and git-diff-files want basically the same 
interface:

	struct file_buf {
		const char *buf;
		unsigned long size;
		int flags;
	}

	int read_file(const char *path, struct file_buf *);
	close_file(struct file_buf *);

and we should use that instead of the current "open + stat + mmap/read + 
close" sequences.

It really shouldn't be too nasty.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 22:54     ` Junio C Hamano
@ 2007-02-12 23:02       ` Junio C Hamano
  2007-02-12 23:09       ` Linus Torvalds
  2007-02-12 23:24       ` Johannes Schindelin
  2 siblings, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-12 23:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List

Junio C Hamano <junkio@cox.net> writes:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> Btw, how would I implement this? If I really were energetic enough to 
>> implement it, I would do:
>> ...
>>  (b) Teach the git-update-index logic about hashing text blobs.
>
> I agree that we can assume editors can grok files with LF
> end-of-line just fine and we would not need to do the reverse
> conversion on checkout paths (e.g. "read-tree -u", "checkout-index").
>
> Textual diff generation needs to learn the CRLF-to-LF conversion
> in diff_populate_filespec(); this needs to be done even when the
> caller wants size_only.
>
> Oops.
>
> Not so fast.  What's your plan for st_size?

If I were to do this, I would say the cache should store the
size on the filesystem in stat fields.  Which means that the
object name recorded is text blob _after_ line endings are
normalized to LF, and its exploded size does not necessarily
match the cached size.

So this means that whoever does the diff_populate_filespec()
change needs to be careful, but it is not such a big deal.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 22:54     ` Junio C Hamano
  2007-02-12 23:02       ` Junio C Hamano
@ 2007-02-12 23:09       ` Linus Torvalds
  2007-02-12 23:25         ` Linus Torvalds
  2007-02-12 23:24       ` Johannes Schindelin
  2 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 23:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List



On Mon, 12 Feb 2007, Junio C Hamano wrote:
> 
> Not so fast.  What's your plan for st_size?

Umm. There's two (very distinct) uses for st_size.

The one that we actually use to validate the current index obviously must 
match the "OS returned value". It contains all the CR/LF stuff.

The one where we actually read the file and run SHA1 on the result must 
obviously be the post-conversion one.

But it shouldn't be a problem. We'll always know which one matters: the 
index case is always about pure stat information (and has no meaning 
outside of that, really - after all, it's no different from st_mode etc, 
and we actually keep it in a special binary format that is endian-safe!) 
and the "real object" case is always about the *data* we use to compare 
with.

I don't think we ever mix the two anyway.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 23:25         ` Linus Torvalds
@ 2007-02-12 23:23           ` David Lang
  0 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-12 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Theodore Tso, Mark Levedahl, Git Mailing List

On Mon, 12 Feb 2007, Linus Torvalds wrote:

> So we'd just need to pass in the information about whether it's binary or
> not, and then do something like
>
> 	@@ -2091,6 +2091,10 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
>
> 	 	if (!type)
> 	 		type = blob_type;
> 	+#ifndef __UNIX__
> 	+	if (text && !strcmp(type, blob_type))
> 	+		convert_crlf_to_lf(&buf, &size);
> 	+#endif
> 	 	if (write_object)
> 	 		ret = write_sha1_file(buf, size, type, sha1);
> 	 	else
>
> and that would take care of a lot of things (yeah, I'd not do it that way
> in practice, but really doesn't look that nasty - it's actually much
> nastier to have to look up the text/binary type in the first place).

you could do something like this and it would deal with the srlf/lf problem, but 
if you instead put in the conversion hooks like Ted suggested then you can 
actually gain a LOT more.

his example of openoffice documents that are gziped xml files is a very good 
one. if the 'conversion' is to gunzip on checkin and gzip on checkout then the 
core git logic will work on the nice diffable xml instead of the compressed 
binary blob.

if this is extensable to arbatrary helper functions to do the conversions I'll 
bet that there are many other cases that can use this.

I think the big questions needs to be, is this helper app a filter, or can it be 
passed a filename as the destination (which would let it do things like set 
permissions on the files it creates), or should it be both?

David Lang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 22:54     ` Junio C Hamano
  2007-02-12 23:02       ` Junio C Hamano
  2007-02-12 23:09       ` Linus Torvalds
@ 2007-02-12 23:24       ` Johannes Schindelin
  2007-02-12 23:42         ` Junio C Hamano
  2007-02-13  0:32         ` Mark Levedahl
  2 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 23:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List

Hi,

On Mon, 12 Feb 2007, Junio C Hamano wrote:

> I agree that we can assume editors can grok files with LF end-of-line 
> just fine and we would not need to do the reverse conversion on checkout 
> paths (e.g. "read-tree -u", "checkout-index").

In that case, a simple pre-commit hook would suffice.

No, the problem mentioned by Mark was a very real one: you _cannot_ rely 
on Windows' editors not to fsck up with line endings. The worst case is if 
the file contains _some_ CRLF and _some _LF_. Almost always I had the 
problem that it now converted _all_ LFs to CRLFs. Even those which already 
were converted.

So, if we are to support text mode, it is not one-way. If we do one-way, 
we really do _not_ support text mode, but pre-commit conversion to LF 
style text. And in this case, core git does not need _any_ change.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 23:09       ` Linus Torvalds
@ 2007-02-12 23:25         ` Linus Torvalds
  2007-02-12 23:23           ` David Lang
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 23:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List



On Mon, 12 Feb 2007, Linus Torvalds wrote:
> 
> But it shouldn't be a problem. We'll always know which one matters: the 
> index case is always about pure stat information (and has no meaning 
> outside of that, really - after all, it's no different from st_mode etc, 
> and we actually keep it in a special binary format that is endian-safe!) 
> and the "real object" case is always about the *data* we use to compare 
> with.

In fact, for git-update-index, I think it's *literally* as easy as just 
changing "index_fd()" to convert the buffer on-the-fly as needed, before 
we actually call "write_sha1_file()" or "hash_sha1_file()".

So we'd just need to pass in the information about whether it's binary or 
not, and then do something like

	@@ -2091,6 +2091,10 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
	 
	 	if (!type)
	 		type = blob_type;
	+#ifndef __UNIX__
	+	if (text && !strcmp(type, blob_type))
	+		convert_crlf_to_lf(&buf, &size);
	+#endif
	 	if (write_object)
	 		ret = write_sha1_file(buf, size, type, sha1);
	 	else

and that would take care of a lot of things (yeah, I'd not do it that way 
in practice, but really doesn't look that nasty - it's actually much 
nastier to have to look up the text/binary type in the first place).

Something similar looks to be true in diff generation. The core "compare 
two SHA1's at a time" doesn't need any changes, but the code that actually 
reads in the temporary file from disk obviously does. But even that is 
just _one_ point, afaik - diff_populate_filespec()":

	@@ -1362,6 +1362,10 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
	 		if (fd < 0)
	 			goto err_empty;
	 		s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
	+#ifndef __UNIX__
	+		if (text)
	+			convert_crlf_to_lf(&s->data, &s->size);
	+#endif
	 		close(fd);
	 		s->should_munmap = 1;
	 	}

(and again, that's not real code, it would also need to change the 
"should_munmap" flag to indicate the state of the _new_ "data" thing.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 23:24       ` Johannes Schindelin
@ 2007-02-12 23:42         ` Junio C Hamano
  2007-02-12 23:46           ` David Lang
  2007-02-12 23:50           ` Johannes Schindelin
  2007-02-13  0:32         ` Mark Levedahl
  1 sibling, 2 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-12 23:42 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi,
>
> On Mon, 12 Feb 2007, Junio C Hamano wrote:
>
>> I agree that we can assume editors can grok files with LF end-of-line 
>> just fine and we would not need to do the reverse conversion on checkout 
>> paths (e.g. "read-tree -u", "checkout-index").
>
> In that case, a simple pre-commit hook would suffice.
>
> No, the problem mentioned by Mark was a very real one: you _cannot_ rely 
> on Windows' editors not to fsck up with line endings. The worst case is if 
> the file contains _some_ CRLF and _some _LF_. Almost always I had the 
> problem that it now converted _all_ LFs to CRLFs. Even those which already 
> were converted.
>
> So, if we are to support text mode, it is not one-way. If we do one-way, 
> we really do _not_ support text mode, but pre-commit conversion to LF 
> style text. And in this case, core git does not need _any_ change.

Well I disagree in two counts.

 - I do not see how you propose to solve some CRLF and some LF
   case with both-ways conversion.

 - Pre-commit hook would not be sufficient.  In a edit, diff,
   test and then commit cycle, diff and test step needs to look
   at whatever the editor left on the filesystem, so the changes
   to populate-filespec is needed to make diff part work.


   

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 23:42         ` Junio C Hamano
@ 2007-02-12 23:46           ` David Lang
  2007-02-12 23:50           ` Johannes Schindelin
  1 sibling, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-12 23:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Linus Torvalds, Theodore Tso, Mark Levedahl,
	Git Mailing List

On Mon, 12 Feb 2007, Junio C Hamano wrote:

>> Hi,
>>
>> On Mon, 12 Feb 2007, Junio C Hamano wrote:
>>
>>> I agree that we can assume editors can grok files with LF end-of-line
>>> just fine and we would not need to do the reverse conversion on checkout
>>> paths (e.g. "read-tree -u", "checkout-index").
>>
>> In that case, a simple pre-commit hook would suffice.
>>
>> No, the problem mentioned by Mark was a very real one: you _cannot_ rely
>> on Windows' editors not to fsck up with line endings. The worst case is if
>> the file contains _some_ CRLF and _some _LF_. Almost always I had the
>> problem that it now converted _all_ LFs to CRLFs. Even those which already
>> were converted.
>>
>> So, if we are to support text mode, it is not one-way. If we do one-way,
>> we really do _not_ support text mode, but pre-commit conversion to LF
>> style text. And in this case, core git does not need _any_ change.
>
> Well I disagree in two counts.
>
> - I do not see how you propose to solve some CRLF and some LF
>   case with both-ways conversion.

the expectation is that the some-of-each situation is unlikly to happen if you 
convert all the time.

and if you do end up with a mixed ending file, the next time you check it in 
from a windows box it should clean it up.

David Lang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 23:42         ` Junio C Hamano
  2007-02-12 23:46           ` David Lang
@ 2007-02-12 23:50           ` Johannes Schindelin
  2007-02-13  0:59             ` Mark Levedahl
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 23:50 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List

Hi,

On Mon, 12 Feb 2007, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > Hi,
> >
> > On Mon, 12 Feb 2007, Junio C Hamano wrote:
> >
> >> I agree that we can assume editors can grok files with LF end-of-line 
> >> just fine and we would not need to do the reverse conversion on checkout 
> >> paths (e.g. "read-tree -u", "checkout-index").
> >
> > In that case, a simple pre-commit hook would suffice.
> >
> > No, the problem mentioned by Mark was a very real one: you _cannot_ rely 
> > on Windows' editors not to fsck up with line endings. The worst case is if 
> > the file contains _some_ CRLF and _some _LF_. Almost always I had the 
> > problem that it now converted _all_ LFs to CRLFs. Even those which already 
> > were converted.
> >
> > So, if we are to support text mode, it is not one-way. If we do one-way, 
> > we really do _not_ support text mode, but pre-commit conversion to LF 
> > style text. And in this case, core git does not need _any_ change.
> 
> Well I disagree in two counts.
> 
>  - I do not see how you propose to solve some CRLF and some LF
>    case with both-ways conversion.

Very easy. Forward: s/\r\n/\n/. Backward: s/\(^\|[^\r]\)\n/\r\n/.

>  - Pre-commit hook would not be sufficient.  In a edit, diff,
>    test and then commit cycle, diff and test step needs to look
>    at whatever the editor left on the filesystem, so the changes
>    to populate-filespec is needed to make diff part work.

Yes, you are right.

However, since this is all post-1.5.0 (right? Right?) why not go with more 
of Ted's proposal, and make this whole mess also usable for other things 
than just crlf issues?

And I _really_ think that you do not help Windows people by doing this 
one-way thing.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 23:24       ` Johannes Schindelin
  2007-02-12 23:42         ` Junio C Hamano
@ 2007-02-13  0:32         ` Mark Levedahl
  1 sibling, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13  0:32 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Git Mailing List

Johannes Schindelin wrote:
> Hi,
>
> On Mon, 12 Feb 2007, Junio C Hamano wrote:
>
>   
>> I agree that we can assume editors can grok files with LF end-of-line 
>> just fine and we would not need to do the reverse conversion on checkout 
>> paths (e.g. "read-tree -u", "checkout-index").
>>     
>
> In that case, a simple pre-commit hook would suffice.
>
> No, the problem mentioned by Mark was a very real one: you _cannot_ rely 
> on Windows' editors not to fsck up with line endings. The worst case is if 
> the file contains _some_ CRLF and _some _LF_. Almost always I had the 
> problem that it now converted _all_ LFs to CRLFs. Even those which already 
> were converted.
>
> So, if we are to support text mode, it is not one-way. If we do one-way, 
> we really do _not_ support text mode, but pre-commit conversion to LF 
> style text. And in this case, core git does not need _any_ change.
>
> Ciao,
> Dscho
In my work flow, I am using a pre-commit script that (among other 
things) rewrites all text files to have \n endings. This is a one-way 
conversion, and does work well for the set of tools I am using. The 
converters I use I wrote years ago, and are smart enough to deal with 
mixtures of \n, \r\n, and \r line endings in one file, transforming all 
into one unified form. d2u / u2d were not that robust when I last tried 
them (years ago), but this is an absolute necessity.

However, I don't think the one-way conversion is acceptable across the 
board. While the only Windows editor I am aware of that doesn't grok \n 
is Notepad (the moral equivalent of edlin), I suspect that undo reliance 
upon this will still lead to grief. If nothing else, someone, somewhere 
will find that their beloved crlf's are missing and will complain. 
Loudly. And in the lore, git will become known for being "wierd."  So, I 
suspect a checkout script is necessary.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-12 23:50           ` Johannes Schindelin
@ 2007-02-13  0:59             ` Mark Levedahl
  2007-02-13  1:06               ` Johannes Schindelin
  2007-02-13  5:18               ` Jeff King
  0 siblings, 2 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13  0:59 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Mark Levedahl,
	Git Mailing List

Johannes Schindelin wrote:
> However, since this is all post-1.5.0 (right? Right?) why not go with more 
> of Ted's proposal, and make this whole mess also usable for other things 
> than just crlf issues
Whatever is done, it needs to be robust to the notion that people will 
fail to set the correct file type somewhere. Current cvsnt is fairly 
good at autodetecting and setting text vs binary file type, and enforces 
this across all platforms, so things don't go awry too often. It is in 
my experience more reliable than subversion, which basically relies upon 
file extensions mapping to mime types to identify content. All of which 
is a very much too low standard of accuracy for a version control 
system: I lost many files per year due to the above nonsense, so I worry 
about trying to create a very general transform solution and not making 
it really, really failsafe. Having projects define individual globbing 
patterns is good, double checking the content for sanity is an absolute 
must, but I don't think that is enough. I suspect the solution should 
include round-trip conversion when creating blobs to assure that the 
input can be exactly reconstructed by the inverse transformation (and 
therefore possibly rejecting input with mixed line endings). A similar 
check could be applied on checkout.

Perhaps I'm too paranoid, but I've been burnt way too many times by 
text/binary mode stuff to let this part be trivialized. Maybe it only 
gets enabled by core.ImReallyParanoid, but I want that option.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  0:59             ` Mark Levedahl
@ 2007-02-13  1:06               ` Johannes Schindelin
  2007-02-13  1:13                 ` Shawn O. Pearce
  2007-02-13  1:36                 ` Mark Levedahl
  2007-02-13  5:18               ` Jeff King
  1 sibling, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13  1:06 UTC (permalink / raw)
  To: Mark Levedahl
  Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Mark Levedahl,
	Git Mailing List

Hi,

On Mon, 12 Feb 2007, Mark Levedahl wrote:

> Perhaps I'm too paranoid, but I've been burnt way too many times by 
> text/binary mode stuff to let this part be trivialized. Maybe it only 
> gets enabled by core.ImReallyParanoid, but I want that option.

Be aware that what you proposed costs many CPU cycles. I am totally 
opposed to enabling that option by default on all platforms. I am okay 
with .gitattributes (but I would call it .gitfiletypes), but I am _not_ 
okay with git being _too much_ fscked up by Windows. Microsoft has done 
enough harm already.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  1:06               ` Johannes Schindelin
@ 2007-02-13  1:13                 ` Shawn O. Pearce
  2007-02-13  1:20                   ` David Lang
  2007-02-13  1:36                 ` Mark Levedahl
  1 sibling, 1 reply; 83+ messages in thread
From: Shawn O. Pearce @ 2007-02-13  1:13 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Mark Levedahl, Junio C Hamano, Linus Torvalds, Theodore Tso,
	Mark Levedahl, Git Mailing List

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Mon, 12 Feb 2007, Mark Levedahl wrote:
> 
> > Perhaps I'm too paranoid, but I've been burnt way too many times by 
> > text/binary mode stuff to let this part be trivialized. Maybe it only 
> > gets enabled by core.ImReallyParanoid, but I want that option.
> 
> Be aware that what you proposed costs many CPU cycles. I am totally 
> opposed to enabling that option by default on all platforms. I am okay 
> with .gitattributes (but I would call it .gitfiletypes), but I am _not_ 
> okay with git being _too much_ fscked up by Windows. Microsoft has done 
> enough harm already.

Indeed; this type of checking should only occur if there is a filter
applied to a file.  Most files in most projects would hopefully
just be considered to be byte streams to Git, like they are today,
and thus not incur any additional overhead, beyond matching their
type to determine they are in fact just a byte stream.

The type could be cached in the index; or at least a single bit
which says "I'm just a byte stream, thanks" so that the matching
only needs to occur during an initial read-tree.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  1:13                 ` Shawn O. Pearce
@ 2007-02-13  1:20                   ` David Lang
  0 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-13  1:20 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Johannes Schindelin, Mark Levedahl, Junio C Hamano,
	Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List

On Mon, 12 Feb 2007, Shawn O. Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> On Mon, 12 Feb 2007, Mark Levedahl wrote:
>>
>>> Perhaps I'm too paranoid, but I've been burnt way too many times by
>>> text/binary mode stuff to let this part be trivialized. Maybe it only
>>> gets enabled by core.ImReallyParanoid, but I want that option.
>>
>> Be aware that what you proposed costs many CPU cycles. I am totally
>> opposed to enabling that option by default on all platforms. I am okay
>> with .gitattributes (but I would call it .gitfiletypes), but I am _not_
>> okay with git being _too much_ fscked up by Windows. Microsoft has done
>> enough harm already.
>
> Indeed; this type of checking should only occur if there is a filter
> applied to a file.  Most files in most projects would hopefully
> just be considered to be byte streams to Git, like they are today,
> and thus not incur any additional overhead, beyond matching their
> type to determine they are in fact just a byte stream.
>
> The type could be cached in the index; or at least a single bit
> which says "I'm just a byte stream, thanks" so that the matching
> only needs to occur during an initial read-tree.

for the limited case of line endings it may be reasonable to define the internal 
git format to be lf, and if you are running on a platform that uses this nativly 
no transition is needed

one possible way to make this be a general feture is to have the helper script 
have a --needed flag that tells git if it would do anything on the current 
platform or not. this way you don't need to run it (and sanity check it) if it's 
not needed.

David Lang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  1:06               ` Johannes Schindelin
  2007-02-13  1:13                 ` Shawn O. Pearce
@ 2007-02-13  1:36                 ` Mark Levedahl
  1 sibling, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13  1:36 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Mark Levedahl,
	Git Mailing List

Johannes Schindelin wrote:
> Hi,
> 
> On Mon, 12 Feb 2007, Mark Levedahl wrote:
> 
>> Perhaps I'm too paranoid, but I've been burnt way too many times by 
>> text/binary mode stuff to let this part be trivialized. Maybe it only 
>> gets enabled by core.ImReallyParanoid, but I want that option.
> 
> Be aware that what you proposed costs many CPU cycles. I am totally 
> opposed to enabling that option by default on all platforms. I am okay 
> with .gitattributes (but I would call it .gitfiletypes), but I am _not_ 
> okay with git being _too much_ fscked up by Windows. Microsoft has done 
> enough harm already.

I would assume that none of this crlf stuff exists at all on Linux / 
Unix / Posix, so if done right has zero impact outside of the Windows 
nuthouse. Inside that, folks are already so used to incredible slowness 
in file I/O that I'm not sure the round tripping I suggest as a check 
would be very noticeable, but in any case I fully agree it should be 
optional even there. However, if git could support something that never 
screws up, absolutely guaranteeing data integrity in the presence of 
these transforms, that would be a first in this arena and I believe a 
significant selling point.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
                   ` (2 preceding siblings ...)
  2007-02-12  4:24 ` Theodore Tso
@ 2007-02-13  2:02 ` Junio C Hamano
  2007-02-13  3:21   ` Mark Levedahl
  2007-02-13  3:32 ` Alexander Litvinov
  4 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13  2:02 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: Git Mailing List

Mark Levedahl <mlevedahl@verizon.net> writes:

> I am NOT intending to start a flamewar O:-) , so please don't turn
> this into one.

Heh, a lofty goal.  And I am glad to see that a thread full of
constructive suggestion is already going on.

So now I do not have to fear starting a flamewar; I can safely
vent.

> The recent threads on a mingw git port are explicit in the intent to
> provide a Windows native git. I believe there is a fundamental
> conflict here with the position, clearly stated by Linus, that git
> does not alter content in any way. Windows suffers the curse of DOS
> line endings (\r\n vs \n), and a true port to Windows *must* allow for
> \r\n and \n to be semantically the same thing as most large projects
> end up with a mixture of such files and/or are targeting
> cross-platform capabilities. The major competing solutions git seeks
> to supplant (cvs, cvsnt, svn, hg) have capability to recognize "text"
> files and transparently replace \r\n with \n on input, the reverse on
> output, and ignore all such differences on diff operations. To be
> relevant on native Windows, git must do the same. Otherwise, git will
> be deemed "too wierd" and dismissed in favor of a tool "that works."
> 
> There is no use to debating the technical merits of \r\n vs \n vs \r
> vs whatever, nor of not converting. Really. Just accept that there is
> a fundamental requirement that any version control tool on Windows be
> able to silently convert between \r\n and \n. To believe otherwise is
> to expect that the conversion be pushed elsewhere into the tool chain
> in use, and that won't happen as the competition already provide this
> conversion capability.

I think there is a fundamental misconception in the above.  I do
not know about others, but to me personally, I do not see any
"seeking to supplant", nor "competition".  It's not like I or
people who raised git into the current shape are begging to
windows users to consider using git and bending backwards to
please them.  You should hone your diplomacy.

Current git may or may not match what they need, and if it does
not match what they need, making it match what they need is
primarily the responsibility of them.  If Windows users find
something in git that is interesting and useful, but if they
find something else lacking in it to be truly useful for them,
they can submit patches, or if they cannot implement the changes
themselves but only have wishlist items, then _they_ can do the
begging.

People in git community are certainly friendly and helpful
bunch, and some (including me) are unfortunate enough that
sometimes they have to touch Windows, so some degree of need is
felt to support Windows better even within the community, but it
has never been high priority.  Making it higher priority by
bringing in better ideas and starting the fire must come from
people who care more about Windows than me and Linus.

> So, I think the git project needs to come to an explicit position on
> this, basically being:
>
> 1) git is a POSIX only tool (i.e., there will be no \r\n munging), or
> 2) a Windows port of git will handle and mung \r\n and \n line endings.

I do not think git project needs to do any such thing.  The
project evolves reflecting the needs of its users, and the
design is not decided upfront without doing any feasibility
study.  I would certainly not say our position is (1), IOW, I
would not say we will rule out Windows support.  If it can be
reasonably done without harming the code, why not?

Depending on how cleanly a change Windows users want is done
without negatively affecting the existing users, it may or may
not be judged acceptable.  We will know only when we see at
least the design and preferably the code.  I feel no need to
decide between (1) and (2) upfront before that happens.

> If the answer is 1, the mingw port is a waste of time as it simply
> won't be usable by its target audience. If the answer is 2, then I
> think a very careful design of this capability is in order.
>
> Comments?

This is not just you, and fortunately it does not happen very
often in git community, but I find it _very_ irritating when
somebody says: "here is a patch, I'll do the doc, test, and
tidying up if this patch is accepted".  I usually pretend to be
a nice person and accept the patch when it is obviously good,
or pretend that I was too busy and did not notice such a
message, but I feel _very_ tempted to say: "if you care deeply
enough that what you did is useful, I expect you'd perfect it
whether or not I apply your patch to my tree right now.  If even
the original author, you, do not find it worth perfecting, then
I am not interested at all."

Even if all existing git community members felt (1) above and
were unwilling to accept line-end conversions (which by now you
already know is not the case -- and that is why I waited until
now to address this as a separate "attitude" issue), if somebody
who works on Windows is motivated enough to make git work better
for him, he can fork (and forking is very easy with git).  If
the forked git works well both on Windows and on non Windows,
people who initially felt (1) will realize that they were wrong
and then the codebase can be merged back together (and merging
the forked projects is very easy with git).

It's open source.  People shouldn't worry too much about what
they have done "wasted".  You are not even talking about what
you've already done -- you are talking about what you _might_
do.

And your saying "If 2, then we need to think carefully" was VERY
good.  My point is that you did not have to say "Is it 1, or is
it 2, and if 2 then" part.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  2:02 ` Junio C Hamano
@ 2007-02-13  3:21   ` Mark Levedahl
  2007-02-13  6:05     ` Junio C Hamano
  0 siblings, 1 reply; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13  3:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Mark Levedahl, Git Mailing List

Junio C Hamano wrote:
 > Mark Levedahl <mlevedahl@verizon.net> writes:
 >
 >> I am NOT intending to start a flamewar O:-) , so please don't turn
 >> this into one.
 >
 > Heh, a lofty goal.  And I am glad to see that a thread full of
 > constructive suggestion is already going on.
 >
 > So now I do not have to fear starting a flamewar; I can safely
 > vent.

Junio,

I meant absolutely no offense in anything I wrote, and sincerely 
apologize if any was taken. My past experiences caused me to be 
skeptical that a significant change to accommodate a very bad design of 
Windows would be accepted here. Happily, that skepticism was misplaced. 
I am much heartened by the responses, and am optimistic a good solution 
will be found that is acceptable to all. It is very clear that the group 
is open and supportive of working through the issues to help this, and I 
intend to contribute to that solution. (If nothing else, I would like to 
be known for something besides some modest ability to hack around Tk bugs).

So, I trust the flamethrowers can remain buried.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
                   ` (3 preceding siblings ...)
  2007-02-13  2:02 ` Junio C Hamano
@ 2007-02-13  3:32 ` Alexander Litvinov
  2007-02-13 10:06   ` Johannes Schindelin
  4 siblings, 1 reply; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-13  3:32 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: Git Mailing List

В сообщении от Monday 12 February 2007 05:13 Mark Levedahl написал(a):
> 1) git is a POSIX only tool (i.e., there will be no \r\n munging), or
> 2) a Windows port of git will handle and mung \r\n and \n line endings.
>
> If the answer is 1, the mingw port is a waste of time as it simply won't
> be usable by its target audience. If the answer is 2, then I think a
> very careful design of this capability is in order.

I am strongly object this statement. I develop one project under Windows and 
use Cygwin git for this. Yes, I have a problem with git's thinking line 
ending is a \n but most of troubles are diff and rebase. In general git works 
well with \r\n line endings.

When I have file that was converted from dos to unix format (or from unix to 
dos) git genereta big diff. But anyway, c++ compiler works well with both 
formats and in this case I simply convert file to dos format and git shows 
again nice diff. If unix format was commited to git I simply change the 
format and commit that file again.

The only trouble is the rebase, it does not like \r\n ending and othen produce 
unexpected merge conflict. But I don't use rebse to othen to realy 
investigate and try to solve the problem.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  0:59             ` Mark Levedahl
  2007-02-13  1:06               ` Johannes Schindelin
@ 2007-02-13  5:18               ` Jeff King
  1 sibling, 0 replies; 83+ messages in thread
From: Jeff King @ 2007-02-13  5:18 UTC (permalink / raw)
  To: Mark Levedahl
  Cc: Johannes Schindelin, Junio C Hamano, Linus Torvalds,
	Theodore Tso, Mark Levedahl, Git Mailing List

On Mon, Feb 12, 2007 at 07:59:50PM -0500, Mark Levedahl wrote:

> fail to set the correct file type somewhere. Current cvsnt is fairly 
> good at autodetecting and setting text vs binary file type, and enforces 
> this across all platforms, so things don't go awry too often. It is in 

There is obviously much sentiment that this should _not_ be the default
(and I agree). But if arbitrary filters are possible, then you can
theoretically write an 'autocrlf' filter which will try to do the right
thing, and you could set it for some or all files:

  echo '*: autocrlf' >.gitattributes

but it would be off by default. If we implement this, everyone has to
"pay" for .gitattributes (even if you don't use it, we have to look it
up to make sure you're not using it!), but nobody has to pay for any
filters they don't use.

-Peff

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  3:21   ` Mark Levedahl
@ 2007-02-13  6:05     ` Junio C Hamano
  0 siblings, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13  6:05 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: Mark Levedahl, Git Mailing List

Mark Levedahl <mdl123@verizon.net> writes:

> I meant absolutely no offense in anything I wrote, and sincerely
> apologize if any was taken.

None taken, although I admit that I was somewhat annoyed, having
to write the first part of my response.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13  3:32 ` Alexander Litvinov
@ 2007-02-13 10:06   ` Johannes Schindelin
  2007-02-13 12:16     ` Alexander Litvinov
  2007-02-13 16:52     ` Linus Torvalds
  0 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 10:06 UTC (permalink / raw)
  To: Alexander Litvinov; +Cc: Mark Levedahl, Git Mailing List

Hi,

On Tue, 13 Feb 2007, Alexander Litvinov wrote:

> When I have file that was converted from dos to unix format (or from 
> unix to dos) git genereta big diff. But anyway, c++ compiler works well 
> with both formats and in this case I simply convert file to dos format 
> and git shows again nice diff. If unix format was commited to git I 
> simply change the format and commit that file again.

That's awful!

> The only trouble is the rebase, it does not like \r\n ending and othen 
> produce unexpected merge conflict. But I don't use rebse to othen to 
> realy investigate and try to solve the problem.

Well, if everybody thinks like you, maybe we do not have to change 
anything for Windows after all?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 10:06   ` Johannes Schindelin
@ 2007-02-13 12:16     ` Alexander Litvinov
  2007-02-13 12:37       ` Johannes Schindelin
  2007-02-13 19:36       ` Mark Levedahl
  2007-02-13 16:52     ` Linus Torvalds
  1 sibling, 2 replies; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-13 12:16 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailing List

В сообщении от Tuesday 13 February 2007 16:06 Johannes Schindelin написал(a):
> Hi,
>
> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> > When I have file that was converted from dos to unix format (or from
> > unix to dos) git genereta big diff. But anyway, c++ compiler works well
> > with both formats and in this case I simply convert file to dos format
> > and git shows again nice diff. If unix format was commited to git I
> > simply change the format and commit that file again.
>
> That's awful!
If you are tring to build history that looks good - you are right this is a 
terrible workflow.

> > The only trouble is the rebase, it does not like \r\n ending and othen
> > produce unexpected merge conflict. But I don't use rebse to othen to
> > realy investigate and try to solve the problem.
>
> Well, if everybody thinks like you, maybe we do not have to change
> anything for Windows after all?
I still wish to have working rebase so if git will hanle somehow \r\n it would 
be nice. But please do not produce the same behavior as cvs does: under 
cygwin it still use \n !

By the way, most windows programmers I work with says 'git is cool but is 
there gui like tortoise or wincvs ?' :-)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 12:16     ` Alexander Litvinov
@ 2007-02-13 12:37       ` Johannes Schindelin
  2007-02-13 19:36       ` Mark Levedahl
  1 sibling, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 12:37 UTC (permalink / raw)
  To: Alexander Litvinov; +Cc: Git Mailing List

Hi,

On Tue, 13 Feb 2007, Alexander Litvinov wrote:

> Tuesday 13 February 2007 16:06 Johannes Schindelin:
> > At some stage, Alexander wrote this:
> 
> > > The only trouble is the rebase, it does not like \r\n ending and othen
> > > produce unexpected merge conflict. But I don't use rebse to othen to
> > > realy investigate and try to solve the problem.
> >
> > Well, if everybody thinks like you, maybe we do not have to change
> > anything for Windows after all?
>
> I still wish to have working rebase so if git will hanle somehow \r\n it 
> would be nice. But please do not produce the same behavior as cvs does: 
> under cygwin it still use \n !

You really should teach format-patch to output \n patches, and keep all 
your blobs CR free.

> By the way, most windows programmers I work with says 'git is cool but 
> is there gui like tortoise or wincvs ?' :-)

Some time ago, I started playing with a shell extension. Now that MinGW 
git is almost there, I might clean it up... Would you be interested in 
working on it, or is this just wishtalk?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 10:06   ` Johannes Schindelin
  2007-02-13 12:16     ` Alexander Litvinov
@ 2007-02-13 16:52     ` Linus Torvalds
  2007-02-13 17:23       ` Linus Torvalds
                         ` (2 more replies)
  1 sibling, 3 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 16:52 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List



On Tue, 13 Feb 2007, Johannes Schindelin wrote:
> 
> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> 
> > When I have file that was converted from dos to unix format (or from 
> > unix to dos) git genereta big diff. But anyway, c++ compiler works well 
> > with both formats and in this case I simply convert file to dos format 
> > and git shows again nice diff. If unix format was commited to git I 
> > simply change the format and commit that file again.
> 
> That's awful!
> 
> > The only trouble is the rebase, it does not like \r\n ending and othen 
> > produce unexpected merge conflict. But I don't use rebse to othen to 
> > realy investigate and try to solve the problem.
> 
> Well, if everybody thinks like you, maybe we do not have to change 
> anything for Windows after all?

No no no.

It's going to be _horrible_ if people start interesting projects in 
Windows, and there are files in a git repository that are encoded with 
CRLF. 

I'd much rather just get this right, and that means "no hooks". If people 
start using commit hooks etc, that will just mean that they won't use them 
for all-windows environments (why use it? Everybody hass CRLF, and 
everybody _wants_ CRLF), or it will just be relatively expensive to have a 
complex hook anyway.

So I think we should plan on something like .gitattributes or similar, so 
that we _can_ handle mixed environments well, without any real setup or 
any real costs.

The costs really shouldn't be too high - we tend to avoid doing any 
expensive working tree changes *anyway*. For example, even "git checkout" 
has a huge optimization to avoid rewriting files that are already ok, so 
doing things like switching whole branches usually wouldn't even need any 
conversion for most files - even on platforms like Windows that need the 
conversion in the first place.

So considering that it looks _trivial_ for git-update-index, fairly easy 
for diff generation, and I doubt "git checkout" is really likely to be any 
worse either, this should just be somethign we do.

The *ONLY* case where we may not be able to do things automatically is 
actually a much more subtle one: "git cat-file". If we just get a SHA1, we 
don't know what the path to look it up was like, and thus we can never 
know whether it's a binary or a text object. With "-p" we can trivially 
guess, of course, but "git cat-file blob" simply must not do that!

But that really doesn't sound like a big problem to me ;)

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 16:52     ` Linus Torvalds
@ 2007-02-13 17:23       ` Linus Torvalds
  2007-02-13 17:23         ` Linus Torvalds
                           ` (2 more replies)
  2007-02-13 17:25       ` Nicolas Pitre
  2007-02-13 18:04       ` Johannes Schindelin
  2 siblings, 3 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 17:23 UTC (permalink / raw)
  To: Johannes Schindelin, Junio C Hamano
  Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List



On Tue, 13 Feb 2007, Linus Torvalds wrote:
> 
> I'd much rather just get this right, and that means "no hooks". If people 
> start using commit hooks etc, that will just mean that they won't use them 
> for all-windows environments (why use it? Everybody hass CRLF, and 
> everybody _wants_ CRLF), or it will just be relatively expensive to have a 
> complex hook anyway.
> 
> So I think we should plan on something like .gitattributes or similar, so 
> that we _can_ handle mixed environments well, without any real setup or 
> any real costs.

Here's a patch that I think we can merge right now. There may be other 
places that need this, but this at least points out the three places that 
read/write working tree files for git update-index, checkout and diff 
respectively. That should cover a lot of it.

Some day we can actually implement it. In the meantime, this points out a 
place for people to start. We *can* even start with a really simple "we do 
CRLF conversion automatically, regardless of filename" kind of approach, 
that just look at the data (all three cases have the _full_ file data 
already in memory) and says "ok, this is text, so let's convert to/from 
DOS format directly".

THAT somebody can write in ten minutes, and it would already make git much 
nicer on a DOS/Windows platform, I suspect.

And it would be totally zero-cost if you just make it a config option 
(but please make it dynamic with the _default_ just being 0/1 depending 
on whether it's UNIX/Windows, just so that UNIX people can _test_ it 
easily).

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 17:23       ` Linus Torvalds
@ 2007-02-13 17:23         ` Linus Torvalds
  2007-02-13 18:00         ` Junio C Hamano
  2007-02-13 18:05         ` Johannes Schindelin
  2 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 17:23 UTC (permalink / raw)
  To: Johannes Schindelin, Junio C Hamano
  Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List



On Tue, 13 Feb 2007, Linus Torvalds wrote:
> 
> Here's a patch [...]

No. HERE's the trivial stupid patch that just marks the core places.

		Linus
---
diff --git a/diff.c b/diff.c
index aaab309..13b9b6c 100644
--- a/diff.c
+++ b/diff.c
@@ -1364,6 +1364,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
 		s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
 		close(fd);
 		s->should_munmap = 1;
+		/* FIXME! CRLF -> LF conversion goes here, based on "s->path" */
 	}
 	else {
 		char type[20];
diff --git a/entry.c b/entry.c
index 0ebf0f0..c2641dd 100644
--- a/entry.c
+++ b/entry.c
@@ -89,6 +89,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
 			return error("git-checkout-index: unable to create file %s (%s)",
 				path, strerror(errno));
 		}
+		/* FIXME: LF -> CRLF conversion goes here, based on "ce->name" */
 		wrote = write_in_full(fd, new, size);
 		close(fd);
 		free(new);
diff --git a/sha1_file.c b/sha1_file.c
index 0d4bf80..8ad7fad 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2091,6 +2091,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
 
 	if (!type)
 		type = blob_type;
+	/* FIXME: CRLF -> LF conversion here for blobs! We'll need the path! */
 	if (write_object)
 		ret = write_sha1_file(buf, size, type, sha1);
 	else

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 16:52     ` Linus Torvalds
  2007-02-13 17:23       ` Linus Torvalds
@ 2007-02-13 17:25       ` Nicolas Pitre
  2007-02-13 18:04       ` Johannes Schindelin
  2 siblings, 0 replies; 83+ messages in thread
From: Nicolas Pitre @ 2007-02-13 17:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List

On Tue, 13 Feb 2007, Linus Torvalds wrote:

> The *ONLY* case where we may not be able to do things automatically is 
> actually a much more subtle one: "git cat-file". If we just get a SHA1, we 
> don't know what the path to look it up was like, and thus we can never 
> know whether it's a binary or a text object. With "-p" we can trivially 
> guess, of course, but "git cat-file blob" simply must not do that!

git-cat-file, and its counter part git-hash-object, are fairly low level 
plumbing.  Anyone using them should be aware of the issue and apply the 
needed conversion.  And actually, since we're going to have the 
conversion routines in the core, we'd only need to add a --crlf argument 
to both of them to optionally perform the conversion since the user of 
those commands is more likely to know if the conversion is needed.


Nicolas

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 17:23       ` Linus Torvalds
  2007-02-13 17:23         ` Linus Torvalds
@ 2007-02-13 18:00         ` Junio C Hamano
  2007-02-13 19:07           ` Linus Torvalds
  2007-02-13 18:05         ` Johannes Schindelin
  2 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Here's a patch that I think we can merge right now. There may be other 
> places that need this, but this at least points out the three places that 
> read/write working tree files for git update-index, checkout and diff 
> respectively. That should cover a lot of it.

Thanks, applied.  I think git-apply has separate codepaths for
both reading and writing; I won't look into them before 1.5.0
but people are welcome to help advancing the cause before I get
to it ;-).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 16:52     ` Linus Torvalds
  2007-02-13 17:23       ` Linus Torvalds
  2007-02-13 17:25       ` Nicolas Pitre
@ 2007-02-13 18:04       ` Johannes Schindelin
  2007-02-13 18:11         ` Junio C Hamano
  2007-02-13 18:39         ` Linus Torvalds
  2 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 18:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List

Hi,

On Tue, 13 Feb 2007, Linus Torvalds wrote:

> On Tue, 13 Feb 2007, Johannes Schindelin wrote:
> > 
> > On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> > 
> > > The only trouble is the rebase, it does not like \r\n ending and 
> > > othen produce unexpected merge conflict. But I don't use rebse to 
> > > othen to realy investigate and try to solve the problem.
> > 
> > Well, if everybody thinks like you, maybe we do not have to change 
> > anything for Windows after all?
> 
> No no no.
> 
> It's going to be _horrible_ if people start interesting projects in 
> Windows, and there are files in a git repository that are encoded with 
> CRLF.
> 
> I'd much rather just get this right, and that means "no hooks".

No hooks means something like cvsnt does, and that means no .gitattributes 
either. (BTW I really hate .gitattributes, as it does not at all say what 
this is about; it's about file _conversions_, not attributes).

CVSNT analyzes the files, and guesses if they are text, and only then 
activates the text mode.

I am strongly opposed to including something like that. (It was already 
proposed, and your "no hooks" suggests the same.)

However, I am slightly positive about the .gitfiletypes approach, _iff_ we 
think about more than just text/binary from the start. If we do it right, 
it will buy us more.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 17:23       ` Linus Torvalds
  2007-02-13 17:23         ` Linus Torvalds
  2007-02-13 18:00         ` Junio C Hamano
@ 2007-02-13 18:05         ` Johannes Schindelin
  2 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List

Hi,

On Tue, 13 Feb 2007, Linus Torvalds wrote:

> Here's a patch that I think we can merge right now.

Why the haste all of a sudden? Your patch is easily applyable for anyone 
who wants to work on text/binary or arbitrary file types. No need to rush 
a developers-only patch into git.git.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 18:04       ` Johannes Schindelin
@ 2007-02-13 18:11         ` Junio C Hamano
  2007-02-13 18:39         ` Linus Torvalds
  1 sibling, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13 18:11 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Alexander Litvinov, Mark Levedahl, Git Mailing List

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> No hooks means something like cvsnt does, and that means no .gitattributes 
> either. (BTW I really hate .gitattributes, as it does not at all say what 
> this is about; it's about file _conversions_, not attributes).

> However, I am slightly positive about the .gitfiletypes approach, _iff_ we 
> think about more than just text/binary from the start. If we do it right, 
> it will buy us more.

We might start with only binary/text attributes, but we may add
more later, e.g. chmod=o-rwx.  I do not see much differnece
between attributes vs filetypes.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 18:04       ` Johannes Schindelin
  2007-02-13 18:11         ` Junio C Hamano
@ 2007-02-13 18:39         ` Linus Torvalds
  2007-02-13 18:42           ` Johannes Schindelin
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 18:39 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List



On Tue, 13 Feb 2007, Johannes Schindelin wrote:
> 
> No hooks means something like cvsnt does, and that means no .gitattributes 
> either. (BTW I really hate .gitattributes, as it does not at all say what 
> this is about; it's about file _conversions_, not attributes).

No, it *is* about attributes.

In order to know how to convert, you need to know the attributes of the 
file.

So it's not about conversion: we would ALWAYS do conversion. It's about 
the fact that in order to do the conversion, we need to know what the 
attributes of the file is - is it text, or what.

And the equal point is that there are _other_ attributes that git might 
care about. The "merge strategy" attribute, for example. Or "owner" 
attributes for files etc.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 18:39         ` Linus Torvalds
@ 2007-02-13 18:42           ` Johannes Schindelin
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 18:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List

Hi,

On Tue, 13 Feb 2007, Linus Torvalds wrote:

> On Tue, 13 Feb 2007, Johannes Schindelin wrote:
> > 
> > No hooks means something like cvsnt does, and that means no .gitattributes 
> > either. (BTW I really hate .gitattributes, as it does not at all say what 
> > this is about; it's about file _conversions_, not attributes).
> 
> No, it *is* about attributes.
> 
> In order to know how to convert, you need to know the attributes of the 
> file.
> 
> So it's not about conversion: we would ALWAYS do conversion. It's about 
> the fact that in order to do the conversion, we need to know what the 
> attributes of the file is - is it text, or what.
> 
> And the equal point is that there are _other_ attributes that git might 
> care about. The "merge strategy" attribute, for example. Or "owner" 
> attributes for files etc.

Yes, you're right. Colour me converted (pun intended).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 18:00         ` Junio C Hamano
@ 2007-02-13 19:07           ` Linus Torvalds
  2007-02-13 20:42             ` Sam Ravnborg
                               ` (3 more replies)
  0 siblings, 4 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 19:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List



On Tue, 13 Feb 2007, Junio C Hamano wrote:
> 
> Thanks, applied.  I think git-apply has separate codepaths for
> both reading and writing; I won't look into them before 1.5.0
> but people are welcome to help advancing the cause before I get
> to it ;-).

Actually, I did it myself.

This is a "lazy man's auto-CRLF", and it really is pretty simple.

It currently does NOT know about file attributes, so it does its 
conversion purely based on content. Maybe that is more in the "git 
philosophy" anyway, since content is king, but I think we should try to do 
the file attributes to turn it off on demand.

Anyway, BY DEFAULT it is off regardless, because it requires a

	[core]
		AutoCRLF = true

in your config file to be enabled. We could make that the default for 
Windows, of course, the same way we do some other things (filemode etc).

But you can actually enable it on UNIX, and it will cause:

 - "git update-index" will write blobs without CRLF
 - "git diff" will diff working tree files without CRLF
 - "git checkout" will write files to the working tree _with_ CRLF

and things work fine.

Funnily, it actually shows an odd file in git itself:

	git clone -n git test-crlf
	cd test-crlf
	git config core.autocrlf true
	git checkout
	git diff

shows a diff for "Documentation/docbook-xsl.css". Why? Because we have 
actually checked in that file *with* CRLF! So when "core.autocrlf" is 
true, we'll always generate a *different* hash for it in the index, 
because the index hash will be for the content _without_ CRLF.

Is this complete? I dunno. It seems to work for me. It doesn't use the 
filename at all right now, and that's probably a deficiency (we could 
certainly make the "is_binary()" heuristics also take standard filename 
heuristics into account).

I don't pass in the filename at all for the "index_fd()" case 
(git-update-index), so that would need to be passed around, but this 
actually works fine.

NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours 
truly. I will not guarantee that they work at all reasonable. Caveat 
emptor. But it _is_ simple, and it _is_ safe, since it's all off by 
default.

The patch is pretty simple - the biggest part is the new "convert.c" file, 
but even that is really just basic stuff that anybody can write in 
"Teaching C 101" as a final project for their first class in programming. 
Not to say that it's bug-free, of course - but at least we're not talking 
about rocket surgery here.

		Linus

---
commit f0731319497ac8121bd901a91fc33d715745d3af
Author: Linus Torvalds <torvalds@osdl.org>
Date:   Tue Feb 13 10:56:50 2007 -0800

    Add "auto-CRLF" conversion logic
    
    It's simple and it's stupid.  But it actually seems to work.  What more
    can you want?
    
    It's not enabled by default: you need to add a
    
    	[core]
    		AutoCRLF = true
    
    to your .git/config file to enable it universally.
    
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 Makefile      |    3 +-
 cache.h       |    5 ++
 config.c      |    5 ++
 convert.c     |  179 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 diff.c        |   16 +++++
 entry.c       |   15 +++++
 environment.c |    1 +
 sha1_file.c   |   22 +++++++-
 8 files changed, 244 insertions(+), 2 deletions(-)
 create mode 100644 convert.c

diff --git a/Makefile b/Makefile
index 40bdcff..60496ff 100644
--- a/Makefile
+++ b/Makefile
@@ -262,7 +262,8 @@ LIB_OBJS = \
 	revision.o pager.o tree-walk.o xdiff-interface.o \
 	write_or_die.o trace.o list-objects.o grep.o \
 	alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
-	color.o wt-status.o archive-zip.o archive-tar.o shallow.o utf8.o
+	color.o wt-status.o archive-zip.o archive-tar.o shallow.o utf8.o \
+	convert.o
 
 BUILTIN_OBJS = \
 	builtin-add.o \
diff --git a/cache.h b/cache.h
index c62b0b0..9c019e8 100644
--- a/cache.h
+++ b/cache.h
@@ -201,6 +201,7 @@ extern const char *apply_default_whitespace;
 extern int zlib_compression_level;
 extern size_t packed_git_window_size;
 extern size_t packed_git_limit;
+extern int auto_crlf;
 
 #define GIT_REPO_VERSION 0
 extern int repository_format_version;
@@ -468,4 +469,8 @@ extern int nfvasprintf(char **str, const char *fmt, va_list va);
 extern void trace_printf(const char *format, ...);
 extern void trace_argv_printf(const char **argv, int count, const char *format, ...);
 
+/* convert.c */
+extern int convert_to_git(const char *path, char **bufp, unsigned long *sizep);
+extern int convert_to_working_tree(const char *path, char **bufp, unsigned long *sizep);
+
 #endif /* CACHE_H */
diff --git a/config.c b/config.c
index d821071..ffe0212 100644
--- a/config.c
+++ b/config.c
@@ -324,6 +324,11 @@ int git_default_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.autocrlf")) {
+		auto_crlf = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "user.name")) {
 		strlcpy(git_default_name, value, sizeof(git_default_name));
 		return 0;
diff --git a/convert.c b/convert.c
new file mode 100644
index 0000000..c04b6c2
--- /dev/null
+++ b/convert.c
@@ -0,0 +1,179 @@
+#include "cache.h"
+/*
+ * convert.c - convert a file when checking it out and checking it in.
+ *
+ * This should use the pathname to decide on whether it wants to do some
+ * more interesting conversions (automatic gzip/unzip, general format
+ * conversions etc etc), but by default it just does automatic CRLF<->LF
+ * translation when the "auto_crlf" option is set.
+ */
+
+struct text_stat {
+	/* CR, LF and CRLF counts */
+	unsigned cr, lf, crlf;
+
+	/* These are just approximations! */
+	unsigned printable, nonprintable;
+};
+
+static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
+{
+	unsigned long i;
+
+	memset(stats, 0, sizeof(*stats));
+
+	for (i = 0; i < size; i++) {
+		unsigned char c = buf[i];
+		if (c == '\r') {
+			stats->cr++;
+			if (i+1 < size && buf[i+1] == '\n')
+				stats->crlf++;
+			continue;
+		}
+		if (c == '\n') {
+			stats->lf++;
+			continue;
+		}
+		if (c == '\t' || (c >= 32 && c < 127)) {
+			stats->printable++;
+			continue;
+		}
+		stats->nonprintable++;
+	}
+}
+
+/*
+ * This is just a heuristic!
+ *
+ * We do allow nonprintable characters (utf-8 and latin1 etc), but we
+ * require that they are just a fairly small percentage of the total
+ * file. 
+ */
+static int is_binary(unsigned long size, struct text_stat *stats)
+{
+	if (stats->nonprintable > (size >> 3))
+		return 1;
+	/*
+	 * Other heuristics? Average line length might be relevant,
+	 * as might LF vs CR vs CRLF counts..
+	 *
+	 * NOTE! It might be normal to have a low ratio of CRLF to LF
+	 * (somebody starts with a LF-only file and edits it with an editor
+	 * that adds CRLF only to lines that are added..). But do  we
+	 * want to support CR-only? Probably not.
+	 */
+	return 0;
+}
+
+int convert_to_git(const char *path, char **bufp, unsigned long *sizep)
+{
+	char *buffer, *nbuf;
+	unsigned long size, nsize;
+	struct text_stat stats;
+
+	/*
+	 * FIXME! Other pluggable conversions should go here,
+	 * based on filename patterns. Right now we just do the
+	 * stupid auto-CRLF one.
+	 */
+	if (!auto_crlf)
+		return 0;
+
+	size = *sizep;
+	if (!size)
+		return 0;
+	buffer = *bufp;
+
+	gather_stats(buffer, size, &stats);
+
+	/* No CR? Nothing to convert, regardless. */
+	if (!stats.cr)
+		return 0;
+
+	/*
+	 * We're currently not going to even try to convert stuff
+	 * that has bare CR characters. Does anybody do that crazy
+	 * stuff?
+	 */
+	if (stats.cr != stats.crlf)
+		return 0;
+
+	/*
+	 * And add some heuristics for binary vs text, of course.. 
+	 */
+	if (is_binary(size, &stats))
+		return 0;
+
+	/*
+	 * Ok, allocate a new buffer, fill it in, and return true
+	 * to let the caller know that we switched buffers on it.
+	 */
+	nsize = size - stats.crlf;
+	nbuf = xmalloc(nsize);
+	*bufp = nbuf;
+	*sizep = nsize;
+	do {
+		unsigned char c = *buffer++;
+		if (c != '\r')
+			*nbuf++ = c;
+	} while (--size);
+
+	return 1;
+}
+
+int convert_to_working_tree(const char *path, char **bufp, unsigned long *sizep)
+{
+	char *buffer, *nbuf;
+	unsigned long size, nsize;
+	struct text_stat stats;
+	unsigned char last;
+
+	/*
+	 * FIXME! Other pluggable conversions should go here,
+	 * based on filename patterns. Right now we just do the
+	 * stupid auto-CRLF one.
+	 */
+	if (!auto_crlf)
+		return 0;
+
+	size = *sizep;
+	if (!size)
+		return 0;
+	buffer = *bufp;
+
+	gather_stats(buffer, size, &stats);
+
+	/* No LF? Nothing to convert, regardless. */
+	if (!stats.lf)
+		return 0;
+
+	/* Was it already in CRLF format? */
+	if (stats.lf == stats.crlf)
+		return 0;
+
+	/* If we have any bare CR characters, we're not going to touch it */
+	if (stats.cr != stats.crlf)
+		return 0;
+
+	if (is_binary(size, &stats))
+		return 0;
+
+	/*
+	 * Ok, allocate a new buffer, fill it in, and return true
+	 * to let the caller know that we switched buffers on it.
+	 */
+	nsize = size + stats.lf - stats.crlf;
+	nbuf = xmalloc(nsize);
+	*bufp = nbuf;
+	*sizep = nsize;
+	last = 0;
+	do {
+		unsigned char c = *buffer++;
+		if (c == '\n' && last != '\r')
+			*nbuf++ = '\r';
+		*nbuf++ = c;
+		last = c;
+	} while (--size);
+
+	return 1;
+}
diff --git a/diff.c b/diff.c
index aaab309..561587c 100644
--- a/diff.c
+++ b/diff.c
@@ -1332,6 +1332,9 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
 	    reuse_worktree_file(s->path, s->sha1, 0)) {
 		struct stat st;
 		int fd;
+		char *buf;
+		unsigned long size;
+
 		if (lstat(s->path, &st) < 0) {
 			if (errno == ENOENT) {
 			err_empty:
@@ -1364,6 +1367,19 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
 		s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
 		close(fd);
 		s->should_munmap = 1;
+
+		/*
+		 * Convert from working tree format to canonical git format
+		 */
+		buf = s->data;
+		size = s->size;
+		if (convert_to_git(s->path, &buf, &size)) {
+			munmap(s->data, s->size);
+			s->should_munmap = 0;
+			s->data = buf;
+			s->size = size;
+			s->should_free = 1;
+		}
 	}
 	else {
 		char type[20];
diff --git a/entry.c b/entry.c
index 0ebf0f0..472a9ef 100644
--- a/entry.c
+++ b/entry.c
@@ -78,6 +78,9 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
 			path, sha1_to_hex(ce->sha1));
 	}
 	switch (ntohl(ce->ce_mode) & S_IFMT) {
+		char *buf;
+		unsigned long nsize;
+
 	case S_IFREG:
 		if (to_tempfile) {
 			strcpy(path, ".merge_file_XXXXXX");
@@ -89,6 +92,18 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
 			return error("git-checkout-index: unable to create file %s (%s)",
 				path, strerror(errno));
 		}
+
+		/*
+		 * Convert from git internal format to working tree format
+		 */
+		buf = new;
+		nsize = size;
+		if (convert_to_working_tree(ce->name, &buf, &nsize)) {
+			free(new);
+			new = buf;
+			size = nsize;
+		}
+
 		wrote = write_in_full(fd, new, size);
 		close(fd);
 		free(new);
diff --git a/environment.c b/environment.c
index 54c22f8..2fa0960 100644
--- a/environment.c
+++ b/environment.c
@@ -28,6 +28,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
 int pager_in_use;
 int pager_use_color = 1;
+int auto_crlf = 0;
 
 static const char *git_dir;
 static char *git_object_dir, *git_index_file, *git_refs_dir, *git_graft_file;
diff --git a/sha1_file.c b/sha1_file.c
index 0d4bf80..6ec67b2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2082,7 +2082,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
 {
 	unsigned long size = st->st_size;
 	void *buf;
-	int ret;
+	int ret, re_allocated = 0;
 
 	buf = "";
 	if (size)
@@ -2091,10 +2091,30 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
 
 	if (!type)
 		type = blob_type;
+
+	/*
+	 * Convert blobs to git internal format
+	 */
+	if (!strcmp(type, blob_type)) {
+		unsigned long nsize = size;
+		char *nbuf = buf;
+		if (convert_to_git(NULL, &nbuf, &nsize)) {
+			if (size)
+				munmap(buf, size);
+			size = nsize;
+			buf = nbuf;
+			re_allocated = 1;
+		}
+	}
+
 	if (write_object)
 		ret = write_sha1_file(buf, size, type, sha1);
 	else
 		ret = hash_sha1_file(buf, size, type, sha1);
+	if (re_allocated) {
+		free(buf);
+		return ret;
+	}
 	if (size)
 		munmap(buf, size);
 	return ret;

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 12:16     ` Alexander Litvinov
  2007-02-13 12:37       ` Johannes Schindelin
@ 2007-02-13 19:36       ` Mark Levedahl
  2007-02-13 20:32         ` Linus Torvalds
  2007-02-13 21:58         ` Robin Rosenberg
  1 sibling, 2 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13 19:36 UTC (permalink / raw)
  To: git

Alexander Litvinov wrote:

> ? ????????? ?? Tuesday 13 February 2007 16:06 Johannes Schindelin
> ???????(a):
>> Hi,
>>
>> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
>> > When I have file that was converted from dos to unix format (or from
>> > unix to dos) git genereta big diff. But anyway, c++ compiler works well
>> > with both formats and in this case I simply convert file to dos format
>> > and git shows again nice diff. If unix format was commited to git I
>> > simply change the format and commit that file again.
>>
>> That's awful!
> If you are tring to build history that looks good - you are right this is
> a terrible workflow.
> 
>> > The only trouble is the rebase, it does not like \r\n ending and othen
>> > produce unexpected merge conflict. But I don't use rebse to othen to
>> > realy investigate and try to solve the problem.
>>
>> Well, if everybody thinks like you, maybe we do not have to change
>> anything for Windows after all?
> I still wish to have working rebase so if git will hanle somehow \r\n it
> would be nice. But please do not produce the same behavior as cvs does:
> under cygwin it still use \n !

Cygwin != Windows, Cygwin is a POSIX emulation layer with the explicit goal
of providing user tools behaving exactly as they do under Linux, and this
includes line ending style.

So, the Cygwin ports of various Linux tools are not expected to satisfy
users who want native Win32 behavior. This is where the mingw port of git
fits in. Yes, under Cygwin git can track files with \r\n endings, but: 
1) Those projects are not portable to non-windows platforms, and 
2) As you noted, git will have trouble with rebase, merge, etc. as there is
an assumption of \n endings throughout.

A proper win32 port will accept any of \n, \r\n as valid line endings (add
\r to support Mac pre-OSX if anyone cares, I still occasionally see such
files), treat any of them as semantically equal, and enforce the user's
chosen style (\n or \r\n) on output. cvsnt and svn under Windows do this
today, serving up "text" files from the same repository with \n endings or
\r\n endings depending upon the client, and is what we need a win32 git to
do as well.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 19:36       ` Mark Levedahl
@ 2007-02-13 20:32         ` Linus Torvalds
  2007-02-14  1:42           ` Mark Levedahl
  2007-02-13 21:58         ` Robin Rosenberg
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 20:32 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: git



On Tue, 13 Feb 2007, Mark Levedahl wrote:
> 
> A proper win32 port will accept any of \n, \r\n as valid line endings (add
> \r to support Mac pre-OSX if anyone cares, I still occasionally see such
> files), treat any of them as semantically equal, and enforce the user's
> chosen style (\n or \r\n) on output.

The patch I sent out does that, except right now the "autocrlf" flag is 
just a pure boolean.

I could easily make it take a ternary value:
 - off (normal UNIX semantics - never change anything)
 - on (turn CRLF->LF on input, turn LF->CRLF on output)
 - input-only (turn CRLF->LF on input, leave LF alone on output)

that would be just a couple of extra lines (almost all of them in the 
config file parsing logic).

[ The "output-only" case is obviously possible, but insane. It would turn 
  a LF-only file into CRLF on output, and then not turn it back on input, 
  so doing any "git commit -a" would basically turn every single lines 
  into CRLF, which you do NOT want. 

  So hopefully that explains the three - not four - cases ]

And the patch already leaves files that the user doesn't touch alone (ie 
if you check something out with CRLF turned off, and then turn it on in 
the config, nobody will care - the checked-out copy will have LF-only even 
if explicitly re-checking it out would turn it into CRLF, but that's fine.

It would be interesting to hear if the patch works for the MinGW people in 
particular. People using git with a Cygnus environment are probably used 
to try to keep files with just LF, since they are really trying to do a 
UNIX environment on top of Windows. But I suspect that WinGW people are 
more likely to use native Windows tools for things, and then perhaps just 
a smattering of UNIXy tools..

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 19:07           ` Linus Torvalds
@ 2007-02-13 20:42             ` Sam Ravnborg
  2007-02-13 21:08               ` Nicolas Pitre
                                 ` (3 more replies)
  2007-02-14  5:16             ` Junio C Hamano
                               ` (2 subsequent siblings)
  3 siblings, 4 replies; 83+ messages in thread
From: Sam Ravnborg @ 2007-02-13 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List

> Anyway, BY DEFAULT it is off regardless, because it requires a
> 
> 	[core]
> 		AutoCRLF = true
> 
> in your config file to be enabled. We could make that the default for 
> Windows, of course, the same way we do some other things (filemode etc).

This whole auto CRLF things seems to deal with DOS issues that I personally
have not encountered since looong time ago.
Granted notepad in Windows does not understand UNIX files but that a bug
in notepad and everyone knows that wordpad can be used.

I wonder what we are really trying to address here. Or in other words
could the original poster maybe tell what Windows IDE's that does
not handle UNIX files properly?

core git today should not care about CRLF as opposed to LF end-of-line
as long as the end-of-line is consistent - correct?

So defaulting to autoCRLF in Windows/DOS environments was maybe
sane 10 years ago but today that seems to be the wrong thing to do.
For certain project the option could be useful if the tool-set in
the project *requires* CRLF, but if the toolset like all modern toolset
supports both CRLF and LF then git better avoid changing end-of-line marker.

	Sam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 20:42             ` Sam Ravnborg
@ 2007-02-13 21:08               ` Nicolas Pitre
  2007-02-13 23:19               ` David Lang
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 83+ messages in thread
From: Nicolas Pitre @ 2007-02-13 21:08 UTC (permalink / raw)
  To: Sam Ravnborg
  Cc: Linus Torvalds, Junio C Hamano, Johannes Schindelin,
	Alexander Litvinov, Mark Levedahl, Git Mailing List

On Tue, 13 Feb 2007, Sam Ravnborg wrote:

> This whole auto CRLF things seems to deal with DOS issues that I personally
> have not encountered since looong time ago.

Maybe you didn't share a work environment with Windows users since 
looong time ago.

> Granted notepad in Windows does not understand UNIX files but that a bug
> in notepad and everyone knows that wordpad can be used.
> 
> I wonder what we are really trying to address here. Or in other words
> could the original poster maybe tell what Windows IDE's that does
> not handle UNIX files properly?

Windows IDE's can _create_files.  Those files will be CRLF infected.

Also some of them read UNIX files just fine but they will use CRLF to 
end new added lines despite the rest of the file using only LF.

> core git today should not care about CRLF as opposed to LF end-of-line
> as long as the end-of-line is consistent - correct?

Consistency won't come alone if not enforced in some way.

> So defaulting to autoCRLF in Windows/DOS environments was maybe
> sane 10 years ago but today that seems to be the wrong thing to do.
> For certain project the option could be useful if the tool-set in
> the project *requires* CRLF, but if the toolset like all modern toolset
> supports both CRLF and LF then git better avoid changing end-of-line marker.

Rather git better enforce consistency otherwise it'll be only a mix of 
possible combination as soon as Windows and UNIX users work on the same 
project.


Nicolas

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 19:36       ` Mark Levedahl
  2007-02-13 20:32         ` Linus Torvalds
@ 2007-02-13 21:58         ` Robin Rosenberg
  2007-02-14  1:18           ` Mark Levedahl
  1 sibling, 1 reply; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-13 21:58 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: git

tisdag 13 februari 2007 20:36 skrev Mark Levedahl:
> Alexander Litvinov wrote:
> 
> > ? ????????? ?? Tuesday 13 February 2007 16:06 Johannes Schindelin
> > ???????(a):
> >> Hi,
> >>
> >> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> >> > When I have file that was converted from dos to unix format (or from
> >> > unix to dos) git genereta big diff. But anyway, c++ compiler works well
> >> > with both formats and in this case I simply convert file to dos format
> >> > and git shows again nice diff. If unix format was commited to git I
> >> > simply change the format and commit that file again.
> >>
> >> That's awful!
> > If you are tring to build history that looks good - you are right this is
> > a terrible workflow.
> > 
> >> > The only trouble is the rebase, it does not like \r\n ending and othen
> >> > produce unexpected merge conflict. But I don't use rebse to othen to
> >> > realy investigate and try to solve the problem.
> >>
> >> Well, if everybody thinks like you, maybe we do not have to change
> >> anything for Windows after all?
> > I still wish to have working rebase so if git will hanle somehow \r\n it
> > would be nice. But please do not produce the same behavior as cvs does:
> > under cygwin it still use \n !
> 
> Cygwin != Windows, Cygwin is a POSIX emulation layer with the explicit goal
> of providing user tools behaving exactly as they do under Linux, and this
> includes line ending style.

Line ending style is selectable in cygwin, both on a global level and path level (cygwin 
mounts). If you use CVS for windows development using CRLF works well and
is the only option if you want to use the same working are with both native CVS clients
like TortoiseCVS and the cygwin client. I use the CRLF style by default and LF only
for selected directories. The only annoying thing I see is that files transformed by patch end 
up with LF-only line endings.

> So, the Cygwin ports of various Linux tools are not expected to satisfy
> users who want native Win32 behavior. This is where the mingw port of git
> fits in. Yes, under Cygwin git can track files with \r\n endings, but: 
> 1) Those projects are not portable to non-windows platforms, and 
> 2) As you noted, git will have trouble with rebase, merge, etc. as there is
> an assumption of \n endings throughout.

Even if there is a native port, I'm inclined to want to use the cygwin version 
anyway because of the nice shell and scripting capabilities and large selection of packages
that match what I'm used to in Linux. Git under cygwin should do CRLF transformations 
according to the same rules that apply to text files in cygwin.

-- robin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 20:42             ` Sam Ravnborg
  2007-02-13 21:08               ` Nicolas Pitre
@ 2007-02-13 23:19               ` David Lang
  2007-02-13 23:28               ` Linus Torvalds
  2007-02-14  3:47               ` Alexander Litvinov
  3 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-13 23:19 UTC (permalink / raw)
  To: Sam Ravnborg
  Cc: Linus Torvalds, Junio C Hamano, Johannes Schindelin,
	Alexander Litvinov, Mark Levedahl, Git Mailing List

On Tue, 13 Feb 2007, Sam Ravnborg wrote:

>
> I wonder what we are really trying to address here. Or in other words
> could the original poster maybe tell what Windows IDE's that does
> not handle UNIX files properly?
>
> core git today should not care about CRLF as opposed to LF end-of-line
> as long as the end-of-line is consistent - correct?
>
> So defaulting to autoCRLF in Windows/DOS environments was maybe
> sane 10 years ago but today that seems to be the wrong thing to do.
> For certain project the option could be useful if the tool-set in
> the project *requires* CRLF, but if the toolset like all modern toolset
> supports both CRLF and LF then git better avoid changing end-of-line marker.

I've actually run into grief on this subject with perl scripts within the last 
year (files from windows systems with crlf not working cleanly on a linux system 
with just lf)

this is real, not just historic

David Lang

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 20:42             ` Sam Ravnborg
  2007-02-13 21:08               ` Nicolas Pitre
  2007-02-13 23:19               ` David Lang
@ 2007-02-13 23:28               ` Linus Torvalds
  2007-02-14  8:41                 ` Sam Ravnborg
  2007-02-14  3:47               ` Alexander Litvinov
  3 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 23:28 UTC (permalink / raw)
  To: Sam Ravnborg
  Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List



On Tue, 13 Feb 2007, Sam Ravnborg wrote:
> 
> This whole auto CRLF things seems to deal with DOS issues that I personally
> have not encountered since looong time ago.

Maybe you stopped using DOS a loong time ago ;)

It's definitely an issue. Yes, all windows programs basically *understand* 
files that have just LF. But almost all of them will *write* files with 
CRLF.

(Which means that I suspect I made the default for "auto_crlf" be wrong in 
my patch: I probably should not default to checking out with CRLF, but 
checking out with just LF, and only do the CRLF->LF conversion on input).

Anybody who has ever worked with _any_ Windows people have long since 
learnt that they always end up having to convert CRLF to just LF when they
get files. Even _I_ know it, and I seldom have to work with people who use 
Windows ;)

So it's a good idea to try to make sure that Windows users don't corrupt 
files by adding CRLF where there is no need for them into a git archive. 
We hope to convert those people to a real OS some day ("here's a nickel, 
boy"), and to make it easier for them to do it, making sure that their 
projects in -git are already in a sane format is probably a good idea.

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 21:58         ` Robin Rosenberg
@ 2007-02-14  1:18           ` Mark Levedahl
  0 siblings, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14  1:18 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: git

Robin Rosenberg wrote:
> 
> Even if there is a native port, I'm inclined to want to use the cygwin version 
> anyway because of the nice shell and scripting capabilities and large selection of packages
> that match what I'm used to in Linux. Git under cygwin should do CRLF transformations 
> according to the same rules that apply to text files in cygwin.
> 
> -- robin

The cygwin project is explicitly trying to bury the "text" mount option 
and drive towards binary (= \n line endings) only. They once had a rule 
that all cygwin programs fully grok \r\n, but that ethic disappeared a 
couple of years ago, it was just too hard. The cygwin git port itself 
will not operate on a text mount, it requires a binary mount, so crlf 
translations are simply not available with git under cygwin.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 20:32         ` Linus Torvalds
@ 2007-02-14  1:42           ` Mark Levedahl
  2007-02-14  2:16             ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14  1:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds wrote:
> 
> On Tue, 13 Feb 2007, Mark Levedahl wrote:
>> A proper win32 port will accept any of \n, \r\n as valid line endings (add
>> \r to support Mac pre-OSX if anyone cares, I still occasionally see such
>> files), treat any of them as semantically equal, and enforce the user's
>> chosen style (\n or \r\n) on output.
> 
> The patch I sent out does that, except right now the "autocrlf" flag is 
> just a pure boolean.
> 
> I could easily make it take a ternary value:
>  - off (normal UNIX semantics - never change anything)
>  - on (turn CRLF->LF on input, turn LF->CRLF on output)
>  - input-only (turn CRLF->LF on input, leave LF alone on output)
> 
> 
> 			Linus

Wow, this is an incredible response: I expected I was going to be 
studying git internals for a while to get to this point. Thank you!

The ternary value is definitely useful. As noted elsewhere, most tools 
on windows are very happy with \n ending, few honor those line endings 
when files are modified, and fewer still allow the user to specify use 
of \n for new files. However, cygwin tools in particular are not 
tolerant of crlf, so for that environment it makes sense to banish crlf 
and the input-only option is most likely the best default setting there.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14  1:42           ` Mark Levedahl
@ 2007-02-14  2:16             ` Linus Torvalds
  0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14  2:16 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: git



On Tue, 13 Feb 2007, Mark Levedahl wrote:
> 
> The ternary value is definitely useful. As noted elsewhere, most tools on
> windows are very happy with \n ending, few honor those line endings when files
> are modified, and fewer still allow the user to specify use of \n for new
> files. However, cygwin tools in particular are not tolerant of crlf, so for
> that environment it makes sense to banish crlf and the input-only option is
> most likely the best default setting there.

Here's a UNTESTED patch on top of the patch I already sent, which allows 
you to do

	[core]
		AutoCRLF = input

and it should do only the CRLF->LF translation (ie it simplifies CRLF only 
when reading working tree files, but when checking out files, it leaves 
the LF alone, and doesn't turn it into a CRLF).

And by "untested" I mean that it looks ok and seems to compile, but I 
really didn't do anything else.

		Linus
---
diff --git a/config.c b/config.c
index ffe0212..e8ae919 100644
--- a/config.c
+++ b/config.c
@@ -325,6 +325,10 @@ int git_default_config(const char *var, const char *value)
 	}
 
 	if (!strcmp(var, "core.autocrlf")) {
+		if (value && !strcasecmp(value, "input")) {
+			auto_crlf = -1;
+			return 0;
+		}
 		auto_crlf = git_config_bool(var, value);
 		return 0;
 	}
diff --git a/convert.c b/convert.c
index c04b6c2..b5a47c2 100644
--- a/convert.c
+++ b/convert.c
@@ -133,7 +133,7 @@ int convert_to_working_tree(const char *path, char **bufp, unsigned long *sizep)
 	 * based on filename patterns. Right now we just do the
 	 * stupid auto-CRLF one.
 	 */
-	if (!auto_crlf)
+	if (auto_crlf <= 0)
 		return 0;
 
 	size = *sizep;
diff --git a/environment.c b/environment.c
index 2fa0960..570e32a 100644
--- a/environment.c
+++ b/environment.c
@@ -28,7 +28,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
 int pager_in_use;
 int pager_use_color = 1;
-int auto_crlf = 0;
+int auto_crlf = 0;	/* 1: both ways, -1: only when adding git objects */
 
 static const char *git_dir;
 static char *git_object_dir, *git_index_file, *git_refs_dir, *git_graft_file;

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 20:42             ` Sam Ravnborg
                                 ` (2 preceding siblings ...)
  2007-02-13 23:28               ` Linus Torvalds
@ 2007-02-14  3:47               ` Alexander Litvinov
  3 siblings, 0 replies; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-14  3:47 UTC (permalink / raw)
  To: Sam Ravnborg
  Cc: Linus Torvalds, Junio C Hamano, Johannes Schindelin,
	Mark Levedahl, Git Mailing List

В сообщении от Wednesday 14 February 2007 02:42 Sam Ravnborg написал(a):
> I wonder what we are really trying to address here. Or in other words
> could the original poster maybe tell what Windows IDE's that does
> not handle UNIX files properly?
MS VC has text file for project file but don't like \n line endings, only 
\r\n.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 19:07           ` Linus Torvalds
  2007-02-13 20:42             ` Sam Ravnborg
@ 2007-02-14  5:16             ` Junio C Hamano
  2007-02-14  5:36               ` Linus Torvalds
  2007-02-14 11:36             ` Alexander Litvinov
  2007-02-14 16:16             ` Johannes Sixt
  3 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14  5:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours 
> truly. I will not guarantee that they work at all reasonable. Caveat 
> emptor. But it _is_ simple, and it _is_ safe, since it's all off by 
> default.

It might be safe for some definition of safe, but it is very
Asian unfriendly.

I'd probably suggest replacing it with what GNU diff uses, which
we stolen and implemented in diff.c::mmfile_is_binary().

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14  5:16             ` Junio C Hamano
@ 2007-02-14  5:36               ` Linus Torvalds
  2007-02-14 11:10                 ` Johannes Schindelin
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14  5:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List



On Tue, 13 Feb 2007, Junio C Hamano wrote:
> 
> It might be safe for some definition of safe, but it is very
> Asian unfriendly.
> 
> I'd probably suggest replacing it with what GNU diff uses, which
> we stolen and implemented in diff.c::mmfile_is_binary().

Well, the thing is, mmfile_is_binary() doesn't really have a big downside 
if it's wrong one way or the other.

In contrast CR->CRLF conversion, if wrong, actually corrupts binary files. 
So I felt it was better to be really safe than sorry. It's *much* better 
to miss some CRLF translation than to do too much of it.

That said, I'm sure it could be improved a lot. In particular, characters 
in the range 0x00 - 0x1f are clearly "more binary" than the 0x7f+ range, 
with the obvious exceptions (tab, cr, lf).

0x00 - which is the only one mmfile_is_binart() uses - is arguably the 
"most binary" one, of course, but it might be interesting to give 
different weights to the whole range.. In particular, especially for small 
files, the fact that there is no 0x00 byte in no way indicates that it's 
not "binary".

This whole issue is obviously one reason I'd like to involve the filename 
itself, and make it use a ".gitattributes" file - exactly because that 
allows you to be much more aggressive and more precise.

(0x00 may be one of the more _common_ characters in many binary files, 
which makes it a good character to search for too, so I don't really have 
any hugely strong opinions here. After all, the whole heuristic is off by 
default anyway, so it's "really safe" ;^)

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 23:28               ` Linus Torvalds
@ 2007-02-14  8:41                 ` Sam Ravnborg
  2007-02-14 16:28                   ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Sam Ravnborg @ 2007-02-14  8:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List

> > This whole auto CRLF things seems to deal with DOS issues that I personally
> > have not encountered since looong time ago.
> 
> Maybe you stopped using DOS a loong time ago ;)
Unfortunately not. (Sitting with a Windows 2000 laptop atm but saved by ssh).

> 
> It's definitely an issue. Yes, all windows programs basically *understand* 
> files that have just LF. But almost all of them will *write* files with 
> CRLF.

So the issue with git supporting CRLF -> LF is to make interoperability between
UNIX* programs and Windows programs which is anohter domain.

My main objective is the proposal to make a conversion default when many users
do not need it. For the UNIX* compatibility thing having conversion at lowest
layer make sense.

> (Which means that I suspect I made the default for "auto_crlf" be wrong in 
> my patch: I probably should not default to checking out with CRLF, but 
> checking out with just LF, and only do the CRLF->LF conversion on input).
Expect that it seems a few br0ken programs yet does not support LF as
end-of-line marker - so .gitattriutes make take special care here.

	Sam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14  5:36               ` Linus Torvalds
@ 2007-02-14 11:10                 ` Johannes Schindelin
  2007-02-14 14:26                   ` Mark Levedahl
  2007-02-14 15:44                   ` Linus Torvalds
  0 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-14 11:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List

Hi,

On Tue, 13 Feb 2007, Linus Torvalds wrote:

> 0x00 - which is the only one mmfile_is_binart() uses - is arguably the 
> "most binary" one, of course, but it might be interesting to give 
> different weights to the whole range.. In particular, especially for 
> small files, the fact that there is no 0x00 byte in no way indicates 
> that it's not "binary".

Last time I checked, the text files never had lines longer than 200 
characters (I chose this intentionally large). So, it might be a good 
heuristic to check the maximal line length, and refuse to believe that 
it's text once a certain (configurable) threshold is reached.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 19:07           ` Linus Torvalds
  2007-02-13 20:42             ` Sam Ravnborg
  2007-02-14  5:16             ` Junio C Hamano
@ 2007-02-14 11:36             ` Alexander Litvinov
  2007-02-14 16:37               ` Linus Torvalds
  2007-02-14 16:16             ` Johannes Sixt
  3 siblings, 1 reply; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-14 11:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

В сообщении от Wednesday 14 February 2007 01:07 Linus Torvalds написал:
> Actually, I did it myself.
>
> This is a "lazy man's auto-CRLF", and it really is pretty simple.

Wow ! Thanks. 

I just tried this patch and it works! From now I can use git-cvsimport under 
Linux and then clone it to cygwin and work there with full history. Nice, 
very nice. In my case text file detection work well as far most of our files 
are .cpp and .h

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 11:10                 ` Johannes Schindelin
@ 2007-02-14 14:26                   ` Mark Levedahl
  2007-02-14 15:51                     ` Linus Torvalds
  2007-02-14 15:56                     ` Johannes Schindelin
  2007-02-14 15:44                   ` Linus Torvalds
  1 sibling, 2 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14 14:26 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Junio C Hamano, Alexander Litvinov,
	Mark Levedahl, Git Mailing List

Johannes Schindelin wrote:
> Last time I checked, the text files never had lines longer than 200 
> characters (I chose this intentionally large). So, it might be a good 
> heuristic to check the maximal line length, and refuse to believe that 
> it's text once a certain (configurable) threshold is reached.
>
> Ciao,
> Dsch
Unfortunately, on my program we have folks using text files with single 
lines over 60,000 characters long, these are data files. Think for 
example of a comma or tab separated data file saved from a spreadsheet. 
In this case, the files are pure ascii. So, the line length could be 
something else to take into account, but is not decisive by itself.

To recap, we have the following various suggestions to determine textness:

1) ratio of ascii to non-ascii characters, possibly weighting some chars 
more than others
2) line length
3) existence of a null (\0)
4) file name globbing
5) roundtrip ( lf(crlf(file) ) == file

I don't think any one suggestion is completely adequate for all uses, 
all need to be available, somehow configurable. This suggests to me a 
core.AutoCRLFstrategy variable that is a comma separated list of methods 
to use (set to a reasonable default of course that does not cause 
runtime headaches on Unix): a file would be deemed binary unless all 
listed methods declare the file as text (with an empty list disabling 
AutoCRLF detection).

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 11:10                 ` Johannes Schindelin
  2007-02-14 14:26                   ` Mark Levedahl
@ 2007-02-14 15:44                   ` Linus Torvalds
  2007-02-14 15:53                     ` Johannes Schindelin
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 15:44 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List



On Wed, 14 Feb 2007, Johannes Schindelin wrote:
> 
> Last time I checked, the text files never had lines longer than 200 
> characters (I chose this intentionally large). So, it might be a good 
> heuristic to check the maximal line length,

No, some broken editor programs and people use "flowing text" files, where 
a newline is actually a _paragraph_ end. You have lines in the hundreds 
(and thousands) of characters, and the program will just flow the text for 
you.

Ugh. Horrible, I know. 

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 14:26                   ` Mark Levedahl
@ 2007-02-14 15:51                     ` Linus Torvalds
  2007-02-14 16:39                       ` Junio C Hamano
  2007-02-14 15:56                     ` Johannes Schindelin
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 15:51 UTC (permalink / raw)
  To: Mark Levedahl
  Cc: Johannes Schindelin, Junio C Hamano, Alexander Litvinov,
	Mark Levedahl, Git Mailing List



On Wed, 14 Feb 2007, Mark Levedahl wrote:
> 
> To recap, we have the following various suggestions to determine textness:
> 
> 1) ratio of ascii to non-ascii characters, possibly weighting some chars more
> than others
> 2) line length
> 3) existence of a null (\0)
> 4) file name globbing
> 5) roundtrip ( lf(crlf(file) ) == file

Actually, my patch already had one that you didn't mention: 
 6) CR never shows up alone.

So the patch I sent out basicallyhad the following rules:
 - no more than ~10% of all characters being other than regular printable 
   ASCII (where any control character except for newline/cr/tab was deemed 
   nonprintable)
 - any "lonely" CR automatically means it's binary, and I would refuse 
   to convert that to a LF (the test in the code is that CRLF count must 
   match CR count)

but the "roundtrip" rule is much too strict (it's actually perfectly 
possible for an editor to add CRLF characters only to new _lines_, leaving 
old lines with just LF - or the other way around. In fact, the editor I 
use under Linux does exactly that in reverse - if I add new lines, it will 
add those without CR, but will leave old lines with CRLF alone).

I think that to help asian languages (or strange text-files in utf8 or 
Latin1 too, for that matter: test-files with _just_ special characters), I 
should probably make the rule be that only the 0-31 range is special.

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 15:44                   ` Linus Torvalds
@ 2007-02-14 15:53                     ` Johannes Schindelin
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-14 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List

Hi,

On Wed, 14 Feb 2007, Linus Torvalds wrote:

> On Wed, 14 Feb 2007, Johannes Schindelin wrote:
> > 
> > Last time I checked, the text files never had lines longer than 200 
> > characters (I chose this intentionally large). So, it might be a good 
> > heuristic to check the maximal line length,
> 
> No, some broken editor programs and people use "flowing text" files, where 
> a newline is actually a _paragraph_ end.

Good point.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 14:26                   ` Mark Levedahl
  2007-02-14 15:51                     ` Linus Torvalds
@ 2007-02-14 15:56                     ` Johannes Schindelin
  2007-02-14 16:23                       ` Linus Torvalds
  2007-02-14 17:28                       ` Mark Levedahl
  1 sibling, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-14 15:56 UTC (permalink / raw)
  To: Mark Levedahl
  Cc: Linus Torvalds, Junio C Hamano, Alexander Litvinov,
	Mark Levedahl, Git Mailing List

Hi,

On Wed, 14 Feb 2007, Mark Levedahl wrote:

> This suggests to me a core.AutoCRLFstrategy variable that is a comma 
> separated list of methods to use (set to a reasonable default of course 
> that does not cause runtime headaches on Unix): a file would be deemed 
> binary unless all listed methods declare the file as text (with an empty 
> list disabling AutoCRLF detection).

This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does 
a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither 
do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text 
detection.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-13 19:07           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2007-02-14 11:36             ` Alexander Litvinov
@ 2007-02-14 16:16             ` Johannes Sixt
  2007-02-14 16:53               ` Linus Torvalds
  3 siblings, 1 reply; 83+ messages in thread
From: Johannes Sixt @ 2007-02-14 16:16 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:
> 
> On Tue, 13 Feb 2007, Junio C Hamano wrote:
> >
> > Thanks, applied.  I think git-apply has separate codepaths for
> > both reading and writing; I won't look into them before 1.5.0
> > but people are welcome to help advancing the cause before I get
> > to it ;-).
> 
> Actually, I did it myself.
> 
> This is a "lazy man's auto-CRLF", and it really is pretty simple.

Thanks a lot, busy beaver! I gave this a quick spin with a few
interesting operations: merges and rebase. Merges leave the merge
results with only LFs behind. Rebasing seems to work as expected
(working files have CRLFs), except when merges are needed.

Doesn't git-unpack-file also need to call into the converter?

-- Hannes

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 15:56                     ` Johannes Schindelin
@ 2007-02-14 16:23                       ` Linus Torvalds
  2007-02-14 17:28                       ` Mark Levedahl
  1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:23 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Mark Levedahl, Junio C Hamano, Alexander Litvinov, Mark Levedahl,
	Git Mailing List



On Wed, 14 Feb 2007, Johannes Schindelin wrote:
> 
> This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does 
> a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither 
> do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text 
> detection.

Well, one thing to keep in mind is that for source code in particular, 
this really very seldom is an issue.

So you can do a really *bad* job in theory, and in practice it really 
works very very well.

Very few people keep binary blobs in any SCM archive _anyway_, partly 
because they've always been told that it's unsafe (and with a lot of SCM's 
it is), but even more because binary blobs are almost always generated by 
some build method, so normally you'd never version them in the first 
place, or versioning isn't all that helpful.

And most binary blobs are so *obviously* binary that even the stupidest 
algorithm on earth will get it right. The only hard cases actually tend to 
be really tiny files, or literally test-sequences.

Tiny files are hard because:

 - they (by being tiny) have so few characters that they can easily lack 
   a "fingerprint" character (eg a NUL character or similar). 

 - tiny files are a lot more likely than bigger files to have strange 
   statistics that throw some more "sophisticated" rule off the scent. 
   Something like a "10% rule" tends to work fine if you have a big text, 
   and ten percent is still a reasonable number to average things out 
   over, but what if you only had ten characters to begin with?

The good news is that tiny files can usually be considered text, since 
you'd seldom use a binary format for something really small anyway.

So I suspect that IN PRACTICE, especially if you come as a CVS replacement 
(where binary files are just damn hard to get right even under the best of 
circumstances!), you can do just about anything, including just saying 
"everything is text", and you'd be fine.

It's entirely possible that that is exactly what CVSNT does ;)

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14  8:41                 ` Sam Ravnborg
@ 2007-02-14 16:28                   ` Linus Torvalds
  2007-02-14 16:47                     ` Sam Ravnborg
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:28 UTC (permalink / raw)
  To: Sam Ravnborg
  Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List



On Wed, 14 Feb 2007, Sam Ravnborg wrote:
>
> > (Which means that I suspect I made the default for "auto_crlf" be wrong in 
> > my patch: I probably should not default to checking out with CRLF, but 
> > checking out with just LF, and only do the CRLF->LF conversion on input).
>
> Expect that it seems a few br0ken programs yet does not support LF as
> end-of-line marker - so .gitattriutes make take special care here.

Yes, but I also think that even without .gitattributes, you just want to 
have a default for what "text" actually means, and it's entirely possible 
that the default should be: "check out with just LF, and on check-in turn 
CRLF into LF".

But exactly because _some_ programs might want to always see CRLF on input 
too, it should be overridable. 

Or maybe the default should be "turn into CRLF", and there should just be 
an option to make it check out as LF-only.

Regardless, I think that is independent of ".gitattributes". The 
_attribute_ should be "text", but what it then means in practice is a 
separate flag.

And yes, we *could* have a per-file attribute ("text,crlf-checkout") which 
could be used to say "I want to always check out as crlf regardless of any 
other policy") and the same for lf-only, but I seriously doubt that 
anybody really needs that kind of knob-tweaking. At some point it's just 
fine to say "you're crazy".

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 11:36             ` Alexander Litvinov
@ 2007-02-14 16:37               ` Linus Torvalds
  2007-02-14 17:18                 ` Junio C Hamano
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:37 UTC (permalink / raw)
  To: Alexander Litvinov; +Cc: Junio C Hamano, Git Mailing List



On Wed, 14 Feb 2007, Alexander Litvinov wrote:
> 
> I just tried this patch and it works! From now I can use git-cvsimport under 
> Linux and then clone it to cygwin and work there with full history. Nice, 
> very nice.

Btw, it didn't do any commit message conversion etc, so you'll still 
always see commit messages with LF-only, and if you _create_ commits, you 
need to make sure that whatever program you use will do the right thing.

> In my case text file detection work well as far most of our files 
> are .cpp and .h

Yeah, considering that it worked in my testing for "git" itself, I'm not 
surprised. Source code tends to look the same..

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 15:51                     ` Linus Torvalds
@ 2007-02-14 16:39                       ` Junio C Hamano
  2007-02-14 17:01                         ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14 16:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Actually, my patch already had one that you didn't mention: 
>  6) CR never shows up alone.

Older Macs ;-)?

> So the patch I sent out basicallyhad the following rules:
>  - no more than ~10% of all characters being other than regular printable 
>    ASCII (where any control character except for newline/cr/tab was deemed 
>    nonprintable)
>  - any "lonely" CR automatically means it's binary, and I would refuse 
>    to convert that to a LF (the test in the code is that CRLF count must 
>    match CR count)
> ...
> I think that to help asian languages (or strange text-files in utf8 or 
> Latin1 too, for that matter: test-files with _just_ special characters), I 
> should probably make the rule be that only the 0-31 range is special.

I would agree.  0-31 except HT, CR, LF and ESC would be a good
idea; that would not harm text in UTF-8, EUC based various
locales nor ISO 2022.

Patch is relative to 'pu'.
-- >8 --

diff --git a/convert.c b/convert.c
index ebcf717..b6b7c66 100644
--- a/convert.c
+++ b/convert.c
@@ -13,7 +13,7 @@ struct text_stat {
 	unsigned cr, lf, crlf;
 
 	/* These are just approximations! */
-	unsigned printable, nonprintable, nul;
+	unsigned printable, nonprintable;
 };
 
 static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
@@ -34,13 +34,11 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 			stats->lf++;
 			continue;
 		}
-		if (c == '\t' || (c >= 32 && c < 127)) {
-			stats->printable++;
+		if ((c < 32) && (c != '\t' && c != '\033')) {
+			stats->nonprintable++;
 			continue;
 		}
-		if (!c)
-			stats->nul++;
-		stats->nonprintable++;
+		stats->printable++;
 	}
 }
 
@@ -50,7 +48,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 static int is_binary(unsigned long size, struct text_stat *stats)
 {
 
-	if (stats->nul)
+	if (stats->nonprintable)
 		return 1;
 	/*
 	 * Other heuristics? Average line length might be relevant,

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 16:28                   ` Linus Torvalds
@ 2007-02-14 16:47                     ` Sam Ravnborg
  0 siblings, 0 replies; 83+ messages in thread
From: Sam Ravnborg @ 2007-02-14 16:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List

On Wed, Feb 14, 2007 at 08:28:24AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 14 Feb 2007, Sam Ravnborg wrote:
> >
> > > (Which means that I suspect I made the default for "auto_crlf" be wrong in 
> > > my patch: I probably should not default to checking out with CRLF, but 
> > > checking out with just LF, and only do the CRLF->LF conversion on input).
> >
> > Expect that it seems a few br0ken programs yet does not support LF as
> > end-of-line marker - so .gitattriutes make take special care here.
> 
> Yes, but I also think that even without .gitattributes, you just want to 
> have a default for what "text" actually means, and it's entirely possible 
> that the default should be: "check out with just LF, and on check-in turn 
> CRLF into LF".
The definition of what is "text" and what action to take upon check-in /
check-out of text is two sepearate things.

I could see it as beneficial as a per-project or even as an overall
git-policy to say "checkin-as-LF" - "checkout-as-LF" to overcome
interoperability issues when more tools gets UNIX* based.

> 
> But exactly because _some_ programs might want to always see CRLF on input 
> too, it should be overridable. 
Which is where I see .gitattributes come into play.
-> A rule that says files with extension .prj and of type "text" shall not see
any conversion.

In this way almost all "text" over time get a proper format and the remaining
brain-dead tools that continue to save in CRLF format will not destroy the sane
LF format.

If anything gets defualt I would vote for LF. But overrideable.

My editor-of-choice does eol auto-sense. If I recall correct it scans the
first 200 lines and counts number of CR,LF,CRLF and based on this judge the
actual eol character used. But not all editors are that sensible :-(

	Sam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 16:16             ` Johannes Sixt
@ 2007-02-14 16:53               ` Linus Torvalds
  0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:53 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git



On Wed, 14 Feb 2007, Johannes Sixt wrote:
>
> Thanks a lot, busy beaver! I gave this a quick spin with a few
> interesting operations: merges and rebase. Merges leave the merge
> results with only LFs behind.

Yes. Merge uses "git-cat-file" (well, it historically did, now that it's 
built-in it still does the equivalent operation).

I already talked about how git-cat-file was special ;)

> Rebasing seems to work as expected (working files have CRLFs), except 
> when merges are needed.

Well, it always "merges", but yes, you mean three-way data merges. The 
normal SHA1-direct merges will just use the normal git-read-tree thing 
which is the same as checkout.

> Doesn't git-unpack-file also need to call into the converter?

See earlier discussions. git-cat-file (and git-unpack-file, which is just 
a version of it, really) don't have the original filename, so we'll need 
to extend on it some way in order to support file attributes even in 
theory. So before we do that, I'd hate to do any format conversion there.

Yes, yes, right now it ignores the filename *anyway*, but the point is, 
right now that's a "small implementation detail". I would NOT want to do 
this if I couldn't know the filename at all!

The merge algorithms actually obviously *do* know the filename fo the 
things that they are going to merge, so the filename information does 
exists. It's just not passed on far enough.

Finally, one comment: if you use "autocrlf = input" (my second patch), all 
of this works even now, since the default is to just leave things as 
LF-only anyway. In fact, even with "autocrlf = on", nothing should really 
*break* except for silly editors that actuall *require* CRLF.

IOW, it's more important to do the CRLF->LF conversion than it is to do 
the LF->CRLF one ;)

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 16:39                       ` Junio C Hamano
@ 2007-02-14 17:01                         ` Linus Torvalds
  2007-02-14 17:29                           ` Junio C Hamano
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 17:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List



On Wed, 14 Feb 2007, Junio C Hamano wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > Actually, my patch already had one that you didn't mention: 
> >  6) CR never shows up alone.
> 
> Older Macs ;-)?

Yeah, I think we can ignore them..

Let's see if anybody ever complains ;)

> I would agree.  0-31 except HT, CR, LF and ESC would be a good
> idea; that would not harm text in UTF-8, EUC based various
> locales nor ISO 2022.

You could possibly add 127 to the list too (it's ascii DEL, I don't know 
if you should ever see it in anything that has anything to do with text).

> -	if (stats->nul)
> +	if (stats->nonprintable)

But this is too harsh.

It's quite common to have the occasional FF character. Some things really 
do use it for page breaks. So saying that *any* nonprintable character is 
bad is not a good idea.

Same goes for BS (some programs use it to show bold and underlined text: 
man-pages, for example).

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 16:37               ` Linus Torvalds
@ 2007-02-14 17:18                 ` Junio C Hamano
  0 siblings, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14 17:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Litvinov, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 14 Feb 2007, Alexander Litvinov wrote:
>> 
>> I just tried this patch and it works! From now I can use git-cvsimport under 
>> Linux and then clone it to cygwin and work there with full history. Nice, 
>> very nice.
>
> Btw, it didn't do any commit message conversion etc, so you'll still 
> always see commit messages with LF-only, and if you _create_ commits, you 
> need to make sure that whatever program you use will do the right thing.

I think stripspace removes CR so we should be Ok.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 15:56                     ` Johannes Schindelin
  2007-02-14 16:23                       ` Linus Torvalds
@ 2007-02-14 17:28                       ` Mark Levedahl
  2007-02-14 18:17                         ` Robin Rosenberg
  1 sibling, 1 reply; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14 17:28 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Mark Levedahl, Linus Torvalds, Junio C Hamano,
	Alexander Litvinov, Git Mailing List

Johannes Schindelin wrote:
> Hi,
>
> On Wed, 14 Feb 2007, Mark Levedahl wrote:
>   
> This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does 
> a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither 
> do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text 
> detection.
>
> Ciao,
> Dscho
>   
I agree that is complex, I started thinking of PAM when I wrote that, 
leading to, "this aint gonna work." But in the modern day let's all feel 
good spirit of "there are no stupid ideas, just some are better" I threw 
it out anyway.

As to cvsnt, my actual feeling is I'd like to kick it in the ass, it has 
destroyed too many files for me over the years, binary and text, so I 
don't think its strategies are very good. That is why I'm kicking these 
ideas around, if I thought I knew the "right" way I would have written 
it already.

Mark

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 17:01                         ` Linus Torvalds
@ 2007-02-14 17:29                           ` Junio C Hamano
  2007-02-14 17:43                             ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

>> -	if (stats->nul)
>> +	if (stats->nonprintable)
>
> But this is too harsh.
>
> It's quite common to have the occasional FF character. Some things really 
> do use it for page breaks. So saying that *any* nonprintable character is 
> bad is not a good idea.
>
> Same goes for BS (some programs use it to show bold and underlined text: 
> man-pages, for example).

Ok.  How about adding BS and FF to the Ok set, and checking if
bad ones are less than 1% of the good ones?

diff --git a/convert.c b/convert.c
index b6b7c66..b0c7641 100644
--- a/convert.c
+++ b/convert.c
@@ -34,11 +34,22 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 			stats->lf++;
 			continue;
 		}
-		if ((c < 32) && (c != '\t' && c != '\033')) {
+		if (c == 127)
+			/* DEL */
 			stats->nonprintable++;
-			continue;
+		else if (c < 32) {
+			switch (c) {
+				/* BS, HT, ESC and FF */
+			case '\b': case '\t': case '\033': case '\014':
+				stats->printable++;
+				break;
+			default:
+				stats->nonprintable++;
+			}
+			
 		}
-		stats->printable++;
+		else
+			stats->printable++;
 	}
 }
 
@@ -48,7 +59,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 static int is_binary(unsigned long size, struct text_stat *stats)
 {
 
-	if (stats->nonprintable)
+	if ((stats->printable >> 7) < stats->nonprintable)
 		return 1;
 	/*
 	 * Other heuristics? Average line length might be relevant,

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 17:29                           ` Junio C Hamano
@ 2007-02-14 17:43                             ` Linus Torvalds
  0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 17:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
	Mark Levedahl, Git Mailing List



On Wed, 14 Feb 2007, Junio C Hamano wrote:
> 
> Ok.  How about adding BS and FF to the Ok set, and checking if
> bad ones are less than 1% of the good ones?

I think that looks fine.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 17:28                       ` Mark Levedahl
@ 2007-02-14 18:17                         ` Robin Rosenberg
  2007-02-14 18:31                           ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-14 18:17 UTC (permalink / raw)
  To: Mark Levedahl
  Cc: Johannes Schindelin, Mark Levedahl, Linus Torvalds,
	Junio C Hamano, Alexander Litvinov, Git Mailing List

onsdag 14 februari 2007 18:28 skrev Mark Levedahl:
> As to cvsnt, my actual feeling is I'd like to kick it in the ass, it has 
> destroyed too many files for me over the years, binary and text, so I 
> don't think its strategies are very good. That is why I'm kicking these 
> ideas around, if I thought I knew the "right" way I would have written 
> it already.

That may be why an excellent piece of software, TortoiseCVS,  doesn't trust 
cvs or cvsnt to do the job. Here is how they do the binary detection (and 
some more):

http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.cpp?revision=1.172&view=markup

-- robin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 18:17                         ` Robin Rosenberg
@ 2007-02-14 18:31                           ` Linus Torvalds
  2007-02-14 20:24                             ` Robin Rosenberg
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 18:31 UTC (permalink / raw)
  To: Robin Rosenberg
  Cc: Mark Levedahl, Johannes Schindelin, Mark Levedahl,
	Junio C Hamano, Alexander Litvinov, Git Mailing List



On Wed, 14 Feb 2007, Robin Rosenberg wrote:
> 
> That may be why an excellent piece of software, TortoiseCVS,  doesn't trust 
> cvs or cvsnt to do the job. Here is how they do the binary detection (and 
> some more):
> 
> http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.cpp?revision=1.172&view=markup

Well, it does seem to boil down to what Junio already got to:

 - 0-31 and 127 are never in text, except for BEL, BS, HT, LF, FF, CR and 
   ESC.
 - 128-255 can all be in either iso-8859 or extended ascii (or they 
   explicitly add NEL but not 128+27 to "normal ASCII", which is strange)

So they've effectively added BEL and ESC to the listof characters that 
Junio has now. But they also make it an absolute error to have anything 
else (no "1% rule").

But they also do the filename tests, and I think that's more important in 
many ways.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: mingw, windows, crlf/lf, and git
  2007-02-14 18:31                           ` Linus Torvalds
@ 2007-02-14 20:24                             ` Robin Rosenberg
  0 siblings, 0 replies; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-14 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Levedahl, Johannes Schindelin, Mark Levedahl,
	Junio C Hamano, Alexander Litvinov, Git Mailing List

onsdag 14 februari 2007 19:31 skrev Linus Torvalds:
> 
> On Wed, 14 Feb 2007, Robin Rosenberg wrote:
> > 
> > That may be why an excellent piece of software, TortoiseCVS,  doesn't trust 
> > cvs or cvsnt to do the job. Here is how they do the binary detection (and 
> > some more):
> > 
> > http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.cpp?revision=1.172&view=markup
> 
> Well, it does seem to boil down to what Junio already got to:
> 
>  - 0-31 and 127 are never in text, except for BEL, BS, HT, LF, FF, CR and 
>    ESC.
>  - 128-255 can all be in either iso-8859 or extended ascii (or they 
>    explicitly add NEL but not 128+27 to "normal ASCII", which is strange)
>
> So they've effectively added BEL and ESC to the listof characters that 
Especially ESC used to be common in DOS/Windows and quite a few hang around in
older code.

> Junio has now. But they also make it an absolute error to have anything 
> else (no "1% rule").
Can this 1%-rule be motivated from real cases, rather that hypotetical ones? It makes 
it harder to understand  why the tools makes a particular decision.

> But they also do the filename tests, and I think that's more important in 
> many ways.

A unixy tool like git should maybe use magic too :).

Btw the filename (like .gitignore or similar) test in practice would give us 
 the binary flag. Just list a filename instead of a pattern.

-- robin

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2007-02-14 20:23 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
2007-02-11 23:34 ` Johannes Schindelin
2007-02-12  0:46   ` Jakub Narebski
2007-02-12  2:36     ` Mark Levedahl
2007-02-12 11:21     ` Johannes Schindelin
2007-02-12  0:14 ` Robin Rosenberg
2007-02-12  2:37   ` Mark Levedahl
2007-02-12  4:24 ` Theodore Tso
2007-02-12  7:28   ` David Lang
2007-02-12 11:36   ` Johannes Schindelin
2007-02-12 17:20   ` Linus Torvalds
2007-02-12 22:37     ` Johannes Schindelin
2007-02-12 23:02       ` Linus Torvalds
2007-02-12 22:54     ` Junio C Hamano
2007-02-12 23:02       ` Junio C Hamano
2007-02-12 23:09       ` Linus Torvalds
2007-02-12 23:25         ` Linus Torvalds
2007-02-12 23:23           ` David Lang
2007-02-12 23:24       ` Johannes Schindelin
2007-02-12 23:42         ` Junio C Hamano
2007-02-12 23:46           ` David Lang
2007-02-12 23:50           ` Johannes Schindelin
2007-02-13  0:59             ` Mark Levedahl
2007-02-13  1:06               ` Johannes Schindelin
2007-02-13  1:13                 ` Shawn O. Pearce
2007-02-13  1:20                   ` David Lang
2007-02-13  1:36                 ` Mark Levedahl
2007-02-13  5:18               ` Jeff King
2007-02-13  0:32         ` Mark Levedahl
2007-02-13  2:02 ` Junio C Hamano
2007-02-13  3:21   ` Mark Levedahl
2007-02-13  6:05     ` Junio C Hamano
2007-02-13  3:32 ` Alexander Litvinov
2007-02-13 10:06   ` Johannes Schindelin
2007-02-13 12:16     ` Alexander Litvinov
2007-02-13 12:37       ` Johannes Schindelin
2007-02-13 19:36       ` Mark Levedahl
2007-02-13 20:32         ` Linus Torvalds
2007-02-14  1:42           ` Mark Levedahl
2007-02-14  2:16             ` Linus Torvalds
2007-02-13 21:58         ` Robin Rosenberg
2007-02-14  1:18           ` Mark Levedahl
2007-02-13 16:52     ` Linus Torvalds
2007-02-13 17:23       ` Linus Torvalds
2007-02-13 17:23         ` Linus Torvalds
2007-02-13 18:00         ` Junio C Hamano
2007-02-13 19:07           ` Linus Torvalds
2007-02-13 20:42             ` Sam Ravnborg
2007-02-13 21:08               ` Nicolas Pitre
2007-02-13 23:19               ` David Lang
2007-02-13 23:28               ` Linus Torvalds
2007-02-14  8:41                 ` Sam Ravnborg
2007-02-14 16:28                   ` Linus Torvalds
2007-02-14 16:47                     ` Sam Ravnborg
2007-02-14  3:47               ` Alexander Litvinov
2007-02-14  5:16             ` Junio C Hamano
2007-02-14  5:36               ` Linus Torvalds
2007-02-14 11:10                 ` Johannes Schindelin
2007-02-14 14:26                   ` Mark Levedahl
2007-02-14 15:51                     ` Linus Torvalds
2007-02-14 16:39                       ` Junio C Hamano
2007-02-14 17:01                         ` Linus Torvalds
2007-02-14 17:29                           ` Junio C Hamano
2007-02-14 17:43                             ` Linus Torvalds
2007-02-14 15:56                     ` Johannes Schindelin
2007-02-14 16:23                       ` Linus Torvalds
2007-02-14 17:28                       ` Mark Levedahl
2007-02-14 18:17                         ` Robin Rosenberg
2007-02-14 18:31                           ` Linus Torvalds
2007-02-14 20:24                             ` Robin Rosenberg
2007-02-14 15:44                   ` Linus Torvalds
2007-02-14 15:53                     ` Johannes Schindelin
2007-02-14 11:36             ` Alexander Litvinov
2007-02-14 16:37               ` Linus Torvalds
2007-02-14 17:18                 ` Junio C Hamano
2007-02-14 16:16             ` Johannes Sixt
2007-02-14 16:53               ` Linus Torvalds
2007-02-13 18:05         ` Johannes Schindelin
2007-02-13 17:25       ` Nicolas Pitre
2007-02-13 18:04       ` Johannes Schindelin
2007-02-13 18:11         ` Junio C Hamano
2007-02-13 18:39         ` Linus Torvalds
2007-02-13 18:42           ` Johannes Schindelin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.