* mingw, windows, crlf/lf, and git
@ 2007-02-11 23:13 Mark Levedahl
2007-02-11 23:34 ` Johannes Schindelin
` (4 more replies)
0 siblings, 5 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-11 23:13 UTC (permalink / raw)
To: Git Mailing List
I am NOT intending to start a flamewar O:-) , so please don't turn this
into one.
The recent threads on a mingw git port are explicit in the intent to
provide a Windows native git. I believe there is a fundamental conflict
here with the position, clearly stated by Linus, that git does not alter
content in any way. Windows suffers the curse of DOS line endings (\r\n
vs \n), and a true port to Windows *must* allow for \r\n and \n to be
semantically the same thing as most large projects end up with a mixture
of such files and/or are targeting cross-platform capabilities. The
major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg)
have capability to recognize "text" files and transparently replace \r\n
with \n on input, the reverse on output, and ignore all such differences
on diff operations. To be relevant on native Windows, git must do the
same. Otherwise, git will be deemed "too wierd" and dismissed in favor
of a tool "that works."
There is no use to debating the technical merits of \r\n vs \n vs \r vs
whatever, nor of not converting. Really. Just accept that there is a
fundamental requirement that any version control tool on Windows be able
to silently convert between \r\n and \n. To believe otherwise is to
expect that the conversion be pushed elsewhere into the tool chain in
use, and that won't happen as the competition already provide this
conversion capability.
So, I think the git project needs to come to an explicit position on
this, basically being:
1) git is a POSIX only tool (i.e., there will be no \r\n munging), or
2) a Windows port of git will handle and mung \r\n and \n line endings.
If the answer is 1, the mingw port is a waste of time as it simply won't
be usable by its target audience. If the answer is 2, then I think a
very careful design of this capability is in order.
Comments?
BTW, I have addressed this in my own world using a pre-commit script
that converts textfile line endings into \n, recognizing that our
Windows tool chain handles such files perfectly well, while our Linux
toolchain requires it.
Mark Levedahl
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
@ 2007-02-11 23:34 ` Johannes Schindelin
2007-02-12 0:46 ` Jakub Narebski
2007-02-12 0:14 ` Robin Rosenberg
` (3 subsequent siblings)
4 siblings, 1 reply; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-11 23:34 UTC (permalink / raw)
To: Mark Levedahl; +Cc: Git Mailing List
Hi,
On Sun, 11 Feb 2007, Mark Levedahl wrote:
> The major competing solutions git seeks to supplant (cvs, cvsnt, svn,
> hg) have capability to recognize "text" files and transparently replace
> \r\n with \n on input, the reverse on output, and ignore all such
> differences on diff operations.
Agree with transformations on input and output; disagree on diff.
The problem is that it really is a transformtion. Since most Windows tools
(at least those used in portable software) handle \n without \r quite
well, thank you, I'd tend towards the view point: do not mess with line
endings pre-commit/post-checkout.
Even MacOSX uses \n now, instead of \r.
Of course, for those projects which _use_ CRLF: they can continue with it.
Git has no problem with those line endings.
The only problem CVS tried to solve (badly) was to be able to checkout
text files on DOS, Unix _and_ MacOS. In practice, though, this use case
does not matter anymore IMHO.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
2007-02-11 23:34 ` Johannes Schindelin
@ 2007-02-12 0:14 ` Robin Rosenberg
2007-02-12 2:37 ` Mark Levedahl
2007-02-12 4:24 ` Theodore Tso
` (2 subsequent siblings)
4 siblings, 1 reply; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-12 0:14 UTC (permalink / raw)
To: Mark Levedahl; +Cc: Git Mailing List
måndag 12 februari 2007 00:13 skrev Mark Levedahl:
> The recent threads on a mingw git port are explicit in the intent to
> provide a Windows native git. I believe there is a fundamental conflict
> here with the position, clearly stated by Linus, that git does not alter
> content in any way. Windows suffers the curse of DOS line endings (\r\n
> vs \n), and a true port to Windows *must* allow for \r\n and \n to be
> semantically the same thing as most large projects end up with a mixture
> of such files and/or are targeting cross-platform capabilities. The
> major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg)
> have capability to recognize "text" files and transparently replace \r\n
> with \n on input, the reverse on output, and ignore all such differences
> on diff operations. To be relevant on native Windows, git must do the
> same. Otherwise, git will be deemed "too wierd" and dismissed in favor
> of a tool "that works."
>
As of today git is a posix tool simply because it's not fully ported to
other enviromnents. I brought this up quite a time ago, and didn't face heavy artillery
then, and wouldn't today either. The code is still missing though. I didn't
write it then, because it's my #1 priority and nobody else did. Linus even did a
rough scetch, but that's it.
I guess git will get this feature when someone does the code for it.
-- robin
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-11 23:34 ` Johannes Schindelin
@ 2007-02-12 0:46 ` Jakub Narebski
2007-02-12 2:36 ` Mark Levedahl
2007-02-12 11:21 ` Johannes Schindelin
0 siblings, 2 replies; 83+ messages in thread
From: Jakub Narebski @ 2007-02-12 0:46 UTC (permalink / raw)
To: git
Johannes Schindelin wrote:
> On Sun, 11 Feb 2007, Mark Levedahl wrote:
>
>> The major competing solutions git seeks to supplant (cvs, cvsnt, svn,
>> hg) have capability to recognize "text" files and transparently replace
>> \r\n with \n on input, the reverse on output, and ignore all such
>> differences on diff operations.
>
> Agree with transformations on input and output; disagree on diff.
I wonder if this could/should be solved with adding some option to git-diff,
similar to --ignore-space-change and --ignore-all-space...
Just a [idle] thought.
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 0:46 ` Jakub Narebski
@ 2007-02-12 2:36 ` Mark Levedahl
2007-02-12 11:21 ` Johannes Schindelin
1 sibling, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-12 2:36 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git
Jakub Narebski wrote:
> Johannes Schindelin wrote:
>
>> On Sun, 11 Feb 2007, Mark Levedahl wrote:
>>
>>
>>> The major competing solutions git seeks to supplant (cvs, cvsnt, svn,
>>> hg) have capability to recognize "text" files and transparently replace
>>> \r\n with \n on input, the reverse on output, and ignore all such
>>> differences on diff operations.
>>>
>> Agree with transformations on input and output; disagree on diff.
>>
>
> I wonder if this could/should be solved with adding some option to git-diff,
> similar to --ignore-space-change and --ignore-all-space...
>
> Just a [idle] thought.
>
That would work. Assuming blobs are stored in with \n, diff just has to
open files in 'rt' mode rather than just 'r' and the \r\n are
transformed on read so are never seen by git code. That is basically
what Windows native tools do, but they also write files opened in 'wt'
mode so \n become \r\n on output. Of course, if this were an option,
users could look for line ending differences if they cared.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 0:14 ` Robin Rosenberg
@ 2007-02-12 2:37 ` Mark Levedahl
0 siblings, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-12 2:37 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Mark Levedahl, Git Mailing List
Robin Rosenberg wrote:
> As of today git is a posix tool simply because it's not fully ported to
> other enviromnents. I brought this up quite a time ago, and didn't face heavy artillery
> then, and wouldn't today either. The code is still missing though. I didn't
> write it then, because it's my #1 priority and nobody else did. Linus even did a
> rough scetch, but that's it.
So, the basic design for this feature exists where? I would assume this
would include a file mode indicator set in the blob or tree designating
the blob is "text", along with mechanism to specify for a project what
files are "text", along with some safety valve to check and not do
transformation when the file does not look text-ish.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
2007-02-11 23:34 ` Johannes Schindelin
2007-02-12 0:14 ` Robin Rosenberg
@ 2007-02-12 4:24 ` Theodore Tso
2007-02-12 7:28 ` David Lang
` (2 more replies)
2007-02-13 2:02 ` Junio C Hamano
2007-02-13 3:32 ` Alexander Litvinov
4 siblings, 3 replies; 83+ messages in thread
From: Theodore Tso @ 2007-02-12 4:24 UTC (permalink / raw)
To: Mark Levedahl; +Cc: Git Mailing List
On Sun, Feb 11, 2007 at 06:13:16PM -0500, Mark Levedahl wrote:
> I am NOT intending to start a flamewar O:-) , so please don't turn this
> into one.
>
> The recent threads on a mingw git port are explicit in the intent to
> provide a Windows native git. I believe there is a fundamental conflict
> here with the position, clearly stated by Linus, that git does not alter
> content in any way. Windows suffers the curse of DOS line endings (\r\n
> vs \n), and a true port to Windows *must* allow for \r\n and \n to be
> semantically the same thing as most large projects end up with a mixture
> of such files and/or are targeting cross-platform capabilities. The
> major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg)
> have capability to recognize "text" files and transparently replace \r\n
> with \n on input, the reverse on output, and ignore all such differences
> on diff operations. To be relevant on native Windows, git must do the
> same. Otherwise, git will be deemed "too wierd" and dismissed in favor
> of a tool "that works."
So this is something that I've tried proposing to the Mercurial
developers, but it's never been implemented in hg. It'll be
interesting to see what the git community thinks. :-)
My proposal does require adding a file type to each file, as tracked
metadata, which may doom it from the start. If you add a file type,
then you have to support mutating the file type, and some way of
handling merge conflicts (generally, picking one type or another).
Then for each file type, we implement a set of interfaces (perhaps as
simple as a series of executables named git-<type>-<operation>) which
if present, transforms the file from its live format to the canonical
format which is actually checked in and back again. Besides using
this for the DOS CR/LF problem, it also allows for an efficient
storage of things like OpenOffice files which are a zipped set of .xml
files. By decompressing them before pushing them into the SCM, it
means that if the user makes a tiny spelling correction in their
OpenOffice file, the delta stored in the git repository can be much
more efficiently stored (since the diff of the .xml tree will be
small, where as the diff of the entire compressed file is likely going
to be close to the entire size of the .odt file).
Another nice thing to provide for each file type would be a
pretty-printer for the diffs, so it becomes easier to see the delta
between two versions of an OpenOffice file in a textual window.
So, is this idea sane or completely insane? Hopefully it passes
Linus's it-solves-multiple-problems-at-once test, at least. :-)
- Ted
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 4:24 ` Theodore Tso
@ 2007-02-12 7:28 ` David Lang
2007-02-12 11:36 ` Johannes Schindelin
2007-02-12 17:20 ` Linus Torvalds
2 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-12 7:28 UTC (permalink / raw)
To: Theodore Tso; +Cc: Mark Levedahl, Git Mailing List
On Sun, 11 Feb 2007, Theodore Tso wrote:
> Then for each file type, we implement a set of interfaces (perhaps as
> simple as a series of executables named git-<type>-<operation>) which
> if present, transforms the file from its live format to the canonical
> format which is actually checked in and back again. Besides using
> this for the DOS CR/LF problem, it also allows for an efficient
> storage of things like OpenOffice files which are a zipped set of .xml
> files. By decompressing them before pushing them into the SCM, it
> means that if the user makes a tiny spelling correction in their
> OpenOffice file, the delta stored in the git repository can be much
> more efficiently stored (since the diff of the .xml tree will be
> small, where as the diff of the entire compressed file is likely going
> to be close to the entire size of the .odt file).
>
> Another nice thing to provide for each file type would be a
> pretty-printer for the diffs, so it becomes easier to see the delta
> between two versions of an OpenOffice file in a textual window.
>
> So, is this idea sane or completely insane? Hopefully it passes
> Linus's it-solves-multiple-problems-at-once test, at least. :-)
there have been other things discussed that could use the 'do this on checkout'
hooks, specificly on the issue of useing git to manage /etc the need to
save/restore permissions requires a hook on checkout that doesn't exist yet.
this sounds like it would solve that problem as well.
David Lang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 0:46 ` Jakub Narebski
2007-02-12 2:36 ` Mark Levedahl
@ 2007-02-12 11:21 ` Johannes Schindelin
1 sibling, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 11:21 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Mark Levedahl, git
Hi,
[Cc'ing git list, which I sometimes have to do when Jakub replies]
On Mon, 12 Feb 2007, Jakub Narebski wrote:
> Johannes Schindelin wrote:
> > On Sun, 11 Feb 2007, Mark Levedahl wrote:
> >
> >> The major competing solutions git seeks to supplant (cvs, cvsnt, svn,
> >> hg) have capability to recognize "text" files and transparently replace
> >> \r\n with \n on input, the reverse on output, and ignore all such
> >> differences on diff operations.
> >
> > Agree with transformations on input and output; disagree on diff.
>
> I wonder if this could/should be solved with adding some option to git-diff,
> similar to --ignore-space-change and --ignore-all-space...
It could be done, but those options were introduced for CRLF breakage in
the first place.
You need --ignore-crlf-breakage? Just holler.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 4:24 ` Theodore Tso
2007-02-12 7:28 ` David Lang
@ 2007-02-12 11:36 ` Johannes Schindelin
2007-02-12 17:20 ` Linus Torvalds
2 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 11:36 UTC (permalink / raw)
To: Theodore Tso; +Cc: Mark Levedahl, Git Mailing List
Hi,
On Sun, 11 Feb 2007, Theodore Tso wrote:
> My proposal does require adding a file type to each file, as tracked
> metadata, which may doom it from the start.
I'd rather do that a la .gitignore, i.e. make this handling dependent on
file name patterns. It is not only backwards compatible (from the
viewpoint of the repository format), it also avoids having to specify over
and over again that yes, this new .odt file _is_ an OpenOffice document.
> Then for each file type, we implement a set of interfaces (perhaps as
> simple as a series of executables named git-<type>-<operation>) which
> if present, transforms the file from its live format to the canonical
> format which is actually checked in and back again.
Again, I propose a slight change: Let's add a transformation driver like
the merge driver: this allows inlining common operations like unzipping,
CRLF->LF conversion, etc.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 4:24 ` Theodore Tso
2007-02-12 7:28 ` David Lang
2007-02-12 11:36 ` Johannes Schindelin
@ 2007-02-12 17:20 ` Linus Torvalds
2007-02-12 22:37 ` Johannes Schindelin
2007-02-12 22:54 ` Junio C Hamano
2 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 17:20 UTC (permalink / raw)
To: Theodore Tso; +Cc: Mark Levedahl, Git Mailing List
On Sun, 11 Feb 2007, Theodore Tso wrote:
>
> So this is something that I've tried proposing to the Mercurial
> developers, but it's never been implemented in hg. It'll be
> interesting to see what the git community thinks. :-)
>
> My proposal does require adding a file type to each file, as tracked
> metadata, which may doom it from the start. If you add a file type,
> then you have to support mutating the file type, and some way of
> handling merge conflicts (generally, picking one type or another).
I agree that a file-type approch would work, but I personally think it's
too inflexible (just cr/lf vs lf? There are tons of other interesting
issues that are valid). I also think it falls down on another (and in some
ways much more fundamental problem): these things exist EVEN WHEN THE FILE
ITSELF DOES NOT EXIST!
In other words, a policy about cr/lf is *not* a policy about actual
content. It's something much more: it's a policy about representation in
general, which includes *potential* content. It should obviously take
effect on "git add" even with content that didn't exist before, and to
work well, it should do so without the user having to think about it.
Equally importantly, this happens with content that was added by people
who simply DO NOT CARE. In other words, I think a "file type" thing
fundamentally cannot work, because under UNIX, it would be stupid and
pointless, so any project that is maintained under UNIX might _add_ the
file types, but since they won't matter, they'll inevitably be wrong (ie
people forgot to mark a binary thing binary, or a text thing as text).
So: file types or attributes are broken. They cannot work well.
But enough on the negative rambling, I do have a positive and constructive
suggestion, because I actually think I have a great model for it. But I've
never cared enough (and since the main target would be some windows issue,
I suspect I never really _will_ care enough) to really worry about it.
Anyway, if somebody really wants to look at this, and wants to create
something that is actually _usable_, my suggestion is to simply extend on
the ".gitignore" file approach. The great thing about .gitignore is that
(a) you can track it like you track any other file
This makes merges a *lot* easier. You see it as conflicts, you can
fix it up, and in general, you can use all the same tools with it as
you use with anything else. In contrast, explicit per-file filetypes
are _horrible_ for maintenance.
(b) you can add to it with *patterns*, which is exactly what you want for
file types.
You can do things like
*.bin: binary
*: text
to say "everythgn that matches *.bin is binary, the rest is text",
and solves the maintenance issue trivially. Everybody will like it.
For the kernel, for example, we'd have a really easy
Documentation/logo.gif: binary
*: text
and that would probably take care of it.
You can also have a few default file patterns built in, which would
take care of it for 99% of all projects without anybody ever having
to even think about it - even under DOS.
(c) it doesn't actually affect database representation, it only changes
behaviour for programs, which is also exactly what you want (if you
have per-file "file types", you end up having serious problems at
merge time: when I say "affect database representation", I don't mean
that I think git cannot change its database, I literally mean at a
"higher" level: represening per-file attributes is a DISASTER from a
merge situation)
So not only is it backwards-compatible with traditional git usage,
it's much more fundamentally simple: it doesn't add any new core data
structures or rules. All the core stays exactly as it is, and it just
affects higher-level behaviour. And that's important: one reason git
has been so stable is that the really core data structures are really
really stable and simple.
Even when we did *really* core changes like the whole packfile thing,
the fundamental data structures didn't change at all *conceptually*.
(d) it's actually a lot more flexible than file types.
Merge stategies, anybody? We can easily have the default merge
strategy be the normal three-way merge (which is obviously the right
thing for almost anything), but how about something like
*.doc: binary,merge=doc-merge
which tells git that it should use a separate "doc-merge" program to
merge those kinds of files when it needs to do a nontrivial merge..
(e) exactly like ".gitignore", you should also be able to have a
".git/info/exclude" file that is your _private_ rules, and
per-directory ".gitignore" files that are the _hierarchical_ rules.
This just makes maintenance much simpler. Not one big file that has
everything, and that clashes. Make the top-level one contain all the
generic default rules, and then lower down we can have more specific
rules for very specific things, exactly like the kernel .gitignore
files do. The top-level file should *not* have to know all the
details of some architecture- or sub-project specific file behaviour.
Similarly, having an untracked file (.git/info/exclude) allows people
to have rules that make sense for *them*, but that might not make
sense for the upstream developers (say, somebody crazy enough to
develop Linux under Windows). So people can have their purely local
rules without forcing them on others.
Anyway, that would be my suggestion. Call it ".gitattributes" or
something. Make it a nice ASCII format, exactly like .gitignore, and make
all the rules exactly the same, except it has a ": <attributelist>" at the
end for each line.
Start off supporting just "binary" and "text", but keep in mind that
people may want other things. Individualized merge strategies etc.
Also, keep in mind that a *lot* of git operations will work purely on a
SHA1 level, and those operations fundamentally *will*not*care* about file
types. So when you merge a file, for example, the initial merge will be
done purely on SHA1's, and git would do all the normal "if it didn't
change in branch 1, take the branch 2 version directly" without ever even
*looking* at any file rules.
This is important, because this is what makes git efficient for large
projects, and which would allow git to _remain_ efficient even in the face
of having to read all those comples .gitattributes files. When we merge
two repositories with 20,000+ files, we usually really only "merge" a
couple of the files.
Same goes for "text" mode. The "text" thing would only affect things like
"git add" etc that use "git-update-index" to calculate the new SHA1. We'd
never use it "normally". "git diff" would still be instantaneous, because
the git index shows the file still matches, and that is all done on a SHA1
only level. So only when you do a "git add" or when it needs to refresh
the index because the file changed, and it reads in the file, will it
actually care about whether it's a text or a binary file.
This is actually *exactly* what you want. Not just for performance, but
simply because this is also how you can take something like the Linux
archive, and "just use it" under Windows, even if your editor adds (or
wants) CR/LF.
Btw, how would I implement this? If I really were energetic enough to
implement it, I would do:
(a) Add a flag to "git-ls-files" logic to add "type information" in
front.
Not only do you want this *anyway* for other reasons, but for
binary/text, the thing you actually care most about is "git add", and
it already basically just does "take this file pattern, feed it
through git-ls-files, and add those files". So you'd get it basically
for free.
It is also fairly easy to add at this stage, because you can simply
look for all the places that work with "info/exclude" and
".gitignore", and you know that "Ahh, I need to teach these exact
places to understand about attributes". So you'd add an
"add_attributes_from_file()" function etc etc.
Quite straightforward. In fact, you might be able to use the
gitignore parsing *as*is*, and just teach it about more flags that
just "ignore": both in "struct dir_entry" and in "struct exclude".
(b) Teach the git-update-index logic about hashing text blobs.
(c) Profit!
It really should be fairly straightforward. I'm sure it wouldn't be
*entirely* trivial, but I'm also fairly sure that somebody reasonably
competent could do it in a couple of days (with testing) if they were just
sufficiently motivated to get started.
Anybody?
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 17:20 ` Linus Torvalds
@ 2007-02-12 22:37 ` Johannes Schindelin
2007-02-12 23:02 ` Linus Torvalds
2007-02-12 22:54 ` Junio C Hamano
1 sibling, 1 reply; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 22:37 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List
Hi,
[I agree on the .gitignore approach; see my other mail in this thread]
On Mon, 12 Feb 2007, Linus Torvalds wrote:
> Btw, how would I implement this? If I really were energetic enough to
> implement it, I would do:
>
> (a) Add a flag to "git-ls-files" logic to add "type information" in
> front.
>
> Not only do you want this *anyway* for other reasons, but for
> binary/text, the thing you actually care most about is "git add", and
> it already basically just does "take this file pattern, feed it
> through git-ls-files, and add those files". So you'd get it basically
> for free.
>
> It is also fairly easy to add at this stage, because you can simply
> look for all the places that work with "info/exclude" and
> ".gitignore", and you know that "Ahh, I need to teach these exact
> places to understand about attributes". So you'd add an
> "add_attributes_from_file()" function etc etc.
>
> Quite straightforward. In fact, you might be able to use the
> gitignore parsing *as*is*, and just teach it about more flags that
> just "ignore": both in "struct dir_entry" and in "struct exclude".
>
> (b) Teach the git-update-index logic about hashing text blobs.
>
> (c) Profit!
Not so fast.
In order for this to be _useful_, you also have to have a way to _extract_
the text blobs. Not only for read-tree, but _also_ for diff. It makes no
sense at all to have this transformation one-way. For diff, you _might_
want to have a diff beautifier (for example the .odt thing), but read-tree
is _really_ important.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 17:20 ` Linus Torvalds
2007-02-12 22:37 ` Johannes Schindelin
@ 2007-02-12 22:54 ` Junio C Hamano
2007-02-12 23:02 ` Junio C Hamano
` (2 more replies)
1 sibling, 3 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-12 22:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Btw, how would I implement this? If I really were energetic enough to
> implement it, I would do:
>
> (a) Add a flag to "git-ls-files" logic to add "type information" in
> front.
>
> Not only do you want this *anyway* for other reasons, but for
> binary/text, the thing you actually care most about is "git add", and
> it already basically just does "take this file pattern, feed it
> through git-ls-files, and add those files". So you'd get it basically
> for free.
>
> It is also fairly easy to add at this stage, because you can simply
> look for all the places that work with "info/exclude" and
> ".gitignore", and you know that "Ahh, I need to teach these exact
> places to understand about attributes". So you'd add an
> "add_attributes_from_file()" function etc etc.
>
> Quite straightforward. In fact, you might be able to use the
> gitignore parsing *as*is*, and just teach it about more flags that
> just "ignore": both in "struct dir_entry" and in "struct exclude".
>
> (b) Teach the git-update-index logic about hashing text blobs.
I agree that we can assume editors can grok files with LF
end-of-line just fine and we would not need to do the reverse
conversion on checkout paths (e.g. "read-tree -u", "checkout-index").
Textual diff generation needs to learn the CRLF-to-LF conversion
in diff_populate_filespec(); this needs to be done even when the
caller wants size_only.
Oops.
Not so fast. What's your plan for st_size?
> (c) Profit!
>
> It really should be fairly straightforward. I'm sure it wouldn't be
> *entirely* trivial, but I'm also fairly sure that somebody reasonably
> competent could do it in a couple of days (with testing) if they were just
> sufficiently motivated to get started.
>
> Anybody?
Not me.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 22:37 ` Johannes Schindelin
@ 2007-02-12 23:02 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 23:02 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List
On Mon, 12 Feb 2007, Johannes Schindelin wrote:
>
> > (c) Profit!
>
> Not so fast.
Aww! And just when I _finally_ had a "step 2".
> In order for this to be _useful_, you also have to have a way to _extract_
> the text blobs. Not only for read-tree, but _also_ for diff.
Actually, my argument is that we don't need it all that much.
For example, your "read-tree" argument is actually wrong. Anything that is
in a tree is _already_ fixed to be '\n'. So as long as we keep to things
like
git diff version1..version2
we'll actually always get the right version.
Also, the index will make sure that we don't even *try* to diff normal
checked out files.
So the only time you actually really need to test the .gitattributes file
is when you do an "open blob in working tree". And once you do that
function right, and just make sure both git-update-index and yes, the
"diff against working tree" cases use it, you really should be mostly
done.
Both git-update-index and git-diff-files want basically the same
interface:
struct file_buf {
const char *buf;
unsigned long size;
int flags;
}
int read_file(const char *path, struct file_buf *);
close_file(struct file_buf *);
and we should use that instead of the current "open + stat + mmap/read +
close" sequences.
It really shouldn't be too nasty.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 22:54 ` Junio C Hamano
@ 2007-02-12 23:02 ` Junio C Hamano
2007-02-12 23:09 ` Linus Torvalds
2007-02-12 23:24 ` Johannes Schindelin
2 siblings, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-12 23:02 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List
Junio C Hamano <junkio@cox.net> writes:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> Btw, how would I implement this? If I really were energetic enough to
>> implement it, I would do:
>> ...
>> (b) Teach the git-update-index logic about hashing text blobs.
>
> I agree that we can assume editors can grok files with LF
> end-of-line just fine and we would not need to do the reverse
> conversion on checkout paths (e.g. "read-tree -u", "checkout-index").
>
> Textual diff generation needs to learn the CRLF-to-LF conversion
> in diff_populate_filespec(); this needs to be done even when the
> caller wants size_only.
>
> Oops.
>
> Not so fast. What's your plan for st_size?
If I were to do this, I would say the cache should store the
size on the filesystem in stat fields. Which means that the
object name recorded is text blob _after_ line endings are
normalized to LF, and its exploded size does not necessarily
match the cached size.
So this means that whoever does the diff_populate_filespec()
change needs to be careful, but it is not such a big deal.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 22:54 ` Junio C Hamano
2007-02-12 23:02 ` Junio C Hamano
@ 2007-02-12 23:09 ` Linus Torvalds
2007-02-12 23:25 ` Linus Torvalds
2007-02-12 23:24 ` Johannes Schindelin
2 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 23:09 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List
On Mon, 12 Feb 2007, Junio C Hamano wrote:
>
> Not so fast. What's your plan for st_size?
Umm. There's two (very distinct) uses for st_size.
The one that we actually use to validate the current index obviously must
match the "OS returned value". It contains all the CR/LF stuff.
The one where we actually read the file and run SHA1 on the result must
obviously be the post-conversion one.
But it shouldn't be a problem. We'll always know which one matters: the
index case is always about pure stat information (and has no meaning
outside of that, really - after all, it's no different from st_mode etc,
and we actually keep it in a special binary format that is endian-safe!)
and the "real object" case is always about the *data* we use to compare
with.
I don't think we ever mix the two anyway.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 23:25 ` Linus Torvalds
@ 2007-02-12 23:23 ` David Lang
0 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-12 23:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Theodore Tso, Mark Levedahl, Git Mailing List
On Mon, 12 Feb 2007, Linus Torvalds wrote:
> So we'd just need to pass in the information about whether it's binary or
> not, and then do something like
>
> @@ -2091,6 +2091,10 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
>
> if (!type)
> type = blob_type;
> +#ifndef __UNIX__
> + if (text && !strcmp(type, blob_type))
> + convert_crlf_to_lf(&buf, &size);
> +#endif
> if (write_object)
> ret = write_sha1_file(buf, size, type, sha1);
> else
>
> and that would take care of a lot of things (yeah, I'd not do it that way
> in practice, but really doesn't look that nasty - it's actually much
> nastier to have to look up the text/binary type in the first place).
you could do something like this and it would deal with the srlf/lf problem, but
if you instead put in the conversion hooks like Ted suggested then you can
actually gain a LOT more.
his example of openoffice documents that are gziped xml files is a very good
one. if the 'conversion' is to gunzip on checkin and gzip on checkout then the
core git logic will work on the nice diffable xml instead of the compressed
binary blob.
if this is extensable to arbatrary helper functions to do the conversions I'll
bet that there are many other cases that can use this.
I think the big questions needs to be, is this helper app a filter, or can it be
passed a filename as the destination (which would let it do things like set
permissions on the files it creates), or should it be both?
David Lang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 22:54 ` Junio C Hamano
2007-02-12 23:02 ` Junio C Hamano
2007-02-12 23:09 ` Linus Torvalds
@ 2007-02-12 23:24 ` Johannes Schindelin
2007-02-12 23:42 ` Junio C Hamano
2007-02-13 0:32 ` Mark Levedahl
2 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 23:24 UTC (permalink / raw)
To: Junio C Hamano
Cc: Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List
Hi,
On Mon, 12 Feb 2007, Junio C Hamano wrote:
> I agree that we can assume editors can grok files with LF end-of-line
> just fine and we would not need to do the reverse conversion on checkout
> paths (e.g. "read-tree -u", "checkout-index").
In that case, a simple pre-commit hook would suffice.
No, the problem mentioned by Mark was a very real one: you _cannot_ rely
on Windows' editors not to fsck up with line endings. The worst case is if
the file contains _some_ CRLF and _some _LF_. Almost always I had the
problem that it now converted _all_ LFs to CRLFs. Even those which already
were converted.
So, if we are to support text mode, it is not one-way. If we do one-way,
we really do _not_ support text mode, but pre-commit conversion to LF
style text. And in this case, core git does not need _any_ change.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 23:09 ` Linus Torvalds
@ 2007-02-12 23:25 ` Linus Torvalds
2007-02-12 23:23 ` David Lang
0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-12 23:25 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Theodore Tso, Mark Levedahl, Git Mailing List
On Mon, 12 Feb 2007, Linus Torvalds wrote:
>
> But it shouldn't be a problem. We'll always know which one matters: the
> index case is always about pure stat information (and has no meaning
> outside of that, really - after all, it's no different from st_mode etc,
> and we actually keep it in a special binary format that is endian-safe!)
> and the "real object" case is always about the *data* we use to compare
> with.
In fact, for git-update-index, I think it's *literally* as easy as just
changing "index_fd()" to convert the buffer on-the-fly as needed, before
we actually call "write_sha1_file()" or "hash_sha1_file()".
So we'd just need to pass in the information about whether it's binary or
not, and then do something like
@@ -2091,6 +2091,10 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
if (!type)
type = blob_type;
+#ifndef __UNIX__
+ if (text && !strcmp(type, blob_type))
+ convert_crlf_to_lf(&buf, &size);
+#endif
if (write_object)
ret = write_sha1_file(buf, size, type, sha1);
else
and that would take care of a lot of things (yeah, I'd not do it that way
in practice, but really doesn't look that nasty - it's actually much
nastier to have to look up the text/binary type in the first place).
Something similar looks to be true in diff generation. The core "compare
two SHA1's at a time" doesn't need any changes, but the code that actually
reads in the temporary file from disk obviously does. But even that is
just _one_ point, afaik - diff_populate_filespec()":
@@ -1362,6 +1362,10 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
if (fd < 0)
goto err_empty;
s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
+#ifndef __UNIX__
+ if (text)
+ convert_crlf_to_lf(&s->data, &s->size);
+#endif
close(fd);
s->should_munmap = 1;
}
(and again, that's not real code, it would also need to change the
"should_munmap" flag to indicate the state of the _new_ "data" thing.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 23:24 ` Johannes Schindelin
@ 2007-02-12 23:42 ` Junio C Hamano
2007-02-12 23:46 ` David Lang
2007-02-12 23:50 ` Johannes Schindelin
2007-02-13 0:32 ` Mark Levedahl
1 sibling, 2 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-12 23:42 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> Hi,
>
> On Mon, 12 Feb 2007, Junio C Hamano wrote:
>
>> I agree that we can assume editors can grok files with LF end-of-line
>> just fine and we would not need to do the reverse conversion on checkout
>> paths (e.g. "read-tree -u", "checkout-index").
>
> In that case, a simple pre-commit hook would suffice.
>
> No, the problem mentioned by Mark was a very real one: you _cannot_ rely
> on Windows' editors not to fsck up with line endings. The worst case is if
> the file contains _some_ CRLF and _some _LF_. Almost always I had the
> problem that it now converted _all_ LFs to CRLFs. Even those which already
> were converted.
>
> So, if we are to support text mode, it is not one-way. If we do one-way,
> we really do _not_ support text mode, but pre-commit conversion to LF
> style text. And in this case, core git does not need _any_ change.
Well I disagree in two counts.
- I do not see how you propose to solve some CRLF and some LF
case with both-ways conversion.
- Pre-commit hook would not be sufficient. In a edit, diff,
test and then commit cycle, diff and test step needs to look
at whatever the editor left on the filesystem, so the changes
to populate-filespec is needed to make diff part work.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 23:42 ` Junio C Hamano
@ 2007-02-12 23:46 ` David Lang
2007-02-12 23:50 ` Johannes Schindelin
1 sibling, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-12 23:46 UTC (permalink / raw)
To: Junio C Hamano
Cc: Johannes Schindelin, Linus Torvalds, Theodore Tso, Mark Levedahl,
Git Mailing List
On Mon, 12 Feb 2007, Junio C Hamano wrote:
>> Hi,
>>
>> On Mon, 12 Feb 2007, Junio C Hamano wrote:
>>
>>> I agree that we can assume editors can grok files with LF end-of-line
>>> just fine and we would not need to do the reverse conversion on checkout
>>> paths (e.g. "read-tree -u", "checkout-index").
>>
>> In that case, a simple pre-commit hook would suffice.
>>
>> No, the problem mentioned by Mark was a very real one: you _cannot_ rely
>> on Windows' editors not to fsck up with line endings. The worst case is if
>> the file contains _some_ CRLF and _some _LF_. Almost always I had the
>> problem that it now converted _all_ LFs to CRLFs. Even those which already
>> were converted.
>>
>> So, if we are to support text mode, it is not one-way. If we do one-way,
>> we really do _not_ support text mode, but pre-commit conversion to LF
>> style text. And in this case, core git does not need _any_ change.
>
> Well I disagree in two counts.
>
> - I do not see how you propose to solve some CRLF and some LF
> case with both-ways conversion.
the expectation is that the some-of-each situation is unlikly to happen if you
convert all the time.
and if you do end up with a mixed ending file, the next time you check it in
from a windows box it should clean it up.
David Lang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 23:42 ` Junio C Hamano
2007-02-12 23:46 ` David Lang
@ 2007-02-12 23:50 ` Johannes Schindelin
2007-02-13 0:59 ` Mark Levedahl
1 sibling, 1 reply; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-12 23:50 UTC (permalink / raw)
To: Junio C Hamano
Cc: Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List
Hi,
On Mon, 12 Feb 2007, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > Hi,
> >
> > On Mon, 12 Feb 2007, Junio C Hamano wrote:
> >
> >> I agree that we can assume editors can grok files with LF end-of-line
> >> just fine and we would not need to do the reverse conversion on checkout
> >> paths (e.g. "read-tree -u", "checkout-index").
> >
> > In that case, a simple pre-commit hook would suffice.
> >
> > No, the problem mentioned by Mark was a very real one: you _cannot_ rely
> > on Windows' editors not to fsck up with line endings. The worst case is if
> > the file contains _some_ CRLF and _some _LF_. Almost always I had the
> > problem that it now converted _all_ LFs to CRLFs. Even those which already
> > were converted.
> >
> > So, if we are to support text mode, it is not one-way. If we do one-way,
> > we really do _not_ support text mode, but pre-commit conversion to LF
> > style text. And in this case, core git does not need _any_ change.
>
> Well I disagree in two counts.
>
> - I do not see how you propose to solve some CRLF and some LF
> case with both-ways conversion.
Very easy. Forward: s/\r\n/\n/. Backward: s/\(^\|[^\r]\)\n/\r\n/.
> - Pre-commit hook would not be sufficient. In a edit, diff,
> test and then commit cycle, diff and test step needs to look
> at whatever the editor left on the filesystem, so the changes
> to populate-filespec is needed to make diff part work.
Yes, you are right.
However, since this is all post-1.5.0 (right? Right?) why not go with more
of Ted's proposal, and make this whole mess also usable for other things
than just crlf issues?
And I _really_ think that you do not help Windows people by doing this
one-way thing.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 23:24 ` Johannes Schindelin
2007-02-12 23:42 ` Junio C Hamano
@ 2007-02-13 0:32 ` Mark Levedahl
1 sibling, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13 0:32 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Git Mailing List
Johannes Schindelin wrote:
> Hi,
>
> On Mon, 12 Feb 2007, Junio C Hamano wrote:
>
>
>> I agree that we can assume editors can grok files with LF end-of-line
>> just fine and we would not need to do the reverse conversion on checkout
>> paths (e.g. "read-tree -u", "checkout-index").
>>
>
> In that case, a simple pre-commit hook would suffice.
>
> No, the problem mentioned by Mark was a very real one: you _cannot_ rely
> on Windows' editors not to fsck up with line endings. The worst case is if
> the file contains _some_ CRLF and _some _LF_. Almost always I had the
> problem that it now converted _all_ LFs to CRLFs. Even those which already
> were converted.
>
> So, if we are to support text mode, it is not one-way. If we do one-way,
> we really do _not_ support text mode, but pre-commit conversion to LF
> style text. And in this case, core git does not need _any_ change.
>
> Ciao,
> Dscho
In my work flow, I am using a pre-commit script that (among other
things) rewrites all text files to have \n endings. This is a one-way
conversion, and does work well for the set of tools I am using. The
converters I use I wrote years ago, and are smart enough to deal with
mixtures of \n, \r\n, and \r line endings in one file, transforming all
into one unified form. d2u / u2d were not that robust when I last tried
them (years ago), but this is an absolute necessity.
However, I don't think the one-way conversion is acceptable across the
board. While the only Windows editor I am aware of that doesn't grok \n
is Notepad (the moral equivalent of edlin), I suspect that undo reliance
upon this will still lead to grief. If nothing else, someone, somewhere
will find that their beloved crlf's are missing and will complain.
Loudly. And in the lore, git will become known for being "wierd." So, I
suspect a checkout script is necessary.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-12 23:50 ` Johannes Schindelin
@ 2007-02-13 0:59 ` Mark Levedahl
2007-02-13 1:06 ` Johannes Schindelin
2007-02-13 5:18 ` Jeff King
0 siblings, 2 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13 0:59 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Mark Levedahl,
Git Mailing List
Johannes Schindelin wrote:
> However, since this is all post-1.5.0 (right? Right?) why not go with more
> of Ted's proposal, and make this whole mess also usable for other things
> than just crlf issues
Whatever is done, it needs to be robust to the notion that people will
fail to set the correct file type somewhere. Current cvsnt is fairly
good at autodetecting and setting text vs binary file type, and enforces
this across all platforms, so things don't go awry too often. It is in
my experience more reliable than subversion, which basically relies upon
file extensions mapping to mime types to identify content. All of which
is a very much too low standard of accuracy for a version control
system: I lost many files per year due to the above nonsense, so I worry
about trying to create a very general transform solution and not making
it really, really failsafe. Having projects define individual globbing
patterns is good, double checking the content for sanity is an absolute
must, but I don't think that is enough. I suspect the solution should
include round-trip conversion when creating blobs to assure that the
input can be exactly reconstructed by the inverse transformation (and
therefore possibly rejecting input with mixed line endings). A similar
check could be applied on checkout.
Perhaps I'm too paranoid, but I've been burnt way too many times by
text/binary mode stuff to let this part be trivialized. Maybe it only
gets enabled by core.ImReallyParanoid, but I want that option.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 0:59 ` Mark Levedahl
@ 2007-02-13 1:06 ` Johannes Schindelin
2007-02-13 1:13 ` Shawn O. Pearce
2007-02-13 1:36 ` Mark Levedahl
2007-02-13 5:18 ` Jeff King
1 sibling, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 1:06 UTC (permalink / raw)
To: Mark Levedahl
Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Mark Levedahl,
Git Mailing List
Hi,
On Mon, 12 Feb 2007, Mark Levedahl wrote:
> Perhaps I'm too paranoid, but I've been burnt way too many times by
> text/binary mode stuff to let this part be trivialized. Maybe it only
> gets enabled by core.ImReallyParanoid, but I want that option.
Be aware that what you proposed costs many CPU cycles. I am totally
opposed to enabling that option by default on all platforms. I am okay
with .gitattributes (but I would call it .gitfiletypes), but I am _not_
okay with git being _too much_ fscked up by Windows. Microsoft has done
enough harm already.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 1:06 ` Johannes Schindelin
@ 2007-02-13 1:13 ` Shawn O. Pearce
2007-02-13 1:20 ` David Lang
2007-02-13 1:36 ` Mark Levedahl
1 sibling, 1 reply; 83+ messages in thread
From: Shawn O. Pearce @ 2007-02-13 1:13 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Mark Levedahl, Junio C Hamano, Linus Torvalds, Theodore Tso,
Mark Levedahl, Git Mailing List
Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Mon, 12 Feb 2007, Mark Levedahl wrote:
>
> > Perhaps I'm too paranoid, but I've been burnt way too many times by
> > text/binary mode stuff to let this part be trivialized. Maybe it only
> > gets enabled by core.ImReallyParanoid, but I want that option.
>
> Be aware that what you proposed costs many CPU cycles. I am totally
> opposed to enabling that option by default on all platforms. I am okay
> with .gitattributes (but I would call it .gitfiletypes), but I am _not_
> okay with git being _too much_ fscked up by Windows. Microsoft has done
> enough harm already.
Indeed; this type of checking should only occur if there is a filter
applied to a file. Most files in most projects would hopefully
just be considered to be byte streams to Git, like they are today,
and thus not incur any additional overhead, beyond matching their
type to determine they are in fact just a byte stream.
The type could be cached in the index; or at least a single bit
which says "I'm just a byte stream, thanks" so that the matching
only needs to occur during an initial read-tree.
--
Shawn.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 1:13 ` Shawn O. Pearce
@ 2007-02-13 1:20 ` David Lang
0 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-13 1:20 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Johannes Schindelin, Mark Levedahl, Junio C Hamano,
Linus Torvalds, Theodore Tso, Mark Levedahl, Git Mailing List
On Mon, 12 Feb 2007, Shawn O. Pearce wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> On Mon, 12 Feb 2007, Mark Levedahl wrote:
>>
>>> Perhaps I'm too paranoid, but I've been burnt way too many times by
>>> text/binary mode stuff to let this part be trivialized. Maybe it only
>>> gets enabled by core.ImReallyParanoid, but I want that option.
>>
>> Be aware that what you proposed costs many CPU cycles. I am totally
>> opposed to enabling that option by default on all platforms. I am okay
>> with .gitattributes (but I would call it .gitfiletypes), but I am _not_
>> okay with git being _too much_ fscked up by Windows. Microsoft has done
>> enough harm already.
>
> Indeed; this type of checking should only occur if there is a filter
> applied to a file. Most files in most projects would hopefully
> just be considered to be byte streams to Git, like they are today,
> and thus not incur any additional overhead, beyond matching their
> type to determine they are in fact just a byte stream.
>
> The type could be cached in the index; or at least a single bit
> which says "I'm just a byte stream, thanks" so that the matching
> only needs to occur during an initial read-tree.
for the limited case of line endings it may be reasonable to define the internal
git format to be lf, and if you are running on a platform that uses this nativly
no transition is needed
one possible way to make this be a general feture is to have the helper script
have a --needed flag that tells git if it would do anything on the current
platform or not. this way you don't need to run it (and sanity check it) if it's
not needed.
David Lang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 1:06 ` Johannes Schindelin
2007-02-13 1:13 ` Shawn O. Pearce
@ 2007-02-13 1:36 ` Mark Levedahl
1 sibling, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13 1:36 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Junio C Hamano, Linus Torvalds, Theodore Tso, Mark Levedahl,
Git Mailing List
Johannes Schindelin wrote:
> Hi,
>
> On Mon, 12 Feb 2007, Mark Levedahl wrote:
>
>> Perhaps I'm too paranoid, but I've been burnt way too many times by
>> text/binary mode stuff to let this part be trivialized. Maybe it only
>> gets enabled by core.ImReallyParanoid, but I want that option.
>
> Be aware that what you proposed costs many CPU cycles. I am totally
> opposed to enabling that option by default on all platforms. I am okay
> with .gitattributes (but I would call it .gitfiletypes), but I am _not_
> okay with git being _too much_ fscked up by Windows. Microsoft has done
> enough harm already.
I would assume that none of this crlf stuff exists at all on Linux /
Unix / Posix, so if done right has zero impact outside of the Windows
nuthouse. Inside that, folks are already so used to incredible slowness
in file I/O that I'm not sure the round tripping I suggest as a check
would be very noticeable, but in any case I fully agree it should be
optional even there. However, if git could support something that never
screws up, absolutely guaranteeing data integrity in the presence of
these transforms, that would be a first in this arena and I believe a
significant selling point.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
` (2 preceding siblings ...)
2007-02-12 4:24 ` Theodore Tso
@ 2007-02-13 2:02 ` Junio C Hamano
2007-02-13 3:21 ` Mark Levedahl
2007-02-13 3:32 ` Alexander Litvinov
4 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13 2:02 UTC (permalink / raw)
To: Mark Levedahl; +Cc: Git Mailing List
Mark Levedahl <mlevedahl@verizon.net> writes:
> I am NOT intending to start a flamewar O:-) , so please don't turn
> this into one.
Heh, a lofty goal. And I am glad to see that a thread full of
constructive suggestion is already going on.
So now I do not have to fear starting a flamewar; I can safely
vent.
> The recent threads on a mingw git port are explicit in the intent to
> provide a Windows native git. I believe there is a fundamental
> conflict here with the position, clearly stated by Linus, that git
> does not alter content in any way. Windows suffers the curse of DOS
> line endings (\r\n vs \n), and a true port to Windows *must* allow for
> \r\n and \n to be semantically the same thing as most large projects
> end up with a mixture of such files and/or are targeting
> cross-platform capabilities. The major competing solutions git seeks
> to supplant (cvs, cvsnt, svn, hg) have capability to recognize "text"
> files and transparently replace \r\n with \n on input, the reverse on
> output, and ignore all such differences on diff operations. To be
> relevant on native Windows, git must do the same. Otherwise, git will
> be deemed "too wierd" and dismissed in favor of a tool "that works."
>
> There is no use to debating the technical merits of \r\n vs \n vs \r
> vs whatever, nor of not converting. Really. Just accept that there is
> a fundamental requirement that any version control tool on Windows be
> able to silently convert between \r\n and \n. To believe otherwise is
> to expect that the conversion be pushed elsewhere into the tool chain
> in use, and that won't happen as the competition already provide this
> conversion capability.
I think there is a fundamental misconception in the above. I do
not know about others, but to me personally, I do not see any
"seeking to supplant", nor "competition". It's not like I or
people who raised git into the current shape are begging to
windows users to consider using git and bending backwards to
please them. You should hone your diplomacy.
Current git may or may not match what they need, and if it does
not match what they need, making it match what they need is
primarily the responsibility of them. If Windows users find
something in git that is interesting and useful, but if they
find something else lacking in it to be truly useful for them,
they can submit patches, or if they cannot implement the changes
themselves but only have wishlist items, then _they_ can do the
begging.
People in git community are certainly friendly and helpful
bunch, and some (including me) are unfortunate enough that
sometimes they have to touch Windows, so some degree of need is
felt to support Windows better even within the community, but it
has never been high priority. Making it higher priority by
bringing in better ideas and starting the fire must come from
people who care more about Windows than me and Linus.
> So, I think the git project needs to come to an explicit position on
> this, basically being:
>
> 1) git is a POSIX only tool (i.e., there will be no \r\n munging), or
> 2) a Windows port of git will handle and mung \r\n and \n line endings.
I do not think git project needs to do any such thing. The
project evolves reflecting the needs of its users, and the
design is not decided upfront without doing any feasibility
study. I would certainly not say our position is (1), IOW, I
would not say we will rule out Windows support. If it can be
reasonably done without harming the code, why not?
Depending on how cleanly a change Windows users want is done
without negatively affecting the existing users, it may or may
not be judged acceptable. We will know only when we see at
least the design and preferably the code. I feel no need to
decide between (1) and (2) upfront before that happens.
> If the answer is 1, the mingw port is a waste of time as it simply
> won't be usable by its target audience. If the answer is 2, then I
> think a very careful design of this capability is in order.
>
> Comments?
This is not just you, and fortunately it does not happen very
often in git community, but I find it _very_ irritating when
somebody says: "here is a patch, I'll do the doc, test, and
tidying up if this patch is accepted". I usually pretend to be
a nice person and accept the patch when it is obviously good,
or pretend that I was too busy and did not notice such a
message, but I feel _very_ tempted to say: "if you care deeply
enough that what you did is useful, I expect you'd perfect it
whether or not I apply your patch to my tree right now. If even
the original author, you, do not find it worth perfecting, then
I am not interested at all."
Even if all existing git community members felt (1) above and
were unwilling to accept line-end conversions (which by now you
already know is not the case -- and that is why I waited until
now to address this as a separate "attitude" issue), if somebody
who works on Windows is motivated enough to make git work better
for him, he can fork (and forking is very easy with git). If
the forked git works well both on Windows and on non Windows,
people who initially felt (1) will realize that they were wrong
and then the codebase can be merged back together (and merging
the forked projects is very easy with git).
It's open source. People shouldn't worry too much about what
they have done "wasted". You are not even talking about what
you've already done -- you are talking about what you _might_
do.
And your saying "If 2, then we need to think carefully" was VERY
good. My point is that you did not have to say "Is it 1, or is
it 2, and if 2 then" part.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 2:02 ` Junio C Hamano
@ 2007-02-13 3:21 ` Mark Levedahl
2007-02-13 6:05 ` Junio C Hamano
0 siblings, 1 reply; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13 3:21 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Mark Levedahl, Git Mailing List
Junio C Hamano wrote:
> Mark Levedahl <mlevedahl@verizon.net> writes:
>
>> I am NOT intending to start a flamewar O:-) , so please don't turn
>> this into one.
>
> Heh, a lofty goal. And I am glad to see that a thread full of
> constructive suggestion is already going on.
>
> So now I do not have to fear starting a flamewar; I can safely
> vent.
Junio,
I meant absolutely no offense in anything I wrote, and sincerely
apologize if any was taken. My past experiences caused me to be
skeptical that a significant change to accommodate a very bad design of
Windows would be accepted here. Happily, that skepticism was misplaced.
I am much heartened by the responses, and am optimistic a good solution
will be found that is acceptable to all. It is very clear that the group
is open and supportive of working through the issues to help this, and I
intend to contribute to that solution. (If nothing else, I would like to
be known for something besides some modest ability to hack around Tk bugs).
So, I trust the flamethrowers can remain buried.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
` (3 preceding siblings ...)
2007-02-13 2:02 ` Junio C Hamano
@ 2007-02-13 3:32 ` Alexander Litvinov
2007-02-13 10:06 ` Johannes Schindelin
4 siblings, 1 reply; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-13 3:32 UTC (permalink / raw)
To: Mark Levedahl; +Cc: Git Mailing List
В сообщении от Monday 12 February 2007 05:13 Mark Levedahl написал(a):
> 1) git is a POSIX only tool (i.e., there will be no \r\n munging), or
> 2) a Windows port of git will handle and mung \r\n and \n line endings.
>
> If the answer is 1, the mingw port is a waste of time as it simply won't
> be usable by its target audience. If the answer is 2, then I think a
> very careful design of this capability is in order.
I am strongly object this statement. I develop one project under Windows and
use Cygwin git for this. Yes, I have a problem with git's thinking line
ending is a \n but most of troubles are diff and rebase. In general git works
well with \r\n line endings.
When I have file that was converted from dos to unix format (or from unix to
dos) git genereta big diff. But anyway, c++ compiler works well with both
formats and in this case I simply convert file to dos format and git shows
again nice diff. If unix format was commited to git I simply change the
format and commit that file again.
The only trouble is the rebase, it does not like \r\n ending and othen produce
unexpected merge conflict. But I don't use rebse to othen to realy
investigate and try to solve the problem.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 0:59 ` Mark Levedahl
2007-02-13 1:06 ` Johannes Schindelin
@ 2007-02-13 5:18 ` Jeff King
1 sibling, 0 replies; 83+ messages in thread
From: Jeff King @ 2007-02-13 5:18 UTC (permalink / raw)
To: Mark Levedahl
Cc: Johannes Schindelin, Junio C Hamano, Linus Torvalds,
Theodore Tso, Mark Levedahl, Git Mailing List
On Mon, Feb 12, 2007 at 07:59:50PM -0500, Mark Levedahl wrote:
> fail to set the correct file type somewhere. Current cvsnt is fairly
> good at autodetecting and setting text vs binary file type, and enforces
> this across all platforms, so things don't go awry too often. It is in
There is obviously much sentiment that this should _not_ be the default
(and I agree). But if arbitrary filters are possible, then you can
theoretically write an 'autocrlf' filter which will try to do the right
thing, and you could set it for some or all files:
echo '*: autocrlf' >.gitattributes
but it would be off by default. If we implement this, everyone has to
"pay" for .gitattributes (even if you don't use it, we have to look it
up to make sure you're not using it!), but nobody has to pay for any
filters they don't use.
-Peff
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 3:21 ` Mark Levedahl
@ 2007-02-13 6:05 ` Junio C Hamano
0 siblings, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13 6:05 UTC (permalink / raw)
To: Mark Levedahl; +Cc: Mark Levedahl, Git Mailing List
Mark Levedahl <mdl123@verizon.net> writes:
> I meant absolutely no offense in anything I wrote, and sincerely
> apologize if any was taken.
None taken, although I admit that I was somewhat annoyed, having
to write the first part of my response.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 3:32 ` Alexander Litvinov
@ 2007-02-13 10:06 ` Johannes Schindelin
2007-02-13 12:16 ` Alexander Litvinov
2007-02-13 16:52 ` Linus Torvalds
0 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 10:06 UTC (permalink / raw)
To: Alexander Litvinov; +Cc: Mark Levedahl, Git Mailing List
Hi,
On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> When I have file that was converted from dos to unix format (or from
> unix to dos) git genereta big diff. But anyway, c++ compiler works well
> with both formats and in this case I simply convert file to dos format
> and git shows again nice diff. If unix format was commited to git I
> simply change the format and commit that file again.
That's awful!
> The only trouble is the rebase, it does not like \r\n ending and othen
> produce unexpected merge conflict. But I don't use rebse to othen to
> realy investigate and try to solve the problem.
Well, if everybody thinks like you, maybe we do not have to change
anything for Windows after all?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 10:06 ` Johannes Schindelin
@ 2007-02-13 12:16 ` Alexander Litvinov
2007-02-13 12:37 ` Johannes Schindelin
2007-02-13 19:36 ` Mark Levedahl
2007-02-13 16:52 ` Linus Torvalds
1 sibling, 2 replies; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-13 12:16 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Git Mailing List
В сообщении от Tuesday 13 February 2007 16:06 Johannes Schindelin написал(a):
> Hi,
>
> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> > When I have file that was converted from dos to unix format (or from
> > unix to dos) git genereta big diff. But anyway, c++ compiler works well
> > with both formats and in this case I simply convert file to dos format
> > and git shows again nice diff. If unix format was commited to git I
> > simply change the format and commit that file again.
>
> That's awful!
If you are tring to build history that looks good - you are right this is a
terrible workflow.
> > The only trouble is the rebase, it does not like \r\n ending and othen
> > produce unexpected merge conflict. But I don't use rebse to othen to
> > realy investigate and try to solve the problem.
>
> Well, if everybody thinks like you, maybe we do not have to change
> anything for Windows after all?
I still wish to have working rebase so if git will hanle somehow \r\n it would
be nice. But please do not produce the same behavior as cvs does: under
cygwin it still use \n !
By the way, most windows programmers I work with says 'git is cool but is
there gui like tortoise or wincvs ?' :-)
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 12:16 ` Alexander Litvinov
@ 2007-02-13 12:37 ` Johannes Schindelin
2007-02-13 19:36 ` Mark Levedahl
1 sibling, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 12:37 UTC (permalink / raw)
To: Alexander Litvinov; +Cc: Git Mailing List
Hi,
On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> Tuesday 13 February 2007 16:06 Johannes Schindelin:
> > At some stage, Alexander wrote this:
>
> > > The only trouble is the rebase, it does not like \r\n ending and othen
> > > produce unexpected merge conflict. But I don't use rebse to othen to
> > > realy investigate and try to solve the problem.
> >
> > Well, if everybody thinks like you, maybe we do not have to change
> > anything for Windows after all?
>
> I still wish to have working rebase so if git will hanle somehow \r\n it
> would be nice. But please do not produce the same behavior as cvs does:
> under cygwin it still use \n !
You really should teach format-patch to output \n patches, and keep all
your blobs CR free.
> By the way, most windows programmers I work with says 'git is cool but
> is there gui like tortoise or wincvs ?' :-)
Some time ago, I started playing with a shell extension. Now that MinGW
git is almost there, I might clean it up... Would you be interested in
working on it, or is this just wishtalk?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 10:06 ` Johannes Schindelin
2007-02-13 12:16 ` Alexander Litvinov
@ 2007-02-13 16:52 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
` (2 more replies)
1 sibling, 3 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 16:52 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Johannes Schindelin wrote:
>
> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
>
> > When I have file that was converted from dos to unix format (or from
> > unix to dos) git genereta big diff. But anyway, c++ compiler works well
> > with both formats and in this case I simply convert file to dos format
> > and git shows again nice diff. If unix format was commited to git I
> > simply change the format and commit that file again.
>
> That's awful!
>
> > The only trouble is the rebase, it does not like \r\n ending and othen
> > produce unexpected merge conflict. But I don't use rebse to othen to
> > realy investigate and try to solve the problem.
>
> Well, if everybody thinks like you, maybe we do not have to change
> anything for Windows after all?
No no no.
It's going to be _horrible_ if people start interesting projects in
Windows, and there are files in a git repository that are encoded with
CRLF.
I'd much rather just get this right, and that means "no hooks". If people
start using commit hooks etc, that will just mean that they won't use them
for all-windows environments (why use it? Everybody hass CRLF, and
everybody _wants_ CRLF), or it will just be relatively expensive to have a
complex hook anyway.
So I think we should plan on something like .gitattributes or similar, so
that we _can_ handle mixed environments well, without any real setup or
any real costs.
The costs really shouldn't be too high - we tend to avoid doing any
expensive working tree changes *anyway*. For example, even "git checkout"
has a huge optimization to avoid rewriting files that are already ok, so
doing things like switching whole branches usually wouldn't even need any
conversion for most files - even on platforms like Windows that need the
conversion in the first place.
So considering that it looks _trivial_ for git-update-index, fairly easy
for diff generation, and I doubt "git checkout" is really likely to be any
worse either, this should just be somethign we do.
The *ONLY* case where we may not be able to do things automatically is
actually a much more subtle one: "git cat-file". If we just get a SHA1, we
don't know what the path to look it up was like, and thus we can never
know whether it's a binary or a text object. With "-p" we can trivially
guess, of course, but "git cat-file blob" simply must not do that!
But that really doesn't sound like a big problem to me ;)
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 16:52 ` Linus Torvalds
@ 2007-02-13 17:23 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
` (2 more replies)
2007-02-13 17:25 ` Nicolas Pitre
2007-02-13 18:04 ` Johannes Schindelin
2 siblings, 3 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 17:23 UTC (permalink / raw)
To: Johannes Schindelin, Junio C Hamano
Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Linus Torvalds wrote:
>
> I'd much rather just get this right, and that means "no hooks". If people
> start using commit hooks etc, that will just mean that they won't use them
> for all-windows environments (why use it? Everybody hass CRLF, and
> everybody _wants_ CRLF), or it will just be relatively expensive to have a
> complex hook anyway.
>
> So I think we should plan on something like .gitattributes or similar, so
> that we _can_ handle mixed environments well, without any real setup or
> any real costs.
Here's a patch that I think we can merge right now. There may be other
places that need this, but this at least points out the three places that
read/write working tree files for git update-index, checkout and diff
respectively. That should cover a lot of it.
Some day we can actually implement it. In the meantime, this points out a
place for people to start. We *can* even start with a really simple "we do
CRLF conversion automatically, regardless of filename" kind of approach,
that just look at the data (all three cases have the _full_ file data
already in memory) and says "ok, this is text, so let's convert to/from
DOS format directly".
THAT somebody can write in ten minutes, and it would already make git much
nicer on a DOS/Windows platform, I suspect.
And it would be totally zero-cost if you just make it a config option
(but please make it dynamic with the _default_ just being 0/1 depending
on whether it's UNIX/Windows, just so that UNIX people can _test_ it
easily).
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 17:23 ` Linus Torvalds
@ 2007-02-13 17:23 ` Linus Torvalds
2007-02-13 18:00 ` Junio C Hamano
2007-02-13 18:05 ` Johannes Schindelin
2 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 17:23 UTC (permalink / raw)
To: Johannes Schindelin, Junio C Hamano
Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Linus Torvalds wrote:
>
> Here's a patch [...]
No. HERE's the trivial stupid patch that just marks the core places.
Linus
---
diff --git a/diff.c b/diff.c
index aaab309..13b9b6c 100644
--- a/diff.c
+++ b/diff.c
@@ -1364,6 +1364,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
s->should_munmap = 1;
+ /* FIXME! CRLF -> LF conversion goes here, based on "s->path" */
}
else {
char type[20];
diff --git a/entry.c b/entry.c
index 0ebf0f0..c2641dd 100644
--- a/entry.c
+++ b/entry.c
@@ -89,6 +89,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
return error("git-checkout-index: unable to create file %s (%s)",
path, strerror(errno));
}
+ /* FIXME: LF -> CRLF conversion goes here, based on "ce->name" */
wrote = write_in_full(fd, new, size);
close(fd);
free(new);
diff --git a/sha1_file.c b/sha1_file.c
index 0d4bf80..8ad7fad 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2091,6 +2091,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
if (!type)
type = blob_type;
+ /* FIXME: CRLF -> LF conversion here for blobs! We'll need the path! */
if (write_object)
ret = write_sha1_file(buf, size, type, sha1);
else
^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 16:52 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
@ 2007-02-13 17:25 ` Nicolas Pitre
2007-02-13 18:04 ` Johannes Schindelin
2 siblings, 0 replies; 83+ messages in thread
From: Nicolas Pitre @ 2007-02-13 17:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Linus Torvalds wrote:
> The *ONLY* case where we may not be able to do things automatically is
> actually a much more subtle one: "git cat-file". If we just get a SHA1, we
> don't know what the path to look it up was like, and thus we can never
> know whether it's a binary or a text object. With "-p" we can trivially
> guess, of course, but "git cat-file blob" simply must not do that!
git-cat-file, and its counter part git-hash-object, are fairly low level
plumbing. Anyone using them should be aware of the issue and apply the
needed conversion. And actually, since we're going to have the
conversion routines in the core, we'd only need to add a --crlf argument
to both of them to optionally perform the conversion since the user of
those commands is more likely to know if the conversion is needed.
Nicolas
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 17:23 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
@ 2007-02-13 18:00 ` Junio C Hamano
2007-02-13 19:07 ` Linus Torvalds
2007-02-13 18:05 ` Johannes Schindelin
2 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13 18:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Here's a patch that I think we can merge right now. There may be other
> places that need this, but this at least points out the three places that
> read/write working tree files for git update-index, checkout and diff
> respectively. That should cover a lot of it.
Thanks, applied. I think git-apply has separate codepaths for
both reading and writing; I won't look into them before 1.5.0
but people are welcome to help advancing the cause before I get
to it ;-).
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 16:52 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
2007-02-13 17:25 ` Nicolas Pitre
@ 2007-02-13 18:04 ` Johannes Schindelin
2007-02-13 18:11 ` Junio C Hamano
2007-02-13 18:39 ` Linus Torvalds
2 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 18:04 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List
Hi,
On Tue, 13 Feb 2007, Linus Torvalds wrote:
> On Tue, 13 Feb 2007, Johannes Schindelin wrote:
> >
> > On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> >
> > > The only trouble is the rebase, it does not like \r\n ending and
> > > othen produce unexpected merge conflict. But I don't use rebse to
> > > othen to realy investigate and try to solve the problem.
> >
> > Well, if everybody thinks like you, maybe we do not have to change
> > anything for Windows after all?
>
> No no no.
>
> It's going to be _horrible_ if people start interesting projects in
> Windows, and there are files in a git repository that are encoded with
> CRLF.
>
> I'd much rather just get this right, and that means "no hooks".
No hooks means something like cvsnt does, and that means no .gitattributes
either. (BTW I really hate .gitattributes, as it does not at all say what
this is about; it's about file _conversions_, not attributes).
CVSNT analyzes the files, and guesses if they are text, and only then
activates the text mode.
I am strongly opposed to including something like that. (It was already
proposed, and your "no hooks" suggests the same.)
However, I am slightly positive about the .gitfiletypes approach, _iff_ we
think about more than just text/binary from the start. If we do it right,
it will buy us more.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 17:23 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
2007-02-13 18:00 ` Junio C Hamano
@ 2007-02-13 18:05 ` Johannes Schindelin
2 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 18:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List
Hi,
On Tue, 13 Feb 2007, Linus Torvalds wrote:
> Here's a patch that I think we can merge right now.
Why the haste all of a sudden? Your patch is easily applyable for anyone
who wants to work on text/binary or arbitrary file types. No need to rush
a developers-only patch into git.git.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 18:04 ` Johannes Schindelin
@ 2007-02-13 18:11 ` Junio C Hamano
2007-02-13 18:39 ` Linus Torvalds
1 sibling, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-13 18:11 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Alexander Litvinov, Mark Levedahl, Git Mailing List
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> No hooks means something like cvsnt does, and that means no .gitattributes
> either. (BTW I really hate .gitattributes, as it does not at all say what
> this is about; it's about file _conversions_, not attributes).
> However, I am slightly positive about the .gitfiletypes approach, _iff_ we
> think about more than just text/binary from the start. If we do it right,
> it will buy us more.
We might start with only binary/text attributes, but we may add
more later, e.g. chmod=o-rwx. I do not see much differnece
between attributes vs filetypes.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 18:04 ` Johannes Schindelin
2007-02-13 18:11 ` Junio C Hamano
@ 2007-02-13 18:39 ` Linus Torvalds
2007-02-13 18:42 ` Johannes Schindelin
1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 18:39 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Johannes Schindelin wrote:
>
> No hooks means something like cvsnt does, and that means no .gitattributes
> either. (BTW I really hate .gitattributes, as it does not at all say what
> this is about; it's about file _conversions_, not attributes).
No, it *is* about attributes.
In order to know how to convert, you need to know the attributes of the
file.
So it's not about conversion: we would ALWAYS do conversion. It's about
the fact that in order to do the conversion, we need to know what the
attributes of the file is - is it text, or what.
And the equal point is that there are _other_ attributes that git might
care about. The "merge strategy" attribute, for example. Or "owner"
attributes for files etc.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 18:39 ` Linus Torvalds
@ 2007-02-13 18:42 ` Johannes Schindelin
0 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-13 18:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Alexander Litvinov, Mark Levedahl, Git Mailing List
Hi,
On Tue, 13 Feb 2007, Linus Torvalds wrote:
> On Tue, 13 Feb 2007, Johannes Schindelin wrote:
> >
> > No hooks means something like cvsnt does, and that means no .gitattributes
> > either. (BTW I really hate .gitattributes, as it does not at all say what
> > this is about; it's about file _conversions_, not attributes).
>
> No, it *is* about attributes.
>
> In order to know how to convert, you need to know the attributes of the
> file.
>
> So it's not about conversion: we would ALWAYS do conversion. It's about
> the fact that in order to do the conversion, we need to know what the
> attributes of the file is - is it text, or what.
>
> And the equal point is that there are _other_ attributes that git might
> care about. The "merge strategy" attribute, for example. Or "owner"
> attributes for files etc.
Yes, you're right. Colour me converted (pun intended).
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 18:00 ` Junio C Hamano
@ 2007-02-13 19:07 ` Linus Torvalds
2007-02-13 20:42 ` Sam Ravnborg
` (3 more replies)
0 siblings, 4 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 19:07 UTC (permalink / raw)
To: Junio C Hamano
Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Junio C Hamano wrote:
>
> Thanks, applied. I think git-apply has separate codepaths for
> both reading and writing; I won't look into them before 1.5.0
> but people are welcome to help advancing the cause before I get
> to it ;-).
Actually, I did it myself.
This is a "lazy man's auto-CRLF", and it really is pretty simple.
It currently does NOT know about file attributes, so it does its
conversion purely based on content. Maybe that is more in the "git
philosophy" anyway, since content is king, but I think we should try to do
the file attributes to turn it off on demand.
Anyway, BY DEFAULT it is off regardless, because it requires a
[core]
AutoCRLF = true
in your config file to be enabled. We could make that the default for
Windows, of course, the same way we do some other things (filemode etc).
But you can actually enable it on UNIX, and it will cause:
- "git update-index" will write blobs without CRLF
- "git diff" will diff working tree files without CRLF
- "git checkout" will write files to the working tree _with_ CRLF
and things work fine.
Funnily, it actually shows an odd file in git itself:
git clone -n git test-crlf
cd test-crlf
git config core.autocrlf true
git checkout
git diff
shows a diff for "Documentation/docbook-xsl.css". Why? Because we have
actually checked in that file *with* CRLF! So when "core.autocrlf" is
true, we'll always generate a *different* hash for it in the index,
because the index hash will be for the content _without_ CRLF.
Is this complete? I dunno. It seems to work for me. It doesn't use the
filename at all right now, and that's probably a deficiency (we could
certainly make the "is_binary()" heuristics also take standard filename
heuristics into account).
I don't pass in the filename at all for the "index_fd()" case
(git-update-index), so that would need to be passed around, but this
actually works fine.
NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours
truly. I will not guarantee that they work at all reasonable. Caveat
emptor. But it _is_ simple, and it _is_ safe, since it's all off by
default.
The patch is pretty simple - the biggest part is the new "convert.c" file,
but even that is really just basic stuff that anybody can write in
"Teaching C 101" as a final project for their first class in programming.
Not to say that it's bug-free, of course - but at least we're not talking
about rocket surgery here.
Linus
---
commit f0731319497ac8121bd901a91fc33d715745d3af
Author: Linus Torvalds <torvalds@osdl.org>
Date: Tue Feb 13 10:56:50 2007 -0800
Add "auto-CRLF" conversion logic
It's simple and it's stupid. But it actually seems to work. What more
can you want?
It's not enabled by default: you need to add a
[core]
AutoCRLF = true
to your .git/config file to enable it universally.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Makefile | 3 +-
cache.h | 5 ++
config.c | 5 ++
convert.c | 179 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
diff.c | 16 +++++
entry.c | 15 +++++
environment.c | 1 +
sha1_file.c | 22 +++++++-
8 files changed, 244 insertions(+), 2 deletions(-)
create mode 100644 convert.c
diff --git a/Makefile b/Makefile
index 40bdcff..60496ff 100644
--- a/Makefile
+++ b/Makefile
@@ -262,7 +262,8 @@ LIB_OBJS = \
revision.o pager.o tree-walk.o xdiff-interface.o \
write_or_die.o trace.o list-objects.o grep.o \
alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
- color.o wt-status.o archive-zip.o archive-tar.o shallow.o utf8.o
+ color.o wt-status.o archive-zip.o archive-tar.o shallow.o utf8.o \
+ convert.o
BUILTIN_OBJS = \
builtin-add.o \
diff --git a/cache.h b/cache.h
index c62b0b0..9c019e8 100644
--- a/cache.h
+++ b/cache.h
@@ -201,6 +201,7 @@ extern const char *apply_default_whitespace;
extern int zlib_compression_level;
extern size_t packed_git_window_size;
extern size_t packed_git_limit;
+extern int auto_crlf;
#define GIT_REPO_VERSION 0
extern int repository_format_version;
@@ -468,4 +469,8 @@ extern int nfvasprintf(char **str, const char *fmt, va_list va);
extern void trace_printf(const char *format, ...);
extern void trace_argv_printf(const char **argv, int count, const char *format, ...);
+/* convert.c */
+extern int convert_to_git(const char *path, char **bufp, unsigned long *sizep);
+extern int convert_to_working_tree(const char *path, char **bufp, unsigned long *sizep);
+
#endif /* CACHE_H */
diff --git a/config.c b/config.c
index d821071..ffe0212 100644
--- a/config.c
+++ b/config.c
@@ -324,6 +324,11 @@ int git_default_config(const char *var, const char *value)
return 0;
}
+ if (!strcmp(var, "core.autocrlf")) {
+ auto_crlf = git_config_bool(var, value);
+ return 0;
+ }
+
if (!strcmp(var, "user.name")) {
strlcpy(git_default_name, value, sizeof(git_default_name));
return 0;
diff --git a/convert.c b/convert.c
new file mode 100644
index 0000000..c04b6c2
--- /dev/null
+++ b/convert.c
@@ -0,0 +1,179 @@
+#include "cache.h"
+/*
+ * convert.c - convert a file when checking it out and checking it in.
+ *
+ * This should use the pathname to decide on whether it wants to do some
+ * more interesting conversions (automatic gzip/unzip, general format
+ * conversions etc etc), but by default it just does automatic CRLF<->LF
+ * translation when the "auto_crlf" option is set.
+ */
+
+struct text_stat {
+ /* CR, LF and CRLF counts */
+ unsigned cr, lf, crlf;
+
+ /* These are just approximations! */
+ unsigned printable, nonprintable;
+};
+
+static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
+{
+ unsigned long i;
+
+ memset(stats, 0, sizeof(*stats));
+
+ for (i = 0; i < size; i++) {
+ unsigned char c = buf[i];
+ if (c == '\r') {
+ stats->cr++;
+ if (i+1 < size && buf[i+1] == '\n')
+ stats->crlf++;
+ continue;
+ }
+ if (c == '\n') {
+ stats->lf++;
+ continue;
+ }
+ if (c == '\t' || (c >= 32 && c < 127)) {
+ stats->printable++;
+ continue;
+ }
+ stats->nonprintable++;
+ }
+}
+
+/*
+ * This is just a heuristic!
+ *
+ * We do allow nonprintable characters (utf-8 and latin1 etc), but we
+ * require that they are just a fairly small percentage of the total
+ * file.
+ */
+static int is_binary(unsigned long size, struct text_stat *stats)
+{
+ if (stats->nonprintable > (size >> 3))
+ return 1;
+ /*
+ * Other heuristics? Average line length might be relevant,
+ * as might LF vs CR vs CRLF counts..
+ *
+ * NOTE! It might be normal to have a low ratio of CRLF to LF
+ * (somebody starts with a LF-only file and edits it with an editor
+ * that adds CRLF only to lines that are added..). But do we
+ * want to support CR-only? Probably not.
+ */
+ return 0;
+}
+
+int convert_to_git(const char *path, char **bufp, unsigned long *sizep)
+{
+ char *buffer, *nbuf;
+ unsigned long size, nsize;
+ struct text_stat stats;
+
+ /*
+ * FIXME! Other pluggable conversions should go here,
+ * based on filename patterns. Right now we just do the
+ * stupid auto-CRLF one.
+ */
+ if (!auto_crlf)
+ return 0;
+
+ size = *sizep;
+ if (!size)
+ return 0;
+ buffer = *bufp;
+
+ gather_stats(buffer, size, &stats);
+
+ /* No CR? Nothing to convert, regardless. */
+ if (!stats.cr)
+ return 0;
+
+ /*
+ * We're currently not going to even try to convert stuff
+ * that has bare CR characters. Does anybody do that crazy
+ * stuff?
+ */
+ if (stats.cr != stats.crlf)
+ return 0;
+
+ /*
+ * And add some heuristics for binary vs text, of course..
+ */
+ if (is_binary(size, &stats))
+ return 0;
+
+ /*
+ * Ok, allocate a new buffer, fill it in, and return true
+ * to let the caller know that we switched buffers on it.
+ */
+ nsize = size - stats.crlf;
+ nbuf = xmalloc(nsize);
+ *bufp = nbuf;
+ *sizep = nsize;
+ do {
+ unsigned char c = *buffer++;
+ if (c != '\r')
+ *nbuf++ = c;
+ } while (--size);
+
+ return 1;
+}
+
+int convert_to_working_tree(const char *path, char **bufp, unsigned long *sizep)
+{
+ char *buffer, *nbuf;
+ unsigned long size, nsize;
+ struct text_stat stats;
+ unsigned char last;
+
+ /*
+ * FIXME! Other pluggable conversions should go here,
+ * based on filename patterns. Right now we just do the
+ * stupid auto-CRLF one.
+ */
+ if (!auto_crlf)
+ return 0;
+
+ size = *sizep;
+ if (!size)
+ return 0;
+ buffer = *bufp;
+
+ gather_stats(buffer, size, &stats);
+
+ /* No LF? Nothing to convert, regardless. */
+ if (!stats.lf)
+ return 0;
+
+ /* Was it already in CRLF format? */
+ if (stats.lf == stats.crlf)
+ return 0;
+
+ /* If we have any bare CR characters, we're not going to touch it */
+ if (stats.cr != stats.crlf)
+ return 0;
+
+ if (is_binary(size, &stats))
+ return 0;
+
+ /*
+ * Ok, allocate a new buffer, fill it in, and return true
+ * to let the caller know that we switched buffers on it.
+ */
+ nsize = size + stats.lf - stats.crlf;
+ nbuf = xmalloc(nsize);
+ *bufp = nbuf;
+ *sizep = nsize;
+ last = 0;
+ do {
+ unsigned char c = *buffer++;
+ if (c == '\n' && last != '\r')
+ *nbuf++ = '\r';
+ *nbuf++ = c;
+ last = c;
+ } while (--size);
+
+ return 1;
+}
diff --git a/diff.c b/diff.c
index aaab309..561587c 100644
--- a/diff.c
+++ b/diff.c
@@ -1332,6 +1332,9 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
reuse_worktree_file(s->path, s->sha1, 0)) {
struct stat st;
int fd;
+ char *buf;
+ unsigned long size;
+
if (lstat(s->path, &st) < 0) {
if (errno == ENOENT) {
err_empty:
@@ -1364,6 +1367,19 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
s->should_munmap = 1;
+
+ /*
+ * Convert from working tree format to canonical git format
+ */
+ buf = s->data;
+ size = s->size;
+ if (convert_to_git(s->path, &buf, &size)) {
+ munmap(s->data, s->size);
+ s->should_munmap = 0;
+ s->data = buf;
+ s->size = size;
+ s->should_free = 1;
+ }
}
else {
char type[20];
diff --git a/entry.c b/entry.c
index 0ebf0f0..472a9ef 100644
--- a/entry.c
+++ b/entry.c
@@ -78,6 +78,9 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
path, sha1_to_hex(ce->sha1));
}
switch (ntohl(ce->ce_mode) & S_IFMT) {
+ char *buf;
+ unsigned long nsize;
+
case S_IFREG:
if (to_tempfile) {
strcpy(path, ".merge_file_XXXXXX");
@@ -89,6 +92,18 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
return error("git-checkout-index: unable to create file %s (%s)",
path, strerror(errno));
}
+
+ /*
+ * Convert from git internal format to working tree format
+ */
+ buf = new;
+ nsize = size;
+ if (convert_to_working_tree(ce->name, &buf, &nsize)) {
+ free(new);
+ new = buf;
+ size = nsize;
+ }
+
wrote = write_in_full(fd, new, size);
close(fd);
free(new);
diff --git a/environment.c b/environment.c
index 54c22f8..2fa0960 100644
--- a/environment.c
+++ b/environment.c
@@ -28,6 +28,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
int pager_in_use;
int pager_use_color = 1;
+int auto_crlf = 0;
static const char *git_dir;
static char *git_object_dir, *git_index_file, *git_refs_dir, *git_graft_file;
diff --git a/sha1_file.c b/sha1_file.c
index 0d4bf80..6ec67b2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2082,7 +2082,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
{
unsigned long size = st->st_size;
void *buf;
- int ret;
+ int ret, re_allocated = 0;
buf = "";
if (size)
@@ -2091,10 +2091,30 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
if (!type)
type = blob_type;
+
+ /*
+ * Convert blobs to git internal format
+ */
+ if (!strcmp(type, blob_type)) {
+ unsigned long nsize = size;
+ char *nbuf = buf;
+ if (convert_to_git(NULL, &nbuf, &nsize)) {
+ if (size)
+ munmap(buf, size);
+ size = nsize;
+ buf = nbuf;
+ re_allocated = 1;
+ }
+ }
+
if (write_object)
ret = write_sha1_file(buf, size, type, sha1);
else
ret = hash_sha1_file(buf, size, type, sha1);
+ if (re_allocated) {
+ free(buf);
+ return ret;
+ }
if (size)
munmap(buf, size);
return ret;
^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 12:16 ` Alexander Litvinov
2007-02-13 12:37 ` Johannes Schindelin
@ 2007-02-13 19:36 ` Mark Levedahl
2007-02-13 20:32 ` Linus Torvalds
2007-02-13 21:58 ` Robin Rosenberg
1 sibling, 2 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-13 19:36 UTC (permalink / raw)
To: git
Alexander Litvinov wrote:
> ? ????????? ?? Tuesday 13 February 2007 16:06 Johannes Schindelin
> ???????(a):
>> Hi,
>>
>> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
>> > When I have file that was converted from dos to unix format (or from
>> > unix to dos) git genereta big diff. But anyway, c++ compiler works well
>> > with both formats and in this case I simply convert file to dos format
>> > and git shows again nice diff. If unix format was commited to git I
>> > simply change the format and commit that file again.
>>
>> That's awful!
> If you are tring to build history that looks good - you are right this is
> a terrible workflow.
>
>> > The only trouble is the rebase, it does not like \r\n ending and othen
>> > produce unexpected merge conflict. But I don't use rebse to othen to
>> > realy investigate and try to solve the problem.
>>
>> Well, if everybody thinks like you, maybe we do not have to change
>> anything for Windows after all?
> I still wish to have working rebase so if git will hanle somehow \r\n it
> would be nice. But please do not produce the same behavior as cvs does:
> under cygwin it still use \n !
Cygwin != Windows, Cygwin is a POSIX emulation layer with the explicit goal
of providing user tools behaving exactly as they do under Linux, and this
includes line ending style.
So, the Cygwin ports of various Linux tools are not expected to satisfy
users who want native Win32 behavior. This is where the mingw port of git
fits in. Yes, under Cygwin git can track files with \r\n endings, but:
1) Those projects are not portable to non-windows platforms, and
2) As you noted, git will have trouble with rebase, merge, etc. as there is
an assumption of \n endings throughout.
A proper win32 port will accept any of \n, \r\n as valid line endings (add
\r to support Mac pre-OSX if anyone cares, I still occasionally see such
files), treat any of them as semantically equal, and enforce the user's
chosen style (\n or \r\n) on output. cvsnt and svn under Windows do this
today, serving up "text" files from the same repository with \n endings or
\r\n endings depending upon the client, and is what we need a win32 git to
do as well.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 19:36 ` Mark Levedahl
@ 2007-02-13 20:32 ` Linus Torvalds
2007-02-14 1:42 ` Mark Levedahl
2007-02-13 21:58 ` Robin Rosenberg
1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 20:32 UTC (permalink / raw)
To: Mark Levedahl; +Cc: git
On Tue, 13 Feb 2007, Mark Levedahl wrote:
>
> A proper win32 port will accept any of \n, \r\n as valid line endings (add
> \r to support Mac pre-OSX if anyone cares, I still occasionally see such
> files), treat any of them as semantically equal, and enforce the user's
> chosen style (\n or \r\n) on output.
The patch I sent out does that, except right now the "autocrlf" flag is
just a pure boolean.
I could easily make it take a ternary value:
- off (normal UNIX semantics - never change anything)
- on (turn CRLF->LF on input, turn LF->CRLF on output)
- input-only (turn CRLF->LF on input, leave LF alone on output)
that would be just a couple of extra lines (almost all of them in the
config file parsing logic).
[ The "output-only" case is obviously possible, but insane. It would turn
a LF-only file into CRLF on output, and then not turn it back on input,
so doing any "git commit -a" would basically turn every single lines
into CRLF, which you do NOT want.
So hopefully that explains the three - not four - cases ]
And the patch already leaves files that the user doesn't touch alone (ie
if you check something out with CRLF turned off, and then turn it on in
the config, nobody will care - the checked-out copy will have LF-only even
if explicitly re-checking it out would turn it into CRLF, but that's fine.
It would be interesting to hear if the patch works for the MinGW people in
particular. People using git with a Cygnus environment are probably used
to try to keep files with just LF, since they are really trying to do a
UNIX environment on top of Windows. But I suspect that WinGW people are
more likely to use native Windows tools for things, and then perhaps just
a smattering of UNIXy tools..
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 19:07 ` Linus Torvalds
@ 2007-02-13 20:42 ` Sam Ravnborg
2007-02-13 21:08 ` Nicolas Pitre
` (3 more replies)
2007-02-14 5:16 ` Junio C Hamano
` (2 subsequent siblings)
3 siblings, 4 replies; 83+ messages in thread
From: Sam Ravnborg @ 2007-02-13 20:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
> Anyway, BY DEFAULT it is off regardless, because it requires a
>
> [core]
> AutoCRLF = true
>
> in your config file to be enabled. We could make that the default for
> Windows, of course, the same way we do some other things (filemode etc).
This whole auto CRLF things seems to deal with DOS issues that I personally
have not encountered since looong time ago.
Granted notepad in Windows does not understand UNIX files but that a bug
in notepad and everyone knows that wordpad can be used.
I wonder what we are really trying to address here. Or in other words
could the original poster maybe tell what Windows IDE's that does
not handle UNIX files properly?
core git today should not care about CRLF as opposed to LF end-of-line
as long as the end-of-line is consistent - correct?
So defaulting to autoCRLF in Windows/DOS environments was maybe
sane 10 years ago but today that seems to be the wrong thing to do.
For certain project the option could be useful if the tool-set in
the project *requires* CRLF, but if the toolset like all modern toolset
supports both CRLF and LF then git better avoid changing end-of-line marker.
Sam
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 20:42 ` Sam Ravnborg
@ 2007-02-13 21:08 ` Nicolas Pitre
2007-02-13 23:19 ` David Lang
` (2 subsequent siblings)
3 siblings, 0 replies; 83+ messages in thread
From: Nicolas Pitre @ 2007-02-13 21:08 UTC (permalink / raw)
To: Sam Ravnborg
Cc: Linus Torvalds, Junio C Hamano, Johannes Schindelin,
Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Sam Ravnborg wrote:
> This whole auto CRLF things seems to deal with DOS issues that I personally
> have not encountered since looong time ago.
Maybe you didn't share a work environment with Windows users since
looong time ago.
> Granted notepad in Windows does not understand UNIX files but that a bug
> in notepad and everyone knows that wordpad can be used.
>
> I wonder what we are really trying to address here. Or in other words
> could the original poster maybe tell what Windows IDE's that does
> not handle UNIX files properly?
Windows IDE's can _create_files. Those files will be CRLF infected.
Also some of them read UNIX files just fine but they will use CRLF to
end new added lines despite the rest of the file using only LF.
> core git today should not care about CRLF as opposed to LF end-of-line
> as long as the end-of-line is consistent - correct?
Consistency won't come alone if not enforced in some way.
> So defaulting to autoCRLF in Windows/DOS environments was maybe
> sane 10 years ago but today that seems to be the wrong thing to do.
> For certain project the option could be useful if the tool-set in
> the project *requires* CRLF, but if the toolset like all modern toolset
> supports both CRLF and LF then git better avoid changing end-of-line marker.
Rather git better enforce consistency otherwise it'll be only a mix of
possible combination as soon as Windows and UNIX users work on the same
project.
Nicolas
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 19:36 ` Mark Levedahl
2007-02-13 20:32 ` Linus Torvalds
@ 2007-02-13 21:58 ` Robin Rosenberg
2007-02-14 1:18 ` Mark Levedahl
1 sibling, 1 reply; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-13 21:58 UTC (permalink / raw)
To: Mark Levedahl; +Cc: git
tisdag 13 februari 2007 20:36 skrev Mark Levedahl:
> Alexander Litvinov wrote:
>
> > ? ????????? ?? Tuesday 13 February 2007 16:06 Johannes Schindelin
> > ???????(a):
> >> Hi,
> >>
> >> On Tue, 13 Feb 2007, Alexander Litvinov wrote:
> >> > When I have file that was converted from dos to unix format (or from
> >> > unix to dos) git genereta big diff. But anyway, c++ compiler works well
> >> > with both formats and in this case I simply convert file to dos format
> >> > and git shows again nice diff. If unix format was commited to git I
> >> > simply change the format and commit that file again.
> >>
> >> That's awful!
> > If you are tring to build history that looks good - you are right this is
> > a terrible workflow.
> >
> >> > The only trouble is the rebase, it does not like \r\n ending and othen
> >> > produce unexpected merge conflict. But I don't use rebse to othen to
> >> > realy investigate and try to solve the problem.
> >>
> >> Well, if everybody thinks like you, maybe we do not have to change
> >> anything for Windows after all?
> > I still wish to have working rebase so if git will hanle somehow \r\n it
> > would be nice. But please do not produce the same behavior as cvs does:
> > under cygwin it still use \n !
>
> Cygwin != Windows, Cygwin is a POSIX emulation layer with the explicit goal
> of providing user tools behaving exactly as they do under Linux, and this
> includes line ending style.
Line ending style is selectable in cygwin, both on a global level and path level (cygwin
mounts). If you use CVS for windows development using CRLF works well and
is the only option if you want to use the same working are with both native CVS clients
like TortoiseCVS and the cygwin client. I use the CRLF style by default and LF only
for selected directories. The only annoying thing I see is that files transformed by patch end
up with LF-only line endings.
> So, the Cygwin ports of various Linux tools are not expected to satisfy
> users who want native Win32 behavior. This is where the mingw port of git
> fits in. Yes, under Cygwin git can track files with \r\n endings, but:
> 1) Those projects are not portable to non-windows platforms, and
> 2) As you noted, git will have trouble with rebase, merge, etc. as there is
> an assumption of \n endings throughout.
Even if there is a native port, I'm inclined to want to use the cygwin version
anyway because of the nice shell and scripting capabilities and large selection of packages
that match what I'm used to in Linux. Git under cygwin should do CRLF transformations
according to the same rules that apply to text files in cygwin.
-- robin
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 20:42 ` Sam Ravnborg
2007-02-13 21:08 ` Nicolas Pitre
@ 2007-02-13 23:19 ` David Lang
2007-02-13 23:28 ` Linus Torvalds
2007-02-14 3:47 ` Alexander Litvinov
3 siblings, 0 replies; 83+ messages in thread
From: David Lang @ 2007-02-13 23:19 UTC (permalink / raw)
To: Sam Ravnborg
Cc: Linus Torvalds, Junio C Hamano, Johannes Schindelin,
Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Sam Ravnborg wrote:
>
> I wonder what we are really trying to address here. Or in other words
> could the original poster maybe tell what Windows IDE's that does
> not handle UNIX files properly?
>
> core git today should not care about CRLF as opposed to LF end-of-line
> as long as the end-of-line is consistent - correct?
>
> So defaulting to autoCRLF in Windows/DOS environments was maybe
> sane 10 years ago but today that seems to be the wrong thing to do.
> For certain project the option could be useful if the tool-set in
> the project *requires* CRLF, but if the toolset like all modern toolset
> supports both CRLF and LF then git better avoid changing end-of-line marker.
I've actually run into grief on this subject with perl scripts within the last
year (files from windows systems with crlf not working cleanly on a linux system
with just lf)
this is real, not just historic
David Lang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 20:42 ` Sam Ravnborg
2007-02-13 21:08 ` Nicolas Pitre
2007-02-13 23:19 ` David Lang
@ 2007-02-13 23:28 ` Linus Torvalds
2007-02-14 8:41 ` Sam Ravnborg
2007-02-14 3:47 ` Alexander Litvinov
3 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-13 23:28 UTC (permalink / raw)
To: Sam Ravnborg
Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Sam Ravnborg wrote:
>
> This whole auto CRLF things seems to deal with DOS issues that I personally
> have not encountered since looong time ago.
Maybe you stopped using DOS a loong time ago ;)
It's definitely an issue. Yes, all windows programs basically *understand*
files that have just LF. But almost all of them will *write* files with
CRLF.
(Which means that I suspect I made the default for "auto_crlf" be wrong in
my patch: I probably should not default to checking out with CRLF, but
checking out with just LF, and only do the CRLF->LF conversion on input).
Anybody who has ever worked with _any_ Windows people have long since
learnt that they always end up having to convert CRLF to just LF when they
get files. Even _I_ know it, and I seldom have to work with people who use
Windows ;)
So it's a good idea to try to make sure that Windows users don't corrupt
files by adding CRLF where there is no need for them into a git archive.
We hope to convert those people to a real OS some day ("here's a nickel,
boy"), and to make it easier for them to do it, making sure that their
projects in -git are already in a sane format is probably a good idea.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 21:58 ` Robin Rosenberg
@ 2007-02-14 1:18 ` Mark Levedahl
0 siblings, 0 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14 1:18 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: git
Robin Rosenberg wrote:
>
> Even if there is a native port, I'm inclined to want to use the cygwin version
> anyway because of the nice shell and scripting capabilities and large selection of packages
> that match what I'm used to in Linux. Git under cygwin should do CRLF transformations
> according to the same rules that apply to text files in cygwin.
>
> -- robin
The cygwin project is explicitly trying to bury the "text" mount option
and drive towards binary (= \n line endings) only. They once had a rule
that all cygwin programs fully grok \r\n, but that ethic disappeared a
couple of years ago, it was just too hard. The cygwin git port itself
will not operate on a text mount, it requires a binary mount, so crlf
translations are simply not available with git under cygwin.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 20:32 ` Linus Torvalds
@ 2007-02-14 1:42 ` Mark Levedahl
2007-02-14 2:16 ` Linus Torvalds
0 siblings, 1 reply; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14 1:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds wrote:
>
> On Tue, 13 Feb 2007, Mark Levedahl wrote:
>> A proper win32 port will accept any of \n, \r\n as valid line endings (add
>> \r to support Mac pre-OSX if anyone cares, I still occasionally see such
>> files), treat any of them as semantically equal, and enforce the user's
>> chosen style (\n or \r\n) on output.
>
> The patch I sent out does that, except right now the "autocrlf" flag is
> just a pure boolean.
>
> I could easily make it take a ternary value:
> - off (normal UNIX semantics - never change anything)
> - on (turn CRLF->LF on input, turn LF->CRLF on output)
> - input-only (turn CRLF->LF on input, leave LF alone on output)
>
>
> Linus
Wow, this is an incredible response: I expected I was going to be
studying git internals for a while to get to this point. Thank you!
The ternary value is definitely useful. As noted elsewhere, most tools
on windows are very happy with \n ending, few honor those line endings
when files are modified, and fewer still allow the user to specify use
of \n for new files. However, cygwin tools in particular are not
tolerant of crlf, so for that environment it makes sense to banish crlf
and the input-only option is most likely the best default setting there.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 1:42 ` Mark Levedahl
@ 2007-02-14 2:16 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 2:16 UTC (permalink / raw)
To: Mark Levedahl; +Cc: git
On Tue, 13 Feb 2007, Mark Levedahl wrote:
>
> The ternary value is definitely useful. As noted elsewhere, most tools on
> windows are very happy with \n ending, few honor those line endings when files
> are modified, and fewer still allow the user to specify use of \n for new
> files. However, cygwin tools in particular are not tolerant of crlf, so for
> that environment it makes sense to banish crlf and the input-only option is
> most likely the best default setting there.
Here's a UNTESTED patch on top of the patch I already sent, which allows
you to do
[core]
AutoCRLF = input
and it should do only the CRLF->LF translation (ie it simplifies CRLF only
when reading working tree files, but when checking out files, it leaves
the LF alone, and doesn't turn it into a CRLF).
And by "untested" I mean that it looks ok and seems to compile, but I
really didn't do anything else.
Linus
---
diff --git a/config.c b/config.c
index ffe0212..e8ae919 100644
--- a/config.c
+++ b/config.c
@@ -325,6 +325,10 @@ int git_default_config(const char *var, const char *value)
}
if (!strcmp(var, "core.autocrlf")) {
+ if (value && !strcasecmp(value, "input")) {
+ auto_crlf = -1;
+ return 0;
+ }
auto_crlf = git_config_bool(var, value);
return 0;
}
diff --git a/convert.c b/convert.c
index c04b6c2..b5a47c2 100644
--- a/convert.c
+++ b/convert.c
@@ -133,7 +133,7 @@ int convert_to_working_tree(const char *path, char **bufp, unsigned long *sizep)
* based on filename patterns. Right now we just do the
* stupid auto-CRLF one.
*/
- if (!auto_crlf)
+ if (auto_crlf <= 0)
return 0;
size = *sizep;
diff --git a/environment.c b/environment.c
index 2fa0960..570e32a 100644
--- a/environment.c
+++ b/environment.c
@@ -28,7 +28,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
int pager_in_use;
int pager_use_color = 1;
-int auto_crlf = 0;
+int auto_crlf = 0; /* 1: both ways, -1: only when adding git objects */
static const char *git_dir;
static char *git_object_dir, *git_index_file, *git_refs_dir, *git_graft_file;
^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 20:42 ` Sam Ravnborg
` (2 preceding siblings ...)
2007-02-13 23:28 ` Linus Torvalds
@ 2007-02-14 3:47 ` Alexander Litvinov
3 siblings, 0 replies; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-14 3:47 UTC (permalink / raw)
To: Sam Ravnborg
Cc: Linus Torvalds, Junio C Hamano, Johannes Schindelin,
Mark Levedahl, Git Mailing List
В сообщении от Wednesday 14 February 2007 02:42 Sam Ravnborg написал(a):
> I wonder what we are really trying to address here. Or in other words
> could the original poster maybe tell what Windows IDE's that does
> not handle UNIX files properly?
MS VC has text file for project file but don't like \n line endings, only
\r\n.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 19:07 ` Linus Torvalds
2007-02-13 20:42 ` Sam Ravnborg
@ 2007-02-14 5:16 ` Junio C Hamano
2007-02-14 5:36 ` Linus Torvalds
2007-02-14 11:36 ` Alexander Litvinov
2007-02-14 16:16 ` Johannes Sixt
3 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14 5:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours
> truly. I will not guarantee that they work at all reasonable. Caveat
> emptor. But it _is_ simple, and it _is_ safe, since it's all off by
> default.
It might be safe for some definition of safe, but it is very
Asian unfriendly.
I'd probably suggest replacing it with what GNU diff uses, which
we stolen and implemented in diff.c::mmfile_is_binary().
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 5:16 ` Junio C Hamano
@ 2007-02-14 5:36 ` Linus Torvalds
2007-02-14 11:10 ` Johannes Schindelin
0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 5:36 UTC (permalink / raw)
To: Junio C Hamano
Cc: Johannes Schindelin, Alexander Litvinov, Mark Levedahl, Git Mailing List
On Tue, 13 Feb 2007, Junio C Hamano wrote:
>
> It might be safe for some definition of safe, but it is very
> Asian unfriendly.
>
> I'd probably suggest replacing it with what GNU diff uses, which
> we stolen and implemented in diff.c::mmfile_is_binary().
Well, the thing is, mmfile_is_binary() doesn't really have a big downside
if it's wrong one way or the other.
In contrast CR->CRLF conversion, if wrong, actually corrupts binary files.
So I felt it was better to be really safe than sorry. It's *much* better
to miss some CRLF translation than to do too much of it.
That said, I'm sure it could be improved a lot. In particular, characters
in the range 0x00 - 0x1f are clearly "more binary" than the 0x7f+ range,
with the obvious exceptions (tab, cr, lf).
0x00 - which is the only one mmfile_is_binart() uses - is arguably the
"most binary" one, of course, but it might be interesting to give
different weights to the whole range.. In particular, especially for small
files, the fact that there is no 0x00 byte in no way indicates that it's
not "binary".
This whole issue is obviously one reason I'd like to involve the filename
itself, and make it use a ".gitattributes" file - exactly because that
allows you to be much more aggressive and more precise.
(0x00 may be one of the more _common_ characters in many binary files,
which makes it a good character to search for too, so I don't really have
any hugely strong opinions here. After all, the whole heuristic is off by
default anyway, so it's "really safe" ;^)
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 23:28 ` Linus Torvalds
@ 2007-02-14 8:41 ` Sam Ravnborg
2007-02-14 16:28 ` Linus Torvalds
0 siblings, 1 reply; 83+ messages in thread
From: Sam Ravnborg @ 2007-02-14 8:41 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
> > This whole auto CRLF things seems to deal with DOS issues that I personally
> > have not encountered since looong time ago.
>
> Maybe you stopped using DOS a loong time ago ;)
Unfortunately not. (Sitting with a Windows 2000 laptop atm but saved by ssh).
>
> It's definitely an issue. Yes, all windows programs basically *understand*
> files that have just LF. But almost all of them will *write* files with
> CRLF.
So the issue with git supporting CRLF -> LF is to make interoperability between
UNIX* programs and Windows programs which is anohter domain.
My main objective is the proposal to make a conversion default when many users
do not need it. For the UNIX* compatibility thing having conversion at lowest
layer make sense.
> (Which means that I suspect I made the default for "auto_crlf" be wrong in
> my patch: I probably should not default to checking out with CRLF, but
> checking out with just LF, and only do the CRLF->LF conversion on input).
Expect that it seems a few br0ken programs yet does not support LF as
end-of-line marker - so .gitattriutes make take special care here.
Sam
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 5:36 ` Linus Torvalds
@ 2007-02-14 11:10 ` Johannes Schindelin
2007-02-14 14:26 ` Mark Levedahl
2007-02-14 15:44 ` Linus Torvalds
0 siblings, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-14 11:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List
Hi,
On Tue, 13 Feb 2007, Linus Torvalds wrote:
> 0x00 - which is the only one mmfile_is_binart() uses - is arguably the
> "most binary" one, of course, but it might be interesting to give
> different weights to the whole range.. In particular, especially for
> small files, the fact that there is no 0x00 byte in no way indicates
> that it's not "binary".
Last time I checked, the text files never had lines longer than 200
characters (I chose this intentionally large). So, it might be a good
heuristic to check the maximal line length, and refuse to believe that
it's text once a certain (configurable) threshold is reached.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 19:07 ` Linus Torvalds
2007-02-13 20:42 ` Sam Ravnborg
2007-02-14 5:16 ` Junio C Hamano
@ 2007-02-14 11:36 ` Alexander Litvinov
2007-02-14 16:37 ` Linus Torvalds
2007-02-14 16:16 ` Johannes Sixt
3 siblings, 1 reply; 83+ messages in thread
From: Alexander Litvinov @ 2007-02-14 11:36 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
В сообщении от Wednesday 14 February 2007 01:07 Linus Torvalds написал:
> Actually, I did it myself.
>
> This is a "lazy man's auto-CRLF", and it really is pretty simple.
Wow ! Thanks.
I just tried this patch and it works! From now I can use git-cvsimport under
Linux and then clone it to cygwin and work there with full history. Nice,
very nice. In my case text file detection work well as far most of our files
are .cpp and .h
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 11:10 ` Johannes Schindelin
@ 2007-02-14 14:26 ` Mark Levedahl
2007-02-14 15:51 ` Linus Torvalds
2007-02-14 15:56 ` Johannes Schindelin
2007-02-14 15:44 ` Linus Torvalds
1 sibling, 2 replies; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14 14:26 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Junio C Hamano, Alexander Litvinov,
Mark Levedahl, Git Mailing List
Johannes Schindelin wrote:
> Last time I checked, the text files never had lines longer than 200
> characters (I chose this intentionally large). So, it might be a good
> heuristic to check the maximal line length, and refuse to believe that
> it's text once a certain (configurable) threshold is reached.
>
> Ciao,
> Dsch
Unfortunately, on my program we have folks using text files with single
lines over 60,000 characters long, these are data files. Think for
example of a comma or tab separated data file saved from a spreadsheet.
In this case, the files are pure ascii. So, the line length could be
something else to take into account, but is not decisive by itself.
To recap, we have the following various suggestions to determine textness:
1) ratio of ascii to non-ascii characters, possibly weighting some chars
more than others
2) line length
3) existence of a null (\0)
4) file name globbing
5) roundtrip ( lf(crlf(file) ) == file
I don't think any one suggestion is completely adequate for all uses,
all need to be available, somehow configurable. This suggests to me a
core.AutoCRLFstrategy variable that is a comma separated list of methods
to use (set to a reasonable default of course that does not cause
runtime headaches on Unix): a file would be deemed binary unless all
listed methods declare the file as text (with an empty list disabling
AutoCRLF detection).
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 11:10 ` Johannes Schindelin
2007-02-14 14:26 ` Mark Levedahl
@ 2007-02-14 15:44 ` Linus Torvalds
2007-02-14 15:53 ` Johannes Schindelin
1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 15:44 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List
On Wed, 14 Feb 2007, Johannes Schindelin wrote:
>
> Last time I checked, the text files never had lines longer than 200
> characters (I chose this intentionally large). So, it might be a good
> heuristic to check the maximal line length,
No, some broken editor programs and people use "flowing text" files, where
a newline is actually a _paragraph_ end. You have lines in the hundreds
(and thousands) of characters, and the program will just flow the text for
you.
Ugh. Horrible, I know.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 14:26 ` Mark Levedahl
@ 2007-02-14 15:51 ` Linus Torvalds
2007-02-14 16:39 ` Junio C Hamano
2007-02-14 15:56 ` Johannes Schindelin
1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 15:51 UTC (permalink / raw)
To: Mark Levedahl
Cc: Johannes Schindelin, Junio C Hamano, Alexander Litvinov,
Mark Levedahl, Git Mailing List
On Wed, 14 Feb 2007, Mark Levedahl wrote:
>
> To recap, we have the following various suggestions to determine textness:
>
> 1) ratio of ascii to non-ascii characters, possibly weighting some chars more
> than others
> 2) line length
> 3) existence of a null (\0)
> 4) file name globbing
> 5) roundtrip ( lf(crlf(file) ) == file
Actually, my patch already had one that you didn't mention:
6) CR never shows up alone.
So the patch I sent out basicallyhad the following rules:
- no more than ~10% of all characters being other than regular printable
ASCII (where any control character except for newline/cr/tab was deemed
nonprintable)
- any "lonely" CR automatically means it's binary, and I would refuse
to convert that to a LF (the test in the code is that CRLF count must
match CR count)
but the "roundtrip" rule is much too strict (it's actually perfectly
possible for an editor to add CRLF characters only to new _lines_, leaving
old lines with just LF - or the other way around. In fact, the editor I
use under Linux does exactly that in reverse - if I add new lines, it will
add those without CR, but will leave old lines with CRLF alone).
I think that to help asian languages (or strange text-files in utf8 or
Latin1 too, for that matter: test-files with _just_ special characters), I
should probably make the rule be that only the 0-31 range is special.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 15:44 ` Linus Torvalds
@ 2007-02-14 15:53 ` Johannes Schindelin
0 siblings, 0 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-14 15:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Alexander Litvinov, Mark Levedahl, Git Mailing List
Hi,
On Wed, 14 Feb 2007, Linus Torvalds wrote:
> On Wed, 14 Feb 2007, Johannes Schindelin wrote:
> >
> > Last time I checked, the text files never had lines longer than 200
> > characters (I chose this intentionally large). So, it might be a good
> > heuristic to check the maximal line length,
>
> No, some broken editor programs and people use "flowing text" files, where
> a newline is actually a _paragraph_ end.
Good point.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 14:26 ` Mark Levedahl
2007-02-14 15:51 ` Linus Torvalds
@ 2007-02-14 15:56 ` Johannes Schindelin
2007-02-14 16:23 ` Linus Torvalds
2007-02-14 17:28 ` Mark Levedahl
1 sibling, 2 replies; 83+ messages in thread
From: Johannes Schindelin @ 2007-02-14 15:56 UTC (permalink / raw)
To: Mark Levedahl
Cc: Linus Torvalds, Junio C Hamano, Alexander Litvinov,
Mark Levedahl, Git Mailing List
Hi,
On Wed, 14 Feb 2007, Mark Levedahl wrote:
> This suggests to me a core.AutoCRLFstrategy variable that is a comma
> separated list of methods to use (set to a reasonable default of course
> that does not cause runtime headaches on Unix): a file would be deemed
> binary unless all listed methods declare the file as text (with an empty
> list disabling AutoCRLF detection).
This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does
a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither
do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text
detection.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-13 19:07 ` Linus Torvalds
` (2 preceding siblings ...)
2007-02-14 11:36 ` Alexander Litvinov
@ 2007-02-14 16:16 ` Johannes Sixt
2007-02-14 16:53 ` Linus Torvalds
3 siblings, 1 reply; 83+ messages in thread
From: Johannes Sixt @ 2007-02-14 16:16 UTC (permalink / raw)
To: git
Linus Torvalds wrote:
>
> On Tue, 13 Feb 2007, Junio C Hamano wrote:
> >
> > Thanks, applied. I think git-apply has separate codepaths for
> > both reading and writing; I won't look into them before 1.5.0
> > but people are welcome to help advancing the cause before I get
> > to it ;-).
>
> Actually, I did it myself.
>
> This is a "lazy man's auto-CRLF", and it really is pretty simple.
Thanks a lot, busy beaver! I gave this a quick spin with a few
interesting operations: merges and rebase. Merges leave the merge
results with only LFs behind. Rebasing seems to work as expected
(working files have CRLFs), except when merges are needed.
Doesn't git-unpack-file also need to call into the converter?
-- Hannes
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 15:56 ` Johannes Schindelin
@ 2007-02-14 16:23 ` Linus Torvalds
2007-02-14 17:28 ` Mark Levedahl
1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:23 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Mark Levedahl, Junio C Hamano, Alexander Litvinov, Mark Levedahl,
Git Mailing List
On Wed, 14 Feb 2007, Johannes Schindelin wrote:
>
> This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does
> a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither
> do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text
> detection.
Well, one thing to keep in mind is that for source code in particular,
this really very seldom is an issue.
So you can do a really *bad* job in theory, and in practice it really
works very very well.
Very few people keep binary blobs in any SCM archive _anyway_, partly
because they've always been told that it's unsafe (and with a lot of SCM's
it is), but even more because binary blobs are almost always generated by
some build method, so normally you'd never version them in the first
place, or versioning isn't all that helpful.
And most binary blobs are so *obviously* binary that even the stupidest
algorithm on earth will get it right. The only hard cases actually tend to
be really tiny files, or literally test-sequences.
Tiny files are hard because:
- they (by being tiny) have so few characters that they can easily lack
a "fingerprint" character (eg a NUL character or similar).
- tiny files are a lot more likely than bigger files to have strange
statistics that throw some more "sophisticated" rule off the scent.
Something like a "10% rule" tends to work fine if you have a big text,
and ten percent is still a reasonable number to average things out
over, but what if you only had ten characters to begin with?
The good news is that tiny files can usually be considered text, since
you'd seldom use a binary format for something really small anyway.
So I suspect that IN PRACTICE, especially if you come as a CVS replacement
(where binary files are just damn hard to get right even under the best of
circumstances!), you can do just about anything, including just saying
"everything is text", and you'd be fine.
It's entirely possible that that is exactly what CVSNT does ;)
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 8:41 ` Sam Ravnborg
@ 2007-02-14 16:28 ` Linus Torvalds
2007-02-14 16:47 ` Sam Ravnborg
0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:28 UTC (permalink / raw)
To: Sam Ravnborg
Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
On Wed, 14 Feb 2007, Sam Ravnborg wrote:
>
> > (Which means that I suspect I made the default for "auto_crlf" be wrong in
> > my patch: I probably should not default to checking out with CRLF, but
> > checking out with just LF, and only do the CRLF->LF conversion on input).
>
> Expect that it seems a few br0ken programs yet does not support LF as
> end-of-line marker - so .gitattriutes make take special care here.
Yes, but I also think that even without .gitattributes, you just want to
have a default for what "text" actually means, and it's entirely possible
that the default should be: "check out with just LF, and on check-in turn
CRLF into LF".
But exactly because _some_ programs might want to always see CRLF on input
too, it should be overridable.
Or maybe the default should be "turn into CRLF", and there should just be
an option to make it check out as LF-only.
Regardless, I think that is independent of ".gitattributes". The
_attribute_ should be "text", but what it then means in practice is a
separate flag.
And yes, we *could* have a per-file attribute ("text,crlf-checkout") which
could be used to say "I want to always check out as crlf regardless of any
other policy") and the same for lf-only, but I seriously doubt that
anybody really needs that kind of knob-tweaking. At some point it's just
fine to say "you're crazy".
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 11:36 ` Alexander Litvinov
@ 2007-02-14 16:37 ` Linus Torvalds
2007-02-14 17:18 ` Junio C Hamano
0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:37 UTC (permalink / raw)
To: Alexander Litvinov; +Cc: Junio C Hamano, Git Mailing List
On Wed, 14 Feb 2007, Alexander Litvinov wrote:
>
> I just tried this patch and it works! From now I can use git-cvsimport under
> Linux and then clone it to cygwin and work there with full history. Nice,
> very nice.
Btw, it didn't do any commit message conversion etc, so you'll still
always see commit messages with LF-only, and if you _create_ commits, you
need to make sure that whatever program you use will do the right thing.
> In my case text file detection work well as far most of our files
> are .cpp and .h
Yeah, considering that it worked in my testing for "git" itself, I'm not
surprised. Source code tends to look the same..
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 15:51 ` Linus Torvalds
@ 2007-02-14 16:39 ` Junio C Hamano
2007-02-14 17:01 ` Linus Torvalds
0 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14 16:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Actually, my patch already had one that you didn't mention:
> 6) CR never shows up alone.
Older Macs ;-)?
> So the patch I sent out basicallyhad the following rules:
> - no more than ~10% of all characters being other than regular printable
> ASCII (where any control character except for newline/cr/tab was deemed
> nonprintable)
> - any "lonely" CR automatically means it's binary, and I would refuse
> to convert that to a LF (the test in the code is that CRLF count must
> match CR count)
> ...
> I think that to help asian languages (or strange text-files in utf8 or
> Latin1 too, for that matter: test-files with _just_ special characters), I
> should probably make the rule be that only the 0-31 range is special.
I would agree. 0-31 except HT, CR, LF and ESC would be a good
idea; that would not harm text in UTF-8, EUC based various
locales nor ISO 2022.
Patch is relative to 'pu'.
-- >8 --
diff --git a/convert.c b/convert.c
index ebcf717..b6b7c66 100644
--- a/convert.c
+++ b/convert.c
@@ -13,7 +13,7 @@ struct text_stat {
unsigned cr, lf, crlf;
/* These are just approximations! */
- unsigned printable, nonprintable, nul;
+ unsigned printable, nonprintable;
};
static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
@@ -34,13 +34,11 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
stats->lf++;
continue;
}
- if (c == '\t' || (c >= 32 && c < 127)) {
- stats->printable++;
+ if ((c < 32) && (c != '\t' && c != '\033')) {
+ stats->nonprintable++;
continue;
}
- if (!c)
- stats->nul++;
- stats->nonprintable++;
+ stats->printable++;
}
}
@@ -50,7 +48,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
static int is_binary(unsigned long size, struct text_stat *stats)
{
- if (stats->nul)
+ if (stats->nonprintable)
return 1;
/*
* Other heuristics? Average line length might be relevant,
^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 16:28 ` Linus Torvalds
@ 2007-02-14 16:47 ` Sam Ravnborg
0 siblings, 0 replies; 83+ messages in thread
From: Sam Ravnborg @ 2007-02-14 16:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
On Wed, Feb 14, 2007 at 08:28:24AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 14 Feb 2007, Sam Ravnborg wrote:
> >
> > > (Which means that I suspect I made the default for "auto_crlf" be wrong in
> > > my patch: I probably should not default to checking out with CRLF, but
> > > checking out with just LF, and only do the CRLF->LF conversion on input).
> >
> > Expect that it seems a few br0ken programs yet does not support LF as
> > end-of-line marker - so .gitattriutes make take special care here.
>
> Yes, but I also think that even without .gitattributes, you just want to
> have a default for what "text" actually means, and it's entirely possible
> that the default should be: "check out with just LF, and on check-in turn
> CRLF into LF".
The definition of what is "text" and what action to take upon check-in /
check-out of text is two sepearate things.
I could see it as beneficial as a per-project or even as an overall
git-policy to say "checkin-as-LF" - "checkout-as-LF" to overcome
interoperability issues when more tools gets UNIX* based.
>
> But exactly because _some_ programs might want to always see CRLF on input
> too, it should be overridable.
Which is where I see .gitattributes come into play.
-> A rule that says files with extension .prj and of type "text" shall not see
any conversion.
In this way almost all "text" over time get a proper format and the remaining
brain-dead tools that continue to save in CRLF format will not destroy the sane
LF format.
If anything gets defualt I would vote for LF. But overrideable.
My editor-of-choice does eol auto-sense. If I recall correct it scans the
first 200 lines and counts number of CR,LF,CRLF and based on this judge the
actual eol character used. But not all editors are that sensible :-(
Sam
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 16:16 ` Johannes Sixt
@ 2007-02-14 16:53 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 16:53 UTC (permalink / raw)
To: Johannes Sixt; +Cc: git
On Wed, 14 Feb 2007, Johannes Sixt wrote:
>
> Thanks a lot, busy beaver! I gave this a quick spin with a few
> interesting operations: merges and rebase. Merges leave the merge
> results with only LFs behind.
Yes. Merge uses "git-cat-file" (well, it historically did, now that it's
built-in it still does the equivalent operation).
I already talked about how git-cat-file was special ;)
> Rebasing seems to work as expected (working files have CRLFs), except
> when merges are needed.
Well, it always "merges", but yes, you mean three-way data merges. The
normal SHA1-direct merges will just use the normal git-read-tree thing
which is the same as checkout.
> Doesn't git-unpack-file also need to call into the converter?
See earlier discussions. git-cat-file (and git-unpack-file, which is just
a version of it, really) don't have the original filename, so we'll need
to extend on it some way in order to support file attributes even in
theory. So before we do that, I'd hate to do any format conversion there.
Yes, yes, right now it ignores the filename *anyway*, but the point is,
right now that's a "small implementation detail". I would NOT want to do
this if I couldn't know the filename at all!
The merge algorithms actually obviously *do* know the filename fo the
things that they are going to merge, so the filename information does
exists. It's just not passed on far enough.
Finally, one comment: if you use "autocrlf = input" (my second patch), all
of this works even now, since the default is to just leave things as
LF-only anyway. In fact, even with "autocrlf = on", nothing should really
*break* except for silly editors that actuall *require* CRLF.
IOW, it's more important to do the CRLF->LF conversion than it is to do
the LF->CRLF one ;)
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 16:39 ` Junio C Hamano
@ 2007-02-14 17:01 ` Linus Torvalds
2007-02-14 17:29 ` Junio C Hamano
0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 17:01 UTC (permalink / raw)
To: Junio C Hamano
Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
On Wed, 14 Feb 2007, Junio C Hamano wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > Actually, my patch already had one that you didn't mention:
> > 6) CR never shows up alone.
>
> Older Macs ;-)?
Yeah, I think we can ignore them..
Let's see if anybody ever complains ;)
> I would agree. 0-31 except HT, CR, LF and ESC would be a good
> idea; that would not harm text in UTF-8, EUC based various
> locales nor ISO 2022.
You could possibly add 127 to the list too (it's ascii DEL, I don't know
if you should ever see it in anything that has anything to do with text).
> - if (stats->nul)
> + if (stats->nonprintable)
But this is too harsh.
It's quite common to have the occasional FF character. Some things really
do use it for page breaks. So saying that *any* nonprintable character is
bad is not a good idea.
Same goes for BS (some programs use it to show bold and underlined text:
man-pages, for example).
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 16:37 ` Linus Torvalds
@ 2007-02-14 17:18 ` Junio C Hamano
0 siblings, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14 17:18 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Alexander Litvinov, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Wed, 14 Feb 2007, Alexander Litvinov wrote:
>>
>> I just tried this patch and it works! From now I can use git-cvsimport under
>> Linux and then clone it to cygwin and work there with full history. Nice,
>> very nice.
>
> Btw, it didn't do any commit message conversion etc, so you'll still
> always see commit messages with LF-only, and if you _create_ commits, you
> need to make sure that whatever program you use will do the right thing.
I think stripspace removes CR so we should be Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 15:56 ` Johannes Schindelin
2007-02-14 16:23 ` Linus Torvalds
@ 2007-02-14 17:28 ` Mark Levedahl
2007-02-14 18:17 ` Robin Rosenberg
1 sibling, 1 reply; 83+ messages in thread
From: Mark Levedahl @ 2007-02-14 17:28 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Mark Levedahl, Linus Torvalds, Junio C Hamano,
Alexander Litvinov, Git Mailing List
Johannes Schindelin wrote:
> Hi,
>
> On Wed, 14 Feb 2007, Mark Levedahl wrote:
>
> This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does
> a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither
> do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text
> detection.
>
> Ciao,
> Dscho
>
I agree that is complex, I started thinking of PAM when I wrote that,
leading to, "this aint gonna work." But in the modern day let's all feel
good spirit of "there are no stupid ideas, just some are better" I threw
it out anyway.
As to cvsnt, my actual feeling is I'd like to kick it in the ass, it has
destroyed too many files for me over the years, binary and text, so I
don't think its strategies are very good. That is why I'm kicking these
ideas around, if I thought I knew the "right" way I would have written
it already.
Mark
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 17:01 ` Linus Torvalds
@ 2007-02-14 17:29 ` Junio C Hamano
2007-02-14 17:43 ` Linus Torvalds
0 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2007-02-14 17:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
>> - if (stats->nul)
>> + if (stats->nonprintable)
>
> But this is too harsh.
>
> It's quite common to have the occasional FF character. Some things really
> do use it for page breaks. So saying that *any* nonprintable character is
> bad is not a good idea.
>
> Same goes for BS (some programs use it to show bold and underlined text:
> man-pages, for example).
Ok. How about adding BS and FF to the Ok set, and checking if
bad ones are less than 1% of the good ones?
diff --git a/convert.c b/convert.c
index b6b7c66..b0c7641 100644
--- a/convert.c
+++ b/convert.c
@@ -34,11 +34,22 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
stats->lf++;
continue;
}
- if ((c < 32) && (c != '\t' && c != '\033')) {
+ if (c == 127)
+ /* DEL */
stats->nonprintable++;
- continue;
+ else if (c < 32) {
+ switch (c) {
+ /* BS, HT, ESC and FF */
+ case '\b': case '\t': case '\033': case '\014':
+ stats->printable++;
+ break;
+ default:
+ stats->nonprintable++;
+ }
+
}
- stats->printable++;
+ else
+ stats->printable++;
}
}
@@ -48,7 +59,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
static int is_binary(unsigned long size, struct text_stat *stats)
{
- if (stats->nonprintable)
+ if ((stats->printable >> 7) < stats->nonprintable)
return 1;
/*
* Other heuristics? Average line length might be relevant,
^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 17:29 ` Junio C Hamano
@ 2007-02-14 17:43 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 17:43 UTC (permalink / raw)
To: Junio C Hamano
Cc: Mark Levedahl, Johannes Schindelin, Alexander Litvinov,
Mark Levedahl, Git Mailing List
On Wed, 14 Feb 2007, Junio C Hamano wrote:
>
> Ok. How about adding BS and FF to the Ok set, and checking if
> bad ones are less than 1% of the good ones?
I think that looks fine.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 17:28 ` Mark Levedahl
@ 2007-02-14 18:17 ` Robin Rosenberg
2007-02-14 18:31 ` Linus Torvalds
0 siblings, 1 reply; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-14 18:17 UTC (permalink / raw)
To: Mark Levedahl
Cc: Johannes Schindelin, Mark Levedahl, Linus Torvalds,
Junio C Hamano, Alexander Litvinov, Git Mailing List
onsdag 14 februari 2007 18:28 skrev Mark Levedahl:
> As to cvsnt, my actual feeling is I'd like to kick it in the ass, it has
> destroyed too many files for me over the years, binary and text, so I
> don't think its strategies are very good. That is why I'm kicking these
> ideas around, if I thought I knew the "right" way I would have written
> it already.
That may be why an excellent piece of software, TortoiseCVS, doesn't trust
cvs or cvsnt to do the job. Here is how they do the binary detection (and
some more):
http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.cpp?revision=1.172&view=markup
-- robin
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 18:17 ` Robin Rosenberg
@ 2007-02-14 18:31 ` Linus Torvalds
2007-02-14 20:24 ` Robin Rosenberg
0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2007-02-14 18:31 UTC (permalink / raw)
To: Robin Rosenberg
Cc: Mark Levedahl, Johannes Schindelin, Mark Levedahl,
Junio C Hamano, Alexander Litvinov, Git Mailing List
On Wed, 14 Feb 2007, Robin Rosenberg wrote:
>
> That may be why an excellent piece of software, TortoiseCVS, doesn't trust
> cvs or cvsnt to do the job. Here is how they do the binary detection (and
> some more):
>
> http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.cpp?revision=1.172&view=markup
Well, it does seem to boil down to what Junio already got to:
- 0-31 and 127 are never in text, except for BEL, BS, HT, LF, FF, CR and
ESC.
- 128-255 can all be in either iso-8859 or extended ascii (or they
explicitly add NEL but not 128+27 to "normal ASCII", which is strange)
So they've effectively added BEL and ESC to the listof characters that
Junio has now. But they also make it an absolute error to have anything
else (no "1% rule").
But they also do the filename tests, and I think that's more important in
many ways.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mingw, windows, crlf/lf, and git
2007-02-14 18:31 ` Linus Torvalds
@ 2007-02-14 20:24 ` Robin Rosenberg
0 siblings, 0 replies; 83+ messages in thread
From: Robin Rosenberg @ 2007-02-14 20:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mark Levedahl, Johannes Schindelin, Mark Levedahl,
Junio C Hamano, Alexander Litvinov, Git Mailing List
onsdag 14 februari 2007 19:31 skrev Linus Torvalds:
>
> On Wed, 14 Feb 2007, Robin Rosenberg wrote:
> >
> > That may be why an excellent piece of software, TortoiseCVS, doesn't trust
> > cvs or cvsnt to do the job. Here is how they do the binary detection (and
> > some more):
> >
> > http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.cpp?revision=1.172&view=markup
>
> Well, it does seem to boil down to what Junio already got to:
>
> - 0-31 and 127 are never in text, except for BEL, BS, HT, LF, FF, CR and
> ESC.
> - 128-255 can all be in either iso-8859 or extended ascii (or they
> explicitly add NEL but not 128+27 to "normal ASCII", which is strange)
>
> So they've effectively added BEL and ESC to the listof characters that
Especially ESC used to be common in DOS/Windows and quite a few hang around in
older code.
> Junio has now. But they also make it an absolute error to have anything
> else (no "1% rule").
Can this 1%-rule be motivated from real cases, rather that hypotetical ones? It makes
it harder to understand why the tools makes a particular decision.
> But they also do the filename tests, and I think that's more important in
> many ways.
A unixy tool like git should maybe use magic too :).
Btw the filename (like .gitignore or similar) test in practice would give us
the binary flag. Just list a filename instead of a pattern.
-- robin
^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2007-02-14 20:23 UTC | newest]
Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-11 23:13 mingw, windows, crlf/lf, and git Mark Levedahl
2007-02-11 23:34 ` Johannes Schindelin
2007-02-12 0:46 ` Jakub Narebski
2007-02-12 2:36 ` Mark Levedahl
2007-02-12 11:21 ` Johannes Schindelin
2007-02-12 0:14 ` Robin Rosenberg
2007-02-12 2:37 ` Mark Levedahl
2007-02-12 4:24 ` Theodore Tso
2007-02-12 7:28 ` David Lang
2007-02-12 11:36 ` Johannes Schindelin
2007-02-12 17:20 ` Linus Torvalds
2007-02-12 22:37 ` Johannes Schindelin
2007-02-12 23:02 ` Linus Torvalds
2007-02-12 22:54 ` Junio C Hamano
2007-02-12 23:02 ` Junio C Hamano
2007-02-12 23:09 ` Linus Torvalds
2007-02-12 23:25 ` Linus Torvalds
2007-02-12 23:23 ` David Lang
2007-02-12 23:24 ` Johannes Schindelin
2007-02-12 23:42 ` Junio C Hamano
2007-02-12 23:46 ` David Lang
2007-02-12 23:50 ` Johannes Schindelin
2007-02-13 0:59 ` Mark Levedahl
2007-02-13 1:06 ` Johannes Schindelin
2007-02-13 1:13 ` Shawn O. Pearce
2007-02-13 1:20 ` David Lang
2007-02-13 1:36 ` Mark Levedahl
2007-02-13 5:18 ` Jeff King
2007-02-13 0:32 ` Mark Levedahl
2007-02-13 2:02 ` Junio C Hamano
2007-02-13 3:21 ` Mark Levedahl
2007-02-13 6:05 ` Junio C Hamano
2007-02-13 3:32 ` Alexander Litvinov
2007-02-13 10:06 ` Johannes Schindelin
2007-02-13 12:16 ` Alexander Litvinov
2007-02-13 12:37 ` Johannes Schindelin
2007-02-13 19:36 ` Mark Levedahl
2007-02-13 20:32 ` Linus Torvalds
2007-02-14 1:42 ` Mark Levedahl
2007-02-14 2:16 ` Linus Torvalds
2007-02-13 21:58 ` Robin Rosenberg
2007-02-14 1:18 ` Mark Levedahl
2007-02-13 16:52 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
2007-02-13 17:23 ` Linus Torvalds
2007-02-13 18:00 ` Junio C Hamano
2007-02-13 19:07 ` Linus Torvalds
2007-02-13 20:42 ` Sam Ravnborg
2007-02-13 21:08 ` Nicolas Pitre
2007-02-13 23:19 ` David Lang
2007-02-13 23:28 ` Linus Torvalds
2007-02-14 8:41 ` Sam Ravnborg
2007-02-14 16:28 ` Linus Torvalds
2007-02-14 16:47 ` Sam Ravnborg
2007-02-14 3:47 ` Alexander Litvinov
2007-02-14 5:16 ` Junio C Hamano
2007-02-14 5:36 ` Linus Torvalds
2007-02-14 11:10 ` Johannes Schindelin
2007-02-14 14:26 ` Mark Levedahl
2007-02-14 15:51 ` Linus Torvalds
2007-02-14 16:39 ` Junio C Hamano
2007-02-14 17:01 ` Linus Torvalds
2007-02-14 17:29 ` Junio C Hamano
2007-02-14 17:43 ` Linus Torvalds
2007-02-14 15:56 ` Johannes Schindelin
2007-02-14 16:23 ` Linus Torvalds
2007-02-14 17:28 ` Mark Levedahl
2007-02-14 18:17 ` Robin Rosenberg
2007-02-14 18:31 ` Linus Torvalds
2007-02-14 20:24 ` Robin Rosenberg
2007-02-14 15:44 ` Linus Torvalds
2007-02-14 15:53 ` Johannes Schindelin
2007-02-14 11:36 ` Alexander Litvinov
2007-02-14 16:37 ` Linus Torvalds
2007-02-14 17:18 ` Junio C Hamano
2007-02-14 16:16 ` Johannes Sixt
2007-02-14 16:53 ` Linus Torvalds
2007-02-13 18:05 ` Johannes Schindelin
2007-02-13 17:25 ` Nicolas Pitre
2007-02-13 18:04 ` Johannes Schindelin
2007-02-13 18:11 ` Junio C Hamano
2007-02-13 18:39 ` Linus Torvalds
2007-02-13 18:42 ` Johannes Schindelin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.