All of lore.kernel.org
 help / color / mirror / Atom feed
* CVS import
@ 2009-02-16  9:17 Ferry Huberts (Pelagic)
  2009-02-16 13:20 ` CVS import [SOLVED] Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-16  9:17 UTC (permalink / raw)
  To: git

Hi list,

when I try to import a CVS repo with:

git cvsimport -v -i \
  -d :pserver:anonymous@javagroups.cvs.sourceforge.net:/cvsroot/javagroups \
  JGroups

I'm getting the errors:

New tests/other/org/jgroups/tests/adapttcp/Test.java: 5914 bytes
Use of uninitialized value in concatenation (.) or string at /usr/bin/git-cvsimport line 674, <CVS> line 652.
Use of uninitialized value in concatenation (.) or string at /usr/bin/git-cvsimport line 674, <CVS> line 652.
fatal: malformed index info 100666 	src/org/jgroups/util/RWLock.java
unable to write to git-update-index:  at /usr/bin/git-cvsimport line 679, <CVS> line 652.


I've seen this before when trying to import other repositories.
And since I'm not good with Perl I was wondering whether this sounds familiar
and if there's a fix for it.

I'm on Fedora 10 with the following packages:
git.x86_64                                                      1.6.0.6-1.fc10
git-all.x86_64                                                  1.6.0.6-1.fc10
git-arch.x86_64                                                 1.6.0.6-1.fc10
git-cvs.x86_64                                                  1.6.0.6-1.fc10
git-daemon.x86_64                                               1.6.0.6-1.fc10
git-email.x86_64                                                1.6.0.6-1.fc10
git-gui.x86_64                                                  1.6.0.6-1.fc10
git-svn.x86_64                                                  1.6.0.6-1.fc10
gitk.x86_64                                                     1.6.0.6-1.fc10
gitosis.noarch                                                  0.2-6.20080825git.fc10
gitweb.x86_64                                                   1.6.0.6-1.fc10



Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16  9:17 CVS import Ferry Huberts (Pelagic)
@ 2009-02-16 13:20 ` Ferry Huberts (Pelagic)
  2009-02-16 13:45   ` Johannes Schindelin
  2009-02-16 20:32   ` Ferry Huberts (Pelagic)
  0 siblings, 2 replies; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-16 13:20 UTC (permalink / raw)
  To: git

I solved it:

it has to do with the
core.autocrlf=input
core.safecrlf=true

settings I had in my global config.
Maybe the manual page should warn against having these defined?

Ferry


On Mon, February 16, 2009 10:17, Ferry Huberts (Pelagic) wrote:
> Hi list,
>
> when I try to import a CVS repo with:
>
> git cvsimport -v -i \
>   -d :pserver:anonymous@javagroups.cvs.sourceforge.net:/cvsroot/javagroups \
>   JGroups
>
> I'm getting the errors:
>
> New tests/other/org/jgroups/tests/adapttcp/Test.java: 5914 bytes
> Use of uninitialized value in concatenation (.) or string at /usr/bin/git-cvsimport line 674, <CVS> line 652.
> Use of uninitialized value in concatenation (.) or string at /usr/bin/git-cvsimport line 674, <CVS> line 652.
> fatal: malformed index info 100666 	src/org/jgroups/util/RWLock.java
> unable to write to git-update-index:  at /usr/bin/git-cvsimport line 679, <CVS> line 652.
>
>
> I've seen this before when trying to import other repositories.
> And since I'm not good with Perl I was wondering whether this sounds familiar
> and if there's a fix for it.
>
> I'm on Fedora 10 with the following packages:
> git.x86_64                                                      1.6.0.6-1.fc10
> git-all.x86_64                                                  1.6.0.6-1.fc10
> git-arch.x86_64                                                 1.6.0.6-1.fc10
> git-cvs.x86_64                                                  1.6.0.6-1.fc10
> git-daemon.x86_64                                               1.6.0.6-1.fc10
> git-email.x86_64                                                1.6.0.6-1.fc10
> git-gui.x86_64                                                  1.6.0.6-1.fc10
> git-svn.x86_64                                                  1.6.0.6-1.fc10
> gitk.x86_64                                                     1.6.0.6-1.fc10
> gitosis.noarch                                                  0.2-6.20080825git.fc10
> gitweb.x86_64                                                   1.6.0.6-1.fc10
>
>
>
> Ferry
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 13:20 ` CVS import [SOLVED] Ferry Huberts (Pelagic)
@ 2009-02-16 13:45   ` Johannes Schindelin
  2009-02-16 13:53     ` Johannes Schindelin
  2009-02-16 20:32   ` Ferry Huberts (Pelagic)
  1 sibling, 1 reply; 56+ messages in thread
From: Johannes Schindelin @ 2009-02-16 13:45 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: git

Hi,

On Mon, 16 Feb 2009, Ferry Huberts (Pelagic) wrote:

> I solved it:
> 
> it has to do with the
> core.autocrlf=input
> core.safecrlf=true
> 
> settings I had in my global config.

Thanks!

> Maybe the manual page should warn against having these defined?

Maybe it should be solved differently?  As cvsimport needs to operate with 
autocrlf=false, it seems, it could set that variable when it creates a 
repository, and check the variable otherwise (erroring out if it is set 
inappropriately)?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 13:45   ` Johannes Schindelin
@ 2009-02-16 13:53     ` Johannes Schindelin
  2009-02-16 17:33       ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Schindelin @ 2009-02-16 13:53 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: git

Hi,

On Mon, 16 Feb 2009, Johannes Schindelin wrote:

> On Mon, 16 Feb 2009, Ferry Huberts (Pelagic) wrote:
> 
> > I solved it:
> > 
> > it has to do with the
> > core.autocrlf=input
> > core.safecrlf=true
> > 
> > settings I had in my global config.
> 
> Thanks!
> 
> > Maybe the manual page should warn against having these defined?
> 
> Maybe it should be solved differently?  As cvsimport needs to operate with 
> autocrlf=false, it seems, it could set that variable when it creates a 
> repository, and check the variable otherwise (erroring out if it is set 
> inappropriately)?

IOW something like this:

-- snip --
 git-cvsimport.perl |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index e439202..a27cc94 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -562,12 +562,16 @@ my %index; # holds filenames of one index per branch
 unless (-d $git_dir) {
 	system("git-init");
 	die "Cannot init the GIT db at $git_tree: $?\n" if $?;
+	system("git-config core.autocrlf false");
+	die "Cannot set core.autocrlf false" if $?;
 	system("git-read-tree");
 	die "Cannot init an empty tree: $?\n" if $?;
 
 	$last_branch = $opt_o;
 	$orig_branch = "";
 } else {
+	die "Cannot operate with core.autocrlf other than 'false'"
+		if (`git-config --bool core.autocrlf` =~ /true|input/);
 	open(F, "git-symbolic-ref HEAD |") or
 		die "Cannot run git-symbolic-ref: $!\n";
 	chomp ($last_branch = <F>);
-- snap --

If you could add a test to t9600 and a few words to the man page, that 
would be awesome.

Ciao,
Dscho

P.S.: I think the same strategy should be applied to git-svn...

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 13:53     ` Johannes Schindelin
@ 2009-02-16 17:33       ` Ferry Huberts (Pelagic)
  2009-02-16 18:11         ` Johannes Schindelin
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-16 17:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin wrote:
> Hi,
>
> On Mon, 16 Feb 2009, Johannes Schindelin wrote:
>
>   
>> On Mon, 16 Feb 2009, Ferry Huberts (Pelagic) wrote:
>>
>>     
>>> I solved it:
>>>
>>> it has to do with the
>>> core.autocrlf=input
>>> core.safecrlf=true
>>>
>>> settings I had in my global config.
>>>       
>> Thanks!
>>
>>     
>>> Maybe the manual page should warn against having these defined?
>>>       
>> Maybe it should be solved differently?  As cvsimport needs to operate with 
>> autocrlf=false, it seems, it could set that variable when it creates a 
>> repository, and check the variable otherwise (erroring out if it is set 
>> inappropriately)?
>>     
>
> IOW something like this:
>
> -- snip --
>  git-cvsimport.perl |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/git-cvsimport.perl b/git-cvsimport.perl
> index e439202..a27cc94 100755
> --- a/git-cvsimport.perl
> +++ b/git-cvsimport.perl
> @@ -562,12 +562,16 @@ my %index; # holds filenames of one index per branch
>  unless (-d $git_dir) {
>  	system("git-init");
>  	die "Cannot init the GIT db at $git_tree: $?\n" if $?;
> +	system("git-config core.autocrlf false");
> +	die "Cannot set core.autocrlf false" if $?;
>  	system("git-read-tree");
>  	die "Cannot init an empty tree: $?\n" if $?;
>  
>  	$last_branch = $opt_o;
>  	$orig_branch = "";
>  } else {
> +	die "Cannot operate with core.autocrlf other than 'false'"
> +		if (`git-config --bool core.autocrlf` =~ /true|input/);
>  	open(F, "git-symbolic-ref HEAD |") or
>  		die "Cannot run git-symbolic-ref: $!\n";
>  	chomp ($last_branch = <F>);
> -- snap --
>
> If you could add a test to t9600 and a few words to the man page, that 
> would be awesome.
>
> Ciao,
> Dscho
>
> P.S.: I think the same strategy should be applied to git-svn...
>   
I'm willing to give it a stab but I'm not versed on Perl at all. C and 
Java I can do without breaking a sweat though.
Isn't it a better idea to have the original authors do this? They 
understand the code.
Also, doing this would constitute my first patch to git. I'm unfamiliar 
with its codebase and the requirements of the people that contribute to 
it. Willing to learn though :-)
My patches would then probably need some review and would take a bit 
longer to develop. If that's acceptable then I'm willing to try.

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 17:33       ` Ferry Huberts (Pelagic)
@ 2009-02-16 18:11         ` Johannes Schindelin
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Schindelin @ 2009-02-16 18:11 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: git

Hi,

On Mon, 16 Feb 2009, Ferry Huberts (Pelagic) wrote:

> Johannes Schindelin wrote:
>
> > On Mon, 16 Feb 2009, Johannes Schindelin wrote:
> >
> >   
> > > On Mon, 16 Feb 2009, Ferry Huberts (Pelagic) wrote:
> > >
> > >     
> > > > I solved it:
> > > >
> > > > it has to do with the
> > > > core.autocrlf=input
> > > > core.safecrlf=true
> > > >
> > > > settings I had in my global config.
> > > >       
> > > Thanks!
> > >
> > >     
> > > > Maybe the manual page should warn against having these defined?
> > > >       
> > > Maybe it should be solved differently?  As cvsimport needs to operate with
> > > autocrlf=false, it seems, it could set that variable when it creates a
> > > repository, and check the variable otherwise (erroring out if it is set
> > > inappropriately)?
> > >     
> >
> > IOW something like this:
> >
> > -- snip --
> >  git-cvsimport.perl |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/git-cvsimport.perl b/git-cvsimport.perl
> > index e439202..a27cc94 100755
> > --- a/git-cvsimport.perl
> > +++ b/git-cvsimport.perl
> > @@ -562,12 +562,16 @@ my %index; # holds filenames of one index per branch
> >  unless (-d $git_dir) {
> >   system("git-init");
> >   die "Cannot init the GIT db at $git_tree: $?\n" if $?;
> > +	system("git-config core.autocrlf false");
> > +	die "Cannot set core.autocrlf false" if $?;
> >   system("git-read-tree");
> >   die "Cannot init an empty tree: $?\n" if $?;
> >  
> >   $last_branch = $opt_o;
> >   $orig_branch = "";
> > } else {
> > +	die "Cannot operate with core.autocrlf other than 'false'"
> > +		if (`git-config --bool core.autocrlf` =~ /true|input/);
> >   open(F, "git-symbolic-ref HEAD |") or
> >   	die "Cannot run git-symbolic-ref: $!\n";
> > 	chomp ($last_branch = <F>);
> > -- snap --
> >
> > If you could add a test to t9600 and a few words to the man page, that would
> > be awesome.
> >
> > Ciao,
> > Dscho
> >
> > P.S.: I think the same strategy should be applied to git-svn...
> >   
> I'm willing to give it a stab but I'm not versed on Perl at all.

Well, the tests and the man page changes I asked you to do don't require 
Perl..

> Isn't it a better idea to have the original authors do this? They 
> understand the code.

No.  This is Open Source, and it is not their itch.

> Also, doing this would constitute my first patch to git. I'm unfamiliar 
> with its codebase and the requirements of the people that contribute to 
> it. Willing to learn though :-)

That's the best way to start.

> My patches would then probably need some review and would take a bit 
> longer to develop. If that's acceptable then I'm willing to try.

No problem, I am here (if you want to cope with me, that is).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 13:20 ` CVS import [SOLVED] Ferry Huberts (Pelagic)
  2009-02-16 13:45   ` Johannes Schindelin
@ 2009-02-16 20:32   ` Ferry Huberts (Pelagic)
  2009-02-16 20:59     ` Johannes Schindelin
  1 sibling, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-16 20:32 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: git

On Mon, February 16, 2009 14:20, Ferry Huberts (Pelagic) wrote:
> I solved it:
>
> it has to do with the
> core.autocrlf=input
> core.safecrlf=true
>
> settings I had in my global config.
> Maybe the manual page should warn against having these defined?
>

I'm working on it now, and did some more testing: it's actually the safecrlf setting, not the autocrlf option.
Which leaves me with some questions of what to do exactly:

 autocrlf safecrlf
1 false    false
2 false     warn
3 false     true
4 input    false
5 input     warn
6 input     true
7 true     false
8 true      warn
9 true      true


1- git ignores the safecrlf flag; obviously acceptable
2- git ignores the safecrlf flag; so acceptable
3- git ignores the safecrlf flag, so acceptable
4- seems acceptable to me
5- seems acceptable to me
6- unacceptable
7- seems acceptable to me
8- seems acceptable to me
9- unacceptable

So, 6 and 9 (safecrlf==true) are definitely unacceptable. 1-3 are definitely acceptable.
How about the others?
Should these produce warnings?
Should the user use a 'force' option to make the import work (and acknowledge that he's calling for trouble)
Should we enforce some setting? Which flags/setting?

Input appreciated :-)

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 20:32   ` Ferry Huberts (Pelagic)
@ 2009-02-16 20:59     ` Johannes Schindelin
  2009-02-17 11:19       ` Ferry Huberts (Pelagic)
  2009-02-20 15:28       ` Jeff King
  0 siblings, 2 replies; 56+ messages in thread
From: Johannes Schindelin @ 2009-02-16 20:59 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: git

Hi,

On Mon, 16 Feb 2009, Ferry Huberts (Pelagic) wrote:

> On Mon, February 16, 2009 14:20, Ferry Huberts (Pelagic) wrote:
> > I solved it:
> >
> > it has to do with the
> > core.autocrlf=input
> > core.safecrlf=true
> >
> > settings I had in my global config.
> > Maybe the manual page should warn against having these defined?
> >
> 
> I'm working on it now, and did some more testing: it's actually the 
> safecrlf setting, not the autocrlf option.

Oh.  That probably means that cvsimport gets confused by the extra 
warnings.

However, I think it is not correct to run cvsimport with autocrlf set to 
anything than false anyway (and safecrlf would not trigger then, right?).

So IMHO the solution is still to force autocrlf off.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 20:59     ` Johannes Schindelin
@ 2009-02-17 11:19       ` Ferry Huberts (Pelagic)
  2009-02-17 14:18         ` Johannes Schindelin
  2009-02-20 15:28       ` Jeff King
  1 sibling, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-17 11:19 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Ferry Huberts, git

Ok.

I tested all combinations of autocrlf and safecrlf on an artificial cvs
repository with only a dos text file and a Unix text file. Here are my results.


 autocrlf safecrlf
1 false    false
2 false     warn
3 false     true
4 input    false
5 input     warn
6 input     true
7 true     false
8 true      warn
9 true      true


1- correct import, no warnings
2- correct import, no warnings
3- correct import, no warnings
4- correct import, no warnings
5- correct import, warning on the dos text file
6- correct import, no warnings
7- correct import, no warnings
8- correct import, no warnings
9- fail:

Initialized empty Git repository in /data/home.f9/ferry/testarea/cvsimport/wc.git.true.true/.git/
Running cvsps...
cvs_direct initialized to CVSROOT /data/home.f9/ferry/testarea/cvsimport/cvs
cvs rlog: Logging master
* UNKNOWN LINE * Branches:
Fetching dos.txt   v 1.1
New dos.txt: 25 bytes
Fetching unix.txt   v 1.1
New unix.txt: 22 bytes
fatal: LF would be replaced by CRLF in /tmp/gitcvs.RT1XN8
Use of uninitialized value $sha in scalar chomp at /usr/bin/git-cvsimport line 928.
Use of uninitialized value in concatenation (.) or string at /usr/bin/git-cvsimport line 674, <CVS> line 14.
fatal: malformed index info 100666 	unix.txt
unable to write to git-update-index:  at /usr/bin/git-cvsimport line 679, <CVS> line 14.


So 9 crashes while 6 does not. Apparently the artificial repo with the 2 text files
doesn't give enough coverage: my problem was with 6.

It seems that the import script does not detect a fatal from git. It seems to
me that it does not check the return code because it tries to continue.
Must be here (from line 923):

print "".($init ? "New" : "Update")." $fn: $size bytes\n" if $opt_v;
my $pid = open(my $F, '-|');
die $! unless defined $pid;
if (!$pid) {
	exec("git-hash-object", "-w", $tmpname)
	or die "Cannot create object: $!\n";
}
my $sha = <$F>;
chomp $sha;
close $F;


I think the culprit here is git-hash-object. Either it does return a non-zero
exit code or cvsimport does not see the exit code correctly.
I've traced it in the code to the file convert.c, function
static void check_safe_crlf(const char *path, int action,
                            struct text_stat *stats, enum safe_crlf checksafe)

It does do a 'die' which will exit with code 128, but apparently isn't picked
up by perl. I'm stuck now as I don't know Perl well enough.


Back to the issue:
I think requiring autocrlf = false is too strict. Requiring autocrlf = false
should be enough. That combined with a bit of text in the manual page about
these settings: autocrlf = false is strongly recommended. Also, safecrlf is
required to be set to false.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-17 11:19       ` Ferry Huberts (Pelagic)
@ 2009-02-17 14:18         ` Johannes Schindelin
  2009-02-17 15:16           ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Schindelin @ 2009-02-17 14:18 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: git

Hi,

On Tue, 17 Feb 2009, Ferry Huberts (Pelagic) wrote:

> 1- correct import, no warnings

When you say "correct import", do you mean that the import worked, or that 
the imported file is actually bytewise identical to what is stored in the 
CVS _repository_ (as opposed to the working directory)?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-17 14:18         ` Johannes Schindelin
@ 2009-02-17 15:16           ` Ferry Huberts (Pelagic)
  0 siblings, 0 replies; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-17 15:16 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin wrote:
> Hi,
>
> On Tue, 17 Feb 2009, Ferry Huberts (Pelagic) wrote:
>
>   
>> 1- correct import, no warnings
>>     
>
> When you say "correct import", do you mean that the import worked, or that 
> the imported file is actually bytewise identical to what is stored in the 
> CVS _repository_ (as opposed to the working directory)?
>
>   
as far as i could tell (but i looked at the checkout, not the 
repository) it means that the files were imported 1:1. no modifications.
(which is actually a bit weird as for some of the cases the autocrlf 
option was set to true or input)

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-16 20:59     ` Johannes Schindelin
  2009-02-17 11:19       ` Ferry Huberts (Pelagic)
@ 2009-02-20 15:28       ` Jeff King
  2009-02-20 16:25         ` Ferry Huberts (Pelagic)
  1 sibling, 1 reply; 56+ messages in thread
From: Jeff King @ 2009-02-20 15:28 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Ferry Huberts (Pelagic), git

On Mon, Feb 16, 2009 at 09:59:29PM +0100, Johannes Schindelin wrote:

> > I'm working on it now, and did some more testing: it's actually the 
> > safecrlf setting, not the autocrlf option.
> 
> Oh.  That probably means that cvsimport gets confused by the extra 
> warnings.
> 
> However, I think it is not correct to run cvsimport with autocrlf set to 
> anything than false anyway (and safecrlf would not trigger then, right?).
> 
> So IMHO the solution is still to force autocrlf off.

I don't think that's right. What is happening is that git-hash-object is
barfing, and git-cvsimport is not properly detecting the error.
something like this (untested) would make that better:

diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index e439202..65e7990 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -926,6 +926,7 @@ while (<CVS>) {
 			my $sha = <$F>;
 			chomp $sha;
 			close $F;
+			$? and die "hash-object reported failure";
 			my $mode = pmode($cvs->{'mode'});
 			push(@new,[$mode, $sha, $fn]); # may be resurrected!
 		}

But the problem is not autocrlf. It is that the combination of "autocrlf
= input" and "safecrlf" is nonsensical. Just try this:

  $ git init
  $ git config core.autocrlf input
  $ git config core.safecrlf true
  $ printf 'DOS\r\n' >file
  $ git add file
  fatal: CRLF would be replaced by LF in file.

which makes sense. SafeCRLF is about making sure that the file will be
the same on checkin and checkout. But it won't, because we are only
doing CRLF conversion half the time.

So the best workaround is disabling safecrlf, which makes no sense with
his autocrlf setting. But I also think safecrlf could be smarter by
treating autocrlf=input as autocrlf=true. That is, we don't care if in
our _particular_ config it will come out the same; we care about whether
one could, if so inclined, get the CRLF's back to create a byte-for-byte
identical object.

-Peff

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: CVS import [SOLVED]
  2009-02-20 15:28       ` Jeff King
@ 2009-02-20 16:25         ` Ferry Huberts (Pelagic)
  2009-02-20 17:29           ` autocrlf=input and safecrlf (was Re: CVS import [SOLVED]) Jeff King
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-20 16:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, Ferry Huberts, git

I replied in the thread with something comparable:
http://article.gmane.org/gmane.comp.version-control.git/110358

My suggestion is make sure that safecrlf is set to false (see the end part of the mail)

Ferry

On Fri, February 20, 2009 16:28, Jeff King wrote:
> On Mon, Feb 16, 2009 at 09:59:29PM +0100, Johannes Schindelin wrote:
>
>> > I'm working on it now, and did some more testing: it's actually the
>> > safecrlf setting, not the autocrlf option.
>>
>> Oh.  That probably means that cvsimport gets confused by the extra
>> warnings.
>>
>> However, I think it is not correct to run cvsimport with autocrlf set to
>> anything than false anyway (and safecrlf would not trigger then, right?).
>>
>> So IMHO the solution is still to force autocrlf off.
>
> I don't think that's right. What is happening is that git-hash-object is
> barfing, and git-cvsimport is not properly detecting the error.
> something like this (untested) would make that better:
>
> diff --git a/git-cvsimport.perl b/git-cvsimport.perl
> index e439202..65e7990 100755
> --- a/git-cvsimport.perl
> +++ b/git-cvsimport.perl
> @@ -926,6 +926,7 @@ while (<CVS>) {
>  			my $sha = <$F>;
>  			chomp $sha;
>  			close $F;
> +			$? and die "hash-object reported failure";
>  			my $mode = pmode($cvs->{'mode'});
>  			push(@new,[$mode, $sha, $fn]); # may be resurrected!
>  		}
>
> But the problem is not autocrlf. It is that the combination of "autocrlf
> = input" and "safecrlf" is nonsensical. Just try this:
>
>   $ git init
>   $ git config core.autocrlf input
>   $ git config core.safecrlf true
>   $ printf 'DOS\r\n' >file
>   $ git add file
>   fatal: CRLF would be replaced by LF in file.
>
> which makes sense. SafeCRLF is about making sure that the file will be
> the same on checkin and checkout. But it won't, because we are only
> doing CRLF conversion half the time.
>
> So the best workaround is disabling safecrlf, which makes no sense with
> his autocrlf setting. But I also think safecrlf could be smarter by
> treating autocrlf=input as autocrlf=true. That is, we don't care if in
> our _particular_ config it will come out the same; we care about whether
> one could, if so inclined, get the CRLF's back to create a byte-for-byte
> identical object.
>
> -Peff
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-20 16:25         ` Ferry Huberts (Pelagic)
@ 2009-02-20 17:29           ` Jeff King
  2009-02-20 23:24             ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff King @ 2009-02-20 17:29 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: Johannes Schindelin, git

On Fri, Feb 20, 2009 at 05:25:43PM +0100, Ferry Huberts (Pelagic) wrote:

> I replied in the thread with something comparable:
> http://article.gmane.org/gmane.comp.version-control.git/110358
> 
> My suggestion is make sure that safecrlf is set to false (see the end
> part of the mail)

Oh, sorry, I missed that bit. You said:

> Back to the issue:
> I think requiring autocrlf = false is too strict.  Requiring autocrlf
> = false should be enough. That combined with a bit of text in the
> manual page about these settings: autocrlf = false is strongly
> recommended. Also, safecrlf is required to be set to false.

Assuming there is a typo and you meant to say "Requiring safecrlf =
false should be enough", then yes, I agree. But if you are recommending
to put that into the "git cvsimport" manpage, I'm not sure that makes
sense. Setting autocrlf to input and turning on safecrlf breaks much
more than that; you can't add any file that has a CRLF in it.  So such a
warning should probably go in the config description for those options.

I still think safecrlf could probably be made more useful in this case
to differentiate between "this will corrupt your data if you do a
checkout with your current config settings" and "this will corrupt your
data forever".  But I am not a user of either config variable, so maybe
there is some subtlety I'm missing.

-Peff

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-20 17:29           ` autocrlf=input and safecrlf (was Re: CVS import [SOLVED]) Jeff King
@ 2009-02-20 23:24             ` Ferry Huberts (Pelagic)
  2009-02-23  0:08               ` Jeff King
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-20 23:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git

Jeff King wrote:
> On Fri, Feb 20, 2009 at 05:25:43PM +0100, Ferry Huberts (Pelagic) wrote:
>
>   
>> I replied in the thread with something comparable:
>> http://article.gmane.org/gmane.comp.version-control.git/110358
>>
>> My suggestion is make sure that safecrlf is set to false (see the end
>> part of the mail)
>>     
>
> Oh, sorry, I missed that bit. You said:
>
>   
>> Back to the issue:
>> I think requiring autocrlf = false is too strict.  Requiring autocrlf
>> = false should be enough. That combined with a bit of text in the
>> manual page about these settings: autocrlf = false is strongly
>> recommended. Also, safecrlf is required to be set to false.
>>     
>
> Assuming there is a typo and you meant to say "Requiring safecrlf =
> false should be enough", then yes, I agree. But if you are recommending
>   
yes that was a typo.
> to put that into the "git cvsimport" manpage, I'm not sure that makes
> sense. Setting autocrlf to input and turning on safecrlf breaks much
> more than that; you can't add any file that has a CRLF in it.  So such a
> warning should probably go in the config description for those options.
>
>   
I meant that I would add a patch that makes sure that a new repository 
is created with that option set to 'off' and that an existing repository 
would be checked for that option set to 'off'. I suggested to _also_ add 
remarks about this in the man page of cvsimport. Johannes already 
suggested a patch but that was for the autocrlf option (trivially 
converted to the safecrlf option)
> I still think safecrlf could probably be made more useful in this case
> to differentiate between "this will corrupt your data if you do a
> checkout with your current config settings" and "this will corrupt your
> data forever".  But I am not a user of either config variable, so maybe
> there is some subtlety I'm missing.
>
> -Peff
>   
I'm a user of these options myself. I maintain several large 
repositories that contain data that is used both on Unix and Windows 
platforms and that have the autocrlf=input and safecrlf=true. This makes 
sure that everything is in Unix format.
Your remark about corrupting your data is a bit strong for my taste. 
Corruption from one point of view, making sure that everybody handles 
the same content from another :-)

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-20 23:24             ` Ferry Huberts (Pelagic)
@ 2009-02-23  0:08               ` Jeff King
  2009-02-23  6:50                 ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff King @ 2009-02-23  0:08 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: Johannes Schindelin, git

On Sat, Feb 21, 2009 at 12:24:11AM +0100, Ferry Huberts (Pelagic) wrote:

>> I still think safecrlf could probably be made more useful in this case
>> to differentiate between "this will corrupt your data if you do a
>> checkout with your current config settings" and "this will corrupt your
>> data forever".  But I am not a user of either config variable, so maybe
>> there is some subtlety I'm missing.
>
> I'm a user of these options myself. I maintain several large repositories 
> that contain data that is used both on Unix and Windows platforms and that 
> have the autocrlf=input and safecrlf=true. This makes sure that everything 
> is in Unix format.

OK, so there is some value to that combination, then, I suppose. It
seems like there must be some easier and more obvious way to say "reject
all CRLFs", but I can't think of one besides setting up a hook (which
would work at commit time, not add time).

> Your remark about corrupting your data is a bit strong for my taste.  
> Corruption from one point of view, making sure that everybody handles the 
> same content from another :-)

I'm not sure you understood what I meant. What I meant is that for some
set of data, applying CRLF->LF conversion is lossy, and will permanently
destroy the ability to restore the original data. For example, arbitrary
binary data which contains both CRLF and LF will have all CRLF become
LF, but you don't know which of the resulting LFs were originally CRLFs,
and which were just LFs. The data is corrupted, there is no way to get
back the original, and this is what CRLF is about protecting.

However, that safecrlf check is implemented by saying "with the current
autocrlf settings, would checkin and checkout get the same file?". In
the case of autocrlf=true, that that exactly prevents the data above
from being corrupted. But with autocrlf=input, it prevents _any_ CRs
from being converted, since checkout will not convert them back. So even
though your data is not irretrievable (the transformation _is_
reversible, you just don't have it enabled), safecrlf is still
triggering and refusing the content.

And I was suggesting that it might be useful to distinguish between
those two situations. Because right now, with autocrlf=input you have
two choices:

  - safecrlf=false, in which you will corrupt mixed CRLF/LF data without
    any warning

  - safecrlf=true, in which case you are not allowed to check in CR at
    all

But there is no choice for "protect me from actual corruption, but
convert text files (i.e., all CRLF)".

I am a bit concerned about a proposal to set safecrlf=false in all
cvsimported repositories.  You are turning off the protection against
corrupting binary files.  _Even if_ the person has put safecrlf=true
into their ~/.gitconfig and thinks they are safe.

-Peff

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-23  0:08               ` Jeff King
@ 2009-02-23  6:50                 ` Ferry Huberts (Pelagic)
  2009-02-23  6:56                   ` Jeff King
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-23  6:50 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git

Jeff King wrote:
> On Sat, Feb 21, 2009 at 12:24:11AM +0100, Ferry Huberts (Pelagic) wrote:
>
>>> I still think safecrlf could probably be made more useful in this case to differentiate between "this will corrupt your data if you do a checkout with your current config settings" and "this
will corrupt your data forever".  But I am not a user of either config variable, so maybe there is some subtlety I'm missing.
>> I'm a user of these options myself. I maintain several large repositories  that contain data that is used both on Unix and Windows platforms and that  have the autocrlf=input and safecrlf=true.
This makes sure that everything  is in Unix format.
>
> OK, so there is some value to that combination, then, I suppose. It seems like there must be some easier and more obvious way to say "reject all CRLFs", but I can't think of one besides setting up
a hook (which would work at commit time, not add time).
>
>> Your remark about corrupting your data is a bit strong for my taste.   Corruption from one point of view, making sure that everybody handles the  same content from another :-)
>
> I'm not sure you understood what I meant. What I meant is that for some set of data, applying CRLF->LF conversion is lossy, and will permanently destroy the ability to restore the original data.
For example, arbitrary binary data which contains both CRLF and LF will have all CRLF become LF, but you don't know which of the resulting LFs were originally CRLFs, and which were just LFs. The
data is corrupted, there is no way to get back the original, and this is what CRLF is about protecting.
>
> However, that safecrlf check is implemented by saying "with the current autocrlf settings, would checkin and checkout get the same file?". In the case of autocrlf=true, that that exactly prevents
the data above from being corrupted. But with autocrlf=input, it prevents _any_ CRs from being converted, since checkout will not convert them back. So even though your data is not irretrievable
(the transformation _is_
> reversible, you just don't have it enabled), safecrlf is still
> triggering and refusing the content.
>
> And I was suggesting that it might be useful to distinguish between those two situations. Because right now, with autocrlf=input you have two choices:
>
>   - safecrlf=false, in which you will corrupt mixed CRLF/LF data without
>     any warning
>
>   - safecrlf=true, in which case you are not allowed to check in CR at
>     all
>
> But there is no choice for "protect me from actual corruption, but convert text files (i.e., all CRLF)".
>
> I am a bit concerned about a proposal to set safecrlf=false in all cvsimported repositories.  You are turning off the protection against corrupting binary files.  _Even if_ the person has put
safecrlf=true into their ~/.gitconfig and thinks they are safe.
>
> -Peff

Ok.
I follow and agree. Full circle :-)
We're back to Johannes' proposal: make sure that autocrlf is set to false. Agree? If so, then I'll try to whip up a patch.

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-23  6:50                 ` Ferry Huberts (Pelagic)
@ 2009-02-23  6:56                   ` Jeff King
  2009-02-23  7:09                     ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff King @ 2009-02-23  6:56 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: Johannes Schindelin, git

[Can you please wrap your lines? They seem to be about 200 characters,
 and then rewrap the quoted text, but without putting a '>' marker.
 I had a very hard time figuring out what was quoted and what was not.]

On Mon, Feb 23, 2009 at 07:50:48AM +0100, Ferry Huberts (Pelagic) wrote:

> > I am a bit concerned about a proposal to set safecrlf=false in all
> > cvsimported repositories.  You are turning off the protection
> > against corrupting binary files.  _Even if_ the person has put
> > safecrlf=true into their ~/.gitconfig and thinks they are safe.
> 
> Ok.  I follow and agree. Full circle :-) We're back to Johannes'
> proposal: make sure that autocrlf is set to false. Agree? If so, then
> I'll try to whip up a patch.

But won't that now import CRLF's into your new git repo? Especially on
Windows, where (IIRC) cvs gives you files with CRLF by default?

-Peff

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-23  6:56                   ` Jeff King
@ 2009-02-23  7:09                     ` Ferry Huberts (Pelagic)
  2009-02-23  7:10                       ` Jeff King
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-23  7:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Ferry Huberts, Johannes Schindelin, git

On Mon, February 23, 2009 07:56, Jeff King wrote:
> [Can you please wrap your lines? They seem to be about 200 characters,
>  and then rewrap the quoted text, but without putting a '>' marker.
>  I had a very hard time figuring out what was quoted and what was not.]
>
> On Mon, Feb 23, 2009 at 07:50:48AM +0100, Ferry Huberts (Pelagic) wrote:
>
>> > I am a bit concerned about a proposal to set safecrlf=false in all
>> > cvsimported repositories.  You are turning off the protection
>> > against corrupting binary files.  _Even if_ the person has put
>> > safecrlf=true into their ~/.gitconfig and thinks they are safe.
>>
>> Ok.  I follow and agree. Full circle :-) We're back to Johannes'
>> proposal: make sure that autocrlf is set to false. Agree? If so, then
>> I'll try to whip up a patch.
>
> But won't that now import CRLF's into your new git repo? Especially on
> Windows, where (IIRC) cvs gives you files with CRLF by default?
>
> -Peff
>

Yes it would. But sadly that's the only way to make sure that the import
will work (without serious manual intervention).
I found this out the hard way.

I started these discussions to narrow down on what we should actually patch.
Now it appears that we should do what Johannes originally proposed. Always
nice to be right, isn't it Johannes? :-)

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-23  7:09                     ` Ferry Huberts (Pelagic)
@ 2009-02-23  7:10                       ` Jeff King
  2009-02-23  7:29                         ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff King @ 2009-02-23  7:10 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: Johannes Schindelin, git

On Mon, Feb 23, 2009 at 08:09:03AM +0100, Ferry Huberts (Pelagic) wrote:

> > But won't that now import CRLF's into your new git repo? Especially on
> > Windows, where (IIRC) cvs gives you files with CRLF by default?
> 
> Yes it would. But sadly that's the only way to make sure that the import
> will work (without serious manual intervention).
> I found this out the hard way.

Wouldn't setting autocrlf=true, safecrlf=true do the import you want?
Then you could reset autocrlf to input after import but before checkout
time.

-Peff

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-23  7:10                       ` Jeff King
@ 2009-02-23  7:29                         ` Ferry Huberts (Pelagic)
  2009-02-24  6:11                           ` Jeff King
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-23  7:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Ferry Huberts, Johannes Schindelin, git

On Mon, February 23, 2009 08:10, Jeff King wrote:
> On Mon, Feb 23, 2009 at 08:09:03AM +0100, Ferry Huberts (Pelagic) wrote:
>
>> > But won't that now import CRLF's into your new git repo? Especially on
>> > Windows, where (IIRC) cvs gives you files with CRLF by default?
>>
>> Yes it would. But sadly that's the only way to make sure that the import
>> will work (without serious manual intervention).
>> I found this out the hard way.
>
> Wouldn't setting autocrlf=true, safecrlf=true do the import you want?
> Then you could reset autocrlf to input after import but before checkout
> time.

No. As I demonstrated in my testing setup the combination of autocrlf=true
and safecrlf=true ALWAYS makes the import NOT work (for repositories that
have CRLF files). In my own experience I also found that the combination
of autocrlf=input and safecrlf=true ALWAYS makes the import NOT work for
PRATICAL repositories. That lead me to the conclusion to require
safecrlf=false. From the discussion and arguments from you it appeared
that that wouldn't be enough. Therefore I think that we have to require
autocrlf=false (which makes git ignore the safecrlf setting).

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-23  7:29                         ` Ferry Huberts (Pelagic)
@ 2009-02-24  6:11                           ` Jeff King
  2009-02-24  9:25                             ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff King @ 2009-02-24  6:11 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: Johannes Schindelin, git

On Mon, Feb 23, 2009 at 08:29:57AM +0100, Ferry Huberts (Pelagic) wrote:

> > Wouldn't setting autocrlf=true, safecrlf=true do the import you want?
> > Then you could reset autocrlf to input after import but before checkout
> > time.
> 
> No. As I demonstrated in my testing setup the combination of autocrlf=true
> and safecrlf=true ALWAYS makes the import NOT work (for repositories that
> have CRLF files). In my own experience I also found that the combination

OK, sorry I missed that before.

It actually works fine with CRLF files. It breaks on _LF_ files.  Look
again at the output you posted, which shows it barfing while working on
unix.txt.

This is the flip-side of the CRLF and autocrlf=input problem; safecrlf
is protecting us from the case where the file would change on checkout,
in addition to when it would actually be corrupted.

But both of those checks (CRLF on autocrlf=input and safecrlf=true, and
LF on autocrlf=output/true and safecrlf=true) aren't useful to us here;
we are not coming from the working tree to git, and worried about
getting back. We are munging input coming from cvs, and whatever gets
put into the working tree is fine (as long as it is not binary
corruption).

So I think the right solution is a relaxed safecrlf mode that protects
against corruption, but not these other cases. And then git-cvsimport
should use that.

In the meantime, detecting the situation is not a bad idea.

> of autocrlf=input and safecrlf=true ALWAYS makes the import NOT work for
> PRATICAL repositories. That lead me to the conclusion to require
> safecrlf=false. From the discussion and arguments from you it appeared
> that that wouldn't be enough. Therefore I think that we have to require
> autocrlf=false (which makes git ignore the safecrlf setting).

So yes, in some sense it is safecrlf that is broken. I'm just concerned
about tweaking the user's options behind their back. The import can
happen differently than they expected no matter which of safecrlf or
autocrlf you tweak. So I think you are better off to complain and die.

-Peff

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-24  6:11                           ` Jeff King
@ 2009-02-24  9:25                             ` Ferry Huberts (Pelagic)
  2009-02-25  6:56                               ` Jeff King
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-24  9:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Ferry Huberts, Johannes Schindelin, git


> So yes, in some sense it is safecrlf that is broken. I'm just concerned
> about tweaking the user's options behind their back. The import can
> happen differently than they expected no matter which of safecrlf or
> autocrlf you tweak. So I think you are better off to complain and die.

The plan was:
- when creating a new git repo for cvs import: setup safecrlf=false
- when importing into an existing repo: check whether the safecrlf
  setting is set to false and crash and burn when not :-)
  (complain before going up in flames)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-24  9:25                             ` Ferry Huberts (Pelagic)
@ 2009-02-25  6:56                               ` Jeff King
  2009-02-25  8:03                                 ` Ferry Huberts (Pelagic)
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff King @ 2009-02-25  6:56 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: Johannes Schindelin, git

On Tue, Feb 24, 2009 at 10:25:12AM +0100, Ferry Huberts (Pelagic) wrote:

> > So yes, in some sense it is safecrlf that is broken. I'm just concerned
> > about tweaking the user's options behind their back. The import can
> > happen differently than they expected no matter which of safecrlf or
> > autocrlf you tweak. So I think you are better off to complain and die.
> 
> The plan was:
> - when creating a new git repo for cvs import: setup safecrlf=false
> - when importing into an existing repo: check whether the safecrlf
>   setting is set to false and crash and burn when not :-)
>   (complain before going up in flames)

Why is it OK to silently change the settings in the first case, but not
the second? Don't both have the potential to screw up the user's import?

Also, are settings going to be unset after the first import? If so, then
further incremental imports will fail as described in your second case.
But if not, then safecrlf is turned off for that repo, even for
non-cvsimport commands, overriding anything in the user's ~/.gitconfig.
For somebody doing a one-shot import, they are paying that price without
any benefit.

-Peff

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-25  6:56                               ` Jeff King
@ 2009-02-25  8:03                                 ` Ferry Huberts (Pelagic)
  2009-02-25  9:03                                   ` Jeff King
  0 siblings, 1 reply; 56+ messages in thread
From: Ferry Huberts (Pelagic) @ 2009-02-25  8:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Ferry Huberts, Johannes Schindelin, git



On Wed, February 25, 2009 07:56, Jeff King wrote:
> On Tue, Feb 24, 2009 at 10:25:12AM +0100, Ferry Huberts (Pelagic) wrote:
>
>> > So yes, in some sense it is safecrlf that is broken. I'm just concerned
>> > about tweaking the user's options behind their back. The import can
>> > happen differently than they expected no matter which of safecrlf or
>> > autocrlf you tweak. So I think you are better off to complain and die.
>>
>> The plan was:
>> - when creating a new git repo for cvs import: setup safecrlf=false
>> - when importing into an existing repo: check whether the safecrlf
>>   setting is set to false and crash and burn when not :-)
>>   (complain before going up in flames)
>
> Why is it OK to silently change the settings in the first case, but not
> the second? Don't both have the potential to screw up the user's import?
>

the option would be setup for the import repository only, not global nor system

> Also, are settings going to be unset after the first import? If so, then
> further incremental imports will fail as described in your second case.
> But if not, then safecrlf is turned off for that repo, even for
> non-cvsimport commands, overriding anything in the user's ~/.gitconfig.
> For somebody doing a one-shot import, they are paying that price without
> any benefit.
>
this actually makes sense to me. I was only thinking about the continuous
import use-case. In that light it would be better to just complain and die
in the script. I guess I'll just implement that in the patch then.

Ferry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: autocrlf=input and safecrlf (was Re: CVS import [SOLVED])
  2009-02-25  8:03                                 ` Ferry Huberts (Pelagic)
@ 2009-02-25  9:03                                   ` Jeff King
  0 siblings, 0 replies; 56+ messages in thread
From: Jeff King @ 2009-02-25  9:03 UTC (permalink / raw)
  To: Ferry Huberts (Pelagic); +Cc: Johannes Schindelin, git

On Wed, Feb 25, 2009 at 09:03:00AM +0100, Ferry Huberts (Pelagic) wrote:

> > Also, are settings going to be unset after the first import? If so, then
> > further incremental imports will fail as described in your second case.
> > But if not, then safecrlf is turned off for that repo, even for
> > non-cvsimport commands, overriding anything in the user's ~/.gitconfig.
> > For somebody doing a one-shot import, they are paying that price without
> > any benefit.
> >
> this actually makes sense to me. I was only thinking about the continuous
> import use-case. In that light it would be better to just complain and die
> in the script. I guess I'll just implement that in the patch then.

OK, that sounds reasonable to me.

-Peff

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-16  3:39               ` Shawn Pearce
@ 2006-09-16  6:04                 ` Oswald Buddenhagen
  0 siblings, 0 replies; 56+ messages in thread
From: Oswald Buddenhagen @ 2006-09-16  6:04 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Markus Schiltknecht, Michael Haggerty, Jon Smirl,
	Martin Langhoff, Git Mailing List, monotone-devel, dev

On Fri, Sep 15, 2006 at 11:39:18PM -0400, Shawn Pearce wrote:
> On the other hand from what I understand of Monotone it needs
> the revisions in oldest->newest order, as does SVN.
> 
> Doing both orderings in cvs2noncvs is probably ugly.
>
don't worry, as i know mike, he'll come up with an abstract, outright
beautiful interface that makes you want to implement middle->oldnewest
just for the sake of doing it. :)

-- 
Hi! I'm a .signature virus! Copy me into your ~/.signature, please!
--
Chaos, panic, and disorder - my work here is done.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-15  7:37             ` Markus Schiltknecht
@ 2006-09-16  3:39               ` Shawn Pearce
  2006-09-16  6:04                 ` Oswald Buddenhagen
  0 siblings, 1 reply; 56+ messages in thread
From: Shawn Pearce @ 2006-09-16  3:39 UTC (permalink / raw)
  To: Markus Schiltknecht
  Cc: Michael Haggerty, Jon Smirl, Martin Langhoff, Git Mailing List,
	monotone-devel, dev

Markus Schiltknecht <markus@bluegap.ch> wrote:
> Shawn Pearce wrote:
> >I don't know how the Monotone guys feel about it but I think Git
> >is happy with the data in any order, just so long as the dependency
> >chains aren't fed out of order.  Which I think nearly all changeset
> >based SCMs would have an issue with.  So we should be just fine
> >with the current chronological order produced by cvs2svn.
> 
> I'd vote for splitting into file data (and delta / patches) import and 
> metadata import (author, changelog, DAG).
> 
> Monotone would be happiest if the file data were sent one file after 
> another and (inside each file) in the order of each file's single 
> history. That guarantees good import performance for monotone. I imagine 
> it's about the same for git. And if you have to somehow cache the files 
> anyway, subversion will benefit, too. (Well, at least the cache will 
> thank us with good performance).
>
> After all file data has been delivered, the metadata can be delivered. 
> As neigther monotone nor git care much if they are chronological across 
> branches, I'd vote for doing it that way.

Right.  I think that one of the cvs2svn guys had the right idea
here.  Provide two hooks: one early during the RCS file parse which
supplies a backend each full text file revision and another during
the very last stage which includes the "file" in the metadata stream
for commit.

This would give Git and Monotone a way to grab the full text for each
file and stream them out up front, then include only a "token" in the
metadata stream which identifies the specific revision.  Meanwhile
SVN can either cache the file revision during the early part or
ignore it, then dump out the full content during the metadata.


As it happens Git doesn't care what order the file revisions come in.
If we don't repack the imported data we would prefer to get the
revisions in newest->oldest order so we can delta the older versions
against the newer versions (like RCS).  This is also happens to be
the fastest way to extract the revision data from RCS.

On the other hand from what I understand of Monotone it needs
the revisions in oldest->newest order, as does SVN.

Doing both orderings in cvs2noncvs is probably ugly.  Doing just
oldest->newest (since 2/3 backends want that) would be acceptable
but would slow down Git imports as the RCS parsing overhead would
be much higher.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14 15:50           ` Shawn Pearce
  2006-09-14 16:04             ` Jakub Narebski
@ 2006-09-15  7:37             ` Markus Schiltknecht
  2006-09-16  3:39               ` Shawn Pearce
  1 sibling, 1 reply; 56+ messages in thread
From: Markus Schiltknecht @ 2006-09-15  7:37 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Michael Haggerty, Jon Smirl, Martin Langhoff, Git Mailing List,
	monotone-devel, dev

Hi,

Shawn Pearce wrote:
> I don't know how the Monotone guys feel about it but I think Git
> is happy with the data in any order, just so long as the dependency
> chains aren't fed out of order.  Which I think nearly all changeset
> based SCMs would have an issue with.  So we should be just fine
> with the current chronological order produced by cvs2svn.

I'd vote for splitting into file data (and delta / patches) import and 
metadata import (author, changelog, DAG).

Monotone would be happiest if the file data were sent one file after 
another and (inside each file) in the order of each file's single 
history. That guarantees good import performance for monotone. I imagine 
it's about the same for git. And if you have to somehow cache the files 
anyway, subversion will benefit, too. (Well, at least the cache will 
thank us with good performance).

After all file data has been delivered, the metadata can be delivered. 
As neigther monotone nor git care much if they are chronological across 
branches, I'd vote for doing it that way.

Regards

Markus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14 17:01                 ` Michael Haggerty
  2006-09-14 17:08                   ` Jakub Narebski
@ 2006-09-14 17:17                   ` Jon Smirl
  1 sibling, 0 replies; 56+ messages in thread
From: Jon Smirl @ 2006-09-14 17:17 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: monotone-devel, dev, git, Jakub Narebski

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Jon Smirl wrote:
> > On 9/14/06, Jakub Narebski <jnareb@gmail.com> wrote:
> >> Shawn Pearce wrote:
> >>
> >> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
> >>
> >> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
> >> available?
> >
> > It has some unresolved problems so I wasn't spreading it around everywhere.
> >
> > It is based on cvs2svn from August. There has been too much change to
> > the current cvs2svn to merge it anymore. [...]
> >
> > If the repo is missing branch tags cvs2svn may turn a single missing
> > branch into hundreds of branches. The Mozilla repo has about 1000
> > extra branches because of this.
>
> [To explain to our studio audience:] Currently, if there is an actual
> branch in CVS but no symbol associated with it, cvs2svn generates branch
> labels like "unlabeled-1.2.3", where "1.2.3" is the branch revision
> number in CVS for the particular file.  The problem is that the branch
> revision numbers for files in the same logical branch are usually
> different.  That is why many extra branches are generated.
>
> Such unnamed branches cannot reasonably be accessed via CVS anyway, and
> somebody probably made the conscious decision to delete the branch from
> CVS (though without doing it correctly).  Therefore such revisions are
> probably garbage.  It would be easy to add an option to discard such
> revisions, and we should probably do so.  (In fact, they can already be
> excluded with "--exclude=unlabeled-.*".)  The only caveat is that it is
> possible for other, named branches to sprout from an unnamed branch.  In
> this case either the second branch would have to be excluded too, or the
> unlabeled branch would have to be included.

In MozCVS there are important branches where the first label has been
deleted but there are subsequent branches off from the first branch.
These subsequent branches are still visible in CVS. Someone else had
this same problem on the cvs2svn list. This has happen twice on major
branches.

Manually looking at one of these it looks like the author wanted to
change the branch name. They made a branch with the wrong name,
branched again with the new name, and deleted the first branch.

> Alternatively, there was a suggestion to add heuristics to guess which
> files' "unlabeled" branches actually belong in the same original branch.
>  This would be a lot of work, and the result would never be very
> accurate (for one thing, there is no evidence of the branch whatsoever
> in files that had no commits on the branch).

You wrote up a detailed solution for this a few weeks ago on the
cvs2svn list. The basic idea is to look at the change sets on the
unlabeled branches. If change sets span multiple unlabeled branches,
there should be one unlabeled branch instead of multiple ones. That
would work to reduce the number of unlabeled branches down from 1000
to the true number which I believe is in the 10-20 range.

Would the dependency based model make these relationships more obvious?

>
> Other ideas are welcome.
>
> Michael
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14 17:01                 ` Michael Haggerty
@ 2006-09-14 17:08                   ` Jakub Narebski
  2006-09-14 17:17                   ` Jon Smirl
  1 sibling, 0 replies; 56+ messages in thread
From: Jakub Narebski @ 2006-09-14 17:08 UTC (permalink / raw)
  To: monotone-devel; +Cc: dev, git

Michael Haggerty wrote:

> Alternatively, there was a suggestion to add heuristics to guess which
> files' "unlabeled" branches actually belong in the same original branch.
>  This would be a lot of work, and the result would never be very
> accurate (for one thing, there is no evidence of the branch whatsoever
> in files that had no commits on the branch).
> 
> Other ideas are welcome.

Interpolate the state of repository according to timestamps, with some
coarse-grainess of course.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14 16:27               ` Jon Smirl
@ 2006-09-14 17:01                 ` Michael Haggerty
  2006-09-14 17:08                   ` Jakub Narebski
  2006-09-14 17:17                   ` Jon Smirl
  0 siblings, 2 replies; 56+ messages in thread
From: Michael Haggerty @ 2006-09-14 17:01 UTC (permalink / raw)
  To: Jon Smirl; +Cc: monotone-devel, dev, git, Jakub Narebski

Jon Smirl wrote:
> On 9/14/06, Jakub Narebski <jnareb@gmail.com> wrote:
>> Shawn Pearce wrote:
>>
>> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
>>
>> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
>> available?
> 
> It has some unresolved problems so I wasn't spreading it around everywhere.
> 
> It is based on cvs2svn from August. There has been too much change to
> the current cvs2svn to merge it anymore. [...]
> 
> If the repo is missing branch tags cvs2svn may turn a single missing
> branch into hundreds of branches. The Mozilla repo has about 1000
> extra branches because of this.

[To explain to our studio audience:] Currently, if there is an actual
branch in CVS but no symbol associated with it, cvs2svn generates branch
labels like "unlabeled-1.2.3", where "1.2.3" is the branch revision
number in CVS for the particular file.  The problem is that the branch
revision numbers for files in the same logical branch are usually
different.  That is why many extra branches are generated.

Such unnamed branches cannot reasonably be accessed via CVS anyway, and
somebody probably made the conscious decision to delete the branch from
CVS (though without doing it correctly).  Therefore such revisions are
probably garbage.  It would be easy to add an option to discard such
revisions, and we should probably do so.  (In fact, they can already be
excluded with "--exclude=unlabeled-.*".)  The only caveat is that it is
possible for other, named branches to sprout from an unnamed branch.  In
this case either the second branch would have to be excluded too, or the
unlabeled branch would have to be included.

Alternatively, there was a suggestion to add heuristics to guess which
files' "unlabeled" branches actually belong in the same original branch.
 This would be a lot of work, and the result would never be very
accurate (for one thing, there is no evidence of the branch whatsoever
in files that had no commits on the branch).

Other ideas are welcome.

Michael

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14 16:04             ` Jakub Narebski
  2006-09-14 16:18               ` Shawn Pearce
@ 2006-09-14 16:27               ` Jon Smirl
  2006-09-14 17:01                 ` Michael Haggerty
  1 sibling, 1 reply; 56+ messages in thread
From: Jon Smirl @ 2006-09-14 16:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, monotone-devel, dev

On 9/14/06, Jakub Narebski <jnareb@gmail.com> wrote:
> Shawn Pearce wrote:
>
> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
>
> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
> available?

It has some unresolved problems so I wasn't spreading it around everywhere.

It is based on cvs2svn from August. There has been too much change to
the current cvs2svn to merge it anymore. It is going to need
significant rewrite. But cvs2svn will all change again if it converts
to the dependency model. It is better to get a backend independent
interface build into cvs2svn.

It it not generating an accurate repo. cvs2svn is outputting tags
based on multiple revisions, git can't do that. I'm just tossing some
of the tag data that git can't handle. I base the tag on the fist
revision which is not correct.

If the repo is missing branch tags cvs2svn may turn a single missing
branch into hundreds of branches. The Mozilla repo has about 1000
extra branches because of this.

Sometime cvs2svn will partial copy from another rev to generate a new
rev. Git doesn't do this so I am tossing the copy requests. I need to
figure out how to hook into the data before cvs2svn tries to copy
things.

cvs2svn makes no attempt to detect merges so gitk will show 1,700
active branches when there are really only 10 currently active
branches in Mozilla.

That said 99.9% of Mozilla CVS is in the output git repo, but it isn't
quite right.

If you still want the code I'll send it to you.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14 16:04             ` Jakub Narebski
@ 2006-09-14 16:18               ` Shawn Pearce
  2006-09-14 16:27               ` Jon Smirl
  1 sibling, 0 replies; 56+ messages in thread
From: Shawn Pearce @ 2006-09-14 16:18 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, monotone-devel, dev

Jakub Narebski <jnareb@gmail.com> wrote:
> Shawn Pearce wrote:
> 
> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
> 
> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
> available?

Yes.  I want to submit git-fast-import to the main Git project and
ask Junio to bring it in.

However right now I feel like the code isn't up-to-snuff and won't
pass peer review on the Git mailing list.  So I wanted to spend a
little bit of time cleaning it up before asking Junio to carry it
in the main distribution.  My pack mmap window code is actually
part of that cleanup.

I think the goal of this thread is to try and merge the ideas
behind Jon's modified cvs2svn into the core cvs2svn, possibly
causing cvs2svn to be renamed to cvs2notcvs (or some such) and
having a slightly more modular output format so Git, Monotone and
Subversion can all benefit from the difficult-to-do-right changeset
generation logic.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14 15:50           ` Shawn Pearce
@ 2006-09-14 16:04             ` Jakub Narebski
  2006-09-14 16:18               ` Shawn Pearce
  2006-09-14 16:27               ` Jon Smirl
  2006-09-15  7:37             ` Markus Schiltknecht
  1 sibling, 2 replies; 56+ messages in thread
From: Jakub Narebski @ 2006-09-14 16:04 UTC (permalink / raw)
  To: git; +Cc: monotone-devel, dev, monotone-devel, git

Shawn Pearce wrote:

> Originally I wanted Jon Smirl to modify the cvs2svn (...)

By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
available?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  5:36         ` Michael Haggerty
@ 2006-09-14 15:50           ` Shawn Pearce
  2006-09-14 16:04             ` Jakub Narebski
  2006-09-15  7:37             ` Markus Schiltknecht
  0 siblings, 2 replies; 56+ messages in thread
From: Shawn Pearce @ 2006-09-14 15:50 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Martin Langhoff, monotone-devel, Jon Smirl, dev, Git Mailing List

Michael Haggerty <mhagger@alum.mit.edu> wrote:
>  The only difference between our SCMs that might be difficult
> to paper over in a universal dumpfile is that SVN wants its changesets
> in chronological order, whereas I gather that others would prefer the
> data in dependency order branch by branch.

This really isn't an issue for Git.

Originally I wanted Jon Smirl to modify the cvs2svn code to emit
only one branch at a time as that would be much faster than jumping
around branches in chronological order.  But it turned out to
be too much work to change cvs2svn.  So git-fast-import (the Git
program that consumes the dump stream from Jon's modified cvs2svn)
maintains an LRU of the branches in memory and reloads inactive
branches as necessary when cvs2svn jumps around.

It turns out it didn't matter if the git-fast-import maintained 5
active branches in the LRU or 60.  Apparently the Mozilla repo didn't
jump around more than 5 branches at a time - most of the time anyway.

Branches in git-fast-import seemed to cost us only 2 MB of memory
per active branch on the Mozilla repository.  Holding 60 of them at
once (120 MB) is peanuts on most machines today.  But really only 5
(10 MB) were needed for an efficient import.


I don't know how the Monotone guys feel about it but I think Git
is happy with the data in any order, just so long as the dependency
chains aren't fed out of order.  Which I think nearly all changeset
based SCMs would have an issue with.  So we should be just fine
with the current chronological order produced by cvs2svn.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 21:38       ` Jon Smirl
@ 2006-09-14  5:36         ` Michael Haggerty
  2006-09-14 15:50           ` Shawn Pearce
  0 siblings, 1 reply; 56+ messages in thread
From: Michael Haggerty @ 2006-09-14  5:36 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Markus Schiltknecht, Martin Langhoff, Git Mailing List,
	monotone-devel, dev

Jon Smirl wrote:
> On 9/13/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
>> Martin Langhoff wrote:
>> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> >> Let's copy the git list too and maybe we can come up with one importer
>> >> for everyone.

That would be great.

> AFAIK none of the CVS converters are using the dependency algorithm.
> So the proposal on the table is to develop a new converter that uses
> the dependency data from CVS to form the change sets and then outputs
> this data in a form that all of the backends can consume. Of course
> each of the backends is going to have to write some code in order to
> consume this new import format.

Frankly, I think people are getting the priorities wrong by focusing on
the format of the output of cvs2svn.  Hacking a new output format onto
cvs2svn is a trivial matter of a couple hours of programming.

The real strength of cvs2svn (and I can say this without bragging
because most of this was done before I got involved in the project) is
that it handles dozens of peculiar corner cases and bizarre CVS
perversions, including a good test suite containing lots of twisted
little example repositories.  This is 90% of the intellectual content of
cvs2svn.

I've spent many, many hours refactoring and reengineering cvs2svn to
make it easy to modify and add new features.  The main thing that I want
to change is to use the dependency graph (rather than timestamps tweaked
to reflect dependency ordering) to deduce changesets.  But I would never
think of throwing away the "old" cvs2svn and starting anew, because then
I would have to add all the little corner cases again from scratch.

It would be nice to have a universal dumpfile format, but IMO not
critical.  The only difference between our SCMs that might be difficult
to paper over in a universal dumpfile is that SVN wants its changesets
in chronological order, whereas I gather that others would prefer the
data in dependency order branch by branch.

I say let cvs2svn (or if you like, we can rename it to "cvs2noncvs" :-)
) reconstruct the repository's change sets, then let us build several
backends that output the data in the format that is most convenient for
each project.

Michael

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  5:21               ` Martin Langhoff
@ 2006-09-14  5:35                 ` Michael Haggerty
  0 siblings, 0 replies; 56+ messages in thread
From: Michael Haggerty @ 2006-09-14  5:35 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jon Smirl, Markus Schiltknecht, Git Mailing List, monotone-devel, dev

Martin Langhoff wrote:
> On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> 2. Long-term continuous mirroring (backwards and forwards) between CVS
>> and another SCM, to allow people to use their preferred tool.  (I
>> actually think that this is a silly idea, but some people seem to like
>> it.)
> 
> Call me silly ;-) I use this all the time to track projects that use
> CVS or SVN, where I either
>
> [...]

Sorry, I guess I was speaking as a person who prefers and is most
familiar with centralized SCM.  But I see from your response that the
ultimate in decentralized development is that each developer decides
what SCM to use :-) and that incremental conversion makes sense in that
context.

Michael

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  5:02             ` Michael Haggerty
  2006-09-14  5:21               ` Martin Langhoff
@ 2006-09-14  5:30               ` Jon Smirl
  1 sibling, 0 replies; 56+ messages in thread
From: Jon Smirl @ 2006-09-14  5:30 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Martin Langhoff, Markus Schiltknecht, Git Mailing List,
	monotone-devel, dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Jon Smirl wrote:
> > On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> >> But aside from this point, I think an intrinsic part of implementing
> >> incremental conversion is "convert the subsequent changes to the CVS
> >> repository *subject to the constraints* imposed by decisions made in
> >> earlier conversion runs.  And the real trick is that things can be done
> >> in CVS (e.g., line-end changes, manual copying of files in the repo)
> >> that (a) are unversioned and (b) have retroactive effects that go
> >> arbitrarily far back in time.  This is the reason that I am pessimistic
> >> that incremental conversion will ever work robustly.
> >
> > We don't need really robust incremental conversion. It just needs to
> > work most of the time. Incremental conversion is usually used to track
> > the main CVS repo with the new tool while people decide if they like
> > the new tool. Commits will still flow to the CVS repo and get
> > incrementally copied to the new tool so that it tracks CVS in close to
> > real time.
>
> I hadn't thought of the idea of using incremental conversion as an
> advertising method for switching SCM systems :-)  But if changes flow
> back to CVS, doesn't this have to be pretty robust?

Changes flow back to CVS but using the new tool to generate a patch,
apply the patch to your CVS check out and commit it.

There are too many people working on Mozilla to get agreement to
switch in a short amount of time. git may need to mirror CVS for
several months. There are also other people pushing svn, monotone,
perforce, etc, etc, etc. Bottom line, Mozilla really needs a
distributed system because external companies are making large changes
and want their repos in house.

In my experience none of the other SCMs are up to taking one Mozilla
yet. Git has the tools but I can get a clean import.


I am using this process on Mozilla right now with git. I have a script
that updates my CVS tree overnight and then commits the changes into a
local git repo. I can then work on Mozilla using git but my history is
all messed up. When a change is ready I generate a diff against last
night's check out and apply it to my CVS tree and commit. CVS then
finds any merge problems for me.

>
> In our trial period, we simply did a single conversion to SVN and let
> people play with this test repository.  When we decided to switch over
> we did another full conversion and simply discarded the changes that had
> been made in the test SVN repository.
>
> The use cases that I had considered were:
>
> 1. For conversions that take days, one could do a full commit while
> leaving CVS online, then take CVS offline and do only an incremental
> conversion to reduce SCM downtime.  This is of course less of an issue
> if you could bring the conversion time down to a couple hours for even
> the largest CVS repos.
>
> 2. Long-term continuous mirroring (backwards and forwards) between CVS
> and another SCM, to allow people to use their preferred tool.  (I
> actually think that this is a silly idea, but some people seem to like it.)
>
> For both of these applications, incremental conversion would have to be
> robust (for 1 it would at least have to give a clear indication of
> unrecoverable errors).
>
>
> Michael
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  5:02             ` Michael Haggerty
@ 2006-09-14  5:21               ` Martin Langhoff
  2006-09-14  5:35                 ` Michael Haggerty
  2006-09-14  5:30               ` Jon Smirl
  1 sibling, 1 reply; 56+ messages in thread
From: Martin Langhoff @ 2006-09-14  5:21 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Jon Smirl, Markus Schiltknecht, Git Mailing List, monotone-devel, dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> 2. Long-term continuous mirroring (backwards and forwards) between CVS
> and another SCM, to allow people to use their preferred tool.  (I
> actually think that this is a silly idea, but some people seem to like it.)

Call me silly ;-) I use this all the time to track projects that use
CVS or SVN, where I either

 - Do have write access, but often develop offline (and I have a bunch
of perl/shell scripts to extract the patches and auto-commit them into
CVS/SVN).

 - Do have write access, but want to experimental work branches
without making much noise in the cvs repo -- and being able to merge
CVS's HEAD in repeatedly as you'd want.

 - Run "vendor-branch-tracking" setups for projects where I have a
custom branch of a FOSS sofware project, and repeatedly import updates
from upstream. this is the 'killer-app' of DSCMs IMHO.

It is not as robust as I'd like; with CVS, the git imports eventually
stray a bit from upstream, and requires manual fixing. But it is
_good_.

cheers,


martin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  4:34           ` Jon Smirl
@ 2006-09-14  5:02             ` Michael Haggerty
  2006-09-14  5:21               ` Martin Langhoff
  2006-09-14  5:30               ` Jon Smirl
  0 siblings, 2 replies; 56+ messages in thread
From: Michael Haggerty @ 2006-09-14  5:02 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Martin Langhoff, Markus Schiltknecht, Git Mailing List,
	monotone-devel, dev

Jon Smirl wrote:
> On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> But aside from this point, I think an intrinsic part of implementing
>> incremental conversion is "convert the subsequent changes to the CVS
>> repository *subject to the constraints* imposed by decisions made in
>> earlier conversion runs.  And the real trick is that things can be done
>> in CVS (e.g., line-end changes, manual copying of files in the repo)
>> that (a) are unversioned and (b) have retroactive effects that go
>> arbitrarily far back in time.  This is the reason that I am pessimistic
>> that incremental conversion will ever work robustly.
> 
> We don't need really robust incremental conversion. It just needs to
> work most of the time. Incremental conversion is usually used to track
> the main CVS repo with the new tool while people decide if they like
> the new tool. Commits will still flow to the CVS repo and get
> incrementally copied to the new tool so that it tracks CVS in close to
> real time.

I hadn't thought of the idea of using incremental conversion as an
advertising method for switching SCM systems :-)  But if changes flow
back to CVS, doesn't this have to be pretty robust?

In our trial period, we simply did a single conversion to SVN and let
people play with this test repository.  When we decided to switch over
we did another full conversion and simply discarded the changes that had
been made in the test SVN repository.

The use cases that I had considered were:

1. For conversions that take days, one could do a full commit while
leaving CVS online, then take CVS offline and do only an incremental
conversion to reduce SCM downtime.  This is of course less of an issue
if you could bring the conversion time down to a couple hours for even
the largest CVS repos.

2. Long-term continuous mirroring (backwards and forwards) between CVS
and another SCM, to allow people to use their preferred tool.  (I
actually think that this is a silly idea, but some people seem to like it.)

For both of these applications, incremental conversion would have to be
robust (for 1 it would at least have to give a clear indication of
unrecoverable errors).


Michael

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  4:17         ` Michael Haggerty
  2006-09-14  4:34           ` Jon Smirl
@ 2006-09-14  4:40           ` Martin Langhoff
  1 sibling, 0 replies; 56+ messages in thread
From: Martin Langhoff @ 2006-09-14  4:40 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Markus Schiltknecht, Jon Smirl, Git Mailing List, monotone-devel, dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> > IIRC, it places branch tags as late as possible. I haven't looked at
> > it in detail, but an import immediately after the first commit against
> > the branch may yield a different branchpoint from the same import done
> > a bit later.
>
> This is correct.  And IMO it makes sense from the standpoint of an
> all-at-once conversion.
>
> But I was under the impression that this wouldn't matter for
> content-indexed-based SCMs.  The content of all possible branching
> points is identical, and therefore from your point of view the topology
> should be the same, no?

Exactly. But if you shift the branching point to later, two things change

 - it is possible that (in some corner cases) the content itself
changes as the branching point could end up being moved a couple of
commits "later". one of the downsides of cvs not being atomic.

 - even if the content does not change, rearranging of history in git
is a no-no. git relies on history being read-only 100%

> But aside from this point, I think an intrinsic part of implementing
> incremental conversion is "convert the subsequent changes to the CVS
> repository *subject to the constraints* imposed by decisions made in
> earlier conversion runs.

Yes, and that's a fundamental change in the algorithm. That's exactly
why I mentioned it in this thread ;-) Any incremental importer has to
make up some parts of history, and then remember what it has made up.

So part of the process becomes
 - figure our history on top of the history we already parsed
 - check whether the cvs repo now has any 'new' history that affects
already-parsed history negatively, and report those as errors

hmmmmmm.

> This is the reason that I am pessimistic
> that incremental conversion will ever work robustly.

We all are :) But for a repo that doesn't go through direct tampering,
we can improve the algorithm to be more stable.



martin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  4:17         ` Michael Haggerty
@ 2006-09-14  4:34           ` Jon Smirl
  2006-09-14  5:02             ` Michael Haggerty
  2006-09-14  4:40           ` Martin Langhoff
  1 sibling, 1 reply; 56+ messages in thread
From: Jon Smirl @ 2006-09-14  4:34 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Martin Langhoff, Markus Schiltknecht, Git Mailing List,
	monotone-devel, dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> But aside from this point, I think an intrinsic part of implementing
> incremental conversion is "convert the subsequent changes to the CVS
> repository *subject to the constraints* imposed by decisions made in
> earlier conversion runs.  And the real trick is that things can be done
> in CVS (e.g., line-end changes, manual copying of files in the repo)
> that (a) are unversioned and (b) have retroactive effects that go
> arbitrarily far back in time.  This is the reason that I am pessimistic
> that incremental conversion will ever work robustly.

We don't need really robust incremental conversion. It just needs to
work most of the time. Incremental conversion is usually used to track
the main CVS repo with the new tool while people decide if they like
the new tool. Commits will still flow to the CVS repo and get
incrementally copied to the new tool so that it tracks CVS in close to
real time.

If the increment import messes up you can always redo a full import,
but a full Mozilla import takes about 2 hours with the git tools. I
would always do a full import on the day of the actual cut over.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 21:16       ` Martin Langhoff
@ 2006-09-14  4:17         ` Michael Haggerty
  2006-09-14  4:34           ` Jon Smirl
  2006-09-14  4:40           ` Martin Langhoff
  0 siblings, 2 replies; 56+ messages in thread
From: Michael Haggerty @ 2006-09-14  4:17 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Markus Schiltknecht, Jon Smirl, Git Mailing List, monotone-devel, dev

Martin Langhoff wrote:
> On 9/14/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
>> Martin Langhoff wrote:
>> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> >> Let's copy the git list too and maybe we can come up with one importer
>> >> for everyone.
>> >
>> > It's a really good idea. cvsps has been for a while a (limited, buggy)
>> > attempt at that. One thing that bothers me in the cvs2svn algorithm is
>> > that is not stable in its decisions about where the branching point is
>> > -- run the import twice at different times and it may tell you that
>> > the branching point has moved.
>>
>> Huh? Really? Why is that? I don't see reasons for such a thing happening
>> when studying the algorithm.
>>
>> For sure the proposed dependency-resolving algorithm which does not rely
>> on timestamps does not have that problem.
> 
> IIRC, it places branch tags as late as possible. I haven't looked at
> it in detail, but an import immediately after the first commit against
> the branch may yield a different branchpoint from the same import done
> a bit later.

This is correct.  And IMO it makes sense from the standpoint of an
all-at-once conversion.

But I was under the impression that this wouldn't matter for
content-indexed-based SCMs.  The content of all possible branching
points is identical, and therefore from your point of view the topology
should be the same, no?

But aside from this point, I think an intrinsic part of implementing
incremental conversion is "convert the subsequent changes to the CVS
repository *subject to the constraints* imposed by decisions made in
earlier conversion runs.  And the real trick is that things can be done
in CVS (e.g., line-end changes, manual copying of files in the repo)
that (a) are unversioned and (b) have retroactive effects that go
arbitrarily far back in time.  This is the reason that I am pessimistic
that incremental conversion will ever work robustly.

Michael

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  2:30           ` [Monotone-devel] " Shawn Pearce
@ 2006-09-14  3:19             ` Daniel Carosone
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Carosone @ 2006-09-14  3:19 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Daniel Carosone, Keith Packard, monotone-devel, Jon Smirl, dev,
	Git Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 1905 bytes --]

On Wed, Sep 13, 2006 at 10:30:17PM -0400, Shawn Pearce wrote:
> I don't know exactly how big it is but the Gentoo CVS repository
> is also considered to be very large (about the size of the Mozilla
> repository) and just as difficult to import.  Its either crashed or
> taken about a month to process with the current Git CVS->Git tools.

Ah, thanks for the tip.

> Since I know that the bulk of the Gentoo CVS repository is the
> portage tree I did a quick find|wc -l in my /usr/portage; its about
> 124,500 files.
> 
> Its interesting that Gentoo has almost as large of a repository given
> that its such a young project, compared to NetBSD and Mozilla.  :-)

Portage uses files and thus CVS very differently, though.  Each ebuild
for each package revision of each version of a third-party package
(like, say, monotone 0.28 and 0.29, and -r1, -r2 pkg bumps of those if
they were needed) is its own file that's added, maybe edited a couple
of times, and then deleted again later as new versions are added and
older ones retired.  These are copies and renames in the workspace,
but are invisible to CVS.  This uses up lots more files than a single
long-lived build that gets edited each time; the Attic dirs must have
huge numbers of files, way beyond the number that are live now.

This lets portage keep builds around in a HEAD checkout for multiple
versions at once, tagged internally with different statuses.
Effectively, these tags take the place of VCS-based branches and
releases, and are more flexible for end users tracking their favourite
applications while keeping the rest of their system stable.

If they had a VCS that supported file cloning and/or renaming, and
used that to follow history between these ebuild files, things would
be very different. There are some interesting use cases for VCS tools
in supporting this behaviour nicely, too.  

--
Dan.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-14  0:57       ` [Monotone-devel] " Jon Smirl
@ 2006-09-14  1:53         ` Daniel Carosone
  2006-09-14  2:30           ` [Monotone-devel] " Shawn Pearce
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Carosone @ 2006-09-14  1:53 UTC (permalink / raw)
  To: Jon Smirl; +Cc: dev, Keith Packard, monotone-devel, Git Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 1676 bytes --]

On Wed, Sep 13, 2006 at 08:57:33PM -0400, Jon Smirl wrote:
> Mozilla is 120,000 files. The complexity comes from 10 years worth of
> history. A few of the files have around 1,700 revisions. There are
> about 1,600 branches and 1,000 tags. The branch number is inflated
> because cvs2svn is generating extra branches, the real number is
> around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this
> into 250,000 commits over about 1M unique revisions.

Those numbers are pretty close to those in the NetBSD repository, and
between them these probably represent just about the most extensive
public CVS test data available. 

I've only done imports of individual top-level dirs (what used to be
modules), like src and pkgsrc, because they're used independently and
don't really overlap.

src had about 180k commits over 1M versions of 120k files, 1000 tags
and 260 branches. pkgsrc had 110k commits over about half as many
files and versions thereof.  We too have a few hot files, one had
13,625 revisions.  xsrc adds a bunch more files and content, but not
many versions; that's mostly vendor branches and only some local
changes.  Between them the cvs ,v files take up 4.7G covering about 13
years of history.

One thing that was interesting was that "src" used to be several
different modules, but we rearranged the repository at one point to
match the checkout structure these modules produced (combining them
all under the src dir).  This doesn't seem to have upset the import at
all.  Just about every other form of CVS evil has been perpetrated in
this repository at some stage or other too, but always very carefully.

--
Dan.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 23:42   ` [Monotone-devel] " Keith Packard
@ 2006-09-14  0:32     ` Nathaniel Smith
  2006-09-14  0:57       ` [Monotone-devel] " Jon Smirl
  0 siblings, 1 reply; 56+ messages in thread
From: Nathaniel Smith @ 2006-09-14  0:32 UTC (permalink / raw)
  To: Keith Packard; +Cc: dev, monotone-devel, Git Mailing List

On Wed, Sep 13, 2006 at 04:42:01PM -0700, Keith Packard wrote:
> However, this means that parsecvs must hold the entire tree state in
> memory, which turned out to be its downfall with large repositories.
> Worked great for all of X.org, not so good with Mozilla.

Does anyone know how big Mozilla (or other humonguous repos, like KDE)
are, in terms of number of files?

A few numbers for repositories I had lying around:
  Linux kernel -- ~21,000
  gcc -- ~42,000
  NetBSD "src" repo -- ~100,000
  uClinux distro -- ~110,000

These don't seem very indimidating... even if it takes an entire
kilobyte per CVS revision to store the information about it that we
need to make decisions about how to move the frontier... that's only
110 megabytes for the largest of these repos.  The frontier sweeping
algorithm only _needs_ to have available the current frontier, and the
current frontier+1.  Storing information on every version of every
file in memory might be worse; but since the algorithm accesses this
data in a linear way, it'd be easy enough to stick those in a
lookaside table on disk if really necessary, like a bdb or sqlite file
or something.

(Again, in practice storing all the metadata for the entire 180k
revisions of the 100k files in the netbsd repo was possible on a
desktop.  Monotone's cvs_import does try somewhat to be frugal about
memory, though, interning strings and suchlike.)

-- Nathaniel

-- 
When the flush of a new-born sun fell first on Eden's green and gold,
Our father Adam sat under the Tree and scratched with a stick in the mould;
And the first rude sketch that the world had seen was joy to his mighty heart,
Till the Devil whispered behind the leaves, "It's pretty, but is it Art?"
  -- The Conundrum of the Workshops, Rudyard Kipling

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 22:52 ` Nathaniel Smith
@ 2006-09-13 23:21   ` Daniel Carosone
  2006-09-13 23:42   ` [Monotone-devel] " Keith Packard
  1 sibling, 0 replies; 56+ messages in thread
From: Daniel Carosone @ 2006-09-13 23:21 UTC (permalink / raw)
  To: Markus Schiltknecht, monotone-devel, dev, Git Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 3325 bytes --]

On Wed, Sep 13, 2006 at 03:52:00PM -0700, Nathaniel Smith wrote:
> This isn't trivial problem.  I think the main thing you want to avoid
> is:
>     1  2  3  4
>     |  |  |  |
>   --o--o--o--o----- <-- current frontier
>     |  |  |  |
>     A  B  A  C
>        |
>        A
> There are a lot of approaches one could take here, on up to pulling
> out a full-on optimal constraint satisfaction system (if we can route
> chips, we should be able to pick a good ordering for accepting CVS
> edits, after all).  A really simple heuristic, though, would be to
> just pick the file whose next commit has the earliest timestamp, then
> group in all the other "next commits" with the same commit message,
> and (maybe) a similar timestamp.  

Pick the earliest first, or more generally: take all the file commits
immediately below the frontier.  Find revs further below the frontier
(up to some small depth or time limit) on other files that might match
them, based on changelog etc (the same grouping you describe, and we
do now).  Eliminate any of those that are not entirely on the frontier
(ie, have some other revision in the way, as with file 2).  Commit the
remaining set in time order. [*]

If you wind up with an empty set, then you need to split revs, but at
this point you have only conflicting revs on the frontier i.e. you've
already committed all the other revs you can that might have avoided
this need, whereas we currently might be doing this too often).

For time order, you could look at each rev as having a time window,
from the first to last commit matching.  If the revs windows are
non-overlapping, commit them in order.  If the rev windows overlap, at
this point we already know the file changes don't overlap - we *could*
commit these as parallel heads and merge them, to better model the
original developer's overlapping commits.

> Handling file additions could potentially be slightly tricky in this
> model.  I guess it is not so bad, if you model added files as being
> present all along (so you never have to add add whole new entries to
> the frontier), with each file starting out in a pre-birth state, and
> then addition of the file is the first edit performed on top of that,
> and you treat these edits like any other edits when considering how to
> advance the frontier.

CVS allows resurrections too..

> I have no particular idea on how to handle tags and branches here;
> I've never actually wrapped my head around CVS's model for those :-).
> I'm not seeing any obvious problem with handling them, though.

Tags could be modelled as another 'event' in the file graph, like a
commit. If your frontier advances through both revisions and a 'tag
this revision' event, the same sequencing as above would work. If tags
had been moved, this would wind up with a sequence whereby commits
interceded with tagging, and we'd need to split the commits such that
we could end up with a revision matching the tagged content.

> In this approach, incremental conversion is cheap, easy, and robust --
> simply remember what frontier corresponded to the final revision
> imported, and restart the process directly at that frontier.

Hm. Except for the tagging idea above, because tags can be applied
behind a live cvs frontier.

--
Dan.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
       [not found] <45084400.1090906@bluegap.ch>
  2006-09-13 19:01 ` cvs import Jon Smirl
@ 2006-09-13 22:52 ` Nathaniel Smith
  2006-09-13 23:21   ` Daniel Carosone
  2006-09-13 23:42   ` [Monotone-devel] " Keith Packard
  1 sibling, 2 replies; 56+ messages in thread
From: Nathaniel Smith @ 2006-09-13 22:52 UTC (permalink / raw)
  To: Markus Schiltknecht; +Cc: dev, monotone-devel, Git Mailing List

On Wed, Sep 13, 2006 at 07:46:40PM +0200, Markus Schiltknecht wrote:
> Hi,
> 
> I've been trying to understand the cvsimport algorithm used by monotone 
> and wanted to adjust that to be more like the one in cvs2svn.
> 
> I've had some problems with cvs2svn itself and began to question the 
> algorithm used there. It turned out that the cvs2svn people have 
> discussed an improved algorithms and are about to write a cvs2svn 2.0. 
> The main problem with the current algorithm is that it depends on the 
> timestamp information stored in the CVS repository.
> 
> Instead, it would be much better to just take the dependencies of the 
> revisions into account. Considering the timestamp an irrelevant (for the 
> import) attribute of the revision.

I just read over the thread on the cvs2svn list about this -- I have a
few random thoughts.  Take them with a grain of salt, since I haven't
actually tried writing a CVS importer myself...

Regarding the basic dependency-based algorithm, the approach of
throwing everything into blobs and then trying to tease them apart
again seems backwards.  What I'm thinking is, first we go through and
build the history graph for each file.  Now, advance a frontier across
the all of these graphs simultaneously.  Your frontier is basically a
map <filename -> CVS revision>, that represents a tree snapshot.  The
basic loop is:
  1) pick some subset of files to advance to their next revision
  2) slide the frontier one CVS revision forward on each of those
     files
  3) snapshot the new frontier (write it to the target VCS as a new
     tree commit)
  4) go to step 1
Obviously, this will produce a target VCS history that respects the
CVS dependency graph, so that's good; it puts a strict limit on how
badly whatever heuristics we use can screw us over if they guess wrong
about things.  Also, it makes the problem much simpler -- all the
heuristics are now in step 1, where we are given a bunch of possible
edits, and we have to pick some subset of them to accept next.

This isn't trivial problem.  I think the main thing you want to avoid
is:
    1  2  3  4
    |  |  |  |
  --o--o--o--o----- <-- current frontier
    |  |  |  |
    A  B  A  C
       |
       A
say you have four files named "1", "2", "3", and "4".  We want to
slide the frontier down, and the next edits were originally created by
one of three commits, A, B, or C.  In this situation, we can take
commit B, or we can take commit C, but we don't want to take commit A
until _after_ we have taken commit B -- because otherwise we will end
up splitting A up into two different commits, A1, B, A2.

There are a lot of approaches one could take here, on up to pulling
out a full-on optimal constraint satisfaction system (if we can route
chips, we should be able to pick a good ordering for accepting CVS
edits, after all).  A really simple heuristic, though, would be to
just pick the file whose next commit has the earliest timestamp, then
group in all the other "next commits" with the same commit message,
and (maybe) a similar timestamp.  I have a suspicion that this
heuristic will work really, really, well in practice.  Also, it's
cheap to apply, and worst case you accidentally split up a commit that
already had wacky timestamps, and we already know that we _have_ to do
that in some cases.

Handling file additions could potentially be slightly tricky in this
model.  I guess it is not so bad, if you model added files as being
present all along (so you never have to add add whole new entries to
the frontier), with each file starting out in a pre-birth state, and
then addition of the file is the first edit performed on top of that,
and you treat these edits like any other edits when considering how to
advance the frontier.

I have no particular idea on how to handle tags and branches here;
I've never actually wrapped my head around CVS's model for those :-).
I'm not seeing any obvious problem with handling them, though.

In this approach, incremental conversion is cheap, easy, and robust --
simply remember what frontier corresponded to the final revision
imported, and restart the process directly at that frontier.


Regarding storing things on disk vs. in memory: we always used to
stress-test monotone's cvs importer with the gcc history; just a few
weeks ago someone did a test import of NetBSD's src repo (~180k
commits) on a desktop with 2 gigs of RAM.  It takes a pretty big
history to really require disk (and for that matter, people with
histories that big likely have a big enough organization that they can
get access to some big iron to run the conversion on -- and probably
will want to anyway, to make it run in reasonable time).

> Now, that can be used to convert from CVS to about anything else. 
> Obviously we were discussing about subversion, but then there was git, 
> too. And monotone.
> 
> I'm beginning to question if one could come up with a generally useful 
> cleaned-and-sane-CVS-changeset-dump-format, which could then be used by 
> importers to all sort of VCSes. This would make monotone's cvsimport 
> function dependent on cvs2svn (and therefore python). But the general 
> try-to-get-something-usefull-from-an-insane-CVS-repository-algorithm 
> would only have to be written once.
> 
> On the other hand, I see that lots of the cvsimport functionality for 
> monotone has already been written (rcs file parsing, stuffing files, 
> file deltas and complete revisions into the monotone database, etc..). 
> Changing it to a better algorithm does not seem to be _that_ much work 
> anymore. Plus the hard part seems to be to come up with a good 
> algorithm, not implementing it. And we could still exchange our 
> experience with the general algorithm with the cvs2svn people.
>
> Plus, the guy who mentioned git pointed out that git needs quite a 
> different dump-format than subversion to do an efficient conversion. I 
> think coming up with a generally-usable dump format would not be that easy.

Probably the biggest technical advantage of having the converter built
into monotone is that it makes it easy to import the file contents.
Since this data is huge (100x the repo size, maybe?), and the naive
algorithm for reconstructing takes time that is quadratic in the depth
of history, this is very valuable.  I'm not sure what sort of dump
format one could come up with that would avoid making this step very
expensive.

I also suspect that SVN's dump format is suboptimal at the metadata
level -- we would essentially have to run a lot of branch/tag
inferencing logic _again_ to go from SVN-style "one giant tree with
branches described as copies, and multiple copies allowed for
branches/tags that are built up over time", to monotone-style
"DAG of tree snapshots".  This would be substantially less annoying
inferencing logic than that needed to decipher CVS in the first place,
granted, and it's stuff we want to write at some point anyway to allow
SVN importing, but it adds another step where information could be
lost.  I may be biased because I grok monotone better, but I suspect
it would be much easier to losslessly convert a monotone-style history
to an svn-style history than vice versa, possibly a generic dumping
tool would want to generate output that looks more like monotone's
model?  The biggest stumbling block I see is if it is important to
build up branches and tags by multiple copies out of trunk -- there
isn't any way to represent that in monotone.  A generic tool could
also use some sort of hybrid model (e.g., dag-of-snapshots plus
some extra annotations), if that worked better.

It's also very nice that users don't need any external software to
import CVS->monotone, just because it cuts down on hassle, but I would
rather have a more hasslesome tool that worked then a less hasslesome
tool that didn't, and I'm not the one volunteering to write the code,
so :-).

Even if we _do_ end up writing two implementations of the algorithm,
we should share a test suite.  Testing cvs importers is way harder
than writing them, because there's no ground truth to compare your
program's output to... in fact, having two separate implementations
and testing them against each other would be useful to increase
confidence in each of them.

(I'm only on one of the CC'ed lists, so reply-to-all appreciated)

-- Nathaniel

-- 
"On arrival in my ward I was immediately served with lunch. `This is
what you ordered yesterday.' I pointed out that I had just arrived,
only to be told: `This is what your bed ordered.'"
  -- Letter to the Editor, The Times, September 2000

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 21:05     ` Markus Schiltknecht
@ 2006-09-13 21:38       ` Jon Smirl
  2006-09-14  5:36         ` Michael Haggerty
  0 siblings, 1 reply; 56+ messages in thread
From: Jon Smirl @ 2006-09-13 21:38 UTC (permalink / raw)
  To: Markus Schiltknecht
  Cc: Martin Langhoff, Git Mailing List, monotone-devel, dev

On 9/13/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
> Martin Langhoff wrote:
> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> >> Let's copy the git list too and maybe we can come up with one importer
> >> for everyone.
> >
> > It's a really good idea. cvsps has been for a while a (limited, buggy)
> > attempt at that.
>
> BTW: good point, I always thought about cvsps. Does anybody know what
> 'dump' format that uses?

cvsps has potential but the multiple missing branch labels in the
Mozilla CVS confuse it and its throws away important data. It's
algorithm would need reworking too. cvs2svn is the only CVS converter
that imported Mozilla CVS on the first try and mostly got things
right.

Patchset format for cvsps
http://www.cobite.com/cvsps/README

AFAIK none of the CVS converters are using the dependency algorithm.
So the proposal on the table is to develop a new converter that uses
the dependency data from CVS to form the change sets and then outputs
this data in a form that all of the backends can consume. Of course
each of the backends is going to have to write some code in order to
consume this new import format.

>
> For sure it's algorithm isn't that strong. cvs2svn is better, IMHO. The
> proposed dependency resolving algorithm will be even better /me thinks.
>
> Regards
>
> Markus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 21:04     ` Markus Schiltknecht
  2006-09-13 21:15       ` Oswald Buddenhagen
@ 2006-09-13 21:16       ` Martin Langhoff
  2006-09-14  4:17         ` Michael Haggerty
  1 sibling, 1 reply; 56+ messages in thread
From: Martin Langhoff @ 2006-09-13 21:16 UTC (permalink / raw)
  To: Markus Schiltknecht; +Cc: Jon Smirl, Git Mailing List, monotone-devel, dev

On 9/14/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
> Martin Langhoff wrote:
> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> >> Let's copy the git list too and maybe we can come up with one importer
> >> for everyone.
> >
> > It's a really good idea. cvsps has been for a while a (limited, buggy)
> > attempt at that. One thing that bothers me in the cvs2svn algorithm is
> > that is not stable in its decisions about where the branching point is
> > -- run the import twice at different times and it may tell you that
> > the branching point has moved.
>
> Huh? Really? Why is that? I don't see reasons for such a thing happening
> when studying the algorithm.
>
> For sure the proposed dependency-resolving algorithm which does not rely
> on timestamps does not have that problem.

IIRC, it places branch tags as late as possible. I haven't looked at
it in detail, but an import immediately after the first commit against
the branch may yield a different branchpoint from the same import done
a bit later.

cheers,


martin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 21:04     ` Markus Schiltknecht
@ 2006-09-13 21:15       ` Oswald Buddenhagen
  2006-09-13 21:16       ` Martin Langhoff
  1 sibling, 0 replies; 56+ messages in thread
From: Oswald Buddenhagen @ 2006-09-13 21:15 UTC (permalink / raw)
  To: Markus Schiltknecht
  Cc: Martin Langhoff, Jon Smirl, Git Mailing List, monotone-devel, dev

On Wed, Sep 13, 2006 at 11:04:13PM +0200, Markus Schiltknecht wrote:
> Martin Langhoff wrote:
> >One thing that bothers me in the cvs2svn algorithm is
> >that is not stable in its decisions about where the branching point is
> >-- run the import twice at different times and it may tell you that
> >the branching point has moved.
> 
> Huh? Really? Why is that? I don't see reasons for such a thing happening 
> when studying the algorithm.
> 
that's certainly due to some hash being iterated. python intentionally
randomizes this to make wrong assumptions obvious.
there is actually a patch pending to improve the branch source selection
drastically. maybe this is affected as well.

> For sure the proposed dependency-resolving algorithm which does not rely 
> on timestamps does not have that problem.
> 
i think that's unrelated.

-- 
Hi! I'm a .signature virus! Copy me into your ~/.signature, please!
--
Chaos, panic, and disorder - my work here is done.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 20:41   ` Martin Langhoff
  2006-09-13 21:04     ` Markus Schiltknecht
@ 2006-09-13 21:05     ` Markus Schiltknecht
  2006-09-13 21:38       ` Jon Smirl
  1 sibling, 1 reply; 56+ messages in thread
From: Markus Schiltknecht @ 2006-09-13 21:05 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, Git Mailing List, monotone-devel, dev

Martin Langhoff wrote:
> On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> Let's copy the git list too and maybe we can come up with one importer
>> for everyone.
> 
> It's a really good idea. cvsps has been for a while a (limited, buggy)
> attempt at that.

BTW: good point, I always thought about cvsps. Does anybody know what 
'dump' format that uses?

For sure it's algorithm isn't that strong. cvs2svn is better, IMHO. The 
proposed dependency resolving algorithm will be even better /me thinks.

Regards

Markus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 20:41   ` Martin Langhoff
@ 2006-09-13 21:04     ` Markus Schiltknecht
  2006-09-13 21:15       ` Oswald Buddenhagen
  2006-09-13 21:16       ` Martin Langhoff
  2006-09-13 21:05     ` Markus Schiltknecht
  1 sibling, 2 replies; 56+ messages in thread
From: Markus Schiltknecht @ 2006-09-13 21:04 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: dev, monotone-devel, Jon Smirl, Git Mailing List

Martin Langhoff wrote:
> On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> Let's copy the git list too and maybe we can come up with one importer
>> for everyone.
> 
> It's a really good idea. cvsps has been for a while a (limited, buggy)
> attempt at that. One thing that bothers me in the cvs2svn algorithm is
> that is not stable in its decisions about where the branching point is
> -- run the import twice at different times and it may tell you that
> the branching point has moved.

Huh? Really? Why is that? I don't see reasons for such a thing happening 
when studying the algorithm.

For sure the proposed dependency-resolving algorithm which does not rely 
on timestamps does not have that problem.

Regards

Markus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
  2006-09-13 19:01 ` cvs import Jon Smirl
@ 2006-09-13 20:41   ` Martin Langhoff
  2006-09-13 21:04     ` Markus Schiltknecht
  2006-09-13 21:05     ` Markus Schiltknecht
  0 siblings, 2 replies; 56+ messages in thread
From: Martin Langhoff @ 2006-09-13 20:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Markus Schiltknecht, Git Mailing List, monotone-devel, dev

On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> Let's copy the git list too and maybe we can come up with one importer
> for everyone.

It's a really good idea. cvsps has been for a while a (limited, buggy)
attempt at that. One thing that bothers me in the cvs2svn algorithm is
that is not stable in its decisions about where the branching point is
-- run the import twice at different times and it may tell you that
the branching point has moved.

This is problematic for incremental imports. If we fudged that "it's
around *here*" we better remember we said that and not go changing our
story. Git is too smart for that ;-)


martin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: cvs import
       [not found] <45084400.1090906@bluegap.ch>
@ 2006-09-13 19:01 ` Jon Smirl
  2006-09-13 20:41   ` Martin Langhoff
  2006-09-13 22:52 ` Nathaniel Smith
  1 sibling, 1 reply; 56+ messages in thread
From: Jon Smirl @ 2006-09-13 19:01 UTC (permalink / raw)
  To: Markus Schiltknecht, Git Mailing List; +Cc: monotone-devel, dev

Let's copy the git list too and maybe we can come up with one importer
for everyone.

On 9/13/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
> Hi,
>
> I've been trying to understand the cvsimport algorithm used by monotone
> and wanted to adjust that to be more like the one in cvs2svn.
>
> I've had some problems with cvs2svn itself and began to question the
> algorithm used there. It turned out that the cvs2svn people have
> discussed an improved algorithms and are about to write a cvs2svn 2.0.
> The main problem with the current algorithm is that it depends on the
> timestamp information stored in the CVS repository.
>
> Instead, it would be much better to just take the dependencies of the
> revisions into account. Considering the timestamp an irrelevant (for the
> import) attribute of the revision.
>
> Now, that can be used to convert from CVS to about anything else.
> Obviously we were discussing about subversion, but then there was git,
> too. And monotone.
>
> I'm beginning to question if one could come up with a generally useful
> cleaned-and-sane-CVS-changeset-dump-format, which could then be used by
> importers to all sort of VCSes. This would make monotone's cvsimport
> function dependent on cvs2svn (and therefore python). But the general
> try-to-get-something-usefull-from-an-insane-CVS-repository-algorithm
> would only have to be written once.
>
> On the other hand, I see that lots of the cvsimport functionality for
> monotone has already been written (rcs file parsing, stuffing files,
> file deltas and complete revisions into the monotone database, etc..).
> Changing it to a better algorithm does not seem to be _that_ much work
> anymore. Plus the hard part seems to be to come up with a good
> algorithm, not implementing it. And we could still exchange our
> experience with the general algorithm with the cvs2svn people.
>
> Plus, the guy who mentioned git pointed out that git needs quite a
> different dump-format than subversion to do an efficient conversion. I
> think coming up with a generally-usable dump format would not be that easy.
>
> So you see, I'm slightly favoring the second implementation approach
> with a C++ implementation inside monotone.
>
> Thoughts or comments?
> Sorry, I forgot to mention some pointers:
>
> Here is the thread where I've started the discussion about the cvs2svn
> algorithm:
> http://cvs2svn.tigris.org/servlets/ReadMsg?list=dev&msgNo=1599
>
> And this is a proposal for an algorithm to do cvs imports independant of
> the timestamp:
> http://cvs2svn.tigris.org/servlets/ReadMsg?list=dev&msgNo=1451
>
> Markus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cvs2svn.tigris.org
> For additional commands, e-mail: dev-help@cvs2svn.tigris.org
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2009-02-25  9:05 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-16  9:17 CVS import Ferry Huberts (Pelagic)
2009-02-16 13:20 ` CVS import [SOLVED] Ferry Huberts (Pelagic)
2009-02-16 13:45   ` Johannes Schindelin
2009-02-16 13:53     ` Johannes Schindelin
2009-02-16 17:33       ` Ferry Huberts (Pelagic)
2009-02-16 18:11         ` Johannes Schindelin
2009-02-16 20:32   ` Ferry Huberts (Pelagic)
2009-02-16 20:59     ` Johannes Schindelin
2009-02-17 11:19       ` Ferry Huberts (Pelagic)
2009-02-17 14:18         ` Johannes Schindelin
2009-02-17 15:16           ` Ferry Huberts (Pelagic)
2009-02-20 15:28       ` Jeff King
2009-02-20 16:25         ` Ferry Huberts (Pelagic)
2009-02-20 17:29           ` autocrlf=input and safecrlf (was Re: CVS import [SOLVED]) Jeff King
2009-02-20 23:24             ` Ferry Huberts (Pelagic)
2009-02-23  0:08               ` Jeff King
2009-02-23  6:50                 ` Ferry Huberts (Pelagic)
2009-02-23  6:56                   ` Jeff King
2009-02-23  7:09                     ` Ferry Huberts (Pelagic)
2009-02-23  7:10                       ` Jeff King
2009-02-23  7:29                         ` Ferry Huberts (Pelagic)
2009-02-24  6:11                           ` Jeff King
2009-02-24  9:25                             ` Ferry Huberts (Pelagic)
2009-02-25  6:56                               ` Jeff King
2009-02-25  8:03                                 ` Ferry Huberts (Pelagic)
2009-02-25  9:03                                   ` Jeff King
     [not found] <45084400.1090906@bluegap.ch>
2006-09-13 19:01 ` cvs import Jon Smirl
2006-09-13 20:41   ` Martin Langhoff
2006-09-13 21:04     ` Markus Schiltknecht
2006-09-13 21:15       ` Oswald Buddenhagen
2006-09-13 21:16       ` Martin Langhoff
2006-09-14  4:17         ` Michael Haggerty
2006-09-14  4:34           ` Jon Smirl
2006-09-14  5:02             ` Michael Haggerty
2006-09-14  5:21               ` Martin Langhoff
2006-09-14  5:35                 ` Michael Haggerty
2006-09-14  5:30               ` Jon Smirl
2006-09-14  4:40           ` Martin Langhoff
2006-09-13 21:05     ` Markus Schiltknecht
2006-09-13 21:38       ` Jon Smirl
2006-09-14  5:36         ` Michael Haggerty
2006-09-14 15:50           ` Shawn Pearce
2006-09-14 16:04             ` Jakub Narebski
2006-09-14 16:18               ` Shawn Pearce
2006-09-14 16:27               ` Jon Smirl
2006-09-14 17:01                 ` Michael Haggerty
2006-09-14 17:08                   ` Jakub Narebski
2006-09-14 17:17                   ` Jon Smirl
2006-09-15  7:37             ` Markus Schiltknecht
2006-09-16  3:39               ` Shawn Pearce
2006-09-16  6:04                 ` Oswald Buddenhagen
2006-09-13 22:52 ` Nathaniel Smith
2006-09-13 23:21   ` Daniel Carosone
2006-09-13 23:42   ` [Monotone-devel] " Keith Packard
2006-09-14  0:32     ` Nathaniel Smith
2006-09-14  0:57       ` [Monotone-devel] " Jon Smirl
2006-09-14  1:53         ` Daniel Carosone
2006-09-14  2:30           ` [Monotone-devel] " Shawn Pearce
2006-09-14  3:19             ` Daniel Carosone

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.