cvs2svn conversion directly to git ready for experimentation

All of lore.kernel.org
 help / color / mirror / Atom feed

* cvs2svn conversion directly to git ready for experimentation
@ 2007-08-01  0:09 Michael Haggerty
  2007-08-01  0:41 ` Johannes Schindelin
                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Michael Haggerty @ 2007-08-01  0:09 UTC (permalink / raw)
  To: git; +Cc: users

I am the maintainer of cvs2svn[1], which is a program for one-time
conversions from CVS to Subversion.  cvs2svn is very robust against the
many peculiarities of CVS and can convert just about every CVS
repository we have ever seen.

I've been working on a cvs2svn output pass that writes the converted CVS
repository directly into git rather than Subversion.  The code runs now
with at least one repository from our test suite of nasty CVS repositories.

Unfortunately, I am a complete git newbie, so I would very much
appreciate help from the git community with feedback and checking
whether the conversion output is reasonable and gitlike.

The git output is very preliminary and virtually untested, and has the
following limitations (hopefully to be removed in the near future):

- It is rather slow.  Among other things, it still uses RCS or CVS to
extract the contents of the CVS revisions, which will soon be changed to
win a factor of 2 or so.

- CVS allows a branch to be created from arbitrary combinations of
source revisions and/or source branches.  cvs2svn tries to create a
branch from a single source, but if it can't figure out how to, it
creates the branch using "merge" from multiple sources.  In pathological
situations, the number of merge sources for a branch can be arbitrarily
large.

- It is not very intelligent about creating tags.  When asked to create
a tag, it unconditionally creates a "tag fixup branch"[2] with the same
name and contents as the tag, then tags this branch.  The tag fixup
branch is never deleted.

- There are no checks that CVS branch and tag names are legal git names,
or indeed that any other similar limitations of git are honored.

- The data that should be fed to git-fast-input is written to two files,
which have to be loaded into git-fast-import manually.  Eventually I
will add an option to invoke git-fast-import automatically and pipe the
output directly into git-fast-import.

- Only single projects can be converted at a time.  I don't think that
this will be a significant limitation when outputting to git.

To try it out:

1. Install svn (to be able to check out cvs2svn) and either cvs or rcs.

2. Check out the current trunk version of cvs2svn:

    svn co http://cvs2svn.tigris.org/svn/cvs2svn/trunk cvs2svn-trunk
    cd cvs2svn-trunk
    make check # ...optional

3. Configure cvs2svn for your conversion.  This has to be done via the
"options-file method"[3].  See cvs2svn-example.options and
test-data/main-cvsrepos/cvs2svn-git.options as examples; the former file
includes voluminous documentation.

4. Run cvs2svn.  This outputs two git-fast-import files, with the names
specified by your options file.  In the example, these files are named
'cvs2svn-tmp/git-blob.dat' and 'cvs2svn-tmp/git-dump.dat'.

5. Initialize a git repository, and load the dump files using
git-fast-import:

    git-init
    cat cvs2svn-tmp/git-blob.dat | \
        git-fast-import --export-marks=cvs2svn-tmp/git-marks.dat
    cat cvs2svn-tmp/git-dump.dat | \
        git-fast-import --import-marks=cvs2svn-tmp/git-marks.dat

I am looking forward to your feedback.  Even better would be if somebody
wants to join forces on this project.  I would be happy to supply the
cvs2svn knowledge if you can bring the git experience.

Michael

[1] http://cvs2svn.tigris.org/
[2] http://www.kernel.org/pub/software/scm/git/docs/git-fast-import.html
[3] http://cvs2svn.tigris.org/cvs2svn.html#cmd-vs-options

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-01  0:09 cvs2svn conversion directly to git ready for experimentation Michael Haggerty
@ 2007-08-01  0:41 ` Johannes Schindelin
  2007-08-01 22:09 ` Jakub Narebski
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Johannes Schindelin @ 2007-08-01  0:41 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, users

Hi,

On Wed, 1 Aug 2007, Michael Haggerty wrote:

> 2. Check out the current trunk version of cvs2svn:
> 
>     svn co http://cvs2svn.tigris.org/svn/cvs2svn/trunk cvs2svn-trunk
>     cd cvs2svn-trunk
>     make check # ...optional

FWIW I tried to clone it with "git svn", and needed to prefix the url with 
"guest", i.e.

	$ git clone http://guest@cvs2svn.tigris.org/svn/cvs2svn/trunk

and it still did not work at once.  Somehow I managed to get the 
"Username" prompt, input "guest", and left the password empty.  Even then, 
only the second attempt succeeded (I guess somehow that "password" got 
stored in $HOME/.subversion/auth/...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-01  0:09 cvs2svn conversion directly to git ready for experimentation Michael Haggerty
  2007-08-01  0:41 ` Johannes Schindelin
@ 2007-08-01 22:09 ` Jakub Narebski
  2007-08-02 16:58   ` Michael Haggerty
  2007-08-02 23:44   ` Jon Smirl
  2007-08-02  8:49 ` Steffen Prohaska
       [not found] ` <8b65902a0708010438s24d16109k601b52c04cf9c066@mail.gmail.com>
  3 siblings, 2 replies; 40+ messages in thread
From: Jakub Narebski @ 2007-08-01 22:09 UTC (permalink / raw)
  To: git; +Cc: users

Michael Haggerty wrote:

> I am the maintainer of cvs2svn[1], which is a program for one-time
> conversions from CVS to Subversion.  cvs2svn is very robust against the
> many peculiarities of CVS and can convert just about every CVS
> repository we have ever seen.
> 
> I've been working on a cvs2svn output pass that writes the converted CVS
> repository directly into git rather than Subversion.  The code runs now
> with at least one repository from our test suite of nasty CVS repositories.

Have you contacted Jon Smirl about his unpublished work on cvs2git,
cvs2svn based CVS to Git converter?

Quote from InterfacesFrontendsAndTools page on GIT wiki[1]:

  cvs2git is the unofficial name of Jon Smirl's modifications to cvs2svn.
  These modifications allow cvs2svn to generate a data stream which is
  consumed by Shawn Pearce's git-fast-import (now included in git.git).
  git-fast-import converts its input stream directly into a Git .pack file,
  minimizing the amount of IO required on large imports.

  Jon Smirl stopped working on cvs2git[2] because first, Mozilla (which was
  main target of his work) decided that to not to move to git, and second
  because of troubles with cvs2svn architecture[*] (which it is based on).
  Jon Smirl has posted his impressions on working on CVS importer in 
  "Some tips for doing a CVS importer" thread[3].

References:
-----------
[1] http://git.or.cz/gitwiki/InterfacesFrontendsAndTools#head-23858c2cde0cef60443d8e73e6829a95f8e191ef
[2] http://msgid.gmane.org/9e4733910611190940y147992b8mbdfac5a51f42e0fe@mail.gmail.com
[3] http://marc.theaimsgroup.com/?t=116405956000001&r=1&w=2

Footnotes:
----------
[*] If I remember correctly authors of cvs2svn were talking about separating
the code dealing with disentangling CVS repository structure from the part
translating it into Subversion repository (with its quirks), and the part
generating Subversion repository.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-01  0:09 cvs2svn conversion directly to git ready for experimentation Michael Haggerty
  2007-08-01  0:41 ` Johannes Schindelin
  2007-08-01 22:09 ` Jakub Narebski
@ 2007-08-02  8:49 ` Steffen Prohaska
  2007-08-02 17:23   ` Michael Haggerty
                     ` (3 more replies)
       [not found] ` <8b65902a0708010438s24d16109k601b52c04cf9c066@mail.gmail.com>
  3 siblings, 4 replies; 40+ messages in thread
From: Steffen Prohaska @ 2007-08-02  8:49 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, users

[-- Attachment #1: Type: text/plain, Size: 4365 bytes --]

Michael,

On Aug 1, 2007, at 2:09 AM, Michael Haggerty wrote:

> I am looking forward to your feedback.  Even better would be if  
> somebody
> wants to join forces on this project.  I would be happy to supply the
> cvs2svn knowledge if you can bring the git experience.

I tried it with revision trunk@3930 of cvs2svn. The results are as  
follows.

some WARNING: problem encoding log message: [...]

cvs2svn Statistics:
------------------
Total CVS Files:              9578
Total CVS Revisions:         66771
Total CVS Branches:         229121
Total CVS Tags:             371259
Total Unique Tags:             112
Total Unique Branches:          79
CVS Repos Size in KB:       210390
Total SVN Commits:           18178
First Revision Date:    Fri Jul 23 10:26:11 1999
Last Revision Date:     Thu Jul 19 17:50:40 2007
------------------
Timings (seconds):
------------------
3295   pass1    CollectRevsPass
    0   pass2    CollateSymbolsPass
3642   pass3    FilterSymbolsPass
    0   pass4    SortRevisionSummaryPass
    1   pass5    SortSymbolSummaryPass
  109   pass6    InitializeChangesetsPass
   56   pass7    BreakRevisionChangesetCyclesPass
   66   pass8    RevisionTopologicalSortPass
   54   pass9    BreakSymbolChangesetCyclesPass
   99   pass10   BreakAllChangesetCyclesPass
   92   pass11   TopologicalSortPass
   46   pass12   CreateRevsPass
    7   pass13   SortSymbolsPass
    2   pass14   IndexSymbolsPass
   70   pass15   OutputPass
7540   total

I checked that CVS head and two other branches match when checked
out from CVS and from the imported git archive. Everything is ok
(ignoring some differences introduced by keyword expansion).
Note, I tried earlier to use cvs2svn to import to svn followed by
git-svnimport to import to git. The repository resulting from
this two step import not even passed this minimal requirement of
matching checkouts from cvs and git.

cvs2svn created a lot of branches that are not present in CVS,
with names identical to CVS tags. Apparently these branches are
used to create a commit matching a certain CVS tag.

I checked one suspicious commit that indicates to me if the root
points of branches are right. Note, git-cvsimport fails this check;
parsecvs and cvs2svn pass the check.

The branching structure looks, ... hmm ..., interesting. cvs2svn
manufactured commits to get the branching points right.
Apparently our CVS has some weired commits like 'unlabeled-1.1.1'
and two other named tags (maybe vendor branches?) that cause
these manufactured commits. In gitk I see long lines running
parallel to the cvs trunk all down to these weired CVS tags. They
are not very useful, altough they might be correct. Note,
parsecvs imports our repository without such basically useless
links.  However, I can't verify if parsecvs gets something wrong.
Other branches are created over a couple of commits mixing in
several branches (maybe again our weired commits already
mentioned). See branching1.png, branching2.png, branching3.png.
[ I have to apologize, our cvs repository contains proprietary
   information, so I can't publish it's history freely. ]

cvs2svn is the first tool besided parsecvs that worked for me,
that is imported the whole repository, passed the basic test of
matching checkouts from cvs and git, and got the one suspicious
commit right that I'm using for verifying the branching points.

[ I have no time to go into the details of all these tests.
   Therefore only a very short summary:
   All tools needed basic cleanup of a few corrupted ,v files and
      ,v files that were duplicated in Attic.
   git-cvsimport fails to create branches at the right commit.
   fromcvs's togit surrendered during the import.
   fromcvs's tohg accepted more of the history, but finally
     surrendered as well.
   parsecvs works for me (crashes on corrupted ,v files).
   cvs2svn followed by git-svnimport create wrong state at the
     tips of branches.
   cvs2svn direct git import works for me (reports corrupted ,v files).
   ]

Right now, I'd prefer the import by parsecvs because of the
simpler history. However, I don't know if I loose history
information by doing so. I'd start by a run of cvs2svn to validate
the overall structure of the CVS repository. Dealing with corruption
in the CVS repository seems to be superior in cvs2svn. It reports
errors when parsecvs just crashes.

	Steffen

[-- Attachment #2.1: branching1.png --]
[-- Type: application/applefile, Size: 74 bytes --]

[-- Attachment #2.2: branching1.png --]
[-- Type: image/png, Size: 3389 bytes --]

[-- Attachment #3: Type: text/plain, Size: 1 bytes --]

[-- Attachment #4.1: branching2.png --]
[-- Type: application/applefile, Size: 74 bytes --]

[-- Attachment #4.2: branching2.png --]
[-- Type: image/png, Size: 1653 bytes --]

[-- Attachment #5: Type: text/plain, Size: 1 bytes --]

[-- Attachment #6.1: branching3.png --]
[-- Type: application/applefile, Size: 74 bytes --]

[-- Attachment #6.2: branching3.png --]
[-- Type: image/png, Size: 2807 bytes --]

[-- Attachment #7: Type: text/plain, Size: 4 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
       [not found] ` <8b65902a0708010438s24d16109k601b52c04cf9c066@mail.gmail.com>
@ 2007-08-02 15:34   ` Michael Haggerty
  2007-08-02 23:08     ` Martin Langhoff
  0 siblings, 1 reply; 40+ messages in thread
From: Michael Haggerty @ 2007-08-02 15:34 UTC (permalink / raw)
  To: Guilhem Bonnefille; +Cc: git, users

[I am CCing this response to the mailing lists.]

Guilhem Bonnefille wrote:
> On 8/1/07, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> I am the maintainer of cvs2svn[1], which is a program for one-time
>> conversions from CVS to Subversion.  cvs2svn is very robust against the
>> many peculiarities of CVS and can convert just about every CVS
>> repository we have ever seen.
> 
> What are the differences with cvsps ( http://www.cobite.com/cvsps/ )?

I'm not extremely familiar with cvsps, and I don't really want to get
into a "my-tool-is-better-than-your-tool" kind of argument.  Instead I
will mention that the goals of the two projects are somewhat different:

cvs2svn is meant for one-time conversions from CVS, and therefore aims
for maximum conversion accuracy, robustness even in the presence of some
kinds of CVS repository corruption, intelligent translation of CVS
idioms to the idioms of a modern SCM, and scalability to large
repositories (by using on-disk databases instead of RAM for intermediate
data).  Conversion speed is not a primary goal of cvs2svn, and
incremental conversions are not supported at all.  cvs2svn requires
filesystem access to the CVS repository (it parses the RCS files directly).

cvsps is not a conversion tool at all, though it is used by other
conversion tools to generate the changesets.  It appears (I hope I am
not misinterpreting things) to emphasize speed and incremental
operation, for example attempting to make changesets consistent from one
run to the next, even if the CVS repository has been changed prudently
between runs.  cvsps does not appear to attempt to create atomic branch
and tag creation commits or handle CVS's special vendorbranch behavior.
 cvsps operates via the CVS protocol; you don't need filesystem access
to the CVS repository.

I can also point you to a list of cvs2svn features, which includes a
list of some of the CVS quirks that it knows how to handle:

    http://cvs2svn.tigris.org/cvs2svn.html#features

cvs2svn includes a large suite of perverse CVS repositories that we use
for testing.  Many of them are derived from real-life CVS repositories
that people have had problems with.  It would be very interesting to see
how other conversion tools handle these repositories, but I don't expect
to have time to do so in the near future.

Michael

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-01 22:09 ` Jakub Narebski
@ 2007-08-02 16:58   ` Michael Haggerty
  2007-08-02 23:44   ` Jon Smirl
  1 sibling, 0 replies; 40+ messages in thread
From: Michael Haggerty @ 2007-08-02 16:58 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski wrote:
> Michael Haggerty wrote:
> Have you contacted Jon Smirl about his unpublished work on cvs2git,
> cvs2svn based CVS to Git converter?

Yes, I am familiar with Jon Smirl's work, and as soon as he let us know
what he was working on, we tried to help.  Unfortunately the cooperation
was not very fruitful.

- While Jon was (unknown to us) working on his git output patch, I was
working on a big cvs2svn rewrite to make cvs2svn more robust and easier
to hack.  By the time he contacted us, his patch did not apply to the
cvs2svn code.  The refactoring that obsoleted the patch, in fact, was
largely to remedy the very same architectural problems that were
hampering his work.

- In my opinion, Jon misdiagnosed the reason for the "fragmented branch
creation" problem that he claimed was preventing a clean conversion to
git, and he felt that we were not interested in fixing the problem.  In
fact, I was working on fixing another problem that I believe was the
*real* reason for the fragmented branch creation.  This fix is
implemented in cvs2svn version 2.0.

> Footnotes:
> ----------
> [*] If I remember correctly authors of cvs2svn were talking about separating
> the code dealing with disentangling CVS repository structure from the part
> translating it into Subversion repository (with its quirks), and the part
> generating Subversion repository.

Yes, this is now done, which was why it was only a couple of days of
programming for me to add a git output option.

Michael

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02  8:49 ` Steffen Prohaska
@ 2007-08-02 17:23   ` Michael Haggerty
  2007-08-02 19:22     ` Marko Macek
  2007-08-02 23:59     ` Jon Smirl
  2007-08-02 17:35   ` Simon 'corecode' Schubert
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 40+ messages in thread
From: Michael Haggerty @ 2007-08-02 17:23 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: git, users

Steffen Prohaska wrote:
> On Aug 1, 2007, at 2:09 AM, Michael Haggerty wrote:
>> I am looking forward to your feedback.  Even better would be if somebody
>> wants to join forces on this project.  I would be happy to supply the
>> cvs2svn knowledge if you can bring the git experience.
> 
> I tried it with revision trunk@3930 of cvs2svn. The results are as follows.

Thanks for the feedback!

> cvs2svn created a lot of branches that are not present in CVS,
> with names identical to CVS tags. Apparently these branches are
> used to create a commit matching a certain CVS tag.

That is correct.  This is something that I plan to work on, at least for
tags that can be created from a single source commit.

> The branching structure looks, ... hmm ..., interesting. cvs2svn
> manufactured commits to get the branching points right.
> Apparently our CVS has some weired commits like 'unlabeled-1.1.1'
> and two other named tags (maybe vendor branches?) that cause
> these manufactured commits. In gitk I see long lines running
> parallel to the cvs trunk all down to these weired CVS tags. They
> are not very useful, altough they might be correct. Note,
> parsecvs imports our repository without such basically useless
> links.  However, I can't verify if parsecvs gets something wrong.

Branches with names like "unlabeled-1.1.1" come from CVS branches for
which the revisions are still contained in the RCS files but for which
the branch name has been deleted.  These wreak havoc on cvs2svn's
attempt to find simple branch sources and cause a proliferation of
basically useless branches.  The main problem is that cvs2svn does not
attempt to figure out that "unlabeled-1.2.4" in one file might be the
same as "unlabeled-1.2.6" in another etc.

An "unlabeled-1.1.1", in particular, means that the branch whose name
was deleted was a vendor branch.  The deletion of a vendor branch name
can cause even more mayhem.

In most cases it makes sense to exclude the unlabeled branches.  After
all, somebody tried to delete them, so they can't be that important,
right?  Use --exclude='unlabeled-.*', or add a line like this to your
options file:

ctx.symbol_strategy.add_rule(ExcludeRegexpStrategyRule(r'unlabeled-.*'))

.  This can of course cause problems if other branches or tags were
created that branched off of the unlabeled branch.  In such cases the
dependent branches/tags might have to be excluded too.

> Other branches are created over a couple of commits mixing in
> several branches (maybe again our weired commits already
> mentioned). See branching1.png, branching2.png, branching3.png.
> [ I have to apologize, our cvs repository contains proprietary
>   information, so I can't publish it's history freely. ]

This can definitely be caused by unlabeled branches.  It can also be
caused by branches rooted in a vendor branch.  In many cases, such
branches can actually be grafted onto trunk, but cvs2svn does not (yet)
attempt this.

> cvs2svn is the first tool besided parsecvs that worked for me,
> that is imported the whole repository, passed the basic test of
> matching checkouts from cvs and git, and got the one suspicious
> commit right that I'm using for verifying the branching points.
> 
> [ I have no time to go into the details of all these tests.
>   Therefore only a very short summary:
>   All tools needed basic cleanup of a few corrupted ,v files and
>      ,v files that were duplicated in Attic.
>   git-cvsimport fails to create branches at the right commit.
>   fromcvs's togit surrendered during the import.
>   fromcvs's tohg accepted more of the history, but finally
>     surrendered as well.
>   parsecvs works for me (crashes on corrupted ,v files).
>   cvs2svn followed by git-svnimport create wrong state at the
>     tips of branches.
>   cvs2svn direct git import works for me (reports corrupted ,v files).
>   ]

Thanks very much for this interesting summary.

> Right now, I'd prefer the import by parsecvs because of the
> simpler history. However, I don't know if I loose history
> information by doing so. I'd start by a run of cvs2svn to validate
> the overall structure of the CVS repository. Dealing with corruption
> in the CVS repository seems to be superior in cvs2svn. It reports
> errors when parsecvs just crashes.

If excluding the unlabeled branches does not fix things for you, I
suggest checking out the first revision on such a branch, and comparing
the results from CVS, from parsecvs, and from cvs2svn.  It *should* be
that the version of the file from the vendor branch is included in the
working copy.  cvs2svn should handle this correctly.  I am curious
whether parsecvs does.

Michael

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02  8:49 ` Steffen Prohaska
  2007-08-02 17:23   ` Michael Haggerty
@ 2007-08-02 17:35   ` Simon 'corecode' Schubert
  2007-08-02 19:13     ` Steffen Prohaska
  2007-08-02 20:43   ` Linus Torvalds
  2007-08-02 23:55   ` Jon Smirl
  3 siblings, 1 reply; 40+ messages in thread
From: Simon 'corecode' Schubert @ 2007-08-02 17:35 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Michael Haggerty, git, users

Steffen Prohaska wrote:
>   fromcvs's togit surrendered during the import.
>   fromcvs's tohg accepted more of the history, but finally
>     surrendered as well.

Which repo is it you are converting?  Is this available somewhere?

I'd appreciate any reports concerning "surrenders" of fromcvs.  Additionally, it seems strange that tohg should have worked "better" than togit, as these are basically just different backends.

cheers
  simon

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 17:35   ` Simon 'corecode' Schubert
@ 2007-08-02 19:13     ` Steffen Prohaska
  2007-08-02 19:29       ` Simon 'corecode' Schubert
  2007-08-02 23:37       ` Michael Haggerty
  0 siblings, 2 replies; 40+ messages in thread
From: Steffen Prohaska @ 2007-08-02 19:13 UTC (permalink / raw)
  To: Simon 'corecode' Schubert; +Cc: Michael Haggerty, git, users

Simon,

On Aug 2, 2007, at 7:35 PM, Simon 'corecode' Schubert wrote:

> Steffen Prohaska wrote:
>>   fromcvs's togit surrendered during the import.
>>   fromcvs's tohg accepted more of the history, but finally
>>     surrendered as well.
>
> Which repo is it you are converting?  Is this available somewhere?

Unfortunately not, the content is a proprietary software package.

> I'd appreciate any reports concerning "surrenders" of fromcvs.   
> Additionally, it seems strange that tohg should have worked  
> "better" than togit, as these are basically just different backends.

Some time passed since I did the tests. I had no time to do a
detailed investigation then. I'll have more time now and will
prepare a bug report, which is not easy because I can't sent you
the cvs repo, sorry. Any hints what would be most helpful for you?

I remember that togit reported a broken pipe. My feeling was
that git-fastimport aborted, which may be reason why tohg
worked better. I didn't try to understand more details. I never
read ruby code before and it was already a challenge for me to
get everything up and running (rcs, rbtree).

	Steffen

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 17:23   ` Michael Haggerty
@ 2007-08-02 19:22     ` Marko Macek
  2007-08-02 23:59     ` Jon Smirl
  1 sibling, 0 replies; 40+ messages in thread
From: Marko Macek @ 2007-08-02 19:22 UTC (permalink / raw)
  To: Michael Haggerty, git, users, prohaska

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

Michael Haggerty wrote:
> This can definitely be caused by unlabeled branches.  It can also be
> caused by branches rooted in a vendor branch.  In many cases, such
> branches can actually be grafted onto trunk, but cvs2svn does not (yet)
> attempt this.

It would be nice to be able to exclude the vendor branch if only 
the initial commit was made on it (or maybe handle it better, by 
remapping the commits to the main branch when they match).

I have tested this on my repository and currently gitk draws 
large 'railroad switching stations' because many tags have the 
vendor branch as a parent (and in some cases also the parent branch, 
in addition to the parent commit).

	Mark

[-- Attachment #2: railroad.png --]
[-- Type: image/png, Size: 3828 bytes --]

[-- Attachment #3: Type: text/plain, Size: 193 bytes --]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cvs2svn.tigris.org
For additional commands, e-mail: users-help@cvs2svn.tigris.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 19:13     ` Steffen Prohaska
@ 2007-08-02 19:29       ` Simon 'corecode' Schubert
  2007-08-02 20:21         ` Robin Rosenberg
                           ` (2 more replies)
  2007-08-02 23:37       ` Michael Haggerty
  1 sibling, 3 replies; 40+ messages in thread
From: Simon 'corecode' Schubert @ 2007-08-02 19:29 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Michael Haggerty, git, users

Steffen Prohaska wrote:
> I remember that togit reported a broken pipe. My feeling was
> that git-fastimport aborted, which may be reason why tohg
> worked better. I didn't try to understand more details. I never
> read ruby code before and it was already a challenge for me to
> get everything up and running (rcs, rbtree).

yah, that pretty much tells me it is shawn's bug :)  but without more details, it is very hard to diagnose.  tohg should tell you which rcs revs are the offenders.  be sure to use a recent fromcvs however.

cheers
  simon

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 19:29       ` Simon 'corecode' Schubert
@ 2007-08-02 20:21         ` Robin Rosenberg
       [not found]           ` <200708022221.13129.robin.rosenberg.lists-RgPrefM1rjDQT0dZR+AlfA@public.gmane.org>
                             ` (2 more replies)
  2007-08-02 22:02         ` Steffen Prohaska
  2007-08-03  3:07         ` Shawn O. Pearce
  2 siblings, 3 replies; 40+ messages in thread
From: Robin Rosenberg @ 2007-08-02 20:21 UTC (permalink / raw)
  To: Simon 'corecode' Schubert
  Cc: Steffen Prohaska, Michael Haggerty, git, users

torsdag 02 augusti 2007 skrev Simon 'corecode' Schubert:
> Steffen Prohaska wrote:
> > I remember that togit reported a broken pipe. My feeling was
> > that git-fastimport aborted, which may be reason why tohg
> > worked better. I didn't try to understand more details. I never
> > read ruby code before and it was already a challenge for me to
> > get everything up and running (rcs, rbtree).
> 
> yah, that pretty much tells me it is shawn's bug :)  but without more 
details, it is very hard to diagnose.  tohg should tell you which rcs revs 
are the offenders.  be sure to use a recent fromcvs however.

If the bug is still unfixed and you haven't been able to diagnose for lack of 
repos, you could try the Eclipse CVS repo.

When I converted the Eclipse source to git I had a problem converting the 
whole repo, i.e. fastimport died. The conversion died so I excluded some 
large parts that were effectively forks and some websites. 

-- robin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
       [not found]           ` <200708022221.13129.robin.rosenberg.lists-RgPrefM1rjDQT0dZR+AlfA@public.gmane.org>
@ 2007-08-02 20:31             ` Lübbe Onken
  0 siblings, 0 replies; 40+ messages in thread
From: Lübbe Onken @ 2007-08-02 20:31 UTC (permalink / raw)
  To: users-6zjzXkf2FExf8fUKLXF2/HdfcadvtA/q
  Cc: git-u79uwXL29TY76Z2rM5mHXA, users-6zjzXkf2FExf8fUKLXF2/HdfcadvtA/q

Hi Folks,

I guess that the initial poster sent this message to the TortoiseSVN
users list only by mistake, because the subject has nothing at all to do
with TortoiseSVN.

Could you please be so kind and remove the TortoiseSVN users list from
future replies to this thread?

thanks
-Lübbe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 20:21         ` Robin Rosenberg
       [not found]           ` <200708022221.13129.robin.rosenberg.lists-RgPrefM1rjDQT0dZR+AlfA@public.gmane.org>
@ 2007-08-02 20:32           ` Lübbe Onken
  2007-08-02 20:33           ` Lübbe Onken
  2 siblings, 0 replies; 40+ messages in thread
From: Lübbe Onken @ 2007-08-02 20:32 UTC (permalink / raw)
  To: git; +Cc: users

Hi Folks,

I guess that the initial poster sent this message to the TortoiseSVN
users list only by mistake, because the subject has nothing at all to do
with TortoiseSVN.

Could you please be so kind and remove the TortoiseSVN users list from
future replies to this thread?

thanks
-Lübbe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 20:21         ` Robin Rosenberg
       [not found]           ` <200708022221.13129.robin.rosenberg.lists-RgPrefM1rjDQT0dZR+AlfA@public.gmane.org>
  2007-08-02 20:32           ` Lübbe Onken
@ 2007-08-02 20:33           ` Lübbe Onken
  2 siblings, 0 replies; 40+ messages in thread
From: Lübbe Onken @ 2007-08-02 20:33 UTC (permalink / raw)
  To: Robin Rosenberg
  Cc: Simon 'corecode' Schubert, Steffen Prohaska,
	Michael Haggerty, git, users

Hi Folks,

I guess that the initial poster sent this message to the TortoiseSVN
users list only by mistake, because the subject has nothing at all to do
with TortoiseSVN.

Could you please be so kind and remove the TortoiseSVN users list from
future replies to this thread?

thanks
-Lübbe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02  8:49 ` Steffen Prohaska
  2007-08-02 17:23   ` Michael Haggerty
  2007-08-02 17:35   ` Simon 'corecode' Schubert
@ 2007-08-02 20:43   ` Linus Torvalds
  2007-08-02 23:19     ` Michael Haggerty
  2007-08-02 23:55   ` Jon Smirl
  3 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2007-08-02 20:43 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Michael Haggerty, git, users

On Thu, 2 Aug 2007, Steffen Prohaska wrote:
> 
> Right now, I'd prefer the import by parsecvs because of the
> simpler history. However, I don't know if I loose history
> information by doing so. I'd start by a run of cvs2svn to validate
> the overall structure of the CVS repository.

Well, once imported, you could just go through the branches and tags, and 
just delete the ones you consider uninteresting, and then do a "git gc".

You'd want to re-pack after a fast-import anyway (regardless of the source 
of the fast-import input), so maybe cvs2svn ends up giving you a bit 
unnecessary info, but it should be easy enough to get rid of 
after-the-fact.

		Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 19:29       ` Simon 'corecode' Schubert
  2007-08-02 20:21         ` Robin Rosenberg
@ 2007-08-02 22:02         ` Steffen Prohaska
  2007-08-02 22:50           ` Simon 'corecode' Schubert
  2007-08-03  3:07         ` Shawn O. Pearce
  2 siblings, 1 reply; 40+ messages in thread
From: Steffen Prohaska @ 2007-08-02 22:02 UTC (permalink / raw)
  To: Simon 'corecode' Schubert; +Cc: Michael Haggerty, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 3884 bytes --]

Simon,

On Aug 2, 2007, at 9:29 PM, Simon 'corecode' Schubert wrote:

> Steffen Prohaska wrote:
>> I remember that togit reported a broken pipe. My feeling was
>> that git-fastimport aborted, which may be reason why tohg
>> worked better. I didn't try to understand more details. I never
>> read ruby code before and it was already a challenge for me to
>> get everything up and running (rcs, rbtree).
>
> yah, that pretty much tells me it is shawn's bug :)  but without  
> more details, it is very hard to diagnose.

I tried again. Interestingly now togit works but tohg still fails.

togit starts with reporting

fatal: Not a valid object name

as the first line. But besides that it seems to work fine. What
concerns me a bit is that the last line togit reports is

committing set 18100/18173

I'd expect it should report 18173/18173.
The rest are git-fast-import statistics.

BTW, togit creates much more complex branching patterns than cvs2svn
does. The attached file branching.png displays a small view of a
branching pattern that extends downwards over a couple of screens.
I checked the cvs2svn history again. It doesn't contain anything
of similar complexity.


> tohg should tell you which rcs revs are the offenders.  be sure to  
> use a recent fromcvs however.

tohg fails (on the same repo that togit imported) with the
following error

Traceback (most recent call last):
   File "./tohg.py", line 102, in <module>
     destrepo.dispatch()
   File "./tohg.py", line 98, in dispatch
     func(*l[1:])
   File "./tohg.py", line 78, in cmd_commit
     extra = {'branch': branch})
   File "/sw/lib/python2.5/site-packages/mercurial/localrepo.py",  
line 736, in commit
     mn = self.manifest.add(m1, tr, linkrev, c1[0], c2[0], (new,  
remove))
   File "/sw/lib/python2.5/site-packages/mercurial/manifest.py", line  
191, in add
     _("failed to remove %s from manifest") % f)
AssertionError: failed to remove X/Y.cpp from manifest
transaction abort!
rollback completed
./tohg.rb:200:in `readline': End of file reached while handling set  
[core/X/Y.cpp,v:1.19,core/X/Z.cpp,v:1.22,core/X/Attic/W,v:1.12]  
(EOFError)
         from ./tohg.rb:200:in `_commit'
         from ./tohg.rb:154:in `commit'
         from ./fromcvs.rb:894:in `commit'
         from ./fromcvs.rb:965:in `commit_sets'
         from ./tohg.rb:228


The versions I used are listed below. I adjusted tohg a bit to use  
python 2.5
installed by fink. I'm working on Mac OS X.

$ cd fromcvs
$ hg tip
changeset:   103:cccdab84e9e5
tag:         tip
user:        Simon 'corecode' Schubert <corecode@fs.ei.tum.de>
date:        Mon Jul 16 23:49:52 2007 +0200
summary:     Add error handling on committing sets.
$ hg diff
diff -r cccdab84e9e5 tohg.rb
--- a/tohg.rb   Mon Jul 16 23:49:52 2007 +0200
+++ b/tohg.rb   Fri Jul 20 17:06:30 2007 +0200
@@ -60,7 +60,7 @@ class HGDestRepo
      @status = status
      @outs, @ins = \
-      Open2.popen2('python', File.join(File.dirname($0), 'tohg.py'),  
hgroot)
+      Open2.popen2('python2.5', File.join(File.dirname($0),  
'tohg.py'), hgroot)
      @last_date = Time.at(@ins.readline.strip.to_i)
      @branches = {}
      while l = @ins.readline do


$ cd rcsparse
$ hg tip
changeset:   37:e871e108f2e4
tag:         tip
user:        Simon 'corecode' Schubert <corecode@fs.ei.tum.de>
date:        Sun Feb 18 15:46:29 2007 +0100
summary:     Return revision date in GMT, like RCS/CVS uses everywhere.

rbtree-0.2.0.tar.gz

ruby 1.8.2 (2004-12-25) [universal-darwin8.0]

$ cd git
$ git describe master
v1.5.3-rc3-120-g68d4229

$ hg --version
Mercurial Distributed SCM (version 0.9.3)

Copyright (C) 2005, 2006 Matt Mackall <mpm@selenic.com>
This is free software; see the source for copying conditions. There  
is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR  
PURPOSE.

$ /sw/bin/python2.5 --version
Python 2.5.1


Hope this helps.

	Steffen


[-- Attachment #2.1: branching.png --]
[-- Type: application/applefile, Size: 73 bytes --]

[-- Attachment #2.2: branching.png --]
[-- Type: image/png, Size: 17562 bytes --]

[-- Attachment #3: Type: text/plain, Size: 1 bytes --]



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 22:02         ` Steffen Prohaska
@ 2007-08-02 22:50           ` Simon 'corecode' Schubert
  2007-08-02 23:50             ` Michael Haggerty
  2007-08-04  8:28             ` Steffen Prohaska
  0 siblings, 2 replies; 40+ messages in thread
From: Simon 'corecode' Schubert @ 2007-08-02 22:50 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Michael Haggerty, Git Mailing List

Steffen Prohaska wrote:
>> yah, that pretty much tells me it is shawn's bug :)  but without more 
>> details, it is very hard to diagnose.
> 
> I tried again. Interestingly now togit works but tohg still fails.
> 
> togit starts with reporting
> 
> fatal: Not a valid object name

that's fine.

> as the first line. But besides that it seems to work fine. What
> concerns me a bit is that the last line togit reports is
> 
> committing set 18100/18173
> 
> I'd expect it should report 18173/18173.

that's fine as well.  You only saw multiples of 100, but you didn't consider it would skip the itermediate ones, right? :)

> BTW, togit creates much more complex branching patterns than cvs2svn
> does. The attached file branching.png displays a small view of a
> branching pattern that extends downwards over a couple of screens.
> I checked the cvs2svn history again. It doesn't contain anything
> of similar complexity.

haha yea, there is still some issue with duplicate branch names and the branchpoint.  if it doesn't get the branch right, it will always "pull" files from the parent branch.

did you do some manual RCS file copying or manual branch name changing of individual files?  this could be the reason.  I still have to find a simple repo to reproduce this.

> tohg fails (on the same repo that togit imported) with the
> following error
[..]
> AssertionError: failed to remove X/Y.cpp from manifest

This is a mercurial 0.9.3 error, as far as I can tell from the reports.  This never occured here, and nobody reporting to me could ever reproduce this problem to pinpoint it.

cheers
  simon

-- 
Serve - BSD     +++  RENT this banner advert  +++    ASCII Ribbon   /"\
Work - Mac      +++  space for low €€€ NOW!1  +++      Campaign     \ /
Party Enjoy Relax   |   http://dragonflybsd.org      Against  HTML   \
Dude 2c 2 the max   !   http://golden-apple.biz       Mail + News   / \

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 15:34   ` Michael Haggerty
@ 2007-08-02 23:08     ` Martin Langhoff
  2007-08-03  4:03       ` Johannes Schindelin
                         ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Martin Langhoff @ 2007-08-02 23:08 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Guilhem Bonnefille, git, users

On 8/3/07, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> cvsps is not a conversion tool at all, though it is used by other
> conversion tools to generate the changesets.  It appears (I hope I am
> not misinterpreting things) to emphasize speed and incremental
> operation, for example attempting to make changesets consistent from one
> run to the next, even if the CVS repository has been changed prudently
> between runs.  cvsps does not appear to attempt to create atomic branch
> and tag creation commits or handle CVS's special vendorbranch behavior.
>  cvsps operates via the CVS protocol; you don't need filesystem access
> to the CVS repository.

100% in agreement. And though I can't claim to be happy with cvsps, in
many scenarios it is mighty useful, in spite of its significant warts.
 The "does incrementals" is hugely important these days, as lots of
people use git to run "vendor branches" of upstream projects that use
CVS.

To me, that's *the* killer-app feature of git. Of course, others see
different aspects of git as their deal-maker. But I'm sure I'm not
alone on this. Surely enough, others have written git-svn which
accomplishes this and more for those tracking SVN upstreams.

Is there any way we can run tweak cvs2svn to run incrementals, even if
not as fast as cvsps/git-cvsimport? The "do it remotely" part can be
worked around in most cases.

cheers,

martin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 20:43   ` Linus Torvalds
@ 2007-08-02 23:19     ` Michael Haggerty
  2007-08-03  3:12       ` Shawn O. Pearce
  0 siblings, 1 reply; 40+ messages in thread
From: Michael Haggerty @ 2007-08-02 23:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Steffen Prohaska, git, users

Linus Torvalds wrote:
> On Thu, 2 Aug 2007, Steffen Prohaska wrote:
>> Right now, I'd prefer the import by parsecvs because of the
>> simpler history. However, I don't know if I loose history
>> information by doing so. I'd start by a run of cvs2svn to validate
>> the overall structure of the CVS repository.
> 
> Well, once imported, you could just go through the branches and tags, and 
> just delete the ones you consider uninteresting, and then do a "git gc".
> 
> You'd want to re-pack after a fast-import anyway (regardless of the source 
> of the fast-import input), so maybe cvs2svn ends up giving you a bit 
> unnecessary info, but it should be easy enough to get rid of 
> after-the-fact.

The real goal is to get cvs2svn to include the useful information and
exclude the rest. :-)

I definitely want to address the problem of the helper branches used to
create tags.  This problem has has two aspects:

1. The helper branches should be deleted after the tag has been defined.
 I simply couldn't figure out how to do this using git-fast-import, and
git-fast-import complained when I tried to use a branch called
"TAG_FIXUP" without the "refs/head/" prefix.

2. The helper branch is not needed at all if an existing revision has
exactly the same contents as needed on the tag.  This requires cvs2svn
to keep a record of which files exist in the complete file tree on every
branch at every revision (which it can already do, though it is
expensive), and also to give it the smarts to choose the optimal tag
point (which it already does, except that it currently doesn't penalize
sources that require files to be deleted before making the tag).

If the problem is lots of seemingly-unnecessary merges involving a
vendor branch, then it is time for me or some other volunteer to add the
optimization of allowing branches to be grafted from the vendor branch
to trunk.  I know of the problem and have a good idea how to implement
it; it is just a matter of finding the time to get it done.

If the problem is unlabeled branches that can't be excluded (because
other branches or tags depend on them), then the real problem is that it
is not known which unlabeled branches in individual files correspond to
the same project-wide conceptual branch.  I have considered two
possibilities to improve this situation:

1. Allow unlabeled -- indeed any -- branches to be discarded even if
other branches or tags depend on them.  This could be done by
incorporating the content of the source revision (i.e., the revision on
the unlabeled branch that is going to be discarded) into the zeroth
revision of the daughter branch, then grafting the daughter onto the
branch from which the unlabeled branch sprouted.

2. Rename the unlabeled branches by figuring out which unlabeled branch
in fileA corresponds to which unlabeled branch in fileB, fileC, etc.
This would involve a tricky bit of matching file-wise dependency trees
onto one another to unify unlabeled branch labels, keeping in mind that:

  - The trees have other differences as well.
  - The unlabeled branch does not necessarily occur in every file.
  - There may be multiple unlabeled branches per file.

Michael

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 19:13     ` Steffen Prohaska
  2007-08-02 19:29       ` Simon 'corecode' Schubert
@ 2007-08-02 23:37       ` Michael Haggerty
  1 sibling, 0 replies; 40+ messages in thread
From: Michael Haggerty @ 2007-08-02 23:37 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Simon 'corecode' Schubert, git, users

Steffen Prohaska wrote:
> On Aug 2, 2007, at 7:35 PM, Simon 'corecode' Schubert wrote:
>> Steffen Prohaska wrote:
>>>   fromcvs's togit surrendered during the import.
>>>   fromcvs's tohg accepted more of the history, but finally
>>>     surrendered as well.
>>
>> Which repo is it you are converting?  Is this available somewhere?
> 
> Unfortunately not, the content is a proprietary software package.
> 
>> I'd appreciate any reports concerning "surrenders" of fromcvs. 
>> [...]
> 
> Some time passed since I did the tests. I had no time to do a
> detailed investigation then. I'll have more time now and will
> prepare a bug report, which is not easy because I can't sent you
> the cvs repo, sorry.

I wrote a couple of scripts for dealing with just this situation for
cvs2svn bug reports, but they should also work for you, and I highly
recommend them.  Both scripts are included in the cvs2svn source tree:

1. contrib/destroy_repository.py [1] -- strips almost all of the
information out of a CVS repository, including author names, log
messages, and file contents (but not file names, commit dates, or
branch/tag names).  Most bugs are not affected by the omission of such
data.  Use of this script has the effect of deleting most information
that might be considered proprietary and also shrinking the size of the
test case considerably.  Use of this script is described in the script
comments itself and also in [2].

2. contrib/shrink_test_case.py [2] -- you provide the script with a
command that should "exit 0" if the bug you are looking for still
exists.  It does a kind of "binary search" through CVS repository space,
iteratively attempting to delete a chunk of the CVS repository, running
the test command, then (depending on whether the test succeeded) either
reverting or making permanent the deletion.  It can boil most test cases
down to just 1-3 files (though presumably not if the "problem" is a
23-way merge).  The things that it will try to delete are:

  - Entire directories and groups of directories
  - Entire files and groups of files
  - Branches within individual files
  - Tags within individual files

It does this in a somewhat optimal way, trying to minimize the number of
times that the test has to be run.  This script is documented in its own
comments and also in [4].

Michael

[1]
http://cvs2svn.tigris.org/svn/cvs2svn/trunk/contrib/destroy_repository.py
[2] http://cvs2svn.tigris.org/faq.html#reportingbugs
[3] http://cvs2svn.tigris.org/svn/cvs2svn/trunk/contrib/shrink_test_case.py
[4] http://cvs2svn.tigris.org/faq.html#testcase

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-01 22:09 ` Jakub Narebski
  2007-08-02 16:58   ` Michael Haggerty
@ 2007-08-02 23:44   ` Jon Smirl
  1 sibling, 0 replies; 40+ messages in thread
From: Jon Smirl @ 2007-08-02 23:44 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, users

On 8/1/07, Jakub Narebski <jnareb@gmail.com> wrote:
> Michael Haggerty wrote:
>
> > I am the maintainer of cvs2svn[1], which is a program for one-time
> > conversions from CVS to Subversion. cvs2svn is very robust against the
> > many peculiarities of CVS and can convert just about every CVS
> > repository we have ever seen.
> >
> > I've been working on a cvs2svn output pass that writes the converted CVS
> > repository directly into git rather than Subversion. The code runs now
> > with at least one repository from our test suite of nasty CVS repositories.
>
> Have you contacted Jon Smirl about his unpublished work on cvs2git,
> cvs2svn based CVS to Git converter?

My converter was derived from Michael's cvs2svn code. The bulk of my
work was converting cvs2svn to output in a format that git-fastimport
could consume. This was all rather straight forward and there was
nothing really interesting in the code.

What it exposed were fundamental issues about the technical
complexities of trying to reconstruct a change set history from CVS
which didn't record all of the needed info.  I was never able to
construct a satisfactory git representation of the Mozilla CVS
repository.  Michael has had a long time to work on the change set
detection code and he's probably added some new strategies.

My code did include a CVS file parser for extracting all the revisions
from the file in a single pass. Doing that is a major performance
benefit.  I believe I posted the code to the cvs2svn mailing list. It
was about 200 lines of code. Forking off cvs a million times to
extract the revisions takes days to run.

Same goes for forking git a million times.git-fastimport uses a pipe
to cvs2svn to avoid forking. git-fastimport also uses a technique from
the database world for bulk import, it imports everything without
indexing it. Indexing is done after the import finishes.

Between parsing the CVS files internally and Shawn's git-fastimport,
it was possible to import Mozilla CVS (2.4G) in about 2 hours and
generate a 450MB pack file. You need 3GB of RAM to do this - if swap
happens the process will take weeks to finish.

> Quote from InterfacesFrontendsAndTools page on GIT wiki[1]:
>
>   cvs2git is the unofficial name of Jon Smirl's modifications to cvs2svn.
>   These modifications allow cvs2svn to generate a data stream which is
>   consumed by Shawn Pearce's git-fast-import (now included in git.git).
>   git-fast-import converts its input stream directly into a Git .pack file,
>   minimizing the amount of IO required on large imports.
>
>   Jon Smirl stopped working on cvs2git[2] because first, Mozilla (which was
>   main target of his work) decided that to not to move to git, and second
>   because of troubles with cvs2svn architecture[*] (which it is based on).
>   Jon Smirl has posted his impressions on working on CVS importer in
>   "Some tips for doing a CVS importer" thread[3].
>
> References:
> -----------
> [1] http://git.or.cz/gitwiki/InterfacesFrontendsAndTools#head-23858c2cde0cef60443d8e73e6829a95f8e191ef
> [2] http://msgid.gmane.org/9e4733910611190940y147992b8mbdfac5a51f42e0fe@mail.gmail.com
> [3] http://marc.theaimsgroup.com/?t=116405956000001&r=1&w=2
>
> Footnotes:
> ----------
> [*] If I remember correctly authors of cvs2svn were talking about separating
> the code dealing with disentangling CVS repository structure from the part
> translating it into Subversion repository (with its quirks), and the part
> generating Subversion repository.
>
> --
> Jakub Narebski
> Warsaw, Poland
> ShadeHawk on #git
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 22:50           ` Simon 'corecode' Schubert
@ 2007-08-02 23:50             ` Michael Haggerty
  2007-08-03  8:40               ` Simon 'corecode' Schubert
  2007-08-04  8:28             ` Steffen Prohaska
  1 sibling, 1 reply; 40+ messages in thread
From: Michael Haggerty @ 2007-08-02 23:50 UTC (permalink / raw)
  To: Simon 'corecode' Schubert; +Cc: Steffen Prohaska, Git Mailing List

Simon 'corecode' Schubert wrote:
> Steffen Prohaska wrote:
>> BTW, togit creates much more complex branching patterns than cvs2svn
>> does. The attached file branching.png displays a small view of a
>> branching pattern that extends downwards over a couple of screens.
>> I checked the cvs2svn history again. It doesn't contain anything
>> of similar complexity.
> 
> haha yea, there is still some issue with duplicate branch names and the
> branchpoint.  if it doesn't get the branch right, it will always "pull"
> files from the parent branch.

This sounds very much like the problem reported by Daniel Jacobowitz
[1].  The problem is that if you create a branch A on a file, then
create branch B from branch A before making a commit on branch A, then
CVS doesn't record that branch A was the source of branch B.  (It treats
B as if it sprouted directly from the revision that was the *source* of
branch A.)  The same problem exists if "B" is a tag.

The only way to determine the correct branch hierarchy is to consider
the branch hierarchy of multiple files at the same time.

cvs2svn 2.0 includes code to choose a "preferred parent" of each branch
and try to use that parent for every file that is on the branch.  It
helps simplify branch creation quite a bit.  The main limitation is that
it still doesn't consider the revision copied back to trunk from a
vendor branch as the possible parent of a branch whose nominal source
was on the vendor branch (a limitation that has come up elsewhere in
this thread).

Michael

[1] http://cvs2svn.tigris.org/servlets/ReadMsg?list=dev&msgNo=1441

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02  8:49 ` Steffen Prohaska
                     ` (2 preceding siblings ...)
  2007-08-02 20:43   ` Linus Torvalds
@ 2007-08-02 23:55   ` Jon Smirl
  3 siblings, 0 replies; 40+ messages in thread
From: Jon Smirl @ 2007-08-02 23:55 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Michael Haggerty, git, users

On 8/2/07, Steffen Prohaska <prohaska@zib.de> wrote:
> Right now, I'd prefer the import by parsecvs because of the
> simpler history. However, I don't know if I loose history
> information by doing so. I'd start by a run of cvs2svn to validate
> the overall structure of the CVS repository. Dealing with corruption
> in the CVS repository seems to be superior in cvs2svn. It reports
> errors when parsecvs just crashes.

Parsecvs silently throws away things that confuse it. cvs2svn is much
more careful about not losing track of anything. For example parsecvs
is unable to process Mozilla CVS and cvs2svn can. The branching in
Mozilla CVS is too complex for parsecvs to handle.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 17:23   ` Michael Haggerty
  2007-08-02 19:22     ` Marko Macek
@ 2007-08-02 23:59     ` Jon Smirl
  2007-08-05  7:58       ` Oswald Buddenhagen
  1 sibling, 1 reply; 40+ messages in thread
From: Jon Smirl @ 2007-08-02 23:59 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Steffen Prohaska, git, users

On 8/2/07, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Branches with names like "unlabeled-1.1.1" come from CVS branches for
> which the revisions are still contained in the RCS files but for which
> the branch name has been deleted.  These wreak havoc on cvs2svn's
> attempt to find simple branch sources and cause a proliferation of
> basically useless branches.  The main problem is that cvs2svn does not
> attempt to figure out that "unlabeled-1.2.4" in one file might be the
> same as "unlabeled-1.2.6" in another etc.

I seem to recall discussing an algorithm  to fix this on the cvs2svn
mailing list. There was a somewhat simple way to correlate the
"unlabeled-1.2.4" in one file might be the same as "unlabeled-1.2.6"
problem.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 19:29       ` Simon 'corecode' Schubert
  2007-08-02 20:21         ` Robin Rosenberg
  2007-08-02 22:02         ` Steffen Prohaska
@ 2007-08-03  3:07         ` Shawn O. Pearce
  2 siblings, 0 replies; 40+ messages in thread
From: Shawn O. Pearce @ 2007-08-03  3:07 UTC (permalink / raw)
  To: Simon 'corecode' Schubert
  Cc: Steffen Prohaska, Michael Haggerty, git, users

Simon 'corecode' Schubert <corecode@fs.ei.tum.de> wrote:
> Steffen Prohaska wrote:
> >I remember that togit reported a broken pipe. My feeling was
> >that git-fastimport aborted, which may be reason why tohg
> >worked better.
> 
> yah, that pretty much tells me it is shawn's bug :)  but without more 
> details, it is very hard to diagnose.  tohg should tell you which rcs revs 
> are the offenders.  be sure to use a recent fromcvs however.

Tonight I'm going to try and add crash dump reporting to fast-import.
Once that's in it should make debugging some of these failed imports
easier, as we'll be able to see the immediate commands leading up
to the crash and the internal state of fast-import when it barfed.

Of course one needs to locate an ugly repository and run on it...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 23:19     ` Michael Haggerty
@ 2007-08-03  3:12       ` Shawn O. Pearce
  0 siblings, 0 replies; 40+ messages in thread
From: Shawn O. Pearce @ 2007-08-03  3:12 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Linus Torvalds, Steffen Prohaska, git, users

Michael Haggerty <mhagger@alum.mit.edu> wrote:
> 1. The helper branches should be deleted after the tag has been defined.
>  I simply couldn't figure out how to do this using git-fast-import, and
> git-fast-import complained when I tried to use a branch called
> "TAG_FIXUP" without the "refs/head/" prefix.

Two issues there:

* Deleting branches:

  I currently don't support this in fast-import, but I'll add support
  for it.  Its actually pretty simple to tell it to drop a branch,
  especially if the dang thing doesn't actually exist in the git
  repository yet (because its only in-memory).

* Creating a branch without refs/heads/ prefix:

  This is a bug.  I had good intentions by trying to verify the
  name was one that didn't contain special reserved characters,
  but I wound up also requiring you to create branches only in the
  refs/heads/ namespace.  That was not what I wanted to do.  I'm
  patching it tonight.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 23:08     ` Martin Langhoff
@ 2007-08-03  4:03       ` Johannes Schindelin
  2007-08-03  6:48         ` Steffen Prohaska
  2007-08-03  7:10       ` Steffen Prohaska
  2007-08-03  8:36       ` Michael Haggerty
  2 siblings, 1 reply; 40+ messages in thread
From: Johannes Schindelin @ 2007-08-03  4:03 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Michael Haggerty, Guilhem Bonnefille, git, users

Hi,

On Fri, 3 Aug 2007, Martin Langhoff wrote:

> On 8/3/07, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> > cvsps is not a conversion tool at all, though it is used by other
> > conversion tools to generate the changesets.  It appears (I hope I am
> > not misinterpreting things) to emphasize speed and incremental
> > operation, for example attempting to make changesets consistent from one
> > run to the next, even if the CVS repository has been changed prudently
> > between runs.  cvsps does not appear to attempt to create atomic branch
> > and tag creation commits or handle CVS's special vendorbranch behavior.
> >  cvsps operates via the CVS protocol; you don't need filesystem access
> > to the CVS repository.
> 
> 100% in agreement. And though I can't claim to be happy with cvsps, in
> many scenarios it is mighty useful, in spite of its significant warts.
>  The "does incrementals" is hugely important these days, as lots of
> people use git to run "vendor branches" of upstream projects that use
> CVS.

Me too: 100% agreement.  A couple of people seem to be content to proclaim 
that their incomplete solutions are better, but in the end of the day, 
they are as bad as the programs they purport to replace: incomplete.

For the moment, I help myself with tracking the different branches 
individually, but there, really, git-cvsimport is as good as the other 
"solutions", with the further advantage that they are actually hackable, 
and not closed to everybody outside a very small community.

So I look forward to testing cvs2svn(git-branch) this weekend.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-03  4:03       ` Johannes Schindelin
@ 2007-08-03  6:48         ` Steffen Prohaska
  0 siblings, 0 replies; 40+ messages in thread
From: Steffen Prohaska @ 2007-08-03  6:48 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Martin Langhoff, Michael Haggerty, Guilhem Bonnefille, git, users

On Aug 3, 2007, at 6:03 AM, Johannes Schindelin wrote:

> On Fri, 3 Aug 2007, Martin Langhoff wrote:
>
>> On 8/3/07, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>> cvsps is not a conversion tool at all, though it is used by other
>>> conversion tools to generate the changesets.  It appears (I hope  
>>> I am
>>> not misinterpreting things) to emphasize speed and incremental
>>> operation, for example attempting to make changesets consistent  
>>> from one
>>> run to the next, even if the CVS repository has been changed  
>>> prudently
>>> between runs.  cvsps does not appear to attempt to create atomic  
>>> branch
>>> and tag creation commits or handle CVS's special vendorbranch  
>>> behavior.
>>>  cvsps operates via the CVS protocol; you don't need filesystem  
>>> access
>>> to the CVS repository.
>>
>> 100% in agreement. And though I can't claim to be happy with  
>> cvsps, in
>> many scenarios it is mighty useful, in spite of its significant  
>> warts.
>>  The "does incrementals" is hugely important these days, as lots of
>> people use git to run "vendor branches" of upstream projects that use
>> CVS.
>
> Me too: 100% agreement.  A couple of people seem to be content to  
> proclaim
> that their incomplete solutions are better, but in the end of the day,
> they are as bad as the programs they purport to replace: incomplete.
>
> For the moment, I help myself with tracking the different branches
> individually, but there, really, git-cvsimport is as good as the other
> "solutions", with the further advantage that they are actually  
> hackable,
> and not closed to everybody outside a very small community.

I just want to add a warning. You should be suspicious of branched  
imported
using git-cvsimport (which is based on cvsps). If the time the branch is
created differs from the time of the first commit to the branch git- 
cvsimport
may get the branching point wrong. This introduces a race condition.  
Someone
may have committed changes to a file that is later changed on the  
branch. At
that point the history of the imported branch is broken and git reports
_wrong_ changesets.

I ran into this issue and abandoned the use of git-cvsimport. It's  
too dangerous
for me. The testcase in [1] illustrates the problem. I still strongly  
believe
the warning should be stated in *BOLD* in the documentation.

I'm not saying git-cvsimport is useless. But you should be suspicious  
about
the result of the import, especially if you plan to rely on  
changesets derived
from the imported repo, for example if you plan to do cherry-picking  
or merging
in git; or if you plan to blame people for their stupid changes based  
on what
you see in gitk (almost happend to me ;).

	Steffen

[1] http://marc.info/?l=git&m=118260312708709&w=2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 23:08     ` Martin Langhoff
  2007-08-03  4:03       ` Johannes Schindelin
@ 2007-08-03  7:10       ` Steffen Prohaska
  2007-08-03  8:36       ` Michael Haggerty
  2 siblings, 0 replies; 40+ messages in thread
From: Steffen Prohaska @ 2007-08-03  7:10 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Michael Haggerty, Guilhem Bonnefille, git, users

On Aug 3, 2007, at 1:08 AM, Martin Langhoff wrote:

> Is there any way we can run tweak cvs2svn to run incrementals, even if
> not as fast as cvsps/git-cvsimport? The "do it remotely" part can be
> worked around in most cases.

What I currently do with parsecvs is to run complete imports again
on the repo. For 'normal' changes to cvs the old import can be fast
forwarded to the new import. However, if you add or remove files or
tweak revision in another abnormal way (cvs admin) this might fail.

In this case I manually search the last common commit and rebase
new commits to the old, already imported branch. I need to do this if
I already publishes the imported branch. Otherwise I can as well just
reset to the newly imported branch and rebase my work on top of it.
Some careful validation (git diff-*) is included in my workflow.

A complete run of parsecvs is fine for me because it is so fast. I run
git-filter-branch afterwards anyway to cleanup some commit messages
and author information. This takes most of the time, because it spawns
off tons of sub processes.

I'd not recommend my approach for incremental imports every hour, but
you can run it every day (although I do less often). You only need to
validate the final result (fast forward or not). The rest can be fully
automated by some shell scripting.

	Steffen

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 23:08     ` Martin Langhoff
  2007-08-03  4:03       ` Johannes Schindelin
  2007-08-03  7:10       ` Steffen Prohaska
@ 2007-08-03  8:36       ` Michael Haggerty
  2007-08-03 14:35         ` Patwardhan, Rajesh
  2 siblings, 1 reply; 40+ messages in thread
From: Michael Haggerty @ 2007-08-03  8:36 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Guilhem Bonnefille, git, users

Martin Langhoff wrote:
> Is there any way we can run tweak cvs2svn to run incrementals, even if
> not as fast as cvsps/git-cvsimport? The "do it remotely" part can be
> worked around in most cases.

I don't see any fundamental reason why not, but I think it would be a
significant amount of work.  There are two main issues:

1. With CVS, it is possible to change things retroactively, such as
changing which version of a file is included in a tag, or adding a new
file to a tag, or changing whether a file is text vs. binary.  And many
people copy and/or rename files within the CVS repository itself (to get
around CVS's inability to rename a file).  This makes it look like the
file has *always* existed under the new name and *never* existed under
the old name.  An incremental conversion tool would have to look
carefully for such changes and either handle them properly or complain
loudly and abort.

2. cvs2svn uses a lot of repository-wide information to make decisions
about how to group CVSItems into changesets, and a lot of these
decisions are based on heuristics.  Incremental conversion would require
that the decisions made in one cvs2svn run are recorded and treated as
unalterable in subsequent runs.

This hasn't been a priority in the Subversion world, because, frankly,
what reason would a person have to stick with CVS instead of switching
to Subversion, given that (1) they are intentionally so similar in
workflow, an (2) there is no significant competition from other
centralized SCMs?  But of course until the distributed SCM playing field
has been thinned out a bit, people will probably be reluctant to commit
to one or the other.

I don't expect to have time to implement incremental conversions in
cvs2svn in the near future.  (I'd much rather work on output back ends
to other distributed SCMs.)  But if any volunteers step forward (hint,
hint) I would be happy to help them get started and answer their
questions.  I think that cvs2svn is quite hackable now, so the learning
curve is hopefully much less frightening than when I started on the
project :-)

Michael

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 23:50             ` Michael Haggerty
@ 2007-08-03  8:40               ` Simon 'corecode' Schubert
  0 siblings, 0 replies; 40+ messages in thread
From: Simon 'corecode' Schubert @ 2007-08-03  8:40 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Steffen Prohaska, Git Mailing List

Michael Haggerty wrote:
> Simon 'corecode' Schubert wrote:
>> Steffen Prohaska wrote:
>>> BTW, togit creates much more complex branching patterns than cvs2svn
>>> does. The attached file branching.png displays a small view of a
>>> branching pattern that extends downwards over a couple of screens.
>>> I checked the cvs2svn history again. It doesn't contain anything
>>> of similar complexity.
>> haha yea, there is still some issue with duplicate branch names and the
>> branchpoint.  if it doesn't get the branch right, it will always "pull"
>> files from the parent branch.
> 
> This sounds very much like the problem reported by Daniel Jacobowitz
> [1].  The problem is that if you create a branch A on a file, then
> create branch B from branch A before making a commit on branch A, then
> CVS doesn't record that branch A was the source of branch B.  (It treats
> B as if it sprouted directly from the revision that was the *source* of
> branch A.)  The same problem exists if "B" is a tag.

I think I have covered this case quite well.  I believe "my" problem happens when there are files being copied manually within the repository and then branch names being changed (or just branch names being changed).  However, the name change just happens only on a subset of files and branches, so you wind up with a commit which is part of two branches.  Or something like that.  I really should have the time to investigate this.

One elementary problem with CVS is that you can assign two branch names to the same branch.  During conversion you need to choose one over the other.

cheers
  simon

-- 
Serve - BSD     +++  RENT this banner advert  +++    ASCII Ribbon   /"\
Work - Mac      +++  space for low €€€ NOW!1  +++      Campaign     \ /
Party Enjoy Relax   |   http://dragonflybsd.org      Against  HTML   \
Dude 2c 2 the max   !   http://golden-apple.biz       Mail + News   / \

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE:  Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-03  8:36       ` Michael Haggerty
@ 2007-08-03 14:35         ` Patwardhan, Rajesh
  2007-08-03 15:41           ` Jon Smirl
  0 siblings, 1 reply; 40+ messages in thread
From: Patwardhan, Rajesh @ 2007-08-03 14:35 UTC (permalink / raw)
  To: Michael Haggerty, Martin Langhoff; +Cc: Guilhem Bonnefille, git, users

Hello Michael, 
I will explain a scenario (we are passing thru this right now) 
1) you have 10 years worth of cvs data.
2) We want to move to svn. 
3) The repository move should be in such a way that the development does
not get hampered for any 1 work day.   
4) We have atleast 4 major modules in cvs which takes about 30 - 40
hours each for conversion currently.
5) With increamental conversions we can do a few things ... 
	A) Keep the downtime for hard cutoff minimal 
	B) try out the svn move for other auxillary tools that are
needed by the SCM process. 
	C) Do some meaningful testing and validation with simulated live
moves of changes from cvs to svn before the actual move on a day to day
basis. 

Hopefuly this would substantiate the request \ need for increamental
moves. Or if someone out there has a better suggestion for such
scenario's please point me in the right direction. 

Regards,
Rajesh 

-----Original Message-----
From: Michael Haggerty [mailto:mhagger@alum.mit.edu] 
Sent: Friday, August 03, 2007 1:36 AM
To: Martin Langhoff
Cc: Guilhem Bonnefille; git@vger.kernel.org; users@cvs2svn.tigris.org
Subject: Re: cvs2svn conversion directly to git ready for
experimentation

Martin Langhoff wrote:
> Is there any way we can run tweak cvs2svn to run incrementals, even if

> not as fast as cvsps/git-cvsimport? The "do it remotely" part can be 
> worked around in most cases.

I don't see any fundamental reason why not, but I think it would be a
significant amount of work.  There are two main issues:

1. With CVS, it is possible to change things retroactively, such as
changing which version of a file is included in a tag, or adding a new
file to a tag, or changing whether a file is text vs. binary.  And many
people copy and/or rename files within the CVS repository itself (to get
around CVS's inability to rename a file).  This makes it look like the
file has *always* existed under the new name and *never* existed under
the old name.  An incremental conversion tool would have to look
carefully for such changes and either handle them properly or complain
loudly and abort.

2. cvs2svn uses a lot of repository-wide information to make decisions
about how to group CVSItems into changesets, and a lot of these
decisions are based on heuristics.  Incremental conversion would require
that the decisions made in one cvs2svn run are recorded and treated as
unalterable in subsequent runs.

This hasn't been a priority in the Subversion world, because, frankly,
what reason would a person have to stick with CVS instead of switching
to Subversion, given that (1) they are intentionally so similar in
workflow, an (2) there is no significant competition from other
centralized SCMs?  But of course until the distributed SCM playing field
has been thinned out a bit, people will probably be reluctant to commit
to one or the other.

I don't expect to have time to implement incremental conversions in
cvs2svn in the near future.  (I'd much rather work on output back ends
to other distributed SCMs.)  But if any volunteers step forward (hint,
hint) I would be happy to help them get started and answer their
questions.  I think that cvs2svn is quite hackable now, so the learning
curve is hopefully much less frightening than when I started on the
project :-)

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cvs2svn.tigris.org
For additional commands, e-mail: users-help@cvs2svn.tigris.org

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-03 14:35         ` Patwardhan, Rajesh
@ 2007-08-03 15:41           ` Jon Smirl
  2007-08-03 16:42             ` Patwardhan, Rajesh
  2007-08-03 18:58             ` Michael Haggerty
  0 siblings, 2 replies; 40+ messages in thread
From: Jon Smirl @ 2007-08-03 15:41 UTC (permalink / raw)
  To: Patwardhan, Rajesh
  Cc: Michael Haggerty, Martin Langhoff, Guilhem Bonnefille, git, users

On 8/3/07, Patwardhan, Rajesh <rajesh.patwardhan@etrade.com> wrote:
>
> Hello Michael,
> I will explain a scenario (we are passing thru this right now)
> 1) you have 10 years worth of cvs data.
> 2) We want to move to svn.
> 3) The repository move should be in such a way that the development does
> not get hampered for any 1 work day.
> 4) We have atleast 4 major modules in cvs which takes about 30 - 40
> hours each for conversion currently.

There are known ways (that haven't been implemented) to get the 40 hr
number down to 1/2 hour. Would that be a better approach than doing
incremental imports?

> 5) With increamental conversions we can do a few things ...
>         A) Keep the downtime for hard cutoff minimal
>         B) try out the svn move for other auxillary tools that are
> needed by the SCM process.
>         C) Do some meaningful testing and validation with simulated live
> moves of changes from cvs to svn before the actual move on a day to day
> basis.
>
> Hopefuly this would substantiate the request \ need for increamental
> moves. Or if someone out there has a better suggestion for such
> scenario's please point me in the right direction.
>
> Regards,
> Rajesh
>
> -----Original Message-----
> From: Michael Haggerty [mailto:mhagger@alum.mit.edu]
> Sent: Friday, August 03, 2007 1:36 AM
> To: Martin Langhoff
> Cc: Guilhem Bonnefille; git@vger.kernel.org; users@cvs2svn.tigris.org
> Subject: Re: cvs2svn conversion directly to git ready for
> experimentation
>
> Martin Langhoff wrote:
> > Is there any way we can run tweak cvs2svn to run incrementals, even if
>
> > not as fast as cvsps/git-cvsimport? The "do it remotely" part can be
> > worked around in most cases.
>
> I don't see any fundamental reason why not, but I think it would be a
> significant amount of work.  There are two main issues:
>
> 1. With CVS, it is possible to change things retroactively, such as
> changing which version of a file is included in a tag, or adding a new
> file to a tag, or changing whether a file is text vs. binary.  And many
> people copy and/or rename files within the CVS repository itself (to get
> around CVS's inability to rename a file).  This makes it look like the
> file has *always* existed under the new name and *never* existed under
> the old name.  An incremental conversion tool would have to look
> carefully for such changes and either handle them properly or complain
> loudly and abort.
>
> 2. cvs2svn uses a lot of repository-wide information to make decisions
> about how to group CVSItems into changesets, and a lot of these
> decisions are based on heuristics.  Incremental conversion would require
> that the decisions made in one cvs2svn run are recorded and treated as
> unalterable in subsequent runs.
>
> This hasn't been a priority in the Subversion world, because, frankly,
> what reason would a person have to stick with CVS instead of switching
> to Subversion, given that (1) they are intentionally so similar in
> workflow, an (2) there is no significant competition from other
> centralized SCMs?  But of course until the distributed SCM playing field
> has been thinned out a bit, people will probably be reluctant to commit
> to one or the other.
>
> I don't expect to have time to implement incremental conversions in
> cvs2svn in the near future.  (I'd much rather work on output back ends
> to other distributed SCMs.)  But if any volunteers step forward (hint,
> hint) I would be happy to help them get started and answer their
> questions.  I think that cvs2svn is quite hackable now, so the learning
> curve is hopefully much less frightening than when I started on the
> project :-)
>
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cvs2svn.tigris.org
> For additional commands, e-mail: users-help@cvs2svn.tigris.org
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-03 15:41           ` Jon Smirl
@ 2007-08-03 16:42             ` Patwardhan, Rajesh
  2007-08-03 18:58             ` Michael Haggerty
  1 sibling, 0 replies; 40+ messages in thread
From: Patwardhan, Rajesh @ 2007-08-03 16:42 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Michael Haggerty, Martin Langhoff, Guilhem Bonnefille, git, users

Thank you very much for the email. 
Yes if the time for conversion can be brought down to 1/2 hour then it
would be really great. 
We could do a automated cvs2svn everyday for testing and that way
maximum lag between cvs and test svn repo would be 1 day. 
Please do let me know when available.
Regards,
Rajesh 

-----Original Message-----
From: Jon Smirl [mailto:jonsmirl@gmail.com] 
Sent: Friday, August 03, 2007 8:41 AM
To: Patwardhan, Rajesh
Cc: Michael Haggerty; Martin Langhoff; Guilhem Bonnefille;
git@vger.kernel.org; users@cvs2svn.tigris.org
Subject: Re: Re: cvs2svn conversion directly to git ready for
experimentation

On 8/3/07, Patwardhan, Rajesh <rajesh.patwardhan@etrade.com> wrote:
>
> Hello Michael,
> I will explain a scenario (we are passing thru this right now)
> 1) you have 10 years worth of cvs data.
> 2) We want to move to svn.
> 3) The repository move should be in such a way that the development 
> does not get hampered for any 1 work day.
> 4) We have atleast 4 major modules in cvs which takes about 30 - 40 
> hours each for conversion currently.

There are known ways (that haven't been implemented) to get the 40 hr
number down to 1/2 hour. Would that be a better approach than doing
incremental imports?

> 5) With increamental conversions we can do a few things ...
>         A) Keep the downtime for hard cutoff minimal
>         B) try out the svn move for other auxillary tools that are 
> needed by the SCM process.
>         C) Do some meaningful testing and validation with simulated 
> live moves of changes from cvs to svn before the actual move on a day 
> to day basis.
>
> Hopefuly this would substantiate the request \ need for increamental 
> moves. Or if someone out there has a better suggestion for such 
> scenario's please point me in the right direction.
>
> Regards,
> Rajesh
>
> -----Original Message-----
> From: Michael Haggerty [mailto:mhagger@alum.mit.edu]
> Sent: Friday, August 03, 2007 1:36 AM
> To: Martin Langhoff
> Cc: Guilhem Bonnefille; git@vger.kernel.org; users@cvs2svn.tigris.org
> Subject: Re: cvs2svn conversion directly to git ready for 
> experimentation
>
> Martin Langhoff wrote:
> > Is there any way we can run tweak cvs2svn to run incrementals, even 
> > if
>
> > not as fast as cvsps/git-cvsimport? The "do it remotely" part can be

> > worked around in most cases.
>
> I don't see any fundamental reason why not, but I think it would be a 
> significant amount of work.  There are two main issues:
>
> 1. With CVS, it is possible to change things retroactively, such as 
> changing which version of a file is included in a tag, or adding a new

> file to a tag, or changing whether a file is text vs. binary.  And 
> many people copy and/or rename files within the CVS repository itself 
> (to get around CVS's inability to rename a file).  This makes it look 
> like the file has *always* existed under the new name and *never* 
> existed under the old name.  An incremental conversion tool would have

> to look carefully for such changes and either handle them properly or 
> complain loudly and abort.
>
> 2. cvs2svn uses a lot of repository-wide information to make decisions

> about how to group CVSItems into changesets, and a lot of these 
> decisions are based on heuristics.  Incremental conversion would 
> require that the decisions made in one cvs2svn run are recorded and 
> treated as unalterable in subsequent runs.
>
> This hasn't been a priority in the Subversion world, because, frankly,

> what reason would a person have to stick with CVS instead of switching

> to Subversion, given that (1) they are intentionally so similar in 
> workflow, an (2) there is no significant competition from other 
> centralized SCMs?  But of course until the distributed SCM playing 
> field has been thinned out a bit, people will probably be reluctant to

> commit to one or the other.
>
> I don't expect to have time to implement incremental conversions in 
> cvs2svn in the near future.  (I'd much rather work on output back ends

> to other distributed SCMs.)  But if any volunteers step forward (hint,
> hint) I would be happy to help them get started and answer their 
> questions.  I think that cvs2svn is quite hackable now, so the 
> learning curve is hopefully much less frightening than when I started 
> on the project :-)
>
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cvs2svn.tigris.org
> For additional commands, e-mail: users-help@cvs2svn.tigris.org
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in the 
> body of a message to majordomo@vger.kernel.org More majordomo info at

> http://vger.kernel.org/majordomo-info.html
>


--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-03 15:41           ` Jon Smirl
  2007-08-03 16:42             ` Patwardhan, Rajesh
@ 2007-08-03 18:58             ` Michael Haggerty
  2007-08-03 20:16               ` Jon Smirl
  1 sibling, 1 reply; 40+ messages in thread
From: Michael Haggerty @ 2007-08-03 18:58 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Patwardhan, Rajesh, Martin Langhoff, Guilhem Bonnefille, git, users

[I set followup-to users@cvs2svn.tigris.org, since this has nothing to
do with git.]

Jon Smirl wrote:
> On 8/3/07, Patwardhan, Rajesh <rajesh.patwardhan@etrade.com> wrote:
>> Hello Michael,
>> I will explain a scenario (we are passing thru this right now)
>> 1) you have 10 years worth of cvs data.
>> 2) We want to move to svn.
>> 3) The repository move should be in such a way that the development does
>> not get hampered for any 1 work day.
>> 4) We have atleast 4 major modules in cvs which takes about 30 - 40
>> hours each for conversion currently.
> 
> There are known ways (that haven't been implemented) to get the 40 hr
> number down to 1/2 hour. Would that be a better approach than doing
> incremental imports?

Jon, I would like very much to hear how you propose to get an 60-fold
speed increase in cvs2svn.  I've never heard of any plausible way to
accomplish anything even close to this.

Please note that the user wants to convert to Subversion, not git.  But
even converting to git, I don't think that such speeds are possible
without massive changes that would include processing everything in RAM
and switching large parts of cvs2svn from Python to a compiled language.

Michael

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-03 18:58             ` Michael Haggerty
@ 2007-08-03 20:16               ` Jon Smirl
  2007-08-03 20:27                 ` Jon Smirl
  0 siblings, 1 reply; 40+ messages in thread
From: Jon Smirl @ 2007-08-03 20:16 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Patwardhan, Rajesh, Martin Langhoff, Guilhem Bonnefille, git, users

On 8/3/07, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> [I set followup-to users@cvs2svn.tigris.org, since this has nothing to
> do with git.]
>
> Jon Smirl wrote:
> > On 8/3/07, Patwardhan, Rajesh <rajesh.patwardhan@etrade.com> wrote:
> >> Hello Michael,
> >> I will explain a scenario (we are passing thru this right now)
> >> 1) you have 10 years worth of cvs data.
> >> 2) We want to move to svn.
> >> 3) The repository move should be in such a way that the development does
> >> not get hampered for any 1 work day.
> >> 4) We have atleast 4 major modules in cvs which takes about 30 - 40
> >> hours each for conversion currently.
> >
> > There are known ways (that haven't been implemented) to get the 40 hr
> > number down to 1/2 hour. Would that be a better approach than doing
> > incremental imports?
>
> Jon, I would like very much to hear how you propose to get an 60-fold
> speed increase in cvs2svn.  I've never heard of any plausible way to
> accomplish anything even close to this.
>
> Please note that the user wants to convert to Subversion, not git.  But
> even converting to git, I don't think that such speeds are possible
> without massive changes that would include processing everything in RAM
> and switching large parts of cvs2svn from Python to a compiled language.

Make a bulk importer for SVN like git-fastimport. I measured some SVN
imports and the bulk of the time was spent forking off SVN. Before
git-fast import it would have taken git two weeks to import Mozilla
CVS.

>
> Michael
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-03 20:16               ` Jon Smirl
@ 2007-08-03 20:27                 ` Jon Smirl
  0 siblings, 0 replies; 40+ messages in thread
From: Jon Smirl @ 2007-08-03 20:27 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Patwardhan, Rajesh, Martin Langhoff, Guilhem Bonnefille, git, users

On 8/3/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> Make a bulk importer for SVN like git-fastimport. I measured some SVN
> imports and the bulk of the time was spent forking off SVN. Before
> git-fast import it would have taken git two weeks to import Mozilla
> CVS.

And add a CVS parser to cvs2svn. Use the one I posted or write it again.
Fork is not a very fast operation, millions of forks take a week to run.

In the cvs2git code I did there was one process running cvs2svn and it
parsed the CVS files internally. A second process ran git-fastimport.
Nothing else was forked.

When I first started we were forking both git and cvs. When I ran
oprofile on it 95% of the CPU time was being spent in the kernel.
Linus helped me figure out what was going on. It was the overhead of
page table copies associated with millions of forks that was taking so
long. The solution is to eliminate the forks.

My first try with forks for both cvs and git took about a week to
import Mozilla CVS. After all the forks were eliminated I could import
Mozilla CVS in four hours.

>
> >
> > Michael
> >
> >
>
>
> --
> Jon Smirl
> jonsmirl@gmail.com
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 22:50           ` Simon 'corecode' Schubert
  2007-08-02 23:50             ` Michael Haggerty
@ 2007-08-04  8:28             ` Steffen Prohaska
  1 sibling, 0 replies; 40+ messages in thread
From: Steffen Prohaska @ 2007-08-04  8:28 UTC (permalink / raw)
  To: Simon 'corecode' Schubert; +Cc: Michael Haggerty, Git Mailing List


On Aug 3, 2007, at 12:50 AM, Simon 'corecode' Schubert wrote:

> Steffen Prohaska wrote:
>>> yah, that pretty much tells me it is shawn's bug :)  but without  
>>> more details, it is very hard to diagnose.
>> I tried again. Interestingly now togit works but tohg still fails.
>> togit starts with reporting
>> fatal: Not a valid object name
>
> that's fine.

Looks a bit scary. Could you hide the message from the user
if it's fine.

>> as the first line. But besides that it seems to work fine. What
>> concerns me a bit is that the last line togit reports is
>> committing set 18100/18173
>> I'd expect it should report 18173/18173.
>
> that's fine as well.  You only saw multiples of 100, but you didn't  
> consider it would skip the itermediate ones, right? :)

I don't care about the intermediates, but only about the
last one. I'd expect that a successful import would report
as the last line 18173/18173. If the first number is smaller
than the second, this indicates to me that there's something
left to do.


>> BTW, togit creates much more complex branching patterns than cvs2svn
>> does. The attached file branching.png displays a small view of a
>> branching pattern that extends downwards over a couple of screens.
>> I checked the cvs2svn history again. It doesn't contain anything
>> of similar complexity.
>
> haha yea, there is still some issue with duplicate branch names and  
> the branchpoint.  if it doesn't get the branch right, it will  
> always "pull" files from the parent branch.
>
> did you do some manual RCS file copying or manual branch name  
> changing of individual files?  this could be the reason.  I still  
> have to find a simple repo to reproduce this.

Maybe, the repo is 8 years old. It started before I joined the
development.

	Steffen

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: cvs2svn conversion directly to git ready for experimentation
  2007-08-02 23:59     ` Jon Smirl
@ 2007-08-05  7:58       ` Oswald Buddenhagen
  0 siblings, 0 replies; 40+ messages in thread
From: Oswald Buddenhagen @ 2007-08-05  7:58 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Michael Haggerty, Steffen Prohaska, git, users

On Thu, Aug 02, 2007 at 07:59:41PM -0400, Jon Smirl wrote:
> I seem to recall discussing an algorithm  to fix this on the cvs2svn
> mailing list. There was a somewhat simple way to correlate the
> "unlabeled-1.2.4" in one file might be the same as "unlabeled-1.2.6"
> problem.
> 
yes, name them after the first symbol that appears on them. like
unlabeled-1.2.4 being named __KDE_3_5_RELEASE because of such tag
(without the underscores, obviously) appearing on it.
the naive per-file implementation doesn't get you that far, though.
again, one'd have to collect data from all files first, correlate
it and make a "majority vote". very similar to your favorite symbol
source problem. ;)

-- 
Hi! I'm a .signature virus! Copy me into your ~/.signature, please!
--
Chaos, panic, and disorder - my work here is done.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-08-05  7:58 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-01  0:09 cvs2svn conversion directly to git ready for experimentation Michael Haggerty
2007-08-01  0:41 ` Johannes Schindelin
2007-08-01 22:09 ` Jakub Narebski
2007-08-02 16:58   ` Michael Haggerty
2007-08-02 23:44   ` Jon Smirl
2007-08-02  8:49 ` Steffen Prohaska
2007-08-02 17:23   ` Michael Haggerty
2007-08-02 19:22     ` Marko Macek
2007-08-02 23:59     ` Jon Smirl
2007-08-05  7:58       ` Oswald Buddenhagen
2007-08-02 17:35   ` Simon 'corecode' Schubert
2007-08-02 19:13     ` Steffen Prohaska
2007-08-02 19:29       ` Simon 'corecode' Schubert
2007-08-02 20:21         ` Robin Rosenberg
     [not found]           ` <200708022221.13129.robin.rosenberg.lists-RgPrefM1rjDQT0dZR+AlfA@public.gmane.org>
2007-08-02 20:31             ` Lübbe Onken
2007-08-02 20:32           ` Lübbe Onken
2007-08-02 20:33           ` Lübbe Onken
2007-08-02 22:02         ` Steffen Prohaska
2007-08-02 22:50           ` Simon 'corecode' Schubert
2007-08-02 23:50             ` Michael Haggerty
2007-08-03  8:40               ` Simon 'corecode' Schubert
2007-08-04  8:28             ` Steffen Prohaska
2007-08-03  3:07         ` Shawn O. Pearce
2007-08-02 23:37       ` Michael Haggerty
2007-08-02 20:43   ` Linus Torvalds
2007-08-02 23:19     ` Michael Haggerty
2007-08-03  3:12       ` Shawn O. Pearce
2007-08-02 23:55   ` Jon Smirl
     [not found] ` <8b65902a0708010438s24d16109k601b52c04cf9c066@mail.gmail.com>
2007-08-02 15:34   ` Michael Haggerty
2007-08-02 23:08     ` Martin Langhoff
2007-08-03  4:03       ` Johannes Schindelin
2007-08-03  6:48         ` Steffen Prohaska
2007-08-03  7:10       ` Steffen Prohaska
2007-08-03  8:36       ` Michael Haggerty
2007-08-03 14:35         ` Patwardhan, Rajesh
2007-08-03 15:41           ` Jon Smirl
2007-08-03 16:42             ` Patwardhan, Rajesh
2007-08-03 18:58             ` Michael Haggerty
2007-08-03 20:16               ` Jon Smirl
2007-08-03 20:27                 ` Jon Smirl

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.