All of lore.kernel.org
 help / color / mirror / Atom feed
* I have end-of-lifed cvsps
@ 2013-12-12  0:17 Eric S. Raymond
  2013-12-12  3:38 ` Martin Langhoff
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12  0:17 UTC (permalink / raw)
  To: git

On the git tools wiki, the first paragraph of the entry for cvsps now
reads:

  Warning: this code has been end-of-lifed by its maintainer in favor of
  cvs-fast-export. Several attempts over the space of a year to repair
  its deficient branch analysis and tag assignment have failed.  Do not
  use it unless you are converting a strictly linear repository and
  cannot get rsync/ssh read access to the repo masters. If you must use
  it, be prepared to inspect and manually correct the history using
  reposurgeon.

I tried very hard to salvage this program - the ability to
remote-fetch CVS repos without rsync access was appealing - but I
reached my limit earlier today when I actually found time to assemble
a test set of CVS repos and run head-to-head tests comparing cvsps
output to cvs-fast-export output.

I've long believed that that cvs-fast-export has a better analyzer
than cvsps just from having read the code for both of them, and having
had to fix some serious bugs in cvsps that have no analogs in
cvs-fast-export.  Direct comparison of the stream outputs revealed
that the difference in quality was larger than I had prevously grasped.

Alas, I'm afraid the cvsps repo analysis code turns out to be crap all
the way down on anything but the simplest linear and near-linear
cases, and it doesn't do so hot on even those (all this *after* I
fixed the most obvious bugs in the 2.x version). In retrospect, trying
to repair it was misdirected effort.

I recommend that git sever its dependency on this tool as soon as
possible. I have shipped a 3.13 release with deprecation warnings fot
archival purposes, after which I will cease maintainance and redirect
anyone inquiring about cvsps to cvs-fast-export.

(I also maintain cvs-fast-export, but credit for the excellent analysis code 
goes to Keith Packard.  All I did was write the output stage, document
it, and fix a few minor bugs.)
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

You [should] not examine legislation in the light of the benefits it will
convey if properly administered, but in the light of the wrongs it
would do and the harm it would cause if improperly administered
	-- Lyndon Johnson, former President of the U.S.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12  0:17 I have end-of-lifed cvsps Eric S. Raymond
@ 2013-12-12  3:38 ` Martin Langhoff
  2013-12-12  4:26   ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: Martin Langhoff @ 2013-12-12  3:38 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: Git Mailing List

On Wed, Dec 11, 2013 at 7:17 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> I tried very hard to salvage this program - the ability to
> remote-fetch CVS repos without rsync access was appealing

Is that the only thing we lose, if we abandon cusps? More to the
point, is there today an incremental import option, outside of
git-cvsimport+cvsps?

[ I am a bit out of touch with the current codebase but I coded and
maintained a good part of it back in the day. However naive/limited
the cvsps parser was, it did help a lot of projects make the leap to
git... ]

regards,



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12  3:38 ` Martin Langhoff
@ 2013-12-12  4:26   ` Eric S. Raymond
  2013-12-12 13:42     ` Martin Langhoff
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12  4:26 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com>:
> On Wed, Dec 11, 2013 at 7:17 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> > I tried very hard to salvage this program - the ability to
> > remote-fetch CVS repos without rsync access was appealing
> 
> Is that the only thing we lose, if we abandon cusps? More to the
> point, is there today an incremental import option, outside of
> git-cvsimport+cvsps?

You'll have to remind me what you mean by "incremental" here. Possibly
it's something cvs-fast-export could support.

But what I'm trying to tell you is that, even after I've done a dozen
releases and fixed the worst problems I could find, cvsps is far too
likely to mangle anything that passes through it.  The idea that you
are preserving *anything* valuable by sticking with it is a mirage.

"That bear trap!  It's mangling your leg!"  "But it's so *shiny*..."

> [ I am a bit out of touch with the current codebase but I coded and
> maintained a good part of it back in the day. However naive/limited
> the cvsps parser was, it did help a lot of projects make the leap to
> git... ]

I fear those "lots of projects" have subtly damaged repository
histories, then.  I warned about this problem a year ago; today I
found out it is much worse than I knew then, in fact so bad that I
cannot responsibly do anything but try to get cvsps turfed out of use
*as soon as possible*.

And no, that should *not* wait on cvs-fast-export getting better 
support for "incremental" or any other legacy feature.  Every week
that cvsps remains the git project's choice is another week in which
somebody's project history is likely to get trashed.

This feels very strange and unpleasant.  I've never had to shoot one
of my own projects through the head before.

I blogged about it: http://esr.ibiblio.org/?p=5167

Ignore the malware warning. It's triggered by something else on ibiblio.org;
they're fixing it.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12  4:26   ` Eric S. Raymond
@ 2013-12-12 13:42     ` Martin Langhoff
  2013-12-12 17:17       ` Andreas Krey
                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Martin Langhoff @ 2013-12-12 13:42 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Git Mailing List

On Wed, Dec 11, 2013 at 11:26 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> You'll have to remind me what you mean by "incremental" here. Possibly
> it's something cvs-fast-export could support.

User can

 - run a cvs to git import at time T, resulting in repo G
 - make commits to cvs repo
 - run cvs to git import at time T1, pointed to G, and the import tool
will only add the new commits found in cvs between T and T1.

> But what I'm trying to tell you is that, even after I've done a dozen
> releases and fixed the worst problems I could find, cvsps is far too
> likely to mangle anything that passes through it.  The idea that you
> are preserving *anything* valuable by sticking with it is a mirage.

The bugs that lead to a mangled history are real. I acknowledge and
respect that.

However, with those limitations, the incremental feature has value in
many scenarios.

The two main ones are as follows:

 - A developer is tracking his/her own patches on top of a CVS-based
project with git. This is often done with git-svn for example. If
old/convoluted branches in the far past are mangled, this user won't
care; as long as HEAD->master and/or the current/recent branch are
consistent with reality, the tool fits a need.

 - A project plans to transition to git gradually. Experienced
developers who'd normally work on CVS HEAD start working on git (and
landing their work on CVS afterwards). Old/mangled branches and tags
are of little interest, the big value is CVS HEAD (which is linear)
and possibly recent release/stable branches. The history captured is
good enough for git blame/log/pickaxe along the "master" line. At
transition time the original CVS repo can be kept around in readonly
mode, so people can still checkout the exact contents of an old branch
or tag for example (assuming no destructive "surgery" was done in the
CVS repo).

The above examples assume that the CVS repos have used "flying fish"
approach in the "interesting" (i.e.: recent) parts of their history.

[ Simplifying a bit for non-CVS-geeks -- flying fish is using CVS HEAD
for your development, plus 'feature branches' that get landed, plus
long-lived 'stable release' branches. Most CVS projects in modern
times use flying fish, which is a lot like what the git project uses
in its own repo, but tuned to CVS's strengths (interesting commits
linearized in CVS HEAD).

Other approaches ('dovetail') tend to end up with unworkable messes
given CVS's weaknesses. ]

The cvsimport+cvsps combo does a reasonable (though imperfect) job on
'flying fish' CVS histories _and that is what most projects evolved to
use_. If other cvs import tools can handle crazy histories, hats off
to them. But careful with knifing cvsps!

cheers,



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 13:42     ` Martin Langhoff
@ 2013-12-12 17:17       ` Andreas Krey
  2013-12-12 17:26         ` Martin Langhoff
  2013-12-12 18:29         ` Eric S. Raymond
  2013-12-12 18:15       ` Eric S. Raymond
  2013-12-17 10:57       ` Jakub Narębski
  2 siblings, 2 replies; 48+ messages in thread
From: Andreas Krey @ 2013-12-12 17:17 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Eric Raymond, Git Mailing List

On Thu, 12 Dec 2013 08:42:25 +0000, Martin Langhoff wrote:
...
>  - run a cvs to git import at time T, resulting in repo G
>  - make commits to cvs repo
>  - run cvs to git import at time T1, pointed to G, and the import tool
> will only add the new commits found in cvs between T and T1.

I'm pretty sure that being given only G the incremental approach wouldn't
work - some extra state would be required.

But anyway, the replacement question is a) how fast the cvs-fast-export is
and b) whether its output is stable, that is, if the cvs repo C yields
a git repo G, will then C with a few extra commits yield G' where every
commit in G (as identified by its SHA1) is also in G', and G' additionally
contains the new commits that were made to the CVS repo.

If that is the case you effectively have an incremental mode, except that
it's not quite as fast.

At least that would be good enough for us - we ended up running a
filter-branch on the resulting history, and that takes some time anyway.

...
> The cvsimport+cvsps combo does a reasonable (though imperfect) job on
> 'flying fish' CVS histories _and that is what most projects evolved to
> use_. If other cvs import tools can handle crazy histories, hats off
> to them. But careful with knifing cvsps!

It won't magically disappear from your machine, and you have been warned. :-)

Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 17:17       ` Andreas Krey
@ 2013-12-12 17:26         ` Martin Langhoff
  2013-12-12 18:35           ` Eric S. Raymond
  2013-12-12 18:29         ` Eric S. Raymond
  1 sibling, 1 reply; 48+ messages in thread
From: Martin Langhoff @ 2013-12-12 17:26 UTC (permalink / raw)
  To: Andreas Krey; +Cc: Eric Raymond, Git Mailing List

On Thu, Dec 12, 2013 at 12:17 PM, Andreas Krey <a.krey@gmx.de> wrote:
> But anyway, the replacement question is a) how fast the cvs-fast-export is
> and b) whether its output is stable

In my prior work, the "better" CVS importers would not have stable
output, so were not appropriate for incremental imports.

And even the fastest ones were very slow on large repos.

That is why I am asking the question.

> It won't magically disappear from your machine, and you have been warned. :-)

However, esr is making the case that git-cvsimport should stop using
cvsps. My questions are aimed at understanding whether this actually
results in proposing that an important feature is dropped.

Perhaps a better alternative is now available.


m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 13:42     ` Martin Langhoff
  2013-12-12 17:17       ` Andreas Krey
@ 2013-12-12 18:15       ` Eric S. Raymond
  2013-12-12 18:53         ` Martin Langhoff
  2013-12-17 10:57       ` Jakub Narębski
  2 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12 18:15 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com>:
> On Wed, Dec 11, 2013 at 11:26 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> > You'll have to remind me what you mean by "incremental" here. Possibly
> > it's something cvs-fast-export could support.
> 
> User can
> 
>  - run a cvs to git import at time T, resulting in repo G
>  - make commits to cvs repo
>  - run cvs to git import at time T1, pointed to G, and the import tool
> will only add the new commits found in cvs between T and T1.

No, cvs-fast-export doesn't do that. However, it is fast enough that
you can probably just rebuild the whole repo each time you want to
move content. 

When I did the conversion of groff recently I was getting rates of
about 150 commits a second - and it will be faster now, because I
found an expensive operation in the output stage I could optimize
out.

Now that you have reminded me of this, I remember implementing a -i
option for cvsps-3.0 that could be combined with a time restriction 
to output incremental dumps. It's likely I could do the same
thing for cvs-fast-import.

> The above examples assume that the CVS repos have used "flying fish"
> approach in the "interesting" (i.e.: recent) parts of their history.
> 
> [ Simplifying a bit for non-CVS-geeks -- flying fish is using CVS HEAD
> for your development, plus 'feature branches' that get landed, plus
> long-lived 'stable release' branches. Most CVS projects in modern
> times use flying fish, which is a lot like what the git project uses
> in its own repo, but tuned to CVS's strengths (interesting commits
> linearized in CVS HEAD).
> 
> Other approaches ('dovetail') tend to end up with unworkable messes
> given CVS's weaknesses. ]

That terminology -- "flying fish" and "dovetail" -- is interesting, and
I have not heard it before.  It might be woth putting in the Jargon File.
Can you point me at examples of live usage?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 17:17       ` Andreas Krey
  2013-12-12 17:26         ` Martin Langhoff
@ 2013-12-12 18:29         ` Eric S. Raymond
  2013-12-12 19:08           ` Martin Langhoff
  1 sibling, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12 18:29 UTC (permalink / raw)
  To: Andreas Krey; +Cc: Martin Langhoff, Git Mailing List

Andreas Krey <a.krey@gmx.de>:
> But anyway, the replacement question is a) how fast the cvs-fast-export is
> and b) whether its output is stable, that is, if the cvs repo C yields
> a git repo G, will then C with a few extra commits yield G' where every
> commit in G (as identified by its SHA1) is also in G', and G' additionally
> contains the new commits that were made to the CVS repo.
> 
> If that is the case you effectively have an incremental mode, except that
> it's not quite as fast.

I am almost certain the output of cvs-fast-export is stable.  I
believe the output of cvsps-3.x was, too.  Not sure about 2.x.

I wrote the output stages for both cvsps-3.x and cvs-fast-export, and
went to some effort to verify that they write streams in the same
"most natural" way - marks sequential from :1, blobs always witten as
late as possible, fileops in the same sort order the git tools emit,
etc.

I have added writing a regression test test to verify the stability
property to the TODO list. I will have this nailed down before the
next point release, in a few days.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 17:26         ` Martin Langhoff
@ 2013-12-12 18:35           ` Eric S. Raymond
  0 siblings, 0 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12 18:35 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Andreas Krey, Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com>:
> In my prior work, the "better" CVS importers would not have stable
> output, so were not appropriate for incremental imports.

That is disturbing.  I would consider lack of stability a severe and
unacceptable failure mode in such a tool, if only because of the
difficulties it creates for proper regression testing.

If cvs-fast-export does not already have this property I will fix it 
so it does.  And document that fact.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 18:15       ` Eric S. Raymond
@ 2013-12-12 18:53         ` Martin Langhoff
  0 siblings, 0 replies; 48+ messages in thread
From: Martin Langhoff @ 2013-12-12 18:53 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Git Mailing List

On Thu, Dec 12, 2013 at 1:15 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> That terminology -- "flying fish" and "dovetail" -- is interesting, and
> I have not heard it before.  It might be woth putting in the Jargon File.
> Can you point me at examples of live usage?

The canonical reference would be
http://cvsbook.red-bean.com/cvsbook.html#Going%20Out%20On%20A%20Limb%20(How%20To%20Work%20With%20Branches%20And%20Survive)

just by being on the internet and widely referenced it has probably
eclipsed in google-juice examples of earlier usage. Karl Fogel may
remember where he got the names from.

cheers,



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 18:29         ` Eric S. Raymond
@ 2013-12-12 19:08           ` Martin Langhoff
  2013-12-12 19:39             ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: Martin Langhoff @ 2013-12-12 19:08 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Andreas Krey, Git Mailing List

On Thu, Dec 12, 2013 at 1:29 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> I am almost certain the output of cvs-fast-export is stable.  I
> believe the output of cvsps-3.x was, too.  Not sure about 2.x.

IIRC, making the output stable is nontrivial, specially on branches.
Two cases are still in my mind, from when I was wrestling with cvsps.

1 - For a history with CVS HEAD and a long-running "stable release"
branch ("STABLE"), which branched at P1...

   a - adding a file only at the tip of STABLE "retroactively changes
history"  for P1 and perhaps CVS HEAD

   b - forgetting to properly tag a subset of files with the branch
tag, and doing it later retroactively changes history

2 - you can create a new branch or tag with files that do not belong
together in any "commit". Doing so changes history retroactively

... when I say "changes history", I mean that the importers I know
revise their guesses of what files were seen together in a 'commit'.
This is specially true for history recorded with early cvs versions
that did not record a 'commit id'.

cvsps has the strange "feature" that it will cache its
assumptions/guesses, and continue incrementally from there. So if a
change in the CVS repo means that the old guess is now invalidated, it
continues the charade instead of forcing a complete rewrite of the git
history.

Maybe the current crop of tools have developed stronger magic than
what was available a few years ago... the task did seem impossible to
me.

cheers,




m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 19:08           ` Martin Langhoff
@ 2013-12-12 19:39             ` Eric S. Raymond
  2013-12-12 19:48               ` Martin Langhoff
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12 19:39 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Andreas Krey, Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com>:
> IIRC, making the output stable is nontrivial, specially on branches.
> Two cases are still in my mind, from when I was wrestling with cvsps.
> 
> 1 - For a history with CVS HEAD and a long-running "stable release"
> branch ("STABLE"), which branched at P1...
> 
>    a - adding a file only at the tip of STABLE "retroactively changes
> history"  for P1 and perhaps CVS HEAD
> 
>    b - forgetting to properly tag a subset of files with the branch
> tag, and doing it later retroactively changes history
> 
> 2 - you can create a new branch or tag with files that do not belong
> together in any "commit". Doing so changes history retroactively
> 
> ... when I say "changes history", I mean that the importers I know
> revise their guesses of what files were seen together in a 'commit'.
> This is specially true for history recorded with early cvs versions
> that did not record a 'commit id'.

Yikes!  That is a much stricter stability criterion than I thought you
were specifying.   No, cvs-fast-export probably doesn't satify all of these.
I think it would handle 1a in a stable way, but 1b and 2 would throw it.

I'm sure it can't be fooled in the presence of commitids, though,
because when it has those it doesn't try to do any similarity
matching.  And (this is the important point) it won't match any change
with a commit-id to any change without one.

What I think this means is that cvs-fast-export is stable if you are
using a server/client combination that generates commitids (that is,
GNU CVS of any version newer than 1.12 of 2004, or CVS-NT). It is
*not* necessary for stability that the entire history have them.

Here's how the logic works out:

1. Commits grouped by commitid are stable - nothing in CVS ever rewrites
those or assigns a duplicate.

2. No file change made with a commitid can destabilize a commit guess
made without them, because the similarity checker never tries to put both 
kinds in a single changeset.

Can you detect any flaw in this?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 19:39             ` Eric S. Raymond
@ 2013-12-12 19:48               ` Martin Langhoff
  2013-12-12 20:58                 ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: Martin Langhoff @ 2013-12-12 19:48 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Andreas Krey, Git Mailing List

On Thu, Dec 12, 2013 at 2:39 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> Yikes!  That is a much stricter stability criterion than I thought you
> were specifying.

:-) -- cvsps's approach is: if you have a cache, you can remember the
lies you told earlier.

It is impossible to be stable purely from the source data in the face
of these issues.

CVS is truly a PoS.

> I think it would handle 1a in a stable way

that is pretty important. Files added on a branch not affecting HEAD
and earlier branch checkout matters.


> What I think this means is that cvs-fast-export is stable if you are
> using a server/client combination that generates commitids (that is,
> GNU CVS of any version newer than 1.12 of 2004, or CVS-NT). It is
> *not* necessary for stability that the entire history have them.
>
> Here's how the logic works out:
>
> 1. Commits grouped by commitid are stable - nothing in CVS ever rewrites
> those or assigns a duplicate.
>
> 2. No file change made with a commitid can destabilize a commit guess
> made without them, because the similarity checker never tries to put both
> kinds in a single changeset.
>
> Can you detect any flaw in this?

If someone creates a nonsensical tag or branch point, tagging files
from different commits, how do you handle it?

 - without commit ids, does it affect your guesses?

 - regardless of commit ids, do you synthesize an artificial commit?
How do you define parenthood for that artificial commit?

curious,



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 19:48               ` Martin Langhoff
@ 2013-12-12 20:58                 ` Eric S. Raymond
  2013-12-12 22:51                   ` Martin Langhoff
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12 20:58 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Andreas Krey, Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com>:
> If someone creates a nonsensical tag or branch point, tagging files
> from different commits, how do you handle it?
> 
>  - without commit ids, does it affect your guesses?

No.  Tagging is never used to deduce changesets. Look:

/*
 * The heart of the merge operation; detect when two
 * commits are "the same"
 */
static bool
rev_commit_match (rev_commit *a, rev_commit *b)
{
    /*
     * Versions of GNU CVS after 1.12 (2004) place a commitid in
     * each commit to track patch sets. Use it if present
     */
    if (a->commitid && b->commitid)
	return a->commitid == b->commitid;
    if (a->commitid || b->commitid)
	return false;
    if (!commit_time_close (a->date, b->date))
	return false;
    if (a->log != b->log)
	return false;
    if (a->author != b->author)
	return false;
    return true;
}

>  - regardless of commit ids, do you synthesize an artificial commit?
> How do you define parenthood for that artificial commit?

Because tagging is never used to deduce changesets, the case does not arise.

I have added an item to my to-do: document what the tool does with
inconsistent tags.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 20:58                 ` Eric S. Raymond
@ 2013-12-12 22:51                   ` Martin Langhoff
  2013-12-12 23:04                     ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: Martin Langhoff @ 2013-12-12 22:51 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Andreas Krey, Git Mailing List

On Thu, Dec 12, 2013 at 3:58 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
>>  - regardless of commit ids, do you synthesize an artificial commit?
>> How do you define parenthood for that artificial commit?
>
> Because tagging is never used to deduce changesets, the case does not arise.

So if a branch has a nonsensical branching point, or a tag is
nonsensical, is it ignored and not imported?

curious,



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 22:51                   ` Martin Langhoff
@ 2013-12-12 23:04                     ` Eric S. Raymond
  2013-12-13  2:35                       ` Martin Langhoff
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-12 23:04 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Andreas Krey, Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com>:
> On Thu, Dec 12, 2013 at 3:58 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> >>  - regardless of commit ids, do you synthesize an artificial commit?
> >> How do you define parenthood for that artificial commit?
> >
> > Because tagging is never used to deduce changesets, the case does not arise.
> 
> So if a branch has a nonsensical branching point, or a tag is
> nonsensical, is it ignored and not imported?

I don't know what happens when identically-named tags point at changes that
resolve into two different commits.  I will figure that out and document it.

There's evidence, in the form of some code that is #ifdefed out, that 
Keith considered trying to make synthetic commits from tag cliques. But
abandoned the idea because he couldn't figure out how to assign such
cliques to a branch.

I'm not sure what counts as a nonsensical branching point. I do know that
Keith left this rather cryptic note in a REAME:

	Disjoint branch resolution. Branches occurring in a subset of the
	files are not correctly resolved; instead, an entirely disjoint
	history will be created containing the branch revisions and all
	parents back to the root. I'm not sure how to fix this; it seems
	to implicitly assume there will be only a single place to attach as
	branch parent, which may not be the case. In any case, the right
	revision will have a superset of the revisions present in the
	original branch parent; perhaps that will suffice.

-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 23:04                     ` Eric S. Raymond
@ 2013-12-13  2:35                       ` Martin Langhoff
  2013-12-13  3:38                         ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: Martin Langhoff @ 2013-12-13  2:35 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Andreas Krey, Git Mailing List

On Thu, Dec 12, 2013 at 6:04 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> I'm not sure what counts as a nonsensical branching point. I do know that
> Keith left this rather cryptic note in a REAME:

Keith names exactly what we are talking about. At that time, Keith was
struggling with the old xorg cvs repo which these and quite a few
other nasties. I was also struggling with the mozilla cvs repo with
its own gremlins.

Between my earlier explanation and Keith's notes it should be clear to
you. It is absolutely trivial in CVS to have an "inconsistent"
checkout (for example, if you switch branch with the -l parameter
disabling recursion, or if you accidentally switch branch in a
subdirectory).

On that inconsistent checkout, nothing prevents you from tagging it,
nor from creating a new branch.

An importer with a 'consistent tree mentality' will look at the
files/revs involved in that tag (or branching point) and find no tree
to match.

CVS repos with that crap exist. x11/xorg did (Jim Gettys challenged me
to try importing it at an LCA, after the Bazaar NG folks passed on
it). Mozilla did as well.


IMHO it is a valid path to skip importing the tag/branch. As long as
main dev work was in HEAD, things end up ok (which goes back to my
flying fish notes).

cheers,



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-13  2:35                       ` Martin Langhoff
@ 2013-12-13  3:38                         ` Eric S. Raymond
  0 siblings, 0 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-13  3:38 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Andreas Krey, Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com>:
> On Thu, Dec 12, 2013 at 6:04 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> > I'm not sure what counts as a nonsensical branching point. I do know that
> > Keith left this rather cryptic note in a REAME:
> 
> Keith names exactly what we are talking about.

Oh, yeah, I figured that much out.  What I wasn't clear on was (a) whether
that's a complete description of "nonsensical branching point" or whether there
are other pathologies fundamentally *different* from that one.

I'm also not sure I have the end state of what cvs-fast-export does in that
case visualized correctly. When he says: "an entirely disjoint history will
be created containing the branch revisions and all parents back to the
root", I'm visualizing something like this:

  a----b----c----d----e----f----g----h
                  \
                   +----1----2----3---4

Suppose the root is a our pathological branch point is at d, then it
sounds like he's saying cvs-fast-export will produce a changeset DAG
that looks like this:

  a----b'---c'---d'---e----f----g----h
   \
    +----b''---c''---d''----1----2----3----4

What I'm not clear on here is how b is related to b' and b'', c to c' and c'',
and d to d' and d''.  Which file changes go to which commit?  I shall have to
craft some broken RCS files to find out.

Have I explained that I'm building a test suite?  I intend to know exactly
what the tool does in these cases and document it.

> Between my earlier explanation and Keith's notes it should be clear to
> you. It is absolutely trivial in CVS to have an "inconsistent"
> checkout (for example, if you switch branch with the -l parameter
> disabling recursion, or if you accidentally switch branch in a
> subdirectory).

That last one sounds easy to fall into and nasty. 

> On that inconsistent checkout, nothing prevents you from tagging it,
> nor from creating a new branch.
> 
> An importer with a 'consistent tree mentality' will look at the
> files/revs involved in that tag (or branching point) and find no tree
> to match.
> 
> CVS repos with that crap exist. x11/xorg did (Jim Gettys challenged me
> to try importing it at an LCA, after the Bazaar NG folks passed on
> it). Mozilla did as well.
> 
> 
> IMHO it is a valid path to skip importing the tag/branch. As long as
> main dev work was in HEAD, things end up ok (which goes back to my
> flying fish notes).

The other way to handle it would be to translate the history as though every
branch of a file subset had been an attempt to branch eveything.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-12 13:42     ` Martin Langhoff
  2013-12-12 17:17       ` Andreas Krey
  2013-12-12 18:15       ` Eric S. Raymond
@ 2013-12-17 10:57       ` Jakub Narębski
  2013-12-17 11:18         ` Johan Herland
  2013-12-17 14:07         ` Eric S. Raymond
  2 siblings, 2 replies; 48+ messages in thread
From: Jakub Narębski @ 2013-12-17 10:57 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Eric Raymond, Git Mailing List

Martin Langhoff wrote:

> On Wed, Dec 11, 2013 at 11:26 PM, Eric S. Raymond<esr@thyrsus.com>  wrote:
>> You'll have to remind me what you mean by "incremental" here. Possibly
>> it's something cvs-fast-export could support.
>
> User can
>
>   - run a cvs to git import at time T, resulting in repo G
>   - make commits to cvs repo
>   - run cvs to git import at time T1, pointed to G, and the import tool
 >
> will only add the new commits found in cvs between T and T1.

I wonder if we can add support for incremental import once, for all
VCS supporting fast-export, in one place, namely at the remote-helper.

I don't know details, so I don't know if it is possible; certainly
unstable fast-export output would be a problem, unless some tricks
are used (like remembering mappings between versions).

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 10:57       ` Jakub Narębski
@ 2013-12-17 11:18         ` Johan Herland
  2013-12-17 14:58           ` Eric S. Raymond
  2013-12-17 14:07         ` Eric S. Raymond
  1 sibling, 1 reply; 48+ messages in thread
From: Johan Herland @ 2013-12-17 11:18 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Martin Langhoff, Eric Raymond, Git Mailing List

On Tue, Dec 17, 2013 at 11:57 AM, Jakub Narębski <jnareb@gmail.com> wrote:
> Martin Langhoff wrote:
>
>> On Wed, Dec 11, 2013 at 11:26 PM, Eric S. Raymond<esr@thyrsus.com>  wrote:
>>>
>>> You'll have to remind me what you mean by "incremental" here. Possibly
>>> it's something cvs-fast-export could support.
>>
>>
>> User can
>>
>>   - run a cvs to git import at time T, resulting in repo G
>>   - make commits to cvs repo
>>   - run cvs to git import at time T1, pointed to G, and the import tool
>
>>
>>
>> will only add the new commits found in cvs between T and T1.
>
>
> I wonder if we can add support for incremental import once, for all
> VCS supporting fast-export, in one place, namely at the remote-helper.
>
> I don't know details, so I don't know if it is possible; certainly
> unstable fast-export output would be a problem, unless some tricks
> are used (like remembering mappings between versions).

You could do this by mapping some CVS revision identifier (like a hash
over the file:revision pairs if nothing better is available), and that
would be useful when trying to match up the git commit from a later
import against the existing commits from an earlier import.

HOWEVER, this only solves the "cheap" half of the problem. The reason
people want incremental CVS import, is to avoid having to repeatedly
convert the ENTIRE CVS history. This means that the CVS exporter must
learn to start from a given point in the CVS history (identified by
the above mapping) and then quickly and efficiently convert only the
"new stuff" without having to consult/convert the rest of the CVS
history. THIS is the hard part of incremental import. And it is much
harder for systems like CVS - where the starting point has a broken
concept of history...

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 10:57       ` Jakub Narębski
  2013-12-17 11:18         ` Johan Herland
@ 2013-12-17 14:07         ` Eric S. Raymond
  2013-12-17 19:58           ` Jakub Narębski
  1 sibling, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-17 14:07 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Martin Langhoff, Git Mailing List

Jakub Narębski <jnareb@gmail.com>:
> I wonder if we can add support for incremental import once, for all
> VCS supporting fast-export, in one place, namely at the remote-helper.

Something in the pipeline - either the helper or the exporter - needs to
have an equivalent of vc-fast-export's and cvsps's -i option, which
omits all commits before a specified time and generates cookies like
"from refs/heads/master^0" before each branch root in the incremental
dump.

This could be done in the wrapper, but only if the wrapper itself
includes an import-stream parser, interprets the output from the
exporter program, and re-emits it.  Having done similar things
myself in reposurgeon, I advise against this strategy; it would
introduce a level of complexity to the wrapper that doesn't belong
there, and make the exporter+wrapper comnination harder to verify.

Fortunately, incremental dump is trivial to implement in the output
stage of an exporter if you have access to the exporter source code.
I've done it in two different exporters.  cvs-fast-export now has a
regression test for this case

> I don't know details, so I don't know if it is possible; certainly
> unstable fast-export output would be a problem, unless some tricks
> are used (like remembering mappings between versions).

About such tricks I can only say "That way lies madness".  The present
Perl wrapper is buggy because it's over-complex.  The replacement wrapper
should do *less*, not more.

Stable output and incremental dump are reasonable things to demand of
your supported exporters.  cvs-fast-export has incremental dump
unconditionally, and stability relative to every CVS implementation
since 2004.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 11:18         ` Johan Herland
@ 2013-12-17 14:58           ` Eric S. Raymond
  2013-12-17 17:52             ` Johan Herland
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-17 14:58 UTC (permalink / raw)
  To: Johan Herland; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

Johan Herland <johan@herland.net>:
> HOWEVER, this only solves the "cheap" half of the problem. The reason
> people want incremental CVS import, is to avoid having to repeatedly
> convert the ENTIRE CVS history. This means that the CVS exporter must
> learn to start from a given point in the CVS history (identified by
> the above mapping) and then quickly and efficiently convert only the
> "new stuff" without having to consult/convert the rest of the CVS
> history. THIS is the hard part of incremental import. And it is much
> harder for systems like CVS - where the starting point has a broken
> concept of history...

I know of *no* importer that solves what you call the "deep" part of
the problem.  cvsps didn't, cvs-fast-import doesn't, cvs2git doesn't.
All take the easy way out; parse the entire history, and limit what
is emitted in the output stage.

Actually, given what I know about delta-file parsing I'd say a "true"
incremental CVS exporter would be so hard that it's really not worth the
bother.  The problem is the delta-based history representation.
Trying to interpret that without building a complete set of history
states in the process (which is most of the work a whole-history
exporter does) would be brutally difficult - barely possible in
principle maybe, but I wouldn't care to try it.

It's much more practical to tune up a whole-history exporter so it's
acceptably fast, then do incremental dumping by suppressing part of
the conversion in the output stage. 

cvs-fast-export's benchmark repo is the history of GNU troff.  That's
3057 commits in 1549 master files; when I reran it just now the
whole-history conversion took 49 seconds.  That's 3.7K commits a
minute, which is plenty fast enough for anything smaller than (say)
one of the *BSD repositories.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 14:58           ` Eric S. Raymond
@ 2013-12-17 17:52             ` Johan Herland
  2013-12-17 18:47               ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: Johan Herland @ 2013-12-17 17:52 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

On Tue, Dec 17, 2013 at 3:58 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> Johan Herland <johan@herland.net>:
>> HOWEVER, this only solves the "cheap" half of the problem. The reason
>> people want incremental CVS import, is to avoid having to repeatedly
>> convert the ENTIRE CVS history. This means that the CVS exporter must
>> learn to start from a given point in the CVS history (identified by
>> the above mapping) and then quickly and efficiently convert only the
>> "new stuff" without having to consult/convert the rest of the CVS
>> history. THIS is the hard part of incremental import. And it is much
>> harder for systems like CVS - where the starting point has a broken
>> concept of history...
>
> I know of *no* importer that solves what you call the "deep" part of
> the problem.  cvsps didn't, cvs-fast-import doesn't, cvs2git doesn't.
> All take the easy way out; parse the entire history, and limit what
> is emitted in the output stage.

Yes, and starting from a non-incremental importer, that's probably the
only viable way to approach incrementalism.

> Actually, given what I know about delta-file parsing I'd say a "true"
> incremental CVS exporter would be so hard that it's really not worth the
> bother.  The problem is the delta-based history representation.
> Trying to interpret that without building a complete set of history
> states in the process (which is most of the work a whole-history
> exporter does) would be brutally difficult - barely possible in
> principle maybe, but I wouldn't care to try it.

Agreed, you would either have to re-parse the entire ,v-file, or you
would have to store some (probably a lot of) intermediate state that
would allow you to resolve deltas of new revisions without having to
parse all the old revisions.

> It's much more practical to tune up a whole-history exporter so it's
> acceptably fast, then do incremental dumping by suppressing part of
> the conversion in the output stage.
>
> cvs-fast-export's benchmark repo is the history of GNU troff.  That's
> 3057 commits in 1549 master files; when I reran it just now the
> whole-history conversion took 49 seconds.  That's 3.7K commits a
> minute, which is plenty fast enough for anything smaller than (say)
> one of the *BSD repositories.

Those are impressive numbers, and in that scenario, using a
"repurposed" converter (i.e. whole-history converter that has been
taught to do incremental output) is undoubtedly the best solution.

However, I fear that you underestimate the number of users that want
to use Git against CVS repos that are orders of magnitude larger (in
both dimensions: #commits and #files) than your example repo. For
these repos, running a proper whole-history conversion takes hours -
or even days - and working incrementally on top of that is simply out
of the question. Obviously, they still need the whole-history
converter for the future point in time when they have collected enough
motivation/buy-in to migrate the entire project/company to a better
VCS, but until then, they want to use Git locally, while enduring CVS
on the server.

At my previous $DAYJOB, I was one of those people, and I ended up with
a two-pronged "solution" to the problem (this is ~5 years ago now, so
I'm somewhat fuzzy on the details):

 1. Adopt an ad hoc incremental approach for working against the CVS
server: Keep a CVS checkout next to my git repo. and maintain a map
between corresponding states/commits in CVS and git. When I update
from CVS, apply the corresponding patch to the "cvs" branch in my git
repo. Rebase my git-based work on top of that, and use "git
cvsexportcommit" to propagate my Git work back to CVS. This is crude
and hacky as hell, but it provides me a local git-based workflow.

 2. Start convincing fellow developers and lobby management about
switching away from CVS. We got a discussion started, gained momentum,
and eventually I got to spend most of my time preparing and performing
the full-history conversion from CVS to git. This happened mostly
before cvs2svn grew its cvs2git sibling, so I ended up writing a
custom converter for our particular variation of insane and demented
CVS practices. Today, I would probably have gone for cvs2git, or your
more recent work.

But back to my main point:

I believe there are two classes of CVS converters, and I have slowly
come to believe that they solve two fundamentally different problems.
The first problem is "how to faithfully recreate the project history
in a different VCS", which is solved by the full-history converters.
Case closed.

The second problem is somewhat harder to define, but I'll try: "how to
allow me to work productively against a CVS server, without having to
deal with the icky CVS bits". Compared to the first problem, the
parameters differ somewhet:

 - Conversion/synchronization time must be short to allow me to stay
productive and up-to-date with my colleagues.

 - Correctness of "current state" is very important. I must be sure
that my git working tree is identical to its CVS counterpart, so that
my git changes can be reproduced in CVS as faithfully as possible.

 - Correctness of "history" is less important. I can accept a
messy/incorrect Git history, since I can always query the CVS server
for the "correct" history (whatever that means in a CVS context...).

 - As a generic CVS user (not the CVS admin) I don't necessarily have
direct access to the ,v files stored on the CVS server.

Although a full-history converter with fairly stable output can be
made to support this second problem for repos up to a certain size,
there will probably still be users that want to work incrementally
against much bigger repos, and I don't think _any_
full-history-gone-incremental importer will be able to support the
biggest repos.

Consequently I believe that for these big repos it is _impossible_ to
get both fast incremental workflows and a high degree of (historical)
correctness.

cvsps tried to be all of the above, and failed badly at the
correctness criteria. Therefore I support your decision to "shoot it
through the head". I certainly also support any work towards making a
full-history converter work in an incremental manner, as it will be
immensely useful for smaller CVS repos. But at the same time we should
realize that it won't be a solution for incrementally working against
_large_ CVS repos.

Although it should have been made obvious a long time ago, the removal
of cvsps has now made it abundantly clear that Git currently provides
no way to support the incremental workflow against large CVS repos.
Maybe that is ok, and we can ignore that, waiting for the few
remaining large CVS repos to die? Or maybe we need a new effort to
fill this niche? Something that is NOT based on a full-history
converter, and does NOT try to guarantee a history-correct conversion,
but that DOES try to guarantee fast and relatively worry-free two-way
synchronization against a CVS server. Unfortunately (or fortunately,
depending on POV) I have not had to touch CVS in a long while, and I
don't see that changing soon, so it is not my itch to scratch.


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 17:52             ` Johan Herland
@ 2013-12-17 18:47               ` Eric S. Raymond
  2013-12-17 21:26                 ` Johan Herland
  2013-12-18 23:44                 ` Michael Haggerty
  0 siblings, 2 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-17 18:47 UTC (permalink / raw)
  To: Johan Herland; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

Johan Herland <johan@herland.net>:
> However, I fear that you underestimate the number of users that want
> to use Git against CVS repos that are orders of magnitude larger (in
> both dimensions: #commits and #files) than your example repo.

You may be right. See below...

I'm working with Alan Barret now on trying to convert the NetBSD
repositories. They break cvs-fast-export through sheer bulk of
metadata, by running the machine out of core.  This is exactly
the kind of huge case that you're talking about.

Alan and I are going to take a good hard whack at modifying cvs-fast-export 
to make this work. Because there really aren't any feasible alternatives.
The analysis code in cvsps was never good enough. cvs2git, being written
in Python, would hit the core limit faster than anything written in C.

> Although a full-history converter with fairly stable output can be
> made to support this second problem for repos up to a certain size,
> there will probably still be users that want to work incrementally
> against much bigger repos, and I don't think _any_
> full-history-gone-incremental importer will be able to support the
> biggest repos.
> 
> Consequently I believe that for these big repos it is _impossible_ to
> get both fast incremental workflows and a high degree of (historical)
> correctness.
> 
> cvsps tried to be all of the above, and failed badly at the
> correctness criteria. Therefore I support your decision to "shoot it
> through the head". I certainly also support any work towards making a
> full-history converter work in an incremental manner, as it will be
> immensely useful for smaller CVS repos. But at the same time we should
> realize that it won't be a solution for incrementally working against
> _large_ CVS repos.

It is certainly the case that a sufficiently large CVS repo will break
anything, like a star with a mass over the Chandrasekhar limit becoming a 
black hole :-)

The question is how common such supermassive cases are. My own guess is that
the *BSD repos and a handful of the oldest GNU projects are pretty much the
whole set; everybody else converted to Subversion within the last decade. 
 
> Although it should have been made obvious a long time ago, the removal
> of cvsps has now made it abundantly clear that Git currently provides
> no way to support the incremental workflow against large CVS repos.
> Maybe that is ok, and we can ignore that, waiting for the few
> remaining large CVS repos to die? Or maybe we need a new effort to
> fill this niche? Something that is NOT based on a full-history
> converter, and does NOT try to guarantee a history-correct conversion,
> but that DOES try to guarantee fast and relatively worry-free two-way
> synchronization against a CVS server. Unfortunately (or fortunately,
> depending on POV) I have not had to touch CVS in a long while, and I
> don't see that changing soon, so it is not my itch to scratch.

Nor mine.  I find the very idea of writing anything that encourages
non-history-correct conversions disturbing and want no part of it.

Which matters, because right now the set of people working on CVS lifters
begins with me and ends with Michael Rafferty (cvs2git), who seems even
less interested in incremental conversion than I am.  Unless somebody
comes out of nowhere and wants to own that problem, it's not going
to get solved.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 14:07         ` Eric S. Raymond
@ 2013-12-17 19:58           ` Jakub Narębski
  2013-12-17 21:02             ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: Jakub Narębski @ 2013-12-17 19:58 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Martin Langhoff, Git Mailing List

On Tue, Dec 17, 2013 at 3:07 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> Jakub Narębski <jnareb@gmail.com>:

>> I wonder if we can add support for incremental import once, for all
>> VCS supporting fast-export, in one place, namely at the remote-helper.
>
> Something in the pipeline - either the helper or the exporter - needs to
> have an equivalent of vc-fast-export's and cvsps's -i option, which
> omits all commits before a specified time and generates cookies like
> "from refs/heads/master^0" before each branch root in the incremental
> dump.

Errr... doesn't cvs-fast-export support --export-marks=<file> to save
progress and --import-marks=<file> to continue incremental import?
I *guess* that 'export' / 'import' capabilities-based remote helpers
use 'export-marks <file>' / 'import-marks <file>' capability for incremental
import, also known as "fetch", isn't it? But I might be mistaken, I don't
know enough about remote helpers...

I would check it in cvs-fast-export manpage, but the page seems to
be down:

  http://isup.me/www.catb.org

    It's not just you! http://www.catb.org looks down from here.

> This could be done in the wrapper, but only if the wrapper itself
> includes an import-stream parser, interprets the output from the
> exporter program, and re-emits it.  Having done similar things
> myself in reposurgeon, I advise against this strategy; it would
> introduce a level of complexity to the wrapper that doesn't belong
> there, and make the exporter+wrapper combination harder to verify.

Right.

> Fortunately, incremental dump is trivial to implement in the output
> stage of an exporter if you have access to the exporter source code.
> I've done it in two different exporters.  cvs-fast-export now has a
> regression test for this case

This is I guess assuming that information from later commits doesn't
change guesses about shape of history from earlier commits...

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 19:58           ` Jakub Narębski
@ 2013-12-17 21:02             ` Eric S. Raymond
  2013-12-18  0:02               ` Jakub Narębski
  2013-12-18  0:04               ` Andreas Schwab
  0 siblings, 2 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-17 21:02 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Martin Langhoff, Git Mailing List

Jakub Narębski <jnareb@gmail.com>:
> Errr... doesn't cvs-fast-export support --export-marks=<file> to save
> progress and --import-marks=<file> to continue incremental import?

No, cvs-fast-export does not have --export-marks. It doesn't generate the
SHA1s that would require. Even if it did, it's not clear how that would help.

> I would check it in cvs-fast-export manpage, but the page seems to
> be down:
> 
>   http://isup.me/www.catb.org
> 
>     It's not just you! http://www.catb.org looks down from here.

Confirmed.  Looks like ibiblio is having a bad day.  I'll file a bug report. 

> > Fortunately, incremental dump is trivial to implement in the output
> > stage of an exporter if you have access to the exporter source code.
> > I've done it in two different exporters.  cvs-fast-export now has a
> > regression test for this case
> 
> This is I guess assuming that information from later commits doesn't
> change guesses about shape of history from earlier commits...

That's the "stability" property that Martin Langhoff and I were discussing
earlier.

cvs-fast-export conversions are stable under incremental
lifting providing a commitid-generating version of CVS is in use
during each increment.  Portions of the history *before the first
lift* may lack commitids and will nevertheless remain stable through
the whole process.

All versions of CVS have generated commitids since 2004.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 18:47               ` Eric S. Raymond
@ 2013-12-17 21:26                 ` Johan Herland
  2013-12-17 22:41                   ` Eric S. Raymond
  2013-12-18 23:44                 ` Michael Haggerty
  1 sibling, 1 reply; 48+ messages in thread
From: Johan Herland @ 2013-12-17 21:26 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

On Tue, Dec 17, 2013 at 7:47 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> I'm working with Alan Barret now on trying to convert the NetBSD
> repositories. They break cvs-fast-export through sheer bulk of
> metadata, by running the machine out of core.  This is exactly
> the kind of huge case that you're talking about.
>
> Alan and I are going to take a good hard whack at modifying cvs-fast-export
> to make this work. Because there really aren't any feasible alternatives.
> The analysis code in cvsps was never good enough. cvs2git, being written
> in Python, would hit the core limit faster than anything written in C.

Depends on how it organizes its data structures. Have you actually
tried running cvs2git on it? I'm not saying you are wrong, but I had
similar problems with my custom converter (also written in Python),
and solved them by adding multiple passes/phases instead of trying to
do too much work in fewer passes. In the end I ended up storing the
largest inter-phase data structures outside of Python (sqlite in my
case) to save memory. Obviously it cost a lot in runtime, but it meant
that I could actually chew through our largest CVS modules without
running out of memory.

> It is certainly the case that a sufficiently large CVS repo will break
> anything, like a star with a mass over the Chandrasekhar limit becoming a
> black hole :-)

:) True, although it's not the sheer size of the files themselves that
is the actual problem. Most of those bytes are (deltified) file data,
which you can pretty much stream through and convert to a
corresponding fast-export stream of blob objects. The code for that
should be fairly straightforward (and should also be eminently
parallelizable, given enough cores and available I/O), resulting in a
table mapping CVS file:revision pairs to corresponding Git blob SHA1s,
and an accompanying (set of) packfile(s) holding said blobs.

The hard part comes when trying to correlate the metadata for all the
per-file revisions, and distill that into a consistent sequence/DAG of
changesets/commits across the entire CVS repo. And then, of course,
trying to fit all the branches and tags into that DAG of commits is
what really drives you mad... ;-)

> The question is how common such supermassive cases are. My own guess is that
> the *BSD repos and a handful of the oldest GNU projects are pretty much the
> whole set; everybody else converted to Subversion within the last decade.

You may be right. At least for the open-source cases. I suspect
there's still a considerable number of huge CVS repos within
companies' walls...

> I find the very idea of writing anything that encourages
> non-history-correct conversions disturbing and want no part of it.
>
> Which matters, because right now the set of people working on CVS lifters
> begins with me and ends with Michael Rafferty (cvs2git),

s/Rafferty/Haggerty/?

> who seems even
> less interested in incremental conversion than I am.  Unless somebody
> comes out of nowhere and wants to own that problem, it's not going
> to get solved.

Agreed. It would be nice to have something to point to for people that
want something similar to git-svn for CVS, but without a motivated
owner, it won't happen.

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 21:26                 ` Johan Herland
@ 2013-12-17 22:41                   ` Eric S. Raymond
  0 siblings, 0 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-17 22:41 UTC (permalink / raw)
  To: Johan Herland; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

Johan Herland <johan@herland.net>:
> > Alan and I are going to take a good hard whack at modifying cvs-fast-export
> > to make this work. Because there really aren't any feasible alternatives.
> > The analysis code in cvsps was never good enough. cvs2git, being written
> > in Python, would hit the core limit faster than anything written in C.
> 
> Depends on how it organizes its data structures. Have you actually
> tried running cvs2git on it? I'm not saying you are wrong, but I had
> similar problems with my custom converter (also written in Python),
> and solved them by adding multiple passes/phases instead of trying to
> do too much work in fewer passes. In the end I ended up storing the
> largest inter-phase data structures outside of Python (sqlite in my
> case) to save memory. Obviously it cost a lot in runtime, but it meant
> that I could actually chew through our largest CVS modules without
> running out of memory.

You make a good point.  cvs2git is descended from cvs2svn, which has
such a multipass organization - it will only have to avoid memory
limits per pass.  Alan and I will try that as a fallback if
cvs-fast-import continues to choke.
 
> > It is certainly the case that a sufficiently large CVS repo will break
> > anything, like a star with a mass over the Chandrasekhar limit becoming a
> > black hole :-)
> 
> :) True, although it's not the sheer size of the files themselves that
> is the actual problem. Most of those bytes are (deltified) file data,
> which you can pretty much stream through and convert to a
> corresponding fast-export stream of blob objects. The code for that
> should be fairly straightforward (and should also be eminently
> parallelizable, given enough cores and available I/O), resulting in a
> table mapping CVS file:revision pairs to corresponding Git blob SHA1s,
> and an accompanying (set of) packfile(s) holding said blobs.

Allowing for the fact that cvs-fast-export isn't git and doesn't use
SHA1s or packfiles, this is in fact how a large portion of
cvs-fast-export works.  The blob files get created during the walk
through the master file list, before actual topo analysis is done.

> The hard part comes when trying to correlate the metadata for all the
> per-file revisions, and distill that into a consistent sequence/DAG of
> changesets/commits across the entire CVS repo. And then, of course,
> trying to fit all the branches and tags into that DAG of commits is
> what really drives you mad... ;-)

Well I know this...:-)

> > The question is how common such supermassive cases are. My own guess is that
> > the *BSD repos and a handful of the oldest GNU projects are pretty much the
> > whole set; everybody else converted to Subversion within the last decade.
> 
> You may be right. At least for the open-source cases. I suspect
> there's still a considerable number of huge CVS repos within
> companies' walls...

If people with money want to hire me to slay those beasts, I'm available.
I'm not proud, I'll use cvs2git if I have to.
 
> > I find the very idea of writing anything that encourages
> > non-history-correct conversions disturbing and want no part of it.
> >
> > Which matters, because right now the set of people working on CVS lifters
> > begins with me and ends with Michael Rafferty (cvs2git),
> 
> s/Rafferty/Haggerty/?

Yup, I thinkoed.
 
> > who seems even
> > less interested in incremental conversion than I am.  Unless somebody
> > comes out of nowhere and wants to own that problem, it's not going
> > to get solved.
> 
> Agreed. It would be nice to have something to point to for people that
> want something similar to git-svn for CVS, but without a motivated
> owner, it won't happen.

I think the fact that it hasn't happened already is a good clue that
it's not going to. Given the decline curve of CVS usage, writing 
git-cvs might have looked like a decent investment of time once,
but that era probably ended five to eight years ago.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 21:02             ` Eric S. Raymond
@ 2013-12-18  0:02               ` Jakub Narębski
  2013-12-18  0:21                 ` Eric S. Raymond
  2013-12-18  0:04               ` Andreas Schwab
  1 sibling, 1 reply; 48+ messages in thread
From: Jakub Narębski @ 2013-12-18  0:02 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Martin Langhoff, Git Mailing List

On Tue, Dec 17, 2013 at 10:02 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> Jakub Narębski <jnareb@gmail.com>:
>>
>> Errr... doesn't cvs-fast-export support --export-marks=<file> to save
>> progress and --import-marks=<file> to continue incremental import?
>
> No, cvs-fast-export does not have --export-marks. It doesn't generate the
> SHA1s that would require. Even if it did, it's not clear how that would help.

I was thinking about how the following part of git-fast-export
`--import-marks=<file>`

  Any commits that have already been marked will not be exported again.
  If the backend uses a similar --import-marks file, this allows for incremental
  bidirectional exporting of the repository by keeping the marks the same
  across runs.

How cvs-fast-export know where to start exporting from in incremental mode?

BTW. does cvs-fast-export support incremental *output*, or does it
perform also incremental *work*?

Anyway, that might mean that generic fast-import stream based incremental
(i.e. supporting proper thin fetch) remote helper is out of question, perhaps
writing one for cvs / cvs-fe would bring incremental import from CVS to
git?

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 21:02             ` Eric S. Raymond
  2013-12-18  0:02               ` Jakub Narębski
@ 2013-12-18  0:04               ` Andreas Schwab
  2013-12-18  0:25                 ` Eric S. Raymond
  1 sibling, 1 reply; 48+ messages in thread
From: Andreas Schwab @ 2013-12-18  0:04 UTC (permalink / raw)
  To: esr; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

"Eric S. Raymond" <esr@thyrsus.com> writes:

> All versions of CVS have generated commitids since 2004.

Though older versions are still in use, eg. sourceware.org still does
not generate commitids.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18  0:02               ` Jakub Narębski
@ 2013-12-18  0:21                 ` Eric S. Raymond
  2013-12-18 15:39                   ` Jakub Narębski
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-18  0:21 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Martin Langhoff, Git Mailing List

Jakub Narębski <jnareb@gmail.com>:
> > No, cvs-fast-export does not have --export-marks. It doesn't generate the
> > SHA1s that would require. Even if it did, it's not clear how that would help.
> 
> I was thinking about how the following part of git-fast-export
> `--import-marks=<file>`
> 
>   Any commits that have already been marked will not be exported again.
>   If the backend uses a similar --import-marks file, this allows for incremental
>   bidirectional exporting of the repository by keeping the marks the same
>   across runs.

I understand that. But it's not relevant - cvs-fast-import doesn't know about
git SHA1s, and cannot.
 
> How cvs-fast-export know where to start exporting from in incremental mode?

You give it a cutoff date. This is the same way cvsps-2.x and 3.x worked,
and it's what the cvsimport wrapper expects to pass down.

> BTW. does cvs-fast-export support incremental *output*, or does it
> perform also incremental *work*?

As I tried to explain previously in my response to John Herland, it's
incremental output only.  There is *no* CVS exporter known to me, or
him, that supports incremental work.  That would be at best be impractically
difficult; given CVS's limitations it may be actually impossible. I wouldn't
bet against impossible.

> Anyway, that might mean that generic fast-import stream based incremental
> (i.e. supporting proper thin fetch) remote helper is out of question, perhaps
> writing one for cvs / cvs-fe would bring incremental import from CVS to
> git?

Sorry, I don't understand that.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18  0:04               ` Andreas Schwab
@ 2013-12-18  0:25                 ` Eric S. Raymond
  0 siblings, 0 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-18  0:25 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

Andreas Schwab <schwab@linux-m68k.org>:
> "Eric S. Raymond" <esr@thyrsus.com> writes:
> 
> > All versions of CVS have generated commitids since 2004.
> 
> Though older versions are still in use, eg. sourceware.org still does
> not generate commitids.

That is awful.  Alas, there is not much anyone can do about stupidity
that determined.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18  0:21                 ` Eric S. Raymond
@ 2013-12-18 15:39                   ` Jakub Narębski
  2013-12-18 16:23                     ` incremental fast-import and marks (Re: I have end-of-lifed cvsps) Jonathan Nieder
  2013-12-18 16:27                     ` I have end-of-lifed cvsps Eric S. Raymond
  0 siblings, 2 replies; 48+ messages in thread
From: Jakub Narębski @ 2013-12-18 15:39 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Martin Langhoff, Git Mailing List

On Wed, Dec 18, 2013 at 1:21 AM, Eric S. Raymond <esr@thyrsus.com> wrote:
> Jakub Narębski <jnareb@gmail.com>:

>>> No, cvs-fast-export does not have --export-marks. It doesn't generate the
>>> SHA1s that would require. Even if it did, it's not clear how that would help.
>>
>> I was thinking about how the following part of git-fast-export
>> `--import-marks=<file>`
>>
>>   Any commits that have already been marked will not be exported again.
>>   If the backend uses a similar --import-marks file, this allows for incremental
>>   bidirectional exporting of the repository by keeping the marks the same
>>   across runs.
>
> I understand that. But it's not relevant - cvs-fast-import doesn't know about
> git SHA1s, and cannot.

It is a bit strange that markfile has explicitly SHA-1 (":markid <SHA-1>"),
instead of generic reference to commit, in the case of CVS it would be
commitid (what to do for older repositories, though?), in case of Bazaar
its revision id (GUID), etc.  Can we assume that SCM v1 fast-export and
SCM v2 fast-import markfile uses compatibile commit names in markfile?

>> How cvs-fast-export know where to start exporting from in incremental mode?
>
> You give it a cutoff date. This is the same way cvsps-2.x and 3.x worked,
> and it's what the cvsimport wrapper expects to pass down.

Nice to know.

I think it would be possible for remote-helper for cvs-fast-export to find
this cutoff date automatically (perhaps with some safety margin), for
fetching (incremental import).

>> BTW. does cvs-fast-export support incremental *output*, or does it
>> perform also incremental *work*?
>
> As I tried to explain previously in my response to John Herland, it's
> incremental output only.  There is *no* CVS exporter known to me, or
> him, that supports incremental work.  That would be at best be impractically
> difficult; given CVS's limitations it may be actually impossible. I wouldn't
> bet against impossible.

Even with saving (or re-calculating from git import) guesses about CVS
history made so far?

Anyway I hope that incremental CVS import would be needed less
and less as CVS is replaced by any more modern version control system.

>> Anyway, that might mean that generic fast-import stream based incremental
>> (i.e. supporting proper thin fetch) remote helper is out of question, perhaps
>> writing one for cvs / cvs-fe would bring incremental import from CVS to
>> git?
>
> Sorry, I don't understand that.

I was thinking about creating remote-helper for cvs-fast-export, so that
git can use local CVS repository as "remote", using e.g. "cvsroot::<path>"
as repo URL, and using this mechanism for incremental import (aka fetch).
(Or even "cvssync::<URL>" for automatic cvssync + cvs-fast-export).

But from what I understand this is not as easy as it seems, even with
remote-helper API having support for fast-import stream.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* incremental fast-import and marks (Re: I have end-of-lifed cvsps)
  2013-12-18 15:39                   ` Jakub Narębski
@ 2013-12-18 16:23                     ` Jonathan Nieder
  2013-12-18 16:27                     ` I have end-of-lifed cvsps Eric S. Raymond
  1 sibling, 0 replies; 48+ messages in thread
From: Jonathan Nieder @ 2013-12-18 16:23 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Eric Raymond, Martin Langhoff, Git Mailing List

Jakub Narebski wrote:

> It is a bit strange that markfile has explicitly SHA-1 (":markid <SHA-1>"),
> instead of generic reference to commit, in the case of CVS it would be
> commitid (what to do for older repositories, though?), in case of Bazaar
> its revision id (GUID), etc.

Usually importers use at least two separate files to save state, one
mapping between git object names and mark numbers, and the other mapping
between native revision identifiers and mark numbers.  That way,
when the importer uses marks to refer to previously imported commits or
blobs, fast-import knows what commits or blobs it is talking about.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 15:39                   ` Jakub Narębski
  2013-12-18 16:23                     ` incremental fast-import and marks (Re: I have end-of-lifed cvsps) Jonathan Nieder
@ 2013-12-18 16:27                     ` Eric S. Raymond
  2013-12-18 16:53                       ` Martin Langhoff
  2013-12-18 17:46                       ` Jeff King
  1 sibling, 2 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-18 16:27 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Martin Langhoff, Git Mailing List

Jakub Narębski <jnareb@gmail.com>:
> It is a bit strange that markfile has explicitly SHA-1 (":markid <SHA-1>"),
> instead of generic reference to commit, in the case of CVS it would be
> commitid (what to do for older repositories, though?), in case of Bazaar
> its revision id (GUID), etc.  Can we assume that SCM v1 fast-export and
> SCM v2 fast-import markfile uses compatibile commit names in markfile?

For use in reposurgeon I have defined a generic cross-VCS reference to
commit I call an "action stamp"; it consists of an RFC3339 date followed by 
a committer email address. Here's an example:

	 2013-02-06T09:35:10Z!esr@thyrsus.com

In any VCS with changesets (git, Subversion, bzr, Mercurial) this
almost always suffices to uniquely identify a commit. The "almost" is
because in these systems it is possible for a user to do multiple commits
in the same second.

And now you know why I wish git had subsecond timestamp resolution!  If it
did, uniqueness of these in a git stream could be guaranteed.

The implied model completely breaks for CVS, of course.  There you have to 
use commitids and plain give up when those don't exist.
 
> I think it would be possible for remote-helper for cvs-fast-export to find
> this cutoff date automatically (perhaps with some safety margin), for
> fetching (incremental import).

Yes.
 
> > As I tried to explain previously in my response to John Herland, it's
> > incremental output only.  There is *no* CVS exporter known to me, or
> > him, that supports incremental work.  That would be at best be impractically
> > difficult; given CVS's limitations it may be actually impossible. I wouldn't
> > bet against impossible.
> 
> Even with saving (or re-calculating from git import) guesses about CVS
> history made so far?

Even with that.  cvsps-2.x tried to do something like this.  It was a lose.
 
> Anyway I hope that incremental CVS import would be needed less
> and less as CVS is replaced by any more modern version control system.

I agree.  I have never understood why people on this list are attached to it.

> I was thinking about creating remote-helper for cvs-fast-export, so that
> git can use local CVS repository as "remote", using e.g. "cvsroot::<path>"
> as repo URL, and using this mechanism for incremental import (aka fetch).
> (Or even "cvssync::<URL>" for automatic cvssync + cvs-fast-export).
> 
> But from what I understand this is not as easy as it seems, even with
> remote-helper API having support for fast-import stream.

It's a swamp I wouldn't want to walk into.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 16:27                     ` I have end-of-lifed cvsps Eric S. Raymond
@ 2013-12-18 16:53                       ` Martin Langhoff
  2013-12-18 19:54                         ` John Keeping
  2013-12-18 17:46                       ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Martin Langhoff @ 2013-12-18 16:53 UTC (permalink / raw)
  To: Eric Raymond; +Cc: Jakub Narębski, Git Mailing List

On Wed, Dec 18, 2013 at 11:27 AM, Eric S. Raymond <esr@thyrsus.com> wrote:
>> Anyway I hope that incremental CVS import would be needed less
>> and less as CVS is replaced by any more modern version control system.
>
> I agree.  I have never understood why people on this list are attached to it.

I think I have answered this question already once in this thread, and
a few times in similar threads with Eric in the past.

People track CVS repos that they have not control over. Smart
programmers forced to work with a corporate CVS repo. It happens also
with SVN, and witness the popularity of git-svn which can sanely
interact with an "active" svn repo.

This is a valid use case. Hard (impossible?) to support. But there
should be no surprise as to its reasons.

cheers,



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 16:27                     ` I have end-of-lifed cvsps Eric S. Raymond
  2013-12-18 16:53                       ` Martin Langhoff
@ 2013-12-18 17:46                       ` Jeff King
  2013-12-18 19:16                         ` Eric S. Raymond
  1 sibling, 1 reply; 48+ messages in thread
From: Jeff King @ 2013-12-18 17:46 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

On Wed, Dec 18, 2013 at 11:27:10AM -0500, Eric S. Raymond wrote:

> For use in reposurgeon I have defined a generic cross-VCS reference to
> commit I call an "action stamp"; it consists of an RFC3339 date followed by 
> a committer email address. Here's an example:
> 
> 	 2013-02-06T09:35:10Z!esr@thyrsus.com
> 
> In any VCS with changesets (git, Subversion, bzr, Mercurial) this
> almost always suffices to uniquely identify a commit. The "almost" is
> because in these systems it is possible for a user to do multiple commits
> in the same second.

FWIW, this has quite a few collisions in git.git:

  $ git log --format='%ct %ce' | sort | uniq -c | sort -rn | head
     22 1172221032 normalperson@yhbt.net
     22 1172221031 normalperson@yhbt.net
     22 1172221029 normalperson@yhbt.net
     21 1190197351 gitster@pobox.com
     21 1172221030 normalperson@yhbt.net
     20 1190197350 gitster@pobox.com
     17 1172221033 normalperson@yhbt.net
     15 1263457676 gitster@pobox.com
     15 1193717011 gitster@pobox.com
     14 1367447590 gitster@pobox.com

In git, it may happen quite a bit during "git am" or "git rebase", in
which a large number of commits are replayed in a tight loop. You can
use the author timestamp instead, but it also collides (try "%at %ae" in
the above command instead).

> And now you know why I wish git had subsecond timestamp resolution!  If it
> did, uniqueness of these in a git stream could be guaranteed.

It's still not guaranteed. Even with sufficient resolution that no two
operations could possibly complete in the same time unit, clocks do not
always march forward. They get reset, they may skew from machine to
machine, the same operation may happen on different machines, etc. The
probability of such collisions is significantly reduced, though, if only
because the extra precision adds an essentially random factor.

But in some cases you might even see the same commit "replayed" on top
of different parts of the graph, or affecting different paths (e.g., by
filter-branch). I.e., no matter what your precision, multiple hacked-up
views of the changeset will still always have that same timestamp.

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 17:46                       ` Jeff King
@ 2013-12-18 19:16                         ` Eric S. Raymond
  0 siblings, 0 replies; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-18 19:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Jakub Narębski, Martin Langhoff, Git Mailing List

Jeff King <peff@peff.net>:
> In git, it may happen quite a bit during "git am" or "git rebase", in
> which a large number of commits are replayed in a tight loop.

That's a good point - a repeatable real-world case in which we can
expect that behavior.

This case could be solved, though, with a slight tweak to the commit generator
in git (given subsecond timestamps).  It could keep the time of last commit
and stall by an arbitrary small amount, enough to show up as a timestamp
difference. 

Action stamps work pretty well inside reposurgeon because they're
mainly used to identify commits from older VCSes that can't run that
fast. Collisions are theoretically possible but I'm never seen one in
the wild.

>                                                       You can
> use the author timestamp instead, but it also collides (try "%at %ae" in
> the above command instead).

Yes, obviously for the same reason. 
 
> > And now you know why I wish git had subsecond timestamp resolution!  If it
> > did, uniqueness of these in a git stream could be guaranteed.
> 
> It's still not guaranteed. Even with sufficient resolution that no two
> operations could possibly complete in the same time unit, clocks do not
> always march forward. They get reset, they may skew from machine to
> machine, the same operation may happen on different machines, etc.

Right...but the *same person* submitting operations from *different
machines* within the time window required to be caught by these effects
is at worst fantastically unlikely.  That case is exactly why action 
stamps have an email part.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 16:53                       ` Martin Langhoff
@ 2013-12-18 19:54                         ` John Keeping
  2013-12-18 20:20                           ` Eric S. Raymond
  0 siblings, 1 reply; 48+ messages in thread
From: John Keeping @ 2013-12-18 19:54 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Eric Raymond, Jakub Narębski, Git Mailing List

On Wed, Dec 18, 2013 at 11:53:47AM -0500, Martin Langhoff wrote:
> On Wed, Dec 18, 2013 at 11:27 AM, Eric S. Raymond <esr@thyrsus.com> wrote:
> >> Anyway I hope that incremental CVS import would be needed less
> >> and less as CVS is replaced by any more modern version control system.
> >
> > I agree.  I have never understood why people on this list are attached to it.
> 
> I think I have answered this question already once in this thread, and
> a few times in similar threads with Eric in the past.
> 
> People track CVS repos that they have not control over. Smart
> programmers forced to work with a corporate CVS repo. It happens also
> with SVN, and witness the popularity of git-svn which can sanely
> interact with an "active" svn repo.
> 
> This is a valid use case. Hard (impossible?) to support. But there
> should be no surprise as to its reasons.

And at this point the git-cvsimport manpage says:

   WARNING: git cvsimport uses cvsps version 2, which is considered
   deprecated; it does not work with cvsps version 3 and later. If you
   are performing a one-shot import of a CVS repository consider using
   cvs2git[1] or parsecvs[2].

Which I think sums up the position nicely; if you're doing a one-shot
import then the standalone tools are going to be a better choice, but if
you're trying to use Git for your work on top of CVS the only choice is
cvsps with git-cvsimport.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 19:54                         ` John Keeping
@ 2013-12-18 20:20                           ` Eric S. Raymond
  2013-12-18 20:47                             ` Kent R. Spillner
  0 siblings, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-18 20:20 UTC (permalink / raw)
  To: John Keeping; +Cc: Martin Langhoff, Jakub Narębski, Git Mailing List

John Keeping <john@keeping.me.uk>:
> Which I think sums up the position nicely; if you're doing a one-shot
> import then the standalone tools are going to be a better choice, but if
> you're trying to use Git for your work on top of CVS the only choice is
> cvsps with git-cvsimport.

Which will trash your history - the bugs in that are worse than the bugs
in 3.0, which are bad enough that I *terminated* it.

Lovely....
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 20:20                           ` Eric S. Raymond
@ 2013-12-18 20:47                             ` Kent R. Spillner
  0 siblings, 0 replies; 48+ messages in thread
From: Kent R. Spillner @ 2013-12-18 20:47 UTC (permalink / raw)
  To: esr; +Cc: John Keeping, Martin Langhoff, Jakub Narębski, Git Mailing List

> Which will trash your history - the bugs in that are worse than the bugs
> in 3.0, which are bad enough that I *terminated* it.

Which *might* trash your history.

cvsps v2 and git cvsimport work as advertised with simple, linear CVS
repositories.  I maintain a git mirror of an active CVS repo and run git
cvsimport every few days to sync with the latest upstream changes.  The
only problem I encountered so far was when you released cvsps v3 and broke
git cvsimport. :)  I had to manually downgrade to cvsps v2.2b1 and configure
my package manager to ignore cvsps updates, but I haven't had any problems
since.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-17 18:47               ` Eric S. Raymond
  2013-12-17 21:26                 ` Johan Herland
@ 2013-12-18 23:44                 ` Michael Haggerty
  2013-12-19  1:11                   ` Johan Herland
  2013-12-19  4:06                   ` Eric S. Raymond
  1 sibling, 2 replies; 48+ messages in thread
From: Michael Haggerty @ 2013-12-18 23:44 UTC (permalink / raw)
  To: esr; +Cc: Johan Herland, Jakub Narębski, Martin Langhoff, Git Mailing List

On 12/17/2013 07:47 PM, Eric S. Raymond wrote:
> Johan Herland <johan@herland.net>:
>> However, I fear that you underestimate the number of users that want
>> to use Git against CVS repos that are orders of magnitude larger (in
>> both dimensions: #commits and #files) than your example repo.
> 
> You may be right. See below...
> 
> I'm working with Alan Barret now on trying to convert the NetBSD
> repositories. They break cvs-fast-export through sheer bulk of
> metadata, by running the machine out of core.  This is exactly
> the kind of huge case that you're talking about.
> 
> Alan and I are going to take a good hard whack at modifying cvs-fast-export 
> to make this work. Because there really aren't any feasible alternatives.
> The analysis code in cvsps was never good enough. cvs2git, being written
> in Python, would hit the core limit faster than anything written in C.

cvs2git goes to great lengths to store intermediate data to disk and
keep the working set small and therefore (despite the Python overhead) I
am confident that it scales better than cvs-fast-export.  My usual test
repo was gcc:

Total CVS Files:             25013
Total CVS Revisions:        578010
Total CVS Branches:        1487929
Total CVS Tags:           11435500
Total Unique Tags:             814
Total Unique Branches:         116
CVS Repos Size in KB:      2074248
Total SVN Commits:           64501

I also regularly converted mozilla (4.2 GB) and emacs (560 MB) for
testing purposes.  These could all be converted on a 32-bit computer.

Other projects that cvs2svn/cvs2git could handle: FreeBSD, Gentoo, KDE,
GNOME, PostgreSQL.  (Though for KDE, which I think was in the 16 GB
range, I know that they used a giant machine for the conversion.)

If you haven't tried cvs2git yet, please start it up somewhere in the
background.  It might take a while but it should have no trouble with
your repos, and then you can compare the tools based on experience
rather than speculation.

> Which matters, because right now the set of people working on CVS lifters
> begins with me and ends with Michael Rafferty (cvs2git), who seems even
> less interested in incremental conversion than I am.  Unless somebody
> comes out of nowhere and wants to own that problem, it's not going
> to get solved.

A correct incremental converter could be done (as long as the CVS users
don't literally change history retroactively) but it would be a lot of
work.  Parsing the CVS files isn't the problem; after all, CVS has to do
that every time you check out a branch.  The problem is the extra
bookkeeping that would be needed to keep the overlapping history
consistent between runs N and N+1 of the tool.  I sketched out what
would be necessary once and it came out to several solid weeks of work.

But the traffic on the cvs2svn/cvs2git mailing list has trailed off
essentially to zero, so either the software is perfect already (haha) or
most everybody has already converted.  Therefore I don't invest any
significant time in that project these days.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 23:44                 ` Michael Haggerty
@ 2013-12-19  1:11                   ` Johan Herland
  2013-12-19  9:31                     ` Michael Haggerty
  2013-12-19  4:06                   ` Eric S. Raymond
  1 sibling, 1 reply; 48+ messages in thread
From: Johan Herland @ 2013-12-19  1:11 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Eric Raymond, Jakub Narębski, Martin Langhoff, Git Mailing List

On Thu, Dec 19, 2013 at 12:44 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> A correct incremental converter could be done (as long as the CVS users
> don't literally change history retroactively) but it would be a lot of work.

Although I agree with that sentence as it is stated, I also believe
that the parenthesized condition rules out a _majority_ of CVS repo of
non-trivial size/history. So even though a correct incremental
converter could be built, it would be pretty much useless if it did
not gracefully handle rewritten history. And in the face of rewritten
history it becomes pretty much impossible to define what a "correct"
conversion should even look like (not to mention the difficulty of
actually implementing that converter...).

Here are just a couple of things a CVS user can do (and that happened
fairly regularly at my previous $dayjob) that would make life
difficult for an incremental converter (and that also makes stable
output from a non-incremental converter hard to solve in practice):

 - A user "deletes" $file from $branch by simply removing the $branch
symbol on $file (cvs tag -B -d $branch $file). CVS stores no record of
this. Many non-incremental importers will see $file as never having
existed on $branch. An incremental importer starting from a previously
converted state, must somehow deal with that previous state no longer
existing from the POV of CVS.

 - A user moves a release tag on a few files to include a late bugfix
into an upcoming release (cvs tag -F -r $new_rev $tag $file). There
might be no single point in time where the tagged state existed in the
repo, it has become a "Frankentag". You could claim user error here,
and that such shortcuts should not happen, but that doesn't really
prevent it from ever happening. Recreating the tree state of the
Frankentag in Git is easy, but what kind of history do you construct
to lead up to that tree?

 - A modularized project develops code on HEAD, and make regular
releases of each module by tagging the files in the module dir with
"$modulename-$version". Afterwards a project-wide "stable" tag is
moved on that subset of files to include the new module release into
the "stable" tag. ("stable" is conceptually a branch, but the CVS
mechanism used here is still the tag, since CVS branches cannot
"follow" eachother like in Git). This is pretty much the same
Frankentag scenario as above, except that in this case it might be
considered Best Practice (it was at our $dayjob), and not a
shortcut/user error made by a single user.

(None of these examples even involve the "cvs admin" which allows you
to do some truly scary and demented things to your CVS history...)

My point here is that people will use whatever available tools they
have to solve whatever problems they are currently having. And when
CVS is your tool, you will sooner or later end up with a "solution"
that irrevocably rewrites your CVS history.


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-18 23:44                 ` Michael Haggerty
  2013-12-19  1:11                   ` Johan Herland
@ 2013-12-19  4:06                   ` Eric S. Raymond
  2013-12-19  9:43                     ` Michael Haggerty
  1 sibling, 1 reply; 48+ messages in thread
From: Eric S. Raymond @ 2013-12-19  4:06 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Johan Herland, Jakub Narębski, Martin Langhoff, Git Mailing List

Michael Haggerty <mhagger@alum.mit.edu>:
> If you haven't tried cvs2git yet, please start it up somewhere in the
> background.  It might take a while but it should have no trouble with
> your repos, and then you can compare the tools based on experience
> rather than speculation.

That would be a good thing.

Michael, in case you're wondering why I've continued to work on
cvs-fast-export when cvs2git exists, there are exactly two reasons:
(a) it's a whole lot faster on repos that aren't large enough to
demand multipass, and (b) the single-whole-dumpfile output makes it a
better reposurgeon front end.

> But the traffic on the cvs2svn/cvs2git mailing list has trailed off
> essentially to zero, so either the software is perfect already (haha) or
> most everybody has already converted.  Therefore I don't invest any
> significant time in that project these days.

Reasonable.  I'm doing this as a temporary break from working on GPSD.
I don't expect to be investing a lot of time in it after I get it
to a 1.0 state.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-19  1:11                   ` Johan Herland
@ 2013-12-19  9:31                     ` Michael Haggerty
  2013-12-19 15:26                       ` Johan Herland
  0 siblings, 1 reply; 48+ messages in thread
From: Michael Haggerty @ 2013-12-19  9:31 UTC (permalink / raw)
  To: Johan Herland
  Cc: Eric Raymond, Jakub Narębski, Martin Langhoff, Git Mailing List

On 12/19/2013 02:11 AM, Johan Herland wrote:
> On Thu, Dec 19, 2013 at 12:44 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> A correct incremental converter could be done (as long as the CVS users
>> don't literally change history retroactively) but it would be a lot of work.
> 
> Although I agree with that sentence as it is stated, I also believe
> that the parenthesized condition rules out a _majority_ of CVS repo of
> non-trivial size/history. So even though a correct incremental
> converter could be built, it would be pretty much useless if it did
> not gracefully handle rewritten history. And in the face of rewritten
> history it becomes pretty much impossible to define what a "correct"
> conversion should even look like (not to mention the difficulty of
> actually implementing that converter...).

A correct conversion would, conceptually, take a diff between the old
CVS history and the new CVS history (I'm talking about the history as a
whole, not a diff between two changesets), figure out what had changed,
and then figure out what Git commits to make to effect the same
conceptual changes in Git-land.

This means that the final Git history would have to depend not only on
the current entirety of the CVS history, but also on what the CVS
history *was* during previous incremental imports and how the tool chose
to represent that history in Git the previous rounds.

There is a tradeoff here.  The smarter the tool is, the fewer
restrictions would have to be made on what people can do in CVS.  For
example, it wouldn't be unreasonable to impose a rule that people are
not allowed to move files within the CVS repository (e.g., to fake
move-file-with-history) after the CVS <-> Git bridge is in use.  (Abuses
of the history that occurred *before* the first incremental conversion,
on the other hand, wouldn't be a problem.)  If the user of the
incremental tool has *no* influence on how his colleagues use CVS, then
the tool would have to be very smart and/or the user would might
sometimes be forced to do another from-scratch conversion.

> Here are just a couple of things a CVS user can do (and that happened
> fairly regularly at my previous $dayjob) that would make life
> difficult for an incremental converter (and that also makes stable
> output from a non-incremental converter hard to solve in practice):
> 
>  - A user "deletes" $file from $branch by simply removing the $branch
> symbol on $file (cvs tag -B -d $branch $file). CVS stores no record of
> this. Many non-incremental importers will see $file as never having
> existed on $branch. An incremental importer starting from a previously
> converted state, must somehow deal with that previous state no longer
> existing from the POV of CVS.

No problem; the tool could just add a synthetic commit "git rm"ming the
file from the branch.  It wouldn't know *when* the file was deleted, so
it would have to pick a plausible date between the time of the last
incremental conversion and the one that discovers that the branch tag
has been removed from the file.  The resulting Git history would contain
more complete information than CVS's history.

>  - A user moves a release tag on a few files to include a late bugfix
> into an upcoming release (cvs tag -F -r $new_rev $tag $file). There
> might be no single point in time where the tagged state existed in the
> repo, it has become a "Frankentag". You could claim user error here,
> and that such shortcuts should not happen, but that doesn't really
> prevent it from ever happening. Recreating the tree state of the
> Frankentag in Git is easy, but what kind of history do you construct
> to lead up to that tree?

Frankentags (tags that include file versions that didn't occur
contemporaneously) can occur even with one-time CVS->Git conversions.
The only way to handle them is to create a Git branch representing the
tag and base it at a plausible Git commit, and then (on the branch)
issue a fixup commit that makes the contents of the branch equal to the
contents of the CVS branch.  This is a problem that cvs2git already handles.

A hypothetical incremental importer would have to notice the changes in
the branch contents between the previous conversion and the current one,
and create commits on the branch to bring it in line with the current
contents.  This is no uglier than what a one-shot conversion already has
to do.

>  - A modularized project develops code on HEAD, and make regular
> releases of each module by tagging the files in the module dir with
> "$modulename-$version". Afterwards a project-wide "stable" tag is
> moved on that subset of files to include the new module release into
> the "stable" tag. ("stable" is conceptually a branch, but the CVS
> mechanism used here is still the tag, since CVS branches cannot
> "follow" eachother like in Git). This is pretty much the same
> Frankentag scenario as above, except that in this case it might be
> considered Best Practice (it was at our $dayjob), and not a
> shortcut/user error made by a single user.

Same problem and same solution as above, as far as I can see.

> (None of these examples even involve the "cvs admin" which allows you
> to do some truly scary and demented things to your CVS history...)

Even some of these might be permitted.  For example:

* Obsoleting already-converted revisions: it's a pretty stupid thing to
do in most cases and the tool could just ignore such events, retaining
the history in Git.  If the revisions were obsoleted because they
contained proprietary information or something, then you've got a bigger
problem on your hands but one that you would have even if you were using
pure Git.

* Retroactive changes to log messages: would probably have to be ignored
or handled via notes.

* Changes to the "default branch" (another brain-dead CVS feature
related to vendor branches): I'd have to think about it.  But handling
vendor branches is already difficult for a one-time converter because
CVS retains too little info (but cvs2git does it except in the most
ambiguous cases).  An incremental importer would have *more* information
than a one-shot importer, because it would have a hope of catching the
change to the default branch at roughly the time it occurred.

> My point here is that people will use whatever available tools they
> have to solve whatever problems they are currently having. And when
> CVS is your tool, you will sooner or later end up with a "solution"
> that irrevocably rewrites your CVS history.

Yes, but I maintain that an incremental importer could keep a Git
history that is consistent with the CVS history in the sense that:

1. the result of checking out any branch or tag, right after a run of
the importer, gives the same results as checking the same branch or tag
out of CVS.

2. the Git history from one run is added to (never rewritten) by the
next run.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-19  4:06                   ` Eric S. Raymond
@ 2013-12-19  9:43                     ` Michael Haggerty
  0 siblings, 0 replies; 48+ messages in thread
From: Michael Haggerty @ 2013-12-19  9:43 UTC (permalink / raw)
  To: esr; +Cc: Johan Herland, Jakub Narębski, Martin Langhoff, Git Mailing List

On 12/19/2013 05:06 AM, Eric S. Raymond wrote:
> Michael Haggerty <mhagger@alum.mit.edu>:
>> If you haven't tried cvs2git yet, please start it up somewhere in the
>> background.  It might take a while but it should have no trouble with
>> your repos, and then you can compare the tools based on experience
>> rather than speculation.
> 
> That would be a good thing.
> 
> Michael, in case you're wondering why I've continued to work on
> cvs-fast-export when cvs2git exists, there are exactly two reasons:
> (a) it's a whole lot faster on repos that aren't large enough to
> demand multipass,

What difference does speed make on little repositories?  They are fast
enough anyway.

If you are worried about the speed of testing and iterating on your
reposurgeon configuration, then just write the output of cvs2svn to a
temporary file and use the temporary file as input to reposurgeon.

> and (b) the single-whole-dumpfile output makes it a
> better reposurgeon front end.

I can't believe you are still hung up on this!  OK, just for you, here
it is: cvs2git-3.0, in gorgeous pipey purity:

    #! /bin/sh
    blobfile=$(mktemp /tmp/myblobs-XXXXXX.out)
    dumpfile=$(mktemp /tmp/mydump-XXXXXX.out)
    cvs2git-2.0 --blobfile="$blobfile" --dumpfile="$dumpfile" "$@" 1>&2 &&
    cat "$blobfile" "$dumpfile"
    rm "$blobfile" "$dumpfile"

I don't think that cvs2git-2.0 outputs any junk to stdout, but just in
case it does I've redirected stdout explicitly to stderr to avoid
commingling it with the output of this script.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-19  9:31                     ` Michael Haggerty
@ 2013-12-19 15:26                       ` Johan Herland
  2013-12-19 16:18                         ` Michael Haggerty
  0 siblings, 1 reply; 48+ messages in thread
From: Johan Herland @ 2013-12-19 15:26 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Eric Raymond, Jakub Narębski, Martin Langhoff, Git Mailing List

On Thu, Dec 19, 2013 at 10:31 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 12/19/2013 02:11 AM, Johan Herland wrote:
>> On Thu, Dec 19, 2013 at 12:44 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>> A correct incremental converter could be done (as long as the CVS users
>>> don't literally change history retroactively) but it would be a lot of work.
>>
>> Although I agree with that sentence as it is stated, I also believe
>> that the parenthesized condition rules out a _majority_ of CVS repo of
>> non-trivial size/history. So even though a correct incremental
>> converter could be built, it would be pretty much useless if it did
>> not gracefully handle rewritten history. And in the face of rewritten
>> history it becomes pretty much impossible to define what a "correct"
>> conversion should even look like (not to mention the difficulty of
>> actually implementing that converter...).
>
> A correct conversion would, conceptually, take a diff between the old
> CVS history and the new CVS history (I'm talking about the history as a
> whole, not a diff between two changesets), figure out what had changed,
> and then figure out what Git commits to make to effect the same
> conceptual changes in Git-land.
>
> This means that the final Git history would have to depend not only on
> the current entirety of the CVS history, but also on what the CVS
> history *was* during previous incremental imports and how the tool chose
> to represent that history in Git the previous rounds.
>
> There is a tradeoff here.  The smarter the tool is, the fewer
> restrictions would have to be made on what people can do in CVS.  For
> example, it wouldn't be unreasonable to impose a rule that people are
> not allowed to move files within the CVS repository (e.g., to fake
> move-file-with-history) after the CVS <-> Git bridge is in use.  (Abuses
> of the history that occurred *before* the first incremental conversion,
> on the other hand, wouldn't be a problem.)  If the user of the
> incremental tool has *no* influence on how his colleagues use CVS, then
> the tool would have to be very smart and/or the user would might
> sometimes be forced to do another from-scratch conversion.

Agreed, but I find it quite ugly how the git history will end up
different depending on _when_ the incremental conversion is run. It
means that it will be impossible for two users to create the same Git
repo (matching SHA1s), unless they carefully synchronize all of their
conversion runs (at which point it's much simpler to run a single
conversion and then have both users fetch the result).

There is a continuum here in incremental converters:

At one end - given that you're always going to lose _some_ history -
you can go "screw it! let's not care about history at all!", and do
the fastest possible conversion: check out the current CVS version;
diff that against the previous CVS version; apply the diff to your Git
repo as a single commit. I suspect quite a lot of users would be happy
with this solution - at least as a temporary measure while they wait
for their surrounding organization to do a proper migraiton off CVS.

At the other end - you can realize that the CVS storage format on the
server is simply too lossy, and you can write a proxy or monitor that
intercept CVS operations on the server, and replicate those in a
companion Git repo as soon as they occur in CVS. Whether you write a
CVS server monitor that detects changes to the CVS server files in
real time (using e.g. inotify or similar), or you write a CVS server
proxy that intercepts CVS commands from the user (also forwarding them
to the _real_ CVS server) is an implementation detail[*]. The
important thing is you should end up with is a real-time stream of
changes that can be converted to corresponding changes in a Git repo.
That should give you closest possible picture of what really happens
in a CVS repo, even better than what CVS stores in its on-disk format.
This would allow an organization to provide a (read-only) Git mirror
of their CVS repo.

What we have been discussing in this thread (various strategies for
fixing up broken history in Git) can be considered intermediate points
between the two extremes presented above: You try to recreate as much
history as possible, but realize that you sometimes need to simply
synthesize some fake history in order to make everything fit together.

>> Here are just a couple of things a CVS user can do (and that happened
>> fairly regularly at my previous $dayjob) that would make life
>> difficult for an incremental converter (and that also makes stable
>> output from a non-incremental converter hard to solve in practice):
>>
>>  - A user "deletes" $file from $branch by simply removing the $branch
>> symbol on $file (cvs tag -B -d $branch $file). CVS stores no record of
>> this. Many non-incremental importers will see $file as never having
>> existed on $branch. An incremental importer starting from a previously
>> converted state, must somehow deal with that previous state no longer
>> existing from the POV of CVS.
>
> No problem; the tool could just add a synthetic commit "git rm"ming the
> file from the branch.  It wouldn't know *when* the file was deleted, so
> it would have to pick a plausible date between the time of the last
> incremental conversion and the one that discovers that the branch tag
> has been removed from the file.  The resulting Git history would contain
> more complete information than CVS's history.

A server proxy/monitor analyzing CVS operations in real time would
know _exactly_ when the file was removed...

>>  - A user moves a release tag on a few files to include a late bugfix
>> into an upcoming release (cvs tag -F -r $new_rev $tag $file). There
>> might be no single point in time where the tagged state existed in the
>> repo, it has become a "Frankentag". You could claim user error here,
>> and that such shortcuts should not happen, but that doesn't really
>> prevent it from ever happening. Recreating the tree state of the
>> Frankentag in Git is easy, but what kind of history do you construct
>> to lead up to that tree?
>
> Frankentags (tags that include file versions that didn't occur
> contemporaneously) can occur even with one-time CVS->Git conversions.
> The only way to handle them is to create a Git branch representing the
> tag and base it at a plausible Git commit, and then (on the branch)
> issue a fixup commit that makes the contents of the branch equal to the
> contents of the CVS branch.  This is a problem that cvs2git already handles.
>
> A hypothetical incremental importer would have to notice the changes in
> the branch contents between the previous conversion and the current one,
> and create commits on the branch to bring it in line with the current
> contents.  This is no uglier than what a one-shot conversion already has
> to do.

True, but analyzing CVS operations in real time, you might be able to
recreate the moving (and adding/deleting) of tags as file edits (and
adds/deletes) in the corresponding Git branch.

>>  - A modularized project develops code on HEAD, and make regular
>> releases of each module by tagging the files in the module dir with
>> "$modulename-$version". Afterwards a project-wide "stable" tag is
>> moved on that subset of files to include the new module release into
>> the "stable" tag. ("stable" is conceptually a branch, but the CVS
>> mechanism used here is still the tag, since CVS branches cannot
>> "follow" eachother like in Git). This is pretty much the same
>> Frankentag scenario as above, except that in this case it might be
>> considered Best Practice (it was at our $dayjob), and not a
>> shortcut/user error made by a single user.
>
> Same problem and same solution as above, as far as I can see.
>
>> (None of these examples even involve the "cvs admin" which allows you
>> to do some truly scary and demented things to your CVS history...)
>
> Even some of these might be permitted.  For example:
>
> * Obsoleting already-converted revisions: it's a pretty stupid thing to
> do in most cases and the tool could just ignore such events, retaining
> the history in Git.  If the revisions were obsoleted because they
> contained proprietary information or something, then you've got a bigger
> problem on your hands but one that you would have even if you were using
> pure Git.
>
> * Retroactive changes to log messages: would probably have to be ignored
> or handled via notes.
>
> * Changes to the "default branch" (another brain-dead CVS feature
> related to vendor branches): I'd have to think about it.  But handling
> vendor branches is already difficult for a one-time converter because
> CVS retains too little info (but cvs2git does it except in the most
> ambiguous cases).  An incremental importer would have *more* information
> than a one-shot importer, because it would have a hope of catching the
> change to the default branch at roughly the time it occurred.

Agreed, but if you want correct metadata (_when_ did these changes
happen, _who_ performed them), then you need to actually monitor the
CVS command stream (or CVS server files) in real time...

>> My point here is that people will use whatever available tools they
>> have to solve whatever problems they are currently having. And when
>> CVS is your tool, you will sooner or later end up with a "solution"
>> that irrevocably rewrites your CVS history.
>
> Yes, but I maintain that an incremental importer could keep a Git
> history that is consistent with the CVS history in the sense that:
>
> 1. the result of checking out any branch or tag, right after a run of
> the importer, gives the same results as checking the same branch or tag
> out of CVS.
>
> 2. the Git history from one run is added to (never rewritten) by the
> next run.

Yes, and even my simplest/fastest possible converter described above
can meet those criteria. After that, it really becomes a question of
_how_much_ CVS history you want to retain in your incremental import.
I have described the two extremes above. Interestingly, _both_ of
those extremes would look quite different from the
whole-history-gone-incremental converters represented by cvs2git and
cvs-fast-export, and _both_ of the extremes would probably also
provide a converted result quite a bit faster than anything in between
(one by virtue of depending on a single "cvs update" command, and the
other by monitoring the CVS server and performing the conversion to
Git in real time).


...Johan


[*]: That said, I suspect git-cvsserver would be a good starting point
for implementing a CVS server proxy, if someone is actually interested
in looking at this...

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: I have end-of-lifed cvsps
  2013-12-19 15:26                       ` Johan Herland
@ 2013-12-19 16:18                         ` Michael Haggerty
  0 siblings, 0 replies; 48+ messages in thread
From: Michael Haggerty @ 2013-12-19 16:18 UTC (permalink / raw)
  To: Johan Herland
  Cc: Eric Raymond, Jakub Narębski, Martin Langhoff, Git Mailing List

On 12/19/2013 04:26 PM, Johan Herland wrote:
> On Thu, Dec 19, 2013 at 10:31 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On 12/19/2013 02:11 AM, Johan Herland wrote:
>>> On Thu, Dec 19, 2013 at 12:44 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>> A correct incremental converter could be done (as long as the CVS users
>>>> don't literally change history retroactively) but it would be a lot of work.
>>>
>>> Although I agree with that sentence as it is stated, I also believe
>>> that the parenthesized condition rules out a _majority_ of CVS repo of
>>> non-trivial size/history. So even though a correct incremental
>>> converter could be built, it would be pretty much useless if it did
>>> not gracefully handle rewritten history. And in the face of rewritten
>>> history it becomes pretty much impossible to define what a "correct"
>>> conversion should even look like (not to mention the difficulty of
>>> actually implementing that converter...).
>>
>> A correct conversion would, conceptually, take a diff between the old
>> CVS history and the new CVS history (I'm talking about the history as a
>> whole, not a diff between two changesets), figure out what had changed,
>> and then figure out what Git commits to make to effect the same
>> conceptual changes in Git-land.
>>
>> This means that the final Git history would have to depend not only on
>> the current entirety of the CVS history, but also on what the CVS
>> history *was* during previous incremental imports and how the tool chose
>> to represent that history in Git the previous rounds.
>>
>> There is a tradeoff here.  The smarter the tool is, the fewer
>> restrictions would have to be made on what people can do in CVS.  For
>> example, it wouldn't be unreasonable to impose a rule that people are
>> not allowed to move files within the CVS repository (e.g., to fake
>> move-file-with-history) after the CVS <-> Git bridge is in use.  (Abuses
>> of the history that occurred *before* the first incremental conversion,
>> on the other hand, wouldn't be a problem.)  If the user of the
>> incremental tool has *no* influence on how his colleagues use CVS, then
>> the tool would have to be very smart and/or the user would might
>> sometimes be forced to do another from-scratch conversion.
> 
> Agreed, but I find it quite ugly how the git history will end up
> different depending on _when_ the incremental conversion is run. It
> means that it will be impossible for two users to create the same Git
> repo (matching SHA1s), unless they carefully synchronize all of their
> conversion runs

Even git-svn doesn't guarantee the same results over time.  The most
obvious scenario when it fails is when somebody changes an SVN commit's
metadata retroactively using something like "svn propedit --revprop
svn:log".  Consistency over time across two independent conversion
processes (that don't communicate) is not even theoretically possible.

> (at which point it's much simpler to run a single
> conversion and then have both users fetch the result).

Yes.  That is a very reasonable approach.

[Discussion of hypothetical real-time inode-watching or proxy-based
converter omitted here...]
> Agreed, but if you want correct metadata (_when_ did these changes
> happen, _who_ performed them), then you need to actually monitor the
> CVS command stream (or CVS server files) in real time...

In my opinion it is ridiculous to try to design a CVS <-> Git bridge
that tries to use back-channels to fill in historical data that even CVS
doesn't record.  Such a thing would require an intimate connection to
the CVS server from the IT department that is presumably blocking a real
move to Git.  So who would ever be able to use it?

The only reason to record extra information would be to enable the
bridge to do self-consistent incremental conversions, and in that case
the *only* extra information that has to be recorded is the information
that would have anyway landed in Git during the previous conversion.

>>> My point here is that people will use whatever available tools they
>>> have to solve whatever problems they are currently having. And when
>>> CVS is your tool, you will sooner or later end up with a "solution"
>>> that irrevocably rewrites your CVS history.
>>
>> Yes, but I maintain that an incremental importer could keep a Git
>> history that is consistent with the CVS history in the sense that:
>>
>> 1. the result of checking out any branch or tag, right after a run of
>> the importer, gives the same results as checking the same branch or tag
>> out of CVS.
>>
>> 2. the Git history from one run is added to (never rewritten) by the
>> next run.
> 
> Yes, and even my simplest/fastest possible converter described above
> can meet those criteria. After that, it really becomes a question of
> _how_much_ CVS history you want to retain in your incremental import.

I think you want enough history to make it pleasant to work with the
resulting Git repository.  That approximately means that you need some
semblance of the CVS commits to be reconstructed, with their correct
metadata, on the closest thing to their correct branches that is
consistent with the CVS - Git impedance mismatch.

> I have described the two extremes above. Interestingly, _both_ of
> those extremes would look quite different from the
> whole-history-gone-incremental converters represented by cvs2git and
> cvs-fast-export, and _both_ of the extremes would probably also
> provide a converted result quite a bit faster than anything in between
> (one by virtue of depending on a single "cvs update" command, and the
> other by monitoring the CVS server and performing the conversion to
> Git in real time).

I am not an extremist.  And I know how much work it would be to start a
project like this from scratch.  After all, what it can do should be a
strict superset of what a tool like cvs2git can do, and cvs2svn/cvs2git
(according to Ohloh's COCOMO estimate) contains the equivalent of 7
person-years of effort.

Anyway, this is all just blah blah unless somebody volunteers to work on
it.  And I think that is highly unlikely, especially given the
decreasing number of CVS repositories in the wild.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2013-12-19 16:18 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-12  0:17 I have end-of-lifed cvsps Eric S. Raymond
2013-12-12  3:38 ` Martin Langhoff
2013-12-12  4:26   ` Eric S. Raymond
2013-12-12 13:42     ` Martin Langhoff
2013-12-12 17:17       ` Andreas Krey
2013-12-12 17:26         ` Martin Langhoff
2013-12-12 18:35           ` Eric S. Raymond
2013-12-12 18:29         ` Eric S. Raymond
2013-12-12 19:08           ` Martin Langhoff
2013-12-12 19:39             ` Eric S. Raymond
2013-12-12 19:48               ` Martin Langhoff
2013-12-12 20:58                 ` Eric S. Raymond
2013-12-12 22:51                   ` Martin Langhoff
2013-12-12 23:04                     ` Eric S. Raymond
2013-12-13  2:35                       ` Martin Langhoff
2013-12-13  3:38                         ` Eric S. Raymond
2013-12-12 18:15       ` Eric S. Raymond
2013-12-12 18:53         ` Martin Langhoff
2013-12-17 10:57       ` Jakub Narębski
2013-12-17 11:18         ` Johan Herland
2013-12-17 14:58           ` Eric S. Raymond
2013-12-17 17:52             ` Johan Herland
2013-12-17 18:47               ` Eric S. Raymond
2013-12-17 21:26                 ` Johan Herland
2013-12-17 22:41                   ` Eric S. Raymond
2013-12-18 23:44                 ` Michael Haggerty
2013-12-19  1:11                   ` Johan Herland
2013-12-19  9:31                     ` Michael Haggerty
2013-12-19 15:26                       ` Johan Herland
2013-12-19 16:18                         ` Michael Haggerty
2013-12-19  4:06                   ` Eric S. Raymond
2013-12-19  9:43                     ` Michael Haggerty
2013-12-17 14:07         ` Eric S. Raymond
2013-12-17 19:58           ` Jakub Narębski
2013-12-17 21:02             ` Eric S. Raymond
2013-12-18  0:02               ` Jakub Narębski
2013-12-18  0:21                 ` Eric S. Raymond
2013-12-18 15:39                   ` Jakub Narębski
2013-12-18 16:23                     ` incremental fast-import and marks (Re: I have end-of-lifed cvsps) Jonathan Nieder
2013-12-18 16:27                     ` I have end-of-lifed cvsps Eric S. Raymond
2013-12-18 16:53                       ` Martin Langhoff
2013-12-18 19:54                         ` John Keeping
2013-12-18 20:20                           ` Eric S. Raymond
2013-12-18 20:47                             ` Kent R. Spillner
2013-12-18 17:46                       ` Jeff King
2013-12-18 19:16                         ` Eric S. Raymond
2013-12-18  0:04               ` Andreas Schwab
2013-12-18  0:25                 ` Eric S. Raymond

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.