Update on SoC proposal: git-remote-svn

* Update on SoC proposal: git-remote-svn
@ 2010-04-13  5:29 Ramkumar Ramachandra
  2010-04-13  8:34 ` Sam Vilain
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ramkumar Ramachandra @ 2010-04-13  5:29 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Sverre Rabbelier, David Michael Barr, Git Mailing List

Hi,

Sam Vilain commented on my SoC proposal using Google's SoC interface,
and requested me to CC my response to the mailing list. His original
comment is also quoted below.

----------------------->8----------------------<8-------------------
Hi Sam,

> Hi Ramkumar, I've looked at this proposal and seen that it differs a
> bit from the version on the list, and I can't see the relevant
> discussion, so I'll just throw my bit in here - though note this is a
> technical comment and not a critique of your proposal.

There's been a lot of discussion, some on the list, and some more off
the list.

> First, consider not using the SVN API at all while prototyping the
> import part of the chain.  Instead, parse the 'svnadmin dump' stream
> from a local mirror.  This will allow you to tackle the actual
> problems involved and importing the data effectively, without
> suffering from the brain-damage that is the SVN API.  After all, the
> SVN API should be returning you all of the same information that the
> dump stream does, so you can treat making it work using the remote
> access API (eg svn_ra_replay, which is faster for mirroring AIUI) as a
> separate task.  You will also more easily spot information which you
> should be extracting from the API, but aren't - it's definitely all in
> the dump format; it has to be.  I received similar advice to this
> before building a perforce importer and let's just say it was
> invaluable.

Yes. I've studied the SVN API, and I agree with you- it's quite
horrible. Instead of providing a API that's transparently
backward-compatible, they've provided different methods for different
versions. There are also several variations of certain methods, and
this is quite confusing.

`svnadmin dump` is exactly how I plan to start out- I've already
discussed this with my to-be mentor, Sverre Rabbelier and David
Michael-Barr, who's building a new SVN exporter in his own time.

> Second, consider making the mirror phase emit directly to a tracking
> branch via git-fast-export, that is not intended to be checked out.
> Instead, it contains trees which correspond to revisions in the
> mirrored SVN repository.  Directory and file properties can be saved
> in the tree using specially named dotfiles, and revision properties
> can be saved in the commit log.  Perhaps I misread your intentions
> with the "stripped down svnsync" part, but syncing to a local SVN
> repository seems to me like a waste of time; people can just do that
> themselves if they choose anyway.  An SVN repository can easily be 10
> times the size of the corresponding git store, and it just seems like
> double-handling of the data and will make the whole process slower and
> more cumbersome than it needs to be.

> With all the blobs already in the git store, and the information
> needed to perform the data mining operation which is the extraction of
> git-style branch histories from the svn data, you will be working with
> data which is all in git-land, and exporting referencing blobs which
> are already in the store.  This will save you a LOT of time, as it
> means in this stage you are not handling the actual file images; just
> constructing branch histories in the git-fast-import stream.  Your
> branch miner will potentially be able to process thousands of
> revisions per second this way, even from python.

Agreed. Sverre, David and I discussed exactly this- The final version
of mirror-client will dump all the SVN information to a Git store
first, so we can do the mapping painlessly in Git. There are some
concerns about information loss though, which we'll have to deal with
as we go on.

> Also bear in mind
> that people might use SVN in a way that violates the expectations of
> this branch miner.  An example is putting a README file in the
> top-level projects directory, a heuristic approach might consider that
> the start of a new project and then mess up later stages.  Another
> example is people accidentally deleting trunk and re-adding it; the
> nice thing about this two-stage approach is that it allows advanced
> users to muck with the "raw" data (ie, this whole repository tracking
> branch) using git to do things like graft away the bad revisions, and
> then the second stage will use the corrected data.  Of course
> eventually, this detail will be hidden by the remote helper.

Excellent suggestion! I'll attempt to build the plumbing for the
mapping in a manner that exposes a sane interface.

> As a general comment - you must be careful in trying to assume that
> what you are attempting is even possible.  Sure, you want 'git clone
> svn://example.com/myrepo' to work, but what does that mean?  A
> repository in SVN is a filesystem, which can contain multiple
> projects.  In git, a repo is a single project.  People might expect to
> be able to clone the trunk URL for instance.  My advice there is to
> not support that use case at all, it's a complete can of worms which
> you will discover as you tackle the conversion algorithms.  Just focus
> on making the case where the complete repository is mirrored work for
> this project.  Mining a single branch out of SVN without all data
> available is the domain of git-svn and really you don't want to go
> there.

Hm, this is something that I hadn't thought about earlier. Thanks for
the suggestion- I will not attempt to go into complicated cases,
atleast in my summer term.

> Anyway like I say, please follow-up on the mailing list, and this
> advice can receive wider scrutiny.

Thank you for your valuable comment!

-- Ram

^ permalink raw reply	[flat|nested] 6+ messages in thread