All of lore.kernel.org
 help / color / mirror / Atom feed
* Update on SoC proposal: git-remote-svn
@ 2010-04-13  5:29 Ramkumar Ramachandra
  2010-04-13  8:34 ` Sam Vilain
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ramkumar Ramachandra @ 2010-04-13  5:29 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Sverre Rabbelier, David Michael Barr, Git Mailing List


Hi,

Sam Vilain commented on my SoC proposal using Google's SoC interface,
and requested me to CC my response to the mailing list. His original
comment is also quoted below.

----------------------->8----------------------<8-------------------
Hi Sam,

> Hi Ramkumar, I've looked at this proposal and seen that it differs a
> bit from the version on the list, and I can't see the relevant
> discussion, so I'll just throw my bit in here - though note this is a
> technical comment and not a critique of your proposal.

There's been a lot of discussion, some on the list, and some more off
the list.

> First, consider not using the SVN API at all while prototyping the
> import part of the chain.  Instead, parse the 'svnadmin dump' stream
> from a local mirror.  This will allow you to tackle the actual
> problems involved and importing the data effectively, without
> suffering from the brain-damage that is the SVN API.  After all, the
> SVN API should be returning you all of the same information that the
> dump stream does, so you can treat making it work using the remote
> access API (eg svn_ra_replay, which is faster for mirroring AIUI) as a
> separate task.  You will also more easily spot information which you
> should be extracting from the API, but aren't - it's definitely all in
> the dump format; it has to be.  I received similar advice to this
> before building a perforce importer and let's just say it was
> invaluable.

Yes. I've studied the SVN API, and I agree with you- it's quite
horrible. Instead of providing a API that's transparently
backward-compatible, they've provided different methods for different
versions. There are also several variations of certain methods, and
this is quite confusing.

`svnadmin dump` is exactly how I plan to start out- I've already
discussed this with my to-be mentor, Sverre Rabbelier and David
Michael-Barr, who's building a new SVN exporter in his own time.

> Second, consider making the mirror phase emit directly to a tracking
> branch via git-fast-export, that is not intended to be checked out.
> Instead, it contains trees which correspond to revisions in the
> mirrored SVN repository.  Directory and file properties can be saved
> in the tree using specially named dotfiles, and revision properties
> can be saved in the commit log.  Perhaps I misread your intentions
> with the "stripped down svnsync" part, but syncing to a local SVN
> repository seems to me like a waste of time; people can just do that
> themselves if they choose anyway.  An SVN repository can easily be 10
> times the size of the corresponding git store, and it just seems like
> double-handling of the data and will make the whole process slower and
> more cumbersome than it needs to be.

> With all the blobs already in the git store, and the information
> needed to perform the data mining operation which is the extraction of
> git-style branch histories from the svn data, you will be working with
> data which is all in git-land, and exporting referencing blobs which
> are already in the store.  This will save you a LOT of time, as it
> means in this stage you are not handling the actual file images; just
> constructing branch histories in the git-fast-import stream.  Your
> branch miner will potentially be able to process thousands of
> revisions per second this way, even from python.

Agreed. Sverre, David and I discussed exactly this- The final version
of mirror-client will dump all the SVN information to a Git store
first, so we can do the mapping painlessly in Git. There are some
concerns about information loss though, which we'll have to deal with
as we go on.

> Also bear in mind
> that people might use SVN in a way that violates the expectations of
> this branch miner.  An example is putting a README file in the
> top-level projects directory, a heuristic approach might consider that
> the start of a new project and then mess up later stages.  Another
> example is people accidentally deleting trunk and re-adding it; the
> nice thing about this two-stage approach is that it allows advanced
> users to muck with the "raw" data (ie, this whole repository tracking
> branch) using git to do things like graft away the bad revisions, and
> then the second stage will use the corrected data.  Of course
> eventually, this detail will be hidden by the remote helper.

Excellent suggestion! I'll attempt to build the plumbing for the
mapping in a manner that exposes a sane interface.

> As a general comment - you must be careful in trying to assume that
> what you are attempting is even possible.  Sure, you want 'git clone
> svn://example.com/myrepo' to work, but what does that mean?  A
> repository in SVN is a filesystem, which can contain multiple
> projects.  In git, a repo is a single project.  People might expect to
> be able to clone the trunk URL for instance.  My advice there is to
> not support that use case at all, it's a complete can of worms which
> you will discover as you tackle the conversion algorithms.  Just focus
> on making the case where the complete repository is mirrored work for
> this project.  Mining a single branch out of SVN without all data
> available is the domain of git-svn and really you don't want to go
> there.

Hm, this is something that I hadn't thought about earlier. Thanks for
the suggestion- I will not attempt to go into complicated cases,
atleast in my summer term.

> Anyway like I say, please follow-up on the mailing list, and this
> advice can receive wider scrutiny.

Thank you for your valuable comment!

-- Ram

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Update on SoC proposal: git-remote-svn
  2010-04-13  5:29 Update on SoC proposal: git-remote-svn Ramkumar Ramachandra
@ 2010-04-13  8:34 ` Sam Vilain
  2010-04-13 16:01 ` Sverre Rabbelier
  2010-04-14  6:33 ` Steven Michalske
  2 siblings, 0 replies; 6+ messages in thread
From: Sam Vilain @ 2010-04-13  8:34 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Sverre Rabbelier, David Michael Barr, Git Mailing List

Ramkumar Ramachandra wrote:
>> Hi Ramkumar, I've looked at this proposal and seen that it differs a
>> bit from the version on the list, and I can't see the relevant
>> discussion, so I'll just throw my bit in here - though note this is a
>> technical comment and not a critique of your proposal.
>>     
>
> There's been a lot of discussion, some on the list, and some more off
> the list.
>   

Right, that will be why :-).  No problem, apologies for not catching the
earlier discussion.  It's good to hear that you have collectively
reached a very similar design to the one indicated from my perforce
importer experiment (SVN is of course, a poor imitation of Perforce in
many ways).

>> Anyway like I say, please follow-up on the mailing list, and this
>> advice can receive wider scrutiny.
>>     
>
> Thank you for your valuable comment!
>   

No problem at all.  I will watch the development of this project with
interest.

All the best,
Sam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Update on SoC proposal: git-remote-svn
  2010-04-13  5:29 Update on SoC proposal: git-remote-svn Ramkumar Ramachandra
  2010-04-13  8:34 ` Sam Vilain
@ 2010-04-13 16:01 ` Sverre Rabbelier
  2010-04-14  6:33 ` Steven Michalske
  2 siblings, 0 replies; 6+ messages in thread
From: Sverre Rabbelier @ 2010-04-13 16:01 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Sam Vilain, David Michael Barr, Git Mailing List

Heya,

On Tue, Apr 13, 2010 at 07:29, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
> Hm, this is something that I hadn't thought about earlier. Thanks for
> the suggestion- I will not attempt to go into complicated cases,
> atleast in my summer term.

I think we can safely say that "git clone svn://example.com/myrepo"
only has to work for svn repositories that have a 'sane' layout. If
you want a non-standard layout you can do 'git init && git configure
.... && git fetch', where you configure everything the way you want
before fetching. We could perhaps even provide a tool to aid with
configuring things; or perhaps a mode to not do any rewriting (just
import the data into git as-is) and then have a tool that examines the
history and interactively helps you set up an appropriate rewrite
config? Interesting times are ahead :).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Update on SoC proposal: git-remote-svn
  2010-04-13  5:29 Update on SoC proposal: git-remote-svn Ramkumar Ramachandra
  2010-04-13  8:34 ` Sam Vilain
  2010-04-13 16:01 ` Sverre Rabbelier
@ 2010-04-14  6:33 ` Steven Michalske
  2010-04-14 12:52   ` David Michael Barr
  2010-04-14 17:15   ` David Michael Barr
  2 siblings, 2 replies; 6+ messages in thread
From: Steven Michalske @ 2010-04-14  6:33 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Sam Vilain, Sverre Rabbelier, David Michael Barr, Git Mailing List

Ramkumar,

In reading this I wondered how a svn dump of one of the repositories  
monitor would size.  If I were to check out the svn root of that  
repository, I would use well over 3TB of disk space to have that  
checked out, I filled my 750GB drive with about a third of it checked  
out.  About 256MB of code with thousands of tags and hundreds of  
branches.

It looks like svnadmin dump defaults to dumping all data.  Fortunately  
it has a delta option, which looks like it would be needed to dump  
this repository I am speaking of without filling up many hard drives.

This might also be helped if the dumps are chunked into ranges for  
many thousands of commits as well, this would keep the files more  
manageable

Just food for thought.

Steve

On Apr 12, 2010, at 10:29 PM, Ramkumar Ramachandra wrote:

>
> Hi,
>
> Sam Vilain commented on my SoC proposal using Google's SoC interface,
> and requested me to CC my response to the mailing list. His original
> comment is also quoted below.
>
> ----------------------->8----------------------<8-------------------
> Hi Sam,
>
>> Hi Ramkumar, I've looked at this proposal and seen that it differs a
>> bit from the version on the list, and I can't see the relevant
>> discussion, so I'll just throw my bit in here - though note this is a
>> technical comment and not a critique of your proposal.
>
> There's been a lot of discussion, some on the list, and some more off
> the list.
>
>> First, consider not using the SVN API at all while prototyping the
>> import part of the chain.  Instead, parse the 'svnadmin dump' stream
>> from a local mirror.  This will allow you to tackle the actual
>> problems involved and importing the data effectively, without
>> suffering from the brain-damage that is the SVN API.  After all, the
>> SVN API should be returning you all of the same information that the
>> dump stream does, so you can treat making it work using the remote
>> access API (eg svn_ra_replay, which is faster for mirroring AIUI)  
>> as a
>> separate task.  You will also more easily spot information which you
>> should be extracting from the API, but aren't - it's definitely all  
>> in
>> the dump format; it has to be.  I received similar advice to this
>> before building a perforce importer and let's just say it was
>> invaluable.
>
> Yes. I've studied the SVN API, and I agree with you- it's quite
> horrible. Instead of providing a API that's transparently
> backward-compatible, they've provided different methods for different
> versions. There are also several variations of certain methods, and
> this is quite confusing.
>
> `svnadmin dump` is exactly how I plan to start out- I've already
> discussed this with my to-be mentor, Sverre Rabbelier and David
> Michael-Barr, who's building a new SVN exporter in his own time.
>
>> Second, consider making the mirror phase emit directly to a tracking
>> branch via git-fast-export, that is not intended to be checked out.
>> Instead, it contains trees which correspond to revisions in the
>> mirrored SVN repository.  Directory and file properties can be saved
>> in the tree using specially named dotfiles, and revision properties
>> can be saved in the commit log.  Perhaps I misread your intentions
>> with the "stripped down svnsync" part, but syncing to a local SVN
>> repository seems to me like a waste of time; people can just do that
>> themselves if they choose anyway.  An SVN repository can easily be 10
>> times the size of the corresponding git store, and it just seems like
>> double-handling of the data and will make the whole process slower  
>> and
>> more cumbersome than it needs to be.
>
>> With all the blobs already in the git store, and the information
>> needed to perform the data mining operation which is the extraction  
>> of
>> git-style branch histories from the svn data, you will be working  
>> with
>> data which is all in git-land, and exporting referencing blobs which
>> are already in the store.  This will save you a LOT of time, as it
>> means in this stage you are not handling the actual file images; just
>> constructing branch histories in the git-fast-import stream.  Your
>> branch miner will potentially be able to process thousands of
>> revisions per second this way, even from python.
>
> Agreed. Sverre, David and I discussed exactly this- The final version
> of mirror-client will dump all the SVN information to a Git store
> first, so we can do the mapping painlessly in Git. There are some
> concerns about information loss though, which we'll have to deal with
> as we go on.
>
>> Also bear in mind
>> that people might use SVN in a way that violates the expectations of
>> this branch miner.  An example is putting a README file in the
>> top-level projects directory, a heuristic approach might consider  
>> that
>> the start of a new project and then mess up later stages.  Another
>> example is people accidentally deleting trunk and re-adding it; the
>> nice thing about this two-stage approach is that it allows advanced
>> users to muck with the "raw" data (ie, this whole repository tracking
>> branch) using git to do things like graft away the bad revisions, and
>> then the second stage will use the corrected data.  Of course
>> eventually, this detail will be hidden by the remote helper.
>
> Excellent suggestion! I'll attempt to build the plumbing for the
> mapping in a manner that exposes a sane interface.
>
>> As a general comment - you must be careful in trying to assume that
>> what you are attempting is even possible.  Sure, you want 'git clone
>> svn://example.com/myrepo' to work, but what does that mean?  A
>> repository in SVN is a filesystem, which can contain multiple
>> projects.  In git, a repo is a single project.  People might expect  
>> to
>> be able to clone the trunk URL for instance.  My advice there is to
>> not support that use case at all, it's a complete can of worms which
>> you will discover as you tackle the conversion algorithms.  Just  
>> focus
>> on making the case where the complete repository is mirrored work for
>> this project.  Mining a single branch out of SVN without all data
>> available is the domain of git-svn and really you don't want to go
>> there.
>
> Hm, this is something that I hadn't thought about earlier. Thanks for
> the suggestion- I will not attempt to go into complicated cases,
> atleast in my summer term.
>
>> Anyway like I say, please follow-up on the mailing list, and this
>> advice can receive wider scrutiny.
>
> Thank you for your valuable comment!
>
> -- Ram
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Update on SoC proposal: git-remote-svn
  2010-04-14  6:33 ` Steven Michalske
@ 2010-04-14 12:52   ` David Michael Barr
  2010-04-14 17:15   ` David Michael Barr
  1 sibling, 0 replies; 6+ messages in thread
From: David Michael Barr @ 2010-04-14 12:52 UTC (permalink / raw)
  To: Steven Michalske
  Cc: Ramkumar Ramachandra, Sam Vilain, Sverre Rabbelier, Git Mailing List

Hi Steve,

> In reading this I wondered how a svn dump of one of the
> repositories monitor would size.  If I were to check out the svn
> root of that repository, I would use well over 3TB of disk space
> to have that checked out, I filled my 750GB drive with about a
> third of it checked out.  About 256MB of code with thousands
> of tags and hundreds of branches.

I encountered this issue with my first attempt to validate the output of 
my dump conversion tool. My case wasn't as dire, 350GB would
have sufficed but I was working in a 160GB partition.
Checking out tags side by side is a sure way to fill your disk.

> It looks like svnadmin dump defaults to dumping all data.
> Fortunately it has a delta option, which looks like it would be
> needed to dump this repository I am speaking of without filling
> up many hard drives.

The svn dump format is not quite that silly, even without deltification
it doesn't output blobs that are just an unaltered copy from a
previous revision.
Handling deltified dumps will greatly increase the complexity of the
import process. Blob content would have be computed from existing
blobs rather than simply passed through.

> This might also be helped if the dumps are chunked into ranges
> for many thousands of commits as well, this would keep the files
> more manageable

Being able to handle a dump stream reassembled from such
piecewise dumps is an important feature which I haven't finished
implementing yet.

> Just food for thought.

Thanks for the feed.

--
David Barr

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Update on SoC proposal: git-remote-svn
  2010-04-14  6:33 ` Steven Michalske
  2010-04-14 12:52   ` David Michael Barr
@ 2010-04-14 17:15   ` David Michael Barr
  1 sibling, 0 replies; 6+ messages in thread
From: David Michael Barr @ 2010-04-14 17:15 UTC (permalink / raw)
  To: Steven Michalske
  Cc: Ramkumar Ramachandra, Sam Vilain, Sverre Rabbelier, Git Mailing List

Hi Steve,

> If I were to check out the svn root of that repository, I would use well
> over 3TB of disk space to have that checked out ...

This stirred my thoughts and I whipped up a bash script that uses SVK,
find, shasum and ln to build a filesystem view of the root of an svn
repository that consumes moderate storage:

SVK_DEPOT=""
MAX_REV=12345

CO_DIR=validation
HASH_DIR=hashes

svk co -r1 /$SVK_DEPOT/ $CO_DIR
mkdir -p $HASH_DIR
for (( REV=1 ; REV<=MAX_REV ; ++REV )) do
  svk up -r$REV $CO_DIR
  # Hashify working copy
  find $CO_DIR -type d -cmin -5 -prune -o \
    -type f -links 1 -exec shasum '{}' + | (
    while read HASH FILE ; do
      [ -x "$FILE" ] && HASH="$HASH"x
      ln "$FILE" $HASH_DIR/$HASH 2>/dev/null || \
        ln -f $HASH_DIR/$HASH "$FILE"
    done
  )
done

Important assumptions are that each update will take less
than 5 minutes and that SVK uses writes to a temporary file
and then renames to perform a modification.
I've used this to build a simple validation script for my project.
I estimate it will use about 20GB to represent my 1GB repo
and that it will take about 3 hours.

--
David Barr

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-04-14 17:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-13  5:29 Update on SoC proposal: git-remote-svn Ramkumar Ramachandra
2010-04-13  8:34 ` Sam Vilain
2010-04-13 16:01 ` Sverre Rabbelier
2010-04-14  6:33 ` Steven Michalske
2010-04-14 12:52   ` David Michael Barr
2010-04-14 17:15   ` David Michael Barr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.