All of lore.kernel.org
 help / color / mirror / Atom feed
* Is there a way to speed up remote-hg?
@ 2013-04-20 11:07 John Szakmeister
  2013-04-20 23:07 ` Felipe Contreras
  0 siblings, 1 reply; 3+ messages in thread
From: John Szakmeister @ 2013-04-20 11:07 UTC (permalink / raw)
  To: git, felipe.contreras

I really like the idea of remote-hg, but it appears to be awfully slow
on the clone step:

    ...
    progress revision 81499 'master' (81500/81664)
    progress revision 81599 'master' (81600/81664)
    Checking out files: 100% (3744/3744), done.
    git clone hg::https://bitbucket.org/python_mirrors/cpython
4484.61s user 41510.05s system 102% cpu 12:29:45.73 total

That seems like an awfully high price to pay.  It there a way to speed
this up at all?  I realize the Python hg repo has more history than
others, but even a smaller project like Sphinx takes a while:

    git clone hg::https://bitbucket.org/birkenfeld/sphinx  56.41s user
90.86s system 98% cpu 2:28.87 total

I was just curious if something more could be done here.  I don't go
around cloning Python all the time, so it's not a big issue, but it'd
be nice if it was more performant.

Thanks!

-John

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Is there a way to speed up remote-hg?
  2013-04-20 11:07 Is there a way to speed up remote-hg? John Szakmeister
@ 2013-04-20 23:07 ` Felipe Contreras
  2013-04-21 12:59   ` John Szakmeister
  0 siblings, 1 reply; 3+ messages in thread
From: Felipe Contreras @ 2013-04-20 23:07 UTC (permalink / raw)
  To: John Szakmeister; +Cc: git

On Sat, Apr 20, 2013 at 6:07 AM, John Szakmeister <john@szakmeister.net> wrote:
> I really like the idea of remote-hg, but it appears to be awfully slow
> on the clone step:

The short answer is no. I do have a couple of patches that improve
performance, but not by a huge factor.

I have profiled the code, and there are two significant places where
performance is wasted:

1) Fetching the file contents

Extracting, decompressing, transferring, and then compressing and
storing the file contents is mostly unavoidable, unless we already
have the contents of such file, which in Git, it would be easy to
check by analyzing the checksum (SHA-1). Unfortunately Mercurial
doesn't have that information. The SHA-1 that is stored is not of the
contents, but the contents and the parent checksum, which means that
if you revert a modification you made to a file, or move a file, any
operation that ends up in the same contents, but from a different
path, the SHA-1 is different. This means the only way to know if the
contents are the same, is by extracting, and calculating the SHA-1
yourself, which defeats the purpose of what you want the calculation
for.

I've tried, calculating the SHA-1 and use a previous reference to
avoid the transfer, or do the transfer, and let Git check for existing
objects doesn't make a difference.

This is by Mercurial's stupid design, and there's nothing we, or
anybody could do about it until they change it.

2) Checking for file changes

For each commit (or revision), we need to figure out which files were
modified, and for that, Mercurial has a neat shortcut that stores such
modifications in the commit context itself, so it's easy to retrieve.
Unfortunately, it's sometimes wrong.

Since the Mercurial tools never use this information for any real
work, simply to show the changes to the users, Mercurial folks never
noticed the contents they were storing were wrong. Which means if you
have a repository that started with old versions of mercurial, chances
are this information would be wrong, and there's no real guarantee
that future versions won't have this problem, since to this day this
information continues to be used only display stuff to the user.

So, since we cannot rely on this, we need to manually check for
differences the way Mercurial does, which blows performance away,
because you need to get the contents of the two parent revisions, and
compare them away. My content I mean the the manifest, or list of
files, which takes considerable amount of time.

For 1) there's nothing we can do, and for 2) we could trust the files
Mercurial thinks were modified, and that gives us a very significant
boost, but the repository will sometimes end up wrong. Most of the
time is spent on 2).

So unfortunately there's nothing we can do, that's just Mercurial
design, and it really has nothing to do with Git. Any other tool would
have the same problems, even a tool that converts a Mercurial
repository to Mercurial (without using tricks).

It seems Bazaar is more sensible in this regard; 1) the checksums are
try of the file contents, and 2) each revision does store the file
modifications correctly. So a clone in Bazaar is much faster. In my
opinion Mercurial just screwed up their design.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Is there a way to speed up remote-hg?
  2013-04-20 23:07 ` Felipe Contreras
@ 2013-04-21 12:59   ` John Szakmeister
  0 siblings, 0 replies; 3+ messages in thread
From: John Szakmeister @ 2013-04-21 12:59 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git

On Sat, Apr 20, 2013 at 7:07 PM, Felipe Contreras
<felipe.contreras@gmail.com> wrote:
> On Sat, Apr 20, 2013 at 6:07 AM, John Szakmeister <john@szakmeister.net> wrote:
>> I really like the idea of remote-hg, but it appears to be awfully slow
>> on the clone step:
>
> The short answer is no. I do have a couple of patches that improve
> performance, but not by a huge factor.
>
> I have profiled the code, and there are two significant places where
> performance is wasted:
>
> 1) Fetching the file contents
>
> Extracting, decompressing, transferring, and then compressing and
> storing the file contents is mostly unavoidable, unless we already
> have the contents of such file, which in Git, it would be easy to
> check by analyzing the checksum (SHA-1). Unfortunately Mercurial
> doesn't have that information. The SHA-1 that is stored is not of the
> contents, but the contents and the parent checksum, which means that
> if you revert a modification you made to a file, or move a file, any
> operation that ends up in the same contents, but from a different
> path, the SHA-1 is different. This means the only way to know if the
> contents are the same, is by extracting, and calculating the SHA-1
> yourself, which defeats the purpose of what you want the calculation
> for.
>
> I've tried, calculating the SHA-1 and use a previous reference to
> avoid the transfer, or do the transfer, and let Git check for existing
> objects doesn't make a difference.
>
> This is by Mercurial's stupid design, and there's nothing we, or
> anybody could do about it until they change it.

That's a bummer. :-(

> 2) Checking for file changes
>
> For each commit (or revision), we need to figure out which files were
> modified, and for that, Mercurial has a neat shortcut that stores such
> modifications in the commit context itself, so it's easy to retrieve.
> Unfortunately, it's sometimes wrong.
>
> Since the Mercurial tools never use this information for any real
> work, simply to show the changes to the users, Mercurial folks never
> noticed the contents they were storing were wrong. Which means if you
> have a repository that started with old versions of mercurial, chances
> are this information would be wrong, and there's no real guarantee
> that future versions won't have this problem, since to this day this
> information continues to be used only display stuff to the user.
>
> So, since we cannot rely on this, we need to manually check for
> differences the way Mercurial does, which blows performance away,
> because you need to get the contents of the two parent revisions, and
> compare them away. My content I mean the the manifest, or list of
> files, which takes considerable amount of time.

Eek!

> For 1) there's nothing we can do, and for 2) we could trust the files
> Mercurial thinks were modified, and that gives us a very significant
> boost, but the repository will sometimes end up wrong. Most of the
> time is spent on 2).
>
> So unfortunately there's nothing we can do, that's just Mercurial
> design, and it really has nothing to do with Git. Any other tool would
> have the same problems, even a tool that converts a Mercurial
> repository to Mercurial (without using tricks).
[snip]

That's unfortunate, but thank you for taking the time to explain!

-John

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-04-21 12:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-20 11:07 Is there a way to speed up remote-hg? John Szakmeister
2013-04-20 23:07 ` Felipe Contreras
2013-04-21 12:59   ` John Szakmeister

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.