All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Hommey <mh@glandium.org>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: fast-import slowness when importing large files with small differences
Date: Sat, 30 Jun 2018 08:35:38 +0900	[thread overview]
Message-ID: <20180629233538.7zxxrvou4twqyd6d@glandium.org> (raw)
In-Reply-To: <87o9ftckhb.fsf@evledraar.gmail.com>

On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Jun 29 2018, Mike Hommey wrote:
> 
> > I noticed some slowness when fast-importing data from the Firefox mercurial
> > repository, where fast-import spends more than 5 minutes importing ~2000
> > revisions of one particular file. I reduced a testcase while still
> > using real data. One could synthesize data with kind of the same
> > properties, but I figured real data could be useful.
> >
> > To reproduce:
> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> > $ git init bar
> > $ cd bar
> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
> >
> > [...]
> > So maybe it would make sense to consolidate the diff code (after all,
> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
> > of common head and tail, this gets down to 3:33.
> >
> > I'll also note that Facebook has imported xdiff from the git code base
> > into mercurial and improved performance on it, so it might also be worth
> > looking at what's worth taking from there.
> 
> It would be interesting to see how does this compares with a more naïve
> approach of committing every version of this file one-at-a-time into a
> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
> suboptimal compared to just writing out a lot of redundant data and
> repacking it all at once later.

"Just" writing 26GB? And that's only one file. If I were to do that for
the whole repository, it would yield a > 100GB pack. Instead of < 2GB
currently.

Mike

  reply	other threads:[~2018-06-29 23:35 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-29  9:44 fast-import slowness when importing large files with small differences Mike Hommey
2018-06-29 20:14 ` Stefan Beller
2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17     ` Junio C Hamano
2018-06-29 23:37       ` Stefan Beller
2018-06-30  1:11         ` Jun Wu
2018-07-01 15:57     ` Michael Haggerty
2018-07-02 17:27       ` Stefan Beller
2018-07-03  9:15         ` Michael Haggerty
2018-07-27 22:23           ` Stefan Beller
2018-07-03 18:14       ` Junio C Hamano
2018-06-29 20:39   ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51     ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35   ` Mike Hommey [this message]
2018-07-03 16:05     ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38       ` Mike Hommey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180629233538.7zxxrvou4twqyd6d@glandium.org \
    --to=mh@glandium.org \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.