* fast-import slowness when importing large files with small differences @ 2018-06-29 9:44 Mike Hommey 2018-06-29 20:14 ` Stefan Beller 2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 17+ messages in thread From: Mike Hommey @ 2018-06-29 9:44 UTC (permalink / raw) To: git Hi, I noticed some slowness when fast-importing data from the Firefox mercurial repository, where fast-import spends more than 5 minutes importing ~2000 revisions of one particular file. I reduced a testcase while still using real data. One could synthesize data with kind of the same properties, but I figured real data could be useful. To reproduce: $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo $ git init bar $ cd bar $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000 (--depth=2000 to minimize the pack size) The python script doesn't have much overhead: $ time python ../foo/import.py ../foo/data.gz > /dev/null real 0m14.564s user 0m9.813s sys 0m4.703s It generates about 26GB of data from that 4.2MB data.gz. $ python ../foo/import.py ../foo/data.gz | time git fast-import --depth=2000 git-fast-import statistics: --------------------------------------------------------------------- Alloc'd objects: 5000 Total objects: 1868 ( 133 duplicates ) blobs : 1868 ( 133 duplicates 1867 deltas of 1868 attempts) trees : 0 ( 0 duplicates 0 deltas of 0 attempts) commits: 0 ( 0 duplicates 0 deltas of 0 attempts) tags : 0 ( 0 duplicates 0 deltas of 0 attempts) Total branches: 0 ( 0 loads ) marks: 1024 ( 0 unique ) atoms: 0 Memory total: 2282 KiB pools: 2048 KiB objects: 234 KiB --------------------------------------------------------------------- pack_report: getpagesize() = 4096 pack_report: core.packedGitWindowSize = 1073741824 pack_report: core.packedGitLimit = 35184372088832 pack_report: pack_used_ctr = 0 pack_report: pack_mmap_calls = 0 pack_report: pack_open_windows = 0 / 0 pack_report: pack_mapped = 0 / 0 --------------------------------------------------------------------- 321.61user 6.60system 5:50.08elapsed 93%CPU (0avgtext+0avgdata 83192maxresident)k 0inputs+10568outputs (0major+38689minor)pagefaults 0swaps (The resulting pack is 5.3MB, fwiw) Obviously, sha1'ing 26GB is not going to be free, but it's also not the dominating cost, according to perf: 63.52% git-fast-import git-fast-import [.] create_delta_index 17.46% git-fast-import git-fast-import [.] sha1_compression_states 9.89% git-fast-import git-fast-import [.] ubc_check 6.23% git-fast-import git-fast-import [.] create_delta 2.49% git-fast-import git-fast-import [.] sha1_process That's a whole lot of time spent on create_delta_index. FWIW, if delta was 100% free (yes, I tested that), the fast-import would take 1:40 with the following profile: 58.74% git-fast-import git-fast-import [.] sha1_compression_states 32.45% git-fast-import git-fast-import [.] ubc_check 8.25% git-fast-import git-fast-import [.] sha1_process I toyed with the idea of eliminating common head and tail before creating the delta, and got some promising result: a fast-import taking 3:22 instead of 5:50, with the following profile: 34.67% git-fast-import git-fast-import [.] create_delta_index 30.88% git-fast-import git-fast-import [.] sha1_compression_states 17.15% git-fast-import git-fast-import [.] ubc_check 7.25% git-fast-import git-fast-import [.] store_object 4.47% git-fast-import git-fast-import [.] sha1_process 2.72% git-fast-import git-fast-import [.] create_delta2 The resulting pack is however much larger (for some reason, many objects are left non-deltaed), and the deltas are partly broken (they don't apply cleanly), but that just tells the code is not ready to be sent. I don't expect working code would be much slower than this. The remaining question is whether this is beneficial for more normal cases. I also seemed to remember when I tested a while ago, that somehow xdiff handles those files faster than diff-delta, and I'm wondering if it would make sense to to make the pack code use xdiff. So I tested replacing diff_delta with a call to xdi_diff_outf with a callback that does nothing and zeroed out xpparam_t and xdemitconf_t (not sure that's best, though, I haven't looked very deeply), and that finished in 5:15 with the following profile (without common head trimming, xdiff-interface apparently does common tail trimming): 32.99% git-fast-import git-fast-import [.] xdl_prepare_ctx.isra.0 20.42% git-fast-import git-fast-import [.] sha1_compression_states 15.26% git-fast-import git-fast-import [.] xdl_hash_record 11.65% git-fast-import git-fast-import [.] ubc_check 3.09% git-fast-import git-fast-import [.] xdl_recs_cmp 3.03% git-fast-import git-fast-import [.] sha1_process 2.91% git-fast-import git-fast-import [.] xdl_prepare_env So maybe it would make sense to consolidate the diff code (after all, diff-delta.c is an old specialized fork of xdiff). With manual trimming of common head and tail, this gets down to 3:33. I'll also note that Facebook has imported xdiff from the git code base into mercurial and improved performance on it, so it might also be worth looking at what's worth taking from there. Cheers, Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences 2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey @ 2018-06-29 20:14 ` Stefan Beller 2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller 2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King 2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason 1 sibling, 2 replies; 17+ messages in thread From: Stefan Beller @ 2018-06-29 20:14 UTC (permalink / raw) To: Mike Hommey, Jameson Miller; +Cc: git On Fri, Jun 29, 2018 at 3:18 AM Mike Hommey <mh@glandium.org> wrote: > > Hi, > > I noticed some slowness when fast-importing data from the Firefox mercurial > repository, where fast-import spends more than 5 minutes importing ~2000 > revisions of one particular file. I reduced a testcase while still > using real data. One could synthesize data with kind of the same > properties, but I figured real data could be useful. I cc'd Jameson, who refactored memory allocation in fast-import recently. (I am not aware of other refactorings in the area of fast-import) > To reproduce: [...] > Memory total: 2282 KiB > pools: 2048 KiB > objects: 234 KiB > [...] > Obviously, sha1'ing 26GB is not going to be free, but it's also not the > dominating cost, according to perf: > > 63.52% git-fast-import git-fast-import [.] create_delta_index So this doesn't sound like a memory issue, but a diffing/deltaing issue. > So maybe it would make sense to consolidate the diff code (after all, > diff-delta.c is an old specialized fork of xdiff). With manual trimming > of common head and tail, this gets down to 3:33. This sounds interesting. I'd love to see that code to be unified. > I'll also note that Facebook has imported xdiff from the git code base > into mercurial and improved performance on it, so it might also be worth > looking at what's worth taking from there. So starting with https://www.mercurial-scm.org/repo/hg/rev/34e2ff1f9cd8 ("xdiff: vendor xdiff library from git") they adapted it slightly: $ hg log --template '{node|short} {desc|firstline}\n' -- mercurial/thirdparty/xdiff/ a2baa61bbb14 xdiff: move stdint.h to xdiff.h d40b9e29c114 xdiff: fix a hard crash on Windows 651c80720eed xdiff: silence a 32-bit shift warning on Windows d255744de97a xdiff: backport int64_t and uint64_t types to Windows e5b14f5b8b94 xdiff: resolve signed unsigned comparison warning f1ef0e53e628 xdiff: use int64 for hash table size f0d9811dda8e xdiff: remove unused xpp and xecfg parameters 49fe6249937a xdiff: remove unused flags parameter 882657a9f768 xdiff: replace {unsigned ,}long with {u,}int64_t 0c7350656f93 xdiff: add comments for fields in xdfile_t f33a87cf60cc xdiff: add a preprocessing step that trims files 3cf40112efb7 xdiff: remove xmerge related logic 90f8fe72446c xdiff: remove xemit related logic b5bb0f99064d xdiff: remove unused structure, functions, and constants 09f320067591 xdiff: remove whitespace related feature 1f9bbd1d6b8a xdiff: fix builds on Windows c420792217c8 xdiff: reduce indent heuristic overhead b3c9c483cac9 xdiff: add a bdiff hunk mode 9e7b14caf67f xdiff: remove patience and histogram diff algorithms 34e2ff1f9cd8 xdiff: vendor xdiff library from git Interesting pieces regarding performance: c420792217c8 xdiff: reduce indent heuristic overhead https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5 f33a87cf60cc xdiff: add a preprocessing step that trims files https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6 I'll see if I can make that into patches. Thanks, Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] xdiff: reduce indent heuristic overhead 2018-06-29 20:14 ` Stefan Beller @ 2018-06-29 20:28 ` Stefan Beller 2018-06-29 21:17 ` Junio C Hamano 2018-07-01 15:57 ` Michael Haggerty 2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King 1 sibling, 2 replies; 17+ messages in thread From: Stefan Beller @ 2018-06-29 20:28 UTC (permalink / raw) To: sbeller; +Cc: mhagger, git, jamill, mh This patch was written originally for mercurial at https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5 changeset: 36674:c420792217c8 user: Jun Wu <quark@fb.com> date: Sat Mar 03 12:39:11 2018 -0800 files: mercurial/thirdparty/xdiff/xdiffi.c description: xdiff: reduce indent heuristic overhead Adds some threshold to avoid expensive cases, like: ``` #!python open('a', 'w').write(" \n" * 1000000) open('b', 'w').write(" \n" * 1000001) ``` The indent heuristic is O(N * 20) (N = 1000000) for the above case, and makes diff much slower. Before this patch (system git 2.14.2): ``` git diff --no-indent-heuristic a b 0.21s user 0.03s system 100% cpu 0.239 total git diff --indent-heuristic a b 0.77s user 0.02s system 99% cpu 0.785 total ``` After this patch (git 2fc74f41, with xdiffi.c patched): ``` # with the changed xdiffi.c git diff --indent-heuristic a b 0.16s user 0.01s system 90% cpu 0.188 total git diff --no-indent-heuristic a b 0.18s user 0.01s system 99% cpu 0.192 total ``` Now turning on indent-heuristic has no visible impact on performance. Differential Revision: https://phab.mercurial-scm.org/D2601 Signed-off-by: Stefan Beller <sbeller@google.com> --- This applies on our master branch, I have not thought of a good commit message or if we need to test it. Thanks, Stefan xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 0de1ef463bf..c74ec77da58 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -807,6 +807,14 @@ static void xdl_bug(const char *msg) exit(1); } +/* + * For indentation heuristic, skip searching for better slide position after + * checking MAX_BORING lines without finding an improvement. This defends the + * indentation heuristic logic against pathological cases. The value is not + * picked scientifically but should be good enough. + */ +#define MAX_BORING 100 + /* * Move back and forward change groups for a consistent and pretty diff output. * This also helps in finding joinable change groups and reducing the diff @@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { long shift, best_shift = -1; struct split_score best_score; - for (shift = earliest_end; shift <= g.end; shift++) { + /* + * This is O(N * MAX_BLANKS) (N = shift-able lines). + * Even with MAX_BLANKS bounded to a small value, a + * large N could still make this loop take several + * times longer than the main diff algorithm. The + * "boring" value is to help cut down N to something + * like (MAX_BORING + groupsize). + * + * Scan from bottom to top. So we can exit the loop + * without compromising the assumption "for a same best + * score, pick the bottommost shift". + */ + int boring = 0; + for (shift = g.end; shift >= earliest_end; shift--) { struct split_measurement m; struct split_score score = {0, 0}; + int cmp; measure_split(xdf, shift, &m); score_add_split(&m, &score); measure_split(xdf, shift - groupsize, &m); score_add_split(&m, &score); - if (best_shift == -1 || - score_cmp(&score, &best_score) <= 0) { + + if (best_shift == -1) { + cmp = -1; + } else { + cmp = score_cmp(&score, &best_score); + } + if (cmp < 0) { + boring = 0; best_score.effective_indent = score.effective_indent; best_score.penalty = score.penalty; best_shift = shift; + } else { + boring += 1; + if (boring >= MAX_BORING) + break; } } -- 2.18.0.399.gad0ab374a1-goog ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead 2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller @ 2018-06-29 21:17 ` Junio C Hamano 2018-06-29 23:37 ` Stefan Beller 2018-07-01 15:57 ` Michael Haggerty 1 sibling, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2018-06-29 21:17 UTC (permalink / raw) To: Stefan Beller; +Cc: mhagger, git, jamill, mh Stefan Beller <sbeller@google.com> writes: > This patch was written originally for mercurial at > https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5 > > changeset: 36674:c420792217c8 > user: Jun Wu <quark@fb.com> > date: Sat Mar 03 12:39:11 2018 -0800 > files: mercurial/thirdparty/xdiff/xdiffi.c > description: > xdiff: reduce indent heuristic overhead ... This should come with in-body header to credit the original author as the author, I think. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] xdiff: reduce indent heuristic overhead 2018-06-29 21:17 ` Junio C Hamano @ 2018-06-29 23:37 ` Stefan Beller 2018-06-30 1:11 ` Jun Wu 0 siblings, 1 reply; 17+ messages in thread From: Stefan Beller @ 2018-06-29 23:37 UTC (permalink / raw) To: gitster; +Cc: git, jamill, mh, mhagger, sbeller, quark From: Jun Wu <quark@fb.com> This patch was written originally for mercurial at [1], adding a limit on how long we'd be looking for an optimal indent heuristic. Choose the limit high enough to only limit edge cases. Adds some threshold to avoid expensive cases, like: ``` #!python open('a', 'w').write(" \n" * 1000000) open('b', 'w').write(" \n" * 1000001) ``` The indent heuristic is O(N * 20) (N = 1000000) for the above case, and makes diff much slower. Before this patch (system git 2.14.2): ``` git diff --no-indent-heuristic a b 0.21s user 0.03s system 100% cpu 0.239 total git diff --indent-heuristic a b 0.77s user 0.02s system 99% cpu 0.785 total ``` After this patch (git 2fc74f41, with xdiffi.c patched): ``` # with the changed xdiffi.c git diff --indent-heuristic a b 0.16s user 0.01s system 90% cpu 0.188 total git diff --no-indent-heuristic a b 0.18s user 0.01s system 99% cpu 0.192 total ``` Now turning on indent-heuristic has no visible impact on performance. Differential Revision: https://phab.mercurial-scm.org/D2601 [1] https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5 Signed-off-by: Stefan Beller <sbeller@google.com> --- Jun, Junio By changing the authorship we'd want to have a sign off from the original author, before applying; in the previous attempt, I was merely taking the code from mercurial as their copy of xdiff is also LGPLv2 so we are free to use that. Thanks, Stefan xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 0de1ef463bf..c74ec77da58 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -807,6 +807,14 @@ static void xdl_bug(const char *msg) exit(1); } +/* + * For indentation heuristic, skip searching for better slide position after + * checking MAX_BORING lines without finding an improvement. This defends the + * indentation heuristic logic against pathological cases. The value is not + * picked scientifically but should be good enough. + */ +#define MAX_BORING 100 + /* * Move back and forward change groups for a consistent and pretty diff output. * This also helps in finding joinable change groups and reducing the diff @@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { long shift, best_shift = -1; struct split_score best_score; - for (shift = earliest_end; shift <= g.end; shift++) { + /* + * This is O(N * MAX_BLANKS) (N = shift-able lines). + * Even with MAX_BLANKS bounded to a small value, a + * large N could still make this loop take several + * times longer than the main diff algorithm. The + * "boring" value is to help cut down N to something + * like (MAX_BORING + groupsize). + * + * Scan from bottom to top. So we can exit the loop + * without compromising the assumption "for a same best + * score, pick the bottommost shift". + */ + int boring = 0; + for (shift = g.end; shift >= earliest_end; shift--) { struct split_measurement m; struct split_score score = {0, 0}; + int cmp; measure_split(xdf, shift, &m); score_add_split(&m, &score); measure_split(xdf, shift - groupsize, &m); score_add_split(&m, &score); - if (best_shift == -1 || - score_cmp(&score, &best_score) <= 0) { + + if (best_shift == -1) { + cmp = -1; + } else { + cmp = score_cmp(&score, &best_score); + } + if (cmp < 0) { + boring = 0; best_score.effective_indent = score.effective_indent; best_score.penalty = score.penalty; best_shift = shift; + } else { + boring += 1; + if (boring >= MAX_BORING) + break; } } -- 2.18.0.399.gad0ab374a1-goog ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead 2018-06-29 23:37 ` Stefan Beller @ 2018-06-30 1:11 ` Jun Wu 0 siblings, 0 replies; 17+ messages in thread From: Jun Wu @ 2018-06-30 1:11 UTC (permalink / raw) To: Stefan Beller; +Cc: gitster, git, jamill, mh, mhagger Excerpts from Stefan Beller's message of 2018-06-29 16:37:41 -0700: > [...] > Jun, Junio > > By changing the authorship we'd want to have a sign off from the original author, > before applying; in the previous attempt, I was merely taking the code from > mercurial as their copy of xdiff is also LGPLv2 so we are free to use that. I'm fine with signing off the patch. I didn't send this one here mainly because indent heuristic was default off at the time I made the changes, and I wasn't sure about how to test this properly according to the git community standard. If this change is fine without additional tests, I can send it myself, too. > Thanks, > Stefan > > [...] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead 2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller 2018-06-29 21:17 ` Junio C Hamano @ 2018-07-01 15:57 ` Michael Haggerty 2018-07-02 17:27 ` Stefan Beller 2018-07-03 18:14 ` Junio C Hamano 1 sibling, 2 replies; 17+ messages in thread From: Michael Haggerty @ 2018-07-01 15:57 UTC (permalink / raw) To: Stefan Beller; +Cc: git, jamill, mh On 06/29/2018 10:28 PM, Stefan Beller wrote: > [...] > Adds some threshold to avoid expensive cases, like: > > ``` > #!python > open('a', 'w').write(" \n" * 1000000) > open('b', 'w').write(" \n" * 1000001) > ``` > > The indent heuristic is O(N * 20) (N = 1000000) for the above case, and > makes diff much slower. > [...] > +/* > + * For indentation heuristic, skip searching for better slide position after > + * checking MAX_BORING lines without finding an improvement. This defends the > + * indentation heuristic logic against pathological cases. The value is not > + * picked scientifically but should be good enough. > + */ > +#define MAX_BORING 100 This is an interesting case, and a speed difference of almost a factor of five seems impressive. But this is a pretty pathological case, isn't it? And I'm pretty sure that the algorithm is `O(N)` both before and after this change. Remember that to find `earliest_end` and `g.end`, there has already been a scan through all 1000000 lines. In other words, you're not improving how the overall algorithm scales with `N`; you're only changing the constant factor in front. So it's a little bit questionable whether it is worth complicating the code for this unusual case. But *if* we want to improve this case, I think that we could be smarter about it. By the time we get to this point in the code, we already know that there is a "slider" hunk of length `M` (`groupsize`) that can be slid up or down within a range of `N` (`g.end - earliest_end + 1`) possible positions. The interesting case here is `N ≫ M`, because then naively the number of positions to try out is a lot bigger than the size of the hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.) But how can that situation even arise? Remember, a hunk can only be slid down by a line if the first line *after* the hunk is identical to the first line *of* the hunk. It follows that if you shift a hunk down `M` lines, then it has the same contents as when you started—you've just rotated all of the hunk lines around once. So if `N ≫ M`, there is necessarily a lot of repetition among the `N + M` lines that the hunk could possibly overlay. Specifically, it must consist of `floor((N + M)/M)` identical copies of the hunk, plus possibly a few leftover lines constituting the start of another repetition. Given this large amount of repetition, it seems to me that there is never a need to scan more than the bottom `M + 1` possible positions [1] plus the highest possible position [2] to be sure of finding the very best one. In the pathological case that you described above, where `M` is 1, only three positions have to be evaluated, not 100. In fact, it *could* be that there is even more repetition, namely if the hunk itself contains multiple copies of an even shorter block of `K` lines. In that case, you would only have to scan `K + 1` positions at the bottom plus one at the top to be sure to find the best hunk position. This would be an interesting optimization for a case like > open('a', 'w').write(" \n" * 1000000) > open('b', 'w').write(" \n" * 1100000) (`N = 1000000`, `M = 100000`, `K = 1`) or > open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000) > open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000) (`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not entirely trivial to find periodicity in a group of lines (i.e., to find `K`), and I don't know offhand how that task scales with `M`. Michael [1] Actually, to be rigorously correct it might be necessary to check even a bit more than `M + 1` positions at the bottom because the heuristic looks a bit beyond the lines of the hunk. [2] The position at the top has different predecessor lines than the other positions, so it could have a lower score than all of the others. It's worth checking it. Here too, to be rigorously correct it might be necessary to check more than one position at the top because the heuristic looks a bit beyond the lines of the hunk. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead 2018-07-01 15:57 ` Michael Haggerty @ 2018-07-02 17:27 ` Stefan Beller 2018-07-03 9:15 ` Michael Haggerty 2018-07-03 18:14 ` Junio C Hamano 1 sibling, 1 reply; 17+ messages in thread From: Stefan Beller @ 2018-07-02 17:27 UTC (permalink / raw) To: Michael Haggerty, quark; +Cc: git, Jameson Miller, Mike Hommey On Sun, Jul 1, 2018 at 8:57 AM Michael Haggerty <mhagger@alum.mit.edu> wrote: > > On 06/29/2018 10:28 PM, Stefan Beller wrote: > > [...] > > Adds some threshold to avoid expensive cases, like: > > > > ``` > > #!python > > open('a', 'w').write(" \n" * 1000000) > > open('b', 'w').write(" \n" * 1000001) > > ``` > > > > The indent heuristic is O(N * 20) (N = 1000000) for the above case, and > > makes diff much slower. > > [...] > > +/* > > + * For indentation heuristic, skip searching for better slide position after > > + * checking MAX_BORING lines without finding an improvement. This defends the > > + * indentation heuristic logic against pathological cases. The value is not > > + * picked scientifically but should be good enough. > > + */ > > +#define MAX_BORING 100 > > This is an interesting case, and a speed difference of almost a factor > of five seems impressive. But this is a pretty pathological case, isn't > it? And I'm pretty sure that the algorithm is `O(N)` both before and > after this change. Remember that to find `earliest_end` and `g.end`, > there has already been a scan through all 1000000 lines. In other words, > you're not improving how the overall algorithm scales with `N`; you're > only changing the constant factor in front. So it's a little bit > questionable whether it is worth complicating the code for this unusual > case. > > But *if* we want to improve this case, I think that we could be smarter > about it. > > By the time we get to this point in the code, we already know that there > is a "slider" hunk of length `M` (`groupsize`) that can be slid up or > down within a range of `N` (`g.end - earliest_end + 1`) possible > positions. The interesting case here is `N ≫ M`, because then naively > the number of positions to try out is a lot bigger than the size of the > hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.) > > But how can that situation even arise? Remember, a hunk can only be slid > down by a line if the first line *after* the hunk is identical to the > first line *of* the hunk. It follows that if you shift a hunk down `M` > lines, then it has the same contents as when you started—you've just > rotated all of the hunk lines around once. > > So if `N ≫ M`, there is necessarily a lot of repetition among the `N + > M` lines that the hunk could possibly overlay. Specifically, it must > consist of `floor((N + M)/M)` identical copies of the hunk, plus > possibly a few leftover lines constituting the start of another repetition. > > Given this large amount of repetition, it seems to me that there is > never a need to scan more than the bottom `M + 1` possible positions [1] > plus the highest possible position [2] to be sure of finding the very > best one. In the pathological case that you described above, where `M` > is 1, only three positions have to be evaluated, not 100. > > In fact, it *could* be that there is even more repetition, namely if the > hunk itself contains multiple copies of an even shorter block of `K` > lines. In that case, you would only have to scan `K + 1` positions at > the bottom plus one at the top to be sure to find the best hunk > position. This would be an interesting optimization for a case like > > > open('a', 'w').write(" \n" * 1000000) > > open('b', 'w').write(" \n" * 1100000) > > (`N = 1000000`, `M = 100000`, `K = 1`) or > > > open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000) > > open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000) > > (`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not > entirely trivial to find periodicity in a group of lines (i.e., to find > `K`), and I don't know offhand how that task scales with `M`. > > Michael > > [1] Actually, to be rigorously correct it might be necessary to check > even a bit more than `M + 1` positions at the bottom because the > heuristic looks a bit beyond the lines of the hunk. > > [2] The position at the top has different predecessor lines than the > other positions, so it could have a lower score than all of the others. > It's worth checking it. Here too, to be rigorously correct it might be > necessary to check more than one position at the top because the > heuristic looks a bit beyond the lines of the hunk. So this suggests to have MAX_BORING to be "hunk size + some small constant offset" ? Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead 2018-07-02 17:27 ` Stefan Beller @ 2018-07-03 9:15 ` Michael Haggerty 2018-07-27 22:23 ` Stefan Beller 0 siblings, 1 reply; 17+ messages in thread From: Michael Haggerty @ 2018-07-03 9:15 UTC (permalink / raw) To: Stefan Beller; +Cc: quark, Git Mailing List, jamill, mh On Mon, Jul 2, 2018 at 7:27 PM Stefan Beller <sbeller@google.com> wrote: > On Sun, Jul 1, 2018 at 8:57 AM Michael Haggerty <mhagger@alum.mit.edu> wrote: > [...] > So this suggests to have MAX_BORING to be > "hunk size + some small constant offset" ? That would be my suggestion, yes. There are cases where it will be more expensive than a fixed `MAX_BORING`, but I bet on average it will be faster. Plus, it should always give the right answer. Michael ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] xdiff: reduce indent heuristic overhead 2018-07-03 9:15 ` Michael Haggerty @ 2018-07-27 22:23 ` Stefan Beller 0 siblings, 0 replies; 17+ messages in thread From: Stefan Beller @ 2018-07-27 22:23 UTC (permalink / raw) To: mhagger; +Cc: git, jamill, mh, quark, sbeller Skip searching for better indentation heuristics if we'd slide a hunk more than its size. This is the easiest fix proposed in the analysis[1] in response to a patch that mercurial took for xdiff to limit searching by a constant. Using a performance test as: #!python open('a', 'w').write(" \n" * 1000000) open('b', 'w').write(" \n" * 1000001) This patch reduces the execution of "git diff --no-index a b" from 0.70s to 0.31s. However limiting the sliding to the size of the diff hunk, which was proposed as a solution (that I found easiest to implement for now) is not optimal for cases like open('a', 'w').write(" \n" * 1000000) open('b', 'w').write(" \n" * 2000000) as then we'd still slide 1000000 times. In addition to limiting the sliding to size of the hunk, also limit by a constant. Choose 100 lines as the constant as that fits more than a screen, which really means that the diff sliding is probably not providing a lot of benefit anyway. [1] https://public-inbox.org/git/72ac1ac2-f567-f241-41d6-d0f83072e0b3@alum.mit.edu/ Reported-by: Jun Wu <quark@fb.com> Analysis-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: Stefan Beller <sbeller@google.com> --- > Plus, it should always give the right answer. I was tempted to do just that, but I caved. The diff is correct, and the hunk sliding is purely to appease the visual aspect of humans looking at diffs. If your diff can slide more than a monitor height, you're not interested in the best slided diff, but something else is going on. > There are cases where it will be > more expensive than a fixed `MAX_BORING`, but I bet on average it will > be faster. So I did both, settling for performance as the utmost desire. ;-) xdiff/xdiffi.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c index 0de1ef463bf..91e98ee9869 100644 --- a/xdiff/xdiffi.c +++ b/xdiff/xdiffi.c @@ -591,6 +591,11 @@ static void measure_split(const xdfile_t *xdf, long split, */ #define INDENT_WEIGHT 60 +/* + * How far do we slide a hunk at most? + */ +#define INDENT_HEURISTIC_MAX_SLIDING 100 + /* * Compute a badness score for the hypothetical split whose measurements are * stored in m. The weight factors were determined empirically using the tools and @@ -903,7 +908,12 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) { long shift, best_shift = -1; struct split_score best_score; - for (shift = earliest_end; shift <= g.end; shift++) { + shift = earliest_end; + if (g.end - groupsize - 1 > shift) + shift = g.end - groupsize - 1; + if (g.end - INDENT_HEURISTIC_MAX_SLIDING > shift) + shift = g.end - INDENT_HEURISTIC_MAX_SLIDING; + for (; shift <= g.end; shift++) { struct split_measurement m; struct split_score score = {0, 0}; -- 2.18.0.345.g5c9ce644c3-goog ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead 2018-07-01 15:57 ` Michael Haggerty 2018-07-02 17:27 ` Stefan Beller @ 2018-07-03 18:14 ` Junio C Hamano 1 sibling, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2018-07-03 18:14 UTC (permalink / raw) To: Michael Haggerty; +Cc: Stefan Beller, git, jamill, mh Michael Haggerty <mhagger@alum.mit.edu> writes: > So if `N ≫ M`, there is necessarily a lot of repetition among the `N + > M` lines that the hunk could possibly overlay. Specifically, it must > consist of `floor((N + M)/M)` identical copies of the hunk, plus > possibly a few leftover lines constituting the start of another repetition. > > Given this large amount of repetition, it seems to me that there is > never a need to scan more than the bottom `M + 1` possible positions [1] > plus the highest possible position [2] to be sure of finding the very > best one. In the pathological case that you described above, where `M` > is 1, only three positions have to be evaluated, not 100. Nicely analysed. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences 2018-06-29 20:14 ` Stefan Beller 2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller @ 2018-06-29 20:39 ` Jeff King 2018-06-29 20:51 ` Stefan Beller 1 sibling, 1 reply; 17+ messages in thread From: Jeff King @ 2018-06-29 20:39 UTC (permalink / raw) To: Stefan Beller; +Cc: Mike Hommey, Jameson Miller, git On Fri, Jun 29, 2018 at 01:14:52PM -0700, Stefan Beller wrote: > Interesting pieces regarding performance: > > c420792217c8 xdiff: reduce indent heuristic overhead > https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5 > > f33a87cf60cc xdiff: add a preprocessing step that trims files > https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6 > > I'll see if I can make that into patches. Apparently the second one is not so trivial as you might hope; see https://public-inbox.org/git/1520337165-sup-4504@x1c/. -Peff ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences 2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King @ 2018-06-29 20:51 ` Stefan Beller 0 siblings, 0 replies; 17+ messages in thread From: Stefan Beller @ 2018-06-29 20:51 UTC (permalink / raw) To: Jeff King, quark; +Cc: Mike Hommey, Jameson Miller, git +cc Jun Wu, original author of these patches On Fri, Jun 29, 2018 at 1:39 PM Jeff King <peff@peff.net> wrote: > > Interesting pieces regarding performance: > > > > c420792217c8 xdiff: reduce indent heuristic overhead > > https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5 Going by the mailing list, the first patch was not brought over yet, so sending it here was warranted. Jun, you may want to take ownership of https://public-inbox.org/git/20180629202811.131265-1-sbeller@google.com/ as I merely resend it to the git mailing list? If not, that is fine, too. > > > > f33a87cf60cc xdiff: add a preprocessing step that trims files > > https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6 > > > > I'll see if I can make that into patches. > > Apparently the second one is not so trivial as you might hope; see > https://public-inbox.org/git/1520337165-sup-4504@x1c/. Thanks so much, this saves me further effort to dig there. So I'll just stop porting this. Thanks, Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences 2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey 2018-06-29 20:14 ` Stefan Beller @ 2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason 2018-06-29 23:35 ` Mike Hommey 1 sibling, 1 reply; 17+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-06-29 22:10 UTC (permalink / raw) To: Mike Hommey; +Cc: git On Fri, Jun 29 2018, Mike Hommey wrote: > I noticed some slowness when fast-importing data from the Firefox mercurial > repository, where fast-import spends more than 5 minutes importing ~2000 > revisions of one particular file. I reduced a testcase while still > using real data. One could synthesize data with kind of the same > properties, but I figured real data could be useful. > > To reproduce: > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo > $ git init bar > $ cd bar > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000 > > [...] > So maybe it would make sense to consolidate the diff code (after all, > diff-delta.c is an old specialized fork of xdiff). With manual trimming > of common head and tail, this gets down to 3:33. > > I'll also note that Facebook has imported xdiff from the git code base > into mercurial and improved performance on it, so it might also be worth > looking at what's worth taking from there. It would be interesting to see how does this compares with a more naïve approach of committing every version of this file one-at-a-time into a new repository (with & without gc.auto=0). Perhaps deltaing as we go is suboptimal compared to just writing out a lot of redundant data and repacking it all at once later. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences 2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason @ 2018-06-29 23:35 ` Mike Hommey 2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 17+ messages in thread From: Mike Hommey @ 2018-06-29 23:35 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Jun 29 2018, Mike Hommey wrote: > > > I noticed some slowness when fast-importing data from the Firefox mercurial > > repository, where fast-import spends more than 5 minutes importing ~2000 > > revisions of one particular file. I reduced a testcase while still > > using real data. One could synthesize data with kind of the same > > properties, but I figured real data could be useful. > > > > To reproduce: > > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo > > $ git init bar > > $ cd bar > > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000 > > > > [...] > > So maybe it would make sense to consolidate the diff code (after all, > > diff-delta.c is an old specialized fork of xdiff). With manual trimming > > of common head and tail, this gets down to 3:33. > > > > I'll also note that Facebook has imported xdiff from the git code base > > into mercurial and improved performance on it, so it might also be worth > > looking at what's worth taking from there. > > It would be interesting to see how does this compares with a more naïve > approach of committing every version of this file one-at-a-time into a > new repository (with & without gc.auto=0). Perhaps deltaing as we go is > suboptimal compared to just writing out a lot of redundant data and > repacking it all at once later. "Just" writing 26GB? And that's only one file. If I were to do that for the whole repository, it would yield a > 100GB pack. Instead of < 2GB currently. Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences 2018-06-29 23:35 ` Mike Hommey @ 2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason 2018-07-03 22:38 ` Mike Hommey 0 siblings, 1 reply; 17+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-07-03 16:05 UTC (permalink / raw) To: Mike Hommey; +Cc: git On Fri, Jun 29 2018, Mike Hommey wrote: > On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> On Fri, Jun 29 2018, Mike Hommey wrote: >> >> > I noticed some slowness when fast-importing data from the Firefox mercurial >> > repository, where fast-import spends more than 5 minutes importing ~2000 >> > revisions of one particular file. I reduced a testcase while still >> > using real data. One could synthesize data with kind of the same >> > properties, but I figured real data could be useful. >> > >> > To reproduce: >> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo >> > $ git init bar >> > $ cd bar >> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000 >> > >> > [...] >> > So maybe it would make sense to consolidate the diff code (after all, >> > diff-delta.c is an old specialized fork of xdiff). With manual trimming >> > of common head and tail, this gets down to 3:33. >> > >> > I'll also note that Facebook has imported xdiff from the git code base >> > into mercurial and improved performance on it, so it might also be worth >> > looking at what's worth taking from there. >> >> It would be interesting to see how does this compares with a more naïve >> approach of committing every version of this file one-at-a-time into a >> new repository (with & without gc.auto=0). Perhaps deltaing as we go is >> suboptimal compared to just writing out a lot of redundant data and >> repacking it all at once later. > > "Just" writing 26GB? And that's only one file. If I were to do that for > the whole repository, it would yield a > 100GB pack. Instead of < 2GB > currently. To clarify on my terse response. I mean to try this on an isolated test case to see to what extent the problem you're describing is unique to fast-import, and to what extent it's encountered during "normal" git use when you commit all the revisions of that file in succession. Perhaps the difference between the two would give some hint as to how to proceed, or not. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences 2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason @ 2018-07-03 22:38 ` Mike Hommey 0 siblings, 0 replies; 17+ messages in thread From: Mike Hommey @ 2018-07-03 22:38 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git On Tue, Jul 03, 2018 at 06:05:16PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Jun 29 2018, Mike Hommey wrote: > > > On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > >> On Fri, Jun 29 2018, Mike Hommey wrote: > >> > >> > I noticed some slowness when fast-importing data from the Firefox mercurial > >> > repository, where fast-import spends more than 5 minutes importing ~2000 > >> > revisions of one particular file. I reduced a testcase while still > >> > using real data. One could synthesize data with kind of the same > >> > properties, but I figured real data could be useful. > >> > > >> > To reproduce: > >> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo > >> > $ git init bar > >> > $ cd bar > >> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000 > >> > > >> > [...] > >> > So maybe it would make sense to consolidate the diff code (after all, > >> > diff-delta.c is an old specialized fork of xdiff). With manual trimming > >> > of common head and tail, this gets down to 3:33. > >> > > >> > I'll also note that Facebook has imported xdiff from the git code base > >> > into mercurial and improved performance on it, so it might also be worth > >> > looking at what's worth taking from there. > >> > >> It would be interesting to see how does this compares with a more naïve > >> approach of committing every version of this file one-at-a-time into a > >> new repository (with & without gc.auto=0). Perhaps deltaing as we go is > >> suboptimal compared to just writing out a lot of redundant data and > >> repacking it all at once later. > > > > "Just" writing 26GB? And that's only one file. If I were to do that for > > the whole repository, it would yield a > 100GB pack. Instead of < 2GB > > currently. > > To clarify on my terse response. I mean to try this on an isolated test > case to see to what extent the problem you're describing is unique to > fast-import, and to what extent it's encountered during "normal" git use > when you commit all the revisions of that file in succession. > > Perhaps the difference between the two would give some hint as to how to > proceed, or not. AIUI, git repack will end up creating delta indexes for every blob, so the problem should exist there, but because it will be comparing "random" blobs, it can't take the same kinds of shortcuts as fast-import could, because fast-import only cares about diffing with the last imported blob. So while fast-import can reduce the amount of work it does by not creating an index for common heads and tails of the compared blobs, git repack can't. Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-07-27 22:24 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey 2018-06-29 20:14 ` Stefan Beller 2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller 2018-06-29 21:17 ` Junio C Hamano 2018-06-29 23:37 ` Stefan Beller 2018-06-30 1:11 ` Jun Wu 2018-07-01 15:57 ` Michael Haggerty 2018-07-02 17:27 ` Stefan Beller 2018-07-03 9:15 ` Michael Haggerty 2018-07-27 22:23 ` Stefan Beller 2018-07-03 18:14 ` Junio C Hamano 2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King 2018-06-29 20:51 ` Stefan Beller 2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason 2018-06-29 23:35 ` Mike Hommey 2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason 2018-07-03 22:38 ` Mike Hommey
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.