git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: John Cai via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, John Cai <johncai86@gmail.com>
Subject: Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm
Date: Thu, 09 Feb 2023 10:41:33 +0100	[thread overview]
Message-ID: <230209.86lel7xi8l.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <CABPp-BGdKjnChEp4zeCcz24wiEJVJb9Tp40MTWn1m0LRZu+M+Q@mail.gmail.com>


On Wed, Feb 08 2023, Elijah Newren wrote:

> On Mon, Feb 6, 2023 at 9:24 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>> On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
>>
>> > From: John Cai <johncai86@gmail.com>
>> > [...]
>> > +
>> > +             if (!o->xdl_opts_command_line) {
>> > +                     static struct attr_check *check;
>> > +                     const char *one_diff_algo;
>> > +                     const char *two_diff_algo;
>> > +
>> > +                     check = attr_check_alloc();
>> > +                     attr_check_append(check, git_attr("diff-algorithm"));
>> > +
>> > +                     git_check_attr(the_repository->index, NULL, one->path, check);
>> > +                     one_diff_algo = check->items[0].value;
>> > +                     git_check_attr(the_repository->index, NULL, two->path, check);
>> > +                     two_diff_algo = check->items[0].value;
>> > +
>> > +                     if (!ATTR_UNSET(one_diff_algo) && !ATTR_UNSET(two_diff_algo) &&
>> > +                             !strcmp(one_diff_algo, two_diff_algo))
>> > +                             set_diff_algorithm(o, one_diff_algo);
>> > +
>> > +                     attr_check_free(check);
>>
>> This is a bit nitpicky, but I for one would find this much easier to
>> read with some shorter variables, here just with "a" rather than
>> "one_diff_algo", "b" instead of "two_diff_algo", and splitting
>> "the_repository->index" into "istate" (untested):
>>
>>         +               if (!o->xdl_opts_command_line) {
>>         +                       static struct attr_check *check;
>>         +                       const char *a;
>>         +                       const char *b;
>>         +                       struct index_state *istate = the_repository->index;
>>         +
>>         +                       check = attr_check_alloc();
>>         +                       attr_check_append(check, git_attr("diff-algorithm"));
>>         +
>>         +                       git_check_attr(istate, NULL, one->path, check);
>>         +                       a = check->items[0].value;
>>         +                       git_check_attr(istate, NULL, two->path, check);
>>         +                       b = check->items[0].value;
>>         +
>>         +                       if (!ATTR_UNSET(a) && !ATTR_UNSET(b) && !strcmp(a, b))
>>         +                               set_diff_algorithm(o, a);
>>         +
>>         +                       attr_check_free(check);
>>         +               }
>>
>> That also nicely keeps the line length shorter.
>>
>> > @@ -333,6 +333,8 @@ struct diff_options {
>> >       int prefix_length;
>> >       const char *stat_sep;
>> >       int xdl_opts;
>> > +     /* If xdl_opts has been set via the command line. */
>> > +     int xdl_opts_command_line;
>> >
>> >       /* see Documentation/diff-options.txt */
>> >       char **anchors;
>> > diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
>> > index 8d1e408bb58..630c98ea65a 100644
>> > --- a/t/lib-diff-alternative.sh
>> > +++ b/t/lib-diff-alternative.sh
>> > @@ -107,8 +107,27 @@ EOF
>> >
>> >       STRATEGY=$1
>> >
>> > +     test_expect_success "$STRATEGY diff from attributes" '
>> > +             echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> > +             test_must_fail git diff --no-index file1 file2 > output &&
>> > +             test_cmp expect output
>> > +     '
>> > +
>> >       test_expect_success "$STRATEGY diff" '
>> > -             test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
>> > +             test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>>
>> Nit: The usual style is ">output", not "> output".
>>
>> > +             test_cmp expect output
>> > +     '
>> > +
>> > +     test_expect_success "$STRATEGY diff command line precedence before attributes" '
>> > +             echo "file* diff-algorithm=meyers" >.gitattributes &&
>> > +             test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> > +             test_cmp expect output
>> > +     '
>> > +
>> > +     test_expect_success "$STRATEGY diff attributes precedence before config" '
>> > +             git config diff.algorithm default &&
>> > +             echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> > +             test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> >               test_cmp expect output
>> >       '
>> >
>> > @@ -166,5 +185,11 @@ EOF
>> >               test_must_fail git diff --no-index "--$STRATEGY" uniq1 uniq2 > output &&
>> >               test_cmp expect output
>> >       '
>> > +
>> > +     test_expect_success "$STRATEGY diff from attributes" '
>> > +             echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> > +             test_must_fail git diff --no-index uniq1 uniq2 > output &&
>> > +             test_cmp expect output
>> > +     '
>> >  }
>>
>> For some non-nitpicking, I do worry about exposing this as a DoS vector,
>> e.g. here's a diff between two distant points in git.git with the
>> various algorithms:
>>
>>         $ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
>>         Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
>>           Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
>>
>>         Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
>>           Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
>>
>>         Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
>>           Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
>>
>>         Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
>>           Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
>>
>>         Summary
>>           'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
>>             1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
>>             1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
>>             1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'

[snip around]

> If you run with more than 1 run, are your numbers even repeatable?

Yes, but tl;dr it's diff.colorMoved=true, sorry, see below.

> I'm really surprised by these numbers; they aren't remotely close to
> what I compute.  Am I correct in understanding you ran these in
> git.git?  Was your computer overloaded?  Was your git.git in some
> serious need of repacking?  Was it on a network filesystem?  

Just on the box I regularly hack on, which isn't too overloaded, but I
re-did these a bit more seriously.

This is/was on a Hetzner EX41S
(https://docs.hetzner.com/robot/dedicated-server/general-information/root-server-hardware/),
but I tried it again now in /dev/shm/git.git with a fresh repo from:

	git clone --bare git@github.com:git/git.git

And (I didn't prune out the notice here where it says it's fuzzy (not
earlier either)).

There's some Java thing eating ~50% of 1/8 CPU cores, but otherwise the
box is pretty idle, and this is current "master" with "make CFLAGS=-O3":
	
	$ hyperfine -w 2 -r 10 -L a patience,minimal,histogram,myers './git -C /dev/shm/git.git diff --diff-algorithm={a} v2.0.0 v2.28.0'
	Benchmark 1: ./git -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0
	  Time (mean ± σ):     41.131 s ±  0.550 s    [User: 40.990 s, System: 0.104 s]
	  Range (min … max):   40.323 s … 42.172 s    10 runs
	
	Benchmark 2: ./git -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0
	  Time (mean ± σ):     34.821 s ±  0.307 s    [User: 34.707 s, System: 0.100 s]
	  Range (min … max):   34.512 s … 35.523 s    10 runs
	
	Benchmark 3: ./git -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0
	  Time (mean ± σ):     45.443 s ±  0.274 s    [User: 45.328 s, System: 0.107 s]
	  Range (min … max):   44.932 s … 45.810 s    10 runs
	
	Benchmark 4: ./git -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0
	  Time (mean ± σ):     33.016 s ±  0.505 s    [User: 32.893 s, System: 0.094 s]
	  Range (min … max):   32.376 s … 33.999 s    10 runs
	
	Summary
	  './git -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
	    1.05 ± 0.02 times faster than './git -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
	    1.25 ± 0.03 times faster than './git -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0'
	    1.38 ± 0.02 times faster than './git -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0'

So that's pretty much the same as my earlier results, but between the
fresh repo, ram disk, warmup & 10 measuremnets for each these should be
more accurate.

> Using git compiled from current main, I see:
>
> $ hyperfine -L a patience,minimal,histogram,myers './git diff
> --diff-algorithm={a} v2.0.0 v2.28.0'
> Benchmark 1: ./git diff --diff-algorithm=patience v2.0.0 v2.28.0
>   Time (mean ± σ):      1.142 s ±  0.033 s    [User: 1.022 s, System: 0.114 s]
>   Range (min … max):    1.117 s …  1.212 s    10 runs
>
> Benchmark 2: ./git diff --diff-algorithm=minimal v2.0.0 v2.28.0
>   Time (mean ± σ):      1.959 s ±  0.011 s    [User: 1.830 s, System: 0.120 s]
>   Range (min … max):    1.947 s …  1.976 s    10 runs
>
> Benchmark 3: ./git diff --diff-algorithm=histogram v2.0.0 v2.28.0
>   Time (mean ± σ):      1.187 s ±  0.007 s    [User: 1.065 s, System: 0.115 s]
>   Range (min … max):    1.175 s …  1.200 s    10 runs
>
> Benchmark 4: ./git diff --diff-algorithm=myers v2.0.0 v2.28.0
>   Time (mean ± σ):      1.194 s ±  0.007 s    [User: 1.068 s, System: 0.120 s]
>   Range (min … max):    1.184 s …  1.206 s    10 runs
>
> Summary
>   './git diff --diff-algorithm=patience v2.0.0 v2.28.0' ran
>     1.04 ± 0.03 times faster than './git diff
> --diff-algorithm=histogram v2.0.0 v2.28.0'
>     1.05 ± 0.03 times faster than './git diff --diff-algorithm=myers
> v2.0.0 v2.28.0'
>     1.71 ± 0.05 times faster than './git diff --diff-algorithm=minimal
> v2.0.0 v2.28.0'

But without diff.colorMoved=true I see basically your results:
	
	$ hyperfine -w 2 -r 10 -L a patience,minimal,histogram,myers './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm={a} v2.0.0 v2.28.0'
	Benchmark 1: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0
	  Time (mean ± σ):     760.9 ms ±  45.2 ms    [User: 698.6 ms, System: 62.1 ms]
	  Range (min … max):   719.0 ms … 862.2 ms    10 runs
	 
	Benchmark 2: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0
	  Time (mean ± σ):      1.347 s ±  0.041 s    [User: 1.281 s, System: 0.065 s]
	  Range (min … max):    1.305 s …  1.417 s    10 runs
	 
	Benchmark 3: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0
	  Time (mean ± σ):     826.3 ms ±  51.1 ms    [User: 767.5 ms, System: 58.6 ms]
	  Range (min … max):   773.7 ms … 929.8 ms    10 runs
	 
	Benchmark 4: ./git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0
	  Time (mean ± σ):     801.1 ms ±  39.4 ms    [User: 736.0 ms, System: 64.9 ms]
	  Range (min … max):   771.6 ms … 904.2 ms    10 runs
	 
	Summary
	  './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=patience v2.0.0 v2.28.0' ran
	    1.05 ± 0.08 times faster than './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=myers v2.0.0 v2.28.0'
	    1.09 ± 0.09 times faster than './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=histogram v2.0.0 v2.28.0'
	    1.77 ± 0.12 times faster than './git -c diff.colorMoved=false -C /dev/shm/git.git diff --diff-algorithm=minimal v2.0.0 v2.28.0'

So they're all within the fuzz-factor, except "minimal".

> And this is on a kind-of low-end refurbished laptop from a few years
> ago (although the repo was recently gc'ed).
>
> I'm biased towards histogram (and making it the default rather than
> making it configurable per file), but that's probably obvious given
> that I made ort use it unconditionally.  And when I made ort use it,
> it was actually a minor performance penalty (~1% IIRC)[*], but I
> thought it was worth it since (a) histogram diffs are more
> understandable to users in general, (b) the histogram diff data
> structures provide an idea for possibly solving some ugly corner cases
> that I don't see a way to solve with the other diffs.

To bring this all home the thing I was going for upthread was to raise
"is this a concern?" I was inclined to think "no", but didn't
know. Since someone who knows way more about the diffs than I probably
ever will (i.e. you :) isn't waiving their hands in panic here I think
we can just consider this checkbox ticked.

I.e. I've seen cases (and this is from vague recollection, I've got no
examples in front of me) where one diff algorithm is worse than others
on performance.

But if your intent is to DoS a service provider you can also just commit
some very long files or whatever, so whether the user can change the
diff algorithm is probably not that interesting, as long as the
performance is within some reasonable bound.

  reply	other threads:[~2023-02-09 10:09 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-05  3:46 [PATCH 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-05  3:46 ` [PATCH 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-06 16:20   ` Phillip Wood
2023-02-05  3:46 ` [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-05 17:50   ` Eric Sunshine
2023-02-06 13:10     ` John Cai
2023-02-06 16:27   ` Phillip Wood
2023-02-06 18:14     ` Eric Sunshine
2023-02-06 19:50     ` John Cai
2023-02-09  8:26       ` Elijah Newren
2023-02-09 10:31         ` "bad" diffs (was: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm) Ævar Arnfjörð Bjarmason
2023-02-09 16:37         ` [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai
2023-02-06 16:39   ` Ævar Arnfjörð Bjarmason
2023-02-06 20:37     ` John Cai
2023-02-07 14:55       ` Phillip Wood
2023-02-07 17:00         ` John Cai
2023-02-09  9:09           ` Elijah Newren
2023-02-09 14:44             ` Phillip Wood
2023-02-10  9:57               ` Elijah Newren
2023-02-11 17:39                 ` Phillip Wood
2023-02-11  1:59               ` Jeff King
2023-02-15  2:35                 ` Elijah Newren
2023-02-15  4:21                   ` Jeff King
2023-02-15  5:20                     ` Junio C Hamano
2023-02-15 14:44                 ` Phillip Wood
2023-02-15 15:00                   ` Jeff King
2023-02-07 17:27         ` Ævar Arnfjörð Bjarmason
2023-02-15 14:47           ` Phillip Wood
2023-02-09  8:44       ` Elijah Newren
2023-02-14 21:16         ` John Cai
2023-02-15  3:41           ` Elijah Newren
2023-02-09  7:50     ` Elijah Newren
2023-02-09  9:41       ` Ævar Arnfjörð Bjarmason [this message]
2023-02-11  2:04         ` Jeff King
2023-02-07 17:56   ` Jeff King
2023-02-07 20:18     ` Ævar Arnfjörð Bjarmason
2023-02-07 20:47       ` Junio C Hamano
2023-02-07 21:05         ` Ævar Arnfjörð Bjarmason
2023-02-07 21:28           ` Junio C Hamano
2023-02-07 21:44             ` Ævar Arnfjörð Bjarmason
2023-02-09 16:34     ` John Cai
2023-02-11  1:39       ` Jeff King
2023-02-14 21:40 ` [PATCH v2 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-14 21:40   ` [PATCH v2 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-15  2:38     ` Junio C Hamano
2023-02-15 23:34       ` John Cai
2023-02-15 23:42         ` Junio C Hamano
2023-02-16  2:14           ` Jeff King
2023-02-16  2:57             ` Junio C Hamano
2023-02-16 20:34               ` John Cai
2023-02-14 21:40   ` [PATCH v2 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-15  2:56     ` Junio C Hamano
2023-02-15  3:20       ` Junio C Hamano
2023-02-16 20:37         ` John Cai
2023-02-17 20:21   ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-17 20:21     ` [PATCH v3 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-17 21:27       ` Junio C Hamano
2023-02-18  1:36       ` Elijah Newren
2023-02-17 20:21     ` [PATCH v3 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-17 21:50       ` Junio C Hamano
2023-02-18  2:56       ` Elijah Newren
2023-02-20 15:32         ` John Cai
2023-02-20 16:21           ` Elijah Newren
2023-02-20 16:49             ` John Cai
2023-02-20 17:32               ` Elijah Newren
2023-02-20 20:53                 ` John Cai
2023-02-22 19:47                 ` Jeff King
2023-02-24 17:44                   ` John Cai
2023-02-18  1:16     ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes Elijah Newren
2023-02-20 13:37       ` John Cai
2023-02-20 21:04     ` [PATCH v4 " John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-21 17:34       ` [PATCH v4 0/2] Teach diff to honor diff algorithms set through git attributes Junio C Hamano
2023-02-21 18:05         ` Elijah Newren
2023-02-21 18:51           ` Junio C Hamano
2023-02-21 19:36             ` John Cai
2023-02-21 20:16               ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=230209.86lel7xi8l.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=johncai86@gmail.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).