Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Derrick Stolee" <stolee@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Stefan Beller" <sbeller@google.com>, git <git@vger.kernel.org>,
	"Duy Nguyen" <pclouds@gmail.com>
Subject: Re: [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode
Date: Wed, 10 Oct 2018 23:13:03 -0400
Message-ID: <20181011031303.GB25067@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqy3b6occq.fsf@gitster-ct.c.googlers.com>

On Wed, Oct 10, 2018 at 09:48:53AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The one difference is the sort order: git's diff output is
> > in tree-sort order, so a subtree "foo" sorts like "foo/",
> > which is after "foo.bar". Whereas the bitmap path list has a
> > true byte sort, which puts "foo.bar" after "foo".
> 
> If we truly cared, it is easy enough to fix by having a custom
> comparison function in 1/3 used in collect_paths() phase.

Yep. I thought about doing it just so I could drop this "one difference"
note, but I got lazy.

Running this on linux.git, I do see a few other differences. It looks
like my code does actually compute lists of touched paths for some
merges (presumably using "-c"). That wasn't intended, and it would
actually make my timings less good, but my goal was just to get a rough
idea on size here (but see below).

> > +	/* dump it while we have the sorted order in memory */
> > +	for (i = 0; i < n; i++) {
> > +		printf("%s", sorted[i]->path);
> > +		putchar('\0');
> > +	}
> 
> With printf("%s%c", sorted[i]->path, '\0'); you can lose the braces.

Heh, I didn't really expect review at that level. I'm not even sure this
is a good direction to go versus something like the bloom filters (or
even a more full --raw cache). But if it is, this code is mostly
throw-away anyway, as we'd want to integrate it with the actual diff
code.

My original goal had mostly been to get an idea of the size, and the
"dump" half was there to verify that the results were roughly sane. But
it actually works for rough timing, too. I can generate roughly the same
results as "rev-list --all | diff-tree --stdin -t --name-only" in about
300ms, as opposed to 33s. So that's good.

But it's also a slight cheat, since I'm not actually traversing the
commits, but rather just opening up the bitmaps in the order we wrote
them. ;)

Actually walking the commits (and not looking at the trees) takes ~7s,
so it would at least be more like 33s versus 7.3s. With core.commitgraph,
it's more like 1.1s, so imagine 27s versus 1.4s, I guess.

That's also neglecting any load/lookup time for actual random-access to
the bitmaps. I doubt that's more than a few hundred ms, but that's just
a made-up number.

So I think the rough timings are favorable, but the real proof would
actually be using it from a revision walk, which I haven't written.

> > +	putchar('\0');
> > +
> >  	free(sorted);
> >  }
> >  
> > @@ -142,6 +150,8 @@ static void generate_bitmap(struct diff_queue_struct *q,
> >  
> >  	ewah = bitmap_to_ewah(bitmap);
> >  	ewah_serialize_strbuf(ewah, &out);
> > +
> > +	fwrite(data->commit->object.oid.hash, 1, GIT_SHA1_RAWSZ, stdout);
> >  	fwrite(out.buf, 1, out.len, stdout);
> 
> OK, so per commit, we have ewah bitmap that records the "changed
> paths" after the commit object name.  Makes sense.

Yeah. This format, btw, is garbage. It was just the smallest and
simplest thing I could think of that would work for my case. We'd want
random-access to the bitmaps for each commit, probably via an index
block in the commit-graph file.

> And the list of paths are based on the "one" side of the filepair.
> When we do an equivalent of "git show X", we see "diff-tree X~1 X"
> and by collecting the "one" side (i.e. subset of paths in the tree
> of X~1 that were modified when going to X) we say "commit X changed
> these paths".  Makes sense, too.

I didn't think too hard on whether we'd need to look at the "two" side
ever. I turned off renames, so we'd see deletions via the "one". I feel
like we'd miss additions in that case, though, but from my results, we
do not seem to.

-Peff

  reply index

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
2018-10-03 13:36 ` SZEDER Gábor
2018-10-03 13:42   ` Derrick Stolee
2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
2018-10-03 14:17     ` SZEDER Gábor
2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
2018-10-03 14:53         ` SZEDER Gábor
2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
2018-10-03 16:59             ` SZEDER Gábor
2018-10-05  6:09               ` Junio C Hamano
2018-10-10 22:07                 ` SZEDER Gábor
2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason
2018-10-03 19:08           ` Stefan Beller
2018-10-03 19:21             ` Jeff King
2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
2018-10-03 17:47         ` Stefan Beller
2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
2018-10-03 18:51             ` Jeff King
2018-10-03 18:59               ` Derrick Stolee
2018-10-03 19:18                 ` Jeff King
2018-10-08 16:41                   ` SZEDER Gábor
2018-10-08 16:57                     ` Derrick Stolee
2018-10-08 18:10                       ` SZEDER Gábor
2018-10-08 18:29                         ` Derrick Stolee
2018-10-09  3:08                           ` Jeff King
2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
2018-10-09 18:46                               ` Jeff King
2018-10-09 19:03                                 ` Derrick Stolee
2018-10-09 21:14                                   ` Jeff King
2018-10-09 23:12                                     ` Bloom Filters Jeff King
2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
2018-10-10  0:48                                         ` Junio C Hamano
2018-10-11  3:13                                           ` Jeff King [this message]
2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
2018-10-10  0:58                                         ` Junio C Hamano
2018-10-11  3:20                                           ` Jeff King
2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
2018-10-11 13:43                                         ` Jeff King
2018-10-09 21:30                             ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
2018-10-09 21:06                           ` Jeff King
2018-10-09 21:37                             ` SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-16  4:45                           ` Junio C Hamano
2018-10-16 11:13                             ` Derrick Stolee
2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
2018-10-16 13:03                                 ` Derrick Stolee
2018-10-18  2:00                                 ` Junio C Hamano
2018-10-16 23:41                           ` Jonathan Tan
2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
2018-10-03 14:32     ` Duy Nguyen
2018-10-03 16:45 ` Duy Nguyen
2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
2018-10-05 12:05   ` Derrick Stolee
2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
2018-10-05 13:45       ` Derrick Stolee
2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
2018-10-05 19:21         ` Jeff King
2018-10-05 19:41           ` Derrick Stolee
2018-10-05 19:47             ` Jeff King
2018-10-05 20:00               ` Derrick Stolee
2018-10-05 20:02                 ` Jeff King
2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
2018-10-05 20:09                 ` Jeff King
2018-10-11 12:49 [PATCH 1/2] One filter per commit Derrick Stolee
2018-10-11 19:11 ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181011031303.GB25067@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git