All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Jakub Narebski <jnareb@gmail.com>
Cc: "Ted Ts'o" <tytso@mit.edu>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Clemens Buchacher" <drizzd@aon.at>,
	git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>
Subject: Re: generation numbers (was: [PATCH 0/4] Speed up git tag --contains)
Date: Thu, 7 Jul 2011 14:59:08 -0400	[thread overview]
Message-ID: <20110707185908.GB12044@sigill.intra.peff.net> (raw)
In-Reply-To: <201107062046.43820.jnareb@gmail.com>

On Wed, Jul 06, 2011 at 08:46:42PM +0200, Jakub Narebski wrote:

> > > Is it worth it to try to replicate this information across repositories?
> > 
> > Probably not. I suggested notes-cache just because the amount of code is
> > very trivial.
> 
> Well, generation numbers are universal and would help everybody.  For
> new commits with 'generation' header those would be always replicated,
> for old commits with 'generation' notes / notes-cache the can be
> replicated.

Sure. But it's not worth trying to transfer them between repositories,
when it only takes a few seconds to generate them locally (approximately
the same amount of time that "git rev-list --all >/dev/null" takes).

> > One problem with notes storage is that it's not well optimized for tiny
> > pieces of data like this (e.g., the generation number should fit in a
> > 32-bit unsigned int, as its max is the size of the longest single path
> > in the history graph). But notes are much more general; we will actually
> > map each commit to a blob object containing the generation number, which
> > is pretty wasteful.
> 
> Wasn't textconv-cache using commit-less notes?  The same can be done
> for generation notes-cache.

No, textconv actually uses parentless commits. But the issue I'm talking
about is not the history storage. It's the value storage. Instead of
pointing to a 32-bit int, we point to a 160-bit sha1 that references an
object containing the value (and the object header is going to be as big
as the value itself). So it's wasteful in storage, and it's wasteful in
the amount of work to do a lookup.

You could "cheat" and instead of storing the sha1 of a blob object in
the notes tree, use the lower 32 bits to store an actual value. I don't
think that currently breaks any assumptions in the notes code, but it
definitely is against the intent of it.

I wrote a patch to calculate and cache commit generation numbers on the
fly, and output them via the "%G" format placeholder. So we can get some
timings (these are from git.git):

  # baseline to compare against; print the commiter timestamp of every
  # commit, which is about how expensive it would be to parse and print
  # an embedded generation number
  $ time git log --format=%ct >/dev/null
  real    0m0.388s
  user    0m0.380s
  sys     0m0.004s

  # and the baseline amount of storage used (fully packed)
  $ du -s .git/objects
  47072   .git/objects

  # now the cost to generate the whole cache; slower, obviously, but it
  # only needs to happen once
  $ time git log --format=%G >/dev/null
  real    0m2.180s
  user    0m1.204s
  sys     0m0.960s

  # at which point everything is loose, and we are wasting tons of space
  $ du -s .git/objects
  171692  .git/objects

  # and traversing with generation lookup is still a bit slower than
  # without it
  $ time git log --format=%G >/dev/null
  real    0m0.822s
  user    0m0.544s
  sys     0m0.272s

  # but repacking helps with the space; now we're using only ~3M
  $ git gc
  $ du -s .git/objects
  50236   .git/objects

  # and traversal is faster, but still about 33% slower than our
  # baseline
  real    0m0.490s
  user    0m0.468s
  sys     0m0.020s

So I suspect we could do better with a data structure optimized for this
type of storage.

> Though it is still wasteful...  By the way, would we be using text
> representation (like in 'generation' commit header) or 32-bit integer
> binary representation in some ordering, or variable-length integer (I
> think git uses them somewhere)?

For a local lookup cache, I would use a fixed-size binary integer just
to keep the lookup data structure simple (then you know the width of
each record ahead of time). For a generation commit header, obviously we
would go with the ascii representation as we do for other headers.

> Nb. I wonder if 32-bit unsigned int would always be enough, for example
> Linux kernel + history.

Yeah, it should be plenty. There are only 250K commits in linux-2.6, and
the absolute worst case generation number is 250K (if they are all
linear). But the history is an n-ary tree, and the highest generation is
actually the height of the tree. So the more branchy the history, the
smaller the height.

The generation of v3.0-rc6, representing 6 years of history, is 79044.
At that rate, the linux-2.6 repository will overflow in a mere 320,000
years.

> I think it would be simpler to use generation header + generation notes.
> Or start with generation notes only.

The patch implementing generation notes is below. The implementation is
quite simple, but it would be nice if it were faster.

---
 commit.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commit.h |    2 +
 pretty.c |    3 ++
 3 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/commit.c b/commit.c
index ac337c7..493517c 100644
--- a/commit.c
+++ b/commit.c
@@ -6,6 +6,7 @@
 #include "diff.h"
 #include "revision.h"
 #include "notes.h"
+#include "notes-cache.h"
 
 int save_commit_buffer = 1;
 
@@ -878,3 +879,83 @@ int commit_tree(const char *msg, unsigned char *tree,
 	strbuf_release(&buffer);
 	return result;
 }
+
+static struct notes_cache generations;
+
+static int generation_from_cache(struct commit *c, unsigned long *g)
+{
+	char *buf, *end;
+	size_t len;
+
+	buf = notes_cache_get(&generations, c->object.sha1, &len);
+	if (!buf)
+		return -1;
+
+	errno = 0;
+	*g = strtoul(buf, &end, 10);
+	if (errno == ERANGE || *end != '\0') {
+		free(buf);
+		return -1;
+	}
+
+	free(buf);
+	return 0;
+}
+
+static void generation_to_cache(struct commit *c, unsigned long g)
+{
+	char buf[64];
+	int len;
+
+	len = snprintf(buf, sizeof(buf), "%lu", g);
+	notes_cache_put(&generations, c->object.sha1, buf, len);
+}
+
+static unsigned long commit_generation_recurse(struct commit *c)
+{
+	struct commit_list *p;
+	unsigned long r;
+
+	if (!generation_from_cache(c, &r))
+		return r;
+
+	if (parse_commit(c) < 0)
+		die("unable to parse commit: %s", sha1_to_hex(c->object.sha1));
+
+	if (!c->parents)
+		return 0;
+
+	r = 0;
+	for (p = c->parents; p; p = p->next) {
+		unsigned long pgen = commit_generation_recurse(p->item);
+		if (pgen > r)
+			r = pgen;
+	}
+	r++;
+
+	generation_to_cache(c, r);
+	return r;
+}
+
+int installed_generation_writer;
+static void write_generation_cache(void)
+{
+	notes_cache_write(&generations);
+}
+
+unsigned long commit_generation(const struct commit *commit)
+{
+	unsigned long r;
+
+	if (!generations.tree.initialized)
+		notes_cache_init(&generations, "generations", "v1");
+
+	r = commit_generation_recurse((struct commit *)commit);
+
+	if (!installed_generation_writer) {
+		atexit(write_generation_cache);
+		installed_generation_writer = 1;
+	}
+
+	return r;
+}
diff --git a/commit.h b/commit.h
index a2d571b..bff6b36 100644
--- a/commit.h
+++ b/commit.h
@@ -176,4 +176,6 @@ extern int commit_tree(const char *msg, unsigned char *tree,
 		struct commit_list *parents, unsigned char *ret,
 		const char *author);
 
+unsigned long commit_generation(const struct commit *commit);
+
 #endif /* COMMIT_H */
diff --git a/pretty.c b/pretty.c
index f45eb54..8f1b321 100644
--- a/pretty.c
+++ b/pretty.c
@@ -965,6 +965,9 @@ static size_t format_commit_one(struct strbuf *sb, const char *placeholder,
 			return 2;
 		}
 		return 0;	/* unknown %g placeholder */
+	case 'G':
+		strbuf_addf(sb, "%lu", commit_generation(commit));
+		return 1;
 	case 'N':
 		if (c->pretty_ctx->show_notes) {
 			format_display_notes(commit->object.sha1, sb,
-- 
1.7.6.7.ge7132.dirty

  reply	other threads:[~2011-07-07 18:59 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 1/4] tag: speed up --contains calculation Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 2/4] limit "contains" traversals based on commit timestamp Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 3/4] default core.clockskew variable to one day Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 4/4] Why is "git tag --contains" so slow? Ævar Arnfjörð Bjarmason
2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
2011-07-06  6:54   ` Jeff King
2011-07-06 19:06     ` Clemens Buchacher
2011-07-06  6:56   ` Jonathan Nieder
2011-07-06  7:03     ` Jeff King
2011-07-06 14:26       ` generation numbers (was: [PATCH 0/4] Speed up git tag --contains) Jakub Narebski
2011-07-06 15:01         ` Ted Ts'o
2011-07-06 18:12           ` Jeff King
2011-07-06 18:46             ` Jakub Narebski
2011-07-07 18:59               ` Jeff King [this message]
2011-07-07 19:34                 ` generation numbers Junio C Hamano
2011-07-07 20:31                   ` Jakub Narebski
2011-07-07 20:52                     ` A Large Angry SCM
2011-07-08  0:29                       ` Junio C Hamano
2011-07-08 22:57                   ` Jeff King
2011-07-06 23:22             ` Junio C Hamano
2011-07-07 19:08               ` Jeff King
2011-07-07 20:10                 ` Jakub Narebski
2018-01-12 18:56   ` [PATCH 0/4] Speed up git tag --contains csilvers
2018-03-03  5:15     ` Jeff King
2018-03-08 23:05       ` csilvers
2018-03-12 13:45       ` Derrick Stolee
2018-03-12 23:59         ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110707185908.GB12044@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=drizzd@aon.at \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.