All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC 0/6] commit caching
@ 2013-01-29  9:14 Jeff King
  2013-01-29  9:15 ` [PATCH 1/6] csum-file: make sha1write const-correct Jeff King
                   ` (7 more replies)
  0 siblings, 8 replies; 43+ messages in thread
From: Jeff King @ 2013-01-29  9:14 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen, Shawn O. Pearce

This is the cleaned-up version of the commit caching patches I mentioned
here:

  http://article.gmane.org/gmane.comp.version-control.git/212329

The basic idea is to generate a cache file that sits alongside a
packfile and contains the timestamp, tree, and parents in a more compact
and easy-to-access format.

The timings from this one are roughly similar to what I posted earlier.
Unlike the earlier version, this one keeps the data for a single commit
together for better cache locality (though I don't think it made a big
difference in my tests, since my cold-cache timing test ends up touching
every commit anyway).  The short of it is that for an extra 31M of disk
space (~4%), I get a warm-cache speedup for "git rev-list --all" of
~4.2s to ~0.66s.

The big thing it does not (yet) do is use offsets to reference sha1s, as
Shawn suggested.  This would potentially drop the on-disk size from 84
bytes to 16 bytes per commit (or about 6M total for linux.git).

Coupled with using compression level 0 for trees (which do not compress
well at all, and yield only a 2% increase in size when left
uncompressed), my "git rev-list --objects --all" time drops from ~40s to
~25s. Perf reveals that we're spending most of the remaining time in
lookup_object. I've spent a fair bit of time trying to optimize that,
but with no luck; I think it's fairly close to optimal. The problem is
just that we call it a very large number of times, since it is the
mechanism by which we recognize that we have already processed each
sha1.

  [1/6]: csum-file: make sha1write const-correct
  [2/6]: strbuf: add string-chomping functions
  [3/6]: introduce pack metadata cache files
  [4/6]: introduce a commit metapack
  [5/6]: add git-metapack command
  [6/6]: commit: look up commit info in metapack

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/6] csum-file: make sha1write const-correct
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
@ 2013-01-29  9:15 ` Jeff King
  2013-01-29  9:15 ` [PATCH 2/6] strbuf: add string-chomping functions Jeff King
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-29  9:15 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen, Shawn O. Pearce

We do not modify the buffer we are asked to write; mark it
with const so that callers with const buffers do not get
unnecessary complaints from the compiler.

Signed-off-by: Jeff King <peff@peff.net>
---
 csum-file.c | 6 +++---
 csum-file.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 53f5375..465971c 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -11,7 +11,7 @@
 #include "progress.h"
 #include "csum-file.h"
 
-static void flush(struct sha1file *f, void *buf, unsigned int count)
+static void flush(struct sha1file *f, const void *buf, unsigned int count)
 {
 	if (0 <= f->check_fd && count)  {
 		unsigned char check_buffer[8192];
@@ -86,13 +86,13 @@ int sha1write(struct sha1file *f, void *buf, unsigned int count)
 	return fd;
 }
 
-int sha1write(struct sha1file *f, void *buf, unsigned int count)
+int sha1write(struct sha1file *f, const void *buf, unsigned int count)
 {
 	while (count) {
 		unsigned offset = f->offset;
 		unsigned left = sizeof(f->buffer) - offset;
 		unsigned nr = count > left ? left : count;
-		void *data;
+		const void *data;
 
 		if (f->do_crc)
 			f->crc32 = crc32(f->crc32, buf, nr);
diff --git a/csum-file.h b/csum-file.h
index 3b540bd..9dedb03 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -34,7 +34,7 @@ extern int sha1close(struct sha1file *, unsigned char *, unsigned int);
 extern struct sha1file *sha1fd_check(const char *name);
 extern struct sha1file *sha1fd_throughput(int fd, const char *name, struct progress *tp);
 extern int sha1close(struct sha1file *, unsigned char *, unsigned int);
-extern int sha1write(struct sha1file *, void *, unsigned int);
+extern int sha1write(struct sha1file *, const void *, unsigned int);
 extern void sha1flush(struct sha1file *f);
 extern void crc32_begin(struct sha1file *);
 extern uint32_t crc32_end(struct sha1file *);
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/6] strbuf: add string-chomping functions
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
  2013-01-29  9:15 ` [PATCH 1/6] csum-file: make sha1write const-correct Jeff King
@ 2013-01-29  9:15 ` Jeff King
  2013-01-29 10:15   ` Michael Haggerty
  2013-01-29  9:15 ` [PATCH 3/6] introduce pack metadata cache files Jeff King
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2013-01-29  9:15 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen, Shawn O. Pearce

Sometimes it is handy to cut a trailing string off the end
of a strbuf (e.g., a file extension). These helper functions
make it a one-liner.

Signed-off-by: Jeff King <peff@peff.net>
---
 strbuf.c | 11 +++++++++++
 strbuf.h |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 9a373be..8199ced 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -106,6 +106,17 @@ void strbuf_ltrim(struct strbuf *sb)
 	sb->buf[sb->len] = '\0';
 }
 
+void strbuf_chompmem(struct strbuf *sb, const void *data, size_t len)
+{
+	if (sb->len >= len && !memcmp(data, sb->buf + sb->len - len, len))
+		strbuf_setlen(sb, sb->len - len);
+}
+
+void strbuf_chompstr(struct strbuf *sb, const char *str)
+{
+	strbuf_chompmem(sb, str, strlen(str));
+}
+
 struct strbuf **strbuf_split_buf(const char *str, size_t slen,
 				 int terminator, int max)
 {
diff --git a/strbuf.h b/strbuf.h
index ecae4e2..3aeb815 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -42,6 +42,8 @@ extern void strbuf_ltrim(struct strbuf *);
 extern void strbuf_trim(struct strbuf *);
 extern void strbuf_rtrim(struct strbuf *);
 extern void strbuf_ltrim(struct strbuf *);
+extern void strbuf_chompmem(struct strbuf *, const void *, size_t);
+extern void strbuf_chompstr(struct strbuf *, const char *);
 extern int strbuf_cmp(const struct strbuf *, const struct strbuf *);
 
 /*
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/6] introduce pack metadata cache files
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
  2013-01-29  9:15 ` [PATCH 1/6] csum-file: make sha1write const-correct Jeff King
  2013-01-29  9:15 ` [PATCH 2/6] strbuf: add string-chomping functions Jeff King
@ 2013-01-29  9:15 ` Jeff King
  2013-01-29 17:35   ` Junio C Hamano
  2013-01-30  1:30   ` Duy Nguyen
  2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 43+ messages in thread
From: Jeff King @ 2013-01-29  9:15 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen, Shawn O. Pearce

The on-disk packfile format is nicely compact, but it does
not always provide the fastest format for looking up
information. This patch introduces the concept of
"metapacks", optional metadata files which can live
alongside packs and represent their data in different ways.
This can allow space-time tradeoffs in accessing certain
object data.

Such space-time tradeoffs have traditionally gone into the
.idx file (e.g., the fact that we can quickly find an
object's offset in the packfile is due to the index). In
theory, cached data could also go into the .idx file.
However, keeping it in a separate file makes backwards
compatibility much simpler. Older versions of git can simply
ignore the extra files and use the existing methods for
accessing object data. This also makes metapacks optional,
so you can easily tune the space-time tradeoff on a per-repo
basis.

TODO: document on-disk format in Documentation/technical
TODO: document api

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile   |   2 +
 metapack.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 metapack.h |  42 ++++++++++++++++
 3 files changed, 202 insertions(+)
 create mode 100644 metapack.c
 create mode 100644 metapack.h

diff --git a/Makefile b/Makefile
index 731b6a8..3e4ae1b 100644
--- a/Makefile
+++ b/Makefile
@@ -660,6 +660,7 @@ LIB_H += mergesort.h
 LIB_H += merge-blobs.h
 LIB_H += merge-recursive.h
 LIB_H += mergesort.h
+LIB_H += metapack.h
 LIB_H += notes-cache.h
 LIB_H += notes-merge.h
 LIB_H += notes.h
@@ -778,6 +779,7 @@ LIB_OBJS += mergesort.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += metapack.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
diff --git a/metapack.c b/metapack.c
new file mode 100644
index 0000000..c859c95
--- /dev/null
+++ b/metapack.c
@@ -0,0 +1,158 @@
+#include "cache.h"
+#include "metapack.h"
+#include "csum-file.h"
+
+static struct sha1file *create_meta_tmp(void)
+{
+	char tmp[PATH_MAX];
+	int fd;
+
+	fd = odb_mkstemp(tmp, sizeof(tmp), "pack/tmp_meta_XXXXXX");
+	return sha1fd(fd, xstrdup(tmp));
+}
+
+static void write_meta_header(struct metapack_writer *mw, const char *id,
+			      uint32_t version)
+{
+	version = htonl(version);
+
+	sha1write(mw->out, "META", 4);
+	sha1write(mw->out, "\0\0\0\1", 4);
+	sha1write(mw->out, mw->pack->sha1, 20);
+	sha1write(mw->out, id, 4);
+	sha1write(mw->out, &version, 4);
+}
+
+void metapack_writer_init(struct metapack_writer *mw,
+			  const char *pack_idx,
+			  const char *name,
+			  int version)
+{
+	struct strbuf path = STRBUF_INIT;
+
+	memset(mw, 0, sizeof(*mw));
+
+	mw->pack = add_packed_git(pack_idx, strlen(pack_idx), 1);
+	if (!mw->pack || open_pack_index(mw->pack))
+		die("unable to open packfile '%s'", pack_idx);
+
+	strbuf_addstr(&path, pack_idx);
+	strbuf_chompstr(&path, ".idx");
+	strbuf_addch(&path, '.');
+	strbuf_addstr(&path, name);
+	mw->path = strbuf_detach(&path, NULL);
+
+	mw->out = create_meta_tmp();
+	write_meta_header(mw, name, version);
+}
+
+void metapack_writer_finish(struct metapack_writer *mw)
+{
+	const char *tmp = mw->out->name;
+
+	sha1close(mw->out, NULL, CSUM_FSYNC);
+	if (rename(tmp, mw->path))
+		die_errno("unable to rename temporary metapack file");
+
+	close_pack_index(mw->pack);
+	free(mw->pack);
+	free(mw->path);
+	free((char *)tmp);
+}
+
+void metapack_writer_add(struct metapack_writer *mw, const void *data, int len)
+{
+	sha1write(mw->out, data, len);
+}
+
+void metapack_writer_add_uint32(struct metapack_writer *mw, uint32_t v)
+{
+	v = htonl(v);
+	metapack_writer_add(mw, &v, 4);
+}
+
+void metapack_writer_foreach(struct metapack_writer *mw,
+			     metapack_writer_each_fn cb,
+			     void *data)
+{
+	const unsigned char *sha1;
+	uint32_t i = 0;
+
+	/*
+	 * We'll feed these to the callback in sorted order, since that is the
+	 * order that they are stored in the .idx file.
+	 */
+	while ((sha1 = nth_packed_object_sha1(mw->pack, i++)))
+		cb(mw, sha1, data);
+}
+
+int metapack_init(struct metapack *m,
+		  struct packed_git *pack,
+		  const char *name,
+		  uint32_t *version)
+{
+	struct strbuf path = STRBUF_INIT;
+	int fd;
+	struct stat st;
+
+	memset(m, 0, sizeof(*m));
+
+	strbuf_addstr(&path, pack->pack_name);
+	strbuf_chompstr(&path, ".pack");
+	strbuf_addch(&path, '.');
+	strbuf_addstr(&path, name);
+
+	fd = open(path.buf, O_RDONLY);
+	strbuf_release(&path);
+	if (fd < 0)
+		return -1;
+	if (fstat(fd, &st) < 0) {
+		close(fd);
+		return -1;
+	}
+
+	m->mapped_len = xsize_t(st.st_size);
+	m->mapped_buf = xmmap(NULL, m->mapped_len, PROT_READ, MAP_PRIVATE, fd, 0);
+	close(fd);
+
+	m->data = m->mapped_buf;
+	m->len = m->mapped_len;
+
+	if (m->len < 8 ||
+	    memcmp(m->mapped_buf, "META", 4) ||
+	    memcmp(m->mapped_buf + 4, "\0\0\0\1", 4)) {
+		warning("metapack '%s' for '%s' does not have a valid header",
+			name, pack->pack_name);
+		metapack_close(m);
+		return -1;
+	}
+	m->data += 8;
+	m->len -= 8;
+
+	if (m->len < 20 || hashcmp(m->data, pack->sha1)) {
+		warning("metapack '%s' for '%s' does not match pack sha1",
+			name, pack->pack_name);
+		metapack_close(m);
+		return -1;
+	}
+	m->data += 20;
+	m->len -= 20;
+
+	if (m->len < 8 || memcmp(m->data, name, 4)) {
+		warning("metapack '%s' for '%s' does not have expected header id",
+			name, pack->pack_name);
+		metapack_close(m);
+		return -1;
+	}
+	memcpy(version, m->data + 4, 4);
+	*version = ntohl(*version);
+	m->data += 8;
+	m->len -= 8;
+
+	return 0;
+}
+
+void metapack_close(struct metapack *m)
+{
+	munmap(m->mapped_buf, m->mapped_len);
+}
diff --git a/metapack.h b/metapack.h
new file mode 100644
index 0000000..6af17fe
--- /dev/null
+++ b/metapack.h
@@ -0,0 +1,42 @@
+#ifndef METAPACK_H
+#define METAPACK_H
+
+struct packed_git;
+struct sha1file;
+
+struct metapack_writer {
+	char *path;
+	struct packed_git *pack;
+	struct sha1file *out;
+};
+
+void metapack_writer_init(struct metapack_writer *mw,
+			  const char *pack_idx,
+			  const char *name,
+			  int version);
+void metapack_writer_add(struct metapack_writer *mw, const void *data, int len);
+void metapack_writer_add_uint32(struct metapack_writer *mw, uint32_t v);
+void metapack_writer_finish(struct metapack_writer *mw);
+
+typedef void (*metapack_writer_each_fn)(struct metapack_writer *,
+					const unsigned char *sha1,
+					void *data);
+void metapack_writer_foreach(struct metapack_writer *mw,
+			     metapack_writer_each_fn cb,
+			     void *data);
+
+struct metapack {
+	void *mapped_buf;
+	size_t mapped_len;
+
+	void *data;
+	size_t len;
+};
+
+int metapack_init(struct metapack *m,
+		  struct packed_git *pack,
+		  const char *name,
+		  uint32_t *version);
+void metapack_close(struct metapack *m);
+
+#endif
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/6] introduce a commit metapack
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
                   ` (2 preceding siblings ...)
  2013-01-29  9:15 ` [PATCH 3/6] introduce pack metadata cache files Jeff King
@ 2013-01-29  9:16 ` Jeff King
  2013-01-29 10:24   ` Michael Haggerty
                     ` (3 more replies)
  2013-01-29  9:16 ` [PATCH 5/6] add git-metapack command Jeff King
                   ` (3 subsequent siblings)
  7 siblings, 4 replies; 43+ messages in thread
From: Jeff King @ 2013-01-29  9:16 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen, Shawn O. Pearce

When we are doing a commit traversal that does not need to
look at the commit messages themselves (e.g., rev-list,
merge-base, etc), we spend a lot of time accessing,
decompressing, and parsing the commit objects just to find
the parent and timestamp information. We can make a
space-time tradeoff by caching that information on disk in a
compact, uncompressed format.

TODO: document on-disk format in Documentation/technical
TODO: document API

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile          |   2 +
 commit-metapack.c | 175 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commit-metapack.h |  12 ++++
 3 files changed, 189 insertions(+)
 create mode 100644 commit-metapack.c
 create mode 100644 commit-metapack.h

diff --git a/Makefile b/Makefile
index 3e4ae1b..6ca5320 100644
--- a/Makefile
+++ b/Makefile
@@ -619,6 +619,7 @@ LIB_H += column.h
 LIB_H += cache.h
 LIB_H += color.h
 LIB_H += column.h
+LIB_H += commit-metapack.h
 LIB_H += commit.h
 LIB_H += compat/bswap.h
 LIB_H += compat/cygwin.h
@@ -730,6 +731,7 @@ LIB_OBJS += combine-diff.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
+LIB_OBJS += commit-metapack.o
 LIB_OBJS += commit.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
diff --git a/commit-metapack.c b/commit-metapack.c
new file mode 100644
index 0000000..2c19f48
--- /dev/null
+++ b/commit-metapack.c
@@ -0,0 +1,175 @@
+#include "cache.h"
+#include "commit-metapack.h"
+#include "metapack.h"
+#include "commit.h"
+#include "sha1-lookup.h"
+
+struct commit_metapack {
+	struct metapack mp;
+	uint32_t nr;
+	unsigned char *index;
+	unsigned char *data;
+	struct commit_metapack *next;
+};
+static struct commit_metapack *commit_metapacks;
+
+static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack)
+{
+	struct commit_metapack *it = xcalloc(1, sizeof(*it));
+	uint32_t version;
+
+	if (metapack_init(&it->mp, pack, "commits", &version) < 0) {
+		free(it);
+		return NULL;
+	}
+	if (version != 1) {
+		/*
+		 * This file comes from a more recent git version. Don't bother
+		 * warning the user, as we'll just fallback to reading the
+		 * commits.
+		 */
+		metapack_close(&it->mp);
+		free(it);
+		return NULL;
+	}
+
+	if (it->mp.len < 4) {
+		warning("commit metapack for '%s' is truncated", pack->pack_name);
+		metapack_close(&it->mp);
+		free(it);
+		return NULL;
+	}
+	memcpy(&it->nr, it->mp.data, 4);
+	it->nr = ntohl(it->nr);
+
+	/*
+	 * We need 84 bytes for each entry: sha1(20), date(4), tree(20),
+	 * parents(40).
+	 */
+	if (it->mp.len < (84 * it->nr + 4)) {
+		warning("commit metapack for '%s' is truncated", pack->pack_name);
+		metapack_close(&it->mp);
+		free(it);
+		return NULL;
+	}
+
+	it->index = it->mp.data + 4;
+	it->data = it->index + 20 * it->nr;
+
+	return it;
+}
+
+static void prepare_commit_metapacks(void)
+{
+	static int initialized;
+	struct commit_metapack **tail = &commit_metapacks;
+	struct packed_git *p;
+
+	if (initialized)
+		return;
+
+	prepare_packed_git();
+	for (p = packed_git; p; p = p->next) {
+		struct commit_metapack *it = alloc_commit_metapack(p);
+
+		if (it) {
+			*tail = it;
+			tail = &it->next;
+		}
+	}
+
+	initialized = 1;
+}
+
+int commit_metapack(unsigned char *sha1,
+		    uint32_t *timestamp,
+		    unsigned char **tree,
+		    unsigned char **parent1,
+		    unsigned char **parent2)
+{
+	struct commit_metapack *p;
+
+	prepare_commit_metapacks();
+	for (p = commit_metapacks; p; p = p->next) {
+		unsigned char *data;
+		int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1);
+		if (pos < 0)
+			continue;
+
+		/* timestamp(4) + tree(20) + parents(40) */
+		data = p->data + 64 * pos;
+		*timestamp = *(uint32_t *)data;
+		*timestamp = ntohl(*timestamp);
+		data += 4;
+		*tree = data;
+		data += 20;
+		*parent1 = data;
+		data += 20;
+		*parent2 = data;
+
+		return 0;
+	}
+
+	return -1;
+}
+
+static void get_commits(struct metapack_writer *mw,
+			const unsigned char *sha1,
+			void *data)
+{
+	struct commit_list ***tail = data;
+	enum object_type type = sha1_object_info(sha1, NULL);
+	struct commit *c;
+
+	if (type != OBJ_COMMIT)
+		return;
+
+	c = lookup_commit(sha1);
+	if (!c || parse_commit(c))
+		die("unable to read commit %s", sha1_to_hex(sha1));
+
+	/*
+	 * Our fixed-size parent list cannot represent root commits, nor
+	 * octopus merges. Just skip those commits, as we can fallback
+	 * in those rare cases to reading the actual commit object.
+	 */
+	if (!c->parents ||
+	    (c->parents && c->parents->next && c->parents->next->next))
+		return;
+
+	*tail = &commit_list_insert(c, *tail)->next;
+}
+
+void commit_metapack_write(const char *idx)
+{
+	struct metapack_writer mw;
+	struct commit_list *commits = NULL, *p;
+	struct commit_list **tail = &commits;
+	uint32_t nr = 0;
+
+	metapack_writer_init(&mw, idx, "commits", 1);
+
+	/* Figure out how many eligible commits we've got in this pack. */
+	metapack_writer_foreach(&mw, get_commits, &tail);
+	for (p = commits; p; p = p->next)
+		nr++;
+	metapack_writer_add_uint32(&mw, nr);
+
+	/* Then write an index of commit sha1s */
+	for (p = commits; p; p = p->next)
+		metapack_writer_add(&mw, p->item->object.sha1, 20);
+
+	/* Followed by the actual date/tree/parents data */
+	for (p = commits; p; p = p->next) {
+		struct commit *c = p->item;
+		metapack_writer_add_uint32(&mw, c->date);
+		metapack_writer_add(&mw, c->tree->object.sha1, 20);
+		metapack_writer_add(&mw, c->parents->item->object.sha1, 20);
+		metapack_writer_add(&mw,
+				    c->parents->next ?
+				    c->parents->next->item->object.sha1 :
+				    null_sha1, 20);
+	}
+
+	metapack_writer_finish(&mw);
+}
diff --git a/commit-metapack.h b/commit-metapack.h
new file mode 100644
index 0000000..4684573
--- /dev/null
+++ b/commit-metapack.h
@@ -0,0 +1,12 @@
+#ifndef METAPACK_COMMIT_H
+#define METAPACK_COMMIT_H
+
+int commit_metapack(unsigned char *sha1,
+		    uint32_t *timestamp,
+		    unsigned char **tree,
+		    unsigned char **parent1,
+		    unsigned char **parent2);
+
+void commit_metapack_write(const char *idx_file);
+
+#endif
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/6] add git-metapack command
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
                   ` (3 preceding siblings ...)
  2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
@ 2013-01-29  9:16 ` Jeff King
  2013-01-29  9:16 ` [PATCH 6/6] commit: look up commit info in metapack Jeff King
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-29  9:16 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen, Shawn O. Pearce

This is a plumbing command for generating metapack files.
Right now it understands only the "commits" metapack (and
there is not yet a reader). Eventually we may want to build
this metapack automatically when we generate a new pack.
Let's be conservative for now, though, and let the idea
prove itself in practice before turning it on for everyone.

The commits metapack generated by this command is 84 bytes
per commit; for linux-2.6.git, this is about 31M.

TODO: documentation

Signed-off-by: Jeff King <peff@peff.net>
---
 .gitignore         |  1 +
 Makefile           |  1 +
 builtin.h          |  1 +
 builtin/metapack.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 git-repack.sh      |  2 +-
 git.c              |  1 +
 6 files changed, 78 insertions(+), 1 deletion(-)
 create mode 100644 builtin/metapack.c

diff --git a/.gitignore b/.gitignore
index 6669bf0..3f67a36 100644
--- a/.gitignore
+++ b/.gitignore
@@ -93,6 +93,7 @@
 /git-merge-subtree
 /git-mergetool
 /git-mergetool--lib
+/git-metapack
 /git-mktag
 /git-mktree
 /git-name-rev
diff --git a/Makefile b/Makefile
index 6ca5320..3899699 100644
--- a/Makefile
+++ b/Makefile
@@ -905,6 +905,7 @@ BUILTIN_OBJS += builtin/merge-tree.o
 BUILTIN_OBJS += builtin/merge-ours.o
 BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
+BUILTIN_OBJS += builtin/metapack.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
 BUILTIN_OBJS += builtin/mv.o
diff --git a/builtin.h b/builtin.h
index faef559..30108ab 100644
--- a/builtin.h
+++ b/builtin.h
@@ -97,6 +97,7 @@ extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
+extern int cmd_metapack(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
diff --git a/builtin/metapack.c b/builtin/metapack.c
new file mode 100644
index 0000000..5fee6cf
--- /dev/null
+++ b/builtin/metapack.c
@@ -0,0 +1,73 @@
+#include "builtin.h"
+#include "parse-options.h"
+#include "commit-metapack.h"
+
+static const char *metapack_usage[] = {
+	N_("git metapack [options] <packindex...>"),
+	NULL
+};
+
+#define METAPACK_COMMITS (1<<0)
+
+static void metapack_one(const char *idx, int type)
+{
+	if (type & METAPACK_COMMITS)
+		commit_metapack_write(idx);
+}
+
+static void metapack_all(int type)
+{
+	struct strbuf path = STRBUF_INIT;
+	size_t dirlen;
+	DIR *dh;
+	struct dirent *de;
+
+	strbuf_addstr(&path, get_object_directory());
+	strbuf_addstr(&path, "/pack");
+	dirlen = path.len;
+
+	dh = opendir(path.buf);
+	if (!dh)
+		die_errno("unable to open pack directory '%s'", path.buf);
+	while ((de = readdir(dh))) {
+		if (!has_extension(de->d_name, ".idx"))
+			continue;
+
+		strbuf_addch(&path, '/');
+		strbuf_addstr(&path, de->d_name);
+		metapack_one(path.buf, type);
+		strbuf_setlen(&path, dirlen);
+	}
+
+	closedir(dh);
+	strbuf_release(&path);
+}
+
+int cmd_metapack(int argc, const char **argv, const char *prefix)
+{
+	int all = 0;
+	int type = 0;
+	struct option opts[] = {
+		OPT_BOOL(0, "all", &all, N_("create metapacks for all packs")),
+		OPT_BIT(0, "commits", &type, N_("create commit metapacks"),
+			METAPACK_COMMITS),
+		OPT_END()
+	};
+
+	argc = parse_options(argc, argv, prefix, opts, metapack_usage, 0);
+
+	if (all && argc)
+		usage_msg_opt(_("pack arguments do not make sense with --all"),
+			      metapack_usage, opts);
+	if (!type)
+		usage_msg_opt(_("no metapack type specified"),
+			      metapack_usage, opts);
+
+	if (all)
+		metapack_all(type);
+	else
+		for (; *argv; argv++)
+			metapack_one(*argv, type);
+
+	return 0;
+}
diff --git a/git-repack.sh b/git-repack.sh
index 7579331..e6a9773 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -180,7 +180,7 @@ then
 		  do
 			case " $fullbases " in
 			*" $e "*) ;;
-			*)	rm -f "$e.pack" "$e.idx" "$e.keep" ;;
+			*)	rm -f "$e.pack" "$e.idx" "$e.keep" "$e.commits";;
 			esac
 		  done
 		)
diff --git a/git.c b/git.c
index b10c18b..f6e5552 100644
--- a/git.c
+++ b/git.c
@@ -365,6 +365,7 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE },
 		{ "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE },
 		{ "merge-tree", cmd_merge_tree, RUN_SETUP },
+		{ "metapack", cmd_metapack, RUN_SETUP },
 		{ "mktag", cmd_mktag, RUN_SETUP },
 		{ "mktree", cmd_mktree, RUN_SETUP },
 		{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/6] commit: look up commit info in metapack
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
                   ` (4 preceding siblings ...)
  2013-01-29  9:16 ` [PATCH 5/6] add git-metapack command Jeff King
@ 2013-01-29  9:16 ` Jeff King
  2013-01-30  3:31 ` [PATCH/RFC 0/6] commit caching Duy Nguyen
  2013-01-31 17:14 ` Shawn Pearce
  7 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-29  9:16 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen, Shawn O. Pearce

Now that we have the plumbing in place to generate and read
commit metapacks, we can hook them up to parse_commit to
fill in the traversal information much more quickly.

We only do so if save_commit_buffer is turned off;
otherwise, the callers will expect to be able to read
commit->buffer after parse_commit returns (and since our
cache obviously does not have that information, we must
leave it NULL). As callers learn to handle a NULL
commit->buffer, we can eventually relax this (while it might
seem like a useless no-op to use the cache if we are going
to load the commit anyway, many callers may first filter
based on the traversal, and end up loading the commit
message for only a subset of the commits).

With this patch (and having run "git metapack --all
--commits"), my best-of-five warm-cache "git rev-list
--count --all" traversal of linux-2.6.git drops from 4.219s
to 0.659s.

Similarly, cold-cache drops from 13.696s to 4.763s due to
the compactness of the cache data (but you are penalized, of
course, if you then want to actually look at the commit
messages, since you have not warmed them into the cache).

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/commit.c b/commit.c
index e8eb0ae..b326201 100644
--- a/commit.c
+++ b/commit.c
@@ -8,6 +8,7 @@
 #include "notes.h"
 #include "gpg-interface.h"
 #include "mergesort.h"
+#include "commit-metapack.h"
 
 static struct commit_extra_header *read_commit_extra_header_lines(const char *buf, size_t len, const char **);
 
@@ -306,6 +307,24 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	return 0;
 }
 
+static int parse_commit_metapack(struct commit *item)
+{
+	unsigned char *tree, *p1, *p2;
+	uint32_t ts;
+
+	if (commit_metapack(item->object.sha1, &ts, &tree, &p1, &p2) < 0)
+		return -1;
+
+	item->date = ts;
+	item->tree = lookup_tree(tree);
+	commit_list_insert(lookup_commit(p1), &item->parents);
+	if (!is_null_sha1(p2))
+		commit_list_insert(lookup_commit(p2), &item->parents->next);
+
+	item->object.parsed = 1;
+	return 0;
+}
+
 int parse_commit(struct commit *item)
 {
 	enum object_type type;
@@ -317,6 +336,10 @@ int parse_commit(struct commit *item)
 		return -1;
 	if (item->object.parsed)
 		return 0;
+
+	if (!save_commit_buffer && !parse_commit_metapack(item))
+		return 0;
+
 	buffer = read_sha1_file(item->object.sha1, &type, &size);
 	if (!buffer)
 		return error("Could not read %s",
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] strbuf: add string-chomping functions
  2013-01-29  9:15 ` [PATCH 2/6] strbuf: add string-chomping functions Jeff King
@ 2013-01-29 10:15   ` Michael Haggerty
  2013-01-29 11:10     ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2013-01-29 10:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

On 01/29/2013 10:15 AM, Jeff King wrote:
> Sometimes it is handy to cut a trailing string off the end
> of a strbuf (e.g., a file extension). These helper functions
> make it a one-liner.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  strbuf.c | 11 +++++++++++
>  strbuf.h |  2 ++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/strbuf.c b/strbuf.c
> index 9a373be..8199ced 100644
> --- a/strbuf.c
> +++ b/strbuf.c
> @@ -106,6 +106,17 @@ void strbuf_ltrim(struct strbuf *sb)
>  	sb->buf[sb->len] = '\0';
>  }
>  
> +void strbuf_chompmem(struct strbuf *sb, const void *data, size_t len)
> +{
> +	if (sb->len >= len && !memcmp(data, sb->buf + sb->len - len, len))
> +		strbuf_setlen(sb, sb->len - len);
> +}
> +
> +void strbuf_chompstr(struct strbuf *sb, const char *str)
> +{
> +	strbuf_chompmem(sb, str, strlen(str));
> +}
> +
>  struct strbuf **strbuf_split_buf(const char *str, size_t slen,
>  				 int terminator, int max)
>  {
> diff --git a/strbuf.h b/strbuf.h
> index ecae4e2..3aeb815 100644
> --- a/strbuf.h
> +++ b/strbuf.h
> @@ -42,6 +42,8 @@ extern void strbuf_ltrim(struct strbuf *);
>  extern void strbuf_trim(struct strbuf *);
>  extern void strbuf_rtrim(struct strbuf *);
>  extern void strbuf_ltrim(struct strbuf *);
> +extern void strbuf_chompmem(struct strbuf *, const void *, size_t);
> +extern void strbuf_chompstr(struct strbuf *, const char *);
>  extern int strbuf_cmp(const struct strbuf *, const struct strbuf *);
>  
>  /*
> 

It might be handy to have these functions return true/false based on
whether the suffix was actually found.

Please document the new functions in
Documentation/technical/api-strbuf.txt.  Personally I would also
advocate a "docstring" in the header file, but obviously that preference
is the exception rather than the rule in the git project :-(

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
@ 2013-01-29 10:24   ` Michael Haggerty
  2013-01-29 11:13     ` Jeff King
  2013-01-29 17:38   ` Junio C Hamano
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Michael Haggerty @ 2013-01-29 10:24 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

On 01/29/2013 10:16 AM, Jeff King wrote:
> When we are doing a commit traversal that does not need to
> look at the commit messages themselves (e.g., rev-list,
> merge-base, etc), we spend a lot of time accessing,
> decompressing, and parsing the commit objects just to find
> the parent and timestamp information. We can make a
> space-time tradeoff by caching that information on disk in a
> compact, uncompressed format.
> 
> TODO: document on-disk format in Documentation/technical
> TODO: document API

Would this be a good place to add the commit generation number that is
so enthusiastically discussed on the mailing list from time to time?

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] strbuf: add string-chomping functions
  2013-01-29 10:15   ` Michael Haggerty
@ 2013-01-29 11:10     ` Jeff King
  2013-01-30  5:00       ` Michael Haggerty
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2013-01-29 11:10 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Duy Nguyen, Shawn O. Pearce

On Tue, Jan 29, 2013 at 11:15:34AM +0100, Michael Haggerty wrote:

> > +void strbuf_chompmem(struct strbuf *sb, const void *data, size_t len)
> > +{
> > +	if (sb->len >= len && !memcmp(data, sb->buf + sb->len - len, len))
> > +		strbuf_setlen(sb, sb->len - len);
> > +}
> > +
> > +void strbuf_chompstr(struct strbuf *sb, const char *str)
> > +{
> > +	strbuf_chompmem(sb, str, strlen(str));
> > +}
> > +
> It might be handy to have these functions return true/false based on
> whether the suffix was actually found.

Yeah, that sounds reasonable.

> Please document the new functions in
> Documentation/technical/api-strbuf.txt.  Personally I would also
> advocate a "docstring" in the header file, but obviously that preference
> is the exception rather than the rule in the git project :-(

Will do. I need to document the metapack functions, too, so I was thinking
about experimenting with some inline documentation systems.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29 10:24   ` Michael Haggerty
@ 2013-01-29 11:13     ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-29 11:13 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Duy Nguyen, Shawn O. Pearce

On Tue, Jan 29, 2013 at 11:24:45AM +0100, Michael Haggerty wrote:

> On 01/29/2013 10:16 AM, Jeff King wrote:
> > When we are doing a commit traversal that does not need to
> > look at the commit messages themselves (e.g., rev-list,
> > merge-base, etc), we spend a lot of time accessing,
> > decompressing, and parsing the commit objects just to find
> > the parent and timestamp information. We can make a
> > space-time tradeoff by caching that information on disk in a
> > compact, uncompressed format.
> > 
> > TODO: document on-disk format in Documentation/technical
> > TODO: document API
> 
> Would this be a good place to add the commit generation number that is
> so enthusiastically discussed on the mailing list from time to time?

Yes, that is one of my goals. We may even be able to just replace the
timestamp field in the cache with a generation number. When it gets
pretty-printed we pull it out of the commit message again anyway, so in
theory the only use inside "struct commit" is for ordering. But I
haven't looked at all of the use sites yet to be sure nobody is
depending on it being an actual date stamp.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] introduce pack metadata cache files
  2013-01-29  9:15 ` [PATCH 3/6] introduce pack metadata cache files Jeff King
@ 2013-01-29 17:35   ` Junio C Hamano
  2013-01-30  6:47     ` Jeff King
  2013-01-30  1:30   ` Duy Nguyen
  1 sibling, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2013-01-29 17:35 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

Jeff King <peff@peff.net> writes:

> +static void write_meta_header(struct metapack_writer *mw, const char *id,
> +			      uint32_t version)
> +{
> +	version = htonl(version);
> +
> +	sha1write(mw->out, "META", 4);
> +	sha1write(mw->out, "\0\0\0\1", 4);
> +	sha1write(mw->out, mw->pack->sha1, 20);
> +	sha1write(mw->out, id, 4);
> +	sha1write(mw->out, &version, 4);
> +}

It seems that you are very close to actually having a plumbing that
could also do the pack .idx files.  Until/unless that can be done, I
am not sure how much benefit we would be getting from a file format
that records a subtype "id" and a generic "META" type, instead of
just a single "id" as the type ehader.  But it is OK to use 8 extra
bytes if we can potentially gain something later.

Shouldn't id be validated with at least something like

	if (strlen(id) < 3)
		die("Bad id: %s", id);

to catch a call

	write_meta_header(&mw, "me", 47);

that will stuff 'm', 'e', NUL and the garbage the compiler/linker
combo has placed after that constant string in the 4-byte id field?

> +void metapack_writer_init(struct metapack_writer *mw,
> +			  const char *pack_idx,
> +			  const char *name,
> +			  int version)
> +{
> +	struct strbuf path = STRBUF_INIT;
> +
> +	memset(mw, 0, sizeof(*mw));
> +
> +	mw->pack = add_packed_git(pack_idx, strlen(pack_idx), 1);
> +	if (!mw->pack || open_pack_index(mw->pack))
> +		die("unable to open packfile '%s'", pack_idx);
> +
> +	strbuf_addstr(&path, pack_idx);
> +	strbuf_chompstr(&path, ".idx");
> +	strbuf_addch(&path, '.');
> +	strbuf_addstr(&path, name);

Your chompstr() does not even validate if the given name ends with
".idx", so this sounds like a glorified way to say

	strbuf_splice(&path, path->len - strlen("idx"), strlen("idx"),
			 name, strlen(name));

to me.

> +	mw->path = strbuf_detach(&path, NULL);
> +
> +	mw->out = create_meta_tmp();
> +	write_meta_header(mw, name, version);
> +}
> +
> +void metapack_writer_finish(struct metapack_writer *mw)
> +{
> +	const char *tmp = mw->out->name;
> +
> +	sha1close(mw->out, NULL, CSUM_FSYNC);
> +	if (rename(tmp, mw->path))
> +		die_errno("unable to rename temporary metapack file");

Who is responsible for running adjust_shared_perm()?  The caller, or
this function?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
  2013-01-29 10:24   ` Michael Haggerty
@ 2013-01-29 17:38   ` Junio C Hamano
  2013-01-29 18:08     ` Junio C Hamano
  2013-01-30  7:07     ` Jeff King
  2013-01-30  3:36   ` Duy Nguyen
  2013-01-30 13:56   ` Duy Nguyen
  3 siblings, 2 replies; 43+ messages in thread
From: Junio C Hamano @ 2013-01-29 17:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

Jeff King <peff@peff.net> writes:

> +int commit_metapack(unsigned char *sha1,
> +		    uint32_t *timestamp,
> +		    unsigned char **tree,
> +		    unsigned char **parent1,
> +		    unsigned char **parent2)
> +{
> +	struct commit_metapack *p;
> +
> +	prepare_commit_metapacks();
> +	for (p = commit_metapacks; p; p = p->next) {
> +		unsigned char *data;
> +		int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1);

This is a tangent, but isn't it about time to rip out the check for
GIT_USE_LOOKUP in find_pack_entry_one(), I wonder.

> +	prepare_commit_metapacks();
> +	for (p = commit_metapacks; p; p = p->next) {
> +		unsigned char *data;
> +		int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1);
> +		if (pos < 0)
> +			continue;
> +
> +		/* timestamp(4) + tree(20) + parents(40) */
> +		data = p->data + 64 * pos;
> +		*timestamp = *(uint32_t *)data;
> +		*timestamp = ntohl(*timestamp);
> +		data += 4;
> +		*tree = data;
> +		data += 20;
> +		*parent1 = data;
> +		data += 20;
> +		*parent2 = data;
> +
> +		return 0;
> +	}
> +
> +	return -1;
> +}

I am torn on this one.

These cached properties of a single commit will not change no matter
which pack it appears in, and it feels logically wrong, especially
when you record these object names in the full SHA-1 form, to tie a
"commit metapack" to a pack.  Logically there needs only one commit
metapack that describes all the commits known to the repository when
the metapack was created.

In order to reduce the disk footprint and I/O cost, the future
direction for this mechanism may want to point into an existing
store of SHA-1 hashes with a shorter file offset, and the .idx file
could be such a store, and in order to move in that direction, you
cannot avoid tying a metapack to a pack.

> +static void get_commits(struct metapack_writer *mw,
> +			const unsigned char *sha1,
> +			void *data)
> +{
> +	struct commit_list ***tail = data;
> +	enum object_type type = sha1_object_info(sha1, NULL);
> +	struct commit *c;
> +
> +	if (type != OBJ_COMMIT)
> +		return;
> +
> +	c = lookup_commit(sha1);
> +	if (!c || parse_commit(c))
> +		die("unable to read commit %s", sha1_to_hex(sha1));
> +
> +	/*
> +	 * Our fixed-size parent list cannot represent root commits, nor
> +	 * octopus merges. Just skip those commits, as we can fallback
> +	 * in those rare cases to reading the actual commit object.
> +	 */
> +	if (!c->parents ||
> +	    (c->parents && c->parents->next && c->parents->next->next))
> +		return;
> +
> +	*tail = &commit_list_insert(c, *tail)->next;
> +}

It feels somewhat wasteful to:

 - use commit_list for this, rather than an array of commit
   objects.  If you have a rough estimate of the number of commits
   in the pack, you could just preallocate a single array and use
   ALLOC_GROW() on it, no?

 - iterate over the .idx file and run sha1_object_info() and
   parse_commit() on many objects in the SHA-1 order.  Iterating in
   the way builtin/pack-objects.c::get_object_details() does avoids
   jumping around in existing packfiles, which may be more
   efficient, no?

> +void commit_metapack_write(const char *idx)
> +{
> +	struct metapack_writer mw;
> +	struct commit_list *commits = NULL, *p;
> +	struct commit_list **tail = &commits;
> +	uint32_t nr = 0;
> +
> +	metapack_writer_init(&mw, idx, "commits", 1);
> +
> +	/* Figure out how many eligible commits we've got in this pack. */
> +	metapack_writer_foreach(&mw, get_commits, &tail);
> +	for (p = commits; p; p = p->next)
> +		nr++;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29 17:38   ` Junio C Hamano
@ 2013-01-29 18:08     ` Junio C Hamano
  2013-01-30  7:12       ` Jeff King
  2013-01-30  7:07     ` Jeff King
  1 sibling, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2013-01-29 18:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

Junio C Hamano <gitster@pobox.com> writes:

> I am torn on this one.
>
> These cached properties of a single commit will not change no matter
> which pack it appears in, and it feels logically wrong, especially
> when you record these object names in the full SHA-1 form, to tie a
> "commit metapack" to a pack.  Logically there needs only one commit
> metapack that describes all the commits known to the repository when
> the metapack was created.
>
> In order to reduce the disk footprint and I/O cost, the future
> direction for this mechanism may want to point into an existing
> store of SHA-1 hashes with a shorter file offset, and the .idx file
> could be such a store, and in order to move in that direction, you
> cannot avoid tying a metapack to a pack.

Have you considered if we can extend the .idx format version 2
without actually having to bump the version number?  My quick
reading of check_packed_git_idx() tells me that we have a gap after
the "large offset table" that we can place extensions, but I may be
mistaken.  If we indeed can, then perhaps adding a series of

	4-byte "id" header
        4-byte extension length (or 8-byte)
        ... N-byte worth of extension data ...

followed by

	20-byte SHA-1 checksum of all the extension sections
	8-byte file offset to the first extension section

at that gap, immediately before the trailer of the .idx file written
by git_SHA1_Final(), in a way similar to the index extension is done
may give us a natural way to deal with this.  The reader can first
read the offset recorded at the tail, checking if the offset is
plausible because it may not exist, the check the extension section
checksum, to see if the file has extension, or the file just ends
with the large offset table and the 28 bytes it tried to were from
the entries near the end of the large offset table.

If that is not worth attempting, perhaps we may still want to store
this as an optional extended section and bump the .idx format
version.

Then it will be very natural for the extension data that store the
commit metainfo to name objects in the pack the .idx file describes
by the offset in the SHA-1 table.

As we always say, .idx is a local cache and bumping its version is
not a huge headache compared to other longer term storage items.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] introduce pack metadata cache files
  2013-01-29  9:15 ` [PATCH 3/6] introduce pack metadata cache files Jeff King
  2013-01-29 17:35   ` Junio C Hamano
@ 2013-01-30  1:30   ` Duy Nguyen
  2013-01-30  6:50     ` Jeff King
  1 sibling, 1 reply; 43+ messages in thread
From: Duy Nguyen @ 2013-01-30  1:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Tue, Jan 29, 2013 at 4:15 PM, Jeff King <peff@peff.net> wrote:
> +static void write_meta_header(struct metapack_writer *mw, const char *id,
> +                             uint32_t version)
> +{
> +       version = htonl(version);
> +
> +       sha1write(mw->out, "META", 4);
> +       sha1write(mw->out, "\0\0\0\1", 4);
> +       sha1write(mw->out, mw->pack->sha1, 20);
> +       sha1write(mw->out, id, 4);
> +       sha1write(mw->out, &version, 4);
> +}

Because you go with all-commit-info-in-one-file, perhaps we should
have an uint32_t bitmap to describe what info this cache contains? So
far we need 4 bits for date, tree, 1st and 2nd parents (yes, I still
want to check if storing 1-parent commits only gains us anything on
some other repos). When commit count comes, it can take the fifth bit.
Reachability bitmap offsets can take the sixth bit, if we just append
the bitmaps at the end of the same file.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC 0/6] commit caching
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
                   ` (5 preceding siblings ...)
  2013-01-29  9:16 ` [PATCH 6/6] commit: look up commit info in metapack Jeff King
@ 2013-01-30  3:31 ` Duy Nguyen
  2013-01-30  7:18   ` Jeff King
  2013-01-31 17:14 ` Shawn Pearce
  7 siblings, 1 reply; 43+ messages in thread
From: Duy Nguyen @ 2013-01-30  3:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Tue, Jan 29, 2013 at 4:14 PM, Jeff King <peff@peff.net> wrote:
> The timings from this one are roughly similar to what I posted earlier.
> Unlike the earlier version, this one keeps the data for a single commit
> together for better cache locality (though I don't think it made a big
> difference in my tests, since my cold-cache timing test ends up touching
> every commit anyway).  The short of it is that for an extra 31M of disk
> space (~4%), I get a warm-cache speedup for "git rev-list --all" of
> ~4.2s to ~0.66s.

Some data point on caching 1-parent vs 2-parent commits on webkit
repo, 26k commits. With your changes (caching 2-parent commits), the
.commits file takes 2241600 bytes. "rev-list --all --quiet":

0.06user 0.00system 0:00.08elapsed 95%CPU (0avgtext+0avgdata 26288maxresident)k
0inputs+0outputs (0major+2094minor)pagefaults 0swaps

With caching 1-parent commits only, the .commits file takes 1707900
bytes (30% less), the same rev-list command:

0.07user 0.00system 0:00.07elapsed 96%CPU (0avgtext+0avgdata 24144maxresident)k
0inputs+0outputs (0major+1960minor)pagefaults 0swaps

Compared to the timing without caching at all:

0.72user 0.02system 0:00.76elapsed 98%CPU (0avgtext+0avgdata 108976maxresident)k
0inputs+0outputs (0major+7272minor)pagefaults 0swaps

The performance loss in 1-parent case is not significant while disk
saving is (although it'll be less impressive after you do Shawn's
suggestion not storing SHA-1 directly)
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
  2013-01-29 10:24   ` Michael Haggerty
  2013-01-29 17:38   ` Junio C Hamano
@ 2013-01-30  3:36   ` Duy Nguyen
  2013-01-30  7:12     ` Jeff King
  2013-01-30 13:56   ` Duy Nguyen
  3 siblings, 1 reply; 43+ messages in thread
From: Duy Nguyen @ 2013-01-30  3:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Tue, Jan 29, 2013 at 4:16 PM, Jeff King <peff@peff.net> wrote:
> +int commit_metapack(unsigned char *sha1,
> +                   uint32_t *timestamp,
> +                   unsigned char **tree,
> +                   unsigned char **parent1,
> +                   unsigned char **parent2)
> +{

Nit picking. tree, parent1 and parent2 can/should be "const unsigned char **".
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] strbuf: add string-chomping functions
  2013-01-29 11:10     ` Jeff King
@ 2013-01-30  5:00       ` Michael Haggerty
  0 siblings, 0 replies; 43+ messages in thread
From: Michael Haggerty @ 2013-01-30  5:00 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

On 01/29/2013 12:10 PM, Jeff King wrote:
> On Tue, Jan 29, 2013 at 11:15:34AM +0100, Michael Haggerty wrote:
>> Please document the new functions in
>> Documentation/technical/api-strbuf.txt.  Personally I would also
>> advocate a "docstring" in the header file, but obviously that preference
>> is the exception rather than the rule in the git project :-(
> 
> Will do. I need to document the metapack functions, too, so I was thinking
> about experimenting with some inline documentation systems.

That would be great; it would make future API documentation much easier
and therefore (hopefully) more common.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] introduce pack metadata cache files
  2013-01-29 17:35   ` Junio C Hamano
@ 2013-01-30  6:47     ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-30  6:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen, Shawn O. Pearce

On Tue, Jan 29, 2013 at 09:35:12AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > +static void write_meta_header(struct metapack_writer *mw, const char *id,
> > +			      uint32_t version)
> > +{
> > +	version = htonl(version);
> > +
> > +	sha1write(mw->out, "META", 4);
> > +	sha1write(mw->out, "\0\0\0\1", 4);
> > +	sha1write(mw->out, mw->pack->sha1, 20);
> > +	sha1write(mw->out, id, 4);
> > +	sha1write(mw->out, &version, 4);
> > +}
> 
> It seems that you are very close to actually having a plumbing that
> could also do the pack .idx files.  Until/unless that can be done, I
> am not sure how much benefit we would be getting from a file format
> that records a subtype "id" and a generic "META" type, instead of
> just a single "id" as the type ehader.  But it is OK to use 8 extra
> bytes if we can potentially gain something later.

Yeah, I considered going that route. I had initially envisioned having a
generic META file type that provided some services (like fixed-size
records), and then having individual subtypes below that. But as I
simplified the design, the META format became pretty much pointless. I
left it in as the 8 bytes are not really a big problem, and it means we
can treat metapacks generically in some cases without necessarily
knowing what is in them. But I don't have a specific use case in mind,
so perhaps it is just useless and confusing. I don't mind simplifying.

> Shouldn't id be validated with at least something like
> 
> 	if (strlen(id) < 3)
> 		die("Bad id: %s", id);
> 
> to catch a call
> 
> 	write_meta_header(&mw, "me", 47);
> 
> that will stuff 'm', 'e', NUL and the garbage the compiler/linker
> combo has placed after that constant string in the 4-byte id field?

Yes, the id does need to be at least 4 bytes. Since the id is intended
to be a static string, I had planned to just document the requirement in
the API documentation. I don't mind putting in a run-time check. I had
originally had a separate "id" parameter that could be "char id[4]", but
found that it was just redundant with the "name" parameter: you ended up
passing ("commit", "CMIT") or similar.

> > +	strbuf_addstr(&path, pack_idx);
> > +	strbuf_chompstr(&path, ".idx");
> > +	strbuf_addch(&path, '.');
> > +	strbuf_addstr(&path, name);
> 
> Your chompstr() does not even validate if the given name ends with
> ".idx",

Yeah, my intent was that it would be liberal in its input (i.e., take
just "pack-*"). E.g., you can run "git metapack pack/pack-XXXX".

> so this sounds like a glorified way to say
> 
> 	strbuf_splice(&path, path->len - strlen("idx"), strlen("idx"),
> 			 name, strlen(name));
> 
> to me.

Yup, though my version handles edge cases by not chomping (e.g., what
does splice do when path->len is less than 3?).

> > +void metapack_writer_finish(struct metapack_writer *mw)
> > +{
> > +	const char *tmp = mw->out->name;
> > +
> > +	sha1close(mw->out, NULL, CSUM_FSYNC);
> > +	if (rename(tmp, mw->path))
> > +		die_errno("unable to rename temporary metapack file");
> 
> Who is responsible for running adjust_shared_perm()?  The caller, or
> this function?

I didn't think about it at all, but it seems pretty obvious to me that
this function should do so. Thanks for pointing it out.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] introduce pack metadata cache files
  2013-01-30  1:30   ` Duy Nguyen
@ 2013-01-30  6:50     ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-30  6:50 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git, Shawn O. Pearce

On Wed, Jan 30, 2013 at 08:30:57AM +0700, Nguyen Thai Ngoc Duy wrote:

> On Tue, Jan 29, 2013 at 4:15 PM, Jeff King <peff@peff.net> wrote:
> > +static void write_meta_header(struct metapack_writer *mw, const char *id,
> > +                             uint32_t version)
> > +{
> > +       version = htonl(version);
> > +
> > +       sha1write(mw->out, "META", 4);
> > +       sha1write(mw->out, "\0\0\0\1", 4);
> > +       sha1write(mw->out, mw->pack->sha1, 20);
> > +       sha1write(mw->out, id, 4);
> > +       sha1write(mw->out, &version, 4);
> > +}
> 
> Because you go with all-commit-info-in-one-file, perhaps we should
> have an uint32_t bitmap to describe what info this cache contains?  So
> far we need 4 bits for date, tree, 1st and 2nd parents (yes, I still
> want to check if storing 1-parent commits only gains us anything on
> some other repos). When commit count comes, it can take the fifth bit.
> Reachability bitmap offsets can take the sixth bit, if we just append
> the bitmaps at the end of the same file.

I thought about having some programmatic self-describing header like
that. But it makes the implementation much harder to verify, and it is
not like there is much point in picking and choosing those bits. My plan
was to do a combination of:

  1. Put truly optional bits into a separate metapack (e.g.,
     reachability bitmaps).

  2. When something becomes obviously obsolete (e.g., we move to
     generation numbers instead of timestamps in commits), bump the
     version number.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29 17:38   ` Junio C Hamano
  2013-01-29 18:08     ` Junio C Hamano
@ 2013-01-30  7:07     ` Jeff King
  1 sibling, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-30  7:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen, Shawn O. Pearce

On Tue, Jan 29, 2013 at 09:38:10AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > +int commit_metapack(unsigned char *sha1,
> > +		    uint32_t *timestamp,
> > +		    unsigned char **tree,
> > +		    unsigned char **parent1,
> > +		    unsigned char **parent2)
> > +{
> > +	struct commit_metapack *p;
> > +
> > +	prepare_commit_metapacks();
> > +	for (p = commit_metapacks; p; p = p->next) {
> > +		unsigned char *data;
> > +		int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1);
> 
> This is a tangent, but isn't it about time to rip out the check for
> GIT_USE_LOOKUP in find_pack_entry_one(), I wonder.

Rip it out and always use sha1_entry_pos, or rip it out and never use
it? I have never been able to measure any speedup from it; I just use it
here to avoid rewriting the binary search myself. I do not think there
is any disadvantage to using it, so I'd be in favor of just
standardizing on it for any sha1 binary searches.

> These cached properties of a single commit will not change no matter
> which pack it appears in, and it feels logically wrong, especially
> when you record these object names in the full SHA-1 form, to tie a
> "commit metapack" to a pack.  Logically there needs only one commit
> metapack that describes all the commits known to the repository when
> the metapack was created.

True. I had originally envisioned it as tied to the packfile to help
manage the lifecycle. You know you need to generate a metapack for some
objects when they get packed, and you know you can throw it away when
the associated pack goes away. And it means you can verify the integrity
of the metapack by matching it to a particular packfile.

However, if you just have a commit cache, you can always blow the whole
thing away and regenerate it for all objects in the repo. It does not
have tied to a pack (you do end up doing some extra work at regeneration
when you are not doing a full repack, but it is really not enough to
worry about).

> In order to reduce the disk footprint and I/O cost, the future
> direction for this mechanism may want to point into an existing
> store of SHA-1 hashes with a shorter file offset, and the .idx file
> could be such a store, and in order to move in that direction, you
> cannot avoid tying a metapack to a pack.

Yes. That was not one of my original goals for the commit cache, but I
do think it's a useful direction to go in. And reachability bitmaps
(which would eventually be their own metapacks) would generally want to
be per-pack, too, for the same reason.

> > +	*tail = &commit_list_insert(c, *tail)->next;
> > +}
> 
> It feels somewhat wasteful to:
> 
>  - use commit_list for this, rather than an array of commit
>    objects.  If you have a rough estimate of the number of commits
>    in the pack, you could just preallocate a single array and use
>    ALLOC_GROW() on it, no?

We don't have a rough estimate, but yes, we could just use an array and
trust ALLOC_GROW to be reasonable. The use of commit_list did not have a
particular reason other than that it was simple (an array means stuffing
the array pointer and the nr and alloc counts into a struct to get to
the callback). The performance of writing the cache is dominated by
accessing the objects themselves, anyway. I don't mind changing it,
though, if you think it's clearer as an array.

>  - iterate over the .idx file and run sha1_object_info() and
>    parse_commit() on many objects in the SHA-1 order.  Iterating in
>    the way builtin/pack-objects.c::get_object_details() does avoids
>    jumping around in existing packfiles, which may be more
>    efficient, no?

Probably, though generating the complete commit cache for linux-2.6.git
takes only about 7 seconds on my machine. I wasn't too concerned with
optimizing generation, since it will typically be dwarfed by repacking
costs. It might make more of a difference for doing a metapack on all
objects (e.g., reachability bitmaps).

The reason I do it in .idx order is that it feeds the callback in sorted
order, so a writer could in theory just use that output as-is (and my
initial version did that, as it wrote separate metapacks for each
element). This version now puts all data elements together (for cache
locality), and builds the in-memory list so we do not have to re-do
sha1_object_info repeatedly. So it could very easily just generate the
list in pack order and sort it at the end.

I'll look into that for the next version.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29 18:08     ` Junio C Hamano
@ 2013-01-30  7:12       ` Jeff King
  2013-01-30  7:17         ` Junio C Hamano
  2013-01-30 15:56         ` Junio C Hamano
  0 siblings, 2 replies; 43+ messages in thread
From: Jeff King @ 2013-01-30  7:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen, Shawn O. Pearce

On Tue, Jan 29, 2013 at 10:08:08AM -0800, Junio C Hamano wrote:

> > In order to reduce the disk footprint and I/O cost, the future
> > direction for this mechanism may want to point into an existing
> > store of SHA-1 hashes with a shorter file offset, and the .idx file
> > could be such a store, and in order to move in that direction, you
> > cannot avoid tying a metapack to a pack.
> 
> Have you considered if we can extend the .idx format version 2
> without actually having to bump the version number?  My quick
> reading of check_packed_git_idx() tells me that we have a gap after
> the "large offset table" that we can place extensions, but I may be
> mistaken.  If we indeed can, then perhaps adding a series of
> 
> 	4-byte "id" header
>         4-byte extension length (or 8-byte)
>         ... N-byte worth of extension data ...
> 
> followed by
> 
> 	20-byte SHA-1 checksum of all the extension sections
> 	8-byte file offset to the first extension section

I considered it, but didn't look into it closely. My feeling was that it
added extra complexity, without adding any real advantage that a
separate file would not.

>From this:

> Then it will be very natural for the extension data that store the
> commit metainfo to name objects in the pack the .idx file describes
> by the offset in the SHA-1 table.

I guess your argument is that putting it all in the same file makes it
more natural for there to be a data dependency.

> As we always say, .idx is a local cache and bumping its version is
> not a huge headache compared to other longer term storage items.

True, but it is even less headache if the file is totally separate and
optional.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30  3:36   ` Duy Nguyen
@ 2013-01-30  7:12     ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-01-30  7:12 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git, Shawn O. Pearce

On Wed, Jan 30, 2013 at 10:36:10AM +0700, Nguyen Thai Ngoc Duy wrote:

> On Tue, Jan 29, 2013 at 4:16 PM, Jeff King <peff@peff.net> wrote:
> > +int commit_metapack(unsigned char *sha1,
> > +                   uint32_t *timestamp,
> > +                   unsigned char **tree,
> > +                   unsigned char **parent1,
> > +                   unsigned char **parent2)
> > +{
> 
> Nit picking. tree, parent1 and parent2 can/should be "const unsigned char **".

Thanks, I'll fix it in the next version.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30  7:12       ` Jeff King
@ 2013-01-30  7:17         ` Junio C Hamano
  2013-02-01  9:21           ` Jeff King
  2013-01-30 15:56         ` Junio C Hamano
  1 sibling, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2013-01-30  7:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

> True, but it is even less headache if the file is totally separate and
> optional.

Once you start thinking about using an offset to some list of SHA-1,
perhaps?  A section inside the same file can never go out of sync.
Also a longer-term advantage is that you can teach index-pack to do
this.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC 0/6] commit caching
  2013-01-30  3:31 ` [PATCH/RFC 0/6] commit caching Duy Nguyen
@ 2013-01-30  7:18   ` Jeff King
  2013-01-30  8:32     ` Duy Nguyen
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2013-01-30  7:18 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git, Shawn O. Pearce

On Wed, Jan 30, 2013 at 10:31:43AM +0700, Nguyen Thai Ngoc Duy wrote:

> On Tue, Jan 29, 2013 at 4:14 PM, Jeff King <peff@peff.net> wrote:
> > The timings from this one are roughly similar to what I posted earlier.
> > Unlike the earlier version, this one keeps the data for a single commit
> > together for better cache locality (though I don't think it made a big
> > difference in my tests, since my cold-cache timing test ends up touching
> > every commit anyway).  The short of it is that for an extra 31M of disk
> > space (~4%), I get a warm-cache speedup for "git rev-list --all" of
> > ~4.2s to ~0.66s.
> 
> Some data point on caching 1-parent vs 2-parent commits on webkit
> repo, 26k commits. With your changes (caching 2-parent commits), the
> .commits file takes 2241600 bytes. "rev-list --all --quiet":

Hmm. My webkit repo has zero merges in it (though it is the older
svn-based one). What percentage of the one you have are merges? How does
your 1-parent cache perform on something like git.git, where about 25%
of all commits are merges?

> The performance loss in 1-parent case is not significant while disk
> saving is (although it'll be less impressive after you do Shawn's
> suggestion not storing SHA-1 directly)

Yeah, I think moving to offsets instead of sha1s is going to be a big
enough win that it won't matter anymore.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC 0/6] commit caching
  2013-01-30  7:18   ` Jeff King
@ 2013-01-30  8:32     ` Duy Nguyen
  0 siblings, 0 replies; 43+ messages in thread
From: Duy Nguyen @ 2013-01-30  8:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Wed, Jan 30, 2013 at 2:18 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Jan 30, 2013 at 10:31:43AM +0700, Nguyen Thai Ngoc Duy wrote:
>
>> On Tue, Jan 29, 2013 at 4:14 PM, Jeff King <peff@peff.net> wrote:
>> > The timings from this one are roughly similar to what I posted earlier.
>> > Unlike the earlier version, this one keeps the data for a single commit
>> > together for better cache locality (though I don't think it made a big
>> > difference in my tests, since my cold-cache timing test ends up touching
>> > every commit anyway).  The short of it is that for an extra 31M of disk
>> > space (~4%), I get a warm-cache speedup for "git rev-list --all" of
>> > ~4.2s to ~0.66s.
>>
>> Some data point on caching 1-parent vs 2-parent commits on webkit
>> repo, 26k commits. With your changes (caching 2-parent commits), the
>> .commits file takes 2241600 bytes. "rev-list --all --quiet":
>
> Hmm. My webkit repo has zero merges in it (though it is the older
> svn-based one). What percentage of the one you have are merges? How does
> your 1-parent cache perform on something like git.git, where about 25%
> of all commits are merges?

git.git performs worse with 1-parent cache. But the point is it should
be customizable.

>> The performance loss in 1-parent case is not significant while disk
>> saving is (although it'll be less impressive after you do Shawn's
>> suggestion not storing SHA-1 directly)
>
> Yeah, I think moving to offsets instead of sha1s is going to be a big
> enough win that it won't matter anymore.

Yeah, if we use uint32_t instead of sha-1, the cache is just about
400k 2 parents for webkit, 312k for 1 parent. The total size is so
small that reduction does not really matter anymore.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
                     ` (2 preceding siblings ...)
  2013-01-30  3:36   ` Duy Nguyen
@ 2013-01-30 13:56   ` Duy Nguyen
  2013-01-30 14:16     ` Duy Nguyen
  2013-02-01 10:00     ` Jeff King
  3 siblings, 2 replies; 43+ messages in thread
From: Duy Nguyen @ 2013-01-30 13:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Tue, Jan 29, 2013 at 04:16:11AM -0500, Jeff King wrote:
> When we are doing a commit traversal that does not need to
> look at the commit messages themselves (e.g., rev-list,
> merge-base, etc), we spend a lot of time accessing,
> decompressing, and parsing the commit objects just to find
> the parent and timestamp information. We can make a
> space-time tradeoff by caching that information on disk in a
> compact, uncompressed format.

And this is a (messy) patch on top that avoids storing SHA-1
directly. On my linux-2.6.git (575 MB pack, 73 MB index), .commits
file is 5.2 MB and 27 MB with and without my patch respectively. Nice
shrinkage.

However, performance seems to suffer too. Maybe I do more lookups than
necessary, I don't know. I should probably measure the cost of
revindex separately.

git rev-list --all --quiet on vanilla git:

real    0m3.645s
user    0m3.556s
sys     0m0.080s

commit cache without my patch:

real    0m0.723s
user    0m0.677s
sys     0m0.045s

and with my patch:

real    0m1.338s
user    0m1.259s
sys     0m0.075s


Another point, but not really important at this stage, I think we have
memory leak somewhere (lookup_commit??). It used up to 800 MB RES on
linux-2.6.git while generating the cache.

-- 8< --
diff --git a/cache.h b/cache.h
index 1f96f65..8048d5b 100644
--- a/cache.h
+++ b/cache.h
@@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
+extern int find_pack_entry_pos(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
diff --git a/commit-metapack.c b/commit-metapack.c
index 2c19f48..55f7ea9 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -3,11 +3,12 @@
 #include "metapack.h"
 #include "commit.h"
 #include "sha1-lookup.h"
+#include "pack-revindex.h"
 
 struct commit_metapack {
 	struct metapack mp;
-	uint32_t nr;
-	unsigned char *index;
+	struct packed_git *pack;
+	uint32_t first, last;
 	unsigned char *data;
 	struct commit_metapack *next;
 };
@@ -16,7 +17,7 @@ static struct commit_metapack *commit_metapacks;
 static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack)
 {
 	struct commit_metapack *it = xcalloc(1, sizeof(*it));
-	uint32_t version;
+	uint32_t version, nr;
 
 	if (metapack_init(&it->mp, pack, "commits", &version) < 0) {
 		free(it);
@@ -39,22 +40,25 @@ static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack)
 		free(it);
 		return NULL;
 	}
-	memcpy(&it->nr, it->mp.data, 4);
-	it->nr = ntohl(it->nr);
+	memcpy(&it->first, it->mp.data, 4);
+	it->first = ntohl(it->first);
+	memcpy(&it->last, it->mp.data + 4, 4);
+	it->last = ntohl(it->last);
+	nr = it->last - it->first + 1;
+	it->pack = pack;
 
 	/*
-	 * We need 84 bytes for each entry: sha1(20), date(4), tree(20),
-	 * parents(40).
+	 * We need 16 bytes for each entry: date(4), tree index(4),
+	 * parent indexes(8).
 	 */
-	if (it->mp.len < (84 * it->nr + 4)) {
+	if (it->mp.len < (16 * nr + 8)) {
 		warning("commit metapack for '%s' is truncated", pack->pack_name);
 		metapack_close(&it->mp);
 		free(it);
 		return NULL;
 	}
 
-	it->index = it->mp.data + 4;
-	it->data = it->index + 20 * it->nr;
+	it->data = it->mp.data + 8;
 
 	return it;
 }
@@ -81,31 +85,61 @@ static void prepare_commit_metapacks(void)
 	initialized = 1;
 }
 
+static const unsigned char *idx_to_sha1(struct packed_git *p,
+					uint32_t nth)
+{
+	struct revindex_entry *revindex = get_revindex(p);
+	if (!revindex)
+		return NULL;
+	return nth_packed_object_sha1(p, revindex[nth].nr);
+}
+
 int commit_metapack(unsigned char *sha1,
 		    uint32_t *timestamp,
-		    unsigned char **tree,
-		    unsigned char **parent1,
-		    unsigned char **parent2)
+		    const unsigned char **tree,
+		    const unsigned char **parent1,
+		    const unsigned char **parent2)
 {
 	struct commit_metapack *p;
 
 	prepare_commit_metapacks();
 	for (p = commit_metapacks; p; p = p->next) {
 		unsigned char *data;
-		int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1);
-		if (pos < 0)
+		uint32_t p1, p2;
+		struct revindex_entry *re, *base;
+		off_t off;
+		uint32_t pos;
+
+		base = get_revindex(p->pack);
+		off = find_pack_entry_one(sha1, p->pack);
+		if (!off)
+			continue;
+		re = find_pack_revindex(p->pack, off);
+		if (!re)
+			continue;
+		pos = re - base;
+		if (pos < p->first || pos > p->last)
 			continue;
 
 		/* timestamp(4) + tree(20) + parents(40) */
-		data = p->data + 64 * pos;
+		data = p->data + 16 * (pos - p->first);
 		*timestamp = *(uint32_t *)data;
 		*timestamp = ntohl(*timestamp);
+		if (!*timestamp)
+			return -1;
 		data += 4;
-		*tree = data;
-		data += 20;
-		*parent1 = data;
-		data += 20;
-		*parent2 = data;
+		*tree = idx_to_sha1(p->pack, ntohl(*(uint32_t*)data));
+		data += 4;
+		p1 = ntohl(*(uint32_t*)data);
+		*parent1 = idx_to_sha1(p->pack, p1);
+		data += 4;
+		p2 = ntohl(*(uint32_t*)data);
+		if (p1 == p2)
+			*parent2 = null_sha1;
+		else
+			*parent2 = idx_to_sha1(p->pack, p2);
+		if (!*tree || !*parent1 || !*parent2)
+			return -1;
 
 		return 0;
 	}
@@ -113,63 +147,114 @@ int commit_metapack(unsigned char *sha1,
 	return -1;
 }
 
-static void get_commits(struct metapack_writer *mw,
-			const unsigned char *sha1,
-			void *data)
+static int get_commits(struct metapack_writer *mw,
+		       uint32_t nth,
+		       int dry_run)
 {
-	struct commit_list ***tail = data;
+	const unsigned char *sha1 = nth_packed_object_sha1(mw->pack, nth);
 	enum object_type type = sha1_object_info(sha1, NULL);
 	struct commit *c;
+	struct revindex_entry *revindex, *ridx;
+	int pt, p1, p2 = -1;
 
-	if (type != OBJ_COMMIT)
-		return;
+	if (type != OBJ_COMMIT) {
+		if (dry_run)
+			return -1;
+		metapack_writer_add_uint32(mw, 0); /* date */
+		metapack_writer_add_uint32(mw, 0); /* tree */
+		metapack_writer_add_uint32(mw, 0); /* 1st parent */
+		metapack_writer_add_uint32(mw, 0); /* 2nd tree */
+		return 0;
+	}
 
 	c = lookup_commit(sha1);
 	if (!c || parse_commit(c))
 		die("unable to read commit %s", sha1_to_hex(sha1));
 
+	if (c->buffer) {
+		free(c->buffer);
+		c->buffer = NULL;
+	}
+
 	/*
 	 * Our fixed-size parent list cannot represent root commits, nor
 	 * octopus merges. Just skip those commits, as we can fallback
 	 * in those rare cases to reading the actual commit object.
 	 */
 	if (!c->parents ||
-	    (c->parents && c->parents->next && c->parents->next->next))
-		return;
+	    (c->parents && c->parents->next && c->parents->next->next) ||
+	    /* edge commits are out too */
+	    (pt = find_pack_entry_pos(c->tree->object.sha1, mw->pack)) == -1 ||
+	    (p1 = find_pack_entry_pos(c->parents->item->object.sha1, mw->pack)) == -1 ||
+	    (c->parents->next &&
+	     (p2 = find_pack_entry_pos(c->parents->next->item->object.sha1, mw->pack)) == -1) ||
+	    /*
+	     * we set the 2nd parent the same as 1st parent as an
+	     * indication that 2nd parent does not exist. Normal
+	     * commits should never have two same parents, but just in
+	     * case..
+	     */
+	    p1 == p2 ||
+	    /* zero date is reserved to say this is not a valid entry */
+	    c->date == 0) {
+		if (dry_run)
+			return -1;
+		metapack_writer_add_uint32(mw, 0); /* date */
+		metapack_writer_add_uint32(mw, 0); /* tree */
+		metapack_writer_add_uint32(mw, 0); /* 1st parent */
+		metapack_writer_add_uint32(mw, 0); /* 2nd tree */
+		return 0;
+	}
+
+	if (dry_run)
+		return 0;
+
+	revindex = get_revindex(mw->pack);
 
-	*tail = &commit_list_insert(c, *tail)->next;
+	metapack_writer_add_uint32(mw, c->date);
+	ridx = find_pack_revindex(mw->pack,
+				  nth_packed_object_offset(mw->pack, pt));
+	metapack_writer_add_uint32(mw, ridx - revindex);
+	ridx = find_pack_revindex(mw->pack,
+				  nth_packed_object_offset(mw->pack, p1));
+	metapack_writer_add_uint32(mw, ridx - revindex);
+	if (p2 != -1)
+		ridx = find_pack_revindex(mw->pack,
+					  nth_packed_object_offset(mw->pack, p2));
+	metapack_writer_add_uint32(mw, ridx - revindex);
+	return 0;
 }
 
 void commit_metapack_write(const char *idx)
 {
 	struct metapack_writer mw;
-	struct commit_list *commits = NULL, *p;
-	struct commit_list **tail = &commits;
-	uint32_t nr = 0;
+	uint32_t i, first = 0xffffffff, last = 0;
+	struct revindex_entry *revidx;
 
 	metapack_writer_init(&mw, idx, "commits", 1);
 
-	/* Figure out how many eligible commits we've got in this pack. */
-	metapack_writer_foreach(&mw, get_commits, &tail);
-	for (p = commits; p; p = p->next)
-		nr++;
-	metapack_writer_add_uint32(&mw, nr);
+	packed_git = mw.pack;
+	revidx = get_revindex(mw.pack);
 
-	/* Then write an index of commit sha1s */
-	for (p = commits; p; p = p->next)
-		metapack_writer_add(&mw, p->item->object.sha1, 20);
+	/*
+	 * Figure out how many eligible commits we've got in this pack.
+	 */
+	for (i = 0; i < mw.pack->num_objects; i++) {
+		int ret = get_commits(&mw, revidx[i].nr, 1);
+		if (ret == -1)	/* not cached */
+			continue;
+		if (i < first)
+			first = i;
+		if (i > last)
+			last = i;
+	}
+
+	metapack_writer_add_uint32(&mw, first);
+	metapack_writer_add_uint32(&mw, last);
 
 	/* Followed by the actual date/tree/parents data */
-	for (p = commits; p; p = p->next) {
-		struct commit *c = p->item;
-		metapack_writer_add_uint32(&mw, c->date);
-		metapack_writer_add(&mw, c->tree->object.sha1, 20);
-		metapack_writer_add(&mw, c->parents->item->object.sha1, 20);
-		metapack_writer_add(&mw,
-				    c->parents->next ?
-				    c->parents->next->item->object.sha1 :
-				    null_sha1, 20);
-	}
+	for (i = first; i <= last; i++)
+		get_commits(&mw, revidx[i].nr, 0);
 
 	metapack_writer_finish(&mw);
 }
diff --git a/commit-metapack.h b/commit-metapack.h
index 4684573..caf85be 100644
--- a/commit-metapack.h
+++ b/commit-metapack.h
@@ -3,9 +3,9 @@
 
 int commit_metapack(unsigned char *sha1,
 		    uint32_t *timestamp,
-		    unsigned char **tree,
-		    unsigned char **parent1,
-		    unsigned char **parent2);
+		    const unsigned char **tree,
+		    const unsigned char **parent1,
+		    const unsigned char **parent2);
 
 void commit_metapack_write(const char *idx_file);
 
diff --git a/pack-revindex.c b/pack-revindex.c
index 77a0465..d58dd02 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -111,12 +111,10 @@ static void create_pack_revindex(struct pack_revindex *rix)
 	qsort(rix->revindex, num_ent, sizeof(*rix->revindex), cmp_offset);
 }
 
-struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
+struct revindex_entry *get_revindex(struct packed_git *p)
 {
 	int num;
-	int lo, hi;
 	struct pack_revindex *rix;
-	struct revindex_entry *revindex;
 
 	if (!pack_revindex_hashsz)
 		init_pack_revindex();
@@ -127,7 +125,13 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
 	rix = &pack_revindex[num];
 	if (!rix->revindex)
 		create_pack_revindex(rix);
-	revindex = rix->revindex;
+	return rix->revindex;
+}
+
+struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
+{
+	int lo, hi;
+	struct revindex_entry *revindex = get_revindex(p);
 
 	lo = 0;
 	hi = p->num_objects + 1;
diff --git a/pack-revindex.h b/pack-revindex.h
index 8d5027a..cea85db 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -6,6 +6,7 @@ struct revindex_entry {
 	unsigned int nr;
 };
 
+struct revindex_entry *get_revindex(struct packed_git *p);
 struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
 void discard_revindex(void);
 
diff --git a/sha1_file.c b/sha1_file.c
index 40b2329..1acab8c 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1978,8 +1978,8 @@ off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n)
 	}
 }
 
-off_t find_pack_entry_one(const unsigned char *sha1,
-				  struct packed_git *p)
+int find_pack_entry_pos(const unsigned char *sha1,
+			struct packed_git *p)
 {
 	const uint32_t *level1_ofs = p->index_data;
 	const unsigned char *index = p->index_data;
@@ -1992,7 +1992,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 
 	if (!index) {
 		if (open_pack_index(p))
-			return 0;
+			return -1;
 		level1_ofs = p->index_data;
 		index = p->index_data;
 	}
@@ -2019,9 +2019,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	if (use_lookup) {
 		int pos = sha1_entry_pos(index, stride, 0,
 					 lo, hi, p->num_objects, sha1);
-		if (pos < 0)
-			return 0;
-		return nth_packed_object_offset(p, pos);
+		return pos;
 	}
 
 	do {
@@ -2032,15 +2030,24 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 			printf("lo %u hi %u rg %u mi %u\n",
 			       lo, hi, hi - lo, mi);
 		if (!cmp)
-			return nth_packed_object_offset(p, mi);
+			return mi;
 		if (cmp > 0)
 			hi = mi;
 		else
 			lo = mi+1;
 	} while (lo < hi);
-	return 0;
+	return -1;
 }
 
+off_t find_pack_entry_one(const unsigned char *sha1,
+				  struct packed_git *p)
+{
+	int pos = find_pack_entry_pos(sha1, p);
+	if (pos < 0)
+		return 0;
+	else
+		return nth_packed_object_offset(p, pos);
+}
 int is_pack_valid(struct packed_git *p)
 {
 	/* An already open pack is known to be valid. */
-- 
1.8.1.1.459.g5970e58

-- 8< --

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30 13:56   ` Duy Nguyen
@ 2013-01-30 14:16     ` Duy Nguyen
  2013-01-31 11:06       ` Duy Nguyen
  2013-02-01 10:00     ` Jeff King
  1 sibling, 1 reply; 43+ messages in thread
From: Duy Nguyen @ 2013-01-30 14:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Wed, Jan 30, 2013 at 8:56 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> However, performance seems to suffer too. Maybe I do more lookups than
> necessary, I don't know.

Yes, I should have stored the position in the sha-1 <-> offset map
instead of the position of the object in .pack file. Even so,
performance does not improve.

> I should probably measure the cost of revindex separately.

And the cost of create_pack_revindex() is 0.6 sec :-(

Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
space/time trade-off.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30  7:12       ` Jeff King
  2013-01-30  7:17         ` Junio C Hamano
@ 2013-01-30 15:56         ` Junio C Hamano
  2013-01-31 17:03           ` Shawn Pearce
  1 sibling, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2013-01-30 15:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen, Shawn O. Pearce

Jeff King <peff@peff.net> writes:

>>From this:
>
>> Then it will be very natural for the extension data that store the
>> commit metainfo to name objects in the pack the .idx file describes
>> by the offset in the SHA-1 table.
>
> I guess your argument is that putting it all in the same file makes it
> more natural for there to be a data dependency.

It is more about the "I am torn on this one" I mentioned earlier.

It would be more "logical" if this weren't tied to a particular
pack, as the properties of a commit you record in this series do not
depend on which pack the commit is in, and such a repository-global
file by definition cannot be inside anybody's .idx.

But if we split the information into separate pieces and store one
piece per .idx for implementation reasons, it is crazy not to at
least consider it a longer term goal to put it inside .idx file.

Of course, it is more convenient to store this kind of things in a
separate file while experimenting and improving the mechanism, but I
do not think we want to see each packfile in a repository comes with
47 auxiliary files with different suffixes 5 years down the road.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30 14:16     ` Duy Nguyen
@ 2013-01-31 11:06       ` Duy Nguyen
  2013-02-01 10:15         ` Jeff King
                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Duy Nguyen @ 2013-01-31 11:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> space/time trade-off.

Following the on-disk format experiment yesterday, I changed the
format to:

 - a list a _short_ SHA-1 of cached commits
 - a list of cache entries, each (5 uint32_t) consists of:
   - uint32_t for the index in .idx sha-1 table to get full SHA-1 of
     the commit
   - uint32_t for timestamp
   - uint32_t for tree, 1st and 2nd parents for the index in .idx
     table

The length of SHA-1 is chosen to be able to unambiguously identify any
cached commits. Full SHA-1 check is done after to catch false
positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
moderate-sized projects are 4 bytes. So it's 26 bytes per commit for
linux-2.6.git, or 8MB .commits file. Not as good as revindex approach
(5.5MB) but way better than the current one (27MB). And still good
enough for caching 2 parents unconditionally.

Performance seems to improve a tiny bit, probably because of more
compact search space: 0.6s with my patch vs 0.7s without (vs 3.4s
without cache).

-- 8< --
diff --git a/cache.h b/cache.h
index 1f96f65..8048d5b 100644
--- a/cache.h
+++ b/cache.h
@@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
+extern int find_pack_entry_pos(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
diff --git a/commit-metapack.c b/commit-metapack.c
index 2c19f48..c984b8e 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -4,11 +4,21 @@
 #include "commit.h"
 #include "sha1-lookup.h"
 
+struct commit_entry {
+	uint32_t commit;	/* nth_packed_object_sha1 to get own SHA-1 */
+	uint32_t timestamp;
+	uint32_t tree;		/* nth_packed_object_sha1 to get tree SHA-1 */
+	uint32_t parent1; /* nth_packed_object_sha1 to get 1st parent SHA-1 */
+	uint32_t parent2; /* nth_packed_object_sha1 to get 2nd parent SHA-1 */
+};
+
 struct commit_metapack {
 	struct metapack mp;
 	uint32_t nr;
+	uint32_t abbrev_len;
+	struct packed_git *pack;
 	unsigned char *index;
-	unsigned char *data;
+	struct commit_entry *data;
 	struct commit_metapack *next;
 };
 static struct commit_metapack *commit_metapacks;
@@ -41,20 +51,23 @@ static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack)
 	}
 	memcpy(&it->nr, it->mp.data, 4);
 	it->nr = ntohl(it->nr);
+	memcpy(&it->abbrev_len, it->mp.data + 4, 4);
+	it->abbrev_len = ntohl(it->abbrev_len);
+	it->pack = pack;
 
 	/*
-	 * We need 84 bytes for each entry: sha1(20), date(4), tree(20),
-	 * parents(40).
+	 * We need 20+abbrev_len bytes for each entry: abbrev sha-1,
+	 * date(4), tree index(4), parent indexes(8).
 	 */
-	if (it->mp.len < (84 * it->nr + 4)) {
+	if (it->mp.len < ((sizeof(*it->data) + it->abbrev_len) * it->nr + 8)) {
 		warning("commit metapack for '%s' is truncated", pack->pack_name);
 		metapack_close(&it->mp);
 		free(it);
 		return NULL;
 	}
 
-	it->index = it->mp.data + 4;
-	it->data = it->index + 20 * it->nr;
+	it->index = it->mp.data + 8;
+	it->data = (struct commit_entry*)(it->index + it->abbrev_len * it->nr);
 
 	return it;
 }
@@ -83,29 +96,51 @@ static void prepare_commit_metapacks(void)
 
 int commit_metapack(unsigned char *sha1,
 		    uint32_t *timestamp,
-		    unsigned char **tree,
-		    unsigned char **parent1,
-		    unsigned char **parent2)
+		    const unsigned char **tree,
+		    const unsigned char **parent1,
+		    const unsigned char **parent2)
 {
 	struct commit_metapack *p;
 
 	prepare_commit_metapacks();
 	for (p = commit_metapacks; p; p = p->next) {
-		unsigned char *data;
-		int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1);
+		struct commit_entry *data;
+		uint32_t p1, p2;
+		unsigned lo, hi, mi;
+		int pos;
+
+		/* sha1_entry_pos does not work with abbreviated sha-1 */
+		lo = 0;
+		hi = p->nr;
+		pos = -1;
+		do {
+			unsigned mi = (lo + hi) / 2;
+			int cmp = memcmp(p->index + mi * p->abbrev_len, sha1, p->abbrev_len);
+
+			if (!cmp) {
+				pos = mi;
+				break;
+			}
+			if (cmp > 0)
+				hi = mi;
+			else
+				lo = mi+1;
+		} while (lo < hi);
 		if (pos < 0)
 			continue;
 
-		/* timestamp(4) + tree(20) + parents(40) */
-		data = p->data + 64 * pos;
-		*timestamp = *(uint32_t *)data;
-		*timestamp = ntohl(*timestamp);
-		data += 4;
-		*tree = data;
-		data += 20;
-		*parent1 = data;
-		data += 20;
-		*parent2 = data;
+		data = p->data + pos;
+
+		/* full sha-1 check again */
+		if (hashcmp(nth_packed_object_sha1(p->pack, ntohl(data->commit)), sha1))
+			continue;
+
+		*timestamp = ntohl(data->timestamp);
+		*tree = nth_packed_object_sha1(p->pack, ntohl(data->tree));
+		p1 = ntohl(data->parent1);
+		*parent1 = nth_packed_object_sha1(p->pack, p1);
+		p2 = ntohl(data->parent2);
+		*parent2 = p1 == p2 ? null_sha1 : nth_packed_object_sha1(p->pack, p2);
 
 		return 0;
 	}
@@ -113,13 +148,20 @@ int commit_metapack(unsigned char *sha1,
 	return -1;
 }
 
+struct write_cb {
+	struct commit_list **tail;
+	int abbrev_len;
+	const unsigned char *last_sha1;
+};
+
 static void get_commits(struct metapack_writer *mw,
 			const unsigned char *sha1,
 			void *data)
 {
-	struct commit_list ***tail = data;
+	struct write_cb *write_cb = (struct write_cb *)data;
 	enum object_type type = sha1_object_info(sha1, NULL);
 	struct commit *c;
+	int p1, p2;
 
 	if (type != OBJ_COMMIT)
 		return;
@@ -128,47 +170,95 @@ static void get_commits(struct metapack_writer *mw,
 	if (!c || parse_commit(c))
 		die("unable to read commit %s", sha1_to_hex(sha1));
 
+	if (c->buffer) {
+		free(c->buffer);
+		c->buffer = NULL;
+	}
+
 	/*
 	 * Our fixed-size parent list cannot represent root commits, nor
 	 * octopus merges. Just skip those commits, as we can fallback
 	 * in those rare cases to reading the actual commit object.
 	 */
 	if (!c->parents ||
-	    (c->parents && c->parents->next && c->parents->next->next))
+	    (c->parents && c->parents->next && c->parents->next->next) ||
+	    /* edge commits are out too */
+	    find_pack_entry_pos(c->tree->object.sha1, mw->pack) == -1 ||
+	    (p1 = find_pack_entry_pos(c->parents->item->object.sha1, mw->pack)) == -1 ||
+	    (c->parents->next &&
+	     (p2 = find_pack_entry_pos(c->parents->next->item->object.sha1, mw->pack)) == -1) ||
+	    /*
+	     * we set the 2nd parent the same as 1st parent as an
+	     * indication that 2nd parent does not exist. Normal
+	     * commits should never have two same parents, but just in
+	     * case..
+	     */
+	    p1 == p2)
 		return;
 
-	*tail = &commit_list_insert(c, *tail)->next;
+	/*
+	 * Make sure we store the abbr sha-1 long enough to
+	 * unambiguously identify any cached commits in the pack.
+	 */
+	while (write_cb->abbrev_len < 20 &&
+	       write_cb->last_sha1 &&
+	       !memcmp(write_cb->last_sha1, sha1, write_cb->abbrev_len))
+		write_cb->abbrev_len++;
+	/*
+	 * A bit sensitive to metapack_writer_foreach. "sha1" must not
+	 * be changed even after this function exits.
+	 */
+	write_cb->last_sha1 = sha1;
+
+	write_cb->tail = &commit_list_insert(c, write_cb->tail)->next;
 }
 
 void commit_metapack_write(const char *idx)
 {
 	struct metapack_writer mw;
 	struct commit_list *commits = NULL, *p;
-	struct commit_list **tail = &commits;
+	struct write_cb write_cb;
 	uint32_t nr = 0;
 
 	metapack_writer_init(&mw, idx, "commits", 1);
 
+	write_cb.tail = &commits;
+	write_cb.abbrev_len = 1;
+	write_cb.last_sha1 = NULL;
+
 	/* Figure out how many eligible commits we've got in this pack. */
-	metapack_writer_foreach(&mw, get_commits, &tail);
+	metapack_writer_foreach(&mw, get_commits, &write_cb);
 	for (p = commits; p; p = p->next)
 		nr++;
+
 	metapack_writer_add_uint32(&mw, nr);
+	metapack_writer_add_uint32(&mw, write_cb.abbrev_len);
 
 	/* Then write an index of commit sha1s */
 	for (p = commits; p; p = p->next)
-		metapack_writer_add(&mw, p->item->object.sha1, 20);
+		metapack_writer_add(&mw, p->item->object.sha1, write_cb.abbrev_len);
 
 	/* Followed by the actual date/tree/parents data */
 	for (p = commits; p; p = p->next) {
 		struct commit *c = p->item;
+		int pos;
+
+		pos = find_pack_entry_pos(c->object.sha1, mw.pack);
+		metapack_writer_add_uint32(&mw, pos);
+
 		metapack_writer_add_uint32(&mw, c->date);
-		metapack_writer_add(&mw, c->tree->object.sha1, 20);
-		metapack_writer_add(&mw, c->parents->item->object.sha1, 20);
-		metapack_writer_add(&mw,
-				    c->parents->next ?
-				    c->parents->next->item->object.sha1 :
-				    null_sha1, 20);
+
+		pos = find_pack_entry_pos(c->tree->object.sha1, mw.pack);
+		metapack_writer_add_uint32(&mw, pos);
+
+		pos = find_pack_entry_pos(c->parents->item->object.sha1, mw.pack);
+		metapack_writer_add_uint32(&mw, pos);
+
+		if (c->parents->next) {
+			struct object *o = &c->parents->next->item->object;
+			pos = find_pack_entry_pos(o->sha1, mw.pack);
+		}
+		metapack_writer_add_uint32(&mw, pos);
 	}
 
 	metapack_writer_finish(&mw);
diff --git a/commit-metapack.h b/commit-metapack.h
index 4684573..caf85be 100644
--- a/commit-metapack.h
+++ b/commit-metapack.h
@@ -3,9 +3,9 @@
 
 int commit_metapack(unsigned char *sha1,
 		    uint32_t *timestamp,
-		    unsigned char **tree,
-		    unsigned char **parent1,
-		    unsigned char **parent2);
+		    const unsigned char **tree,
+		    const unsigned char **parent1,
+		    const unsigned char **parent2);
 
 void commit_metapack_write(const char *idx_file);
 
diff --git a/sha1_file.c b/sha1_file.c
index 40b2329..1acab8c 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1978,8 +1978,8 @@ off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n)
 	}
 }
 
-off_t find_pack_entry_one(const unsigned char *sha1,
-				  struct packed_git *p)
+int find_pack_entry_pos(const unsigned char *sha1,
+			struct packed_git *p)
 {
 	const uint32_t *level1_ofs = p->index_data;
 	const unsigned char *index = p->index_data;
@@ -1992,7 +1992,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 
 	if (!index) {
 		if (open_pack_index(p))
-			return 0;
+			return -1;
 		level1_ofs = p->index_data;
 		index = p->index_data;
 	}
@@ -2019,9 +2019,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	if (use_lookup) {
 		int pos = sha1_entry_pos(index, stride, 0,
 					 lo, hi, p->num_objects, sha1);
-		if (pos < 0)
-			return 0;
-		return nth_packed_object_offset(p, pos);
+		return pos;
 	}
 
 	do {
@@ -2032,15 +2030,24 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 			printf("lo %u hi %u rg %u mi %u\n",
 			       lo, hi, hi - lo, mi);
 		if (!cmp)
-			return nth_packed_object_offset(p, mi);
+			return mi;
 		if (cmp > 0)
 			hi = mi;
 		else
 			lo = mi+1;
 	} while (lo < hi);
-	return 0;
+	return -1;
 }
 
+off_t find_pack_entry_one(const unsigned char *sha1,
+				  struct packed_git *p)
+{
+	int pos = find_pack_entry_pos(sha1, p);
+	if (pos < 0)
+		return 0;
+	else
+		return nth_packed_object_offset(p, pos);
+}
 int is_pack_valid(struct packed_git *p)
 {
 	/* An already open pack is known to be valid. */
-- 8< --

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30 15:56         ` Junio C Hamano
@ 2013-01-31 17:03           ` Shawn Pearce
  2013-02-01  9:42             ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Shawn Pearce @ 2013-01-31 17:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Duy Nguyen

On Wed, Jan 30, 2013 at 7:56 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>>>From this:
>>
>>> Then it will be very natural for the extension data that store the
>>> commit metainfo to name objects in the pack the .idx file describes
>>> by the offset in the SHA-1 table.
>>
>> I guess your argument is that putting it all in the same file makes it
>> more natural for there to be a data dependency.
>
> It is more about the "I am torn on this one" I mentioned earlier.
>
> It would be more "logical" if this weren't tied to a particular
> pack, as the properties of a commit you record in this series do not
> depend on which pack the commit is in, and such a repository-global
> file by definition cannot be inside anybody's .idx.
>
> But if we split the information into separate pieces and store one
> piece per .idx for implementation reasons, it is crazy not to at
> least consider it a longer term goal to put it inside .idx file.
>
> Of course, it is more convenient to store this kind of things in a
> separate file while experimenting and improving the mechanism, but I
> do not think we want to see each packfile in a repository comes with
> 47 auxiliary files with different suffixes 5 years down the road.

Arrrrgggh.

Right now we are in the middle of refactoring the JGit reachability
bitmap implementation to store it into a separate file and get it out
of the .idx file. This work is nearly completed. So this thread is
great timing. :-)

I think Junio is right about not wanting 47 different tiny auxiliary
files for a single pack. We are unlikely to create that many, but
right now there are proposals floating around for at least 2 new
auxiliary files (commit cache and reachability bitmaps). So its not
impossible that another will be discovered in the future.

Junio may be right about the hole in the index file for git-core. I
haven't checked the JGit implementation, but I suspect it does not
have this hole. IIRC JGit consumes the index sections and then expects
the pack trailer SHA-1 to be present immediately after the last table.
This happens because JGit doesn't use mmap() to load the index, it
streams the file into memory and does some reformatting on the tables
to make search faster.

If we are going to change the index to support extension sections and
I have to modify JGit to grok this new format, it needs to be index v3
not index v2. If we are making index v3 we should just put index v3 on
the end of the pack file.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC 0/6] commit caching
  2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
                   ` (6 preceding siblings ...)
  2013-01-30  3:31 ` [PATCH/RFC 0/6] commit caching Duy Nguyen
@ 2013-01-31 17:14 ` Shawn Pearce
  2013-02-01  9:11   ` Jeff King
  7 siblings, 1 reply; 43+ messages in thread
From: Shawn Pearce @ 2013-01-31 17:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen

On Tue, Jan 29, 2013 at 1:14 AM, Jeff King <peff@peff.net> wrote:
> This is the cleaned-up version of the commit caching patches I mentioned
> here:
>
>   http://article.gmane.org/gmane.comp.version-control.git/212329
...
> The short of it is that for an extra 31M of disk
> space (~4%), I get a warm-cache speedup for "git rev-list --all" of
> ~4.2s to ~0.66s.

I have to admit, this is a nice gain. I don't think users often dig
through all commits to the root but I can see how this might improve
git log with a path filter.

> Coupled with using compression level 0 for trees (which do not compress
> well at all, and yield only a 2% increase in size when left
> uncompressed), my "git rev-list --objects --all" time drops from ~40s to
> ~25s.

This uhm.... is nice?

But consider reachability bitmaps. ~40s to ~80ms. :-)

> Perf reveals that we're spending most of the remaining time in
> lookup_object. I've spent a fair bit of time trying to optimize that,
> but with no luck; I think it's fairly close to optimal. The problem is
> just that we call it a very large number of times, since it is the
> mechanism by which we recognize that we have already processed each
> sha1.

Yup. I have also futzed with the one in JGit for quite a while now. I
pull some tricks there like making it a 2 level directory to reduce
the need to find a contiguous array of 8M entries when processing the
Linux kernel, and I try to preallocate the first level table based on
the number of objects in pack-*.idx files. But the bottleneck is
basically the cache lookups and hits, these happen like 100M times on
2M objects, because its every link in nearly every tree.

Reachability bitmaps basically let you skip this. So they go fast. But
I have another that you could try.

If we modified pack-objects' delta compressor for tree objects to only
generate delta instructions at tree record boundaries, a delta-encoded
tree can be processed without inflating the full content of that tree.
Because of the way deltas are created, "most" tree deltas should have
their delta base scanned by the object traversal before the delta is
considered. This means the tree delta just needs to consider the much
smaller records that are inserted into the base. We know these are
different SHA-1s than what was there before, so they are more likely
to be new to the lookup_object table.

So the --objects traversal algorithm can change to get the delta base
SHA-1 and raw tree delta from the pack storage. Perform a
lookup_object on the base to see if it has been scanned. If it has,
just scan the delta insert instructions. If the base has not yet been
scanned, inflate the tree to its normal format and scan the entire
tree.

This is an approximation of what Nico and I were talking about doing
for pack v4. But doesn't require a file format change. :-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC 0/6] commit caching
  2013-01-31 17:14 ` Shawn Pearce
@ 2013-02-01  9:11   ` Jeff King
  2013-02-02 10:04     ` Shawn Pearce
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2013-02-01  9:11 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Duy Nguyen

On Thu, Jan 31, 2013 at 09:14:26AM -0800, Shawn O. Pearce wrote:

> On Tue, Jan 29, 2013 at 1:14 AM, Jeff King <peff@peff.net> wrote:
> > This is the cleaned-up version of the commit caching patches I mentioned
> > here:
> >
> >   http://article.gmane.org/gmane.comp.version-control.git/212329
> ...
> > The short of it is that for an extra 31M of disk
> > space (~4%), I get a warm-cache speedup for "git rev-list --all" of
> > ~4.2s to ~0.66s.
> 
> I have to admit, this is a nice gain. I don't think users often dig
> through all commits to the root but I can see how this might improve
> git log with a path filter.

It doesn't just help digging to the roots. It should speed up most
traversals. So merge-bases, --contains, etc, would all be better. I
suspect we could also make --topo-order startup a lot faster, too.

It also helps "rev-list --objects --all", though obviously not by as
large a percentage. And since the main use of that is reachability
bitmaps, the improvements aren't as exciting there.

> > Coupled with using compression level 0 for trees (which do not compress
> > well at all, and yield only a 2% increase in size when left
> > uncompressed), my "git rev-list --objects --all" time drops from ~40s to
> > ~25s.
> 
> This uhm.... is nice?
> 
> But consider reachability bitmaps. ~40s to ~80ms. :-)

Yeah, yeah. I'm working my way up to it. :)

I wanted to see first how good we could get with a more generic
approach. I think this work may still have value even with reachability
bitmaps, as it will help regular traversals as well as tree access for
pathspec limiting.

At this point I'm convinced that my 25s is about the best we will do for
reachability analysis with a graph traversal. The repeated hashcmps to
see that we've visited each node are starting to dominate. So the next
obvious step is to try reachability bitmaps. I was hoping to iron out
the "pack metadata goes here" issues with the commit cache stuff,
though, as the actual cache implementation is quite simple (whereas the
bitmap stuff is more on the complex side, but can build on the same
metadata base).

> Yup. I have also futzed with the one in JGit for quite a while now. I
> pull some tricks there like making it a 2 level directory to reduce
> the need to find a contiguous array of 8M entries when processing the
> Linux kernel, and I try to preallocate the first level table based on
> the number of objects in pack-*.idx files. But the bottleneck is
> basically the cache lookups and hits, these happen like 100M times on
> 2M objects, because its every link in nearly every tree.

Right. I tried some multi-level tricks (and even a radix trie), but I
couldn't get anything to beat the simple-and-stupid single hash table
with linear probing.

> If we modified pack-objects' delta compressor for tree objects to only
> generate delta instructions at tree record boundaries, a delta-encoded
> tree can be processed without inflating the full content of that tree.
> Because of the way deltas are created, "most" tree deltas should have
> their delta base scanned by the object traversal before the delta is
> considered. This means the tree delta just needs to consider the much
> smaller records that are inserted into the base. We know these are
> different SHA-1s than what was there before, so they are more likely
> to be new to the lookup_object table.

So sort of a magic shortcut tree diff you get while accessing the
object. Neat idea.

> So the --objects traversal algorithm can change to get the delta base
> SHA-1 and raw tree delta from the pack storage. Perform a
> lookup_object on the base to see if it has been scanned. If it has,
> just scan the delta insert instructions. If the base has not yet been
> scanned, inflate the tree to its normal format and scan the entire
> tree.

This would not perform well if we hit the deltas before the bases. In
general, though, our "use the larger as the base" heuristic should mean
that our traversal hits the bases first.

> This is an approximation of what Nico and I were talking about doing
> for pack v4. But doesn't require a file format change. :-)

Yeah. It just needs to be very careful that the deltas it is looking at
all fall on record boundaries, since we might get deltas generated by
other versions of git. Can we necessarily identify that case for sure,
though?  I imagine a tree delta like that would look something like:

  delete bytes 100-120
  add 20 bytes at offset 100: \x12\x34\x56...

Without looking at the base object, and without knowing whether the
delta was generated by our particular implementation, how can we be sure
this is a sha1 replacement and not the renaming of part of a file? Or
are you proposing some flag in the packfile to indicate "yes, this tree
really was delta'd only at record boundaries"?

It could be a big win, but it does seem quite complex and error-prone.
And it only helps with reachability, not regular traversals, so it's not
very generic. Which makes me think the bitmap route is a much better way
to go.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30  7:17         ` Junio C Hamano
@ 2013-02-01  9:21           ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-02-01  9:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen, Shawn O. Pearce

On Tue, Jan 29, 2013 at 11:17:41PM -0800, Junio C Hamano wrote:

> > True, but it is even less headache if the file is totally separate and
> > optional.
> 
> Once you start thinking about using an offset to some list of SHA-1,
> perhaps?  A section inside the same file can never go out of sync.

Yes, having a data dependency is important. It is unavoidable to have a
dependency on the packfile, though (and that is why the index and the
metapacks embed the sha1 of the packfile). If the offsets used are
packfile offsets, then that is sufficient.

If the offsets are from the index, then yes, putting it in the same file
is one way to keep them tied together. Another way is to do the same
sha1 verification, except to embed the sha1 of the index in the
metapack.

So I certainly consider putting the dependency-handling to be a point in
favor of the same file, but I'd weight it against other points (headache
of bumping index version, performance of both types, etc).

> Also a longer-term advantage is that you can teach index-pack to do
> this.

I think that is roughly the same amount of difficulty to do whether they
are in the same file or not.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-31 17:03           ` Shawn Pearce
@ 2013-02-01  9:42             ` Jeff King
  2013-02-02 17:49               ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2013-02-01  9:42 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Junio C Hamano, git, Duy Nguyen

On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote:

> > Of course, it is more convenient to store this kind of things in a
> > separate file while experimenting and improving the mechanism, but I
> > do not think we want to see each packfile in a repository comes with
> > 47 auxiliary files with different suffixes 5 years down the road.
> 
> Arrrrgggh.
> 
> Right now we are in the middle of refactoring the JGit reachability
> bitmap implementation to store it into a separate file and get it out
> of the .idx file. This work is nearly completed. So this thread is
> great timing. :-)
> 
> I think Junio is right about not wanting 47 different tiny auxiliary
> files for a single pack. We are unlikely to create that many, but
> right now there are proposals floating around for at least 2 new
> auxiliary files (commit cache and reachability bitmaps). So its not
> impossible that another will be discovered in the future.

Why don't we want 47 files (or even 3)? If it makes the implementation
more flexible or simple without sacrificing performance, I don't see a
problem. The filesystem is there to organize our data; we do not cram
all of our files into one just to save a few inodes.

We _do_ cram our data into packfiles and packed-refs when we can save
O(n) inodes. But if we are talking about a handful of extra files that
we must readdir() over while preparing the objects/pack directory, I
don't think that is the same thing.

The data dependency issues (can the files get out of sync? How common is
it? Can we detect it?) and performance issues are more interesting to
me. With respect to the latter, here's specifically what I'm thinking
of. Let's imagine we have an extension for reachability bitmaps that
takes up an extra 10MB. When we mmap the .idx file, do we always mmap
the extra bytes? What's the performance impact of the extra mmap? I
recall on some non-Linux systems it can be quite expensive. For most git
commands which are not going to do a reachability analysis, this is a
loss.

I don't know if this is an issue big enough to care about or not. But I
think it's worth benchmarking and including in the decision.

> Junio may be right about the hole in the index file for git-core. I
> haven't checked the JGit implementation, but I suspect it does not
> have this hole. IIRC JGit consumes the index sections and then expects
> the pack trailer SHA-1 to be present immediately after the last table.
> This happens because JGit doesn't use mmap() to load the index, it
> streams the file into memory and does some reformatting on the tables
> to make search faster.

Yeah, looking at the PackIndexV2 implementation, it counts the 64-bit
offsets from the first table, calculates the size of the large offset
table, reads past it, and then expects the pack checksum. So let's
assume we would have to bump to v3 to implement it inside the .idx file.

> If we are going to change the index to support extension sections and
> I have to modify JGit to grok this new format, it needs to be index v3
> not index v2. If we are making index v3 we should just put index v3 on
> the end of the pack file.

I'm not sure what you mean by your last sentence here.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-30 13:56   ` Duy Nguyen
  2013-01-30 14:16     ` Duy Nguyen
@ 2013-02-01 10:00     ` Jeff King
  1 sibling, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-02-01 10:00 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git, Shawn O. Pearce

On Wed, Jan 30, 2013 at 08:56:07PM +0700, Nguyen Thai Ngoc Duy wrote:

> Another point, but not really important at this stage, I think we have
> memory leak somewhere (lookup_commit??). It used up to 800 MB RES on
> linux-2.6.git while generating the cache.

We generate (and then leak!) the linked list in commit_metapack_write.
That may be the culprit.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-31 11:06       ` Duy Nguyen
@ 2013-02-01 10:15         ` Jeff King
  2013-02-02  9:49           ` Duy Nguyen
  2013-02-01 10:40         ` Jeff King
  2013-03-17 13:21         ` Duy Nguyen
  2 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2013-02-01 10:15 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git, Shawn O. Pearce

On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> > space/time trade-off.
> 
> Following the on-disk format experiment yesterday, I changed the
> format to:
> 
>  - a list a _short_ SHA-1 of cached commits
>  - a list of cache entries, each (5 uint32_t) consists of:
>    - uint32_t for the index in .idx sha-1 table to get full SHA-1 of
>      the commit
>    - uint32_t for timestamp
>    - uint32_t for tree, 1st and 2nd parents for the index in .idx
>      table

Thanks for working on this, as it was the next step I was going to take. :)

The short-sha1 is a clever idea. Looks like it saves us on the order of
4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the
savings we get from dropping the other 3 sha1's to uint32_t, but still
not bad.

I guess the next steps in iterating on this would be:

  1. splitting out the refactoring here into separate patches

  2. squashing the cleaned-up bits into my patch 4/6

  3. deciding whether this should go into a separate file or as part of
     index v3. Your offsets depend on the .idx file having a sorted sha1
     list. That is not likely to change, but it would still be nice to
     make sure they cannot get out of sync. I'm still curious what the
     performance impact is for mmap-ing N versus N+8MB.

> The length of SHA-1 is chosen to be able to unambiguously identify any
> cached commits. Full SHA-1 check is done after to catch false
> positives.

Just to be clear, these false positives come because the abbreviation is
unambiguous within the packfile, but we might be looking for a commit
that is not even in our pack, right?

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-31 11:06       ` Duy Nguyen
  2013-02-01 10:15         ` Jeff King
@ 2013-02-01 10:40         ` Jeff King
  2013-03-17 13:21         ` Duy Nguyen
  2 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-02-01 10:40 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git, Shawn O. Pearce

On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> > space/time trade-off.
> 
> Following the on-disk format experiment yesterday, I changed the
> format to:
> 
>  - a list a _short_ SHA-1 of cached commits
>  - a list of cache entries, each (5 uint32_t) consists of:
>    - uint32_t for the index in .idx sha-1 table to get full SHA-1 of
>      the commit
>    - uint32_t for timestamp
>    - uint32_t for tree, 1st and 2nd parents for the index in .idx
>      table

BTW, I needed the minor fixups below to silence some warnings from your
patch. Here are the cold and warm cache timings I got, as compared to
stock git and my implementation:

 Pack  |  Cold Revs   |  Warm Revs
-------+--------------+------------
 stock | 12.54        | 4.14
    me |  4.76 (-62%) | 0.66 (-84%)
   duy |  4.36 (-65%) | 0.55 (-86%)

Not surprising; yours is just a bit faster in terms of CPU, and even
gains a little more in the cold cache case. Nice. Of course that is just
gravy on top of the smaller disk usage, too. :)

---
diff --git a/commit-metapack.c b/commit-metapack.c
index c984b8e..78fd961 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -106,7 +106,7 @@ int commit_metapack(unsigned char *sha1,
 	for (p = commit_metapacks; p; p = p->next) {
 		struct commit_entry *data;
 		uint32_t p1, p2;
-		unsigned lo, hi, mi;
+		unsigned lo, hi;
 		int pos;
 
 		/* sha1_entry_pos does not work with abbreviated sha-1 */
@@ -161,7 +161,7 @@ static void get_commits(struct metapack_writer *mw,
 	struct write_cb *write_cb = (struct write_cb *)data;
 	enum object_type type = sha1_object_info(sha1, NULL);
 	struct commit *c;
-	int p1, p2;
+	int p1 = -1, p2 = -1;
 
 	if (type != OBJ_COMMIT)
 		return;
diff --git a/commit.c b/commit.c
index b326201..5b776f8 100644
--- a/commit.c
+++ b/commit.c
@@ -309,7 +309,7 @@ static int parse_commit_metapack(struct commit *item)
 
 static int parse_commit_metapack(struct commit *item)
 {
-	unsigned char *tree, *p1, *p2;
+	const unsigned char *tree, *p1, *p2;
 	uint32_t ts;
 
 	if (commit_metapack(item->object.sha1, &ts, &tree, &p1, &p2) < 0)

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-02-01 10:15         ` Jeff King
@ 2013-02-02  9:49           ` Duy Nguyen
  0 siblings, 0 replies; 43+ messages in thread
From: Duy Nguyen @ 2013-02-02  9:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Fri, Feb 1, 2013 at 5:15 PM, Jeff King <peff@peff.net> wrote:
> The short-sha1 is a clever idea. Looks like it saves us on the order of
> 4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the
> savings we get from dropping the other 3 sha1's to uint32_t, but still
> not bad.

We could save another 4 bytes per commit by using 3 bytes for storing
.idx offsets. linux-2.6 only has 3M objects. It'll take many years for
big projects to reach 16M objects and need the fourth byte in
uint32_t.


> I guess the next steps in iterating on this would be:
>
>   1. splitting out the refactoring here into separate patches
>
>   2. squashing the cleaned-up bits into my patch 4/6
>
>   3. deciding whether this should go into a separate file or as part of
>      index v3. Your offsets depend on the .idx file having a sorted sha1
>      list. That is not likely to change, but it would still be nice to
>      make sure they cannot get out of sync. I'm still curious what the
>      performance impact is for mmap-ing N versus N+8MB.

4. Print some cache statistics in "count-objects -v"


>> The length of SHA-1 is chosen to be able to unambiguously identify any
>> cached commits. Full SHA-1 check is done after to catch false
>> positives.
>
> Just to be clear, these false positives come because the abbreviation is
> unambiguous within the packfile, but we might be looking for a commit
> that is not even in our pack, right?

It may even be ambiguous within the pack, say an octopus (i.e. not
cached) commit that shares the same sha-1 prefix with one of the
cached commits.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC 0/6] commit caching
  2013-02-01  9:11   ` Jeff King
@ 2013-02-02 10:04     ` Shawn Pearce
  0 siblings, 0 replies; 43+ messages in thread
From: Shawn Pearce @ 2013-02-02 10:04 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen

On Fri, Feb 1, 2013 at 1:11 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Jan 31, 2013 at 09:14:26AM -0800, Shawn O. Pearce wrote:
>
>> On Tue, Jan 29, 2013 at 1:14 AM, Jeff King <peff@peff.net> wrote:
>> > Coupled with using compression level 0 for trees (which do not compress
>> > well at all, and yield only a 2% increase in size when left
>> > uncompressed), my "git rev-list --objects --all" time drops from ~40s to
>> > ~25s.
>>
>> This uhm.... is nice?
>>
>> But consider reachability bitmaps. ~40s to ~80ms. :-)
>
> Yeah, yeah. I'm working my way up to it. :)

:-)

> At this point I'm convinced that my 25s is about the best we will do for
> reachability analysis with a graph traversal. The repeated hashcmps to
> see that we've visited each node are starting to dominate. So the next
> obvious step is to try reachability bitmaps.

Yea, its hard to make a big N go fast when you still have a big N...

> I was hoping to iron out
> the "pack metadata goes here" issues with the commit cache stuff,
> though, as the actual cache implementation is quite simple (whereas the
> bitmap stuff is more on the complex side, but can build on the same
> metadata base).

Junio and I were talking about putting these in an index v3, below the
current tables where he thought there was a hole in v2. I am inclined
to agree with his comment elsewhere that we don't want 50 auxiliary
files next to a pack in 5 years.

But if we go that route I also suggested we append the index below the
pack file itself, so its a single file, and that we rename the file to
be SHA1(all-bits) not SHA1(sorted-object-list). Both steps make it
much safer to perform git gc on Windows while the repository is being
accessed.

>> Yup. I have also futzed with the one in JGit for quite a while now. I
>> pull some tricks there like making it a 2 level directory to reduce
>> the need to find a contiguous array of 8M entries when processing the
>> Linux kernel, and I try to preallocate the first level table based on
>> the number of objects in pack-*.idx files. But the bottleneck is
>> basically the cache lookups and hits, these happen like 100M times on
>> 2M objects, because its every link in nearly every tree.
>
> Right. I tried some multi-level tricks (and even a radix trie), but I
> couldn't get anything to beat the simple-and-stupid single hash table
> with linear probing.

O(1) lookup is hard for big N. Lets go shopping^Wcoding instead.

>> If we modified pack-objects' delta compressor for tree objects to only
>> generate delta instructions at tree record boundaries, a delta-encoded
>> tree can be processed without inflating the full content of that tree.
>> Because of the way deltas are created, "most" tree deltas should have
>> their delta base scanned by the object traversal before the delta is
>> considered. This means the tree delta just needs to consider the much
>> smaller records that are inserted into the base. We know these are
>> different SHA-1s than what was there before, so they are more likely
>> to be new to the lookup_object table.
>
> So sort of a magic shortcut tree diff you get while accessing the
> object. Neat idea.

Yes, exactly.

>> So the --objects traversal algorithm can change to get the delta base
>> SHA-1 and raw tree delta from the pack storage. Perform a
>> lookup_object on the base to see if it has been scanned. If it has,
>> just scan the delta insert instructions. If the base has not yet been
>> scanned, inflate the tree to its normal format and scan the entire
>> tree.
>
> This would not perform well if we hit the deltas before the bases. In
> general, though, our "use the larger as the base" heuristic should mean
> that our traversal hits the bases first.

It won't perform worse than the current code. And its actually the
time heuristic that should kick in here for trees. Most trees are
roughly the same size, or are only slightly bigger because a new
source file was added. More recent trees should appear earlier in the
delta window and be suitable candidates for older trees. So we should
get a large percentage of trees covered by this trick.

>> This is an approximation of what Nico and I were talking about doing
>> for pack v4. But doesn't require a file format change. :-)
>
> Yeah. It just needs to be very careful that the deltas it is looking at
> all fall on record boundaries, since we might get deltas generated by
> other versions of git. Can we necessarily identify that case for sure,
> though?  I imagine a tree delta like that would look something like:
>
>   delete bytes 100-120
>   add 20 bytes at offset 100: \x12\x34\x56...

Of course we can't know without some flag. I assumed it was obvious we
would need to tag the pack somehow with extra metadata to say "every
tree delta in this pack is on a record boundary". That does make delta
reuse more complex as a tree delta can only be reused if it is coming
from a pack that has the same promise about record boundaries.
Otherwise the delta must be regenerated during packing.

> Without looking at the base object, and without knowing whether the
> delta was generated by our particular implementation, how can we be sure
> this is a sha1 replacement and not the renaming of part of a file? Or
> are you proposing some flag in the packfile to indicate "yes, this tree
> really was delta'd only at record boundaries"?

Yes, some sort of flag would be required on the pack. :-\

> It could be a big win, but it does seem quite complex and error-prone.

It is complex. But it seems so simple on the Internet...

> And it only helps with reachability, not regular traversals, so it's not
> very generic.

It should also help with path filter traversals. If we require the
deltas to be only at tree record boundaries then the mode and path
will be part of the delta, even if it is unmodified from the base
tree. A path filter can avoid scanning the base sections if its
already seen that base tree before. This could be a substantial
reduction in the number of tree records a path filter needs to examine
to get a `git log -- path` or `git blame path` completed.

> Which makes me think the bitmap route is a much better way
> to go.

Bitmaps discard a lot of useful data in favor of being really freaking
small. Colby was toying around with some other allocations of bitmaps
today. I think he managed to bitmap basically every commit in the
Linux kernel history since September... and its only 3M of data to
store on disk / cache in RAM. Very useful for serving fetch requests
to clients that are at varying points in history, but not so good for
doing things like new delta compression or path limited history
traversal.

<thought type="random" why="look at the Date header">

What if we built a bitmap for each path? Linux kernel history is
~41.5k paths. If the packer constructs a bitmap for each path that
sets a commit's bit if the path is impacted in that commit, you can do
some incredibly fast path traversal operations. Naive implementation
would cost around 1.7G of disk, but I wonder how well that set of maps
might compress.

</thought>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-02-01  9:42             ` Jeff King
@ 2013-02-02 17:49               ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2013-02-02 17:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, git, Duy Nguyen

Jeff King <peff@peff.net> writes:

> On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote:
> ...
>> If we are going to change the index to support extension sections and
>> I have to modify JGit to grok this new format, it needs to be index v3
>> not index v2. If we are making index v3 we should just put index v3 on
>> the end of the pack file.
>
> I'm not sure what you mean by your last sentence here.

I am not Shawn, but here is a summary of what I think I discussed
with him in person, lest I forget.

You could imagine that a new pack system (from pack-objects,
index-pack down to read_packed_sha1() call) that works with a
packfile that

 * is a single file, whose name is pack-$SHA1.$sfx (where $sfx
   is something other than 'pack', perhaps);

 * has the pack data stream, including the concluding checksum of
   the stream contents at the end, at the beginning of the file; and

 * has the index v3 data blob appended to the pack data stream.

The pack data is streamed over the wire exactly the same way,
interoperating with existing software.  When receiving, the new
index-pack can read such a pack stream and add index at the end.
When re-indexing an existing pack (think: upgrading existing
packfiles from the current system), the index-pack can read from the
packfile and do what it does currently (notably, it knows where the
pack stream ends and can stop reading at that point, so even if you
feed the new pack to it, it will stop at the end of the pack data,
ignoring the index v3 already at the end of the input).

One potential advantage of using a single file, instead of the
primary .pack file with 3 (or 47) auxiliary files, is that it lets
you repack without having to deal with this sequence, which happens
currently when you repack:

 * create a new .pack file and the corresponding auxiliary files
   under temporary filename;

 * move existing pack files that describe the same set of objects
   away;

 * rename these new files, one at a time, to their final name,
   making sure that you rename .idx the last, because that happens
   to be the key to the pack aware programs.

Instead you can rename only one thing (the new one) to the final
name (possibly atomically replacing the existing one).  With the
current system, when you need to replace a pack with a new pack with
the same packname (e.g. you repack everything with a better pack
parameter in a repository that has everything packed into one),
there is a very small window other concurrent users will not find
the object data between the time when you rename the old ones away
and the time when you move the new ones in.  The hairly logic
between "Ok we have prepared all new packfiles" and "End of pack
replacement" can be done with a single rename(2) of the new packfile
(which contains everything) to the final name, which atomically
replaces the old one.

This will become even safer if we picked $SHA1 (the name of the
packfile) to represent the hash of the whole thing, not the hash of
the sorted object names in the pack, as that will let us know there
is no need to even "replace" the files.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-01-31 11:06       ` Duy Nguyen
  2013-02-01 10:15         ` Jeff King
  2013-02-01 10:40         ` Jeff King
@ 2013-03-17 13:21         ` Duy Nguyen
  2013-03-18 12:20           ` Jeff King
  2 siblings, 1 reply; 43+ messages in thread
From: Duy Nguyen @ 2013-03-17 13:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Shawn O. Pearce

On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
>> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
>> space/time trade-off.
>
> Following the on-disk format experiment yesterday, I changed the
> format to:
>
>  - a list a _short_ SHA-1 of cached commits
> ..
>
> The length of SHA-1 is chosen to be able to unambiguously identify any
> cached commits. Full SHA-1 check is done after to catch false
> positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
> moderate-sized projects are 4 bytes.

And if we are going to create index v3, the same trick could be used
for the sha-1 table in the index. We use the short sha-1 table for
binary search and put the rest of sha-1 in a following table (just
like file offset table). The advantage is a denser search space, about
1/4-1/3 the size of full sha-1 table.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] introduce a commit metapack
  2013-03-17 13:21         ` Duy Nguyen
@ 2013-03-18 12:20           ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2013-03-18 12:20 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git, Shawn O. Pearce

On Sun, Mar 17, 2013 at 08:21:13PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> > On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> >> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> >> space/time trade-off.
> >
> > Following the on-disk format experiment yesterday, I changed the
> > format to:
> >
> >  - a list a _short_ SHA-1 of cached commits
> > ..
> >
> > The length of SHA-1 is chosen to be able to unambiguously identify any
> > cached commits. Full SHA-1 check is done after to catch false
> > positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
> > moderate-sized projects are 4 bytes.
> 
> And if we are going to create index v3, the same trick could be used
> for the sha-1 table in the index. We use the short sha-1 table for
> binary search and put the rest of sha-1 in a following table (just
> like file offset table). The advantage is a denser search space, about
> 1/4-1/3 the size of full sha-1 table.

You can make it even smaller at some (potential) run-time cost.

Keep in mind you are just repeating information that is in the full sha1
list in the index. So you could store a fixed-size offset into that list
(e.g., 32-bit), and then instead of comparing sha1s during a binary
search, you would dereference the offset to the real sha1s and compare
those.

The run-time cost is not any worse in a big-O sense, but your cache
locality is much worse (you hit a second random page for each sha1
comparison), which might be noticeable. You'd have to benchmark to see
how big an impact.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2013-03-18 12:20 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
2013-01-29  9:15 ` [PATCH 1/6] csum-file: make sha1write const-correct Jeff King
2013-01-29  9:15 ` [PATCH 2/6] strbuf: add string-chomping functions Jeff King
2013-01-29 10:15   ` Michael Haggerty
2013-01-29 11:10     ` Jeff King
2013-01-30  5:00       ` Michael Haggerty
2013-01-29  9:15 ` [PATCH 3/6] introduce pack metadata cache files Jeff King
2013-01-29 17:35   ` Junio C Hamano
2013-01-30  6:47     ` Jeff King
2013-01-30  1:30   ` Duy Nguyen
2013-01-30  6:50     ` Jeff King
2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
2013-01-29 10:24   ` Michael Haggerty
2013-01-29 11:13     ` Jeff King
2013-01-29 17:38   ` Junio C Hamano
2013-01-29 18:08     ` Junio C Hamano
2013-01-30  7:12       ` Jeff King
2013-01-30  7:17         ` Junio C Hamano
2013-02-01  9:21           ` Jeff King
2013-01-30 15:56         ` Junio C Hamano
2013-01-31 17:03           ` Shawn Pearce
2013-02-01  9:42             ` Jeff King
2013-02-02 17:49               ` Junio C Hamano
2013-01-30  7:07     ` Jeff King
2013-01-30  3:36   ` Duy Nguyen
2013-01-30  7:12     ` Jeff King
2013-01-30 13:56   ` Duy Nguyen
2013-01-30 14:16     ` Duy Nguyen
2013-01-31 11:06       ` Duy Nguyen
2013-02-01 10:15         ` Jeff King
2013-02-02  9:49           ` Duy Nguyen
2013-02-01 10:40         ` Jeff King
2013-03-17 13:21         ` Duy Nguyen
2013-03-18 12:20           ` Jeff King
2013-02-01 10:00     ` Jeff King
2013-01-29  9:16 ` [PATCH 5/6] add git-metapack command Jeff King
2013-01-29  9:16 ` [PATCH 6/6] commit: look up commit info in metapack Jeff King
2013-01-30  3:31 ` [PATCH/RFC 0/6] commit caching Duy Nguyen
2013-01-30  7:18   ` Jeff King
2013-01-30  8:32     ` Duy Nguyen
2013-01-31 17:14 ` Shawn Pearce
2013-02-01  9:11   ` Jeff King
2013-02-02 10:04     ` Shawn Pearce

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.