All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Prefix-compress on-disk index entries
@ 2012-04-03 22:53 Junio C Hamano
  2012-04-03 22:53 ` [PATCH 1/9] varint: make it available outside the context of pack Junio C Hamano
                   ` (11 more replies)
  0 siblings, 12 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

This is still rough, but with this patch I am getting:

    $ ls -l .git/index*
    -rw-r----- 1 jch eng 25586488 2012-04-03 15:27 .git/index
    -rw-r----- 1 jch eng 14654328 2012-04-03 15:38 .git/index-4

in a clone of WebKit repository that has 183175 paths.

With hot-cache with no local modification:

    $ time sh -c 'GIT_INDEX_FILE=.git/index-4 git diff'
    real  0m0.469s
    user  0m0.130s
    sys   0m0.330s

    $ time sh -c 'git diff'
    real  0m0.677s
    user  0m0.290s
    sys   0m0.370s

which is mesuring the time needed to read of the index into in-core
structure and comparing the cached stat information taken from lstat(2).

The updated format is not documented yet, as I didn't intend (and I still
am not committed) to declare a change along this line the official "v4"
format; I was merely being curious to see how much improvements we can get
from a trivial approach like this.

The saving of the on-disk index size comes from two factors:

 - Not padding the on-disk index entries to 8-byte boundary;

 - Not storing the full pathname for each entry in the on-disk format.

Because the entries are sorted by path, adjacent entries in the index tend
to share the leading components of them, and it makes sense to only store
the differences in later entries.  In the v4 on-disk format of the index,
each on-disk cache entry stores the number of bytes to be stripped from
the end of the previous name, and the bytes to append to the result, to
come up with its name.

The "to-remove" count is encoded in the varint format used in the
packfiles, and the "bytes-to-append" is a simple NUL-terminated string.

Junio C Hamano (9):
  varint: make it available outside the context of pack
  cache.h: hide on-disk index details
  read-cache.c: allow unaligned mapping of the index file
  read-cache.c: make create_from_disk() report number of bytes it consumed
  read-cache.c: report the header version we do not understand
  read-cache.c: move code to copy ondisk to incore cache to a helper function
  read-cache.c: move code to copy incore to ondisk cache to a helper function
  read-cache.c: read prefix-compressed names in index on-disk version v4
  read-cache.c: write index v4 format

 Makefile               |    2 +
 builtin/update-index.c |    2 +
 cache.h                |   52 +---------
 config.c               |   11 ++
 environment.c          |    1 +
 read-cache.c           |  259 ++++++++++++++++++++++++++++++++++++++++--------
 varint.c               |   29 ++++++
 varint.h               |    9 ++
 8 files changed, 275 insertions(+), 90 deletions(-)
 create mode 100644 varint.c
 create mode 100644 varint.h

-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/9] varint: make it available outside the context of pack
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 2/9] cache.h: hide on-disk index details Junio C Hamano
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * This was taken from the bottom of my jc/split-blob topic, which made it
   available across "pack" related machinery, but it is useful outside the
   context of "pack".

 Makefile |    2 ++
 varint.c |   29 +++++++++++++++++++++++++++++
 varint.h |    9 +++++++++
 3 files changed, 40 insertions(+)
 create mode 100644 varint.c
 create mode 100644 varint.h

diff --git a/Makefile b/Makefile
index be1957a..0f26c87 100644
--- a/Makefile
+++ b/Makefile
@@ -627,6 +627,7 @@ LIB_H += tree-walk.h
 LIB_H += unpack-trees.h
 LIB_H += userdiff.h
 LIB_H += utf8.h
+LIB_H += varint.h
 LIB_H += xdiff-interface.h
 LIB_H += xdiff/xdiff.h
 
@@ -752,6 +753,7 @@ LIB_OBJS += url.o
 LIB_OBJS += usage.o
 LIB_OBJS += userdiff.o
 LIB_OBJS += utf8.o
+LIB_OBJS += varint.o
 LIB_OBJS += walker.o
 LIB_OBJS += wrapper.o
 LIB_OBJS += write_or_die.o
diff --git a/varint.c b/varint.c
new file mode 100644
index 0000000..4ed7729
--- /dev/null
+++ b/varint.c
@@ -0,0 +1,29 @@
+#include "varint.h"
+
+uintmax_t decode_varint(const unsigned char **bufp)
+{
+	const unsigned char *buf = *bufp;
+	unsigned char c = *buf++;
+	uintmax_t val = c & 127;
+	while (c & 128) {
+		val += 1;
+		if (!val || MSB(val, 7))
+			return 0; /* overflow */
+		c = *buf++;
+		val = (val << 7) + (c & 127);
+	}
+	*bufp = buf;
+	return val;
+}
+
+int encode_varint(uintmax_t value, unsigned char *buf)
+{
+	unsigned char varint[16];
+	unsigned pos = sizeof(varint) - 1;
+	varint[pos] = value & 127;
+	while (value >>= 7)
+		varint[--pos] = 128 | (--value & 127);
+	if (buf)
+		memcpy(buf, varint + pos, sizeof(varint) - pos);
+	return sizeof(varint) - pos;
+}
diff --git a/varint.h b/varint.h
new file mode 100644
index 0000000..0321195
--- /dev/null
+++ b/varint.h
@@ -0,0 +1,9 @@
+#ifndef VARINT_H
+#define VARINT_H
+
+#include "git-compat-util.h"
+
+extern int encode_varint(uintmax_t, unsigned char *);
+extern uintmax_t decode_varint(const unsigned char **);
+
+#endif /* VARINT_H */
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/9] cache.h: hide on-disk index details
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
  2012-04-03 22:53 ` [PATCH 1/9] varint: make it available outside the context of pack Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 3/9] read-cache.c: allow unaligned mapping of the index file Junio C Hamano
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

The on-disk format of the index file is a detail whose implementation is
neatly encapsulated in read-cache.c; there is no need to expose it to the
general public that include the cache.h header file.

Also add a prominent mark to read-cache.c to delineate the parts that deal
with the index file I/O routines from the remainder of the file.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h      |   48 ------------------------------------------------
 read-cache.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 48 deletions(-)

diff --git a/cache.h b/cache.h
index e5e1aa4..65a7aba 100644
--- a/cache.h
+++ b/cache.h
@@ -115,48 +115,6 @@ struct cache_time {
 	unsigned int nsec;
 };
 
-/*
- * dev/ino/uid/gid/size are also just tracked to the low 32 bits
- * Again - this is just a (very strong in practice) heuristic that
- * the inode hasn't changed.
- *
- * We save the fields in big-endian order to allow using the
- * index file over NFS transparently.
- */
-struct ondisk_cache_entry {
-	struct cache_time ctime;
-	struct cache_time mtime;
-	unsigned int dev;
-	unsigned int ino;
-	unsigned int mode;
-	unsigned int uid;
-	unsigned int gid;
-	unsigned int size;
-	unsigned char sha1[20];
-	unsigned short flags;
-	char name[FLEX_ARRAY]; /* more */
-};
-
-/*
- * This struct is used when CE_EXTENDED bit is 1
- * The struct must match ondisk_cache_entry exactly from
- * ctime till flags
- */
-struct ondisk_cache_entry_extended {
-	struct cache_time ctime;
-	struct cache_time mtime;
-	unsigned int dev;
-	unsigned int ino;
-	unsigned int mode;
-	unsigned int uid;
-	unsigned int gid;
-	unsigned int size;
-	unsigned char sha1[20];
-	unsigned short flags;
-	unsigned short flags2;
-	char name[FLEX_ARRAY]; /* more */
-};
-
 struct cache_entry {
 	struct cache_time ce_ctime;
 	struct cache_time ce_mtime;
@@ -253,9 +211,6 @@ static inline size_t ce_namelen(const struct cache_entry *ce)
 }
 
 #define ce_size(ce) cache_entry_size(ce_namelen(ce))
-#define ondisk_ce_size(ce) (((ce)->ce_flags & CE_EXTENDED) ? \
-			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
-			    ondisk_cache_entry_size(ce_namelen(ce)))
 #define ce_stage(ce) ((CE_STAGEMASK & (ce)->ce_flags) >> CE_STAGESHIFT)
 #define ce_uptodate(ce) ((ce)->ce_flags & CE_UPTODATE)
 #define ce_skip_worktree(ce) ((ce)->ce_flags & CE_SKIP_WORKTREE)
@@ -306,10 +261,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 	return S_IFGITLINK;
 }
 
-#define flexible_size(STRUCT,len) ((offsetof(struct STRUCT,name) + (len) + 8) & ~7)
 #define cache_entry_size(len) (offsetof(struct cache_entry,name) + (len) + 1)
-#define ondisk_cache_entry_size(len) flexible_size(ondisk_cache_entry,len)
-#define ondisk_cache_entry_extended_size(len) flexible_size(ondisk_cache_entry_extended,len)
 
 struct index_state {
 	struct cache_entry **cache;
diff --git a/read-cache.c b/read-cache.c
index 274e54b..fa8aa73 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1189,6 +1189,60 @@ static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int reall
 	return refresh_cache_ent(&the_index, ce, really, NULL, NULL);
 }
 
+
+/*****************************************************************
+ * Index File I/O
+ *****************************************************************/
+
+/*
+ * dev/ino/uid/gid/size are also just tracked to the low 32 bits
+ * Again - this is just a (very strong in practice) heuristic that
+ * the inode hasn't changed.
+ *
+ * We save the fields in big-endian order to allow using the
+ * index file over NFS transparently.
+ */
+struct ondisk_cache_entry {
+	struct cache_time ctime;
+	struct cache_time mtime;
+	unsigned int dev;
+	unsigned int ino;
+	unsigned int mode;
+	unsigned int uid;
+	unsigned int gid;
+	unsigned int size;
+	unsigned char sha1[20];
+	unsigned short flags;
+	char name[FLEX_ARRAY]; /* more */
+};
+
+/*
+ * This struct is used when CE_EXTENDED bit is 1
+ * The struct must match ondisk_cache_entry exactly from
+ * ctime till flags
+ */
+struct ondisk_cache_entry_extended {
+	struct cache_time ctime;
+	struct cache_time mtime;
+	unsigned int dev;
+	unsigned int ino;
+	unsigned int mode;
+	unsigned int uid;
+	unsigned int gid;
+	unsigned int size;
+	unsigned char sha1[20];
+	unsigned short flags;
+	unsigned short flags2;
+	char name[FLEX_ARRAY]; /* more */
+};
+
+#define align_flex_name(STRUCT,len) ((offsetof(struct STRUCT,name) + (len) + 8) & ~7)
+#define ondisk_cache_entry_size(len) align_flex_name(ondisk_cache_entry,len)
+#define ondisk_cache_entry_extended_size(len) align_flex_name(ondisk_cache_entry_extended,len)
+#define ondisk_ce_size(ce) (((ce)->ce_flags & CE_EXTENDED) ? \
+			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
+			    ondisk_cache_entry_size(ce_namelen(ce)))
+
 static int verify_hdr(struct cache_header *hdr, unsigned long size)
 {
 	git_SHA_CTX c;
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/9] read-cache.c: allow unaligned mapping of the index file
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
  2012-04-03 22:53 ` [PATCH 1/9] varint: make it available outside the context of pack Junio C Hamano
  2012-04-03 22:53 ` [PATCH 2/9] cache.h: hide on-disk index details Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 4/9] read-cache.c: make create_from_disk() report number of bytes it consumed Junio C Hamano
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

Both the on-disk format v2 and v3 pads the "name" field to the multiple of
eight to make sure that various quantities in network long/short type can
be accessed with ntohl/ntohs without having to worry about alignment, but
this forces us to waste disk I/O bandwidth.

Introduce ntoh_s()/ntoh_l() macros that the callers can use as if they were
the regular ntohs()/ntohl() on a field that may not be aligned correctly.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 read-cache.c |   44 ++++++++++++++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index fa8aa73..d8865f5 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1285,6 +1285,26 @@ int read_index(struct index_state *istate)
 	return read_index_from(istate, get_index_file());
 }
 
+#ifndef NEEDS_ALIGNED_ACCESS
+#define ntoh_s(var) ntohs(var)
+#define ntoh_l(var) ntohl(var)
+#else
+static inline uint16_t ntoh_s_force_align(void *p)
+{
+	uint16_t x;
+	memcpy(&x, p, sizeof(x));
+	return ntohs(x);
+}
+static inline uint32_t ntoh_l_force_align(void *p)
+{
+	uint32_t x;
+	memcpy(&x, p, sizeof(x));
+	return ntohl(x);
+}
+#define ntoh_s(var) ntoh_s_force_align(&(var))
+#define ntoh_l(var) ntoh_l_force_align(&(var))
+#endif
+
 static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk)
 {
 	struct cache_entry *ce;
@@ -1293,14 +1313,14 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk)
 	unsigned int flags;
 
 	/* On-disk flags are just 16 bits */
-	flags = ntohs(ondisk->flags);
+	flags = ntoh_s(ondisk->flags);
 	len = flags & CE_NAMEMASK;
 
 	if (flags & CE_EXTENDED) {
 		struct ondisk_cache_entry_extended *ondisk2;
 		int extended_flags;
 		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
-		extended_flags = ntohs(ondisk2->flags2) << 16;
+		extended_flags = ntoh_s(ondisk2->flags2) << 16;
 		/* We do not yet understand any bit out of CE_EXTENDED_FLAGS */
 		if (extended_flags & ~CE_EXTENDED_FLAGS)
 			die("Unknown index entry format %08x", extended_flags);
@@ -1315,16 +1335,16 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk)
 
 	ce = xmalloc(cache_entry_size(len));
 
-	ce->ce_ctime.sec = ntohl(ondisk->ctime.sec);
-	ce->ce_mtime.sec = ntohl(ondisk->mtime.sec);
-	ce->ce_ctime.nsec = ntohl(ondisk->ctime.nsec);
-	ce->ce_mtime.nsec = ntohl(ondisk->mtime.nsec);
-	ce->ce_dev   = ntohl(ondisk->dev);
-	ce->ce_ino   = ntohl(ondisk->ino);
-	ce->ce_mode  = ntohl(ondisk->mode);
-	ce->ce_uid   = ntohl(ondisk->uid);
-	ce->ce_gid   = ntohl(ondisk->gid);
-	ce->ce_size  = ntohl(ondisk->size);
+	ce->ce_ctime.sec = ntoh_l(ondisk->ctime.sec);
+	ce->ce_mtime.sec = ntoh_l(ondisk->mtime.sec);
+	ce->ce_ctime.nsec = ntoh_l(ondisk->ctime.nsec);
+	ce->ce_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
+	ce->ce_dev   = ntoh_l(ondisk->dev);
+	ce->ce_ino   = ntoh_l(ondisk->ino);
+	ce->ce_mode  = ntoh_l(ondisk->mode);
+	ce->ce_uid   = ntoh_l(ondisk->uid);
+	ce->ce_gid   = ntoh_l(ondisk->gid);
+	ce->ce_size  = ntoh_l(ondisk->size);
 	ce->ce_flags = flags;
 
 	hashcpy(ce->sha1, ondisk->sha1);
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/9] read-cache.c: make create_from_disk() report number of bytes it consumed
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (2 preceding siblings ...)
  2012-04-03 22:53 ` [PATCH 3/9] read-cache.c: allow unaligned mapping of the index file Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 5/9] read-cache.c: report the header version we do not understand Junio C Hamano
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

The function is the one that is reading from the data stream. It only is
natural to make it responsible for reporting this number, not the caller.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 read-cache.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index d8865f5..58bfb24 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1305,7 +1305,8 @@ static inline uint32_t ntoh_l_force_align(void *p)
 #define ntoh_l(var) ntoh_l_force_align(&(var))
 #endif
 
-static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk)
+static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
+					    unsigned long *ent_size)
 {
 	struct cache_entry *ce;
 	size_t len;
@@ -1351,6 +1352,7 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk)
 
 	memcpy(ce->name, name, len);
 	ce->name[len] = '\0';
+	*ent_size = ondisk_ce_size(ce);
 	return ce;
 }
 
@@ -1404,12 +1406,13 @@ int read_index_from(struct index_state *istate, const char *path)
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct ondisk_cache_entry *disk_ce;
 		struct cache_entry *ce;
+		unsigned long consumed;
 
 		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
-		ce = create_from_disk(disk_ce);
+		ce = create_from_disk(disk_ce, &consumed);
 		set_index_entry(istate, i, ce);
 
-		src_offset += ondisk_ce_size(ce);
+		src_offset += consumed;
 	}
 	istate->timestamp.sec = st.st_mtime;
 	istate->timestamp.nsec = ST_MTIME_NSEC(st);
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/9] read-cache.c: report the header version we do not understand
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (3 preceding siblings ...)
  2012-04-03 22:53 ` [PATCH 4/9] read-cache.c: make create_from_disk() report number of bytes it consumed Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 6/9] read-cache.c: move code to copy ondisk to incore cache to a helper function Junio C Hamano
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

Instead of just saying "bad index version", report the value we read
from the disk.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 read-cache.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 58bfb24..2d93826 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1247,11 +1247,13 @@ static int verify_hdr(struct cache_header *hdr, unsigned long size)
 {
 	git_SHA_CTX c;
 	unsigned char sha1[20];
+	int hdr_version;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error("bad signature");
-	if (hdr->hdr_version != htonl(2) && hdr->hdr_version != htonl(3))
-		return error("bad index version");
+	hdr_version = ntohl(hdr->hdr_version);
+	if (hdr_version < 2 || 3 < hdr_version)
+		return error("bad index version %d", hdr_version);
 	git_SHA1_Init(&c);
 	git_SHA1_Update(&c, hdr, size - 20);
 	git_SHA1_Final(sha1, &c);
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/9] read-cache.c: move code to copy ondisk to incore cache to a helper function
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (4 preceding siblings ...)
  2012-04-03 22:53 ` [PATCH 5/9] read-cache.c: report the header version we do not understand Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 7/9] read-cache.c: move code to copy incore to ondisk " Junio C Hamano
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

This makes the change in a later patch look less scary.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 read-cache.c |   44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 2d93826..82711c2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1307,6 +1307,30 @@ static inline uint32_t ntoh_l_force_align(void *p)
 #define ntoh_l(var) ntoh_l_force_align(&(var))
 #endif
 
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+						   unsigned int flags,
+						   const char *name,
+						   size_t len)
+{
+	struct cache_entry *ce = xmalloc(cache_entry_size(len));
+
+	ce->ce_ctime.sec = ntoh_l(ondisk->ctime.sec);
+	ce->ce_mtime.sec = ntoh_l(ondisk->mtime.sec);
+	ce->ce_ctime.nsec = ntoh_l(ondisk->ctime.nsec);
+	ce->ce_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
+	ce->ce_dev   = ntoh_l(ondisk->dev);
+	ce->ce_ino   = ntoh_l(ondisk->ino);
+	ce->ce_mode  = ntoh_l(ondisk->mode);
+	ce->ce_uid   = ntoh_l(ondisk->uid);
+	ce->ce_gid   = ntoh_l(ondisk->gid);
+	ce->ce_size  = ntoh_l(ondisk->size);
+	ce->ce_flags = flags;
+	hashcpy(ce->sha1, ondisk->sha1);
+	memcpy(ce->name, name, len);
+	ce->name[len] = '\0';
+	return ce;
+}
+
 static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
 					    unsigned long *ent_size)
 {
@@ -1335,25 +1359,7 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
 
 	if (len == CE_NAMEMASK)
 		len = strlen(name);
-
-	ce = xmalloc(cache_entry_size(len));
-
-	ce->ce_ctime.sec = ntoh_l(ondisk->ctime.sec);
-	ce->ce_mtime.sec = ntoh_l(ondisk->mtime.sec);
-	ce->ce_ctime.nsec = ntoh_l(ondisk->ctime.nsec);
-	ce->ce_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
-	ce->ce_dev   = ntoh_l(ondisk->dev);
-	ce->ce_ino   = ntoh_l(ondisk->ino);
-	ce->ce_mode  = ntoh_l(ondisk->mode);
-	ce->ce_uid   = ntoh_l(ondisk->uid);
-	ce->ce_gid   = ntoh_l(ondisk->gid);
-	ce->ce_size  = ntoh_l(ondisk->size);
-	ce->ce_flags = flags;
-
-	hashcpy(ce->sha1, ondisk->sha1);
-
-	memcpy(ce->name, name, len);
-	ce->name[len] = '\0';
+	ce = cache_entry_from_ondisk(ondisk, flags, name, len);
 	*ent_size = ondisk_ce_size(ce);
 	return ce;
 }
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 7/9] read-cache.c: move code to copy incore to ondisk cache to a helper function
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (5 preceding siblings ...)
  2012-04-03 22:53 ` [PATCH 6/9] read-cache.c: move code to copy ondisk to incore cache to a helper function Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 8/9] read-cache.c: read prefix-compressed names in index on-disk version v4 Junio C Hamano
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

This makes the change in a later patch look less scary.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 read-cache.c |   26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 82711c2..c159351 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1605,13 +1605,10 @@ static void ce_smudge_racily_clean_entry(struct cache_entry *ce)
 	}
 }
 
-static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce)
+/* Copy miscellaneous fields but not the name */
+static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
+				       struct cache_entry *ce)
 {
-	int size = ondisk_ce_size(ce);
-	struct ondisk_cache_entry *ondisk = xcalloc(1, size);
-	char *name;
-	int result;
-
 	ondisk->ctime.sec = htonl(ce->ce_ctime.sec);
 	ondisk->mtime.sec = htonl(ce->ce_mtime.sec);
 	ondisk->ctime.nsec = htonl(ce->ce_ctime.nsec);
@@ -1628,10 +1625,21 @@ static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce)
 		struct ondisk_cache_entry_extended *ondisk2;
 		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
 		ondisk2->flags2 = htons((ce->ce_flags & CE_EXTENDED_FLAGS) >> 16);
-		name = ondisk2->name;
+		return ondisk2->name;
 	}
-	else
-		name = ondisk->name;
+	else {
+		return ondisk->name;
+	}
+}
+
+static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce)
+{
+	int size = ondisk_ce_size(ce);
+	struct ondisk_cache_entry *ondisk = xcalloc(1, size);
+	char *name;
+	int result;
+
+	name = copy_cache_entry_to_ondisk(ondisk, ce);
 	memcpy(name, ce->name, ce_namelen(ce));
 
 	result = ce_write(c, fd, ondisk, size);
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 8/9] read-cache.c: read prefix-compressed names in index on-disk version v4
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (6 preceding siblings ...)
  2012-04-03 22:53 ` [PATCH 7/9] read-cache.c: move code to copy incore to ondisk " Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-03 22:53 ` [PATCH 9/9] read-cache.c: write index v4 format Junio C Hamano
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

Because the entries are sorted by path, adjacent entries in the index tend
to share the leading components of them, and it makes sense to only store
the differences in later entries.  In the v4 on-disk format of the index,
each on-disk cache entry stores the number of bytes to be stripped from
the end of the previous name, and the bytes to append to the result, to
come up with its name.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 read-cache.c |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 7 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index c159351..1c173f7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -12,6 +12,8 @@
 #include "commit.h"
 #include "blob.h"
 #include "resolve-undo.h"
+#include "strbuf.h"
+#include "varint.h"
 
 static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int really);
 
@@ -1236,6 +1238,7 @@ struct ondisk_cache_entry_extended {
 	char name[FLEX_ARRAY]; /* more */
 };
 
+/* These are only used for v3 or lower */
 #define align_flex_name(STRUCT,len) ((offsetof(struct STRUCT,name) + (len) + 8) & ~7)
 #define ondisk_cache_entry_size(len) align_flex_name(ondisk_cache_entry,len)
 #define ondisk_cache_entry_extended_size(len) align_flex_name(ondisk_cache_entry_extended,len)
@@ -1252,7 +1255,7 @@ static int verify_hdr(struct cache_header *hdr, unsigned long size)
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error("bad signature");
 	hdr_version = ntohl(hdr->hdr_version);
-	if (hdr_version < 2 || 3 < hdr_version)
+	if (hdr_version < 2 || 4 < hdr_version)
 		return error("bad index version %d", hdr_version);
 	git_SHA1_Init(&c);
 	git_SHA1_Update(&c, hdr, size - 20);
@@ -1331,8 +1334,30 @@ static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *on
 	return ce;
 }
 
+/*
+ * Adjacent cache entries tend to share the leading paths, so it makes
+ * sense to only store the differences in later entries.  In the v4
+ * on-disk format of the index, each on-disk cache entry stores the
+ * number of bytes to be stripped from the end of the previous name,
+ * and the bytes to append to the result, to come up with its name.
+ */
+static unsigned long expand_name_field(struct strbuf *name, const char *cp_)
+{
+	const unsigned char *ep, *cp = (const unsigned char *)cp_;
+	size_t len = decode_varint(&cp);
+
+	if (name->len < len)
+		die("malformed name field in the index");
+	strbuf_remove(name, name->len - len, len);
+	for (ep = cp; *ep; ep++)
+		; /* find the end */
+	strbuf_add(name, cp, ep - cp);
+	return (const char *)ep + 1 - cp_;
+}
+
 static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
-					    unsigned long *ent_size)
+					    unsigned long *ent_size,
+					    struct strbuf *previous_name)
 {
 	struct cache_entry *ce;
 	size_t len;
@@ -1357,10 +1382,22 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
 	else
 		name = ondisk->name;
 
-	if (len == CE_NAMEMASK)
-		len = strlen(name);
-	ce = cache_entry_from_ondisk(ondisk, flags, name, len);
-	*ent_size = ondisk_ce_size(ce);
+	if (!previous_name) {
+		/* v3 and earlier */
+		if (len == CE_NAMEMASK)
+			len = strlen(name);
+		ce = cache_entry_from_ondisk(ondisk, flags, name, len);
+
+		*ent_size = ondisk_ce_size(ce);
+	} else {
+		unsigned long consumed;
+		consumed = expand_name_field(previous_name, name);
+		ce = cache_entry_from_ondisk(ondisk, flags,
+					     previous_name->buf,
+					     previous_name->len);
+
+		*ent_size = (name - ((char *)ondisk)) + consumed;
+	}
 	return ce;
 }
 
@@ -1373,6 +1410,7 @@ int read_index_from(struct index_state *istate, const char *path)
 	struct cache_header *hdr;
 	void *mmap;
 	size_t mmap_size;
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 
 	errno = EBUSY;
 	if (istate->initialized)
@@ -1410,6 +1448,11 @@ int read_index_from(struct index_state *istate, const char *path)
 	istate->cache = xcalloc(istate->cache_alloc, sizeof(struct cache_entry *));
 	istate->initialized = 1;
 
+	if (hdr->hdr_version == htonl(4))
+		previous_name = &previous_name_buf;
+	else
+		previous_name = NULL;
+
 	src_offset = sizeof(*hdr);
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct ondisk_cache_entry *disk_ce;
@@ -1417,11 +1460,12 @@ int read_index_from(struct index_state *istate, const char *path)
 		unsigned long consumed;
 
 		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
-		ce = create_from_disk(disk_ce, &consumed);
+		ce = create_from_disk(disk_ce, &consumed, previous_name);
 		set_index_entry(istate, i, ce);
 
 		src_offset += consumed;
 	}
+	strbuf_release(&previous_name_buf);
 	istate->timestamp.sec = st.st_mtime;
 	istate->timestamp.nsec = ST_MTIME_NSEC(st);
 
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 9/9] read-cache.c: write index v4 format
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (7 preceding siblings ...)
  2012-04-03 22:53 ` [PATCH 8/9] read-cache.c: read prefix-compressed names in index on-disk version v4 Junio C Hamano
@ 2012-04-03 22:53 ` Junio C Hamano
  2012-04-04  1:44 ` [PATCH 0/9] Prefix-compress on-disk index entries David Barr
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-03 22:53 UTC (permalink / raw)
  To: git

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/update-index.c |    2 ++
 cache.h                |    4 ++++
 config.c               |   11 ++++++++++
 environment.c          |    1 +
 read-cache.c           |   56 ++++++++++++++++++++++++++++++++++++++++--------
 5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index a6a23fa..b663f45 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -791,6 +791,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 			"(for porcelains) forget saved unresolved conflicts",
 			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
 			resolve_undo_clear_callback},
+		OPT_INTEGER(0, "index-format", &preferred_index_format,
+			    "write index in this format"),
 		OPT_END()
 	};
 
diff --git a/cache.h b/cache.h
index 65a7aba..bdec32c 100644
--- a/cache.h
+++ b/cache.h
@@ -105,6 +105,9 @@ struct cache_header {
 	unsigned int hdr_entries;
 };
 
+#define SUPPORTED_INDEX_FORMAT_LB 2
+#define SUPPORTED_INDEX_FORMAT_UB 4
+
 /*
  * The "cache_time" is just the low 32 bits of the
  * time. It doesn't matter if it overflows - we only
@@ -537,6 +540,7 @@ extern int has_symlinks;
 extern int minimum_abbrev, default_abbrev;
 extern int ignore_case;
 extern int assume_unchanged;
+extern int preferred_index_format;
 extern int prefer_symlink_refs;
 extern int log_all_ref_updates;
 extern int warn_ambiguous_refs;
diff --git a/config.c b/config.c
index 68d3294..0244a85 100644
--- a/config.c
+++ b/config.c
@@ -564,6 +564,17 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.indexformat")) {
+		int val = git_config_int(var, value);
+		if (val < SUPPORTED_INDEX_FORMAT_LB ||
+		    SUPPORTED_INDEX_FORMAT_UB < val)
+			return error("%s not in supported range: %d..%d",
+				     var, SUPPORTED_INDEX_FORMAT_LB,
+				     SUPPORTED_INDEX_FORMAT_UB);
+		preferred_index_format = val;
+		return 0;
+	}
+
 	if (!strcmp(var, "core.quotepath")) {
 		quote_path_fully = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index c93b8f4..4ecf8e6 100644
--- a/environment.c
+++ b/environment.c
@@ -20,6 +20,7 @@ int has_symlinks = 1;
 int minimum_abbrev = 4, default_abbrev = 7;
 int ignore_case;
 int assume_unchanged;
+int preferred_index_format;
 int prefer_symlink_refs;
 int is_bare_repository_cfg = -1; /* unspecified */
 int log_all_ref_updates = -1; /* unspecified */
diff --git a/read-cache.c b/read-cache.c
index 1c173f7..fdac89a 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1676,15 +1676,45 @@ static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
 	}
 }
 
-static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce)
+static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce,
+			  struct strbuf *previous_name)
 {
-	int size = ondisk_ce_size(ce);
-	struct ondisk_cache_entry *ondisk = xcalloc(1, size);
+	int size;
+	struct ondisk_cache_entry *ondisk;
 	char *name;
 	int result;
 
-	name = copy_cache_entry_to_ondisk(ondisk, ce);
-	memcpy(name, ce->name, ce_namelen(ce));
+	if (!previous_name) {
+		size = ondisk_ce_size(ce);
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, ce->name, ce_namelen(ce));
+	} else {
+		int common, to_remove, prefix_size;
+		unsigned char to_remove_vi[16];
+		for (common = 0;
+		     (ce->name[common] &&
+		      common < previous_name->len &&
+		      ce->name[common] == previous_name->buf[common]);
+		     common++)
+			; /* still matching */
+		to_remove = previous_name->len - common;
+		prefix_size = encode_varint(to_remove, to_remove_vi);
+
+		if (ce->ce_flags & CE_EXTENDED)
+			size = offsetof(struct ondisk_cache_entry_extended, name);
+		else
+			size = offsetof(struct ondisk_cache_entry, name);
+		size += prefix_size + (ce_namelen(ce) - common + 1);
+
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, to_remove_vi, prefix_size);
+		memcpy(name + prefix_size, ce->name + common, ce_namelen(ce) - common);
+
+		strbuf_splice(previous_name, common, to_remove,
+			      ce->name + common, ce_namelen(ce) - common);
+	}
 
 	result = ce_write(c, fd, ondisk, size);
 	free(ondisk);
@@ -1720,10 +1750,11 @@ int write_index(struct index_state *istate, int newfd)
 {
 	git_SHA_CTX c;
 	struct cache_header hdr;
-	int i, err, removed, extended;
+	int i, err, removed, extended, hdr_version;
 	struct cache_entry **cache = istate->cache;
 	int entries = istate->cache_nr;
 	struct stat st;
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
@@ -1737,24 +1768,31 @@ int write_index(struct index_state *istate, int newfd)
 		}
 	}
 
+	if (preferred_index_format)
+		hdr_version = preferred_index_format;
+	else
+		hdr_version = extended ? 3 : 2;
+
+
 	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
-	/* for extended format, increase version so older git won't try to read it */
-	hdr.hdr_version = htonl(extended ? 3 : 2);
+	hdr.hdr_version = htonl(hdr_version);
 	hdr.hdr_entries = htonl(entries - removed);
 
 	git_SHA1_Init(&c);
 	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
 		return -1;
 
+	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
 	for (i = 0; i < entries; i++) {
 		struct cache_entry *ce = cache[i];
 		if (ce->ce_flags & CE_REMOVE)
 			continue;
 		if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
 			ce_smudge_racily_clean_entry(ce);
-		if (ce_write_entry(&c, newfd, ce) < 0)
+		if (ce_write_entry(&c, newfd, ce, previous_name) < 0)
 			return -1;
 	}
+	strbuf_release(&previous_name_buf);
 
 	/* Write extension data here */
 	if (istate->cache_tree) {
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (8 preceding siblings ...)
  2012-04-03 22:53 ` [PATCH 9/9] read-cache.c: write index v4 format Junio C Hamano
@ 2012-04-04  1:44 ` David Barr
  2012-04-04 15:33   ` Junio C Hamano
  2012-04-04 12:34 ` [PATCH 0/9] Prefix-compress on-disk index entries Nguyen Thai Ngoc Duy
  2012-04-27 22:58 ` [PATCH 1/2] unpack-trees: preserve the index file version of original Junio C Hamano
  11 siblings, 1 reply; 27+ messages in thread
From: David Barr @ 2012-04-04  1:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Apr 4, 2012 at 8:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
> This is still rough, but with this patch I am getting:
>
>    $ ls -l .git/index*
>    -rw-r----- 1 jch eng 25586488 2012-04-03 15:27 .git/index
>    -rw-r----- 1 jch eng 14654328 2012-04-03 15:38 .git/index-4
>
> in a clone of WebKit repository that has 183175 paths.
>
> With hot-cache with no local modification:
>
>    $ time sh -c 'GIT_INDEX_FILE=.git/index-4 git diff'
>    real  0m0.469s
>    user  0m0.130s
>    sys   0m0.330s
>
>    $ time sh -c 'git diff'
>    real  0m0.677s
>    user  0m0.290s
>    sys   0m0.370s
>
> which is mesuring the time needed to read of the index into in-core
> structure and comparing the cached stat information taken from lstat(2).
>
> The updated format is not documented yet, as I didn't intend (and I still
> am not committed) to declare a change along this line the official "v4"
> format; I was merely being curious to see how much improvements we can get
> from a trivial approach like this.

As I am hacking on WebKit daily, I'll try out this series and give feedback.

--
David Barr

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (9 preceding siblings ...)
  2012-04-04  1:44 ` [PATCH 0/9] Prefix-compress on-disk index entries David Barr
@ 2012-04-04 12:34 ` Nguyen Thai Ngoc Duy
  2012-04-04 18:44   ` Junio C Hamano
  2012-04-27 22:58 ` [PATCH 1/2] unpack-trees: preserve the index file version of original Junio C Hamano
  11 siblings, 1 reply; 27+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-04 12:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Apr 4, 2012 at 5:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
> This is still rough,

but nice cleanups

> but with this patch I am getting:
>
>    $ ls -l .git/index*
>    -rw-r----- 1 jch eng 25586488 2012-04-03 15:27 .git/index
>    -rw-r----- 1 jch eng 14654328 2012-04-03 15:38 .git/index-4
>
> in a clone of WebKit repository that has 183175 paths.
>
> With hot-cache with no local modification:
>
>    $ time sh -c 'GIT_INDEX_FILE=.git/index-4 git diff'
>    real  0m0.469s
>    user  0m0.130s
>    sys   0m0.330s
>
>    $ time sh -c 'git diff'
>    real  0m0.677s
>    user  0m0.290s
>    sys   0m0.370s

I wonder what causes user time drop from .29s to .13s here. I think
the main patch should increase computation, even only slightly, not
less. Or is it noise?

> The updated format is not documented yet, as I didn't intend (and I still
> am not committed) to declare a change along this line the official "v4"
> format; I was merely being curious to see how much improvements we can get
> from a trivial approach like this.

Anything else you have in mind for v4? Any chance we can adopt crc32
instead of sha-1? We could divide the index into many smaller parts
for checksum, for example one crc32 every 100 entries, and one (or
sha-1) for each extension. It should not complicate the code too much.
-- 
Duy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-04-04  1:44 ` [PATCH 0/9] Prefix-compress on-disk index entries David Barr
@ 2012-04-04 15:33   ` Junio C Hamano
  2012-04-04 16:57     ` Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2012-04-04 15:33 UTC (permalink / raw)
  To: David Barr; +Cc: git

David Barr <davidbarr@google.com> writes:

> As I am hacking on WebKit daily, I'll try out this series and give feedback.

Thanks; the write-out codepath needs to learn to keep the format of the
index it originally read from when there is no preferred format defined, I
think, as I do not think core.indexformat configuration is particularly a
good idea, by the way.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-04-04 15:33   ` Junio C Hamano
@ 2012-04-04 16:57     ` Junio C Hamano
  2012-04-04 16:58       ` [PATCH 2/2] update-index: upgrade/downgrade on-disk index version Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2012-04-04 16:57 UTC (permalink / raw)
  To: git; +Cc: David Barr

Junio C Hamano <gitster@pobox.com> writes:

> David Barr <davidbarr@google.com> writes:
>
>> As I am hacking on WebKit daily, I'll try out this series and give feedback.
>
> Thanks; the write-out codepath needs to learn to keep the format of the
> index it originally read from when there is no preferred format defined, I
> think, as I do not think core.indexformat configuration is particularly a
> good idea, by the way.

Here is the first of two patches that should replace 9/9 (write index v4 format)
of the yesterday's 9-patch series.

-- >8 --
Subject: [PATCH 1/2] read-cache.c: write prefix-compressed names in the index

Teach the code to write the index in the v4 on-disk format.

Record the format version of the on-disk index we read from in the
index_state, and use the format when writing the new index out.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h      |    4 ++++
 read-cache.c |   64 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/cache.h b/cache.h
index 65a7aba..a3f1279 100644
--- a/cache.h
+++ b/cache.h
@@ -105,6 +105,9 @@ struct cache_header {
 	unsigned int hdr_entries;
 };
 
+#define INDEX_FORMAT_LB 2
+#define INDEX_FORMAT_UB 4
+
 /*
  * The "cache_time" is just the low 32 bits of the
  * time. It doesn't matter if it overflows - we only
@@ -265,6 +268,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 
 struct index_state {
 	struct cache_entry **cache;
+	unsigned int version;
 	unsigned int cache_nr, cache_alloc, cache_changed;
 	struct string_list *resolve_undo;
 	struct cache_tree *cache_tree;
diff --git a/read-cache.c b/read-cache.c
index 1c173f7..adda1da 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1196,6 +1196,8 @@ static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int reall
  * Index File I/O
  *****************************************************************/
 
+#define INDEX_FORMAT_DEFAULT 3
+
 /*
  * dev/ino/uid/gid/size are also just tracked to the low 32 bits
  * Again - this is just a (very strong in practice) heuristic that
@@ -1443,12 +1445,13 @@ int read_index_from(struct index_state *istate, const char *path)
 	if (verify_hdr(hdr, mmap_size) < 0)
 		goto unmap;
 
+	istate->version = ntohl(hdr->hdr_version);
 	istate->cache_nr = ntohl(hdr->hdr_entries);
 	istate->cache_alloc = alloc_nr(istate->cache_nr);
 	istate->cache = xcalloc(istate->cache_alloc, sizeof(struct cache_entry *));
 	istate->initialized = 1;
 
-	if (hdr->hdr_version == htonl(4))
+	if (istate->version == 4)
 		previous_name = &previous_name_buf;
 	else
 		previous_name = NULL;
@@ -1676,15 +1679,45 @@ static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
 	}
 }
 
-static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce)
+static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce,
+			  struct strbuf *previous_name)
 {
-	int size = ondisk_ce_size(ce);
-	struct ondisk_cache_entry *ondisk = xcalloc(1, size);
+	int size;
+	struct ondisk_cache_entry *ondisk;
 	char *name;
 	int result;
 
-	name = copy_cache_entry_to_ondisk(ondisk, ce);
-	memcpy(name, ce->name, ce_namelen(ce));
+	if (!previous_name) {
+		size = ondisk_ce_size(ce);
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, ce->name, ce_namelen(ce));
+	} else {
+		int common, to_remove, prefix_size;
+		unsigned char to_remove_vi[16];
+		for (common = 0;
+		     (ce->name[common] &&
+		      common < previous_name->len &&
+		      ce->name[common] == previous_name->buf[common]);
+		     common++)
+			; /* still matching */
+		to_remove = previous_name->len - common;
+		prefix_size = encode_varint(to_remove, to_remove_vi);
+
+		if (ce->ce_flags & CE_EXTENDED)
+			size = offsetof(struct ondisk_cache_entry_extended, name);
+		else
+			size = offsetof(struct ondisk_cache_entry, name);
+		size += prefix_size + (ce_namelen(ce) - common + 1);
+
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, to_remove_vi, prefix_size);
+		memcpy(name + prefix_size, ce->name + common, ce_namelen(ce) - common);
+
+		strbuf_splice(previous_name, common, to_remove,
+			      ce->name + common, ce_namelen(ce) - common);
+	}
 
 	result = ce_write(c, fd, ondisk, size);
 	free(ondisk);
@@ -1720,10 +1753,11 @@ int write_index(struct index_state *istate, int newfd)
 {
 	git_SHA_CTX c;
 	struct cache_header hdr;
-	int i, err, removed, extended;
+	int i, err, removed, extended, hdr_version;
 	struct cache_entry **cache = istate->cache;
 	int entries = istate->cache_nr;
 	struct stat st;
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
@@ -1737,24 +1771,34 @@ int write_index(struct index_state *istate, int newfd)
 		}
 	}
 
+	if (!istate->version)
+		istate->version = INDEX_FORMAT_DEFAULT;
+
+	/* demote version 3 to version 2 when the latter suffices */
+	if (istate->version == 3 || istate->version == 2)
+		istate->version = extended ? 3 : 2;
+
+	hdr_version = istate->version;
+
 	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
-	/* for extended format, increase version so older git won't try to read it */
-	hdr.hdr_version = htonl(extended ? 3 : 2);
+	hdr.hdr_version = htonl(hdr_version);
 	hdr.hdr_entries = htonl(entries - removed);
 
 	git_SHA1_Init(&c);
 	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
 		return -1;
 
+	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
 	for (i = 0; i < entries; i++) {
 		struct cache_entry *ce = cache[i];
 		if (ce->ce_flags & CE_REMOVE)
 			continue;
 		if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
 			ce_smudge_racily_clean_entry(ce);
-		if (ce_write_entry(&c, newfd, ce) < 0)
+		if (ce_write_entry(&c, newfd, ce, previous_name) < 0)
 			return -1;
 	}
+	strbuf_release(&previous_name_buf);
 
 	/* Write extension data here */
 	if (istate->cache_tree) {
-- 
1.7.10.rc4.54.g1d5dd3

 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] update-index: upgrade/downgrade on-disk index version
  2012-04-04 16:57     ` Junio C Hamano
@ 2012-04-04 16:58       ` Junio C Hamano
  0 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-04-04 16:58 UTC (permalink / raw)
  To: git; +Cc: David Barr

With the "--index-version <n>" parameter, write the index out in the
specified version.  With this, an index file that is written in newer
format (say v4) can be downgraded to be read by older versions of Git.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 * And this is the second of the two-patch series to replace 9/9 from
   yesterday's series.

 Documentation/git-update-index.txt |    6 +++++-
 builtin/update-index.c             |   14 ++++++++++++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-update-index.txt b/Documentation/git-update-index.txt
index a3081f4..fd9103b 100644
--- a/Documentation/git-update-index.txt
+++ b/Documentation/git-update-index.txt
@@ -19,7 +19,7 @@ SYNOPSIS
 	     [--ignore-submodules]
 	     [--really-refresh] [--unresolve] [--again | -g]
 	     [--info-only] [--index-info]
-	     [-z] [--stdin]
+	     [-z] [--stdin] [--index-version <n>]
 	     [--verbose]
 	     [--] [<file>...]
 
@@ -143,6 +143,10 @@ you will need to handle the situation manually.
 --verbose::
         Report what is being added and removed from index.
 
+--index-version <n>::
+	Write the resulting index out in the named on-disk format version.
+	The current default version is 2.
+
 -z::
 	Only meaningful with `--stdin` or `--index-info`; paths are
 	separated with NUL character instead of LF.
diff --git a/builtin/update-index.c b/builtin/update-index.c
index a6a23fa..5f038d6 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -708,6 +708,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 	int newfd, entries, has_errors = 0, line_termination = '\n';
 	int read_from_stdin = 0;
 	int prefix_length = prefix ? strlen(prefix) : 0;
+	int preferred_index_format = 0;
 	char set_executable_bit = 0;
 	struct refresh_params refresh_args = {0, &has_errors};
 	int lock_error = 0;
@@ -791,6 +792,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 			"(for porcelains) forget saved unresolved conflicts",
 			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
 			resolve_undo_clear_callback},
+		OPT_INTEGER(0, "index-version", &preferred_index_format,
+			    "write index in this format"),
 		OPT_END()
 	};
 
@@ -851,6 +854,17 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 		}
 	}
 	argc = parse_options_end(&ctx);
+	if (preferred_index_format) {
+		if (preferred_index_format < INDEX_FORMAT_LB ||
+		    INDEX_FORMAT_UB < preferred_index_format)
+			die("index-version %d not in range: %d..%d",
+			    preferred_index_format,
+			    INDEX_FORMAT_LB, INDEX_FORMAT_UB);
+
+		if (the_index.version != preferred_index_format)
+			active_cache_changed = 1;
+		the_index.version = preferred_index_format;
+	}
 
 	if (read_from_stdin) {
 		struct strbuf buf = STRBUF_INIT, nbuf = STRBUF_INIT;
-- 
1.7.10.rc4.54.g1d5dd3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-04-04 12:34 ` [PATCH 0/9] Prefix-compress on-disk index entries Nguyen Thai Ngoc Duy
@ 2012-04-04 18:44   ` Junio C Hamano
  2012-04-06  8:41     ` David Barr
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2012-04-04 18:44 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Wed, Apr 4, 2012 at 5:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
> I wonder what causes user time drop from .29s to .13s here. I think
> the main patch should increase computation, even only slightly, not
> less.

The main patch reduced the amount of the data needs to be sent to the
machinery to checksum and write to disk by about 45%, saving both I/O
and computation.

This is a tangent, but I wonder why we are not using csum-file API to do
this (I know the dircache code came first way before csum-file; I am
wondering why we haven't rewritten the codepath using it later).

> Anything else you have in mind for v4? Any chance we can adopt crc32
> instead of sha-1?

I am not interested in sacrificing integrity over unproven/unmeasured
performance "issues" on SHA-1, so I am not planning to experiment with
such a change myself.  The choice of hashing algorithm from my point of
view is the least interesting part.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-04-04 18:44   ` Junio C Hamano
@ 2012-04-06  8:41     ` David Barr
  2012-05-02  1:58       ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 27+ messages in thread
From: David Barr @ 2012-04-06  8:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nguyen Thai Ngoc Duy, git

On Thu, Apr 5, 2012 at 4:44 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> On Wed, Apr 4, 2012 at 5:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> ...
>> I wonder what causes user time drop from .29s to .13s here. I think
>> the main patch should increase computation, even only slightly, not
>> less.
>
> The main patch reduced the amount of the data needs to be sent to the
> machinery to checksum and write to disk by about 45%, saving both I/O
> and computation.

I hacked together a quick patch to try predictive coding the other
fields of the index. I got a further 34% improvement in size over
this series. Patches to come. I just used the previous cache entry as
the predictor and reused varint.h together with zigzag encoding[1].

That's a total improvement in size over v2 of 62%.

[1] https://developers.google.com/protocol-buffers/docs/encoding#types

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] unpack-trees: preserve the index file version of original
  2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
                   ` (10 preceding siblings ...)
  2012-04-04 12:34 ` [PATCH 0/9] Prefix-compress on-disk index entries Nguyen Thai Ngoc Duy
@ 2012-04-27 22:58 ` Junio C Hamano
  2012-04-27 23:02   ` [PATCH 2/2] index-v4: document the entry format Junio C Hamano
  11 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2012-04-27 22:58 UTC (permalink / raw)
  To: git

Otherwise "git checkout $other_branch" (or even "git checkout HEAD")
would end up writing the index out in the default format.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 * The first of a two patch series to update the jc/index-v4 series:
   http://thread.gmane.org/gmane.comp.version-control.git/194660

 unpack-trees.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index 7c9ecf6..2a037d6 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1020,6 +1020,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	o->result.initialized = 1;
 	o->result.timestamp.sec = o->src_index->timestamp.sec;
 	o->result.timestamp.nsec = o->src_index->timestamp.nsec;
+	o->result.version = o->src_index->version;
 	o->merge_size = len;
 	mark_all_ce_unused(o->src_index);
 
-- 
1.7.10.526.gb0571

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] index-v4: document the entry format
  2012-04-27 22:58 ` [PATCH 1/2] unpack-trees: preserve the index file version of original Junio C Hamano
@ 2012-04-27 23:02   ` Junio C Hamano
  2012-04-30 17:20     ` Thomas Rast
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2012-04-27 23:02 UTC (permalink / raw)
  To: git

Document the format so that others can learn from and build on top of
the series.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/technical/index-format.txt |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index 8930b3f..9d25b30 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -113,9 +113,22 @@ GIT index format
     are encoded in 7-bit ASCII and the encoding cannot contain a NUL
     byte (iow, this is a UNIX pathname).
 
+  (Version 4) In version 4, the entry path name is prefix-compressed
+    relative to the path name for the previous entry (the very first
+    entry is encoded as if the path name for the previous entry is an
+    empty string).  At the beginning of an entry, an integer N in the
+    variable width encoding (the same encoding as the offset is encoded
+    for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
+    by a NUL-terminated string S.  Removing N bytes from the end of the
+    path name for the previous entry, and replacing it with the string S
+    yields the path name for this entry.
+
   1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes
   while keeping the name NUL-terminated.
 
+  (Version 4) In version 4, the padding after the pathname does not
+  exist.
+
 == Extensions
 
 === Cached tree
-- 
1.7.10.526.gb0571

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] index-v4: document the entry format
  2012-04-27 23:02   ` [PATCH 2/2] index-v4: document the entry format Junio C Hamano
@ 2012-04-30 17:20     ` Thomas Rast
  2012-05-01  4:00       ` Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Rast @ 2012-04-30 17:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Thomas Gummerer, David Michael Barr

Hi Junio,

I seem to have completely missed the earlier series at

  http://thread.gmane.org/gmane.comp.version-control.git/194660

My bad.

Thomas has been working on a prototype converter over the past few days,
with results similar to (but not quite as good as) your numbers

    $ ls -l .git/index*
    -rw-r----- 1 jch eng 25586488 2012-04-03 15:27 .git/index
    -rw-r----- 1 jch eng 14654328 2012-04-03 15:38 .git/index-4

while taking a different approach with different tradeoffs.

Nevertheless...

> +  (Version 4) In version 4, the entry path name is prefix-compressed
> +    relative to the path name for the previous entry (the very first
> +    entry is encoded as if the path name for the previous entry is an
> +    empty string).  At the beginning of an entry, an integer N in the
> +    variable width encoding (the same encoding as the offset is encoded
> +    for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
> +    by a NUL-terminated string S.  Removing N bytes from the end of the
> +    path name for the previous entry, and replacing it with the string S
> +    yields the path name for this entry.
[..]
> +  (Version 4) In version 4, the padding after the pathname does not
> +  exist.

I think there are actually several separate ideas here:

* The prefix compression.  Thomas is not using this idea; we've been
  toying with making the index bisectable (within each directory) for
  fast single-entry lookups, which inherently conflicts with this.  The
  directory-like layout partially achieves the same (elides common path
  components).

* The varint encoding (or offset encoding, but "varint" is something you
  can google :-).  David suggested using it on stat() data, combined
  with zigzag encoding and delta against the first entry in the
  directory, which gives some good compression results.  Profiling will
  have to say whether the extra decoding effort is worth the space
  savings.

* The lack of variable padding, which is a good idea -- in any case I
  seem to remember Shawn complaining about it.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] index-v4: document the entry format
  2012-04-30 17:20     ` Thomas Rast
@ 2012-05-01  4:00       ` Junio C Hamano
  2012-05-01 21:43         ` Thomas Rast
  2012-05-02 15:12         ` Shawn Pearce
  0 siblings, 2 replies; 27+ messages in thread
From: Junio C Hamano @ 2012-05-01  4:00 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Thomas Gummerer, David Michael Barr

Thomas Rast <trast@student.ethz.ch> writes:

> I seem to have completely missed the earlier series at
>
>   http://thread.gmane.org/gmane.comp.version-control.git/194660
>
> My bad.
>
> Thomas has been working on a prototype converter over the past few days,
> with results similar to (but not quite as good as) your numbers

The "entry-shrinkage" v4 itself is an afternoon hack (even though it is a
good hack), and any design that would not come close to its result is not
worth considering.  It is good to hear that the student is making progress
learning.

> I think there are actually several separate ideas here:
>
> * The prefix compression.  Thomas is not using this idea; we've been
>   toying with making the index bisectable (within each directory) for
>   fast single-entry lookups, which inherently conflicts with this.  The
>   directory-like layout partially achieves the same (elides common path
>   components).
>
> * The varint encoding (or offset encoding, but "varint" is something you
>   can google :-).  David suggested using it on stat() data, combined
>   with zigzag encoding and delta against the first entry in the
>   directory, which gives some good compression results.  Profiling will
>   have to say whether the extra decoding effort is worth the space
>   savings.
>
> * The lack of variable padding, which is a good idea -- in any case I
>   seem to remember Shawn complaining about it.

I am planning to merge this series early to 'master', before the GSoC
student really starts working on the code, perhaps by this Wednesday. The
earlier parts of this series refactor code to make things easier to
modify, and the later parts of it demonstrate by example both:

 (1) how the backward compatibility must be handled at the design level
     [*1*]; and

 (2) how such a design can be coded cleanly at the implementation level.

The hope is that this will give a solidified base to build whatever new
work on top of (perhaps call it v5). I do not mind David's further work
built on top of this series, but I think the entry-shrinkage design for v4
is good enough as-is. I am afraid that letting the code slushy again at
this point may make your student's work unnecessarily more cumbersome.

How do you want to proceed?


[Footnote]

*1* Here are the minimum requirements.

 - you can read both old and new formats (obviously);

 - by default you write out in the same version you read the original;

 - have a single simple command to explicitly specify what format to
   write out; and

 - make sure that the new format is something older readers can
   reliably notice is new and beyond the version they support

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] index-v4: document the entry format
  2012-05-01  4:00       ` Junio C Hamano
@ 2012-05-01 21:43         ` Thomas Rast
  2012-05-02 15:12         ` Shawn Pearce
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Rast @ 2012-05-01 21:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Rast, git, Thomas Gummerer, David Michael Barr

Junio C Hamano <gitster@pobox.com> writes:

> I am planning to merge this series early to 'master', before the GSoC
> student really starts working on the code, perhaps by this Wednesday. The
> earlier parts of this series refactor code to make things easier to
> modify, and the later parts of it demonstrate by example both:
>
>  (1) how the backward compatibility must be handled at the design level
>      [*1*]; and
>
>  (2) how such a design can be coded cleanly at the implementation level.
>
> The hope is that this will give a solidified base to build whatever new
> work on top of (perhaps call it v5).
[...]
> How do you want to proceed?

I was initially a bit reluctant to add this complexity so shortly before
the GSoC starts in earnest.  But the cleanups are really worth it, and
then it's not *that* much code for a quite substantial speedup for
webkit.

So go ahead and merge it.  Thomas can build on top, though I'm still
hoping he'll start before you complete the merge, and learn a bit about
basing work on top of unmerged topics ;-)

> I do not mind David's further work built on top of this series, but I
> think the entry-shrinkage design for v4 is good enough as-is.

My impression was that David just tossed around ideas (very
well-researched and tested ones, but still ideas) to help Thomas.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-04-06  8:41     ` David Barr
@ 2012-05-02  1:58       ` Nguyen Thai Ngoc Duy
  2012-05-02  4:26         ` David Barr
  0 siblings, 1 reply; 27+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-02  1:58 UTC (permalink / raw)
  To: David Barr; +Cc: Junio C Hamano, git

On Fri, Apr 6, 2012 at 3:41 PM, David Barr <davidbarr@google.com> wrote:
> On Thu, Apr 5, 2012 at 4:44 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>>
>>> On Wed, Apr 4, 2012 at 5:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>> ...
>>> I wonder what causes user time drop from .29s to .13s here. I think
>>> the main patch should increase computation, even only slightly, not
>>> less.
>>
>> The main patch reduced the amount of the data needs to be sent to the
>> machinery to checksum and write to disk by about 45%, saving both I/O
>> and computation.
>
> I hacked together a quick patch to try predictive coding the other
> fields of the index. I got a further 34% improvement in size over
> this series. Patches to come. I just used the previous cache entry as
> the predictor and reused varint.h together with zigzag encoding[1].
>
> That's a total improvement in size over v2 of 62%.

Have you posted (and I missed) the patches? I'm interested in seeing
what changes you made.

> [1] https://developers.google.com/protocol-buffers/docs/encoding#types
-- 
Duy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] Prefix-compress on-disk index entries
  2012-05-02  1:58       ` Nguyen Thai Ngoc Duy
@ 2012-05-02  4:26         ` David Barr
  0 siblings, 0 replies; 27+ messages in thread
From: David Barr @ 2012-05-02  4:26 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Junio C Hamano, git

On Wed, May 2, 2012 at 11:58 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Fri, Apr 6, 2012 at 3:41 PM, David Barr <davidbarr@google.com> wrote:
>> On Thu, Apr 5, 2012 at 4:44 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>>>
>>>> On Wed, Apr 4, 2012 at 5:53 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>>> ...
>>>> I wonder what causes user time drop from .29s to .13s here. I think
>>>> the main patch should increase computation, even only slightly, not
>>>> less.
>>>
>>> The main patch reduced the amount of the data needs to be sent to the
>>> machinery to checksum and write to disk by about 45%, saving both I/O
>>> and computation.
>>
>> I hacked together a quick patch to try predictive coding the other
>> fields of the index. I got a further 34% improvement in size over
>> this series. Patches to come. I just used the previous cache entry as
>> the predictor and reused varint.h together with zigzag encoding[1].
>>
>> That's a total improvement in size over v2 of 62%.
>
> Have you posted (and I missed) the patches? I'm interested in seeing
> what changes you made.

I haven't posted anything - my proof of concept was write-only and slow.

I added a prelude with a bitmask that describes which fields differ
with the previous entry.

For each differing field, I encoded something like:
diff := this - prev;
zigzag := (diff << 1) ^ (diff >> 31)
raw := zigzag - 1 /* zero impossible because of mask */
write_varint(raw)

I also experimented with using unique sha1 prefixes but it was slow
and probably introduces race conditions.

>> [1] https://developers.google.com/protocol-buffers/docs/encoding#types
--
David Barr

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] index-v4: document the entry format
  2012-05-01  4:00       ` Junio C Hamano
  2012-05-01 21:43         ` Thomas Rast
@ 2012-05-02 15:12         ` Shawn Pearce
  2012-05-02 17:04           ` Junio C Hamano
  1 sibling, 1 reply; 27+ messages in thread
From: Shawn Pearce @ 2012-05-02 15:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Rast, git, Thomas Gummerer, David Michael Barr

On Mon, Apr 30, 2012 at 9:00 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> * The prefix compression.  Thomas is not using this idea; we've been
>>   toying with making the index bisectable (within each directory) for
>>   fast single-entry lookups, which inherently conflicts with this.  The
>>   directory-like layout partially achieves the same (elides common path
>>   components).
>>
>> * The varint encoding (or offset encoding, but "varint" is something you
>>   can google :-).  David suggested using it on stat() data, combined
>>   with zigzag encoding and delta against the first entry in the
>>   directory, which gives some good compression results.  Profiling will
>>   have to say whether the extra decoding effort is worth the space
>>   savings.
>>
>> * The lack of variable padding, which is a good idea -- in any case I
>>   seem to remember Shawn complaining about it.

I complain about a lot of things. Here is another...

> I am planning to merge this series early to 'master', before the GSoC
> student really starts working on the code, perhaps by this Wednesday. The
> earlier parts of this series refactor code to make things easier to
> modify, and the later parts of it demonstrate by example both:

I think this is a bad idea.

For sake of argument, lets say the GSoC project goes really well, and
the student creates a great implementation of (what is now) index v5.
Lets say we all agree its a great evolution of the format, the
implementation is sane, and there is no reason not to merge it and
make it the default.

If this v4 thing merges to master and you make a release from master,
we are potentially stuck supporting this new v4 format for the next 2
years, along with v5 which we want to immediately replace it. If any
OS distro picks up a release Git that supports v4 but not v5, and
parks it into their stable tree, the rest of the Git ecosystem (e.g.
libgit2, JGit) will be supporting v4 until that OS distro release dies
and all of its users are able to move to a newer distro with a newer
Git version.

Consider my case at $DAY_JOB where we still have Git 1.7.7.3 as the
standard Git. Upstream has already shipped 1.7.10 and is well on its
way to 1.7.11, but the distro choose to freeze on 1.7.7 rather
arbitrarily because that was the latest stable release version at the
time the distro was freezing its package sets for its own release.
Yay.


IMHO, keep this in next to avoid releasing it until we know the
outcome of the GSoC project. The handful of WebKit developers that use
Git that really benefit from index v4 can use it by building and
installing their own next. If they can't work `make install
prefix=$HOME/git`, they might want to reconsider their career and
hobby activities. And we can be sure it won't show up in a distro
release, thereby avoiding us needing us to support what may turn out
to be a dead-end index v4. The GSoC student can build on this topic
until their own work arrives in your tree.

Its only a few months to wait and see where "v5" goes. If v5 is
successful, v4 will just be a minor footnote in the history of Git,
and other tools won't need to support v4, they can go straight to v5.
If v5 fails and we choose to ship and commit to supporting v4, its
only a few months delay. We have had index v2/v3 for years. We (and
our users) can wait a couple of additional months for a format we can
support.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] index-v4: document the entry format
  2012-05-02 15:12         ` Shawn Pearce
@ 2012-05-02 17:04           ` Junio C Hamano
  2012-05-02 17:13             ` Shawn Pearce
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2012-05-02 17:04 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Thomas Rast, git, Thomas Gummerer, David Michael Barr

Shawn Pearce <spearce@spearce.org> writes:

> IMHO, keep this in next to avoid releasing it until we know the
> outcome of the GSoC project. The handful of WebKit developers that use
> Git that really benefit from index v4 can use it by building and
> installing their own next.
> ...
> Its only a few months to wait and see where "v5" goes. If v5 is
> successful, v4 will just be a minor footnote in the history of Git,
> and other tools won't need to support v4, they can go straight to v5.

You may not have noticed this, but there is no practical difference
between keeping it in 'next' and releasing it to 'master' from the
third-party tool's point of view.

There is _only_ one way to end up with v4 version of index: running "git
update-index --index-version 4".  When creating a new index, or working in
a repository, starting from an index written in the current version, you
will get v2 (or v3) index (this gentle handling of backward compatibility
comes from later parts of the series).  It is either running that command
or running 'next' version *and* running that command---either way, the
user deliberately has to ask for it, and if a third-party tool like jgit
chooses to ignore v4, it is not the end of the world.  The user opted-in
can run "git update-index --index-version 2" to revert it before using
such a tool.

For a third-party tool, lack of support of v4 is similar to not supporting
a config file that does not record the core.repositoryformatversion, which
the user can manually add it with the editor, and much less serious than
not supporting the v3 version of the index, which the user cannot do much
about it.

I would say that the cost of not merging the refactoring in the earlier
parts of the series and the gentler handling of backward compatibility in
the later parts of the series is much higher.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] index-v4: document the entry format
  2012-05-02 17:04           ` Junio C Hamano
@ 2012-05-02 17:13             ` Shawn Pearce
  0 siblings, 0 replies; 27+ messages in thread
From: Shawn Pearce @ 2012-05-02 17:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Rast, git, Thomas Gummerer, David Michael Barr

On Wed, May 2, 2012 at 10:04 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> IMHO, keep this in next to avoid releasing it until we know the
>> outcome of the GSoC project. The handful of WebKit developers that use
>> Git that really benefit from index v4 can use it by building and
>> installing their own next.
>> ...
>> Its only a few months to wait and see where "v5" goes. If v5 is
>> successful, v4 will just be a minor footnote in the history of Git,
>> and other tools won't need to support v4, they can go straight to v5.
>
> You may not have noticed this, but there is no practical difference
> between keeping it in 'next' and releasing it to 'master' from the
> third-party tool's point of view.
>
> There is _only_ one way to end up with v4 version of index: running "git
> update-index --index-version 4".

Thanks, I did miss this in the series. I hereby retract my complaint. :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-05-02 17:20 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-03 22:53 [PATCH 0/9] Prefix-compress on-disk index entries Junio C Hamano
2012-04-03 22:53 ` [PATCH 1/9] varint: make it available outside the context of pack Junio C Hamano
2012-04-03 22:53 ` [PATCH 2/9] cache.h: hide on-disk index details Junio C Hamano
2012-04-03 22:53 ` [PATCH 3/9] read-cache.c: allow unaligned mapping of the index file Junio C Hamano
2012-04-03 22:53 ` [PATCH 4/9] read-cache.c: make create_from_disk() report number of bytes it consumed Junio C Hamano
2012-04-03 22:53 ` [PATCH 5/9] read-cache.c: report the header version we do not understand Junio C Hamano
2012-04-03 22:53 ` [PATCH 6/9] read-cache.c: move code to copy ondisk to incore cache to a helper function Junio C Hamano
2012-04-03 22:53 ` [PATCH 7/9] read-cache.c: move code to copy incore to ondisk " Junio C Hamano
2012-04-03 22:53 ` [PATCH 8/9] read-cache.c: read prefix-compressed names in index on-disk version v4 Junio C Hamano
2012-04-03 22:53 ` [PATCH 9/9] read-cache.c: write index v4 format Junio C Hamano
2012-04-04  1:44 ` [PATCH 0/9] Prefix-compress on-disk index entries David Barr
2012-04-04 15:33   ` Junio C Hamano
2012-04-04 16:57     ` Junio C Hamano
2012-04-04 16:58       ` [PATCH 2/2] update-index: upgrade/downgrade on-disk index version Junio C Hamano
2012-04-04 12:34 ` [PATCH 0/9] Prefix-compress on-disk index entries Nguyen Thai Ngoc Duy
2012-04-04 18:44   ` Junio C Hamano
2012-04-06  8:41     ` David Barr
2012-05-02  1:58       ` Nguyen Thai Ngoc Duy
2012-05-02  4:26         ` David Barr
2012-04-27 22:58 ` [PATCH 1/2] unpack-trees: preserve the index file version of original Junio C Hamano
2012-04-27 23:02   ` [PATCH 2/2] index-v4: document the entry format Junio C Hamano
2012-04-30 17:20     ` Thomas Rast
2012-05-01  4:00       ` Junio C Hamano
2012-05-01 21:43         ` Thomas Rast
2012-05-02 15:12         ` Shawn Pearce
2012-05-02 17:04           ` Junio C Hamano
2012-05-02 17:13             ` Shawn Pearce

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.