[RFC/PATCH] define the way new representation types are encoded in the pack

* [RFC/PATCH] define the way new representation types are encoded in the pack
@ 2011-10-28  6:04 Junio C Hamano
  2011-10-28  6:12 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Junio C Hamano @ 2011-10-28  6:04 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre, Shawn O. Pearce, Jeff King

In addition to four basic types (commit, tree, blob and tag), the pack
stream can encode a few other "representation" types, such as REF_DELTA
and OFS_DELTA. As we allocate 3 bits in the first byte for this purpose,
we do not have much room to add new representation types in place, but we
do have one value reserved for future expansion.

This patch is about defining how that reserved value is used.

The first byte in the pack stream data consists of the following for the
current representation types:

 - Bit 0-3 are used for the low 4-bit of "some" size (not necessarily the
   size of the representation);

 - Bit 4-6 are used for object types 0-7, but we have not used type 5 so
   far and reserved it for future expansion (we could also use type 0
   recorded in the pack stream for future expansion, just like how I
   convert 5 into the real "extended" representation type in this patch);

 - Bit 7 is used to signal if the second byte needs to be read for sizes
   that do not fit in the 4-bit.

When bit 4-6 encodes type 5, the first byte is used this way:

 - Bit 0-3 denotes the real "extended" representation type. Because types
   0-7 can already be encoded without using the extended format, we can
   offset the type by 8 (i.e. if bit 0-3 says 3, it means representation
   type 11 = 3 + 8);

 - Bit 4-6 has the value "5";

 - Bit 7 is used to signal if the _third_ byte needs to be read for larger
   size that cannot be represented with 8-bit.

As it is unlikely for us to pack things that do not need to record any
size, the second byte is always used in full to encode the low 8-bit of
the size.

I haven't started using type=8 and upwards for anything yet, but because
we have only one "future expansion" value left, I want us to be extremely
careful in order to avoid painting us into a corner that we cannot get out
of, so I am sending this out early for a preliminary review.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |    3 ++-
 sha1_file.c |   36 ++++++++++++++++++++++++++++++++----
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/cache.h b/cache.h
index 2e6ad36..b02139b 100644
--- a/cache.h
+++ b/cache.h
@@ -380,9 +380,10 @@ enum object_type {
 	OBJ_TREE = 2,
 	OBJ_BLOB = 3,
 	OBJ_TAG = 4,
-	/* 5 for future expansion */
+	OBJ_EXT = 5, /* 5 for future expansion */
 	OBJ_OFS_DELTA = 6,
 	OBJ_REF_DELTA = 7,
+	OBJ_CAT_TREE = 8,
 	OBJ_ANY,
 	OBJ_MAX
 };
diff --git a/sha1_file.c b/sha1_file.c
index 27f3b9b..4dcd023 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1254,16 +1254,43 @@ static int experimental_loose_object(unsigned char *map)
 }
 
 unsigned long unpack_object_header_buffer(const unsigned char *buf,
-		unsigned long len, enum object_type *type, unsigned long *sizep)
+	unsigned long len, enum object_type *typep, unsigned long *sizep)
 {
 	unsigned shift;
 	unsigned long size, c;
 	unsigned long used = 0;
+	enum object_type type;
 
+	/*
+	 * MSB of the first byte is used to tell if the second byte
+	 * needs to be read for the size, so type field is only 3-bit
+	 * wide.
+	 */
 	c = buf[used++];
-	*type = (c >> 4) & 7;
-	size = c & 15;
-	shift = 4;
+	type = (c >> 4) & 7;
+
+	if (type != OBJ_EXT) {
+		/*
+		 * For basic types of object representations, the low
+		 * 4-bit of the first byte is used for the lowermost
+		 * 4-bit of the size. The MSB of the first byte tells
+		 * if the second byte needs to be read for size.
+		 */
+		size = c & 15;
+		shift = 4;
+	} else {
+		/*
+		 * For extended types, the low 4-bit of the first byte
+		 * is used for the representation type (offset by 8),
+		 * and the size begins at the second byte. The MSB of
+		 * the first byte is still used to indicate the next
+		 * byte (i.e. the third byte) needs to be read for the
+		 * size.
+		 */
+		type = (c & 15) + 8;
+		size = buf[used++];
+		shift = 8;
+	}
 	while (c & 0x80) {
 		if (len <= used || bitsizeof(long) <= shift) {
 			error("bad object header");
@@ -1274,6 +1301,7 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 		shift += 7;
 	}
 	*sizep = size;
+	*typep = type;
 	return used;
 }
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread