All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] writing out a huge blob to working tree
@ 2011-05-16  0:30 Junio C Hamano
  2011-05-16  0:30 ` [PATCH 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
                   ` (14 more replies)
  0 siblings, 15 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

Traditionally, git always read the full contents of an object in memory
before performing various operations on it, e.g. comparing for diff,
writing it to the working tree, etc.  A huge blob that you cannot fit
in memory was very cumbersome to handle.

Recently "diff" learned to avoid reading the contents only to say "Binary
files differ" when these large blobs are marked as binary. Also there is a
topic cooking to teach "git add" to stream a large file directly to a
packfile without keeping the whole thing in core.

The "checkout" codepath is to learn the trick next, and this is the series
to attempt to do so.  These would apply cleanly on top of three other
topics still in 'next' or 'pu', namely:

 - jc/convert that cleans up the conversion;
 - jc/replacing that cleans up the object replacement;
 - jc/bigfile that teaches "git add" to handle large files.

Patch 1 and 5 are trivial clean-ups and refactoring. These could be
separated out of the series and applied much earlier, but nothing other
than this series directly benefit from these changes, so they are here in
the series.

Patch 2, 3, and 4 enhances the sha1_file layer.

Patch 6 introduces a new API that takes an object name and gives back a
"handle" you can read from (think: FILE *) the contents of the object.
The implementation at this step is deliberately kept simple: it just calls
read_sha1_file() to read everything in memory.

Patch 7 then uses the new API in the "git checkout" codepath, namely, in
entry.c::write_entry() function.  At this point, any blob that does not
require smudge filters including crlf conversion would pass through this
new codepath and used the 'incore' case of the streaming API, which means
that (1) "hold everything in memory and process" limitation is not lifted
yet, and that (2) breakage detected in here would have meant either the
simple 'incore' implementation of the streaming API is broken (not likely),
or its caller streaming_write_entry() is broken (more likely).

Patch 8 teaches the new write-out codepath to detect and make holes in the
resulting file. This is primarily meant to help testing---when you add a
large test file that weighs 1GB with "git add" (see how it is done in the
test t/t1050-large.sh on jc/bigfile topic) and check it out, you do not
want to end up with 1GB file fully populated with real blocks in your
working tree.

Patch 9 teaches the streaming API how to read a non-delta object directly
from packfile, without holding the entire result in the memory. This is
the representation jc/bigfile topic creates for a huge file, and the
primary interest of this topic.

Patch 10 and 11 teaches the streaming API how to read a loose object,
without holding the entire result in the memory. This is not strictly
necessary for the purpose of handling the output from jc/bigfile, but not
having to hold everything in core by itself may be a plus.

Interested parties may want to measure the performance impact of the last
three patches. The series deliberately ignores core.bigfileThreashold and
let small and large blobs alike go through the streaming_write_entry()
codepath, but it _might_ turn out that we would want to use the new code
only for large-ish blobs.


Junio C Hamano (11):
  packed_object_info_detail(): do not return a string
  sha1_object_info_extended(): expose a bit more info
  sha1_object_info_extended(): hint about objects in delta-base cache
  unpack_object_header(): make it public
  write_entry(): separate two helper functions out
  streaming: a new API to read from the object store
  streaming_write_entry(): use streaming API in write_entry()
  streaming_write_entry(): support files with holes
  streaming: read non-delta incrementally from a pack
  sha1_file.c: expose helpers to read loose objects
  streaming: read loose objects incrementally

 Makefile              |    2 +
 builtin/verify-pack.c |    4 +-
 cache.h               |   36 +++++-
 convert.c             |   23 +++
 entry.c               |  111 ++++++++++++---
 sha1_file.c           |   71 ++++++++--
 streaming.c           |  376 +++++++++++++++++++++++++++++++++++++++++++++++++
 streaming.h           |   12 ++
 8 files changed, 600 insertions(+), 35 deletions(-)
 create mode 100644 streaming.c
 create mode 100644 streaming.h

-- 
1.7.5.1.365.g32b65

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/11] packed_object_info_detail(): do not return a string
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-17  0:45   ` Thiago Farina
  2011-05-16  0:30 ` [PATCH 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

Instead return enum object_name just like everybody else does.
The caller can turn it into a string with typename() easily.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/verify-pack.c |    4 ++--
 cache.h               |    2 +-
 sha1_file.c           |    4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/builtin/verify-pack.c b/builtin/verify-pack.c
index b6079ae..3a919b1 100644
--- a/builtin/verify-pack.c
+++ b/builtin/verify-pack.c
@@ -33,9 +33,9 @@ static void show_pack_info(struct packed_git *p, unsigned int flags)
 		if (!sha1)
 			die("internal error pack-check nth-packed-object");
 		offset = nth_packed_object_offset(p, i);
-		type = packed_object_info_detail(p, offset, &size, &store_size,
+		type = typename(packed_object_info_detail(p, offset, &size, &store_size,
 						 &delta_chain_length,
-						 base_sha1);
+						 base_sha1));
 		if (!stat_only)
 			printf("%s ", sha1_to_hex(sha1));
 		if (!delta_chain_length) {
diff --git a/cache.h b/cache.h
index b1b5bb5..cdb5112 100644
--- a/cache.h
+++ b/cache.h
@@ -1020,7 +1020,7 @@ extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
-extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
 
 /* Dumb servers support */
 extern int update_server_info(int);
diff --git a/sha1_file.c b/sha1_file.c
index 064a330..4f96eb1 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1549,7 +1549,7 @@ static int unpack_object_header(struct packed_git *p,
 	return type;
 }
 
-const char *packed_object_info_detail(struct packed_git *p,
+int packed_object_info_detail(struct packed_git *p,
 				      off_t obj_offset,
 				      unsigned long *size,
 				      unsigned long *store_size,
@@ -1580,7 +1580,7 @@ const char *packed_object_info_detail(struct packed_git *p,
 		case OBJ_BLOB:
 		case OBJ_TAG:
 			unuse_pack(&w_curs);
-			return typename(type);
+			return type;
 		case OBJ_OFS_DELTA:
 			obj_offset = get_delta_base(p, &w_curs, &curpos, type, obj_offset);
 			if (!obj_offset)
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/11] sha1_object_info_extended(): expose a bit more info
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
  2011-05-16  0:30 ` [PATCH 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:30 ` [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

The original interface for sha1_object_info() takes an object name and
gives back a type and its size (the latter is given only when it was
asked).  The new interface wraps its implementation and exposes a bit
more pieces of information that the interface used to discard, namely:

 - where the object is stored (loose? cached? packed?)
 - if packed, where in which packfile?

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |   29 +++++++++++++++++++++++++++++
 sha1_file.c |   46 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index cdb5112..dfd2d61 100644
--- a/cache.h
+++ b/cache.h
@@ -1022,6 +1022,35 @@ extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsig
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
 
+struct object_info {
+	/* Request */
+	unsigned long *sizep;
+	int want_deltainfo;
+
+	/* Response */
+	enum {
+		OI_CACHED,
+		OI_LOOSE,
+		OI_PACKED
+	} whence;
+	union {
+		/*
+		 * struct {
+		 * 	... Nothing to expose in this case
+		 * } cached;
+		 * struct {
+		 * 	... Nothing to expose in this case
+		 * } loose;
+		 */
+		struct {
+			off_t offset;
+			unsigned int delta;
+			struct packed_git *pack;
+		} packed;
+	} u;
+};
+extern int sha1_object_info_extended(const unsigned char *, struct object_info *);
+
 /* Dumb servers support */
 extern int update_server_info(int);
 
diff --git a/sha1_file.c b/sha1_file.c
index 4f96eb1..9b16cd8 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2093,7 +2093,8 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 	return status;
 }
 
-int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
+/* returns enum object_type or negative */
+int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 {
 	struct cached_object *co;
 	struct pack_entry e;
@@ -2101,16 +2102,19 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 
 	co = find_cached_object(sha1);
 	if (co) {
-		if (sizep)
-			*sizep = co->size;
+		if (oi->sizep)
+			*(oi->sizep) = co->size;
+		oi->whence = OI_CACHED;
 		return co->type;
 	}
 
 	if (!find_pack_entry(sha1, &e)) {
 		/* Most likely it's a loose object. */
-		status = sha1_loose_object_info(sha1, sizep);
-		if (status >= 0)
+		status = sha1_loose_object_info(sha1, oi->sizep);
+		if (status >= 0) {
+			oi->whence = OI_LOOSE;
 			return status;
+		}
 
 		/* Not a loose object; someone else may have just packed it. */
 		reprepare_packed_git();
@@ -2118,15 +2122,43 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 			return status;
 	}
 
-	status = packed_object_info(e.p, e.offset, sizep);
+	if (!oi->want_deltainfo) {
+		status = packed_object_info(e.p, e.offset, oi->sizep);
+	} else {
+		unsigned long size, store_size;
+		unsigned int delta_chain_length;
+		unsigned char base_sha1[20];
+		status = packed_object_info_detail(e.p, e.offset,
+						   &size, &store_size,
+						   &delta_chain_length,
+						   base_sha1);
+		if (0 <= status) {
+			if (oi->sizep)
+				*oi->sizep = size;
+			oi->u.packed.delta = delta_chain_length;
+		}
+	}
 	if (status < 0) {
 		mark_bad_packed_object(e.p, sha1);
-		status = sha1_object_info(sha1, sizep);
+		status = sha1_object_info_extended(sha1, oi);
+	} else {
+		oi->whence = OI_PACKED;
+		oi->u.packed.offset = e.offset;
+		oi->u.packed.pack = e.p;
 	}
 
 	return status;
 }
 
+int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
+{
+	struct object_info oi;
+
+	oi.sizep = sizep;
+	oi.want_deltainfo = 0;
+	return sha1_object_info_extended(sha1, &oi);
+}
+
 static void *read_packed_sha1(const unsigned char *sha1,
 			      enum object_type *type, unsigned long *size)
 {
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
  2011-05-16  0:30 ` [PATCH 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
  2011-05-16  0:30 ` [PATCH 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:40   ` Shawn Pearce
  2011-05-16  0:30 ` [PATCH 04/11] unpack_object_header(): make it public Junio C Hamano
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

An object found in the delta-base cache is not guaranteed to
stay there, but we know it came from a pack and it is likely
to give us a quick access if we read_sha1_file() it right now,
which is a piece of useful information.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |    3 ++-
 sha1_file.c |    9 +++++++++
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/cache.h b/cache.h
index dfd2d61..7511b07 100644
--- a/cache.h
+++ b/cache.h
@@ -1031,7 +1031,8 @@ struct object_info {
 	enum {
 		OI_CACHED,
 		OI_LOOSE,
-		OI_PACKED
+		OI_PACKED,
+		OI_DBCACHED,
 	} whence;
 	union {
 		/*
diff --git a/sha1_file.c b/sha1_file.c
index 9b16cd8..4d1d3ef 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1695,6 +1695,13 @@ static unsigned long pack_entry_hash(struct packed_git *p, off_t base_offset)
 	return hash % MAX_DELTA_CACHE;
 }
 
+static int in_delta_base_cache(struct packed_git *p, off_t base_offset)
+{
+	unsigned long hash = pack_entry_hash(p, base_offset);
+	struct delta_base_cache_entry *ent = delta_base_cache + hash;
+	return (ent->data && ent->p == p && ent->base_offset == base_offset);
+}
+
 static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	unsigned long *base_size, enum object_type *type, int keep_cache)
 {
@@ -2141,6 +2148,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 	if (status < 0) {
 		mark_bad_packed_object(e.p, sha1);
 		status = sha1_object_info_extended(sha1, oi);
+	} else if (in_delta_base_cache(e.p, e.offset)) {
+		oi->whence = OI_CACHED;
 	} else {
 		oi->whence = OI_PACKED;
 		oi->u.packed.offset = e.offset;
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/11] unpack_object_header(): make it public
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (2 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:30 ` [PATCH 05/11] write_entry(): separate two helper functions out Junio C Hamano
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

This function is used to read and skip over the per-object header
in a packfile.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |    1 +
 sha1_file.c |    8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 7511b07..39c09b2 100644
--- a/cache.h
+++ b/cache.h
@@ -1021,6 +1021,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
 
 struct object_info {
 	/* Request */
diff --git a/sha1_file.c b/sha1_file.c
index 4d1d3ef..b0d82d9 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1523,10 +1523,10 @@ static int packed_delta_info(struct packed_git *p,
 	return type;
 }
 
-static int unpack_object_header(struct packed_git *p,
-				struct pack_window **w_curs,
-				off_t *curpos,
-				unsigned long *sizep)
+int unpack_object_header(struct packed_git *p,
+			 struct pack_window **w_curs,
+			 off_t *curpos,
+			 unsigned long *sizep)
 {
 	unsigned char *base;
 	unsigned int left;
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/11] write_entry(): separate two helper functions out
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (3 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 04/11] unpack_object_header(): make it public Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:30 ` [PATCH 06/11] streaming: a new API to read from the object store Junio C Hamano
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

In the write-out codepath, a block of code determines what file in the
working tree to write to, and opens an output file descriptor to it.

After writing the contents out to the file, another block of code runs
fstat() on the file descriptor when appropriate.

Separate these blocks out to open_output_fd() and fstat_output()
helper functions.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 entry.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/entry.c b/entry.c
index b017167..cc6502a 100644
--- a/entry.c
+++ b/entry.c
@@ -91,6 +91,29 @@ static void *read_blob_entry(struct cache_entry *ce, unsigned long *size)
 	return NULL;
 }
 
+static int open_output_fd(char *path, struct cache_entry *ce, int to_tempfile)
+{
+	int symlink = (ce->ce_mode & S_IFMT) != S_IFREG;
+	if (to_tempfile) {
+		strcpy(path, symlink
+		       ? ".merge_link_XXXXXX" : ".merge_file_XXXXXX");
+		return mkstemp(path);
+	} else {
+		return create_file(path, !symlink ? ce->ce_mode : 0666);
+	}
+}
+
+static int fstat_output(int fd, const struct checkout *state, struct stat *st)
+{
+	/* use fstat() only when path == ce->name */
+	if (fstat_is_reliable() &&
+	    state->refresh_cache && !state->base_dir_len) {
+		fstat(fd, st);
+		return 1;
+	}
+	return 0;
+}
+
 static int write_entry(struct cache_entry *ce, char *path, const struct checkout *state, int to_tempfile)
 {
 	unsigned int ce_mode_s_ifmt = ce->ce_mode & S_IFMT;
@@ -128,17 +151,7 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 			size = newsize;
 		}
 
-		if (to_tempfile) {
-			if (ce_mode_s_ifmt == S_IFREG)
-				strcpy(path, ".merge_file_XXXXXX");
-			else
-				strcpy(path, ".merge_link_XXXXXX");
-			fd = mkstemp(path);
-		} else if (ce_mode_s_ifmt == S_IFREG) {
-			fd = create_file(path, ce->ce_mode);
-		} else {
-			fd = create_file(path, 0666);
-		}
+		fd = open_output_fd(path, ce, to_tempfile);
 		if (fd < 0) {
 			free(new);
 			return error("unable to create file %s (%s)",
@@ -146,12 +159,8 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 		}
 
 		wrote = write_in_full(fd, new, size);
-		/* use fstat() only when path == ce->name */
-		if (fstat_is_reliable() &&
-		    state->refresh_cache && !to_tempfile && !state->base_dir_len) {
-			fstat(fd, &st);
-			fstat_done = 1;
-		}
+		if (!to_tempfile)
+			fstat_done = fstat_output(fd, state, &st);
 		close(fd);
 		free(new);
 		if (wrote != size)
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/11] streaming: a new API to read from the object store
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (4 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 05/11] write_entry(): separate two helper functions out Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-18  8:09   ` Jeff King
  2011-05-16  0:30 ` [PATCH 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

Given an object name, use open_istream() to get a git_istream handle
that you can read_istream() from as if you are using read(2) to read
the contents of the object, and close it with close_istream() when
you are done.

Currently, we do not do anything fancy--it just calls read_sha1_file()
and keeps the contents in memory as a whole, and carve it out as you
request with read_istream().

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Makefile    |    2 +
 streaming.c |  196 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 streaming.h |   12 ++++
 3 files changed, 210 insertions(+), 0 deletions(-)
 create mode 100644 streaming.c
 create mode 100644 streaming.h

diff --git a/Makefile b/Makefile
index 320ccc7..83bd539 100644
--- a/Makefile
+++ b/Makefile
@@ -552,6 +552,7 @@ LIB_H += sha1-lookup.h
 LIB_H += sideband.h
 LIB_H += sigchain.h
 LIB_H += strbuf.h
+LIB_H += streaming.h
 LIB_H += string-list.h
 LIB_H += submodule.h
 LIB_H += tag.h
@@ -657,6 +658,7 @@ LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
 LIB_OBJS += strbuf.o
+LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
 LIB_OBJS += submodule.o
 LIB_OBJS += symlinks.o
diff --git a/streaming.c b/streaming.c
new file mode 100644
index 0000000..e03a374
--- /dev/null
+++ b/streaming.c
@@ -0,0 +1,196 @@
+#include "cache.h"
+#include "streaming.h"
+
+enum input_source {
+	stream_error = -1,
+	incore = 0,
+	loose = 1,
+	pack_non_delta = 2
+};
+
+typedef int (*open_istream_fn)(struct git_istream *,
+			       struct object_info *, const unsigned char *,
+			       enum object_type *, unsigned long *);
+typedef int (*close_istream_fn)(struct git_istream *);
+typedef ssize_t (*read_istream_fn)(struct git_istream *, char *, size_t);
+
+struct stream_vtbl {
+	close_istream_fn close;
+	read_istream_fn read;
+};
+
+#define open_method_decl(name) \
+	int open_istream_ ##name \
+	(struct git_istream *st, struct object_info *oi, \
+	 const unsigned char *sha1, \
+	 enum object_type *type, unsigned long *sz)
+
+#define close_method_decl(name) \
+	int close_istream_ ##name \
+	(struct git_istream *st)
+
+#define read_method_decl(name) \
+	ssize_t read_istream_ ##name \
+	(struct git_istream *st, char *buf, size_t sz)
+
+/* forward declaration */
+static open_method_decl(incore);
+static open_method_decl(loose);
+static open_method_decl(pack_non_delta);
+
+static open_istream_fn open_istream_tbl[] = {
+	open_istream_incore,
+	open_istream_loose,
+	open_istream_pack_non_delta,
+};
+
+struct git_istream {
+	enum input_source source;
+	const struct stream_vtbl *vtbl;
+	union {
+		struct {
+			char *buf; /* from read_object() */
+			unsigned long sz;
+			unsigned long read_ptr;
+		} incore;
+
+		struct {
+			int fd; /* open for reading */
+			/* NEEDSWORK: what else? */
+		} loose;
+
+		struct {
+			int fd; /* open for reading */
+			/* NEEDSWORK: what else? */
+		} in_pack;
+	} u;
+};
+
+int close_istream(struct git_istream *st)
+{
+	return st->vtbl->close(st);
+}
+
+ssize_t read_istream(struct git_istream *st, char *buf, size_t sz)
+{
+	return st->vtbl->read(st, buf, sz);
+}
+
+static enum input_source istream_source(const unsigned char *sha1,
+					enum object_type *type,
+					struct object_info *oi)
+{
+	unsigned long size;
+	int status;
+
+	oi->sizep = &size;
+	oi->want_deltainfo = 1;
+
+	status = sha1_object_info_extended(sha1, oi);
+	if (status < 0)
+		return stream_error;
+	*type = status;
+
+	switch (oi->whence) {
+	case OI_LOOSE:
+		return loose;
+	case OI_PACKED:
+		if (!oi->u.packed.delta && big_file_threshold <= size)
+			return pack_non_delta;
+		/* fallthru */
+	default:
+		return incore;
+	}
+}
+
+struct git_istream *open_istream(const unsigned char *sha1,
+				 enum object_type *type,
+				 unsigned long *sz)
+{
+	struct git_istream *st;
+	struct object_info oi;
+	const unsigned char *real = lookup_replace_object(sha1);
+	enum input_source src = istream_source(real, type, &oi);
+
+	if (src < 0)
+		return NULL;
+
+	st = xmalloc(sizeof(*st));
+	st->source = src;
+	if (open_istream_tbl[src](st, &oi, real, type, sz)) {
+		if (open_istream_incore(st, &oi, real, type, sz)) {
+			free(st);
+			st = NULL;
+		}
+	}
+	return st;
+}
+
+/*****************************************************************
+ *
+ * Loose object stream
+ *
+ *****************************************************************/
+
+static open_method_decl(loose)
+{
+	return -1; /* for now */
+}
+
+
+/*****************************************************************
+ *
+ * Non-delta packed object stream
+ *
+ *****************************************************************/
+
+static open_method_decl(pack_non_delta)
+{
+	return -1; /* for now */
+}
+
+
+/*****************************************************************
+ *
+ * In-core stream
+ *
+ *****************************************************************/
+
+static close_method_decl(incore)
+{
+	free(st->u.incore.buf);
+	return 0;
+}
+
+static read_method_decl(incore)
+{
+	size_t read_size = sz;
+	size_t remainder = st->u.incore.sz - st->u.incore.read_ptr;
+
+	if (remainder <= read_size)
+		read_size = remainder;
+	if (read_size) {
+		memcpy(buf, st->u.incore.buf + st->u.incore.read_ptr, read_size);
+		st->u.incore.read_ptr += read_size;
+	}
+	return read_size;
+}
+
+static struct stream_vtbl incore_vtbl = {
+	close_istream_incore,
+	read_istream_incore,
+};
+
+static open_method_decl(incore)
+{
+	st->u.incore.buf = read_sha1_file_extended(sha1, type, sz, 0);
+	st->u.incore.read_ptr = 0;
+	st->u.incore.sz = *sz;
+	st->vtbl = &incore_vtbl;
+
+	if (!st->u.incore.buf) {
+		free(st->u.incore.buf);
+		return -1;
+	}
+	return 0;
+}
diff --git a/streaming.h b/streaming.h
new file mode 100644
index 0000000..dff5e19
--- /dev/null
+++ b/streaming.h
@@ -0,0 +1,12 @@
+#ifndef STREAMING_H
+#define STREAMING_H 1
+#include "cache.h"
+
+/* opaque */
+struct git_istream;
+
+extern struct git_istream *open_istream(const unsigned char *, enum object_type *, unsigned long *);
+extern int close_istream(struct git_istream *);
+extern ssize_t read_istream(struct git_istream *, char *, size_t);
+
+#endif /* STREAMING_H */
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/11] streaming_write_entry(): use streaming API in write_entry()
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (5 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 06/11] streaming: a new API to read from the object store Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:30 ` [PATCH 08/11] streaming_write_entry(): support files with holes Junio C Hamano
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

When the output to a path does not have to be converted, we can read from
the object database from the streaming API and write to the file in the
working tree, without having to hold everything in the memory.

The ident, auto- and safe- crlf conversions inherently require you to read
the whole thing before deciding what to do, so while it is technically
possible to support them by using a buffer of an unbound size or rewinding
and reading the stream twice, it is less practical than the traditional
"read the whole thing in core and convert" approach.

Adding streaming filters for the other conversions on top of this should
be doable by tweaking the can_bypass_conversion() function (it should be
renamed to can_filter_stream() when it happens). Then the streaming API
can be extended to wrap the git_istream streaming_write_entry() opens on
the underlying object in another git_istream that reads from it, filters
what is read, and let the streaming_write_entry() read the filtered
result. But that is outside the scope of this series.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h   |    1 +
 convert.c |   23 +++++++++++++++++++++++
 entry.c   |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/cache.h b/cache.h
index 39c09b2..39e53c8 100644
--- a/cache.h
+++ b/cache.h
@@ -1157,6 +1157,7 @@ extern int convert_to_git(const char *path, const char *src, size_t len,
                           struct strbuf *dst, enum safe_crlf checksafe);
 extern int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst);
 extern int renormalize_buffer(const char *path, const char *src, size_t len, struct strbuf *dst);
+extern int can_bypass_conversion(const char *path);
 
 /* add */
 /*
diff --git a/convert.c b/convert.c
index efc7e07..d3c0041 100644
--- a/convert.c
+++ b/convert.c
@@ -813,3 +813,26 @@ int renormalize_buffer(const char *path, const char *src, size_t len, struct str
 	}
 	return ret | convert_to_git(path, src, len, dst, 0);
 }
+
+/*
+ * You would be crazy to set CRLF, smuge/clean or ident to
+ * a large binary blob you would want us not to slurp into
+ * the memory!
+ */
+int can_bypass_conversion(const char *path)
+{
+	struct conv_attrs ca;
+	enum crlf_action crlf_action;
+
+	convert_attrs(&ca, path);
+
+	if (ca.ident ||
+	    (ca.drv && (ca.drv->smudge || ca.drv->clean)))
+		return 0;
+
+	crlf_action = input_crlf_action(ca.crlf_action, ca.eol_attr);
+	if ((crlf_action == CRLF_BINARY) ||
+	    (crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE))
+		return 1;
+	return 0;
+}
diff --git a/entry.c b/entry.c
index cc6502a..7733a6b 100644
--- a/entry.c
+++ b/entry.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "blob.h"
 #include "dir.h"
+#include "streaming.h"
 
 static void create_directories(const char *path, int path_len,
 			       const struct checkout *state)
@@ -114,6 +115,50 @@ static int fstat_output(int fd, const struct checkout *state, struct stat *st)
 	return 0;
 }
 
+static int streaming_write_entry(struct cache_entry *ce, char *path,
+				 const struct checkout *state, int to_tempfile,
+				 int *fstat_done, struct stat *statbuf)
+{
+	struct git_istream *st;
+	enum object_type type;
+	unsigned long sz;
+	int result = -1;
+	int fd = -1;
+
+	st = open_istream(ce->sha1, &type, &sz);
+	if (!st)
+		return -1;
+	if (type != OBJ_BLOB)
+		goto close_and_exit;
+
+	fd = open_output_fd(path, ce, to_tempfile);
+	if (fd < 0)
+		goto close_and_exit;
+
+	for (;;) {
+		char buf[10240];
+		ssize_t wrote;
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+
+		wrote = write_in_full(fd, buf, readlen);
+
+		if (wrote != readlen)
+			goto close_and_exit;
+	}
+	*fstat_done = fstat_output(fd, state, statbuf);
+
+close_and_exit:
+	close_istream(st);
+	if (0 <= fd)
+		result |= close(fd);
+	if (result && 0 <= fd)
+		unlink(path);
+	return result;
+}
+
 static int write_entry(struct cache_entry *ce, char *path, const struct checkout *state, int to_tempfile)
 {
 	unsigned int ce_mode_s_ifmt = ce->ce_mode & S_IFMT;
@@ -124,6 +169,12 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 	size_t wrote, newsize = 0;
 	struct stat st;
 
+	if ((ce_mode_s_ifmt == S_IFREG) &&
+	    can_bypass_conversion(path) &&
+	    !streaming_write_entry(ce, path, state, to_tempfile,
+				   &fstat_done, &st))
+		goto finish;
+
 	switch (ce_mode_s_ifmt) {
 	case S_IFREG:
 	case S_IFLNK:
@@ -176,6 +227,7 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 		return error("unknown file mode for %s in index", path);
 	}
 
+finish:
 	if (state->refresh_cache) {
 		if (!fstat_done)
 			lstat(ce->name, &st);
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/11] streaming_write_entry(): support files with holes
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (6 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16 10:53   ` Nguyen Thai Ngoc Duy
  2011-05-16 13:03   ` Thiago Farina
  2011-05-16  0:30 ` [PATCH 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

One typical use of a large binary file is to hold a sparse on-disk hash
table with a lot of holes. Help preserving the holes with lseek().

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 entry.c |   18 +++++++++++++++++-
 1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/entry.c b/entry.c
index 7733a6b..d50e388 100644
--- a/entry.c
+++ b/entry.c
@@ -123,6 +123,7 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 	enum object_type type;
 	unsigned long sz;
 	int result = -1;
+	ssize_t kept = 0;
 	int fd = -1;
 
 	st = open_istream(ce->sha1, &type, &sz);
@@ -137,17 +138,32 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 
 	for (;;) {
 		char buf[10240];
-		ssize_t wrote;
+		ssize_t wrote, holeto;
 		ssize_t readlen = read_istream(st, buf, sizeof(buf));
 
 		if (!readlen)
 			break;
+		for (holeto = 0; holeto < readlen; holeto++)
+			if (buf[holeto])
+				break;
+		if (readlen == holeto) {
+			kept += holeto;
+			continue;
+		}
 
+		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
+			goto close_and_exit;
+		else
+			kept = 0;
 		wrote = write_in_full(fd, buf, readlen);
 
 		if (wrote != readlen)
 			goto close_and_exit;
 	}
+	if (kept &&
+	    ((  lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1)
+	     || write(fd, "", 1) != 1))
+		goto close_and_exit;
 	*fstat_done = fstat_output(fd, state, statbuf);
 
 close_and_exit:
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/11] streaming: read non-delta incrementally from a pack
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (7 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 08/11] streaming_write_entry(): support files with holes Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:58   ` Shawn Pearce
  2011-05-16  0:30 ` [PATCH 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 streaming.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/streaming.c b/streaming.c
index e03a374..0f85bd8 100644
--- a/streaming.c
+++ b/streaming.c
@@ -60,10 +60,13 @@ struct git_istream {
 		} loose;
 
 		struct {
-			int fd; /* open for reading */
-			/* NEEDSWORK: what else? */
+			struct packed_git *pack;
+			off_t pos;
+			unsigned long sz;
 		} in_pack;
 	} u;
+	z_stream z;
+	enum { z_unused, z_used, z_done, z_error } z_state;
 };
 
 int close_istream(struct git_istream *st)
@@ -126,6 +129,20 @@ struct git_istream *open_istream(const unsigned char *sha1,
 	return st;
 }
 
+
+/*****************************************************************
+ *
+ * Common helpers
+ *
+ *****************************************************************/
+
+static void close_deflated_stream(struct git_istream *st)
+{
+	if (st->z_state == z_used)
+		git_inflate_end(&st->z);
+}
+
+
 /*****************************************************************
  *
  * Loose object stream
@@ -144,9 +161,93 @@ static open_method_decl(loose)
  *
  *****************************************************************/
 
+static read_method_decl(pack_non_delta)
+{
+	size_t total_read = 0;
+
+	switch (st->z_state) {
+	case z_unused:
+		memset(&st->z, 0, sizeof(st->z));
+		git_inflate_init(&st->z);
+		st->z_state = z_used;
+		break;
+	case z_done:
+		return 0;
+	case z_error:
+		return -1;
+	case z_used:
+		break;
+	}
+
+	while (total_read < sz) {
+		int status;
+		struct pack_window *window = NULL;
+		unsigned char *mapped;
+
+		mapped = use_pack(st->u.in_pack.pack, &window,
+				  st->u.in_pack.pos, &st->z.avail_in);
+
+		st->z.next_out = (unsigned char *)buf + total_read;
+		st->z.avail_out = sz - total_read;
+		st->z.next_in = mapped;
+		status = git_inflate(&st->z, Z_FINISH);
+
+		st->u.in_pack.pos += st->z.next_in - mapped;
+		total_read = st->z.next_out - (unsigned char *)buf;
+		unuse_pack(&window);
+
+		if (status == Z_STREAM_END) {
+			git_inflate_end(&st->z);
+			st->z_state = z_done;
+			break;
+		}
+		if (status != Z_OK && status != Z_BUF_ERROR) {
+			git_inflate_end(&st->z);
+			st->z_state = z_error;
+			return -1;
+		}
+	}
+	return total_read;
+}
+
+static close_method_decl(pack_non_delta)
+{
+	close_deflated_stream(st);
+	return 0;
+}
+
+static struct stream_vtbl pack_non_delta_vtbl = {
+	close_istream_pack_non_delta,
+	read_istream_pack_non_delta,
+};
+
 static open_method_decl(pack_non_delta)
 {
-	return -1; /* for now */
+	struct pack_window *window;
+	enum object_type in_pack_type;
+
+	st->u.in_pack.pack = oi->u.packed.pack;
+	st->u.in_pack.pos = oi->u.packed.offset;
+	window = NULL;
+
+	in_pack_type = unpack_object_header(st->u.in_pack.pack,
+					    &window,
+					    &st->u.in_pack.pos,
+					    &st->u.in_pack.sz);
+	unuse_pack(&window);
+	switch (in_pack_type) {
+	default:
+		return -1; /* we do not do deltas for now */
+	case OBJ_COMMIT:
+	case OBJ_TREE:
+	case OBJ_BLOB:
+	case OBJ_TAG:
+		break;
+	}
+
+	st->z_state = z_unused;
+	st->vtbl = &pack_non_delta_vtbl;
+	return 0;
 }
 
 
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/11] sha1_file.c: expose helpers to read loose objects
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (8 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:30 ` [PATCH 11/11] streaming: read loose objects incrementally Junio C Hamano
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

Make map_sha1_file() and unpack_sha1_header() available to the streaming
read API by exporting them via cache.h header file.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |    2 ++
 sha1_file.c |    4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index 39e53c8..3d3d683 100644
--- a/cache.h
+++ b/cache.h
@@ -780,6 +780,8 @@ extern int hash_sha1_file(const void *buf, unsigned long len, const char *type,
 extern int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
+extern void *map_sha1_file(const unsigned char *sha1, unsigned long *size);
+extern int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz);
 
 /* global flag to enable extra checks when accessing packed objects */
 extern int do_check_packed_object_crc;
diff --git a/sha1_file.c b/sha1_file.c
index b0d82d9..b68a308 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1186,7 +1186,7 @@ static int open_sha1_file(const unsigned char *sha1)
 	return -1;
 }
 
-static void *map_sha1_file(const unsigned char *sha1, unsigned long *size)
+void *map_sha1_file(const unsigned char *sha1, unsigned long *size)
 {
 	void *map;
 	int fd;
@@ -1245,7 +1245,7 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 	return used;
 }
 
-static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
+int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
 {
 	unsigned long size, used;
 	static const char valid_loose_object_type[8] = {
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/11] streaming: read loose objects incrementally
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (9 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
@ 2011-05-16  0:30 ` Junio C Hamano
  2011-05-16  0:47 ` [PATCH 00/11] writing out a huge blob to working tree Shawn Pearce
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  0:30 UTC (permalink / raw)
  To: git

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 streaming.c |   85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/streaming.c b/streaming.c
index 0f85bd8..e732337 100644
--- a/streaming.c
+++ b/streaming.c
@@ -55,8 +55,11 @@ struct git_istream {
 		} incore;
 
 		struct {
-			int fd; /* open for reading */
-			/* NEEDSWORK: what else? */
+			void *mapped;
+			unsigned long mapsize;
+			char hdr[32];
+			int hdr_avail;
+			int hdr_used;
 		} loose;
 
 		struct {
@@ -149,9 +152,85 @@ static void close_deflated_stream(struct git_istream *st)
  *
  *****************************************************************/
 
+static read_method_decl(loose)
+{
+	size_t total_read = 0;
+
+	switch (st->z_state) {
+	case z_done:
+		return 0;
+	case z_error:
+		return -1;
+	default:
+		break;
+	}
+
+	if (st->u.loose.hdr_used < st->u.loose.hdr_avail) {
+		size_t to_copy = st->u.loose.hdr_avail - st->u.loose.hdr_used;
+		if (sz < to_copy)
+			to_copy = sz;
+		memcpy(buf, st->u.loose.hdr + st->u.loose.hdr_used, to_copy);
+		st->u.loose.hdr_used += to_copy;
+		total_read += to_copy;
+	}
+
+	while (total_read < sz) {
+		int status;
+
+		st->z.next_out = (unsigned char *)buf + total_read;
+		st->z.avail_out = sz - total_read;
+		status = git_inflate(&st->z, Z_FINISH);
+
+		total_read = st->z.next_out - (unsigned char *)buf;
+
+		if (status == Z_STREAM_END) {
+			git_inflate_end(&st->z);
+			st->z_state = z_done;
+			break;
+		}
+		if (status != Z_OK && status != Z_BUF_ERROR) {
+			git_inflate_end(&st->z);
+			st->z_state = z_error;
+			return -1;
+		}
+	}
+	return total_read;
+}
+
+static close_method_decl(loose)
+{
+	close_deflated_stream(st);
+	munmap(st->u.loose.mapped, st->u.loose.mapsize);
+	return 0;
+}
+
+static struct stream_vtbl loose_vtbl = {
+	close_istream_loose,
+	read_istream_loose,
+};
+
 static open_method_decl(loose)
 {
-	return -1; /* for now */
+	st->u.loose.mapped = map_sha1_file(sha1, &st->u.loose.mapsize);
+	if (!st->u.loose.mapped)
+		return -1;
+	if (unpack_sha1_header(&st->z,
+			       st->u.loose.mapped,
+			       st->u.loose.mapsize,
+			       st->u.loose.hdr,
+			       sizeof(st->u.loose.hdr)) < 0) {
+		git_inflate_end(&st->z);
+		munmap(st->u.loose.mapped, st->u.loose.mapsize);
+		return -1;
+	}
+
+	/* bypass parse_sha1_header() as we know it is already valid */
+	st->u.loose.hdr_used = strlen(st->u.loose.hdr) + 1;
+	st->u.loose.hdr_avail = st->z.total_out;
+	st->z_state = z_used;
+
+	st->vtbl = &loose_vtbl;
+	return 0;
 }
 
 
-- 
1.7.5.1.365.g32b65

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache
  2011-05-16  0:30 ` [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
@ 2011-05-16  0:40   ` Shawn Pearce
  0 siblings, 0 replies; 49+ messages in thread
From: Shawn Pearce @ 2011-05-16  0:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 17:30, Junio C Hamano <gitster@pobox.com> wrote:
> +               OI_DBCACHED,
...
> +       } else if (in_delta_base_cache(e.p, e.offset)) {
> +               oi->whence = OI_CACHED;

Did you mean OI_DBCACHED here?

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/11] writing out a huge blob to working tree
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (10 preceding siblings ...)
  2011-05-16  0:30 ` [PATCH 11/11] streaming: read loose objects incrementally Junio C Hamano
@ 2011-05-16  0:47 ` Shawn Pearce
  2011-05-18  6:41 ` Jeff King
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 49+ messages in thread
From: Shawn Pearce @ 2011-05-16  0:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 17:30, Junio C Hamano <gitster@pobox.com> wrote:
> Traditionally, git always read the full contents of an object in memory
> before performing various operations on it, e.g. comparing for diff,
> writing it to the working tree, etc.  A huge blob that you cannot fit
> in memory was very cumbersome to handle.
,,,
> Interested parties may want to measure the performance impact of the last
> three patches. The series deliberately ignores core.bigfileThreashold and
> let small and large blobs alike go through the streaming_write_entry()
> codepath, but it _might_ turn out that we would want to use the new code
> only for large-ish blobs.

FWIW in JGit we control this by looking at the object size and
comparing to the variable core.streamFileThreshold. For any object
below this size we allocate the buffer, unpack into it, and return the
buffer to the caller. Only objects above the size use the streaming
code paths.

There is a performance difference, at least for us in Java. Most of
the overhead seems to be due to running zlib inflate() with a tiny
buffer size rather than the full destination buffer. This probably has
to do with the cost associated with jumping from the Java bytecode
through JNI to the libz library.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/11] streaming: read non-delta incrementally from a pack
  2011-05-16  0:30 ` [PATCH 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
@ 2011-05-16  0:58   ` Shawn Pearce
  2011-05-16  5:00     ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: Shawn Pearce @ 2011-05-16  0:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 17:30, Junio C Hamano <gitster@pobox.com> wrote:
> +static read_method_decl(pack_non_delta)

I am not a huge fan of these decl macros... but I can see how writing
out the same function prototype 3 times is annoying.

> +       switch (in_pack_type) {
> +       default:
> +               return -1; /* we do not do deltas for now */

Haha. Deltas are going to be painful. Very, very painful.

We actually try to stream deltas in JGit. Our implementation isn't
useful. It only works if the base object is only accessed in
sequential order by the delta instruction sequences.

I had plans to add an external delta base cache to
$GIT_DIR/objects/delta-base-cache using a block file format that is
random accessible, but has CRC-32 checksums on each block to still
ensure there isn't silent data corruption on reads. You can read more
about it here http://egit.eclipse.org/r/1724 but the patch is probably
stalled and will get abandoned.

I think the better strategy is to avoid delta compression altogether
for objects that are so big we cannot materialize them as a contiguous
buffer. What a reasonable limit is, I don't know... but its probably
got to be around 25-50 MB. The Android project (as an example) has 6+
MB XML documents in their source code repository that are very delta
compressible.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/11] streaming: read non-delta incrementally from a pack
  2011-05-16  0:58   ` Shawn Pearce
@ 2011-05-16  5:00     ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16  5:00 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> On Sun, May 15, 2011 at 17:30, Junio C Hamano <gitster@pobox.com> wrote:
>> +static read_method_decl(pack_non_delta)
>
> I am not a huge fan of these decl macros... but I can see how writing
> out the same function prototype 3 times is annoying.

More importantly, it helped greatly while re-rolling this series literally
several times until I nailed the right set of parameters these methods
should take.

> I think the better strategy is to avoid delta compression altogether
> for objects that are so big we cannot materialize them as a contiguous
> buffer.

Yes, that is what I am aiming for in the recent topics around this area.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/11] streaming_write_entry(): support files with holes
  2011-05-16  0:30 ` [PATCH 08/11] streaming_write_entry(): support files with holes Junio C Hamano
@ 2011-05-16 10:53   ` Nguyen Thai Ngoc Duy
  2011-05-16 14:39     ` Junio C Hamano
  2011-05-16 13:03   ` Thiago Farina
  1 sibling, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-05-16 10:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, May 16, 2011 at 7:30 AM, Junio C Hamano <gitster@pobox.com> wrote:
> One typical use of a large binary file is to hold a sparse on-disk hash
> table with a lot of holes. Help preserving the holes with lseek().

Should that be done only with big enough holes? Random zeros may
increase the number of syscalls unnecessarily.
-- 
Duy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/11] streaming_write_entry(): support files with holes
  2011-05-16  0:30 ` [PATCH 08/11] streaming_write_entry(): support files with holes Junio C Hamano
  2011-05-16 10:53   ` Nguyen Thai Ngoc Duy
@ 2011-05-16 13:03   ` Thiago Farina
  1 sibling, 0 replies; 49+ messages in thread
From: Thiago Farina @ 2011-05-16 13:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 9:30 PM, Junio C Hamano <gitster@pobox.com> wrote:
> One typical use of a large binary file is to hold a sparse on-disk hash
> table with a lot of holes. Help preserving the holes with lseek().
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  entry.c |   18 +++++++++++++++++-
>  1 files changed, 17 insertions(+), 1 deletions(-)
>
> diff --git a/entry.c b/entry.c
> index 7733a6b..d50e388 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -123,6 +123,7 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
>        enum object_type type;
>        unsigned long sz;
>        int result = -1;
> +       ssize_t kept = 0;
>        int fd = -1;
>
>        st = open_istream(ce->sha1, &type, &sz);
> @@ -137,17 +138,32 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
>
>        for (;;) {
>                char buf[10240];
> -               ssize_t wrote;
> +               ssize_t wrote, holeto;
>                ssize_t readlen = read_istream(st, buf, sizeof(buf));
>
>                if (!readlen)
>                        break;
> +               for (holeto = 0; holeto < readlen; holeto++)
> +                       if (buf[holeto])
> +                               break;
> +               if (readlen == holeto) {
> +                       kept += holeto;
> +                       continue;
> +               }
>
> +               if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
> +                       goto close_and_exit;
> +               else
> +                       kept = 0;
>                wrote = write_in_full(fd, buf, readlen);
>
>                if (wrote != readlen)
>                        goto close_and_exit;
>        }
> +       if (kept &&
> +           ((  lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1)
There is an extra whitespace after ((

> +            || write(fd, "", 1) != 1))
> +               goto close_and_exit;
>        *fstat_done = fstat_output(fd, state, statbuf);
>
>  close_and_exit:
> --
> 1.7.5.1.365.g32b65
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/11] streaming_write_entry(): support files with holes
  2011-05-16 10:53   ` Nguyen Thai Ngoc Duy
@ 2011-05-16 14:39     ` Junio C Hamano
  2011-05-17  1:18       ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-16 14:39 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Mon, May 16, 2011 at 7:30 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> One typical use of a large binary file is to hold a sparse on-disk hash
>> table with a lot of holes. Help preserving the holes with lseek().
>
> Should that be done only with big enough holes? Random zeros may
> increase the number of syscalls unnecessarily.

I think that is a valid concern, but doesn't the code do that already?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/11] packed_object_info_detail(): do not return a string
  2011-05-16  0:30 ` [PATCH 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
@ 2011-05-17  0:45   ` Thiago Farina
  2011-05-17  2:36     ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: Thiago Farina @ 2011-05-17  0:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 9:30 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Instead return enum object_name just like everybody else does.
> The caller can turn it into a string with typename() easily.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  builtin/verify-pack.c |    4 ++--
>  cache.h               |    2 +-
>  sha1_file.c           |    4 ++--
>  3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/builtin/verify-pack.c b/builtin/verify-pack.c
> index b6079ae..3a919b1 100644
> --- a/builtin/verify-pack.c
> +++ b/builtin/verify-pack.c
> @@ -33,9 +33,9 @@ static void show_pack_info(struct packed_git *p, unsigned int flags)
>                if (!sha1)
>                        die("internal error pack-check nth-packed-object");
>                offset = nth_packed_object_offset(p, i);
> -               type = packed_object_info_detail(p, offset, &size, &store_size,
> +               type = typename(packed_object_info_detail(p, offset, &size, &store_size,
>                                                 &delta_chain_length,
> -                                                base_sha1);
> +                                                base_sha1));
>                if (!stat_only)
>                        printf("%s ", sha1_to_hex(sha1));
>                if (!delta_chain_length) {
> diff --git a/cache.h b/cache.h
> index b1b5bb5..cdb5112 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1020,7 +1020,7 @@ extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
>  extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
>  extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
>  extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
> -extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
> +extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);

In the commit message you say to return "enum object_name". Maybe
change from int to enum object_name here and below?

Also, |type| below is enum object_type not object_name.

>
>  /* Dumb servers support */
>  extern int update_server_info(int);
> diff --git a/sha1_file.c b/sha1_file.c
> index 064a330..4f96eb1 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1549,7 +1549,7 @@ static int unpack_object_header(struct packed_git *p,
>        return type;
>  }
>
> -const char *packed_object_info_detail(struct packed_git *p,
> +int packed_object_info_detail(struct packed_git *p,
>                                      off_t obj_offset,
>                                      unsigned long *size,
>                                      unsigned long *store_size,
> @@ -1580,7 +1580,7 @@ const char *packed_object_info_detail(struct packed_git *p,
>                case OBJ_BLOB:
>                case OBJ_TAG:
>                        unuse_pack(&w_curs);
> -                       return typename(type);
> +                       return type;
>                case OBJ_OFS_DELTA:
>                        obj_offset = get_delta_base(p, &w_curs, &curpos, type, obj_offset);
>                        if (!obj_offset)
> --
> 1.7.5.1.365.g32b65
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/11] streaming_write_entry(): support files with holes
  2011-05-16 14:39     ` Junio C Hamano
@ 2011-05-17  1:18       ` Nguyen Thai Ngoc Duy
  2011-05-17  5:23         ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-05-17  1:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, May 16, 2011 at 9:39 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> On Mon, May 16, 2011 at 7:30 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>> One typical use of a large binary file is to hold a sparse on-disk hash
>>> table with a lot of holes. Help preserving the holes with lseek().
>>
>> Should that be done only with big enough holes? Random zeros may
>> increase the number of syscalls unnecessarily.
>
> I think that is a valid concern, but doesn't the code do that already?

Ahh I see you only increase kept when the the whole buf is zero. I was
looking for an explicit threshold, but it's implicitly the buffer
size. Sorry for the noise.
-- 
Duy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/11] packed_object_info_detail(): do not return a string
  2011-05-17  0:45   ` Thiago Farina
@ 2011-05-17  2:36     ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-17  2:36 UTC (permalink / raw)
  To: Thiago Farina; +Cc: git

Thiago Farina <tfransosi@gmail.com> writes:

> In the commit message you say to return "enum object_name". Maybe
> change from int to enum object_name here and below?

As other functions that return -1 on error return int as a convention, I
do not think it is a good change.

As you may have noticed, "enum object_name" is not correct. It should be
"enum object_type" I think.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/11] streaming_write_entry(): support files with holes
  2011-05-17  1:18       ` Nguyen Thai Ngoc Duy
@ 2011-05-17  5:23         ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-17  5:23 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> On Mon, May 16, 2011 at 9:39 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>>
>>> On Mon, May 16, 2011 at 7:30 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>>> One typical use of a large binary file is to hold a sparse on-disk hash
>>>> table with a lot of holes. Help preserving the holes with lseek().
>>>
>>> Should that be done only with big enough holes? Random zeros may
>>> increase the number of syscalls unnecessarily.
>>
>> I think that is a valid concern, but doesn't the code do that already?
>
> Ahh I see you only increase kept when the the whole buf is zero. I was
> looking for an explicit threshold, but it's implicitly the buffer
> size.

Because the code works on 10k chunks and read_istream() does not give you
a short-read, most of the time "kept" will only grab contiguous stream of
NULs in 10k increment.  At the very end of the file, however, the code can
seek by less than the chunksize, as the check is done by comparing the
holdto with readlen, not with sizeof(buf).

We might want to make sizeof(buf) a multiple of typical file block size
(e.g. 16k) to get a better alignment.

Seeking to 10k and writing 2k on a filesystem with 4k pagesize will only
make two blocks of hole, not two and half, and we would be wasting a seek
in that case.

Also we may want to omit seeking for the last chunk that is smaller than
the chunk size.

Like this...

 entry.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/entry.c b/entry.c
index e063e72..f751c60 100644
--- a/entry.c
+++ b/entry.c
@@ -137,18 +137,20 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 		goto close_and_exit;
 
 	for (;;) {
-		char buf[10240];
+		char buf[1024 * 16];
 		ssize_t wrote, holeto;
 		ssize_t readlen = read_istream(st, buf, sizeof(buf));
 
 		if (!readlen)
 			break;
-		for (holeto = 0; holeto < readlen; holeto++)
-			if (buf[holeto])
-				break;
-		if (readlen == holeto) {
-			kept += holeto;
-			continue;
+		if (sizeof(buf) == readlen) {
+			for (holeto = 0; holeto < readlen; holeto++)
+				if (buf[holeto])
+					break;
+			if (readlen == holeto) {
+				kept += holeto;
+				continue;
+			}
 		}
 
 		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/11] writing out a huge blob to working tree
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (11 preceding siblings ...)
  2011-05-16  0:47 ` [PATCH 00/11] writing out a huge blob to working tree Shawn Pearce
@ 2011-05-18  6:41 ` Jeff King
  2011-05-18  7:08   ` Jeff King
  2011-05-18  8:17 ` Jeff King
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
  14 siblings, 1 reply; 49+ messages in thread
From: Jeff King @ 2011-05-18  6:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 05:30:20PM -0700, Junio C Hamano wrote:

> Interested parties may want to measure the performance impact of the last
> three patches. The series deliberately ignores core.bigfileThreashold and
> let small and large blobs alike go through the streaming_write_entry()
> codepath, but it _might_ turn out that we would want to use the new code
> only for large-ish blobs.

Hmm.

  $ cd compile/linux-2.6
  $ rm -rf *
  $ time git.v1.7.5 checkout -f
  real    0m4.405s
  user    0m3.592s
  sys     0m0.804s

  $ rm -rf *
  $ time git.jch.streaming checkout -f
  real    0m7.062s
  user    0m5.188s
  sys     0m1.776s

(Actually those times are best-of-5 in each case). So there is
definitely some slow-down for the non-huge case. Bisection points to
your cd36b7b (streaming_write_entry(): use streaming API in
write_entry()).

According to perf, though, it's not the increased writes; the slowdown
is actually from create_pack_revindex, in this call chain:

 create_pack_revindex
 find_pack_revindex
 packed_object_info_detail
 sha1_object_info_extended
 istream_source
 open_istream
 streaming_write_entry

-Peff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/11] writing out a huge blob to working tree
  2011-05-18  6:41 ` Jeff King
@ 2011-05-18  7:08   ` Jeff King
  2011-05-18  7:50     ` Jeff King
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff King @ 2011-05-18  7:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, May 18, 2011 at 02:41:58AM -0400, Jeff King wrote:

> According to perf, though, it's not the increased writes; the slowdown
> is actually from create_pack_revindex, in this call chain:
> 
>  create_pack_revindex
>  find_pack_revindex
>  packed_object_info_detail
>  sha1_object_info_extended
>  istream_source
>  open_istream
>  streaming_write_entry

Part of the problem is that with the current code, all you care about is
"Is it loose, packed non-delta, or packed delta?". And
packed_object_info_detail will tell you not just whether it's deltafied,
but will go to a lot more work to make the revindex. One solution is to
let the cheap packed_object_info() report back on delta status, since
it's free there, and then we don't have to deal with the revindex at
all.

Of course, it may turn out that the extra information is useful if and
when open_istream_* actually gets implemented for delta-fied objects.

The patch below implements the cheap "is_delta" check. But it only
shaves off a half second (dropping us from 7s to 6.5s). Prior to your
patches, we were at 4.5 seconds. So there's still quite a bit of
slowdown to figure out.

diff --git a/cache.h b/cache.h
index 39e53c8..829c3a4 100644
--- a/cache.h
+++ b/cache.h
@@ -1026,6 +1026,7 @@ extern int unpack_object_header(struct packed_git *, struct pack_window **, off_
 struct object_info {
 	/* Request */
 	unsigned long *sizep;
+	int *is_deltap;
 	int want_deltainfo;
 
 	/* Response */
diff --git a/sha1_file.c b/sha1_file.c
index 15f1c05..35be909 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1481,7 +1481,7 @@ static off_t get_delta_base(struct packed_git *p,
 
 /* forward declaration for a mutually recursive function */
 static int packed_object_info(struct packed_git *p, off_t offset,
-			      unsigned long *sizep);
+			      unsigned long *sizep, int *is_delta);
 
 static int packed_delta_info(struct packed_git *p,
 			     struct pack_window **w_curs,
@@ -1495,7 +1495,7 @@ static int packed_delta_info(struct packed_git *p,
 	base_offset = get_delta_base(p, w_curs, &curpos, type, obj_offset);
 	if (!base_offset)
 		return OBJ_BAD;
-	type = packed_object_info(p, base_offset, NULL);
+	type = packed_object_info(p, base_offset, NULL, NULL);
 	if (type <= OBJ_NONE) {
 		struct revindex_entry *revidx;
 		const unsigned char *base_sha1;
@@ -1605,7 +1605,7 @@ int packed_object_info_detail(struct packed_git *p,
 }
 
 static int packed_object_info(struct packed_git *p, off_t obj_offset,
-			      unsigned long *sizep)
+			      unsigned long *sizep, int *is_delta)
 {
 	struct pack_window *w_curs = NULL;
 	unsigned long size;
@@ -1619,6 +1619,8 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 	case OBJ_REF_DELTA:
 		type = packed_delta_info(p, &w_curs, curpos,
 					 type, obj_offset, sizep);
+		if (is_delta)
+			*is_delta = 1;
 		break;
 	case OBJ_COMMIT:
 	case OBJ_TREE:
@@ -1626,11 +1628,15 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 	case OBJ_TAG:
 		if (sizep)
 			*sizep = size;
+		if (is_delta)
+			*is_delta = 0;
 		break;
 	default:
 		error("unknown object type %i at offset %"PRIuMAX" in %s",
 		      type, (uintmax_t)obj_offset, p->pack_name);
 		type = OBJ_BAD;
+		if (is_delta)
+			*is_delta = 0;
 	}
 	unuse_pack(&w_curs);
 	return type;
@@ -2130,7 +2136,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 	}
 
 	if (!oi->want_deltainfo) {
-		status = packed_object_info(e.p, e.offset, oi->sizep);
+		status = packed_object_info(e.p, e.offset, oi->sizep,
+					    oi->is_deltap);
 	} else {
 		unsigned long size, store_size;
 		unsigned int delta_chain_length;
diff --git a/streaming.c b/streaming.c
index 03c58b2..a2c0e84 100644
--- a/streaming.c
+++ b/streaming.c
@@ -84,10 +84,12 @@ static enum input_source istream_source(const unsigned char *sha1,
 					struct object_info *oi)
 {
 	unsigned long size;
+	int is_delta;
 	int status;
 
 	oi->sizep = &size;
-	oi->want_deltainfo = 1;
+	oi->is_deltap = &is_delta;
+	oi->want_deltainfo = 0;
 
 	status = sha1_object_info_extended(sha1, oi);
 	if (status < 0)
@@ -98,7 +100,7 @@ static enum input_source istream_source(const unsigned char *sha1,
 	case OI_LOOSE:
 		return loose;
 	case OI_PACKED:
-		if (!oi->u.packed.delta && big_file_threshold <= size)
+		if (!is_delta && big_file_threshold <= size)
 			return pack_non_delta;
 		/* fallthru */
 	default:

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/11] writing out a huge blob to working tree
  2011-05-18  7:08   ` Jeff King
@ 2011-05-18  7:50     ` Jeff King
  2011-05-18 15:12       ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff King @ 2011-05-18  7:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, May 18, 2011 at 03:08:37AM -0400, Jeff King wrote:

> Part of the problem is that with the current code, all you care about is
> "Is it loose, packed non-delta, or packed delta?".
> [...]
> The patch below implements the cheap "is_delta" check.

Hmm, sorry, this patch works well on top of cd36b7b, where I first
detected the slowness, but later in the series we actually do look at
the pack information in the object_info. So my patch breaks that code
path horribly (I still think the concept of avoiding the revindex should
still work in principle, though).

-Peff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/11] streaming: a new API to read from the object store
  2011-05-16  0:30 ` [PATCH 06/11] streaming: a new API to read from the object store Junio C Hamano
@ 2011-05-18  8:09   ` Jeff King
  2011-05-19  1:52     ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff King @ 2011-05-18  8:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 05:30:26PM -0700, Junio C Hamano wrote:

> +struct git_istream *open_istream(const unsigned char *sha1,
> +				 enum object_type *type,
> +				 unsigned long *sz)
> +{
> +	struct git_istream *st;
> +	struct object_info oi;
> +	const unsigned char *real = lookup_replace_object(sha1);
> +	enum input_source src = istream_source(real, type, &oi);
> +
> +	if (src < 0)
> +		return NULL;
> +
> +	st = xmalloc(sizeof(*st));
> +	st->source = src;
> +	if (open_istream_tbl[src](st, &oi, real, type, sz)) {
> +		if (open_istream_incore(st, &oi, real, type, sz)) {
> +			free(st);
> +			st = NULL;
> +		}
> +	}
> +	return st;
> +}

I assume the "sz" parameter is meant to be an output parameter with the
total size of the object. The open_istream_incore function fills it in
properly. But later, when you add open_istream_loose and
open_istream_pack_non_delta, neither of them actually touches the "sz"
parameter at all. So code like:

  struct git_istream *st;
  enum object_type type;
  unsigned long size;
  st = open_istream(sha1, &type, &size);

may or may not have "size" meaningful at this point, which seems like a
bug.

-Peff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/11] writing out a huge blob to working tree
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (12 preceding siblings ...)
  2011-05-18  6:41 ` Jeff King
@ 2011-05-18  8:17 ` Jeff King
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
  14 siblings, 0 replies; 49+ messages in thread
From: Jeff King @ 2011-05-18  8:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 15, 2011 at 05:30:20PM -0700, Junio C Hamano wrote:

> Recently "diff" learned to avoid reading the contents only to say "Binary
> files differ" when these large blobs are marked as binary.

With your series, we should be able to get similar speedups even if the
user didn't explicitly mark a file as binary. We need only peek at the
beginning of a blob to see if it is binary, so we can be conservative
with big files. Something like this (which doesn't work because of the
"size" bug I mentioned elsewhere, but is meant to be illustrative):

diff --git a/diff.c b/diff.c
index ba5f7aa..bfe1b2d 100644
--- a/diff.c
+++ b/diff.c
@@ -15,6 +15,7 @@
 #include "sigchain.h"
 #include "submodule.h"
 #include "ll-merge.h"
+#include "streaming.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -1931,6 +1932,37 @@ static void diff_filespec_load_driver(struct diff_filespec *one)
 		one->driver = userdiff_find_by_name("default");
 }
 
+static char *populate_or_peek(struct diff_filespec *df,
+			      unsigned long want,
+			      unsigned long *got)
+{
+	struct git_istream *st;
+	enum object_type type;
+	char *buf;
+
+	st = open_istream(df->sha1, &type, &df->size);
+	if (!st) {
+		diff_populate_filespec(df, 0);
+		*got = df->size;
+		return df->data;
+	}
+
+	if (df->size < big_file_threshold) {
+		buf = df->data = xmallocz(df->size);
+		want = df->size;
+		df->should_free = 1;
+	}
+	else
+		buf = xmallocz(want);
+
+	/* looks like it will always read_in_full? */
+	if (read_istream(st, buf, want) != want)
+		die("failed to read object");
+	close_istream(st);
+	*got = want;
+	return buf;
+}
+
 int diff_filespec_is_binary(struct diff_filespec *one)
 {
 	if (one->is_binary == -1) {
@@ -1938,13 +1970,25 @@ int diff_filespec_is_binary(struct diff_filespec *one)
 		if (one->driver->binary != -1)
 			one->is_binary = one->driver->binary;
 		else {
-			if (!one->data && DIFF_FILE_VALID(one))
-				diff_populate_filespec(one, 0);
-			if (one->data)
-				one->is_binary = buffer_is_binary(one->data,
-						one->size);
+			char *buf;
+			unsigned long size;
+
+			if (one->data) {
+				buf = one->data;
+				size = one->size;
+			}
+			else if (DIFF_FILE_VALID(one))
+				buf = populate_or_peek(one, 8192, &size);
+			else
+				buf = NULL;
+
+			if (buf)
+				one->is_binary = buffer_is_binary(buf, size);
 			if (one->is_binary == -1)
 				one->is_binary = 0;
+
+			if (buf != one->data)
+				free(buf);
 		}
 	}
 	return one->is_binary;

I think a "peek" function like this would be a nice addition to the
streaming API. Something like:

  char *peek_sha1(const unsigned char sha1[20], /* which object */
                  enum object_type *type, /* out: type */
                  unsigned long want, /* how much do we need */
                  unsigned long big, /* if less than this, just give us
                                        everything in the name of
                                        efficiency */
                  unsigned long *got, /* out: how much did we peek */
                  unsigned long *size, /* out: how big is the whole thing */
                  );

but maybe diff is the only place where that is useful. I dunno.

-Peff

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/11] writing out a huge blob to working tree
  2011-05-18  7:50     ` Jeff King
@ 2011-05-18 15:12       ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-18 15:12 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> (I still think the concept of avoiding the revindex should
> still work in principle, though).

Yes, absolutely. The streaming code cares only about non-delta, and in
general applying delta without holding the base image entirely in core
would be horribly complex unless the delta only runs forwards over the
delta-base image. The true general case is not something this topic is
interested in (it is primarily about the large objects), and limiting
ourselves to non-delta case is just as fine.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/11] streaming: a new API to read from the object store
  2011-05-18  8:09   ` Jeff King
@ 2011-05-19  1:52     ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19  1:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I assume the "sz" parameter is meant to be an output parameter with the
> total size of the object. The open_istream_incore function fills it in
> properly. But later,...
> may or may not have "size" meaningful at this point, which seems like a
> bug.

Thanks for spotting.

I would want to revamp the API implementation a bit further.

The "u.*.sz" fields should probably become the first level field in
"struct git_istream" [*1*].

The open_istream_$backend() methods should lose "sz" parameter, as it
should be the same as what would be stored in st->size after the above
change. The public API open_istream() would fill the caller supplied *sz
word with what is stored in st->sz by the backend method.

I do not think the "read" method implementations (other than "incore")
currently make sure that what comes out of the z_stream really has the
size. A check probably needs to be added somewhere in the codepath.

I haven't looked at the revindex issue yet.


[Footnote]

*1* Actually I am a bit torn on this. As I am envisioning that we would
later (much later) reuse this API to implement "streaming filters" by
adding a new struct "filter" as a member of the st->u union, and such a
filter would not have "size" known in advance, I am a bit reluctant to
change it to have sz in the common part of istream. It can be argued that
it may not be a big deal to move "size" to the generic part and have only
certain backends look at it while allowing other backends to ignore it,
just like "z" and "z_state" are unnecessary for "incore" but they are
shared among three backends already.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 00/11] writing out a huge blob to working tree
  2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
                   ` (13 preceding siblings ...)
  2011-05-18  8:17 ` Jeff King
@ 2011-05-19 21:33 ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
                     ` (12 more replies)
  14 siblings, 13 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

So this is the second round. Peff noticed that the istream_open() did not
return the size of the object correctly for in-pack (non-delta) and loose
representations, and this round fixes it.

Also sha1_object_info_extended() lost the call to the expensive function
packed_object_info_detail(), as the only thing we are interested in is to
see if the first-level object is a non-delta.  As the result, 02/11 would
now be much easier to follow.

Junio C Hamano (11):
  packed_object_info_detail(): do not return a string
  sha1_object_info_extended(): expose a bit more info
  sha1_object_info_extended(): hint about objects in delta-base cache
  unpack_object_header(): make it public
  write_entry(): separate two helper functions out
  streaming: a new API to read from the object store
  streaming_write_entry(): use streaming API in write_entry()
  streaming_write_entry(): support files with holes
  streaming: read non-delta incrementally from a pack
  sha1_file.c: expose helpers to read loose objects
  streaming: read loose objects incrementally

 Makefile              |    2 +
 builtin/verify-pack.c |    4 +-
 cache.h               |   36 +++++-
 convert.c             |   23 +++
 entry.c               |  112 +++++++++++++---
 sha1_file.c           |   69 +++++++---
 streaming.c           |  377 +++++++++++++++++++++++++++++++++++++++++++++++++
 streaming.h           |   15 ++
 8 files changed, 598 insertions(+), 40 deletions(-)
 create mode 100644 streaming.c
 create mode 100644 streaming.h

-- 
1.7.5.1.416.gac10c8

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 01/11] packed_object_info_detail(): do not return a string
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

Instead return an integer that can be given to typename() if
the caller wants a string, just like everybody else does.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/verify-pack.c |    4 ++--
 cache.h               |    2 +-
 sha1_file.c           |    4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/builtin/verify-pack.c b/builtin/verify-pack.c
index b6079ae..3a919b1 100644
--- a/builtin/verify-pack.c
+++ b/builtin/verify-pack.c
@@ -33,9 +33,9 @@ static void show_pack_info(struct packed_git *p, unsigned int flags)
 		if (!sha1)
 			die("internal error pack-check nth-packed-object");
 		offset = nth_packed_object_offset(p, i);
-		type = packed_object_info_detail(p, offset, &size, &store_size,
+		type = typename(packed_object_info_detail(p, offset, &size, &store_size,
 						 &delta_chain_length,
-						 base_sha1);
+						 base_sha1));
 		if (!stat_only)
 			printf("%s ", sha1_to_hex(sha1));
 		if (!delta_chain_length) {
diff --git a/cache.h b/cache.h
index b1b5bb5..cdb5112 100644
--- a/cache.h
+++ b/cache.h
@@ -1020,7 +1020,7 @@ extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
-extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
 
 /* Dumb servers support */
 extern int update_server_info(int);
diff --git a/sha1_file.c b/sha1_file.c
index 064a330..4f96eb1 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1549,7 +1549,7 @@ static int unpack_object_header(struct packed_git *p,
 	return type;
 }
 
-const char *packed_object_info_detail(struct packed_git *p,
+int packed_object_info_detail(struct packed_git *p,
 				      off_t obj_offset,
 				      unsigned long *size,
 				      unsigned long *store_size,
@@ -1580,7 +1580,7 @@ const char *packed_object_info_detail(struct packed_git *p,
 		case OBJ_BLOB:
 		case OBJ_TAG:
 			unuse_pack(&w_curs);
-			return typename(type);
+			return type;
 		case OBJ_OFS_DELTA:
 			obj_offset = get_delta_base(p, &w_curs, &curpos, type, obj_offset);
 			if (!obj_offset)
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 02/11] sha1_object_info_extended(): expose a bit more info
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

The original interface for sha1_object_info() takes an object name and
gives back a type and its size (the latter is given only when it was
asked).  The new interface wraps its implementation and exposes a bit
more pieces of information that the interface used to discard, namely:

 - where the object is stored (loose? cached? packed?)
 - if packed, where in which packfile?

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * In the earlier round, this used u.pack.delta to record the length of
   the delta chain, but the caller is not necessarily interested in the
   length of the delta chain per-se, but may only want to know if it is a
   delta against another object or is stored as a deflated data. Calling
   packed_object_info_detail() involves walking the reverse index chain to
   compute the store size of the object and is unnecessarily expensive.

   We could resurrect the code if a new caller wants to know, but I doubt
   it.
---
 cache.h     |   28 ++++++++++++++++++++++++++++
 sha1_file.c |   42 +++++++++++++++++++++++++++++++-----------
 2 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/cache.h b/cache.h
index cdb5112..9fbc07e 100644
--- a/cache.h
+++ b/cache.h
@@ -1022,6 +1022,34 @@ extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsig
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
 
+struct object_info {
+	/* Request */
+	unsigned long *sizep;
+
+	/* Response */
+	enum {
+		OI_CACHED,
+		OI_LOOSE,
+		OI_PACKED
+	} whence;
+	union {
+		/*
+		 * struct {
+		 * 	... Nothing to expose in this case
+		 * } cached;
+		 * struct {
+		 * 	... Nothing to expose in this case
+		 * } loose;
+		 */
+		struct {
+			struct packed_git *pack;
+			off_t offset;
+			unsigned int is_delta;
+		} packed;
+	} u;
+};
+extern int sha1_object_info_extended(const unsigned char *, struct object_info *);
+
 /* Dumb servers support */
 extern int update_server_info(int);
 
diff --git a/sha1_file.c b/sha1_file.c
index 4f96eb1..7eed316 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1481,7 +1481,7 @@ static off_t get_delta_base(struct packed_git *p,
 
 /* forward declaration for a mutually recursive function */
 static int packed_object_info(struct packed_git *p, off_t offset,
-			      unsigned long *sizep);
+			      unsigned long *sizep, int *rtype);
 
 static int packed_delta_info(struct packed_git *p,
 			     struct pack_window **w_curs,
@@ -1495,7 +1495,7 @@ static int packed_delta_info(struct packed_git *p,
 	base_offset = get_delta_base(p, w_curs, &curpos, type, obj_offset);
 	if (!base_offset)
 		return OBJ_BAD;
-	type = packed_object_info(p, base_offset, NULL);
+	type = packed_object_info(p, base_offset, NULL, NULL);
 	if (type <= OBJ_NONE) {
 		struct revindex_entry *revidx;
 		const unsigned char *base_sha1;
@@ -1605,7 +1605,7 @@ int packed_object_info_detail(struct packed_git *p,
 }
 
 static int packed_object_info(struct packed_git *p, off_t obj_offset,
-			      unsigned long *sizep)
+			      unsigned long *sizep, int *rtype)
 {
 	struct pack_window *w_curs = NULL;
 	unsigned long size;
@@ -1613,6 +1613,8 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 	enum object_type type;
 
 	type = unpack_object_header(p, &w_curs, &curpos, &size);
+	if (rtype)
+		*rtype = type; /* representation type */
 
 	switch (type) {
 	case OBJ_OFS_DELTA:
@@ -2093,24 +2095,28 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 	return status;
 }
 
-int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
+/* returns enum object_type or negative */
+int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 {
 	struct cached_object *co;
 	struct pack_entry e;
-	int status;
+	int status, rtype;
 
 	co = find_cached_object(sha1);
 	if (co) {
-		if (sizep)
-			*sizep = co->size;
+		if (oi->sizep)
+			*(oi->sizep) = co->size;
+		oi->whence = OI_CACHED;
 		return co->type;
 	}
 
 	if (!find_pack_entry(sha1, &e)) {
 		/* Most likely it's a loose object. */
-		status = sha1_loose_object_info(sha1, sizep);
-		if (status >= 0)
+		status = sha1_loose_object_info(sha1, oi->sizep);
+		if (status >= 0) {
+			oi->whence = OI_LOOSE;
 			return status;
+		}
 
 		/* Not a loose object; someone else may have just packed it. */
 		reprepare_packed_git();
@@ -2118,15 +2124,29 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 			return status;
 	}
 
-	status = packed_object_info(e.p, e.offset, sizep);
+	status = packed_object_info(e.p, e.offset, oi->sizep, &rtype);
 	if (status < 0) {
 		mark_bad_packed_object(e.p, sha1);
-		status = sha1_object_info(sha1, sizep);
+		status = sha1_object_info_extended(sha1, oi);
+	} else {
+		oi->whence = OI_PACKED;
+		oi->u.packed.offset = e.offset;
+		oi->u.packed.pack = e.p;
+		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
+					 rtype == OBJ_OFS_DELTA);
 	}
 
 	return status;
 }
 
+int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
+{
+	struct object_info oi;
+
+	oi.sizep = sizep;
+	return sha1_object_info_extended(sha1, &oi);
+}
+
 static void *read_packed_sha1(const unsigned char *sha1,
 			      enum object_type *type, unsigned long *size)
 {
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-20 23:05     ` René Scharfe
  2011-05-19 21:33   ` [PATCH v2 04/11] unpack_object_header(): make it public Junio C Hamano
                     ` (9 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

An object found in the delta-base cache is not guaranteed to
stay there, but we know it came from a pack and it is likely
to give us a quick access if we read_sha1_file() it right now,
which is a piece of useful information.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |    3 ++-
 sha1_file.c |    9 +++++++++
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/cache.h b/cache.h
index 9fbc07e..e3f235f 100644
--- a/cache.h
+++ b/cache.h
@@ -1030,7 +1030,8 @@ struct object_info {
 	enum {
 		OI_CACHED,
 		OI_LOOSE,
-		OI_PACKED
+		OI_PACKED,
+		OI_DBCACHED,
 	} whence;
 	union {
 		/*
diff --git a/sha1_file.c b/sha1_file.c
index 7eed316..1d6f93d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1697,6 +1697,13 @@ static unsigned long pack_entry_hash(struct packed_git *p, off_t base_offset)
 	return hash % MAX_DELTA_CACHE;
 }
 
+static int in_delta_base_cache(struct packed_git *p, off_t base_offset)
+{
+	unsigned long hash = pack_entry_hash(p, base_offset);
+	struct delta_base_cache_entry *ent = delta_base_cache + hash;
+	return (ent->data && ent->p == p && ent->base_offset == base_offset);
+}
+
 static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	unsigned long *base_size, enum object_type *type, int keep_cache)
 {
@@ -2128,6 +2135,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 	if (status < 0) {
 		mark_bad_packed_object(e.p, sha1);
 		status = sha1_object_info_extended(sha1, oi);
+	} else if (in_delta_base_cache(e.p, e.offset)) {
+		oi->whence = OI_DBCACHED;
 	} else {
 		oi->whence = OI_PACKED;
 		oi->u.packed.offset = e.offset;
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 04/11] unpack_object_header(): make it public
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (2 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 05/11] write_entry(): separate two helper functions out Junio C Hamano
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

This function is used to read and skip over the per-object header
in a packfile.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |    1 +
 sha1_file.c |    8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index e3f235f..38e14b2 100644
--- a/cache.h
+++ b/cache.h
@@ -1021,6 +1021,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
 
 struct object_info {
 	/* Request */
diff --git a/sha1_file.c b/sha1_file.c
index 1d6f93d..a28683a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1523,10 +1523,10 @@ static int packed_delta_info(struct packed_git *p,
 	return type;
 }
 
-static int unpack_object_header(struct packed_git *p,
-				struct pack_window **w_curs,
-				off_t *curpos,
-				unsigned long *sizep)
+int unpack_object_header(struct packed_git *p,
+			 struct pack_window **w_curs,
+			 off_t *curpos,
+			 unsigned long *sizep)
 {
 	unsigned char *base;
 	unsigned int left;
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 05/11] write_entry(): separate two helper functions out
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (3 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 04/11] unpack_object_header(): make it public Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 06/11] streaming: a new API to read from the object store Junio C Hamano
                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

In the write-out codepath, a block of code determines what file in the
working tree to write to, and opens an output file descriptor to it.

After writing the contents out to the file, another block of code runs
fstat() on the file descriptor when appropriate.

Separate these blocks out to open_output_fd() and fstat_output()
helper functions.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 entry.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/entry.c b/entry.c
index b017167..cc6502a 100644
--- a/entry.c
+++ b/entry.c
@@ -91,6 +91,29 @@ static void *read_blob_entry(struct cache_entry *ce, unsigned long *size)
 	return NULL;
 }
 
+static int open_output_fd(char *path, struct cache_entry *ce, int to_tempfile)
+{
+	int symlink = (ce->ce_mode & S_IFMT) != S_IFREG;
+	if (to_tempfile) {
+		strcpy(path, symlink
+		       ? ".merge_link_XXXXXX" : ".merge_file_XXXXXX");
+		return mkstemp(path);
+	} else {
+		return create_file(path, !symlink ? ce->ce_mode : 0666);
+	}
+}
+
+static int fstat_output(int fd, const struct checkout *state, struct stat *st)
+{
+	/* use fstat() only when path == ce->name */
+	if (fstat_is_reliable() &&
+	    state->refresh_cache && !state->base_dir_len) {
+		fstat(fd, st);
+		return 1;
+	}
+	return 0;
+}
+
 static int write_entry(struct cache_entry *ce, char *path, const struct checkout *state, int to_tempfile)
 {
 	unsigned int ce_mode_s_ifmt = ce->ce_mode & S_IFMT;
@@ -128,17 +151,7 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 			size = newsize;
 		}
 
-		if (to_tempfile) {
-			if (ce_mode_s_ifmt == S_IFREG)
-				strcpy(path, ".merge_file_XXXXXX");
-			else
-				strcpy(path, ".merge_link_XXXXXX");
-			fd = mkstemp(path);
-		} else if (ce_mode_s_ifmt == S_IFREG) {
-			fd = create_file(path, ce->ce_mode);
-		} else {
-			fd = create_file(path, 0666);
-		}
+		fd = open_output_fd(path, ce, to_tempfile);
 		if (fd < 0) {
 			free(new);
 			return error("unable to create file %s (%s)",
@@ -146,12 +159,8 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 		}
 
 		wrote = write_in_full(fd, new, size);
-		/* use fstat() only when path == ce->name */
-		if (fstat_is_reliable() &&
-		    state->refresh_cache && !to_tempfile && !state->base_dir_len) {
-			fstat(fd, &st);
-			fstat_done = 1;
-		}
+		if (!to_tempfile)
+			fstat_done = fstat_output(fd, state, &st);
 		close(fd);
 		free(new);
 		if (wrote != size)
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 06/11] streaming: a new API to read from the object store
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (4 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 05/11] write_entry(): separate two helper functions out Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-20 23:05     ` René Scharfe
  2011-05-19 21:33   ` [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
                     ` (6 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

Given an object name, use open_istream() to get a git_istream handle
that you can read_istream() from as if you are using read(2) to read
the contents of the object, and close it with close_istream() when
you are done.

Currently, we do not do anything fancy--it just calls read_sha1_file()
and keeps the contents in memory as a whole, and carve it out as you
request with read_istream().

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Makefile    |    2 +
 streaming.c |  199 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 streaming.h |   15 +++++
 3 files changed, 216 insertions(+), 0 deletions(-)
 create mode 100644 streaming.c
 create mode 100644 streaming.h

diff --git a/Makefile b/Makefile
index 320ccc7..83bd539 100644
--- a/Makefile
+++ b/Makefile
@@ -552,6 +552,7 @@ LIB_H += sha1-lookup.h
 LIB_H += sideband.h
 LIB_H += sigchain.h
 LIB_H += strbuf.h
+LIB_H += streaming.h
 LIB_H += string-list.h
 LIB_H += submodule.h
 LIB_H += tag.h
@@ -657,6 +658,7 @@ LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
 LIB_OBJS += strbuf.o
+LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
 LIB_OBJS += submodule.o
 LIB_OBJS += symlinks.o
diff --git a/streaming.c b/streaming.c
new file mode 100644
index 0000000..84330b4
--- /dev/null
+++ b/streaming.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#include "cache.h"
+#include "streaming.h"
+
+enum input_source {
+	stream_error = -1,
+	incore = 0,
+	loose = 1,
+	pack_non_delta = 2
+};
+
+typedef int (*open_istream_fn)(struct git_istream *,
+			       struct object_info *,
+			       const unsigned char *,
+			       enum object_type *);
+typedef int (*close_istream_fn)(struct git_istream *);
+typedef ssize_t (*read_istream_fn)(struct git_istream *, char *, size_t);
+
+struct stream_vtbl {
+	close_istream_fn close;
+	read_istream_fn read;
+};
+
+#define open_method_decl(name) \
+	int open_istream_ ##name \
+	(struct git_istream *st, struct object_info *oi, \
+	 const unsigned char *sha1, \
+	 enum object_type *type)
+
+#define close_method_decl(name) \
+	int close_istream_ ##name \
+	(struct git_istream *st)
+
+#define read_method_decl(name) \
+	ssize_t read_istream_ ##name \
+	(struct git_istream *st, char *buf, size_t sz)
+
+/* forward declaration */
+static open_method_decl(incore);
+static open_method_decl(loose);
+static open_method_decl(pack_non_delta);
+
+static open_istream_fn open_istream_tbl[] = {
+	open_istream_incore,
+	open_istream_loose,
+	open_istream_pack_non_delta,
+};
+
+struct git_istream {
+	enum input_source source;
+	const struct stream_vtbl *vtbl;
+	unsigned long size; /* inflated size of full object */
+
+	union {
+		struct {
+			char *buf; /* from read_object() */
+			unsigned long read_ptr;
+		} incore;
+
+		struct {
+			int fd; /* open for reading */
+			/* NEEDSWORK: what else? */
+		} loose;
+
+		struct {
+			int fd; /* open for reading */
+			/* NEEDSWORK: what else? */
+		} in_pack;
+	} u;
+};
+
+int close_istream(struct git_istream *st)
+{
+	return st->vtbl->close(st);
+}
+
+ssize_t read_istream(struct git_istream *st, char *buf, size_t sz)
+{
+	return st->vtbl->read(st, buf, sz);
+}
+
+static enum input_source istream_source(const unsigned char *sha1,
+					enum object_type *type,
+					struct object_info *oi)
+{
+	unsigned long size;
+	int status;
+
+	oi->sizep = &size;
+	status = sha1_object_info_extended(sha1, oi);
+	if (status < 0)
+		return stream_error;
+	*type = status;
+
+	switch (oi->whence) {
+	case OI_LOOSE:
+		return loose;
+	case OI_PACKED:
+		if (!oi->u.packed.is_delta && big_file_threshold <= size)
+			return pack_non_delta;
+		/* fallthru */
+	default:
+		return incore;
+	}
+}
+
+struct git_istream *open_istream(const unsigned char *sha1,
+				 enum object_type *type,
+				 unsigned long *size)
+{
+	struct git_istream *st;
+	struct object_info oi;
+	const unsigned char *real = lookup_replace_object(sha1);
+	enum input_source src = istream_source(real, type, &oi);
+
+	if (src < 0)
+		return NULL;
+
+	st = xmalloc(sizeof(*st));
+	st->source = src;
+	if (open_istream_tbl[src](st, &oi, real, type)) {
+		if (open_istream_incore(st, &oi, real, type)) {
+			free(st);
+			st = NULL;
+		}
+	}
+	*size = st->size;
+	return st;
+}
+
+/*****************************************************************
+ *
+ * Loose object stream
+ *
+ *****************************************************************/
+
+static open_method_decl(loose)
+{
+	return -1; /* for now */
+}
+
+
+/*****************************************************************
+ *
+ * Non-delta packed object stream
+ *
+ *****************************************************************/
+
+static open_method_decl(pack_non_delta)
+{
+	return -1; /* for now */
+}
+
+
+/*****************************************************************
+ *
+ * In-core stream
+ *
+ *****************************************************************/
+
+static close_method_decl(incore)
+{
+	free(st->u.incore.buf);
+	return 0;
+}
+
+static read_method_decl(incore)
+{
+	size_t read_size = sz;
+	size_t remainder = st->size - st->u.incore.read_ptr;
+
+	if (remainder <= read_size)
+		read_size = remainder;
+	if (read_size) {
+		memcpy(buf, st->u.incore.buf + st->u.incore.read_ptr, read_size);
+		st->u.incore.read_ptr += read_size;
+	}
+	return read_size;
+}
+
+static struct stream_vtbl incore_vtbl = {
+	close_istream_incore,
+	read_istream_incore,
+};
+
+static open_method_decl(incore)
+{
+	st->u.incore.buf = read_sha1_file_extended(sha1, type, &st->size, 0);
+	st->u.incore.read_ptr = 0;
+	st->vtbl = &incore_vtbl;
+
+	if (!st->u.incore.buf) {
+		free(st->u.incore.buf);
+		return -1;
+	}
+	return 0;
+}
diff --git a/streaming.h b/streaming.h
new file mode 100644
index 0000000..18cbe68
--- /dev/null
+++ b/streaming.h
@@ -0,0 +1,15 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#ifndef STREAMING_H
+#define STREAMING_H 1
+#include "cache.h"
+
+/* opaque */
+struct git_istream;
+
+extern struct git_istream *open_istream(const unsigned char *, enum object_type *, unsigned long *);
+extern int close_istream(struct git_istream *);
+extern ssize_t read_istream(struct git_istream *, char *, size_t);
+
+#endif /* STREAMING_H */
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry()
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (5 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 06/11] streaming: a new API to read from the object store Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-20 22:52     ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 08/11] streaming_write_entry(): support files with holes Junio C Hamano
                     ` (5 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

When the output to a path does not have to be converted, we can read from
the object database from the streaming API and write to the file in the
working tree, without having to hold everything in the memory.

The ident, auto- and safe- crlf conversions inherently require you to read
the whole thing before deciding what to do, so while it is technically
possible to support them by using a buffer of an unbound size or rewinding
and reading the stream twice, it is less practical than the traditional
"read the whole thing in core and convert" approach.

Adding streaming filters for the other conversions on top of this should
be doable by tweaking the can_bypass_conversion() function (it should be
renamed to can_filter_stream() when it happens). Then the streaming API
can be extended to wrap the git_istream streaming_write_entry() opens on
the underlying object in another git_istream that reads from it, filters
what is read, and let the streaming_write_entry() read the filtered
result. But that is outside the scope of this series.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h   |    1 +
 convert.c |   23 +++++++++++++++++++++++
 entry.c   |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/cache.h b/cache.h
index 38e14b2..8c5bb8a 100644
--- a/cache.h
+++ b/cache.h
@@ -1156,6 +1156,7 @@ extern int convert_to_git(const char *path, const char *src, size_t len,
                           struct strbuf *dst, enum safe_crlf checksafe);
 extern int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst);
 extern int renormalize_buffer(const char *path, const char *src, size_t len, struct strbuf *dst);
+extern int can_bypass_conversion(const char *path);
 
 /* add */
 /*
diff --git a/convert.c b/convert.c
index efc7e07..d3c0041 100644
--- a/convert.c
+++ b/convert.c
@@ -813,3 +813,26 @@ int renormalize_buffer(const char *path, const char *src, size_t len, struct str
 	}
 	return ret | convert_to_git(path, src, len, dst, 0);
 }
+
+/*
+ * You would be crazy to set CRLF, smuge/clean or ident to
+ * a large binary blob you would want us not to slurp into
+ * the memory!
+ */
+int can_bypass_conversion(const char *path)
+{
+	struct conv_attrs ca;
+	enum crlf_action crlf_action;
+
+	convert_attrs(&ca, path);
+
+	if (ca.ident ||
+	    (ca.drv && (ca.drv->smudge || ca.drv->clean)))
+		return 0;
+
+	crlf_action = input_crlf_action(ca.crlf_action, ca.eol_attr);
+	if ((crlf_action == CRLF_BINARY) ||
+	    (crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE))
+		return 1;
+	return 0;
+}
diff --git a/entry.c b/entry.c
index cc6502a..7733a6b 100644
--- a/entry.c
+++ b/entry.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "blob.h"
 #include "dir.h"
+#include "streaming.h"
 
 static void create_directories(const char *path, int path_len,
 			       const struct checkout *state)
@@ -114,6 +115,50 @@ static int fstat_output(int fd, const struct checkout *state, struct stat *st)
 	return 0;
 }
 
+static int streaming_write_entry(struct cache_entry *ce, char *path,
+				 const struct checkout *state, int to_tempfile,
+				 int *fstat_done, struct stat *statbuf)
+{
+	struct git_istream *st;
+	enum object_type type;
+	unsigned long sz;
+	int result = -1;
+	int fd = -1;
+
+	st = open_istream(ce->sha1, &type, &sz);
+	if (!st)
+		return -1;
+	if (type != OBJ_BLOB)
+		goto close_and_exit;
+
+	fd = open_output_fd(path, ce, to_tempfile);
+	if (fd < 0)
+		goto close_and_exit;
+
+	for (;;) {
+		char buf[10240];
+		ssize_t wrote;
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+
+		wrote = write_in_full(fd, buf, readlen);
+
+		if (wrote != readlen)
+			goto close_and_exit;
+	}
+	*fstat_done = fstat_output(fd, state, statbuf);
+
+close_and_exit:
+	close_istream(st);
+	if (0 <= fd)
+		result |= close(fd);
+	if (result && 0 <= fd)
+		unlink(path);
+	return result;
+}
+
 static int write_entry(struct cache_entry *ce, char *path, const struct checkout *state, int to_tempfile)
 {
 	unsigned int ce_mode_s_ifmt = ce->ce_mode & S_IFMT;
@@ -124,6 +169,12 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 	size_t wrote, newsize = 0;
 	struct stat st;
 
+	if ((ce_mode_s_ifmt == S_IFREG) &&
+	    can_bypass_conversion(path) &&
+	    !streaming_write_entry(ce, path, state, to_tempfile,
+				   &fstat_done, &st))
+		goto finish;
+
 	switch (ce_mode_s_ifmt) {
 	case S_IFREG:
 	case S_IFLNK:
@@ -176,6 +227,7 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
 		return error("unknown file mode for %s in index", path);
 	}
 
+finish:
 	if (state->refresh_cache) {
 		if (!fstat_done)
 			lstat(ce->name, &st);
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 08/11] streaming_write_entry(): support files with holes
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (6 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

One typical use of a large binary file is to hold a sparse on-disk hash
table with a lot of holes. Help preserving the holes with lseek().

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 entry.c |   21 +++++++++++++++++++--
 1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/entry.c b/entry.c
index 7733a6b..f751c60 100644
--- a/entry.c
+++ b/entry.c
@@ -123,6 +123,7 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 	enum object_type type;
 	unsigned long sz;
 	int result = -1;
+	ssize_t kept = 0;
 	int fd = -1;
 
 	st = open_istream(ce->sha1, &type, &sz);
@@ -136,18 +137,34 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 		goto close_and_exit;
 
 	for (;;) {
-		char buf[10240];
-		ssize_t wrote;
+		char buf[1024 * 16];
+		ssize_t wrote, holeto;
 		ssize_t readlen = read_istream(st, buf, sizeof(buf));
 
 		if (!readlen)
 			break;
+		if (sizeof(buf) == readlen) {
+			for (holeto = 0; holeto < readlen; holeto++)
+				if (buf[holeto])
+					break;
+			if (readlen == holeto) {
+				kept += holeto;
+				continue;
+			}
+		}
 
+		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
+			goto close_and_exit;
+		else
+			kept = 0;
 		wrote = write_in_full(fd, buf, readlen);
 
 		if (wrote != readlen)
 			goto close_and_exit;
 	}
+	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
+		     write(fd, "", 1) != 1))
+		goto close_and_exit;
 	*fstat_done = fstat_output(fd, state, statbuf);
 
 close_and_exit:
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 09/11] streaming: read non-delta incrementally from a pack
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (7 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 08/11] streaming_write_entry(): support files with holes Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 streaming.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/streaming.c b/streaming.c
index 84330b4..fbe8eb6 100644
--- a/streaming.c
+++ b/streaming.c
@@ -52,6 +52,8 @@ struct git_istream {
 	enum input_source source;
 	const struct stream_vtbl *vtbl;
 	unsigned long size; /* inflated size of full object */
+	z_stream z;
+	enum { z_unused, z_used, z_done, z_error } z_state;
 
 	union {
 		struct {
@@ -65,8 +67,8 @@ struct git_istream {
 		} loose;
 
 		struct {
-			int fd; /* open for reading */
-			/* NEEDSWORK: what else? */
+			struct packed_git *pack;
+			off_t pos;
 		} in_pack;
 	} u;
 };
@@ -130,6 +132,20 @@ struct git_istream *open_istream(const unsigned char *sha1,
 	return st;
 }
 
+
+/*****************************************************************
+ *
+ * Common helpers
+ *
+ *****************************************************************/
+
+static void close_deflated_stream(struct git_istream *st)
+{
+	if (st->z_state == z_used)
+		git_inflate_end(&st->z);
+}
+
+
 /*****************************************************************
  *
  * Loose object stream
@@ -148,9 +164,92 @@ static open_method_decl(loose)
  *
  *****************************************************************/
 
+static read_method_decl(pack_non_delta)
+{
+	size_t total_read = 0;
+
+	switch (st->z_state) {
+	case z_unused:
+		memset(&st->z, 0, sizeof(st->z));
+		git_inflate_init(&st->z);
+		st->z_state = z_used;
+		break;
+	case z_done:
+		return 0;
+	case z_error:
+		return -1;
+	case z_used:
+		break;
+	}
+
+	while (total_read < sz) {
+		int status;
+		struct pack_window *window = NULL;
+		unsigned char *mapped;
+
+		mapped = use_pack(st->u.in_pack.pack, &window,
+				  st->u.in_pack.pos, &st->z.avail_in);
+
+		st->z.next_out = (unsigned char *)buf + total_read;
+		st->z.avail_out = sz - total_read;
+		st->z.next_in = mapped;
+		status = git_inflate(&st->z, Z_FINISH);
+
+		st->u.in_pack.pos += st->z.next_in - mapped;
+		total_read = st->z.next_out - (unsigned char *)buf;
+		unuse_pack(&window);
+
+		if (status == Z_STREAM_END) {
+			git_inflate_end(&st->z);
+			st->z_state = z_done;
+			break;
+		}
+		if (status != Z_OK && status != Z_BUF_ERROR) {
+			git_inflate_end(&st->z);
+			st->z_state = z_error;
+			return -1;
+		}
+	}
+	return total_read;
+}
+
+static close_method_decl(pack_non_delta)
+{
+	close_deflated_stream(st);
+	return 0;
+}
+
+static struct stream_vtbl pack_non_delta_vtbl = {
+	close_istream_pack_non_delta,
+	read_istream_pack_non_delta,
+};
+
 static open_method_decl(pack_non_delta)
 {
-	return -1; /* for now */
+	struct pack_window *window;
+	enum object_type in_pack_type;
+
+	st->u.in_pack.pack = oi->u.packed.pack;
+	st->u.in_pack.pos = oi->u.packed.offset;
+	window = NULL;
+
+	in_pack_type = unpack_object_header(st->u.in_pack.pack,
+					    &window,
+					    &st->u.in_pack.pos,
+					    &st->size);
+	unuse_pack(&window);
+	switch (in_pack_type) {
+	default:
+		return -1; /* we do not do deltas for now */
+	case OBJ_COMMIT:
+	case OBJ_TREE:
+	case OBJ_BLOB:
+	case OBJ_TAG:
+		break;
+	}
+	st->z_state = z_unused;
+	st->vtbl = &pack_non_delta_vtbl;
+	return 0;
 }
 
 
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 10/11] sha1_file.c: expose helpers to read loose objects
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (8 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:33   ` [PATCH v2 11/11] streaming: read loose objects incrementally Junio C Hamano
                     ` (2 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

Make map_sha1_file(), parse_sha1_header() and unpack_sha1_header()
available to the streaming read API by exporting them via cache.h header
file.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h     |    3 +++
 sha1_file.c |    6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index 8c5bb8a..52cc4d3 100644
--- a/cache.h
+++ b/cache.h
@@ -780,6 +780,9 @@ extern int hash_sha1_file(const void *buf, unsigned long len, const char *type,
 extern int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
+extern void *map_sha1_file(const unsigned char *sha1, unsigned long *size);
+extern int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz);
+extern int parse_sha1_header(const char *hdr, unsigned long *sizep);
 
 /* global flag to enable extra checks when accessing packed objects */
 extern int do_check_packed_object_crc;
diff --git a/sha1_file.c b/sha1_file.c
index a28683a..5fc877f 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1186,7 +1186,7 @@ static int open_sha1_file(const unsigned char *sha1)
 	return -1;
 }
 
-static void *map_sha1_file(const unsigned char *sha1, unsigned long *size)
+void *map_sha1_file(const unsigned char *sha1, unsigned long *size)
 {
 	void *map;
 	int fd;
@@ -1245,7 +1245,7 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 	return used;
 }
 
-static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
+int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
 {
 	unsigned long size, used;
 	static const char valid_loose_object_type[8] = {
@@ -1342,7 +1342,7 @@ static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size
  * too permissive for what we want to check. So do an anal
  * object header parse by hand.
  */
-static int parse_sha1_header(const char *hdr, unsigned long *sizep)
+int parse_sha1_header(const char *hdr, unsigned long *sizep)
 {
 	char type[10];
 	int i;
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 11/11] streaming: read loose objects incrementally
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (9 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
@ 2011-05-19 21:33   ` Junio C Hamano
  2011-05-19 21:44   ` [Not A PATCH v2 02/11] interdiff Junio C Hamano
  2011-05-19 22:21   ` [PATCH v2 00/11] writing out a huge blob to working tree Jeff King
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:33 UTC (permalink / raw)
  To: git

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 streaming.c |   85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/streaming.c b/streaming.c
index fbe8eb6..76a4f4d 100644
--- a/streaming.c
+++ b/streaming.c
@@ -62,8 +62,11 @@ struct git_istream {
 		} incore;
 
 		struct {
-			int fd; /* open for reading */
-			/* NEEDSWORK: what else? */
+			void *mapped;
+			unsigned long mapsize;
+			char hdr[32];
+			int hdr_avail;
+			int hdr_used;
 		} loose;
 
 		struct {
@@ -152,9 +155,85 @@ static void close_deflated_stream(struct git_istream *st)
  *
  *****************************************************************/
 
+static read_method_decl(loose)
+{
+	size_t total_read = 0;
+
+	switch (st->z_state) {
+	case z_done:
+		return 0;
+	case z_error:
+		return -1;
+	default:
+		break;
+	}
+
+	if (st->u.loose.hdr_used < st->u.loose.hdr_avail) {
+		size_t to_copy = st->u.loose.hdr_avail - st->u.loose.hdr_used;
+		if (sz < to_copy)
+			to_copy = sz;
+		memcpy(buf, st->u.loose.hdr + st->u.loose.hdr_used, to_copy);
+		st->u.loose.hdr_used += to_copy;
+		total_read += to_copy;
+	}
+
+	while (total_read < sz) {
+		int status;
+
+		st->z.next_out = (unsigned char *)buf + total_read;
+		st->z.avail_out = sz - total_read;
+		status = git_inflate(&st->z, Z_FINISH);
+
+		total_read = st->z.next_out - (unsigned char *)buf;
+
+		if (status == Z_STREAM_END) {
+			git_inflate_end(&st->z);
+			st->z_state = z_done;
+			break;
+		}
+		if (status != Z_OK && status != Z_BUF_ERROR) {
+			git_inflate_end(&st->z);
+			st->z_state = z_error;
+			return -1;
+		}
+	}
+	return total_read;
+}
+
+static close_method_decl(loose)
+{
+	close_deflated_stream(st);
+	munmap(st->u.loose.mapped, st->u.loose.mapsize);
+	return 0;
+}
+
+static struct stream_vtbl loose_vtbl = {
+	close_istream_loose,
+	read_istream_loose,
+};
+
 static open_method_decl(loose)
 {
-	return -1; /* for now */
+	st->u.loose.mapped = map_sha1_file(sha1, &st->u.loose.mapsize);
+	if (!st->u.loose.mapped)
+		return -1;
+	if (unpack_sha1_header(&st->z,
+			       st->u.loose.mapped,
+			       st->u.loose.mapsize,
+			       st->u.loose.hdr,
+			       sizeof(st->u.loose.hdr)) < 0) {
+		git_inflate_end(&st->z);
+		munmap(st->u.loose.mapped, st->u.loose.mapsize);
+		return -1;
+	}
+
+	parse_sha1_header(st->u.loose.hdr, &st->size);
+	st->u.loose.hdr_used = strlen(st->u.loose.hdr) + 1;
+	st->u.loose.hdr_avail = st->z.total_out;
+	st->z_state = z_used;
+
+	st->vtbl = &loose_vtbl;
+	return 0;
 }
 
 
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Not A PATCH v2 02/11] interdiff
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (10 preceding siblings ...)
  2011-05-19 21:33   ` [PATCH v2 11/11] streaming: read loose objects incrementally Junio C Hamano
@ 2011-05-19 21:44   ` Junio C Hamano
  2011-05-19 22:21   ` [PATCH v2 00/11] writing out a huge blob to working tree Jeff King
  12 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-19 21:44 UTC (permalink / raw)
  To: git

This is an interdiff for 02/11 from the earlier round.

As oi->u.packed.delta is renamed to is_delta, and oi->want_deltainfo is
lost with this change, the later patches have necessary adjustments but
they are all trivial so I wouldn't be spamming the list with the interdiff
for them since the first round.

---
 cache.h     |    5 ++---
 sha1_file.c |   30 +++++++++---------------------
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/cache.h b/cache.h
index dfd2d61..9fbc07e 100644
--- a/cache.h
+++ b/cache.h
@@ -1025,7 +1025,6 @@ extern int packed_object_info_detail(struct packed_git *, off_t, unsigned long *
 struct object_info {
 	/* Request */
 	unsigned long *sizep;
-	int want_deltainfo;
 
 	/* Response */
 	enum {
@@ -1043,9 +1042,9 @@ struct object_info {
 		 * } loose;
 		 */
 		struct {
-			off_t offset;
-			unsigned int delta;
 			struct packed_git *pack;
+			off_t offset;
+			unsigned int is_delta;
 		} packed;
 	} u;
 };
diff --git a/sha1_file.c b/sha1_file.c
index 9b16cd8..7eed316 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1481,7 +1481,7 @@ static off_t get_delta_base(struct packed_git *p,
 
 /* forward declaration for a mutually recursive function */
 static int packed_object_info(struct packed_git *p, off_t offset,
-			      unsigned long *sizep);
+			      unsigned long *sizep, int *rtype);
 
 static int packed_delta_info(struct packed_git *p,
 			     struct pack_window **w_curs,
@@ -1495,7 +1495,7 @@ static int packed_delta_info(struct packed_git *p,
 	base_offset = get_delta_base(p, w_curs, &curpos, type, obj_offset);
 	if (!base_offset)
 		return OBJ_BAD;
-	type = packed_object_info(p, base_offset, NULL);
+	type = packed_object_info(p, base_offset, NULL, NULL);
 	if (type <= OBJ_NONE) {
 		struct revindex_entry *revidx;
 		const unsigned char *base_sha1;
@@ -1605,7 +1605,7 @@ int packed_object_info_detail(struct packed_git *p,
 }
 
 static int packed_object_info(struct packed_git *p, off_t obj_offset,
-			      unsigned long *sizep)
+			      unsigned long *sizep, int *rtype)
 {
 	struct pack_window *w_curs = NULL;
 	unsigned long size;
@@ -1613,6 +1613,8 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 	enum object_type type;
 
 	type = unpack_object_header(p, &w_curs, &curpos, &size);
+	if (rtype)
+		*rtype = type; /* representation type */
 
 	switch (type) {
 	case OBJ_OFS_DELTA:
@@ -2098,7 +2100,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 {
 	struct cached_object *co;
 	struct pack_entry e;
-	int status;
+	int status, rtype;
 
 	co = find_cached_object(sha1);
 	if (co) {
@@ -2122,22 +2124,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 			return status;
 	}
 
-	if (!oi->want_deltainfo) {
-		status = packed_object_info(e.p, e.offset, oi->sizep);
-	} else {
-		unsigned long size, store_size;
-		unsigned int delta_chain_length;
-		unsigned char base_sha1[20];
-		status = packed_object_info_detail(e.p, e.offset,
-						   &size, &store_size,
-						   &delta_chain_length,
-						   base_sha1);
-		if (0 <= status) {
-			if (oi->sizep)
-				*oi->sizep = size;
-			oi->u.packed.delta = delta_chain_length;
-		}
-	}
+	status = packed_object_info(e.p, e.offset, oi->sizep, &rtype);
 	if (status < 0) {
 		mark_bad_packed_object(e.p, sha1);
 		status = sha1_object_info_extended(sha1, oi);
@@ -2145,6 +2132,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 		oi->whence = OI_PACKED;
 		oi->u.packed.offset = e.offset;
 		oi->u.packed.pack = e.p;
+		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
+					 rtype == OBJ_OFS_DELTA);
 	}
 
 	return status;
@@ -2155,7 +2144,6 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 	struct object_info oi;
 
 	oi.sizep = sizep;
-	oi.want_deltainfo = 0;
 	return sha1_object_info_extended(sha1, &oi);
 }
 
-- 
1.7.5.1.416.gac10c8

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 00/11] writing out a huge blob to working tree
  2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
                     ` (11 preceding siblings ...)
  2011-05-19 21:44   ` [Not A PATCH v2 02/11] interdiff Junio C Hamano
@ 2011-05-19 22:21   ` Jeff King
  12 siblings, 0 replies; 49+ messages in thread
From: Jeff King @ 2011-05-19 22:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, May 19, 2011 at 02:33:35PM -0700, Junio C Hamano wrote:

> Also sha1_object_info_extended() lost the call to the expensive function
> packed_object_info_detail(), as the only thing we are interested in is to
> see if the first-level object is a non-delta.  As the result, 02/11 would
> now be much easier to follow.

Thanks, I can confirm that this clears up the performance issue in my
test. I have no idea why it wasn't totally fixed with the patch I posted
earlier. I must have botched something.

-Peff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry()
  2011-05-19 21:33   ` [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
@ 2011-05-20 22:52     ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-20 22:52 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:

> When the output to a path does not have to be converted, we can read from
> the object database from the streaming API and write to the file in the
> working tree, without having to hold everything in the memory.

There was an embarrassing bug hiding here.  The way I found it was doubly
embarrassing.

> diff --git a/entry.c b/entry.c
> index cc6502a..7733a6b 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -114,6 +115,50 @@ static int fstat_output(int fd, const struct checkout *state, struct stat *st)
>  	return 0;
>  }
>  
> +static int streaming_write_entry(struct cache_entry *ce, char *path,
> +				 const struct checkout *state, int to_tempfile,
> +				 int *fstat_done, struct stat *statbuf)
> +{
> +	struct git_istream *st;
> +	enum object_type type;
> +	unsigned long sz;
> +	int result = -1;

We see result is initialized to -1 here...

> +	int fd = -1;
> +
> +	st = open_istream(ce->sha1, &type, &sz);
> + ...
> +close_and_exit:
> +	close_istream(st);

...but nobody touches it before coming here via various codepaths.  If I
haven't opened the output, fd is -1, but if I have, then the condition of
the next if statement is met, and I close the output.

> +	if (0 <= fd)
> +		result |= close(fd);

Oops. This is the embarrassment. This has to be an assignment, not OR-ing
it in. I am not clearing the result in a successful case.

> +	if (result && 0 <= fd)
> +		unlink(path);
> +	return result;

Hence, even though the codepath fully wrote out, we run unlink(path), and
return a failure.

> @@ -124,6 +169,12 @@ static int write_entry(struct cache_entry *ce, char *path, const struct checkout
>  	size_t wrote, newsize = 0;
>  	struct stat st;
>  
> +	if ((ce_mode_s_ifmt == S_IFREG) &&
> +	    can_bypass_conversion(path) &&
> +	    !streaming_write_entry(ce, path, state, to_tempfile,
> +				   &fstat_done, &st))
> +		goto finish;

Then the caller just thinks that the contents could not be streamed
(e.g. the data is in pack deltified) and takes the fall-back codepath to
fix everything up, hiding the above bug.

>  	switch (ce_mode_s_ifmt) {
>  	case S_IFREG:
>  	case S_IFLNK:
>		... fall-back code follows ...

The other embarrassment is the way how I found the bug.  I was trying to
rewrite this "if regular file and can bypass conversion, try streaming"
logic, and botched the result to jump to "finish" when streaming write
returned failure.

Will re-queue with the obvious fix squashed in.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache
  2011-05-19 21:33   ` [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
@ 2011-05-20 23:05     ` René Scharfe
  2011-05-21  1:49       ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: René Scharfe @ 2011-05-20 23:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Am 19.05.2011 23:33, schrieb Junio C Hamano:
> --- a/cache.h
> +++ b/cache.h
> @@ -1030,7 +1030,8 @@ struct object_info {
>  	enum {
>  		OI_CACHED,
>  		OI_LOOSE,
> -		OI_PACKED
> +		OI_PACKED,
> +		OI_DBCACHED,
>  	} whence;

Some compilers don't like trailing commas in enums.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 06/11] streaming: a new API to read from the object store
  2011-05-19 21:33   ` [PATCH v2 06/11] streaming: a new API to read from the object store Junio C Hamano
@ 2011-05-20 23:05     ` René Scharfe
  2011-05-21  1:49       ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: René Scharfe @ 2011-05-20 23:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Am 19.05.2011 23:33, schrieb Junio C Hamano:
> Given an object name, use open_istream() to get a git_istream handle
> that you can read_istream() from as if you are using read(2) to read
> the contents of the object, and close it with close_istream() when
> you are done.
> 
> Currently, we do not do anything fancy--it just calls read_sha1_file()
> and keeps the contents in memory as a whole, and carve it out as you
> request with read_istream().
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  Makefile    |    2 +
>  streaming.c |  199 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  streaming.h |   15 +++++
>  3 files changed, 216 insertions(+), 0 deletions(-)
>  create mode 100644 streaming.c
>  create mode 100644 streaming.h
> 
> diff --git a/Makefile b/Makefile
> index 320ccc7..83bd539 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -552,6 +552,7 @@ LIB_H += sha1-lookup.h
>  LIB_H += sideband.h
>  LIB_H += sigchain.h
>  LIB_H += strbuf.h
> +LIB_H += streaming.h
>  LIB_H += string-list.h
>  LIB_H += submodule.h
>  LIB_H += tag.h
> @@ -657,6 +658,7 @@ LIB_OBJS += shallow.o
>  LIB_OBJS += sideband.o
>  LIB_OBJS += sigchain.o
>  LIB_OBJS += strbuf.o
> +LIB_OBJS += streaming.o
>  LIB_OBJS += string-list.o
>  LIB_OBJS += submodule.o
>  LIB_OBJS += symlinks.o
> diff --git a/streaming.c b/streaming.c
> new file mode 100644
> index 0000000..84330b4
> --- /dev/null
> +++ b/streaming.c
> @@ -0,0 +1,199 @@
> +/*
> + * Copyright (c) 2011, Google Inc.
> + */
> +#include "cache.h"
> +#include "streaming.h"
> +
> +enum input_source {
> +	stream_error = -1,
> +	incore = 0,
> +	loose = 1,
> +	pack_non_delta = 2
> +};
> +
> +typedef int (*open_istream_fn)(struct git_istream *,
> +			       struct object_info *,
> +			       const unsigned char *,
> +			       enum object_type *);
> +typedef int (*close_istream_fn)(struct git_istream *);
> +typedef ssize_t (*read_istream_fn)(struct git_istream *, char *, size_t);
> +
> +struct stream_vtbl {
> +	close_istream_fn close;
> +	read_istream_fn read;
> +};
> +
> +#define open_method_decl(name) \
> +	int open_istream_ ##name \
> +	(struct git_istream *st, struct object_info *oi, \
> +	 const unsigned char *sha1, \
> +	 enum object_type *type)
> +
> +#define close_method_decl(name) \
> +	int close_istream_ ##name \
> +	(struct git_istream *st)
> +
> +#define read_method_decl(name) \
> +	ssize_t read_istream_ ##name \
> +	(struct git_istream *st, char *buf, size_t sz)

It would be nice if those macros could be got rid of once the interface
stabilizes.

> +
> +/* forward declaration */
> +static open_method_decl(incore);
> +static open_method_decl(loose);
> +static open_method_decl(pack_non_delta);
> +
> +static open_istream_fn open_istream_tbl[] = {
> +	open_istream_incore,
> +	open_istream_loose,
> +	open_istream_pack_non_delta,
> +};

These three uses of the macro can be avoided by moving open_istream_tbl
and open_istream() to the end of the file.  It would be just as clear
and clean, albeit not as close to literal programming style.

> +
> +struct git_istream {
> +	enum input_source source;

source seems to be write-only.

> +	const struct stream_vtbl *vtbl;
> +	unsigned long size; /* inflated size of full object */
> +
> +	union {
> +		struct {
> +			char *buf; /* from read_object() */
> +			unsigned long read_ptr;
> +		} incore;
> +
> +		struct {
> +			int fd; /* open for reading */
> +			/* NEEDSWORK: what else? */
> +		} loose;
> +
> +		struct {
> +			int fd; /* open for reading */
> +			/* NEEDSWORK: what else? */
> +		} in_pack;
> +	} u;
> +};
> +
> +int close_istream(struct git_istream *st)
> +{
> +	return st->vtbl->close(st);
> +}
> +
> +ssize_t read_istream(struct git_istream *st, char *buf, size_t sz)
> +{
> +	return st->vtbl->read(st, buf, sz);
> +}
> +
> +static enum input_source istream_source(const unsigned char *sha1,
> +					enum object_type *type,
> +					struct object_info *oi)
> +{
> +	unsigned long size;
> +	int status;
> +
> +	oi->sizep = &size;
> +	status = sha1_object_info_extended(sha1, oi);
> +	if (status < 0)
> +		return stream_error;
> +	*type = status;
> +
> +	switch (oi->whence) {
> +	case OI_LOOSE:
> +		return loose;
> +	case OI_PACKED:
> +		if (!oi->u.packed.is_delta && big_file_threshold <= size)
> +			return pack_non_delta;
> +		/* fallthru */
> +	default:
> +		return incore;
> +	}
> +}
> +
> +struct git_istream *open_istream(const unsigned char *sha1,
> +				 enum object_type *type,
> +				 unsigned long *size)
> +{
> +	struct git_istream *st;
> +	struct object_info oi;
> +	const unsigned char *real = lookup_replace_object(sha1);
> +	enum input_source src = istream_source(real, type, &oi);
> +
> +	if (src < 0)
> +		return NULL;
> +
> +	st = xmalloc(sizeof(*st));
> +	st->source = src;
> +	if (open_istream_tbl[src](st, &oi, real, type)) {
> +		if (open_istream_incore(st, &oi, real, type)) {
> +			free(st);
> +			st = NULL;

			return NULL;
			// Otherwise we get a problem three lines down.

> +		}
> +	}
> +	*size = st->size;
> +	return st;
> +}
> +
> +/*****************************************************************
> + *
> + * Loose object stream
> + *
> + *****************************************************************/
> +
> +static open_method_decl(loose)
> +{
> +	return -1; /* for now */
> +}
> +
> +
> +/*****************************************************************
> + *
> + * Non-delta packed object stream
> + *
> + *****************************************************************/
> +
> +static open_method_decl(pack_non_delta)
> +{
> +	return -1; /* for now */
> +}
> +
> +
> +/*****************************************************************
> + *
> + * In-core stream
> + *
> + *****************************************************************/
> +
> +static close_method_decl(incore)
> +{
> +	free(st->u.incore.buf);
> +	return 0;
> +}
> +
> +static read_method_decl(incore)
> +{
> +	size_t read_size = sz;
> +	size_t remainder = st->size - st->u.incore.read_ptr;
> +
> +	if (remainder <= read_size)
> +		read_size = remainder;
> +	if (read_size) {
> +		memcpy(buf, st->u.incore.buf + st->u.incore.read_ptr, read_size);
> +		st->u.incore.read_ptr += read_size;
> +	}
> +	return read_size;
> +}
> +
> +static struct stream_vtbl incore_vtbl = {
> +	close_istream_incore,
> +	read_istream_incore,
> +};
> +
> +static open_method_decl(incore)
> +{
> +	st->u.incore.buf = read_sha1_file_extended(sha1, type, &st->size, 0);
> +	st->u.incore.read_ptr = 0;
> +	st->vtbl = &incore_vtbl;
> +
> +	if (!st->u.incore.buf) {
> +		free(st->u.incore.buf);

free(NULL) is a noop.

> +		return -1;
> +	}
> +	return 0;
> +}
> diff --git a/streaming.h b/streaming.h
> new file mode 100644
> index 0000000..18cbe68
> --- /dev/null
> +++ b/streaming.h
> @@ -0,0 +1,15 @@
> +/*
> + * Copyright (c) 2011, Google Inc.
> + */
> +#ifndef STREAMING_H
> +#define STREAMING_H 1
> +#include "cache.h"
> +
> +/* opaque */
> +struct git_istream;
> +
> +extern struct git_istream *open_istream(const unsigned char *, enum object_type *, unsigned long *);
> +extern int close_istream(struct git_istream *);
> +extern ssize_t read_istream(struct git_istream *, char *, size_t);
> +
> +#endif /* STREAMING_H */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 06/11] streaming: a new API to read from the object store
  2011-05-20 23:05     ` René Scharfe
@ 2011-05-21  1:49       ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-21  1:49 UTC (permalink / raw)
  To: René Scharfe; +Cc: git

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

>> +#define read_method_decl(name) \
>> +	ssize_t read_istream_ ##name \
>> +	(struct git_istream *st, char *buf, size_t sz)
>
> It would be nice if those macros could be got rid of once the interface
> stabilizes.

Probably, but not while it is still in flux in 'pu'. I already had to
tweak something to support my unpublished series I was working on today.

>> +struct git_istream {
>> +	enum input_source source;
>
> source seems to be write-only.

Yes, I had this initially but later ended up with a design that makes
everything go through vtbl, so this is only useful for debugging and can
be removed.

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache
  2011-05-20 23:05     ` René Scharfe
@ 2011-05-21  1:49       ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2011-05-21  1:49 UTC (permalink / raw)
  To: René Scharfe; +Cc: git

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> Am 19.05.2011 23:33, schrieb Junio C Hamano:
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -1030,7 +1030,8 @@ struct object_info {
>>  	enum {
>>  		OI_CACHED,
>>  		OI_LOOSE,
>> -		OI_PACKED
>> +		OI_PACKED,
>> +		OI_DBCACHED,
>>  	} whence;
>
> Some compilers don't like trailing commas in enums.

Right; thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2011-05-21  1:50 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
2011-05-16  0:30 ` [PATCH 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
2011-05-17  0:45   ` Thiago Farina
2011-05-17  2:36     ` Junio C Hamano
2011-05-16  0:30 ` [PATCH 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
2011-05-16  0:30 ` [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
2011-05-16  0:40   ` Shawn Pearce
2011-05-16  0:30 ` [PATCH 04/11] unpack_object_header(): make it public Junio C Hamano
2011-05-16  0:30 ` [PATCH 05/11] write_entry(): separate two helper functions out Junio C Hamano
2011-05-16  0:30 ` [PATCH 06/11] streaming: a new API to read from the object store Junio C Hamano
2011-05-18  8:09   ` Jeff King
2011-05-19  1:52     ` Junio C Hamano
2011-05-16  0:30 ` [PATCH 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
2011-05-16  0:30 ` [PATCH 08/11] streaming_write_entry(): support files with holes Junio C Hamano
2011-05-16 10:53   ` Nguyen Thai Ngoc Duy
2011-05-16 14:39     ` Junio C Hamano
2011-05-17  1:18       ` Nguyen Thai Ngoc Duy
2011-05-17  5:23         ` Junio C Hamano
2011-05-16 13:03   ` Thiago Farina
2011-05-16  0:30 ` [PATCH 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
2011-05-16  0:58   ` Shawn Pearce
2011-05-16  5:00     ` Junio C Hamano
2011-05-16  0:30 ` [PATCH 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
2011-05-16  0:30 ` [PATCH 11/11] streaming: read loose objects incrementally Junio C Hamano
2011-05-16  0:47 ` [PATCH 00/11] writing out a huge blob to working tree Shawn Pearce
2011-05-18  6:41 ` Jeff King
2011-05-18  7:08   ` Jeff King
2011-05-18  7:50     ` Jeff King
2011-05-18 15:12       ` Junio C Hamano
2011-05-18  8:17 ` Jeff King
2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
2011-05-20 23:05     ` René Scharfe
2011-05-21  1:49       ` Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 04/11] unpack_object_header(): make it public Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 05/11] write_entry(): separate two helper functions out Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 06/11] streaming: a new API to read from the object store Junio C Hamano
2011-05-20 23:05     ` René Scharfe
2011-05-21  1:49       ` Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
2011-05-20 22:52     ` Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 08/11] streaming_write_entry(): support files with holes Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 11/11] streaming: read loose objects incrementally Junio C Hamano
2011-05-19 21:44   ` [Not A PATCH v2 02/11] interdiff Junio C Hamano
2011-05-19 22:21   ` [PATCH v2 00/11] writing out a huge blob to working tree Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.