All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] Large blob fixes
@ 2012-02-27  7:55 Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
                   ` (12 more replies)
  0 siblings, 13 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

These patches make sure we avoid keeping whole blob in memory, at
least in common cases. Blob-only streaming code paths are opened to
accomplish that.

I don't quite like having three different implementations for
checking sha-1 signature (one on git_istream, one on packed_git and
the other one in index-pack) but I failed to see how to unify them.

Making archive-zip work with stream can be difficult. But at least tar
format works. Good enough for me.

Nguyễn Thái Ngọc Duy (11):
  Add more large blob test cases
  Factor out and export large blob writing code to arbitrary file
    handle
  cat-file: use streaming interface to print blobs
  parse_object: special code path for blobs to avoid putting whole
    object in memory
  show: use streaming interface for showing blobs
  index-pack --verify: skip sha-1 collision test
  index-pack: split second pass obj handling into own function
  index-pack: reduce memory usage when the pack has large blobs
  pack-check: do not unpack blobs
  archive: support streaming large files to a tar archive
  fsck: use streaming interface for writing lost-found blobs

 archive-tar.c        |   35 +++++++++++++---
 archive-zip.c        |    9 ++--
 archive.c            |   51 ++++++++++++++++--------
 archive.h            |   11 ++++-
 builtin/cat-file.c   |   22 ++++++++++
 builtin/fsck.c       |    8 +---
 builtin/index-pack.c |  108 +++++++++++++++++++++++++++++++++++++------------
 builtin/log.c        |    9 ++++-
 cache.h              |    5 ++-
 entry.c              |   39 ++++++++++++------
 fast-import.c        |    2 +-
 object.c             |   11 +++++
 pack-check.c         |   21 +++++++++-
 sha1_file.c          |   78 +++++++++++++++++++++++++++++++-----
 t/t1050-large.sh     |   59 +++++++++++++++++++++++++++-
 wrapper.c            |   27 +++++++++++-
 16 files changed, 400 insertions(+), 95 deletions(-)

-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 01/11] Add more large blob test cases
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27 20:18   ` Peter Baumann
  2012-02-27  7:55 ` [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle Nguyễn Thái Ngọc Duy
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

New test cases list commands that should work when memory is
limited. All memory allocation functions (*) learn to reject any
allocation larger than $GIT_ALLOC_LIMIT if set.

(*) Not exactly all. Some places do not use x* functions, but
malloc/calloc directly, notably diff-delta. These could path should
never be run on large blobs.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 t/t1050-large.sh |   59 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 wrapper.c        |   27 ++++++++++++++++++++++--
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 29d6024..f245e59 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -10,7 +10,9 @@ test_expect_success setup '
 	echo X | dd of=large1 bs=1k seek=2000 &&
 	echo X | dd of=large2 bs=1k seek=2000 &&
 	echo X | dd of=large3 bs=1k seek=2000 &&
-	echo Y | dd of=huge bs=1k seek=2500
+	echo Y | dd of=huge bs=1k seek=2500 &&
+	GIT_ALLOC_LIMIT=1500 &&
+	export GIT_ALLOC_LIMIT
 '
 
 test_expect_success 'add a large file or two' '
@@ -100,4 +102,59 @@ test_expect_success 'packsize limit' '
 	)
 '
 
+test_expect_success 'diff --raw' '
+	git commit -q -m initial &&
+	echo modified >>large1 &&
+	git add large1 &&
+	git commit -q -m modified &&
+	git diff --raw HEAD^
+'
+
+test_expect_success 'hash-object' '
+	git hash-object large1
+'
+
+test_expect_failure 'cat-file a large file' '
+	git cat-file blob :large1 >/dev/null
+'
+
+test_expect_failure 'git-show a large file' '
+	git show :large1 >/dev/null
+
+'
+
+test_expect_failure 'clone' '
+	git clone -n file://"$PWD"/.git new &&
+	(
+	cd new &&
+	git config core.bigfilethreshold 200k &&
+	git checkout master
+	)
+'
+
+test_expect_failure 'fetch updates' '
+	echo modified >> large1 &&
+	git commit -q -a -m updated &&
+	(
+	cd new &&
+	git fetch --keep # FIXME should not need --keep
+	)
+'
+
+test_expect_failure 'fsck' '
+	git fsck --full
+'
+
+test_expect_success 'repack' '
+	git repack -ad
+'
+
+test_expect_failure 'tar achiving' '
+	git archive --format=tar HEAD >/dev/null
+'
+
+test_expect_failure 'zip achiving' '
+	git archive --format=zip HEAD >/dev/null
+'
+
 test_done
diff --git a/wrapper.c b/wrapper.c
index 85f09df..d4c0972 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -9,6 +9,18 @@ static void do_nothing(size_t size)
 
 static void (*try_to_free_routine)(size_t size) = do_nothing;
 
+static void memory_limit_check(size_t size)
+{
+	static int limit = -1;
+	if (limit == -1) {
+		const char *env = getenv("GIT_ALLOC_LIMIT");
+		limit = env ? atoi(env) * 1024 : 0;
+	}
+	if (limit && size > limit)
+		die("attempting to allocate %d over limit %d",
+		    size, limit);
+}
+
 try_to_free_t set_try_to_free_routine(try_to_free_t routine)
 {
 	try_to_free_t old = try_to_free_routine;
@@ -32,7 +44,10 @@ char *xstrdup(const char *str)
 
 void *xmalloc(size_t size)
 {
-	void *ret = malloc(size);
+	void *ret;
+
+	memory_limit_check(size);
+	ret = malloc(size);
 	if (!ret && !size)
 		ret = malloc(1);
 	if (!ret) {
@@ -79,7 +94,10 @@ char *xstrndup(const char *str, size_t len)
 
 void *xrealloc(void *ptr, size_t size)
 {
-	void *ret = realloc(ptr, size);
+	void *ret;
+
+	memory_limit_check(size);
+	ret = realloc(ptr, size);
 	if (!ret && !size)
 		ret = realloc(ptr, 1);
 	if (!ret) {
@@ -95,7 +113,10 @@ void *xrealloc(void *ptr, size_t size)
 
 void *xcalloc(size_t nmemb, size_t size)
 {
-	void *ret = calloc(nmemb, size);
+	void *ret;
+
+	memory_limit_check(size * nmemb);
+	ret = calloc(nmemb, size);
 	if (!ret && (!nmemb || !size))
 		ret = calloc(1, 1);
 	if (!ret) {
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27 17:29   ` Junio C Hamano
  2012-02-27  7:55 ` [PATCH 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 cache.h |    3 +++
 entry.c |   39 ++++++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/cache.h b/cache.h
index e12b15f..6ce691b 100644
--- a/cache.h
+++ b/cache.h
@@ -937,6 +937,9 @@ struct checkout {
 		 refresh_cache:1;
 };
 
+extern int streaming_write_sha1(int fd, int seekable, const unsigned char *sha1,
+				enum object_type exp_type,
+				struct stream_filter *filter);
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
 
 struct cache_def {
diff --git a/entry.c b/entry.c
index 852fea1..dde0d17 100644
--- a/entry.c
+++ b/entry.c
@@ -115,26 +115,20 @@ static int fstat_output(int fd, const struct checkout *state, struct stat *st)
 	return 0;
 }
 
-static int streaming_write_entry(struct cache_entry *ce, char *path,
-				 struct stream_filter *filter,
-				 const struct checkout *state, int to_tempfile,
-				 int *fstat_done, struct stat *statbuf)
+int streaming_write_sha1(int fd, int seekable, const unsigned char *sha1,
+			 enum object_type exp_type,
+			 struct stream_filter *filter)
 {
 	struct git_istream *st;
 	enum object_type type;
 	unsigned long sz;
 	int result = -1;
 	ssize_t kept = 0;
-	int fd = -1;
 
-	st = open_istream(ce->sha1, &type, &sz, filter);
+	st = open_istream(sha1, &type, &sz, filter);
 	if (!st)
 		return -1;
-	if (type != OBJ_BLOB)
-		goto close_and_exit;
-
-	fd = open_output_fd(path, ce, to_tempfile);
-	if (fd < 0)
+	if (exp_type != OBJ_ANY && type != exp_type)
 		goto close_and_exit;
 
 	for (;;) {
@@ -144,7 +138,7 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 
 		if (!readlen)
 			break;
-		if (sizeof(buf) == readlen) {
+		if (seekable && sizeof(buf) == readlen) {
 			for (holeto = 0; holeto < readlen; holeto++)
 				if (buf[holeto])
 					break;
@@ -166,10 +160,29 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
 		     write(fd, "", 1) != 1))
 		goto close_and_exit;
-	*fstat_done = fstat_output(fd, state, statbuf);
+	result = 0;
 
 close_and_exit:
 	close_istream(st);
+	return result;
+}
+
+static int streaming_write_entry(struct cache_entry *ce, char *path,
+				 struct stream_filter *filter,
+				 const struct checkout *state, int to_tempfile,
+				 int *fstat_done, struct stat *statbuf)
+{
+	int result = -1;
+	int fd = open_output_fd(path, ce, to_tempfile);
+	if (fd < 0)
+		goto close_and_exit;
+
+	if (streaming_write_sha1(fd, 1, ce->sha1, OBJ_BLOB, filter))
+		goto close_and_exit;
+
+	*fstat_done = fstat_output(fd, state, statbuf);
+
+close_and_exit:
 	if (0 <= fd)
 		result = close(fd);
 	if (result && 0 <= fd)
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 03/11] cat-file: use streaming interface to print blobs
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27 17:44   ` Junio C Hamano
  2012-02-27  7:55 ` [PATCH 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/cat-file.c |   22 ++++++++++++++++++++++
 t/t1050-large.sh   |    2 +-
 2 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8ed501f..3f3b558 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -82,6 +82,24 @@ static void pprint_tag(const unsigned char *sha1, const char *buf, unsigned long
 		write_or_die(1, cp, endp - cp);
 }
 
+static int write_blob(const unsigned char *sha1)
+{
+	unsigned char new_sha1[20];
+
+	if (sha1_object_info(sha1, NULL) == OBJ_TAG) {
+		enum object_type type;
+		unsigned long size;
+		char *buffer = read_sha1_file(sha1, &type, &size);
+		if (memcmp(buffer, "object ", 7) ||
+		    get_sha1_hex(buffer + 7, new_sha1))
+			die("%s not a valid tag", sha1_to_hex(sha1));
+		sha1 = new_sha1;
+		free(buffer);
+	}
+
+	return streaming_write_sha1(1, 0, sha1, OBJ_BLOB, NULL);
+}
+
 static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 {
 	unsigned char sha1[20];
@@ -127,6 +145,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 			return cmd_ls_tree(2, ls_args, NULL);
 		}
 
+		if (type == OBJ_BLOB)
+			return write_blob(sha1);
 		buf = read_sha1_file(sha1, &type, &size);
 		if (!buf)
 			die("Cannot read object %s", obj_name);
@@ -149,6 +169,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 		break;
 
 	case 0:
+		if (type_from_string(exp_type) == OBJ_BLOB)
+			return write_blob(sha1);
 		buf = read_object_with_reference(sha1, exp_type, &size, NULL);
 		break;
 
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index f245e59..39a3e77 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -114,7 +114,7 @@ test_expect_success 'hash-object' '
 	git hash-object large1
 '
 
-test_expect_failure 'cat-file a large file' '
+test_expect_success 'cat-file a large file' '
 	git cat-file blob :large1 >/dev/null
 '
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 04/11] parse_object: special code path for blobs to avoid putting whole object in memory
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (2 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 object.c    |   11 +++++++++++
 sha1_file.c |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/object.c b/object.c
index 6b06297..0498b18 100644
--- a/object.c
+++ b/object.c
@@ -198,6 +198,17 @@ struct object *parse_object(const unsigned char *sha1)
 	if (obj && obj->parsed)
 		return obj;
 
+	if ((obj && obj->type == OBJ_BLOB) ||
+	    (!obj && has_sha1_file(sha1) &&
+	     sha1_object_info(sha1, NULL) == OBJ_BLOB)) {
+		if (check_sha1_signature(repl, NULL, 0, NULL) < 0) {
+			error("sha1 mismatch %s\n", sha1_to_hex(repl));
+			return NULL;
+		}
+		parse_blob_buffer(lookup_blob(sha1), NULL, 0);
+		return lookup_object(sha1);
+	}
+
 	buffer = read_sha1_file(sha1, &type, &size);
 	if (buffer) {
 		if (check_sha1_signature(repl, buffer, size, typename(type)) < 0) {
diff --git a/sha1_file.c b/sha1_file.c
index f9f8d5e..a77ef0a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -19,6 +19,7 @@
 #include "pack-revindex.h"
 #include "sha1-lookup.h"
 #include "bulk-checkin.h"
+#include "streaming.h"
 
 #ifndef O_NOATIME
 #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
@@ -1149,7 +1150,37 @@ static const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
 int check_sha1_signature(const unsigned char *sha1, void *map, unsigned long size, const char *type)
 {
 	unsigned char real_sha1[20];
-	hash_sha1_file(map, size, type, real_sha1);
+	enum object_type obj_type;
+	struct git_istream *st;
+	git_SHA_CTX c;
+	char hdr[32];
+	int hdrlen;
+
+	if (map) {
+		hash_sha1_file(map, size, type, real_sha1);
+		return hashcmp(sha1, real_sha1) ? -1 : 0;
+	}
+
+	st = open_istream(sha1, &obj_type, &size, NULL);
+	if (!st)
+		return -1;
+
+	/* Generate the header */
+	hdrlen = sprintf(hdr, "%s %lu", typename(obj_type), size) + 1;
+
+	/* Sha1.. */
+	git_SHA1_Init(&c);
+	git_SHA1_Update(&c, hdr, hdrlen);
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		git_SHA1_Update(&c, buf, readlen);
+	}
+	git_SHA1_Final(real_sha1, &c);
+	close_istream(st);
 	return hashcmp(sha1, real_sha1) ? -1 : 0;
 }
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 05/11] show: use streaming interface for showing blobs
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (3 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27 18:00   ` Junio C Hamano
  2012-02-27  7:55 ` [PATCH 06/11] index-pack --verify: skip sha-1 collision test Nguyễn Thái Ngọc Duy
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/log.c    |    9 ++++++++-
 t/t1050-large.sh |    2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index 7d1f6f8..4c4b17a 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -386,13 +386,20 @@ static int show_object(const unsigned char *sha1, int show_tag_object,
 {
 	unsigned long size;
 	enum object_type type;
-	char *buf = read_sha1_file(sha1, &type, &size);
+	char *buf;
 	int offset = 0;
 
+	if (!show_tag_object) {
+		fflush(stdout);
+		return streaming_write_sha1(1, 0, sha1, OBJ_ANY, NULL);
+	}
+
+	buf = read_sha1_file(sha1, &type, &size);
 	if (!buf)
 		return error(_("Could not read object %s"), sha1_to_hex(sha1));
 
 	if (show_tag_object)
+		assert(type == OBJ_TAG);
 		while (offset < size && buf[offset] != '\n') {
 			int new_offset = offset + 1;
 			while (new_offset < size && buf[new_offset++] != '\n')
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 39a3e77..66acb3b 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -118,7 +118,7 @@ test_expect_success 'cat-file a large file' '
 	git cat-file blob :large1 >/dev/null
 '
 
-test_expect_failure 'git-show a large file' '
+test_expect_success 'git-show a large file' '
 	git show :large1 >/dev/null
 
 '
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 06/11] index-pack --verify: skip sha-1 collision test
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (4 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 07/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

index-pack --verify (or verify-pack) is about verifying the pack
itself. SHA-1 collision test is about outside (probably malicious)
objects with the same SHA-1 entering current repo.

SHA-1 collision test is currently done unconditionally. Which means if
you verify an in-repo pack, all objects from the pack will be checked
against objects in repo, which are themselves.

Skip this test for --verify, unless --strict is also specified.

linux-2.6 $ ls -sh .git/objects/pack/pack-e7732c98a8d54840add294c3c562840f78764196.pack
401M .git/objects/pack/pack-e7732c98a8d54840add294c3c562840f78764196.pack

Without the patch (and with another patch to cut out second pass in
index-pack):

linux-2.6 $ time ~/w/git/old index-pack -v --verify .git/objects/pack/pack-e7732c98a8d54840add294c3c562840f78764196.pack
Indexing objects: 100% (1944656/1944656), done.
fatal: pack has 1617280 unresolved deltas

real    1m1.223s
user    0m55.028s
sys     0m0.828s

With the patch:

linux-2.6 $ time ~/w/git/git index-pack -v --verify .git/objects/pack/pack-e7732c98a8d54840add294c3c562840f78764196.pack
Indexing objects: 100% (1944656/1944656), done.
fatal: pack has 1617280 unresolved deltas

real    0m41.714s
user    0m40.994s
sys     0m0.550s

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/index-pack.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index dd1c5c9..cee83b9 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -62,6 +62,7 @@ static int nr_resolved_deltas;
 
 static int from_stdin;
 static int strict;
+static int verify;
 static int verbose;
 
 static struct progress *progress;
@@ -461,7 +462,7 @@ static void sha1_object(const void *data, unsigned long size,
 			enum object_type type, unsigned char *sha1)
 {
 	hash_sha1_file(data, size, typename(type), sha1);
-	if (has_sha1_file(sha1)) {
+	if ((strict || !verify) && has_sha1_file(sha1)) {
 		void *has_data;
 		enum object_type has_type;
 		unsigned long has_size;
@@ -1078,7 +1079,7 @@ static void show_pack_info(int stat_only)
 
 int cmd_index_pack(int argc, const char **argv, const char *prefix)
 {
-	int i, fix_thin_pack = 0, verify = 0, stat_only = 0, stat = 0;
+	int i, fix_thin_pack = 0, stat_only = 0, stat = 0;
 	const char *curr_pack, *curr_index;
 	const char *index_name = NULL, *pack_name = NULL;
 	const char *keep_name = NULL, *keep_msg = NULL;
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 07/11] index-pack: split second pass obj handling into own function
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (5 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 06/11] index-pack --verify: skip sha-1 collision test Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 08/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/index-pack.c |   31 ++++++++++++++++++-------------
 1 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index cee83b9..e3cb684 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -683,6 +683,23 @@ static int compare_delta_entry(const void *a, const void *b)
 				   objects[delta_b->obj_no].type);
 }
 
+/*
+ * Second pass:
+ * - for all non-delta objects, look if it is used as a base for
+ *   deltas;
+ * - if used as a base, uncompress the object and apply all deltas,
+ *   recursively checking if the resulting object is used as a base
+ *   for some more deltas.
+ */
+static void second_pass(struct object_entry *obj)
+{
+	struct base_data *base_obj = alloc_base_data();
+	base_obj->obj = obj;
+	base_obj->data = NULL;
+	find_unresolved_deltas(base_obj);
+	display_progress(progress, nr_resolved_deltas);
+}
+
 /* Parse all objects and return the pack content SHA1 hash */
 static void parse_pack_objects(unsigned char *sha1)
 {
@@ -737,26 +754,14 @@ static void parse_pack_objects(unsigned char *sha1)
 	qsort(deltas, nr_deltas, sizeof(struct delta_entry),
 	      compare_delta_entry);
 
-	/*
-	 * Second pass:
-	 * - for all non-delta objects, look if it is used as a base for
-	 *   deltas;
-	 * - if used as a base, uncompress the object and apply all deltas,
-	 *   recursively checking if the resulting object is used as a base
-	 *   for some more deltas.
-	 */
 	if (verbose)
 		progress = start_progress("Resolving deltas", nr_deltas);
 	for (i = 0; i < nr_objects; i++) {
 		struct object_entry *obj = &objects[i];
-		struct base_data *base_obj = alloc_base_data();
 
 		if (is_delta_type(obj->type))
 			continue;
-		base_obj->obj = obj;
-		base_obj->data = NULL;
-		find_unresolved_deltas(base_obj);
-		display_progress(progress, nr_resolved_deltas);
+		second_pass(obj);
 	}
 }
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 08/11] index-pack: reduce memory usage when the pack has large blobs
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (6 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 07/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 09/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

This command unpacks every non-delta objects in order to:

1. calculate sha-1
2. do byte-to-byte sha-1 collision test if we happen to have objects
   with the same sha-1
3. validate object content in strict mode

All this requires the entire object to stay in memory, a bad news for
giant blobs. This patch lowers memory consumption by not saving the
object in memory whenever possible, calculating SHA-1 while unpacking
the object.

This patch assumes that the collision test is rarely needed. The
collision test will be done later in second pass if necessary, which
puts the entire object back to memory again (We could even do the
collision test without putting the entire object back in memory, by
comparing as we unpack it).

In strict mode, it always keeps non-blob objects in memory for
validation (blobs do not need data validation). "--strict --verify"
also keeps blobs in memory.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/index-pack.c |   74 +++++++++++++++++++++++++++++++++++++++++---------
 t/t1050-large.sh     |    4 +-
 2 files changed, 63 insertions(+), 15 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index e3cb684..86de813 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -277,30 +277,60 @@ static void unlink_base_data(struct base_data *c)
 	free_base_data(c);
 }
 
-static void *unpack_entry_data(unsigned long offset, unsigned long size)
+static void *unpack_entry_data(unsigned long offset, unsigned long size,
+			       enum object_type type, unsigned char *sha1)
 {
+	static char fixed_buf[8192];
 	int status;
 	git_zstream stream;
-	void *buf = xmalloc(size);
+	void *buf;
+	git_SHA_CTX c;
+
+	if (sha1) {		/* do hash_sha1_file internally */
+		char hdr[32];
+		int hdrlen = sprintf(hdr, "%s %lu", typename(type), size)+1;
+		git_SHA1_Init(&c);
+		git_SHA1_Update(&c, hdr, hdrlen);
+
+		buf = fixed_buf;
+	} else {
+		buf = xmalloc(size);
+	}
 
 	memset(&stream, 0, sizeof(stream));
 	git_inflate_init(&stream);
 	stream.next_out = buf;
-	stream.avail_out = size;
+	stream.avail_out = buf == fixed_buf ? sizeof(fixed_buf) : size;
 
 	do {
 		stream.next_in = fill(1);
 		stream.avail_in = input_len;
 		status = git_inflate(&stream, 0);
 		use(input_len - stream.avail_in);
+		if (sha1) {
+			git_SHA1_Update(&c, buf, stream.next_out - (unsigned char *)buf);
+			stream.next_out = buf;
+			stream.avail_out = sizeof(fixed_buf);
+		}
 	} while (status == Z_OK);
 	if (stream.total_out != size || status != Z_STREAM_END)
 		bad_object(offset, "inflate returned %d", status);
 	git_inflate_end(&stream);
+	if (sha1) {
+		git_SHA1_Final(sha1, &c);
+		buf = NULL;
+	}
 	return buf;
 }
 
-static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_base)
+static int is_delta_type(enum object_type type)
+{
+	return (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA);
+}
+
+static void *unpack_raw_entry(struct object_entry *obj,
+			      union delta_base *delta_base,
+			      unsigned char *sha1)
 {
 	unsigned char *p;
 	unsigned long size, c;
@@ -360,7 +390,17 @@ static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_
 	}
 	obj->hdr_size = consumed_bytes - obj->idx.offset;
 
-	data = unpack_entry_data(obj->idx.offset, obj->size);
+	/*
+	 * --verify --strict: sha1_object() does all collision test
+	 *          --strict: sha1_object() does all except blobs,
+	 *                    blobs tested in second pass
+	 * --verify         : no collision test
+	 *                  : all in second pass
+	 */
+	if (is_delta_type(obj->type) ||
+	    (strict && (verify || obj->type != OBJ_BLOB)))
+		sha1 = NULL;	/* save unpacked object */
+	data = unpack_entry_data(obj->idx.offset, obj->size, obj->type, sha1);
 	obj->idx.crc32 = input_crc32;
 	return data;
 }
@@ -461,8 +501,9 @@ static void find_delta_children(const union delta_base *base,
 static void sha1_object(const void *data, unsigned long size,
 			enum object_type type, unsigned char *sha1)
 {
-	hash_sha1_file(data, size, typename(type), sha1);
-	if ((strict || !verify) && has_sha1_file(sha1)) {
+	if (data)
+		hash_sha1_file(data, size, typename(type), sha1);
+	if (data && (strict || !verify) && has_sha1_file(sha1)) {
 		void *has_data;
 		enum object_type has_type;
 		unsigned long has_size;
@@ -511,11 +552,6 @@ static void sha1_object(const void *data, unsigned long size,
 	}
 }
 
-static int is_delta_type(enum object_type type)
-{
-	return (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA);
-}
-
 /*
  * This function is part of find_unresolved_deltas(). There are two
  * walkers going in the opposite ways.
@@ -690,10 +726,22 @@ static int compare_delta_entry(const void *a, const void *b)
  * - if used as a base, uncompress the object and apply all deltas,
  *   recursively checking if the resulting object is used as a base
  *   for some more deltas.
+ * - if the same object exists in repository and we're not in strict
+ *   mode, we skipped the sha-1 collision test in the first pass.
+ *   Do it now.
  */
 static void second_pass(struct object_entry *obj)
 {
 	struct base_data *base_obj = alloc_base_data();
+
+	if (((!strict && !verify) ||
+	     (strict && !verify && obj->type == OBJ_BLOB)) &&
+	    has_sha1_file(obj->idx.sha1)) {
+		void *data = get_data_from_pack(obj);
+		sha1_object(data, obj->size, obj->type, obj->idx.sha1);
+		free(data);
+	}
+
 	base_obj->obj = obj;
 	base_obj->data = NULL;
 	find_unresolved_deltas(base_obj);
@@ -719,7 +767,7 @@ static void parse_pack_objects(unsigned char *sha1)
 				nr_objects);
 	for (i = 0; i < nr_objects; i++) {
 		struct object_entry *obj = &objects[i];
-		void *data = unpack_raw_entry(obj, &delta->base);
+		void *data = unpack_raw_entry(obj, &delta->base, obj->idx.sha1);
 		obj->real_type = obj->type;
 		if (is_delta_type(obj->type)) {
 			nr_deltas++;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 66acb3b..7e78c72 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -123,7 +123,7 @@ test_expect_success 'git-show a large file' '
 
 '
 
-test_expect_failure 'clone' '
+test_expect_success 'clone' '
 	git clone -n file://"$PWD"/.git new &&
 	(
 	cd new &&
@@ -132,7 +132,7 @@ test_expect_failure 'clone' '
 	)
 '
 
-test_expect_failure 'fetch updates' '
+test_expect_success 'fetch updates' '
 	echo modified >> large1 &&
 	git commit -q -a -m updated &&
 	(
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 09/11] pack-check: do not unpack blobs
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (7 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 08/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 10/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

blob content is not used by verify_pack caller (currently only fsck),
we only need to make sure blob sha-1 signature matches its
content. unpack_entry() is taught to hash pack entry as it is
unpacked, eliminating the need to keep whole blob in memory.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 cache.h          |    2 +-
 fast-import.c    |    2 +-
 pack-check.c     |   21 ++++++++++++++++++++-
 sha1_file.c      |   45 +++++++++++++++++++++++++++++++++++----------
 t/t1050-large.sh |    2 +-
 5 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/cache.h b/cache.h
index 6ce691b..33bfb69 100644
--- a/cache.h
+++ b/cache.h
@@ -1065,7 +1065,7 @@ extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
-extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
+extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *, unsigned char *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
diff --git a/fast-import.c b/fast-import.c
index 6cd19e5..5e94a64 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1303,7 +1303,7 @@ static void *gfi_unpack_entry(
 		 */
 		p->pack_size = pack_size + 20;
 	}
-	return unpack_entry(p, oe->idx.offset, &type, sizep);
+	return unpack_entry(p, oe->idx.offset, &type, sizep, NULL);
 }
 
 static const char *get_mode(const char *str, uint16_t *modep)
diff --git a/pack-check.c b/pack-check.c
index 63a595c..1920bdb 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -105,6 +105,7 @@ static int verify_packfile(struct packed_git *p,
 		void *data;
 		enum object_type type;
 		unsigned long size;
+		off_t curpos = entries[i].offset;
 
 		if (p->index_version > 1) {
 			off_t offset = entries[i].offset;
@@ -116,7 +117,25 @@ static int verify_packfile(struct packed_git *p,
 					    sha1_to_hex(entries[i].sha1),
 					    p->pack_name, (uintmax_t)offset);
 		}
-		data = unpack_entry(p, entries[i].offset, &type, &size);
+		type = unpack_object_header(p, w_curs, &curpos, &size);
+		unuse_pack(w_curs);
+		if (type == OBJ_BLOB) {
+			unsigned char sha1[20];
+			data = unpack_entry(p, entries[i].offset, &type, &size, sha1);
+			if (!data) {
+				if (hashcmp(entries[i].sha1, sha1))
+					err = error("packed %s from %s is corrupt",
+						    sha1_to_hex(entries[i].sha1), p->pack_name);
+				else if (fn) {
+					int eaten = 0;
+					fn(entries[i].sha1, type, size, NULL, &eaten);
+				}
+				if (((base_count + i) & 1023) == 0)
+					display_progress(progress, base_count + i);
+				continue;
+			}
+		}
+		data = unpack_entry(p, entries[i].offset, &type, &size, NULL);
 		if (!data)
 			err = error("cannot unpack %s from %s at offset %"PRIuMAX"",
 				    sha1_to_hex(entries[i].sha1), p->pack_name,
diff --git a/sha1_file.c b/sha1_file.c
index a77ef0a..d68a5b0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1653,28 +1653,51 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 }
 
 static void *unpack_compressed_entry(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t curpos,
-				    unsigned long size)
+				     struct pack_window **w_curs,
+				     off_t curpos,
+				     unsigned long size,
+				     enum object_type type,
+				     unsigned char *sha1)
 {
+	static unsigned char fixed_buf[8192];
 	int st;
 	git_zstream stream;
 	unsigned char *buffer, *in;
+	git_SHA_CTX c;
+
+	if (sha1) {		/* do hash_sha1_file internally */
+		char hdr[32];
+		int hdrlen = sprintf(hdr, "%s %lu", typename(type), size)+1;
+		git_SHA1_Init(&c);
+		git_SHA1_Update(&c, hdr, hdrlen);
+
+		buffer = fixed_buf;
+	} else {
+		buffer = xmallocz(size);
+	}
 
-	buffer = xmallocz(size);
 	memset(&stream, 0, sizeof(stream));
 	stream.next_out = buffer;
-	stream.avail_out = size + 1;
+	stream.avail_out = buffer == fixed_buf ? sizeof(fixed_buf) : size + 1;
 
 	git_inflate_init(&stream);
 	do {
 		in = use_pack(p, w_curs, curpos, &stream.avail_in);
 		stream.next_in = in;
 		st = git_inflate(&stream, Z_FINISH);
-		if (!stream.avail_out)
+		if (sha1) {
+			git_SHA1_Update(&c, buffer, stream.next_out - (unsigned char *)buffer);
+			stream.next_out = buffer;
+			stream.avail_out = sizeof(fixed_buf);
+		}
+		else if (!stream.avail_out)
 			break; /* the payload is larger than it should be */
 		curpos += stream.next_in - in;
 	} while (st == Z_OK || st == Z_BUF_ERROR);
+	if (sha1) {
+		git_SHA1_Final(sha1, &c);
+		buffer = NULL;
+	}
 	git_inflate_end(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != size) {
 		free(buffer);
@@ -1727,7 +1750,7 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 
 	ret = ent->data;
 	if (!ret || ent->p != p || ent->base_offset != base_offset)
-		return unpack_entry(p, base_offset, type, base_size);
+		return unpack_entry(p, base_offset, type, base_size, NULL);
 
 	if (!keep_cache) {
 		ent->data = NULL;
@@ -1844,7 +1867,7 @@ static void *unpack_delta_entry(struct packed_git *p,
 			return NULL;
 	}
 
-	delta_data = unpack_compressed_entry(p, w_curs, curpos, delta_size);
+	delta_data = unpack_compressed_entry(p, w_curs, curpos, delta_size, OBJ_NONE, NULL);
 	if (!delta_data) {
 		error("failed to unpack compressed delta "
 		      "at offset %"PRIuMAX" from %s",
@@ -1883,7 +1906,8 @@ static void write_pack_access_log(struct packed_git *p, off_t obj_offset)
 int do_check_packed_object_crc;
 
 void *unpack_entry(struct packed_git *p, off_t obj_offset,
-		   enum object_type *type, unsigned long *sizep)
+		   enum object_type *type, unsigned long *sizep,
+		   unsigned char *sha1)
 {
 	struct pack_window *w_curs = NULL;
 	off_t curpos = obj_offset;
@@ -1917,7 +1941,8 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 	case OBJ_TREE:
 	case OBJ_BLOB:
 	case OBJ_TAG:
-		data = unpack_compressed_entry(p, &w_curs, curpos, *sizep);
+		data = unpack_compressed_entry(p, &w_curs, curpos,
+					       *sizep, *type, sha1);
 		break;
 	default:
 		data = NULL;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 7e78c72..c749ecb 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -141,7 +141,7 @@ test_expect_success 'fetch updates' '
 	)
 '
 
-test_expect_failure 'fsck' '
+test_expect_success 'fsck' '
 	git fsck --full
 '
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 10/11] archive: support streaming large files to a tar archive
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (8 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 09/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27  7:55 ` [PATCH 11/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 archive-tar.c    |   35 ++++++++++++++++++++++++++++-------
 archive-zip.c    |    9 +++++----
 archive.c        |   51 ++++++++++++++++++++++++++++++++++-----------------
 archive.h        |   11 +++++++++--
 t/t1050-large.sh |    2 +-
 5 files changed, 77 insertions(+), 31 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 20af005..5bffe49 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -5,6 +5,7 @@
 #include "tar.h"
 #include "archive.h"
 #include "run-command.h"
+#include "streaming.h"
 
 #define RECORDSIZE	(512)
 #define BLOCKSIZE	(RECORDSIZE * 20)
@@ -123,9 +124,29 @@ static size_t get_path_prefix(const char *path, size_t pathlen, size_t maxlen)
 	return i;
 }
 
+static void write_file(struct git_istream *stream, const void *buffer,
+		       unsigned long size)
+{
+	if (!stream) {
+		write_blocked(buffer, size);
+		return;
+	}
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t readlen;
+
+		readlen = read_istream(stream, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		write_blocked(buf, readlen);
+	}
+}
+
 static int write_tar_entry(struct archiver_args *args,
-		const unsigned char *sha1, const char *path, size_t pathlen,
-		unsigned int mode, void *buffer, unsigned long size)
+			   const unsigned char *sha1, const char *path,
+			   size_t pathlen, unsigned int mode, void *buffer,
+			   struct git_istream *stream, unsigned long size)
 {
 	struct ustar_header header;
 	struct strbuf ext_header = STRBUF_INIT;
@@ -200,14 +221,14 @@ static int write_tar_entry(struct archiver_args *args,
 
 	if (ext_header.len > 0) {
 		err = write_tar_entry(args, sha1, NULL, 0, 0, ext_header.buf,
-				ext_header.len);
+				      NULL, ext_header.len);
 		if (err)
 			return err;
 	}
 	strbuf_release(&ext_header);
 	write_blocked(&header, sizeof(header));
-	if (S_ISREG(mode) && buffer && size > 0)
-		write_blocked(buffer, size);
+	if (S_ISREG(mode) && size > 0)
+		write_file(stream, buffer, size);
 	return err;
 }
 
@@ -219,7 +240,7 @@ static int write_global_extended_header(struct archiver_args *args)
 
 	strbuf_append_ext_header(&ext_header, "comment", sha1_to_hex(sha1), 40);
 	err = write_tar_entry(args, NULL, NULL, 0, 0, ext_header.buf,
-			ext_header.len);
+			      NULL, ext_header.len);
 	strbuf_release(&ext_header);
 	return err;
 }
@@ -308,7 +329,7 @@ static int write_tar_archive(const struct archiver *ar,
 	if (args->commit_sha1)
 		err = write_global_extended_header(args);
 	if (!err)
-		err = write_archive_entries(args, write_tar_entry);
+		err = write_archive_entries(args, write_tar_entry, 1);
 	if (!err)
 		write_trailer();
 	return err;
diff --git a/archive-zip.c b/archive-zip.c
index 02d1f37..4a1e917 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -120,9 +120,10 @@ static void *zlib_deflate(void *data, unsigned long size,
 	return buffer;
 }
 
-static int write_zip_entry(struct archiver_args *args,
-		const unsigned char *sha1, const char *path, size_t pathlen,
-		unsigned int mode, void *buffer, unsigned long size)
+int write_zip_entry(struct archiver_args *args,
+			   const unsigned char *sha1, const char *path,
+			   size_t pathlen, unsigned int mode, void *buffer,
+			   struct git_istream *stream, unsigned long size)
 {
 	struct zip_local_header header;
 	struct zip_dir_header dirent;
@@ -271,7 +272,7 @@ static int write_zip_archive(const struct archiver *ar,
 	zip_dir = xmalloc(ZIP_DIRECTORY_MIN_SIZE);
 	zip_dir_size = ZIP_DIRECTORY_MIN_SIZE;
 
-	err = write_archive_entries(args, write_zip_entry);
+	err = write_archive_entries(args, write_zip_entry, 0);
 	if (!err)
 		write_zip_trailer(args->commit_sha1);
 
diff --git a/archive.c b/archive.c
index 1ee837d..257eadf 100644
--- a/archive.c
+++ b/archive.c
@@ -5,6 +5,7 @@
 #include "archive.h"
 #include "parse-options.h"
 #include "unpack-trees.h"
+#include "streaming.h"
 
 static char const * const archive_usage[] = {
 	"git archive [options] <tree-ish> [<path>...]",
@@ -59,26 +60,35 @@ static void format_subst(const struct commit *commit,
 	free(to_free);
 }
 
-static void *sha1_file_to_archive(const char *path, const unsigned char *sha1,
-		unsigned int mode, enum object_type *type,
-		unsigned long *sizep, const struct commit *commit)
+void sha1_file_to_archive(void **buffer, struct git_istream **stream,
+			  const char *path, const unsigned char *sha1,
+			  unsigned int mode, enum object_type *type,
+			  unsigned long *sizep,
+			  const struct commit *commit)
 {
-	void *buffer;
+	if (stream) {
+		struct stream_filter *filter;
+		filter = get_stream_filter(path, sha1);
+		if (!commit && S_ISREG(mode) && is_null_stream_filter(filter)) {
+			*buffer = NULL;
+			*stream = open_istream(sha1, type, sizep, NULL);
+			return;
+		}
+		*stream = NULL;
+	}
 
-	buffer = read_sha1_file(sha1, type, sizep);
-	if (buffer && S_ISREG(mode)) {
+	*buffer = read_sha1_file(sha1, type, sizep);
+	if (*buffer && S_ISREG(mode)) {
 		struct strbuf buf = STRBUF_INIT;
 		size_t size = 0;
 
-		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
+		strbuf_attach(&buf, *buffer, *sizep, *sizep + 1);
 		convert_to_working_tree(path, buf.buf, buf.len, &buf);
 		if (commit)
 			format_subst(commit, buf.buf, buf.len, &buf);
-		buffer = strbuf_detach(&buf, &size);
+		*buffer = strbuf_detach(&buf, &size);
 		*sizep = size;
 	}
-
-	return buffer;
 }
 
 static void setup_archive_check(struct git_attr_check *check)
@@ -97,6 +107,7 @@ static void setup_archive_check(struct git_attr_check *check)
 struct archiver_context {
 	struct archiver_args *args;
 	write_archive_entry_fn_t write_entry;
+	int stream_ok;
 };
 
 static int write_archive_entry(const unsigned char *sha1, const char *base,
@@ -109,6 +120,7 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
 	write_archive_entry_fn_t write_entry = c->write_entry;
 	struct git_attr_check check[2];
 	const char *path_without_prefix;
+	struct git_istream *stream = NULL;
 	int convert = 0;
 	int err;
 	enum object_type type;
@@ -133,25 +145,29 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
 		strbuf_addch(&path, '/');
 		if (args->verbose)
 			fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
-		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, 0);
+		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, NULL, 0);
 		if (err)
 			return err;
 		return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
 	}
 
-	buffer = sha1_file_to_archive(path_without_prefix, sha1, mode,
-			&type, &size, convert ? args->commit : NULL);
-	if (!buffer)
+	sha1_file_to_archive(&buffer, c->stream_ok ? &stream : NULL,
+			     path_without_prefix, sha1, mode,
+			     &type, &size, convert ? args->commit : NULL);
+	if (!buffer && !stream)
 		return error("cannot read %s", sha1_to_hex(sha1));
 	if (args->verbose)
 		fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
-	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, size);
+	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, stream, size);
+	if (stream)
+		close_istream(stream);
 	free(buffer);
 	return err;
 }
 
 int write_archive_entries(struct archiver_args *args,
-		write_archive_entry_fn_t write_entry)
+			  write_archive_entry_fn_t write_entry,
+			  int stream_ok)
 {
 	struct archiver_context context;
 	struct unpack_trees_options opts;
@@ -167,13 +183,14 @@ int write_archive_entries(struct archiver_args *args,
 		if (args->verbose)
 			fprintf(stderr, "%.*s\n", (int)len, args->base);
 		err = write_entry(args, args->tree->object.sha1, args->base,
-				len, 040777, NULL, 0);
+				  len, 040777, NULL, NULL, 0);
 		if (err)
 			return err;
 	}
 
 	context.args = args;
 	context.write_entry = write_entry;
+	context.stream_ok = stream_ok;
 
 	/*
 	 * Setup index and instruct attr to read index only
diff --git a/archive.h b/archive.h
index 2b0884f..370cca9 100644
--- a/archive.h
+++ b/archive.h
@@ -27,9 +27,16 @@ extern void register_archiver(struct archiver *);
 extern void init_tar_archiver(void);
 extern void init_zip_archiver(void);
 
-typedef int (*write_archive_entry_fn_t)(struct archiver_args *args, const unsigned char *sha1, const char *path, size_t pathlen, unsigned int mode, void *buffer, unsigned long size);
+struct git_istream;
+typedef int (*write_archive_entry_fn_t)(struct archiver_args *args,
+					const unsigned char *sha1,
+					const char *path, size_t pathlen,
+					unsigned int mode,
+					void *buffer,
+					struct git_istream *stream,
+					unsigned long size);
 
-extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry);
+extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry, int stream_ok);
 extern int write_archive(int argc, const char **argv, const char *prefix, int setup_prefix, const char *name_hint, int remote);
 
 const char *archive_format_from_filename(const char *filename);
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index c749ecb..1e64692 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -149,7 +149,7 @@ test_expect_success 'repack' '
 	git repack -ad
 '
 
-test_expect_failure 'tar achiving' '
+test_expect_success 'tar achiving' '
 	git archive --format=tar HEAD >/dev/null
 '
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 11/11] fsck: use streaming interface for writing lost-found blobs
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (9 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 10/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
@ 2012-02-27  7:55 ` Nguyễn Thái Ngọc Duy
  2012-02-27 18:43 ` [PATCH 00/11] Large blob fixes Junio C Hamano
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
  12 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-02-27  7:55 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/fsck.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 8c479a7..319b5c7 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -236,13 +236,9 @@ static void check_unreachable_object(struct object *obj)
 			if (!(f = fopen(filename, "w")))
 				die_errno("Could not open '%s'", filename);
 			if (obj->type == OBJ_BLOB) {
-				enum object_type type;
-				unsigned long size;
-				char *buf = read_sha1_file(obj->sha1,
-						&type, &size);
-				if (buf && fwrite(buf, 1, size, f) != size)
+				if (streaming_write_sha1(fileno(f), 1,
+							 obj->sha1, OBJ_BLOB, NULL))
 					die_errno("Could not write '%s'", filename);
-				free(buf);
 			} else
 				fprintf(f, "%s\n", sha1_to_hex(obj->sha1));
 			if (fclose(f))
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle
  2012-02-27  7:55 ` [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle Nguyễn Thái Ngọc Duy
@ 2012-02-27 17:29   ` Junio C Hamano
  2012-02-27 21:50     ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2012-02-27 17:29 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  cache.h |    3 +++
>  entry.c |   39 ++++++++++++++++++++++++++-------------
>  2 files changed, 29 insertions(+), 13 deletions(-)

It was the goal of the original streaming output topic to helping more
callers stream the data out directly from the object store in order to
reduce memory pressure, and this series is very much in line with its
spirit.

The static version of streaming_write_entry() in entry.c was very specific
to writing out an index entry out to the working tree, and it made perfect
sense to have the function in that file, but its interface was limited to
the original context the function was used in.

The whole point of your refactoring in this patch is to make it available
for callers outside that original context; e.g. archive that finds blob
SHA-1 from a tree and writes the blob out to its standard output.  They
should not have to work with an API that takes a cache-entry and writes to
a working tree file.  And your result is much more generic.

So I think the external declaration and the definition should move to a
more generic place, namely streaming.[ch].  It does not belong to entry.c
anymore.

Thanks for working on this.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/11] cat-file: use streaming interface to print blobs
  2012-02-27  7:55 ` [PATCH 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
@ 2012-02-27 17:44   ` Junio C Hamano
  2012-02-28  1:08     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2012-02-27 17:44 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  builtin/cat-file.c |   22 ++++++++++++++++++++++
>  t/t1050-large.sh   |    2 +-
>  2 files changed, 23 insertions(+), 1 deletions(-)
>
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 8ed501f..3f3b558 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -82,6 +82,24 @@ static void pprint_tag(const unsigned char *sha1, const char *buf, unsigned long
>  		write_or_die(1, cp, endp - cp);
>  }
>  
> +static int write_blob(const unsigned char *sha1)
> +{
> +	unsigned char new_sha1[20];
> +
> +	if (sha1_object_info(sha1, NULL) == OBJ_TAG) {

This smells bad.  Why in the world could an API be sane if lets a caller
call "write_blob()" with something that can be a tag?

Both of your callsites call this function when (type == OBJ_BLOB), but the
"case 0:" arm in the large switch in cat_one_file() only checks "expected
type" which may not match the real type at all, so it is wrong to switch
on that in the first place.  In addition, that call site alone needs to
deref tag to the requested/expected type.

This block does not belong to this function, but to only one of its
callers among two.

> +		enum object_type type;
> +		unsigned long size;
> +		char *buffer = read_sha1_file(sha1, &type, &size);
> +		if (memcmp(buffer, "object ", 7) ||
> +		    get_sha1_hex(buffer + 7, new_sha1))
> +			die("%s not a valid tag", sha1_to_hex(sha1));
> +		sha1 = new_sha1;
> +		free(buffer);
> +	}
> +
> +	return streaming_write_sha1(1, 0, sha1, OBJ_BLOB, NULL);

I do not think your previous refactoring added a fall-back codepath to the
function you are calling here.  In the original context, the caller of
streaming_write_entry() made sure that the blob is suitable for streaming
write by getting an istream, and called the function only when that is the
case.  Blobs unsuitable for streaming (e.g. an deltified object in a pack)
were handled by the caller that decided not to call
streaming_write_entry() with the conventional "read to core and then write
it out" codepath.

And I do not think your updated caller in cat_one_file() is equipped to do
so at all.

So it looks to me that this patch totally breaks the cat-file.  What am I
missing?

> +}
> +
>  static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
>  {
>  	unsigned char sha1[20];
> @@ -127,6 +145,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
>  			return cmd_ls_tree(2, ls_args, NULL);
>  		}
>  
> +		if (type == OBJ_BLOB)
> +			return write_blob(sha1);
>  		buf = read_sha1_file(sha1, &type, &size);
>  		if (!buf)
>  			die("Cannot read object %s", obj_name);
> @@ -149,6 +169,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
>  		break;
>  
>  	case 0:
> +		if (type_from_string(exp_type) == OBJ_BLOB)
> +			return write_blob(sha1);
>  		buf = read_object_with_reference(sha1, exp_type, &size, NULL);
>  		break;
>  
> diff --git a/t/t1050-large.sh b/t/t1050-large.sh
> index f245e59..39a3e77 100755
> --- a/t/t1050-large.sh
> +++ b/t/t1050-large.sh
> @@ -114,7 +114,7 @@ test_expect_success 'hash-object' '
>  	git hash-object large1
>  '
>  
> -test_expect_failure 'cat-file a large file' '
> +test_expect_success 'cat-file a large file' '
>  	git cat-file blob :large1 >/dev/null
>  '

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/11] show: use streaming interface for showing blobs
  2012-02-27  7:55 ` [PATCH 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
@ 2012-02-27 18:00   ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2012-02-27 18:00 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  builtin/log.c    |    9 ++++++++-
>  t/t1050-large.sh |    2 +-
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/log.c b/builtin/log.c
> index 7d1f6f8..4c4b17a 100644
> --- a/builtin/log.c
> +++ b/builtin/log.c
> @@ -386,13 +386,20 @@ static int show_object(const unsigned char *sha1, int show_tag_object,
>  {
>  	unsigned long size;
>  	enum object_type type;
> -	char *buf = read_sha1_file(sha1, &type, &size);
> +	char *buf;
>  	int offset = 0;
>  
> +	if (!show_tag_object) {
> +		fflush(stdout);
> +		return streaming_write_sha1(1, 0, sha1, OBJ_ANY, NULL);
> +	}
> +
> +	buf = read_sha1_file(sha1, &type, &size);
>  	if (!buf)
>  		return error(_("Could not read object %s"), sha1_to_hex(sha1));
>  
>  	if (show_tag_object)
> +		assert(type == OBJ_TAG);
>  		while (offset < size && buf[offset] != '\n') {
>  			int new_offset = offset + 1;
>  			while (new_offset < size && buf[new_offset++] != '\n')

Yuck.

The two callsites to this static function are to do BLOB to do TAG.  And
after you start handing all the blob handling to streaming_write_sha1(),
there is no shared code between the two callers for this function.

So why not remove this function, create one show_blob_object() and the
other show_tag_object(), and update the callers to call the appropriate
one?

> diff --git a/t/t1050-large.sh b/t/t1050-large.sh
> index 39a3e77..66acb3b 100755
> --- a/t/t1050-large.sh
> +++ b/t/t1050-large.sh
> @@ -118,7 +118,7 @@ test_expect_success 'cat-file a large file' '
>  	git cat-file blob :large1 >/dev/null
>  '
>  
> -test_expect_failure 'git-show a large file' '
> +test_expect_success 'git-show a large file' '
>  	git show :large1 >/dev/null
>  
>  '

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 00/11] Large blob fixes
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (10 preceding siblings ...)
  2012-02-27  7:55 ` [PATCH 11/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
@ 2012-02-27 18:43 ` Junio C Hamano
  2012-02-28  1:23   ` Nguyen Thai Ngoc Duy
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
  12 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2012-02-27 18:43 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> These patches make sure we avoid keeping whole blob in memory, at
> least in common cases. Blob-only streaming code paths are opened to
> accomplish that.

Some in the series seem to be unrelated to the above, namely, the
index-pack ones.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/11] Add more large blob test cases
  2012-02-27  7:55 ` [PATCH 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
@ 2012-02-27 20:18   ` Peter Baumann
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Baumann @ 2012-02-27 20:18 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Junio C Hamano

A minor spelling error in the text.

On Mon, Feb 27, 2012 at 02:55:05PM +0700, Nguyễn Thái Ngọc Duy wrote:
> New test cases list commands that should work when memory is
> limited. All memory allocation functions (*) learn to reject any
> allocation larger than $GIT_ALLOC_LIMIT if set.
> 
> (*) Not exactly all. Some places do not use x* functions, but
> malloc/calloc directly, notably diff-delta. These could path should
                                                    ^code
> never be run on large blobs.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>

-Peter

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle
  2012-02-27 17:29   ` Junio C Hamano
@ 2012-02-27 21:50     ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2012-02-27 21:50 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> So I think the external declaration and the definition should move to a
> more generic place, namely streaming.[ch].  It does not belong to entry.c
> anymore.
>
> Thanks for working on this.

In other words, I think the result should look more like this.

The original logic in entry.c is that the caller should try to get a
filter and call streaming_write_entry(), but either of them is allowed to
return a failure when the blob is not suitable for the streaming codepath
to tell the caller to try their traditional codepath.

We might want to add another helper function for callers to use to decide
if they should use the streaming interface, or the traditional one, before
actually making a call to streaming_write_entry().  With the original (and
current) API, they have to retry even when the streaming codepath truly
failed (e.g. no such blob object), in which case it is very likely that
the traditional codepath in the caller will fail the same way. Retrying is
a wasted effort in such a case.

-- >8 --
Subject: [PATCH] streaming: make streaming-write-entry to be more reusable

The static function in entry.c takes a cache entry and streams its blob
contents to a file in the working tree.  Refactor the logic to a new API
function stream_blob_to_fd() that takes an object name and an open file
descriptor, so that it can be reused by other callers.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 entry.c     |   53 +++++------------------------------------------------
 streaming.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 streaming.h |    2 ++
 3 files changed, 62 insertions(+), 48 deletions(-)

diff --git a/entry.c b/entry.c
index 852fea1..17a6bcc 100644
--- a/entry.c
+++ b/entry.c
@@ -120,58 +120,15 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 				 const struct checkout *state, int to_tempfile,
 				 int *fstat_done, struct stat *statbuf)
 {
-	struct git_istream *st;
-	enum object_type type;
-	unsigned long sz;
 	int result = -1;
-	ssize_t kept = 0;
-	int fd = -1;
-
-	st = open_istream(ce->sha1, &type, &sz, filter);
-	if (!st)
-		return -1;
-	if (type != OBJ_BLOB)
-		goto close_and_exit;
+	int fd;
 
 	fd = open_output_fd(path, ce, to_tempfile);
-	if (fd < 0)
-		goto close_and_exit;
-
-	for (;;) {
-		char buf[1024 * 16];
-		ssize_t wrote, holeto;
-		ssize_t readlen = read_istream(st, buf, sizeof(buf));
-
-		if (!readlen)
-			break;
-		if (sizeof(buf) == readlen) {
-			for (holeto = 0; holeto < readlen; holeto++)
-				if (buf[holeto])
-					break;
-			if (readlen == holeto) {
-				kept += holeto;
-				continue;
-			}
-		}
-
-		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
-			goto close_and_exit;
-		else
-			kept = 0;
-		wrote = write_in_full(fd, buf, readlen);
-
-		if (wrote != readlen)
-			goto close_and_exit;
-	}
-	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
-		     write(fd, "", 1) != 1))
-		goto close_and_exit;
-	*fstat_done = fstat_output(fd, state, statbuf);
-
-close_and_exit:
-	close_istream(st);
-	if (0 <= fd)
+	if (0 <= fd) {
+		result = stream_blob_to_fd(fd, ce->sha1, filter, 1);
+		*fstat_done = fstat_output(fd, state, statbuf);
 		result = close(fd);
+	}
 	if (result && 0 <= fd)
 		unlink(path);
 	return result;
diff --git a/streaming.c b/streaming.c
index 71072e1..7e7ee2b 100644
--- a/streaming.c
+++ b/streaming.c
@@ -489,3 +489,58 @@ static open_method_decl(incore)
 
 	return st->u.incore.buf ? 0 : -1;
 }
+
+
+/****************************************************************
+ * Users of streaming interface
+ ****************************************************************/
+
+int stream_blob_to_fd(int fd, unsigned const char *sha1, struct stream_filter *filter,
+		      int can_seek)
+{
+	struct git_istream *st;
+	enum object_type type;
+	unsigned long sz;
+	ssize_t kept = 0;
+	int result = -1;
+
+	st = open_istream(sha1, &type, &sz, filter);
+	if (!st)
+		return result;
+	if (type != OBJ_BLOB)
+		goto close_and_exit;
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t wrote, holeto;
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		if (can_seek && sizeof(buf) == readlen) {
+			for (holeto = 0; holeto < readlen; holeto++)
+				if (buf[holeto])
+					break;
+			if (readlen == holeto) {
+				kept += holeto;
+				continue;
+			}
+		}
+
+		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
+			goto close_and_exit;
+		else
+			kept = 0;
+		wrote = write_in_full(fd, buf, readlen);
+
+		if (wrote != readlen)
+			goto close_and_exit;
+	}
+	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
+		     write(fd, "", 1) != 1))
+		goto close_and_exit;
+	result = 0;
+
+ close_and_exit:
+	close_istream(st);
+	return result;
+}
diff --git a/streaming.h b/streaming.h
index 589e857..3e82770 100644
--- a/streaming.h
+++ b/streaming.h
@@ -12,4 +12,6 @@ extern struct git_istream *open_istream(const unsigned char *, enum object_type
 extern int close_istream(struct git_istream *);
 extern ssize_t read_istream(struct git_istream *, char *, size_t);
 
+extern int stream_blob_to_fd(int fd, const unsigned char *, struct stream_filter *, int can_seek);
+
 #endif /* STREAMING_H */
-- 
1.7.9.2.312.g1abc3

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/11] cat-file: use streaming interface to print blobs
  2012-02-27 17:44   ` Junio C Hamano
@ 2012-02-28  1:08     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 48+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-28  1:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2012/2/28 Junio C Hamano <gitster@pobox.com>:
>> +             enum object_type type;
>> +             unsigned long size;
>> +             char *buffer = read_sha1_file(sha1, &type, &size);
>> +             if (memcmp(buffer, "object ", 7) ||
>> +                 get_sha1_hex(buffer + 7, new_sha1))
>> +                     die("%s not a valid tag", sha1_to_hex(sha1));
>> +             sha1 = new_sha1;
>> +             free(buffer);
>> +     }
>> +
>> +     return streaming_write_sha1(1, 0, sha1, OBJ_BLOB, NULL);
>
> I do not think your previous refactoring added a fall-back codepath to the
> function you are calling here.  In the original context, the caller of
> streaming_write_entry() made sure that the blob is suitable for streaming
> write by getting an istream, and called the function only when that is the
> case.  Blobs unsuitable for streaming (e.g. an deltified object in a pack)
> were handled by the caller that decided not to call
> streaming_write_entry() with the conventional "read to core and then write
> it out" codepath.
>
> And I do not think your updated caller in cat_one_file() is equipped to do
> so at all.
>
> So it looks to me that this patch totally breaks the cat-file.  What am I
> missing?

I think open_istream can deal with unsuitable for streaming objects
too. There's a fallback "incore" backend that does
read_sha1_file_extended.
-- 
Duy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 00/11] Large blob fixes
  2012-02-27 18:43 ` [PATCH 00/11] Large blob fixes Junio C Hamano
@ 2012-02-28  1:23   ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 48+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-28  1:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2012/2/28 Junio C Hamano <gitster@pobox.com>:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>> These patches make sure we avoid keeping whole blob in memory, at
>> least in common cases. Blob-only streaming code paths are opened to
>> accomplish that.
>
> Some in the series seem to be unrelated to the above, namely, the
> index-pack ones.

index-pack patches in this series can make "index-pack --verify"
worse, but it's already not so good. Will take the --verify patch out.
I will need better strategy than blindly skipping sha-1 collision test
when --verify is specified.
-- 
Duy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 00/10] Large blob fixes
  2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                   ` (11 preceding siblings ...)
  2012-02-27 18:43 ` [PATCH 00/11] Large blob fixes Junio C Hamano
@ 2012-03-04 12:59 ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 01/10] Add more large blob test cases Nguyễn Thái Ngọc Duy
                     ` (21 more replies)
  12 siblings, 22 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

These patches make sure we avoid keeping whole blob in memory, at
least in common cases. Blob-only streaming code paths are opened to
accomplish that.

There are a few things I'd like to see addressed, perhaps as part of
GSoC if any student steps up.

 - somehow avoid unpack-objects and keep the pack if it contains large
   blobs. I guess we could just save the pack, then decide to
   unpack-objects later. I've updated GSoC ideas page about this.
 
 - pack-objects still puts large blobs in memory if they are in loose
   format. This should not happen if we fix the above. But if anyone
   has spare energy, (s)he can try to stream large loose blobs in the
   pack too. Not sure how ugly the end result could be.

 - archive-zip with large blobs. I think two phases are required
   because we need to calculate crc32 in advance. I have a feeling
   that we could just stream compressed blobs (either in loose or
   packed format) to the zip file, i.e. no decompressing then
   compresssing, which makes two phases nearly as good as one.

 - not really large blob related, but it'd be great to see
   pack-check.c and index-pack.c share as much pack reading code as
   possible, even bettere if sha1_file.c could join the party.

 - I've been thinking whether we could just drop pack-check.c, which
   is only used by fsck, and make fsck run index-pack instead. The
   pros is we can run index-pack in parallel. The cons is, how to
   return marked object list to fsck efficiently.

Anyway changes from v1:

 - use stream_blob_to_fd() patch from Junio (better factoring)
 - split show_object() in "git show" in two separate functions, one
   for tag and one for blob, as they do not share much in the end
 - get rid of "index-pack --verify" patch. It'll come back separately

Junio C Hamano (1):
  streaming: make streaming-write-entry to be more reusable

Nguyễn Thái Ngọc Duy (9):
  Add more large blob test cases
  cat-file: use streaming interface to print blobs
  parse_object: special code path for blobs to avoid putting whole
    object in memory
  show: use streaming interface for showing blobs
  index-pack: split second pass obj handling into own function
  index-pack: reduce memory usage when the pack has large blobs
  pack-check: do not unpack blobs
  archive: support streaming large files to a tar archive
  fsck: use streaming interface for writing lost-found blobs

 archive-tar.c        |   35 +++++++++++++++----
 archive-zip.c        |    9 +++--
 archive.c            |   51 ++++++++++++++++++---------
 archive.h            |   11 +++++-
 builtin/cat-file.c   |   23 ++++++++++++
 builtin/fsck.c       |    8 +---
 builtin/index-pack.c |   95 ++++++++++++++++++++++++++++++++++++--------------
 builtin/log.c        |   34 ++++++++++-------
 cache.h              |    2 +-
 entry.c              |   53 +++-------------------------
 fast-import.c        |    2 +-
 object.c             |   11 ++++++
 pack-check.c         |   21 ++++++++++-
 sha1_file.c          |   78 +++++++++++++++++++++++++++++++++++------
 streaming.c          |   55 +++++++++++++++++++++++++++++
 streaming.h          |    2 +
 t/t1050-large.sh     |   59 ++++++++++++++++++++++++++++++-
 wrapper.c            |   27 ++++++++++++--
 18 files changed, 434 insertions(+), 142 deletions(-)

-- 
1.7.8.36.g69ee2

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 01/10] Add more large blob test cases
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-06  0:59     ` Junio C Hamano
  2012-03-04 12:59   ` [PATCH v2 02/10] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
                     ` (20 subsequent siblings)
  21 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

New test cases list commands that should work when memory is
limited. All memory allocation functions (*) learn to reject any
allocation larger than $GIT_ALLOC_LIMIT if set.

(*) Not exactly all. Some places do not use x* functions, but
malloc/calloc directly, notably diff-delta. These code path should
never be run on large blobs.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 t/t1050-large.sh |   59 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 wrapper.c        |   27 ++++++++++++++++++++++--
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 29d6024..f245e59 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -10,7 +10,9 @@ test_expect_success setup '
 	echo X | dd of=large1 bs=1k seek=2000 &&
 	echo X | dd of=large2 bs=1k seek=2000 &&
 	echo X | dd of=large3 bs=1k seek=2000 &&
-	echo Y | dd of=huge bs=1k seek=2500
+	echo Y | dd of=huge bs=1k seek=2500 &&
+	GIT_ALLOC_LIMIT=1500 &&
+	export GIT_ALLOC_LIMIT
 '
 
 test_expect_success 'add a large file or two' '
@@ -100,4 +102,59 @@ test_expect_success 'packsize limit' '
 	)
 '
 
+test_expect_success 'diff --raw' '
+	git commit -q -m initial &&
+	echo modified >>large1 &&
+	git add large1 &&
+	git commit -q -m modified &&
+	git diff --raw HEAD^
+'
+
+test_expect_success 'hash-object' '
+	git hash-object large1
+'
+
+test_expect_failure 'cat-file a large file' '
+	git cat-file blob :large1 >/dev/null
+'
+
+test_expect_failure 'git-show a large file' '
+	git show :large1 >/dev/null
+
+'
+
+test_expect_failure 'clone' '
+	git clone -n file://"$PWD"/.git new &&
+	(
+	cd new &&
+	git config core.bigfilethreshold 200k &&
+	git checkout master
+	)
+'
+
+test_expect_failure 'fetch updates' '
+	echo modified >> large1 &&
+	git commit -q -a -m updated &&
+	(
+	cd new &&
+	git fetch --keep # FIXME should not need --keep
+	)
+'
+
+test_expect_failure 'fsck' '
+	git fsck --full
+'
+
+test_expect_success 'repack' '
+	git repack -ad
+'
+
+test_expect_failure 'tar achiving' '
+	git archive --format=tar HEAD >/dev/null
+'
+
+test_expect_failure 'zip achiving' '
+	git archive --format=zip HEAD >/dev/null
+'
+
 test_done
diff --git a/wrapper.c b/wrapper.c
index 85f09df..d4c0972 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -9,6 +9,18 @@ static void do_nothing(size_t size)
 
 static void (*try_to_free_routine)(size_t size) = do_nothing;
 
+static void memory_limit_check(size_t size)
+{
+	static int limit = -1;
+	if (limit == -1) {
+		const char *env = getenv("GIT_ALLOC_LIMIT");
+		limit = env ? atoi(env) * 1024 : 0;
+	}
+	if (limit && size > limit)
+		die("attempting to allocate %d over limit %d",
+		    size, limit);
+}
+
 try_to_free_t set_try_to_free_routine(try_to_free_t routine)
 {
 	try_to_free_t old = try_to_free_routine;
@@ -32,7 +44,10 @@ char *xstrdup(const char *str)
 
 void *xmalloc(size_t size)
 {
-	void *ret = malloc(size);
+	void *ret;
+
+	memory_limit_check(size);
+	ret = malloc(size);
 	if (!ret && !size)
 		ret = malloc(1);
 	if (!ret) {
@@ -79,7 +94,10 @@ char *xstrndup(const char *str, size_t len)
 
 void *xrealloc(void *ptr, size_t size)
 {
-	void *ret = realloc(ptr, size);
+	void *ret;
+
+	memory_limit_check(size);
+	ret = realloc(ptr, size);
 	if (!ret && !size)
 		ret = realloc(ptr, 1);
 	if (!ret) {
@@ -95,7 +113,10 @@ void *xrealloc(void *ptr, size_t size)
 
 void *xcalloc(size_t nmemb, size_t size)
 {
-	void *ret = calloc(nmemb, size);
+	void *ret;
+
+	memory_limit_check(size * nmemb);
+	ret = calloc(nmemb, size);
 	if (!ret && (!nmemb || !size))
 		ret = calloc(1, 1);
 	if (!ret) {
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 02/10] streaming: make streaming-write-entry to be more reusable
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 01/10] Add more large blob test cases Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 03/10] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

From: Junio C Hamano <gitster@pobox.com>

The static function in entry.c takes a cache entry and streams its blob
contents to a file in the working tree.  Refactor the logic to a new API
function stream_blob_to_fd() that takes an object name and an open file
descriptor, so that it can be reused by other callers.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 entry.c     |   53 +++++------------------------------------------------
 streaming.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 streaming.h |    2 ++
 3 files changed, 62 insertions(+), 48 deletions(-)

diff --git a/entry.c b/entry.c
index 852fea1..17a6bcc 100644
--- a/entry.c
+++ b/entry.c
@@ -120,58 +120,15 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 				 const struct checkout *state, int to_tempfile,
 				 int *fstat_done, struct stat *statbuf)
 {
-	struct git_istream *st;
-	enum object_type type;
-	unsigned long sz;
 	int result = -1;
-	ssize_t kept = 0;
-	int fd = -1;
-
-	st = open_istream(ce->sha1, &type, &sz, filter);
-	if (!st)
-		return -1;
-	if (type != OBJ_BLOB)
-		goto close_and_exit;
+	int fd;
 
 	fd = open_output_fd(path, ce, to_tempfile);
-	if (fd < 0)
-		goto close_and_exit;
-
-	for (;;) {
-		char buf[1024 * 16];
-		ssize_t wrote, holeto;
-		ssize_t readlen = read_istream(st, buf, sizeof(buf));
-
-		if (!readlen)
-			break;
-		if (sizeof(buf) == readlen) {
-			for (holeto = 0; holeto < readlen; holeto++)
-				if (buf[holeto])
-					break;
-			if (readlen == holeto) {
-				kept += holeto;
-				continue;
-			}
-		}
-
-		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
-			goto close_and_exit;
-		else
-			kept = 0;
-		wrote = write_in_full(fd, buf, readlen);
-
-		if (wrote != readlen)
-			goto close_and_exit;
-	}
-	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
-		     write(fd, "", 1) != 1))
-		goto close_and_exit;
-	*fstat_done = fstat_output(fd, state, statbuf);
-
-close_and_exit:
-	close_istream(st);
-	if (0 <= fd)
+	if (0 <= fd) {
+		result = stream_blob_to_fd(fd, ce->sha1, filter, 1);
+		*fstat_done = fstat_output(fd, state, statbuf);
 		result = close(fd);
+	}
 	if (result && 0 <= fd)
 		unlink(path);
 	return result;
diff --git a/streaming.c b/streaming.c
index 71072e1..7e7ee2b 100644
--- a/streaming.c
+++ b/streaming.c
@@ -489,3 +489,58 @@ static open_method_decl(incore)
 
 	return st->u.incore.buf ? 0 : -1;
 }
+
+
+/****************************************************************
+ * Users of streaming interface
+ ****************************************************************/
+
+int stream_blob_to_fd(int fd, unsigned const char *sha1, struct stream_filter *filter,
+		      int can_seek)
+{
+	struct git_istream *st;
+	enum object_type type;
+	unsigned long sz;
+	ssize_t kept = 0;
+	int result = -1;
+
+	st = open_istream(sha1, &type, &sz, filter);
+	if (!st)
+		return result;
+	if (type != OBJ_BLOB)
+		goto close_and_exit;
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t wrote, holeto;
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		if (can_seek && sizeof(buf) == readlen) {
+			for (holeto = 0; holeto < readlen; holeto++)
+				if (buf[holeto])
+					break;
+			if (readlen == holeto) {
+				kept += holeto;
+				continue;
+			}
+		}
+
+		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
+			goto close_and_exit;
+		else
+			kept = 0;
+		wrote = write_in_full(fd, buf, readlen);
+
+		if (wrote != readlen)
+			goto close_and_exit;
+	}
+	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
+		     write(fd, "", 1) != 1))
+		goto close_and_exit;
+	result = 0;
+
+ close_and_exit:
+	close_istream(st);
+	return result;
+}
diff --git a/streaming.h b/streaming.h
index 589e857..3e82770 100644
--- a/streaming.h
+++ b/streaming.h
@@ -12,4 +12,6 @@ extern struct git_istream *open_istream(const unsigned char *, enum object_type
 extern int close_istream(struct git_istream *);
 extern ssize_t read_istream(struct git_istream *, char *, size_t);
 
+extern int stream_blob_to_fd(int fd, const unsigned char *, struct stream_filter *, int can_seek);
+
 #endif /* STREAMING_H */
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 03/10] cat-file: use streaming interface to print blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 01/10] Add more large blob test cases Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 02/10] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 23:12     ` Junio C Hamano
  2012-03-04 12:59   ` [PATCH v2 04/10] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
                     ` (18 subsequent siblings)
  21 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/cat-file.c |   23 +++++++++++++++++++++++
 t/t1050-large.sh   |    2 +-
 2 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8ed501f..bc6cc9f 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -11,6 +11,7 @@
 #include "parse-options.h"
 #include "diff.h"
 #include "userdiff.h"
+#include "streaming.h"
 
 #define BATCH 1
 #define BATCH_CHECK 2
@@ -82,6 +83,24 @@ static void pprint_tag(const unsigned char *sha1, const char *buf, unsigned long
 		write_or_die(1, cp, endp - cp);
 }
 
+static int write_blob(const unsigned char *sha1)
+{
+	unsigned char new_sha1[20];
+
+	if (sha1_object_info(sha1, NULL) == OBJ_TAG) {
+		enum object_type type;
+		unsigned long size;
+		char *buffer = read_sha1_file(sha1, &type, &size);
+		if (memcmp(buffer, "object ", 7) ||
+		    get_sha1_hex(buffer + 7, new_sha1))
+			die("%s not a valid tag", sha1_to_hex(sha1));
+		sha1 = new_sha1;
+		free(buffer);
+	}
+
+	return stream_blob_to_fd(1, sha1, NULL, 0);
+}
+
 static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 {
 	unsigned char sha1[20];
@@ -127,6 +146,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 			return cmd_ls_tree(2, ls_args, NULL);
 		}
 
+		if (type == OBJ_BLOB)
+			return write_blob(sha1);
 		buf = read_sha1_file(sha1, &type, &size);
 		if (!buf)
 			die("Cannot read object %s", obj_name);
@@ -149,6 +170,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 		break;
 
 	case 0:
+		if (type_from_string(exp_type) == OBJ_BLOB)
+			return write_blob(sha1);
 		buf = read_object_with_reference(sha1, exp_type, &size, NULL);
 		break;
 
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index f245e59..39a3e77 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -114,7 +114,7 @@ test_expect_success 'hash-object' '
 	git hash-object large1
 '
 
-test_expect_failure 'cat-file a large file' '
+test_expect_success 'cat-file a large file' '
 	git cat-file blob :large1 >/dev/null
 '
 
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 04/10] parse_object: special code path for blobs to avoid putting whole object in memory
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (2 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 03/10] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 05/10] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 object.c    |   11 +++++++++++
 sha1_file.c |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/object.c b/object.c
index 6b06297..0498b18 100644
--- a/object.c
+++ b/object.c
@@ -198,6 +198,17 @@ struct object *parse_object(const unsigned char *sha1)
 	if (obj && obj->parsed)
 		return obj;
 
+	if ((obj && obj->type == OBJ_BLOB) ||
+	    (!obj && has_sha1_file(sha1) &&
+	     sha1_object_info(sha1, NULL) == OBJ_BLOB)) {
+		if (check_sha1_signature(repl, NULL, 0, NULL) < 0) {
+			error("sha1 mismatch %s\n", sha1_to_hex(repl));
+			return NULL;
+		}
+		parse_blob_buffer(lookup_blob(sha1), NULL, 0);
+		return lookup_object(sha1);
+	}
+
 	buffer = read_sha1_file(sha1, &type, &size);
 	if (buffer) {
 		if (check_sha1_signature(repl, buffer, size, typename(type)) < 0) {
diff --git a/sha1_file.c b/sha1_file.c
index f9f8d5e..a77ef0a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -19,6 +19,7 @@
 #include "pack-revindex.h"
 #include "sha1-lookup.h"
 #include "bulk-checkin.h"
+#include "streaming.h"
 
 #ifndef O_NOATIME
 #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
@@ -1149,7 +1150,37 @@ static const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
 int check_sha1_signature(const unsigned char *sha1, void *map, unsigned long size, const char *type)
 {
 	unsigned char real_sha1[20];
-	hash_sha1_file(map, size, type, real_sha1);
+	enum object_type obj_type;
+	struct git_istream *st;
+	git_SHA_CTX c;
+	char hdr[32];
+	int hdrlen;
+
+	if (map) {
+		hash_sha1_file(map, size, type, real_sha1);
+		return hashcmp(sha1, real_sha1) ? -1 : 0;
+	}
+
+	st = open_istream(sha1, &obj_type, &size, NULL);
+	if (!st)
+		return -1;
+
+	/* Generate the header */
+	hdrlen = sprintf(hdr, "%s %lu", typename(obj_type), size) + 1;
+
+	/* Sha1.. */
+	git_SHA1_Init(&c);
+	git_SHA1_Update(&c, hdr, hdrlen);
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		git_SHA1_Update(&c, buf, readlen);
+	}
+	git_SHA1_Final(real_sha1, &c);
+	close_istream(st);
 	return hashcmp(sha1, real_sha1) ? -1 : 0;
 }
 
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 05/10] show: use streaming interface for showing blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (3 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 04/10] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 06/10] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
                     ` (16 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/log.c    |   34 ++++++++++++++++++++--------------
 t/t1050-large.sh |    2 +-
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index 7d1f6f8..d1702e7 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -20,6 +20,7 @@
 #include "string-list.h"
 #include "parse-options.h"
 #include "branch.h"
+#include "streaming.h"
 
 /* Set a default date-time format for git log ("log.date" config variable) */
 static const char *default_date_mode = NULL;
@@ -381,8 +382,13 @@ static void show_tagger(char *buf, int len, struct rev_info *rev)
 	strbuf_release(&out);
 }
 
-static int show_object(const unsigned char *sha1, int show_tag_object,
-	struct rev_info *rev)
+static int show_blob_object(const unsigned char *sha1, struct rev_info *rev)
+{
+	fflush(stdout);
+	return stream_blob_to_fd(1, sha1, NULL, 0);
+}
+
+static int show_tag_object(const unsigned char *sha1, struct rev_info *rev)
 {
 	unsigned long size;
 	enum object_type type;
@@ -392,16 +398,16 @@ static int show_object(const unsigned char *sha1, int show_tag_object,
 	if (!buf)
 		return error(_("Could not read object %s"), sha1_to_hex(sha1));
 
-	if (show_tag_object)
-		while (offset < size && buf[offset] != '\n') {
-			int new_offset = offset + 1;
-			while (new_offset < size && buf[new_offset++] != '\n')
-				; /* do nothing */
-			if (!prefixcmp(buf + offset, "tagger "))
-				show_tagger(buf + offset + 7,
-					    new_offset - offset - 7, rev);
-			offset = new_offset;
-		}
+	assert(type == OBJ_TAG);
+	while (offset < size && buf[offset] != '\n') {
+		int new_offset = offset + 1;
+		while (new_offset < size && buf[new_offset++] != '\n')
+			; /* do nothing */
+		if (!prefixcmp(buf + offset, "tagger "))
+			show_tagger(buf + offset + 7,
+				    new_offset - offset - 7, rev);
+		offset = new_offset;
+	}
 
 	if (offset < size)
 		fwrite(buf + offset, size - offset, 1, stdout);
@@ -459,7 +465,7 @@ int cmd_show(int argc, const char **argv, const char *prefix)
 		const char *name = objects[i].name;
 		switch (o->type) {
 		case OBJ_BLOB:
-			ret = show_object(o->sha1, 0, NULL);
+			ret = show_blob_object(o->sha1, NULL);
 			break;
 		case OBJ_TAG: {
 			struct tag *t = (struct tag *)o;
@@ -470,7 +476,7 @@ int cmd_show(int argc, const char **argv, const char *prefix)
 					diff_get_color_opt(&rev.diffopt, DIFF_COMMIT),
 					t->tag,
 					diff_get_color_opt(&rev.diffopt, DIFF_RESET));
-			ret = show_object(o->sha1, 1, &rev);
+			ret = show_tag_object(o->sha1, &rev);
 			rev.shown_one = 1;
 			if (ret)
 				break;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 39a3e77..66acb3b 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -118,7 +118,7 @@ test_expect_success 'cat-file a large file' '
 	git cat-file blob :large1 >/dev/null
 '
 
-test_expect_failure 'git-show a large file' '
+test_expect_success 'git-show a large file' '
 	git show :large1 >/dev/null
 
 '
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 06/10] index-pack: split second pass obj handling into own function
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (4 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 05/10] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 07/10] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/index-pack.c |   31 ++++++++++++++++++-------------
 1 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index dd1c5c9..918684f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -682,6 +682,23 @@ static int compare_delta_entry(const void *a, const void *b)
 				   objects[delta_b->obj_no].type);
 }
 
+/*
+ * Second pass:
+ * - for all non-delta objects, look if it is used as a base for
+ *   deltas;
+ * - if used as a base, uncompress the object and apply all deltas,
+ *   recursively checking if the resulting object is used as a base
+ *   for some more deltas.
+ */
+static void second_pass(struct object_entry *obj)
+{
+	struct base_data *base_obj = alloc_base_data();
+	base_obj->obj = obj;
+	base_obj->data = NULL;
+	find_unresolved_deltas(base_obj);
+	display_progress(progress, nr_resolved_deltas);
+}
+
 /* Parse all objects and return the pack content SHA1 hash */
 static void parse_pack_objects(unsigned char *sha1)
 {
@@ -736,26 +753,14 @@ static void parse_pack_objects(unsigned char *sha1)
 	qsort(deltas, nr_deltas, sizeof(struct delta_entry),
 	      compare_delta_entry);
 
-	/*
-	 * Second pass:
-	 * - for all non-delta objects, look if it is used as a base for
-	 *   deltas;
-	 * - if used as a base, uncompress the object and apply all deltas,
-	 *   recursively checking if the resulting object is used as a base
-	 *   for some more deltas.
-	 */
 	if (verbose)
 		progress = start_progress("Resolving deltas", nr_deltas);
 	for (i = 0; i < nr_objects; i++) {
 		struct object_entry *obj = &objects[i];
-		struct base_data *base_obj = alloc_base_data();
 
 		if (is_delta_type(obj->type))
 			continue;
-		base_obj->obj = obj;
-		base_obj->data = NULL;
-		find_unresolved_deltas(base_obj);
-		display_progress(progress, nr_resolved_deltas);
+		second_pass(obj);
 	}
 }
 
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 07/10] index-pack: reduce memory usage when the pack has large blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (5 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 06/10] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 08/10] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

This command unpacks every non-delta objects in order to:

1. calculate sha-1
2. do byte-to-byte sha-1 collision test if we happen to have objects
   with the same sha-1
3. validate object content in strict mode

All this requires the entire object to stay in memory, a bad news for
giant blobs. This patch lowers memory consumption by not saving the
object in memory whenever possible, calculating SHA-1 while unpacking
the object.

This patch assumes that the collision test is rarely needed. The
collision test will be done later in second pass if necessary, which
puts the entire object back to memory again (We could even do the
collision test without putting the entire object back in memory, by
comparing as we unpack it).

In strict mode, it always keeps non-blob objects in memory for
validation (blobs do not need data validation).

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/index-pack.c |   64 +++++++++++++++++++++++++++++++++++++++----------
 t/t1050-large.sh     |    4 +-
 2 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 918684f..db27133 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -276,30 +276,60 @@ static void unlink_base_data(struct base_data *c)
 	free_base_data(c);
 }
 
-static void *unpack_entry_data(unsigned long offset, unsigned long size)
+static void *unpack_entry_data(unsigned long offset, unsigned long size,
+			       enum object_type type, unsigned char *sha1)
 {
+	static char fixed_buf[8192];
 	int status;
 	git_zstream stream;
-	void *buf = xmalloc(size);
+	void *buf;
+	git_SHA_CTX c;
+
+	if (sha1) {		/* do hash_sha1_file internally */
+		char hdr[32];
+		int hdrlen = sprintf(hdr, "%s %lu", typename(type), size)+1;
+		git_SHA1_Init(&c);
+		git_SHA1_Update(&c, hdr, hdrlen);
+
+		buf = fixed_buf;
+	} else {
+		buf = xmalloc(size);
+	}
 
 	memset(&stream, 0, sizeof(stream));
 	git_inflate_init(&stream);
 	stream.next_out = buf;
-	stream.avail_out = size;
+	stream.avail_out = buf == fixed_buf ? sizeof(fixed_buf) : size;
 
 	do {
 		stream.next_in = fill(1);
 		stream.avail_in = input_len;
 		status = git_inflate(&stream, 0);
 		use(input_len - stream.avail_in);
+		if (sha1) {
+			git_SHA1_Update(&c, buf, stream.next_out - (unsigned char *)buf);
+			stream.next_out = buf;
+			stream.avail_out = sizeof(fixed_buf);
+		}
 	} while (status == Z_OK);
 	if (stream.total_out != size || status != Z_STREAM_END)
 		bad_object(offset, "inflate returned %d", status);
 	git_inflate_end(&stream);
+	if (sha1) {
+		git_SHA1_Final(sha1, &c);
+		buf = NULL;
+	}
 	return buf;
 }
 
-static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_base)
+static int is_delta_type(enum object_type type)
+{
+	return (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA);
+}
+
+static void *unpack_raw_entry(struct object_entry *obj,
+			      union delta_base *delta_base,
+			      unsigned char *sha1)
 {
 	unsigned char *p;
 	unsigned long size, c;
@@ -359,7 +389,9 @@ static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_
 	}
 	obj->hdr_size = consumed_bytes - obj->idx.offset;
 
-	data = unpack_entry_data(obj->idx.offset, obj->size);
+	if (is_delta_type(obj->type) || strict)
+		sha1 = NULL;	/* save unpacked object */
+	data = unpack_entry_data(obj->idx.offset, obj->size, obj->type, sha1);
 	obj->idx.crc32 = input_crc32;
 	return data;
 }
@@ -460,8 +492,9 @@ static void find_delta_children(const union delta_base *base,
 static void sha1_object(const void *data, unsigned long size,
 			enum object_type type, unsigned char *sha1)
 {
-	hash_sha1_file(data, size, typename(type), sha1);
-	if (has_sha1_file(sha1)) {
+	if (data)
+		hash_sha1_file(data, size, typename(type), sha1);
+	if (data && has_sha1_file(sha1)) {
 		void *has_data;
 		enum object_type has_type;
 		unsigned long has_size;
@@ -510,11 +543,6 @@ static void sha1_object(const void *data, unsigned long size,
 	}
 }
 
-static int is_delta_type(enum object_type type)
-{
-	return (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA);
-}
-
 /*
  * This function is part of find_unresolved_deltas(). There are two
  * walkers going in the opposite ways.
@@ -689,10 +717,20 @@ static int compare_delta_entry(const void *a, const void *b)
  * - if used as a base, uncompress the object and apply all deltas,
  *   recursively checking if the resulting object is used as a base
  *   for some more deltas.
+ * - if the same object exists in repository and we're not in strict
+ *   mode, we skipped the sha-1 collision test in the first pass.
+ *   Do it now.
  */
 static void second_pass(struct object_entry *obj)
 {
 	struct base_data *base_obj = alloc_base_data();
+
+	if (!strict && has_sha1_file(obj->idx.sha1)) {
+		void *data = get_data_from_pack(obj);
+		sha1_object(data, obj->size, obj->type, obj->idx.sha1);
+		free(data);
+	}
+
 	base_obj->obj = obj;
 	base_obj->data = NULL;
 	find_unresolved_deltas(base_obj);
@@ -718,7 +756,7 @@ static void parse_pack_objects(unsigned char *sha1)
 				nr_objects);
 	for (i = 0; i < nr_objects; i++) {
 		struct object_entry *obj = &objects[i];
-		void *data = unpack_raw_entry(obj, &delta->base);
+		void *data = unpack_raw_entry(obj, &delta->base, obj->idx.sha1);
 		obj->real_type = obj->type;
 		if (is_delta_type(obj->type)) {
 			nr_deltas++;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 66acb3b..7e78c72 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -123,7 +123,7 @@ test_expect_success 'git-show a large file' '
 
 '
 
-test_expect_failure 'clone' '
+test_expect_success 'clone' '
 	git clone -n file://"$PWD"/.git new &&
 	(
 	cd new &&
@@ -132,7 +132,7 @@ test_expect_failure 'clone' '
 	)
 '
 
-test_expect_failure 'fetch updates' '
+test_expect_success 'fetch updates' '
 	echo modified >> large1 &&
 	git commit -q -a -m updated &&
 	(
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 08/10] pack-check: do not unpack blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (6 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 07/10] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 09/10] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

blob content is not used by verify_pack caller (currently only fsck),
we only need to make sure blob sha-1 signature matches its
content. unpack_entry() is taught to hash pack entry as it is
unpacked, eliminating the need to keep whole blob in memory.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 cache.h          |    2 +-
 fast-import.c    |    2 +-
 pack-check.c     |   21 ++++++++++++++++++++-
 sha1_file.c      |   45 +++++++++++++++++++++++++++++++++++----------
 t/t1050-large.sh |    2 +-
 5 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/cache.h b/cache.h
index e12b15f..3365f89 100644
--- a/cache.h
+++ b/cache.h
@@ -1062,7 +1062,7 @@ extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
-extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
+extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *, unsigned char *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
diff --git a/fast-import.c b/fast-import.c
index 6cd19e5..5e94a64 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1303,7 +1303,7 @@ static void *gfi_unpack_entry(
 		 */
 		p->pack_size = pack_size + 20;
 	}
-	return unpack_entry(p, oe->idx.offset, &type, sizep);
+	return unpack_entry(p, oe->idx.offset, &type, sizep, NULL);
 }
 
 static const char *get_mode(const char *str, uint16_t *modep)
diff --git a/pack-check.c b/pack-check.c
index 63a595c..1920bdb 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -105,6 +105,7 @@ static int verify_packfile(struct packed_git *p,
 		void *data;
 		enum object_type type;
 		unsigned long size;
+		off_t curpos = entries[i].offset;
 
 		if (p->index_version > 1) {
 			off_t offset = entries[i].offset;
@@ -116,7 +117,25 @@ static int verify_packfile(struct packed_git *p,
 					    sha1_to_hex(entries[i].sha1),
 					    p->pack_name, (uintmax_t)offset);
 		}
-		data = unpack_entry(p, entries[i].offset, &type, &size);
+		type = unpack_object_header(p, w_curs, &curpos, &size);
+		unuse_pack(w_curs);
+		if (type == OBJ_BLOB) {
+			unsigned char sha1[20];
+			data = unpack_entry(p, entries[i].offset, &type, &size, sha1);
+			if (!data) {
+				if (hashcmp(entries[i].sha1, sha1))
+					err = error("packed %s from %s is corrupt",
+						    sha1_to_hex(entries[i].sha1), p->pack_name);
+				else if (fn) {
+					int eaten = 0;
+					fn(entries[i].sha1, type, size, NULL, &eaten);
+				}
+				if (((base_count + i) & 1023) == 0)
+					display_progress(progress, base_count + i);
+				continue;
+			}
+		}
+		data = unpack_entry(p, entries[i].offset, &type, &size, NULL);
 		if (!data)
 			err = error("cannot unpack %s from %s at offset %"PRIuMAX"",
 				    sha1_to_hex(entries[i].sha1), p->pack_name,
diff --git a/sha1_file.c b/sha1_file.c
index a77ef0a..d68a5b0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1653,28 +1653,51 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 }
 
 static void *unpack_compressed_entry(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t curpos,
-				    unsigned long size)
+				     struct pack_window **w_curs,
+				     off_t curpos,
+				     unsigned long size,
+				     enum object_type type,
+				     unsigned char *sha1)
 {
+	static unsigned char fixed_buf[8192];
 	int st;
 	git_zstream stream;
 	unsigned char *buffer, *in;
+	git_SHA_CTX c;
+
+	if (sha1) {		/* do hash_sha1_file internally */
+		char hdr[32];
+		int hdrlen = sprintf(hdr, "%s %lu", typename(type), size)+1;
+		git_SHA1_Init(&c);
+		git_SHA1_Update(&c, hdr, hdrlen);
+
+		buffer = fixed_buf;
+	} else {
+		buffer = xmallocz(size);
+	}
 
-	buffer = xmallocz(size);
 	memset(&stream, 0, sizeof(stream));
 	stream.next_out = buffer;
-	stream.avail_out = size + 1;
+	stream.avail_out = buffer == fixed_buf ? sizeof(fixed_buf) : size + 1;
 
 	git_inflate_init(&stream);
 	do {
 		in = use_pack(p, w_curs, curpos, &stream.avail_in);
 		stream.next_in = in;
 		st = git_inflate(&stream, Z_FINISH);
-		if (!stream.avail_out)
+		if (sha1) {
+			git_SHA1_Update(&c, buffer, stream.next_out - (unsigned char *)buffer);
+			stream.next_out = buffer;
+			stream.avail_out = sizeof(fixed_buf);
+		}
+		else if (!stream.avail_out)
 			break; /* the payload is larger than it should be */
 		curpos += stream.next_in - in;
 	} while (st == Z_OK || st == Z_BUF_ERROR);
+	if (sha1) {
+		git_SHA1_Final(sha1, &c);
+		buffer = NULL;
+	}
 	git_inflate_end(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != size) {
 		free(buffer);
@@ -1727,7 +1750,7 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 
 	ret = ent->data;
 	if (!ret || ent->p != p || ent->base_offset != base_offset)
-		return unpack_entry(p, base_offset, type, base_size);
+		return unpack_entry(p, base_offset, type, base_size, NULL);
 
 	if (!keep_cache) {
 		ent->data = NULL;
@@ -1844,7 +1867,7 @@ static void *unpack_delta_entry(struct packed_git *p,
 			return NULL;
 	}
 
-	delta_data = unpack_compressed_entry(p, w_curs, curpos, delta_size);
+	delta_data = unpack_compressed_entry(p, w_curs, curpos, delta_size, OBJ_NONE, NULL);
 	if (!delta_data) {
 		error("failed to unpack compressed delta "
 		      "at offset %"PRIuMAX" from %s",
@@ -1883,7 +1906,8 @@ static void write_pack_access_log(struct packed_git *p, off_t obj_offset)
 int do_check_packed_object_crc;
 
 void *unpack_entry(struct packed_git *p, off_t obj_offset,
-		   enum object_type *type, unsigned long *sizep)
+		   enum object_type *type, unsigned long *sizep,
+		   unsigned char *sha1)
 {
 	struct pack_window *w_curs = NULL;
 	off_t curpos = obj_offset;
@@ -1917,7 +1941,8 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 	case OBJ_TREE:
 	case OBJ_BLOB:
 	case OBJ_TAG:
-		data = unpack_compressed_entry(p, &w_curs, curpos, *sizep);
+		data = unpack_compressed_entry(p, &w_curs, curpos,
+					       *sizep, *type, sha1);
 		break;
 	default:
 		data = NULL;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 7e78c72..c749ecb 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -141,7 +141,7 @@ test_expect_success 'fetch updates' '
 	)
 '
 
-test_expect_failure 'fsck' '
+test_expect_success 'fsck' '
 	git fsck --full
 '
 
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 09/10] archive: support streaming large files to a tar archive
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (7 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 08/10] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-04 12:59   ` [PATCH v2 10/10] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 archive-tar.c    |   35 ++++++++++++++++++++++++++++-------
 archive-zip.c    |    9 +++++----
 archive.c        |   51 ++++++++++++++++++++++++++++++++++-----------------
 archive.h        |   11 +++++++++--
 t/t1050-large.sh |    2 +-
 5 files changed, 77 insertions(+), 31 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 20af005..5bffe49 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -5,6 +5,7 @@
 #include "tar.h"
 #include "archive.h"
 #include "run-command.h"
+#include "streaming.h"
 
 #define RECORDSIZE	(512)
 #define BLOCKSIZE	(RECORDSIZE * 20)
@@ -123,9 +124,29 @@ static size_t get_path_prefix(const char *path, size_t pathlen, size_t maxlen)
 	return i;
 }
 
+static void write_file(struct git_istream *stream, const void *buffer,
+		       unsigned long size)
+{
+	if (!stream) {
+		write_blocked(buffer, size);
+		return;
+	}
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t readlen;
+
+		readlen = read_istream(stream, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		write_blocked(buf, readlen);
+	}
+}
+
 static int write_tar_entry(struct archiver_args *args,
-		const unsigned char *sha1, const char *path, size_t pathlen,
-		unsigned int mode, void *buffer, unsigned long size)
+			   const unsigned char *sha1, const char *path,
+			   size_t pathlen, unsigned int mode, void *buffer,
+			   struct git_istream *stream, unsigned long size)
 {
 	struct ustar_header header;
 	struct strbuf ext_header = STRBUF_INIT;
@@ -200,14 +221,14 @@ static int write_tar_entry(struct archiver_args *args,
 
 	if (ext_header.len > 0) {
 		err = write_tar_entry(args, sha1, NULL, 0, 0, ext_header.buf,
-				ext_header.len);
+				      NULL, ext_header.len);
 		if (err)
 			return err;
 	}
 	strbuf_release(&ext_header);
 	write_blocked(&header, sizeof(header));
-	if (S_ISREG(mode) && buffer && size > 0)
-		write_blocked(buffer, size);
+	if (S_ISREG(mode) && size > 0)
+		write_file(stream, buffer, size);
 	return err;
 }
 
@@ -219,7 +240,7 @@ static int write_global_extended_header(struct archiver_args *args)
 
 	strbuf_append_ext_header(&ext_header, "comment", sha1_to_hex(sha1), 40);
 	err = write_tar_entry(args, NULL, NULL, 0, 0, ext_header.buf,
-			ext_header.len);
+			      NULL, ext_header.len);
 	strbuf_release(&ext_header);
 	return err;
 }
@@ -308,7 +329,7 @@ static int write_tar_archive(const struct archiver *ar,
 	if (args->commit_sha1)
 		err = write_global_extended_header(args);
 	if (!err)
-		err = write_archive_entries(args, write_tar_entry);
+		err = write_archive_entries(args, write_tar_entry, 1);
 	if (!err)
 		write_trailer();
 	return err;
diff --git a/archive-zip.c b/archive-zip.c
index 02d1f37..4a1e917 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -120,9 +120,10 @@ static void *zlib_deflate(void *data, unsigned long size,
 	return buffer;
 }
 
-static int write_zip_entry(struct archiver_args *args,
-		const unsigned char *sha1, const char *path, size_t pathlen,
-		unsigned int mode, void *buffer, unsigned long size)
+int write_zip_entry(struct archiver_args *args,
+			   const unsigned char *sha1, const char *path,
+			   size_t pathlen, unsigned int mode, void *buffer,
+			   struct git_istream *stream, unsigned long size)
 {
 	struct zip_local_header header;
 	struct zip_dir_header dirent;
@@ -271,7 +272,7 @@ static int write_zip_archive(const struct archiver *ar,
 	zip_dir = xmalloc(ZIP_DIRECTORY_MIN_SIZE);
 	zip_dir_size = ZIP_DIRECTORY_MIN_SIZE;
 
-	err = write_archive_entries(args, write_zip_entry);
+	err = write_archive_entries(args, write_zip_entry, 0);
 	if (!err)
 		write_zip_trailer(args->commit_sha1);
 
diff --git a/archive.c b/archive.c
index 1ee837d..257eadf 100644
--- a/archive.c
+++ b/archive.c
@@ -5,6 +5,7 @@
 #include "archive.h"
 #include "parse-options.h"
 #include "unpack-trees.h"
+#include "streaming.h"
 
 static char const * const archive_usage[] = {
 	"git archive [options] <tree-ish> [<path>...]",
@@ -59,26 +60,35 @@ static void format_subst(const struct commit *commit,
 	free(to_free);
 }
 
-static void *sha1_file_to_archive(const char *path, const unsigned char *sha1,
-		unsigned int mode, enum object_type *type,
-		unsigned long *sizep, const struct commit *commit)
+void sha1_file_to_archive(void **buffer, struct git_istream **stream,
+			  const char *path, const unsigned char *sha1,
+			  unsigned int mode, enum object_type *type,
+			  unsigned long *sizep,
+			  const struct commit *commit)
 {
-	void *buffer;
+	if (stream) {
+		struct stream_filter *filter;
+		filter = get_stream_filter(path, sha1);
+		if (!commit && S_ISREG(mode) && is_null_stream_filter(filter)) {
+			*buffer = NULL;
+			*stream = open_istream(sha1, type, sizep, NULL);
+			return;
+		}
+		*stream = NULL;
+	}
 
-	buffer = read_sha1_file(sha1, type, sizep);
-	if (buffer && S_ISREG(mode)) {
+	*buffer = read_sha1_file(sha1, type, sizep);
+	if (*buffer && S_ISREG(mode)) {
 		struct strbuf buf = STRBUF_INIT;
 		size_t size = 0;
 
-		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
+		strbuf_attach(&buf, *buffer, *sizep, *sizep + 1);
 		convert_to_working_tree(path, buf.buf, buf.len, &buf);
 		if (commit)
 			format_subst(commit, buf.buf, buf.len, &buf);
-		buffer = strbuf_detach(&buf, &size);
+		*buffer = strbuf_detach(&buf, &size);
 		*sizep = size;
 	}
-
-	return buffer;
 }
 
 static void setup_archive_check(struct git_attr_check *check)
@@ -97,6 +107,7 @@ static void setup_archive_check(struct git_attr_check *check)
 struct archiver_context {
 	struct archiver_args *args;
 	write_archive_entry_fn_t write_entry;
+	int stream_ok;
 };
 
 static int write_archive_entry(const unsigned char *sha1, const char *base,
@@ -109,6 +120,7 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
 	write_archive_entry_fn_t write_entry = c->write_entry;
 	struct git_attr_check check[2];
 	const char *path_without_prefix;
+	struct git_istream *stream = NULL;
 	int convert = 0;
 	int err;
 	enum object_type type;
@@ -133,25 +145,29 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
 		strbuf_addch(&path, '/');
 		if (args->verbose)
 			fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
-		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, 0);
+		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, NULL, 0);
 		if (err)
 			return err;
 		return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
 	}
 
-	buffer = sha1_file_to_archive(path_without_prefix, sha1, mode,
-			&type, &size, convert ? args->commit : NULL);
-	if (!buffer)
+	sha1_file_to_archive(&buffer, c->stream_ok ? &stream : NULL,
+			     path_without_prefix, sha1, mode,
+			     &type, &size, convert ? args->commit : NULL);
+	if (!buffer && !stream)
 		return error("cannot read %s", sha1_to_hex(sha1));
 	if (args->verbose)
 		fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
-	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, size);
+	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, stream, size);
+	if (stream)
+		close_istream(stream);
 	free(buffer);
 	return err;
 }
 
 int write_archive_entries(struct archiver_args *args,
-		write_archive_entry_fn_t write_entry)
+			  write_archive_entry_fn_t write_entry,
+			  int stream_ok)
 {
 	struct archiver_context context;
 	struct unpack_trees_options opts;
@@ -167,13 +183,14 @@ int write_archive_entries(struct archiver_args *args,
 		if (args->verbose)
 			fprintf(stderr, "%.*s\n", (int)len, args->base);
 		err = write_entry(args, args->tree->object.sha1, args->base,
-				len, 040777, NULL, 0);
+				  len, 040777, NULL, NULL, 0);
 		if (err)
 			return err;
 	}
 
 	context.args = args;
 	context.write_entry = write_entry;
+	context.stream_ok = stream_ok;
 
 	/*
 	 * Setup index and instruct attr to read index only
diff --git a/archive.h b/archive.h
index 2b0884f..370cca9 100644
--- a/archive.h
+++ b/archive.h
@@ -27,9 +27,16 @@ extern void register_archiver(struct archiver *);
 extern void init_tar_archiver(void);
 extern void init_zip_archiver(void);
 
-typedef int (*write_archive_entry_fn_t)(struct archiver_args *args, const unsigned char *sha1, const char *path, size_t pathlen, unsigned int mode, void *buffer, unsigned long size);
+struct git_istream;
+typedef int (*write_archive_entry_fn_t)(struct archiver_args *args,
+					const unsigned char *sha1,
+					const char *path, size_t pathlen,
+					unsigned int mode,
+					void *buffer,
+					struct git_istream *stream,
+					unsigned long size);
 
-extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry);
+extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry, int stream_ok);
 extern int write_archive(int argc, const char **argv, const char *prefix, int setup_prefix, const char *name_hint, int remote);
 
 const char *archive_format_from_filename(const char *filename);
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index c749ecb..1e64692 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -149,7 +149,7 @@ test_expect_success 'repack' '
 	git repack -ad
 '
 
-test_expect_failure 'tar achiving' '
+test_expect_success 'tar achiving' '
 	git archive --format=tar HEAD >/dev/null
 '
 
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 10/10] fsck: use streaming interface for writing lost-found blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (8 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 09/10] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
@ 2012-03-04 12:59   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-04 12:59 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/fsck.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 8c479a7..7fcb33e 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -12,6 +12,7 @@
 #include "parse-options.h"
 #include "dir.h"
 #include "progress.h"
+#include "streaming.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -236,13 +237,8 @@ static void check_unreachable_object(struct object *obj)
 			if (!(f = fopen(filename, "w")))
 				die_errno("Could not open '%s'", filename);
 			if (obj->type == OBJ_BLOB) {
-				enum object_type type;
-				unsigned long size;
-				char *buf = read_sha1_file(obj->sha1,
-						&type, &size);
-				if (buf && fwrite(buf, 1, size, f) != size)
+				if (stream_blob_to_fd(fileno(f), obj->sha1, NULL, 1))
 					die_errno("Could not write '%s'", filename);
-				free(buf);
 			} else
 				fprintf(f, "%s\n", sha1_to_hex(obj->sha1));
 			if (fclose(f))
-- 
1.7.8.36.g69ee2

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 03/10] cat-file: use streaming interface to print blobs
  2012-03-04 12:59   ` [PATCH v2 03/10] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
@ 2012-03-04 23:12     ` Junio C Hamano
  2012-03-05  2:42       ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2012-03-04 23:12 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> +static int write_blob(const unsigned char *sha1)
> +{
> +	unsigned char new_sha1[20];
> +
> +	if (sha1_object_info(sha1, NULL) == OBJ_TAG) {

Hrm, didn't I say that it tastes bad for a function write_blob() to have
to worry about OBJ_TAG already?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 03/10] cat-file: use streaming interface to print blobs
  2012-03-04 23:12     ` Junio C Hamano
@ 2012-03-05  2:42       ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 48+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-05  2:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

2012/3/5 Junio C Hamano <gitster@pobox.com>:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>> +static int write_blob(const unsigned char *sha1)
>> +{
>> +     unsigned char new_sha1[20];
>> +
>> +     if (sha1_object_info(sha1, NULL) == OBJ_TAG) {
>
> Hrm, didn't I say that it tastes bad for a function write_blob() to have
> to worry about OBJ_TAG already?

My bad. Reworked, added another test case for the dereference case,
and clone exceeded memory limit again due to new test case :( Will
need some more work on this.
-- 
Duy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 00/11] Large blob fixes
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (9 preceding siblings ...)
  2012-03-04 12:59   ` [PATCH v2 10/10] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
                     ` (10 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

Changes from v2:

 - set core.bigfilethreshold globally in t1050 to make git-clone happy
   because there's currently no way to specify this in git-clone (or
   is there?)
 - fix the bad coding taste in builtin/cat-file.c
 - make update-server-info respect core.bigfilethreshold,
   which makes repack pass on repositories that have tags

Junio C Hamano (1):
  streaming: make streaming-write-entry to be more reusable

Nguyễn Thái Ngọc Duy (10):
  Add more large blob test cases
  cat-file: use streaming interface to print blobs
  parse_object: special code path for blobs to avoid putting whole
    object in memory
  show: use streaming interface for showing blobs
  index-pack: split second pass obj handling into own function
  index-pack: reduce memory usage when the pack has large blobs
  pack-check: do not unpack blobs
  archive: support streaming large files to a tar archive
  fsck: use streaming interface for writing lost-found blobs
  update-server-info: respect core.bigfilethreshold

 archive-tar.c                |   35 ++++++++++++---
 archive-zip.c                |    9 ++--
 archive.c                    |   51 +++++++++++++++-------
 archive.h                    |   11 ++++-
 builtin/cat-file.c           |   24 +++++++++++
 builtin/fsck.c               |    8 +---
 builtin/index-pack.c         |   95 ++++++++++++++++++++++++++++++-----------
 builtin/log.c                |   34 +++++++++------
 builtin/update-server-info.c |    1 +
 cache.h                      |    2 +-
 entry.c                      |   53 ++---------------------
 fast-import.c                |    2 +-
 object.c                     |   11 +++++
 pack-check.c                 |   21 +++++++++-
 sha1_file.c                  |   78 +++++++++++++++++++++++++++++-----
 streaming.c                  |   55 ++++++++++++++++++++++++
 streaming.h                  |    2 +
 t/t1050-large.sh             |   63 +++++++++++++++++++++++++++-
 wrapper.c                    |   27 +++++++++++-
 19 files changed, 439 insertions(+), 143 deletions(-)

-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 01/11] Add more large blob test cases
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (10 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 02/11] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

New test cases list commands that should work when memory is
limited. All memory allocation functions (*) learn to reject any
allocation larger than $GIT_ALLOC_LIMIT if set.

(*) Not exactly all. Some places do not use x* functions, but
malloc/calloc directly, notably diff-delta. These code path should
never be run on large blobs.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 t/t1050-large.sh |   63 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 wrapper.c        |   27 ++++++++++++++++++++--
 2 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 29d6024..80f157a 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -6,11 +6,15 @@ test_description='adding and checking out large blobs'
 . ./test-lib.sh
 
 test_expect_success setup '
-	git config core.bigfilethreshold 200k &&
+	# clone does not allow us to pass core.bigfilethreshold to
+	# new repos, so set core.bigfilethreshold globally
+	git config --global core.bigfilethreshold 200k &&
 	echo X | dd of=large1 bs=1k seek=2000 &&
 	echo X | dd of=large2 bs=1k seek=2000 &&
 	echo X | dd of=large3 bs=1k seek=2000 &&
-	echo Y | dd of=huge bs=1k seek=2500
+	echo Y | dd of=huge bs=1k seek=2500 &&
+	GIT_ALLOC_LIMIT=1500 &&
+	export GIT_ALLOC_LIMIT
 '
 
 test_expect_success 'add a large file or two' '
@@ -100,4 +104,59 @@ test_expect_success 'packsize limit' '
 	)
 '
 
+test_expect_success 'diff --raw' '
+	git commit -q -m initial &&
+	echo modified >>large1 &&
+	git add large1 &&
+	git commit -q -m modified &&
+	git diff --raw HEAD^
+'
+
+test_expect_success 'hash-object' '
+	git hash-object large1
+'
+
+test_expect_failure 'cat-file a large file' '
+	git cat-file blob :large1 >/dev/null
+'
+
+test_expect_failure 'cat-file a large file from a tag' '
+	git tag -m largefile largefiletag :large1 &&
+	git cat-file blob largefiletag >/dev/null
+'
+
+test_expect_failure 'git-show a large file' '
+	git show :large1 >/dev/null
+
+'
+
+test_expect_failure 'clone' '
+	git clone file://"$PWD"/.git new
+'
+
+test_expect_failure 'fetch updates' '
+	echo modified >> large1 &&
+	git commit -q -a -m updated &&
+	(
+	cd new &&
+	git fetch --keep # FIXME should not need --keep
+	)
+'
+
+test_expect_failure 'fsck' '
+	git fsck --full
+'
+
+test_expect_failure 'repack' '
+	git repack -ad
+'
+
+test_expect_failure 'tar achiving' '
+	git archive --format=tar HEAD >/dev/null
+'
+
+test_expect_failure 'zip achiving' '
+	git archive --format=zip HEAD >/dev/null
+'
+
 test_done
diff --git a/wrapper.c b/wrapper.c
index 85f09df..d4c0972 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -9,6 +9,18 @@ static void do_nothing(size_t size)
 
 static void (*try_to_free_routine)(size_t size) = do_nothing;
 
+static void memory_limit_check(size_t size)
+{
+	static int limit = -1;
+	if (limit == -1) {
+		const char *env = getenv("GIT_ALLOC_LIMIT");
+		limit = env ? atoi(env) * 1024 : 0;
+	}
+	if (limit && size > limit)
+		die("attempting to allocate %d over limit %d",
+		    size, limit);
+}
+
 try_to_free_t set_try_to_free_routine(try_to_free_t routine)
 {
 	try_to_free_t old = try_to_free_routine;
@@ -32,7 +44,10 @@ char *xstrdup(const char *str)
 
 void *xmalloc(size_t size)
 {
-	void *ret = malloc(size);
+	void *ret;
+
+	memory_limit_check(size);
+	ret = malloc(size);
 	if (!ret && !size)
 		ret = malloc(1);
 	if (!ret) {
@@ -79,7 +94,10 @@ char *xstrndup(const char *str, size_t len)
 
 void *xrealloc(void *ptr, size_t size)
 {
-	void *ret = realloc(ptr, size);
+	void *ret;
+
+	memory_limit_check(size);
+	ret = realloc(ptr, size);
 	if (!ret && !size)
 		ret = realloc(ptr, 1);
 	if (!ret) {
@@ -95,7 +113,10 @@ void *xrealloc(void *ptr, size_t size)
 
 void *xcalloc(size_t nmemb, size_t size)
 {
-	void *ret = calloc(nmemb, size);
+	void *ret;
+
+	memory_limit_check(size * nmemb);
+	ret = calloc(nmemb, size);
 	if (!ret && (!nmemb || !size))
 		ret = calloc(1, 1);
 	if (!ret) {
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 02/11] streaming: make streaming-write-entry to be more reusable
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (11 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

From: Junio C Hamano <gitster@pobox.com>

The static function in entry.c takes a cache entry and streams its blob
contents to a file in the working tree.  Refactor the logic to a new API
function stream_blob_to_fd() that takes an object name and an open file
descriptor, so that it can be reused by other callers.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 entry.c     |   53 +++++------------------------------------------------
 streaming.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 streaming.h |    2 ++
 3 files changed, 62 insertions(+), 48 deletions(-)

diff --git a/entry.c b/entry.c
index 852fea1..17a6bcc 100644
--- a/entry.c
+++ b/entry.c
@@ -120,58 +120,15 @@ static int streaming_write_entry(struct cache_entry *ce, char *path,
 				 const struct checkout *state, int to_tempfile,
 				 int *fstat_done, struct stat *statbuf)
 {
-	struct git_istream *st;
-	enum object_type type;
-	unsigned long sz;
 	int result = -1;
-	ssize_t kept = 0;
-	int fd = -1;
-
-	st = open_istream(ce->sha1, &type, &sz, filter);
-	if (!st)
-		return -1;
-	if (type != OBJ_BLOB)
-		goto close_and_exit;
+	int fd;
 
 	fd = open_output_fd(path, ce, to_tempfile);
-	if (fd < 0)
-		goto close_and_exit;
-
-	for (;;) {
-		char buf[1024 * 16];
-		ssize_t wrote, holeto;
-		ssize_t readlen = read_istream(st, buf, sizeof(buf));
-
-		if (!readlen)
-			break;
-		if (sizeof(buf) == readlen) {
-			for (holeto = 0; holeto < readlen; holeto++)
-				if (buf[holeto])
-					break;
-			if (readlen == holeto) {
-				kept += holeto;
-				continue;
-			}
-		}
-
-		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
-			goto close_and_exit;
-		else
-			kept = 0;
-		wrote = write_in_full(fd, buf, readlen);
-
-		if (wrote != readlen)
-			goto close_and_exit;
-	}
-	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
-		     write(fd, "", 1) != 1))
-		goto close_and_exit;
-	*fstat_done = fstat_output(fd, state, statbuf);
-
-close_and_exit:
-	close_istream(st);
-	if (0 <= fd)
+	if (0 <= fd) {
+		result = stream_blob_to_fd(fd, ce->sha1, filter, 1);
+		*fstat_done = fstat_output(fd, state, statbuf);
 		result = close(fd);
+	}
 	if (result && 0 <= fd)
 		unlink(path);
 	return result;
diff --git a/streaming.c b/streaming.c
index 71072e1..7e7ee2b 100644
--- a/streaming.c
+++ b/streaming.c
@@ -489,3 +489,58 @@ static open_method_decl(incore)
 
 	return st->u.incore.buf ? 0 : -1;
 }
+
+
+/****************************************************************
+ * Users of streaming interface
+ ****************************************************************/
+
+int stream_blob_to_fd(int fd, unsigned const char *sha1, struct stream_filter *filter,
+		      int can_seek)
+{
+	struct git_istream *st;
+	enum object_type type;
+	unsigned long sz;
+	ssize_t kept = 0;
+	int result = -1;
+
+	st = open_istream(sha1, &type, &sz, filter);
+	if (!st)
+		return result;
+	if (type != OBJ_BLOB)
+		goto close_and_exit;
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t wrote, holeto;
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		if (can_seek && sizeof(buf) == readlen) {
+			for (holeto = 0; holeto < readlen; holeto++)
+				if (buf[holeto])
+					break;
+			if (readlen == holeto) {
+				kept += holeto;
+				continue;
+			}
+		}
+
+		if (kept && lseek(fd, kept, SEEK_CUR) == (off_t) -1)
+			goto close_and_exit;
+		else
+			kept = 0;
+		wrote = write_in_full(fd, buf, readlen);
+
+		if (wrote != readlen)
+			goto close_and_exit;
+	}
+	if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
+		     write(fd, "", 1) != 1))
+		goto close_and_exit;
+	result = 0;
+
+ close_and_exit:
+	close_istream(st);
+	return result;
+}
diff --git a/streaming.h b/streaming.h
index 589e857..3e82770 100644
--- a/streaming.h
+++ b/streaming.h
@@ -12,4 +12,6 @@ extern struct git_istream *open_istream(const unsigned char *, enum object_type
 extern int close_istream(struct git_istream *);
 extern ssize_t read_istream(struct git_istream *, char *, size_t);
 
+extern int stream_blob_to_fd(int fd, const unsigned char *, struct stream_filter *, int can_seek);
+
 #endif /* STREAMING_H */
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 03/11] cat-file: use streaming interface to print blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (12 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 02/11] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/cat-file.c |   24 ++++++++++++++++++++++++
 t/t1050-large.sh   |    4 ++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8ed501f..ce68a20 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -11,6 +11,7 @@
 #include "parse-options.h"
 #include "diff.h"
 #include "userdiff.h"
+#include "streaming.h"
 
 #define BATCH 1
 #define BATCH_CHECK 2
@@ -127,6 +128,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 			return cmd_ls_tree(2, ls_args, NULL);
 		}
 
+		if (type == OBJ_BLOB)
+			return stream_blob_to_fd(1, sha1, NULL, 0);
 		buf = read_sha1_file(sha1, &type, &size);
 		if (!buf)
 			die("Cannot read object %s", obj_name);
@@ -149,6 +152,27 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
 		break;
 
 	case 0:
+		if (type_from_string(exp_type) == OBJ_BLOB) {
+			unsigned char blob_sha1[20];
+			if (sha1_object_info(sha1, NULL) == OBJ_TAG) {
+				enum object_type type;
+				unsigned long size;
+				char *buffer = read_sha1_file(sha1, &type, &size);
+				if (memcmp(buffer, "object ", 7) ||
+				    get_sha1_hex(buffer + 7, blob_sha1))
+					die("%s not a valid tag", sha1_to_hex(sha1));
+				free(buffer);
+			} else
+				hashcpy(blob_sha1, sha1);
+
+			if (sha1_object_info(blob_sha1, NULL) == OBJ_BLOB)
+				return stream_blob_to_fd(1, blob_sha1, NULL, 0);
+			/* we attempted to dereference a tag to a blob
+			   and failed, perhaps there are new dereference
+			   mechanisms this code is not aware of,
+			   fallthrough and let read_object_with_reference
+			   deal with it */
+		}
 		buf = read_object_with_reference(sha1, exp_type, &size, NULL);
 		break;
 
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 80f157a..97ad5b3 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -116,11 +116,11 @@ test_expect_success 'hash-object' '
 	git hash-object large1
 '
 
-test_expect_failure 'cat-file a large file' '
+test_expect_success 'cat-file a large file' '
 	git cat-file blob :large1 >/dev/null
 '
 
-test_expect_failure 'cat-file a large file from a tag' '
+test_expect_success 'cat-file a large file from a tag' '
 	git tag -m largefile largefiletag :large1 &&
 	git cat-file blob largefiletag >/dev/null
 '
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 04/11] parse_object: special code path for blobs to avoid putting whole object in memory
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (13 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-06  0:57     ` Junio C Hamano
  2012-03-05  3:43   ` [PATCH v3 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
                     ` (6 subsequent siblings)
  21 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 object.c    |   11 +++++++++++
 sha1_file.c |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/object.c b/object.c
index 6b06297..0498b18 100644
--- a/object.c
+++ b/object.c
@@ -198,6 +198,17 @@ struct object *parse_object(const unsigned char *sha1)
 	if (obj && obj->parsed)
 		return obj;
 
+	if ((obj && obj->type == OBJ_BLOB) ||
+	    (!obj && has_sha1_file(sha1) &&
+	     sha1_object_info(sha1, NULL) == OBJ_BLOB)) {
+		if (check_sha1_signature(repl, NULL, 0, NULL) < 0) {
+			error("sha1 mismatch %s\n", sha1_to_hex(repl));
+			return NULL;
+		}
+		parse_blob_buffer(lookup_blob(sha1), NULL, 0);
+		return lookup_object(sha1);
+	}
+
 	buffer = read_sha1_file(sha1, &type, &size);
 	if (buffer) {
 		if (check_sha1_signature(repl, buffer, size, typename(type)) < 0) {
diff --git a/sha1_file.c b/sha1_file.c
index f9f8d5e..a77ef0a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -19,6 +19,7 @@
 #include "pack-revindex.h"
 #include "sha1-lookup.h"
 #include "bulk-checkin.h"
+#include "streaming.h"
 
 #ifndef O_NOATIME
 #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
@@ -1149,7 +1150,37 @@ static const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
 int check_sha1_signature(const unsigned char *sha1, void *map, unsigned long size, const char *type)
 {
 	unsigned char real_sha1[20];
-	hash_sha1_file(map, size, type, real_sha1);
+	enum object_type obj_type;
+	struct git_istream *st;
+	git_SHA_CTX c;
+	char hdr[32];
+	int hdrlen;
+
+	if (map) {
+		hash_sha1_file(map, size, type, real_sha1);
+		return hashcmp(sha1, real_sha1) ? -1 : 0;
+	}
+
+	st = open_istream(sha1, &obj_type, &size, NULL);
+	if (!st)
+		return -1;
+
+	/* Generate the header */
+	hdrlen = sprintf(hdr, "%s %lu", typename(obj_type), size) + 1;
+
+	/* Sha1.. */
+	git_SHA1_Init(&c);
+	git_SHA1_Update(&c, hdr, hdrlen);
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t readlen = read_istream(st, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		git_SHA1_Update(&c, buf, readlen);
+	}
+	git_SHA1_Final(real_sha1, &c);
+	close_istream(st);
 	return hashcmp(sha1, real_sha1) ? -1 : 0;
 }
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 05/11] show: use streaming interface for showing blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (14 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 06/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/log.c    |   34 ++++++++++++++++++++--------------
 t/t1050-large.sh |    2 +-
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index 7d1f6f8..d1702e7 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -20,6 +20,7 @@
 #include "string-list.h"
 #include "parse-options.h"
 #include "branch.h"
+#include "streaming.h"
 
 /* Set a default date-time format for git log ("log.date" config variable) */
 static const char *default_date_mode = NULL;
@@ -381,8 +382,13 @@ static void show_tagger(char *buf, int len, struct rev_info *rev)
 	strbuf_release(&out);
 }
 
-static int show_object(const unsigned char *sha1, int show_tag_object,
-	struct rev_info *rev)
+static int show_blob_object(const unsigned char *sha1, struct rev_info *rev)
+{
+	fflush(stdout);
+	return stream_blob_to_fd(1, sha1, NULL, 0);
+}
+
+static int show_tag_object(const unsigned char *sha1, struct rev_info *rev)
 {
 	unsigned long size;
 	enum object_type type;
@@ -392,16 +398,16 @@ static int show_object(const unsigned char *sha1, int show_tag_object,
 	if (!buf)
 		return error(_("Could not read object %s"), sha1_to_hex(sha1));
 
-	if (show_tag_object)
-		while (offset < size && buf[offset] != '\n') {
-			int new_offset = offset + 1;
-			while (new_offset < size && buf[new_offset++] != '\n')
-				; /* do nothing */
-			if (!prefixcmp(buf + offset, "tagger "))
-				show_tagger(buf + offset + 7,
-					    new_offset - offset - 7, rev);
-			offset = new_offset;
-		}
+	assert(type == OBJ_TAG);
+	while (offset < size && buf[offset] != '\n') {
+		int new_offset = offset + 1;
+		while (new_offset < size && buf[new_offset++] != '\n')
+			; /* do nothing */
+		if (!prefixcmp(buf + offset, "tagger "))
+			show_tagger(buf + offset + 7,
+				    new_offset - offset - 7, rev);
+		offset = new_offset;
+	}
 
 	if (offset < size)
 		fwrite(buf + offset, size - offset, 1, stdout);
@@ -459,7 +465,7 @@ int cmd_show(int argc, const char **argv, const char *prefix)
 		const char *name = objects[i].name;
 		switch (o->type) {
 		case OBJ_BLOB:
-			ret = show_object(o->sha1, 0, NULL);
+			ret = show_blob_object(o->sha1, NULL);
 			break;
 		case OBJ_TAG: {
 			struct tag *t = (struct tag *)o;
@@ -470,7 +476,7 @@ int cmd_show(int argc, const char **argv, const char *prefix)
 					diff_get_color_opt(&rev.diffopt, DIFF_COMMIT),
 					t->tag,
 					diff_get_color_opt(&rev.diffopt, DIFF_RESET));
-			ret = show_object(o->sha1, 1, &rev);
+			ret = show_tag_object(o->sha1, &rev);
 			rev.shown_one = 1;
 			if (ret)
 				break;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 97ad5b3..4e08e02 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -125,7 +125,7 @@ test_expect_success 'cat-file a large file from a tag' '
 	git cat-file blob largefiletag >/dev/null
 '
 
-test_expect_failure 'git-show a large file' '
+test_expect_success 'git-show a large file' '
 	git show :large1 >/dev/null
 
 '
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 06/11] index-pack: split second pass obj handling into own function
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (15 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 07/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/index-pack.c |   31 ++++++++++++++++++-------------
 1 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index dd1c5c9..918684f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -682,6 +682,23 @@ static int compare_delta_entry(const void *a, const void *b)
 				   objects[delta_b->obj_no].type);
 }
 
+/*
+ * Second pass:
+ * - for all non-delta objects, look if it is used as a base for
+ *   deltas;
+ * - if used as a base, uncompress the object and apply all deltas,
+ *   recursively checking if the resulting object is used as a base
+ *   for some more deltas.
+ */
+static void second_pass(struct object_entry *obj)
+{
+	struct base_data *base_obj = alloc_base_data();
+	base_obj->obj = obj;
+	base_obj->data = NULL;
+	find_unresolved_deltas(base_obj);
+	display_progress(progress, nr_resolved_deltas);
+}
+
 /* Parse all objects and return the pack content SHA1 hash */
 static void parse_pack_objects(unsigned char *sha1)
 {
@@ -736,26 +753,14 @@ static void parse_pack_objects(unsigned char *sha1)
 	qsort(deltas, nr_deltas, sizeof(struct delta_entry),
 	      compare_delta_entry);
 
-	/*
-	 * Second pass:
-	 * - for all non-delta objects, look if it is used as a base for
-	 *   deltas;
-	 * - if used as a base, uncompress the object and apply all deltas,
-	 *   recursively checking if the resulting object is used as a base
-	 *   for some more deltas.
-	 */
 	if (verbose)
 		progress = start_progress("Resolving deltas", nr_deltas);
 	for (i = 0; i < nr_objects; i++) {
 		struct object_entry *obj = &objects[i];
-		struct base_data *base_obj = alloc_base_data();
 
 		if (is_delta_type(obj->type))
 			continue;
-		base_obj->obj = obj;
-		base_obj->data = NULL;
-		find_unresolved_deltas(base_obj);
-		display_progress(progress, nr_resolved_deltas);
+		second_pass(obj);
 	}
 }
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 07/11] index-pack: reduce memory usage when the pack has large blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (16 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 06/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 08/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

This command unpacks every non-delta objects in order to:

1. calculate sha-1
2. do byte-to-byte sha-1 collision test if we happen to have objects
   with the same sha-1
3. validate object content in strict mode

All this requires the entire object to stay in memory, a bad news for
giant blobs. This patch lowers memory consumption by not saving the
object in memory whenever possible, calculating SHA-1 while unpacking
the object.

This patch assumes that the collision test is rarely needed. The
collision test will be done later in second pass if necessary, which
puts the entire object back to memory again (We could even do the
collision test without putting the entire object back in memory, by
comparing as we unpack it).

In strict mode, it always keeps non-blob objects in memory for
validation (blobs do not need data validation).

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/index-pack.c |   64 +++++++++++++++++++++++++++++++++++++++----------
 t/t1050-large.sh     |    4 +-
 2 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 918684f..db27133 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -276,30 +276,60 @@ static void unlink_base_data(struct base_data *c)
 	free_base_data(c);
 }
 
-static void *unpack_entry_data(unsigned long offset, unsigned long size)
+static void *unpack_entry_data(unsigned long offset, unsigned long size,
+			       enum object_type type, unsigned char *sha1)
 {
+	static char fixed_buf[8192];
 	int status;
 	git_zstream stream;
-	void *buf = xmalloc(size);
+	void *buf;
+	git_SHA_CTX c;
+
+	if (sha1) {		/* do hash_sha1_file internally */
+		char hdr[32];
+		int hdrlen = sprintf(hdr, "%s %lu", typename(type), size)+1;
+		git_SHA1_Init(&c);
+		git_SHA1_Update(&c, hdr, hdrlen);
+
+		buf = fixed_buf;
+	} else {
+		buf = xmalloc(size);
+	}
 
 	memset(&stream, 0, sizeof(stream));
 	git_inflate_init(&stream);
 	stream.next_out = buf;
-	stream.avail_out = size;
+	stream.avail_out = buf == fixed_buf ? sizeof(fixed_buf) : size;
 
 	do {
 		stream.next_in = fill(1);
 		stream.avail_in = input_len;
 		status = git_inflate(&stream, 0);
 		use(input_len - stream.avail_in);
+		if (sha1) {
+			git_SHA1_Update(&c, buf, stream.next_out - (unsigned char *)buf);
+			stream.next_out = buf;
+			stream.avail_out = sizeof(fixed_buf);
+		}
 	} while (status == Z_OK);
 	if (stream.total_out != size || status != Z_STREAM_END)
 		bad_object(offset, "inflate returned %d", status);
 	git_inflate_end(&stream);
+	if (sha1) {
+		git_SHA1_Final(sha1, &c);
+		buf = NULL;
+	}
 	return buf;
 }
 
-static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_base)
+static int is_delta_type(enum object_type type)
+{
+	return (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA);
+}
+
+static void *unpack_raw_entry(struct object_entry *obj,
+			      union delta_base *delta_base,
+			      unsigned char *sha1)
 {
 	unsigned char *p;
 	unsigned long size, c;
@@ -359,7 +389,9 @@ static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_
 	}
 	obj->hdr_size = consumed_bytes - obj->idx.offset;
 
-	data = unpack_entry_data(obj->idx.offset, obj->size);
+	if (is_delta_type(obj->type) || strict)
+		sha1 = NULL;	/* save unpacked object */
+	data = unpack_entry_data(obj->idx.offset, obj->size, obj->type, sha1);
 	obj->idx.crc32 = input_crc32;
 	return data;
 }
@@ -460,8 +492,9 @@ static void find_delta_children(const union delta_base *base,
 static void sha1_object(const void *data, unsigned long size,
 			enum object_type type, unsigned char *sha1)
 {
-	hash_sha1_file(data, size, typename(type), sha1);
-	if (has_sha1_file(sha1)) {
+	if (data)
+		hash_sha1_file(data, size, typename(type), sha1);
+	if (data && has_sha1_file(sha1)) {
 		void *has_data;
 		enum object_type has_type;
 		unsigned long has_size;
@@ -510,11 +543,6 @@ static void sha1_object(const void *data, unsigned long size,
 	}
 }
 
-static int is_delta_type(enum object_type type)
-{
-	return (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA);
-}
-
 /*
  * This function is part of find_unresolved_deltas(). There are two
  * walkers going in the opposite ways.
@@ -689,10 +717,20 @@ static int compare_delta_entry(const void *a, const void *b)
  * - if used as a base, uncompress the object and apply all deltas,
  *   recursively checking if the resulting object is used as a base
  *   for some more deltas.
+ * - if the same object exists in repository and we're not in strict
+ *   mode, we skipped the sha-1 collision test in the first pass.
+ *   Do it now.
  */
 static void second_pass(struct object_entry *obj)
 {
 	struct base_data *base_obj = alloc_base_data();
+
+	if (!strict && has_sha1_file(obj->idx.sha1)) {
+		void *data = get_data_from_pack(obj);
+		sha1_object(data, obj->size, obj->type, obj->idx.sha1);
+		free(data);
+	}
+
 	base_obj->obj = obj;
 	base_obj->data = NULL;
 	find_unresolved_deltas(base_obj);
@@ -718,7 +756,7 @@ static void parse_pack_objects(unsigned char *sha1)
 				nr_objects);
 	for (i = 0; i < nr_objects; i++) {
 		struct object_entry *obj = &objects[i];
-		void *data = unpack_raw_entry(obj, &delta->base);
+		void *data = unpack_raw_entry(obj, &delta->base, obj->idx.sha1);
 		obj->real_type = obj->type;
 		if (is_delta_type(obj->type)) {
 			nr_deltas++;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 4e08e02..e4b77a2 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -130,11 +130,11 @@ test_expect_success 'git-show a large file' '
 
 '
 
-test_expect_failure 'clone' '
+test_expect_success 'clone' '
 	git clone file://"$PWD"/.git new
 '
 
-test_expect_failure 'fetch updates' '
+test_expect_success 'fetch updates' '
 	echo modified >> large1 &&
 	git commit -q -a -m updated &&
 	(
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 08/11] pack-check: do not unpack blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (17 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 07/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 09/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

blob content is not used by verify_pack caller (currently only fsck),
we only need to make sure blob sha-1 signature matches its
content. unpack_entry() is taught to hash pack entry as it is
unpacked, eliminating the need to keep whole blob in memory.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 cache.h          |    2 +-
 fast-import.c    |    2 +-
 pack-check.c     |   21 ++++++++++++++++++++-
 sha1_file.c      |   45 +++++++++++++++++++++++++++++++++++----------
 t/t1050-large.sh |    2 +-
 5 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/cache.h b/cache.h
index e12b15f..3365f89 100644
--- a/cache.h
+++ b/cache.h
@@ -1062,7 +1062,7 @@ extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
-extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
+extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *, unsigned char *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
diff --git a/fast-import.c b/fast-import.c
index 6cd19e5..5e94a64 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1303,7 +1303,7 @@ static void *gfi_unpack_entry(
 		 */
 		p->pack_size = pack_size + 20;
 	}
-	return unpack_entry(p, oe->idx.offset, &type, sizep);
+	return unpack_entry(p, oe->idx.offset, &type, sizep, NULL);
 }
 
 static const char *get_mode(const char *str, uint16_t *modep)
diff --git a/pack-check.c b/pack-check.c
index 63a595c..1920bdb 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -105,6 +105,7 @@ static int verify_packfile(struct packed_git *p,
 		void *data;
 		enum object_type type;
 		unsigned long size;
+		off_t curpos = entries[i].offset;
 
 		if (p->index_version > 1) {
 			off_t offset = entries[i].offset;
@@ -116,7 +117,25 @@ static int verify_packfile(struct packed_git *p,
 					    sha1_to_hex(entries[i].sha1),
 					    p->pack_name, (uintmax_t)offset);
 		}
-		data = unpack_entry(p, entries[i].offset, &type, &size);
+		type = unpack_object_header(p, w_curs, &curpos, &size);
+		unuse_pack(w_curs);
+		if (type == OBJ_BLOB) {
+			unsigned char sha1[20];
+			data = unpack_entry(p, entries[i].offset, &type, &size, sha1);
+			if (!data) {
+				if (hashcmp(entries[i].sha1, sha1))
+					err = error("packed %s from %s is corrupt",
+						    sha1_to_hex(entries[i].sha1), p->pack_name);
+				else if (fn) {
+					int eaten = 0;
+					fn(entries[i].sha1, type, size, NULL, &eaten);
+				}
+				if (((base_count + i) & 1023) == 0)
+					display_progress(progress, base_count + i);
+				continue;
+			}
+		}
+		data = unpack_entry(p, entries[i].offset, &type, &size, NULL);
 		if (!data)
 			err = error("cannot unpack %s from %s at offset %"PRIuMAX"",
 				    sha1_to_hex(entries[i].sha1), p->pack_name,
diff --git a/sha1_file.c b/sha1_file.c
index a77ef0a..d68a5b0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1653,28 +1653,51 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 }
 
 static void *unpack_compressed_entry(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t curpos,
-				    unsigned long size)
+				     struct pack_window **w_curs,
+				     off_t curpos,
+				     unsigned long size,
+				     enum object_type type,
+				     unsigned char *sha1)
 {
+	static unsigned char fixed_buf[8192];
 	int st;
 	git_zstream stream;
 	unsigned char *buffer, *in;
+	git_SHA_CTX c;
+
+	if (sha1) {		/* do hash_sha1_file internally */
+		char hdr[32];
+		int hdrlen = sprintf(hdr, "%s %lu", typename(type), size)+1;
+		git_SHA1_Init(&c);
+		git_SHA1_Update(&c, hdr, hdrlen);
+
+		buffer = fixed_buf;
+	} else {
+		buffer = xmallocz(size);
+	}
 
-	buffer = xmallocz(size);
 	memset(&stream, 0, sizeof(stream));
 	stream.next_out = buffer;
-	stream.avail_out = size + 1;
+	stream.avail_out = buffer == fixed_buf ? sizeof(fixed_buf) : size + 1;
 
 	git_inflate_init(&stream);
 	do {
 		in = use_pack(p, w_curs, curpos, &stream.avail_in);
 		stream.next_in = in;
 		st = git_inflate(&stream, Z_FINISH);
-		if (!stream.avail_out)
+		if (sha1) {
+			git_SHA1_Update(&c, buffer, stream.next_out - (unsigned char *)buffer);
+			stream.next_out = buffer;
+			stream.avail_out = sizeof(fixed_buf);
+		}
+		else if (!stream.avail_out)
 			break; /* the payload is larger than it should be */
 		curpos += stream.next_in - in;
 	} while (st == Z_OK || st == Z_BUF_ERROR);
+	if (sha1) {
+		git_SHA1_Final(sha1, &c);
+		buffer = NULL;
+	}
 	git_inflate_end(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != size) {
 		free(buffer);
@@ -1727,7 +1750,7 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 
 	ret = ent->data;
 	if (!ret || ent->p != p || ent->base_offset != base_offset)
-		return unpack_entry(p, base_offset, type, base_size);
+		return unpack_entry(p, base_offset, type, base_size, NULL);
 
 	if (!keep_cache) {
 		ent->data = NULL;
@@ -1844,7 +1867,7 @@ static void *unpack_delta_entry(struct packed_git *p,
 			return NULL;
 	}
 
-	delta_data = unpack_compressed_entry(p, w_curs, curpos, delta_size);
+	delta_data = unpack_compressed_entry(p, w_curs, curpos, delta_size, OBJ_NONE, NULL);
 	if (!delta_data) {
 		error("failed to unpack compressed delta "
 		      "at offset %"PRIuMAX" from %s",
@@ -1883,7 +1906,8 @@ static void write_pack_access_log(struct packed_git *p, off_t obj_offset)
 int do_check_packed_object_crc;
 
 void *unpack_entry(struct packed_git *p, off_t obj_offset,
-		   enum object_type *type, unsigned long *sizep)
+		   enum object_type *type, unsigned long *sizep,
+		   unsigned char *sha1)
 {
 	struct pack_window *w_curs = NULL;
 	off_t curpos = obj_offset;
@@ -1917,7 +1941,8 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 	case OBJ_TREE:
 	case OBJ_BLOB:
 	case OBJ_TAG:
-		data = unpack_compressed_entry(p, &w_curs, curpos, *sizep);
+		data = unpack_compressed_entry(p, &w_curs, curpos,
+					       *sizep, *type, sha1);
 		break;
 	default:
 		data = NULL;
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index e4b77a2..52acae5 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -143,7 +143,7 @@ test_expect_success 'fetch updates' '
 	)
 '
 
-test_expect_failure 'fsck' '
+test_expect_success 'fsck' '
 	git fsck --full
 '
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 09/11] archive: support streaming large files to a tar archive
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (18 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 08/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-06  0:57     ` Junio C Hamano
  2012-03-05  3:43   ` [PATCH v3 10/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 11/11] update-server-info: respect core.bigfilethreshold Nguyễn Thái Ngọc Duy
  21 siblings, 1 reply; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 archive-tar.c    |   35 ++++++++++++++++++++++++++++-------
 archive-zip.c    |    9 +++++----
 archive.c        |   51 ++++++++++++++++++++++++++++++++++-----------------
 archive.h        |   11 +++++++++--
 t/t1050-large.sh |    2 +-
 5 files changed, 77 insertions(+), 31 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 20af005..5bffe49 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -5,6 +5,7 @@
 #include "tar.h"
 #include "archive.h"
 #include "run-command.h"
+#include "streaming.h"
 
 #define RECORDSIZE	(512)
 #define BLOCKSIZE	(RECORDSIZE * 20)
@@ -123,9 +124,29 @@ static size_t get_path_prefix(const char *path, size_t pathlen, size_t maxlen)
 	return i;
 }
 
+static void write_file(struct git_istream *stream, const void *buffer,
+		       unsigned long size)
+{
+	if (!stream) {
+		write_blocked(buffer, size);
+		return;
+	}
+	for (;;) {
+		char buf[1024 * 16];
+		ssize_t readlen;
+
+		readlen = read_istream(stream, buf, sizeof(buf));
+
+		if (!readlen)
+			break;
+		write_blocked(buf, readlen);
+	}
+}
+
 static int write_tar_entry(struct archiver_args *args,
-		const unsigned char *sha1, const char *path, size_t pathlen,
-		unsigned int mode, void *buffer, unsigned long size)
+			   const unsigned char *sha1, const char *path,
+			   size_t pathlen, unsigned int mode, void *buffer,
+			   struct git_istream *stream, unsigned long size)
 {
 	struct ustar_header header;
 	struct strbuf ext_header = STRBUF_INIT;
@@ -200,14 +221,14 @@ static int write_tar_entry(struct archiver_args *args,
 
 	if (ext_header.len > 0) {
 		err = write_tar_entry(args, sha1, NULL, 0, 0, ext_header.buf,
-				ext_header.len);
+				      NULL, ext_header.len);
 		if (err)
 			return err;
 	}
 	strbuf_release(&ext_header);
 	write_blocked(&header, sizeof(header));
-	if (S_ISREG(mode) && buffer && size > 0)
-		write_blocked(buffer, size);
+	if (S_ISREG(mode) && size > 0)
+		write_file(stream, buffer, size);
 	return err;
 }
 
@@ -219,7 +240,7 @@ static int write_global_extended_header(struct archiver_args *args)
 
 	strbuf_append_ext_header(&ext_header, "comment", sha1_to_hex(sha1), 40);
 	err = write_tar_entry(args, NULL, NULL, 0, 0, ext_header.buf,
-			ext_header.len);
+			      NULL, ext_header.len);
 	strbuf_release(&ext_header);
 	return err;
 }
@@ -308,7 +329,7 @@ static int write_tar_archive(const struct archiver *ar,
 	if (args->commit_sha1)
 		err = write_global_extended_header(args);
 	if (!err)
-		err = write_archive_entries(args, write_tar_entry);
+		err = write_archive_entries(args, write_tar_entry, 1);
 	if (!err)
 		write_trailer();
 	return err;
diff --git a/archive-zip.c b/archive-zip.c
index 02d1f37..4a1e917 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -120,9 +120,10 @@ static void *zlib_deflate(void *data, unsigned long size,
 	return buffer;
 }
 
-static int write_zip_entry(struct archiver_args *args,
-		const unsigned char *sha1, const char *path, size_t pathlen,
-		unsigned int mode, void *buffer, unsigned long size)
+int write_zip_entry(struct archiver_args *args,
+			   const unsigned char *sha1, const char *path,
+			   size_t pathlen, unsigned int mode, void *buffer,
+			   struct git_istream *stream, unsigned long size)
 {
 	struct zip_local_header header;
 	struct zip_dir_header dirent;
@@ -271,7 +272,7 @@ static int write_zip_archive(const struct archiver *ar,
 	zip_dir = xmalloc(ZIP_DIRECTORY_MIN_SIZE);
 	zip_dir_size = ZIP_DIRECTORY_MIN_SIZE;
 
-	err = write_archive_entries(args, write_zip_entry);
+	err = write_archive_entries(args, write_zip_entry, 0);
 	if (!err)
 		write_zip_trailer(args->commit_sha1);
 
diff --git a/archive.c b/archive.c
index 1ee837d..257eadf 100644
--- a/archive.c
+++ b/archive.c
@@ -5,6 +5,7 @@
 #include "archive.h"
 #include "parse-options.h"
 #include "unpack-trees.h"
+#include "streaming.h"
 
 static char const * const archive_usage[] = {
 	"git archive [options] <tree-ish> [<path>...]",
@@ -59,26 +60,35 @@ static void format_subst(const struct commit *commit,
 	free(to_free);
 }
 
-static void *sha1_file_to_archive(const char *path, const unsigned char *sha1,
-		unsigned int mode, enum object_type *type,
-		unsigned long *sizep, const struct commit *commit)
+void sha1_file_to_archive(void **buffer, struct git_istream **stream,
+			  const char *path, const unsigned char *sha1,
+			  unsigned int mode, enum object_type *type,
+			  unsigned long *sizep,
+			  const struct commit *commit)
 {
-	void *buffer;
+	if (stream) {
+		struct stream_filter *filter;
+		filter = get_stream_filter(path, sha1);
+		if (!commit && S_ISREG(mode) && is_null_stream_filter(filter)) {
+			*buffer = NULL;
+			*stream = open_istream(sha1, type, sizep, NULL);
+			return;
+		}
+		*stream = NULL;
+	}
 
-	buffer = read_sha1_file(sha1, type, sizep);
-	if (buffer && S_ISREG(mode)) {
+	*buffer = read_sha1_file(sha1, type, sizep);
+	if (*buffer && S_ISREG(mode)) {
 		struct strbuf buf = STRBUF_INIT;
 		size_t size = 0;
 
-		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
+		strbuf_attach(&buf, *buffer, *sizep, *sizep + 1);
 		convert_to_working_tree(path, buf.buf, buf.len, &buf);
 		if (commit)
 			format_subst(commit, buf.buf, buf.len, &buf);
-		buffer = strbuf_detach(&buf, &size);
+		*buffer = strbuf_detach(&buf, &size);
 		*sizep = size;
 	}
-
-	return buffer;
 }
 
 static void setup_archive_check(struct git_attr_check *check)
@@ -97,6 +107,7 @@ static void setup_archive_check(struct git_attr_check *check)
 struct archiver_context {
 	struct archiver_args *args;
 	write_archive_entry_fn_t write_entry;
+	int stream_ok;
 };
 
 static int write_archive_entry(const unsigned char *sha1, const char *base,
@@ -109,6 +120,7 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
 	write_archive_entry_fn_t write_entry = c->write_entry;
 	struct git_attr_check check[2];
 	const char *path_without_prefix;
+	struct git_istream *stream = NULL;
 	int convert = 0;
 	int err;
 	enum object_type type;
@@ -133,25 +145,29 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
 		strbuf_addch(&path, '/');
 		if (args->verbose)
 			fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
-		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, 0);
+		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, NULL, 0);
 		if (err)
 			return err;
 		return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
 	}
 
-	buffer = sha1_file_to_archive(path_without_prefix, sha1, mode,
-			&type, &size, convert ? args->commit : NULL);
-	if (!buffer)
+	sha1_file_to_archive(&buffer, c->stream_ok ? &stream : NULL,
+			     path_without_prefix, sha1, mode,
+			     &type, &size, convert ? args->commit : NULL);
+	if (!buffer && !stream)
 		return error("cannot read %s", sha1_to_hex(sha1));
 	if (args->verbose)
 		fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
-	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, size);
+	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, stream, size);
+	if (stream)
+		close_istream(stream);
 	free(buffer);
 	return err;
 }
 
 int write_archive_entries(struct archiver_args *args,
-		write_archive_entry_fn_t write_entry)
+			  write_archive_entry_fn_t write_entry,
+			  int stream_ok)
 {
 	struct archiver_context context;
 	struct unpack_trees_options opts;
@@ -167,13 +183,14 @@ int write_archive_entries(struct archiver_args *args,
 		if (args->verbose)
 			fprintf(stderr, "%.*s\n", (int)len, args->base);
 		err = write_entry(args, args->tree->object.sha1, args->base,
-				len, 040777, NULL, 0);
+				  len, 040777, NULL, NULL, 0);
 		if (err)
 			return err;
 	}
 
 	context.args = args;
 	context.write_entry = write_entry;
+	context.stream_ok = stream_ok;
 
 	/*
 	 * Setup index and instruct attr to read index only
diff --git a/archive.h b/archive.h
index 2b0884f..370cca9 100644
--- a/archive.h
+++ b/archive.h
@@ -27,9 +27,16 @@ extern void register_archiver(struct archiver *);
 extern void init_tar_archiver(void);
 extern void init_zip_archiver(void);
 
-typedef int (*write_archive_entry_fn_t)(struct archiver_args *args, const unsigned char *sha1, const char *path, size_t pathlen, unsigned int mode, void *buffer, unsigned long size);
+struct git_istream;
+typedef int (*write_archive_entry_fn_t)(struct archiver_args *args,
+					const unsigned char *sha1,
+					const char *path, size_t pathlen,
+					unsigned int mode,
+					void *buffer,
+					struct git_istream *stream,
+					unsigned long size);
 
-extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry);
+extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry, int stream_ok);
 extern int write_archive(int argc, const char **argv, const char *prefix, int setup_prefix, const char *name_hint, int remote);
 
 const char *archive_format_from_filename(const char *filename);
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 52acae5..5336eb8 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -151,7 +151,7 @@ test_expect_failure 'repack' '
 	git repack -ad
 '
 
-test_expect_failure 'tar achiving' '
+test_expect_success 'tar achiving' '
 	git archive --format=tar HEAD >/dev/null
 '
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 10/11] fsck: use streaming interface for writing lost-found blobs
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (19 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 09/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  2012-03-05  3:43   ` [PATCH v3 11/11] update-server-info: respect core.bigfilethreshold Nguyễn Thái Ngọc Duy
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy


Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/fsck.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 8c479a7..7fcb33e 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -12,6 +12,7 @@
 #include "parse-options.h"
 #include "dir.h"
 #include "progress.h"
+#include "streaming.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -236,13 +237,8 @@ static void check_unreachable_object(struct object *obj)
 			if (!(f = fopen(filename, "w")))
 				die_errno("Could not open '%s'", filename);
 			if (obj->type == OBJ_BLOB) {
-				enum object_type type;
-				unsigned long size;
-				char *buf = read_sha1_file(obj->sha1,
-						&type, &size);
-				if (buf && fwrite(buf, 1, size, f) != size)
+				if (stream_blob_to_fd(fileno(f), obj->sha1, NULL, 1))
 					die_errno("Could not write '%s'", filename);
-				free(buf);
 			} else
 				fprintf(f, "%s\n", sha1_to_hex(obj->sha1));
 			if (fclose(f))
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 11/11] update-server-info: respect core.bigfilethreshold
  2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
                     ` (20 preceding siblings ...)
  2012-03-05  3:43   ` [PATCH v3 10/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
@ 2012-03-05  3:43   ` Nguyễn Thái Ngọc Duy
  21 siblings, 0 replies; 48+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2012-03-05  3:43 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

This command indirectly calls check_sha1_signature() (add_info_ref ->
deref_tag -> parse_object -> ..) , which may put whole blob in memory
if the blob's size is under core.bigfilethreshold. As config is not
read, the threshold is always 512MB. Respect user settings here.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/update-server-info.c |    1 +
 t/t1050-large.sh             |    2 +-
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/builtin/update-server-info.c b/builtin/update-server-info.c
index b90dce6..0d63c44 100644
--- a/builtin/update-server-info.c
+++ b/builtin/update-server-info.c
@@ -15,6 +15,7 @@ int cmd_update_server_info(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
+	git_config(git_default_config, NULL);
 	argc = parse_options(argc, argv, prefix, options,
 			     update_server_info_usage, 0);
 	if (argc > 0)
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 5336eb8..9197b89 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -147,7 +147,7 @@ test_expect_success 'fsck' '
 	git fsck --full
 '
 
-test_expect_failure 'repack' '
+test_expect_success 'repack' '
 	git repack -ad
 '
 
-- 
1.7.3.1.256.g2539c.dirty

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 04/11] parse_object: special code path for blobs to avoid putting whole object in memory
  2012-03-05  3:43   ` [PATCH v3 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
@ 2012-03-06  0:57     ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2012-03-06  0:57 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>

The code looks OK but the updated API into check_sha1_signature()
needs to be explained both in-code comment and the log message.

I'll push out an updated version later on 'pu'.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 09/11] archive: support streaming large files to a tar archive
  2012-03-05  3:43   ` [PATCH v3 09/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
@ 2012-03-06  0:57     ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2012-03-06  0:57 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>

This is *way* *too* underdocumented.

For example, it is totally unclear from the patch what determines
the last parameter to write_archive_entries(), OK_TO_STREAM.  Does
it depend on the nature of the payload?  Does the backend decide it,
in other words, if it is prepared to read from a streaming API or not?

I wanted to first take all the "do not slurp things in core, and
instead read from streaming API" patches from this series, but I
had to stop at this one.

> ---
>  archive-tar.c    |   35 ++++++++++++++++++++++++++++-------
>  archive-zip.c    |    9 +++++----
>  archive.c        |   51 ++++++++++++++++++++++++++++++++++-----------------
>  archive.h        |   11 +++++++++--
>  t/t1050-large.sh |    2 +-
>  5 files changed, 77 insertions(+), 31 deletions(-)
>
> diff --git a/archive-tar.c b/archive-tar.c
> index 20af005..5bffe49 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -5,6 +5,7 @@
>  #include "tar.h"
>  #include "archive.h"
>  #include "run-command.h"
> +#include "streaming.h"
>  
>  #define RECORDSIZE	(512)
>  #define BLOCKSIZE	(RECORDSIZE * 20)
> @@ -123,9 +124,29 @@ static size_t get_path_prefix(const char *path, size_t pathlen, size_t maxlen)
>  	return i;
>  }
>  
> +static void write_file(struct git_istream *stream, const void *buffer,
> +		       unsigned long size)
> +{
> +	if (!stream) {
> +		write_blocked(buffer, size);
> +		return;
> +	}
> +	for (;;) {
> +		char buf[1024 * 16];
> +		ssize_t readlen;
> +
> +		readlen = read_istream(stream, buf, sizeof(buf));
> +
> +		if (!readlen)
> +			break;
> +		write_blocked(buf, readlen);
> +	}
> +}
> +
>  static int write_tar_entry(struct archiver_args *args,
> -		const unsigned char *sha1, const char *path, size_t pathlen,
> -		unsigned int mode, void *buffer, unsigned long size)
> +			   const unsigned char *sha1, const char *path,
> +			   size_t pathlen, unsigned int mode, void *buffer,
> +			   struct git_istream *stream, unsigned long size)
>  {
>  	struct ustar_header header;
>  	struct strbuf ext_header = STRBUF_INIT;
> @@ -200,14 +221,14 @@ static int write_tar_entry(struct archiver_args *args,
>  
>  	if (ext_header.len > 0) {
>  		err = write_tar_entry(args, sha1, NULL, 0, 0, ext_header.buf,
> -				ext_header.len);
> +				      NULL, ext_header.len);
>  		if (err)
>  			return err;
>  	}
>  	strbuf_release(&ext_header);
>  	write_blocked(&header, sizeof(header));
> -	if (S_ISREG(mode) && buffer && size > 0)
> -		write_blocked(buffer, size);
> +	if (S_ISREG(mode) && size > 0)
> +		write_file(stream, buffer, size);
>  	return err;
>  }
>  
> @@ -219,7 +240,7 @@ static int write_global_extended_header(struct archiver_args *args)
>  
>  	strbuf_append_ext_header(&ext_header, "comment", sha1_to_hex(sha1), 40);
>  	err = write_tar_entry(args, NULL, NULL, 0, 0, ext_header.buf,
> -			ext_header.len);
> +			      NULL, ext_header.len);
>  	strbuf_release(&ext_header);
>  	return err;
>  }
> @@ -308,7 +329,7 @@ static int write_tar_archive(const struct archiver *ar,
>  	if (args->commit_sha1)
>  		err = write_global_extended_header(args);
>  	if (!err)
> -		err = write_archive_entries(args, write_tar_entry);
> +		err = write_archive_entries(args, write_tar_entry, 1);
>  	if (!err)
>  		write_trailer();
>  	return err;
> diff --git a/archive-zip.c b/archive-zip.c
> index 02d1f37..4a1e917 100644
> --- a/archive-zip.c
> +++ b/archive-zip.c
> @@ -120,9 +120,10 @@ static void *zlib_deflate(void *data, unsigned long size,
>  	return buffer;
>  }
>  
> -static int write_zip_entry(struct archiver_args *args,
> -		const unsigned char *sha1, const char *path, size_t pathlen,
> -		unsigned int mode, void *buffer, unsigned long size)
> +int write_zip_entry(struct archiver_args *args,
> +			   const unsigned char *sha1, const char *path,
> +			   size_t pathlen, unsigned int mode, void *buffer,
> +			   struct git_istream *stream, unsigned long size)
>  {
>  	struct zip_local_header header;
>  	struct zip_dir_header dirent;
> @@ -271,7 +272,7 @@ static int write_zip_archive(const struct archiver *ar,
>  	zip_dir = xmalloc(ZIP_DIRECTORY_MIN_SIZE);
>  	zip_dir_size = ZIP_DIRECTORY_MIN_SIZE;
>  
> -	err = write_archive_entries(args, write_zip_entry);
> +	err = write_archive_entries(args, write_zip_entry, 0);
>  	if (!err)
>  		write_zip_trailer(args->commit_sha1);
>  
> diff --git a/archive.c b/archive.c
> index 1ee837d..257eadf 100644
> --- a/archive.c
> +++ b/archive.c
> @@ -5,6 +5,7 @@
>  #include "archive.h"
>  #include "parse-options.h"
>  #include "unpack-trees.h"
> +#include "streaming.h"
>  
>  static char const * const archive_usage[] = {
>  	"git archive [options] <tree-ish> [<path>...]",
> @@ -59,26 +60,35 @@ static void format_subst(const struct commit *commit,
>  	free(to_free);
>  }
>  
> -static void *sha1_file_to_archive(const char *path, const unsigned char *sha1,
> -		unsigned int mode, enum object_type *type,
> -		unsigned long *sizep, const struct commit *commit)
> +void sha1_file_to_archive(void **buffer, struct git_istream **stream,
> +			  const char *path, const unsigned char *sha1,
> +			  unsigned int mode, enum object_type *type,
> +			  unsigned long *sizep,
> +			  const struct commit *commit)
>  {
> -	void *buffer;
> +	if (stream) {
> +		struct stream_filter *filter;
> +		filter = get_stream_filter(path, sha1);
> +		if (!commit && S_ISREG(mode) && is_null_stream_filter(filter)) {
> +			*buffer = NULL;
> +			*stream = open_istream(sha1, type, sizep, NULL);
> +			return;
> +		}
> +		*stream = NULL;
> +	}
>  
> -	buffer = read_sha1_file(sha1, type, sizep);
> -	if (buffer && S_ISREG(mode)) {
> +	*buffer = read_sha1_file(sha1, type, sizep);
> +	if (*buffer && S_ISREG(mode)) {
>  		struct strbuf buf = STRBUF_INIT;
>  		size_t size = 0;
>  
> -		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
> +		strbuf_attach(&buf, *buffer, *sizep, *sizep + 1);
>  		convert_to_working_tree(path, buf.buf, buf.len, &buf);
>  		if (commit)
>  			format_subst(commit, buf.buf, buf.len, &buf);
> -		buffer = strbuf_detach(&buf, &size);
> +		*buffer = strbuf_detach(&buf, &size);
>  		*sizep = size;
>  	}
> -
> -	return buffer;
>  }
>  
>  static void setup_archive_check(struct git_attr_check *check)
> @@ -97,6 +107,7 @@ static void setup_archive_check(struct git_attr_check *check)
>  struct archiver_context {
>  	struct archiver_args *args;
>  	write_archive_entry_fn_t write_entry;
> +	int stream_ok;
>  };
>  
>  static int write_archive_entry(const unsigned char *sha1, const char *base,
> @@ -109,6 +120,7 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
>  	write_archive_entry_fn_t write_entry = c->write_entry;
>  	struct git_attr_check check[2];
>  	const char *path_without_prefix;
> +	struct git_istream *stream = NULL;
>  	int convert = 0;
>  	int err;
>  	enum object_type type;
> @@ -133,25 +145,29 @@ static int write_archive_entry(const unsigned char *sha1, const char *base,
>  		strbuf_addch(&path, '/');
>  		if (args->verbose)
>  			fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
> -		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, 0);
> +		err = write_entry(args, sha1, path.buf, path.len, mode, NULL, NULL, 0);
>  		if (err)
>  			return err;
>  		return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
>  	}
>  
> -	buffer = sha1_file_to_archive(path_without_prefix, sha1, mode,
> -			&type, &size, convert ? args->commit : NULL);
> -	if (!buffer)
> +	sha1_file_to_archive(&buffer, c->stream_ok ? &stream : NULL,
> +			     path_without_prefix, sha1, mode,
> +			     &type, &size, convert ? args->commit : NULL);
> +	if (!buffer && !stream)
>  		return error("cannot read %s", sha1_to_hex(sha1));
>  	if (args->verbose)
>  		fprintf(stderr, "%.*s\n", (int)path.len, path.buf);
> -	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, size);
> +	err = write_entry(args, sha1, path.buf, path.len, mode, buffer, stream, size);
> +	if (stream)
> +		close_istream(stream);
>  	free(buffer);
>  	return err;
>  }
>  
>  int write_archive_entries(struct archiver_args *args,
> -		write_archive_entry_fn_t write_entry)
> +			  write_archive_entry_fn_t write_entry,
> +			  int stream_ok)
>  {
>  	struct archiver_context context;
>  	struct unpack_trees_options opts;
> @@ -167,13 +183,14 @@ int write_archive_entries(struct archiver_args *args,
>  		if (args->verbose)
>  			fprintf(stderr, "%.*s\n", (int)len, args->base);
>  		err = write_entry(args, args->tree->object.sha1, args->base,
> -				len, 040777, NULL, 0);
> +				  len, 040777, NULL, NULL, 0);
>  		if (err)
>  			return err;
>  	}
>  
>  	context.args = args;
>  	context.write_entry = write_entry;
> +	context.stream_ok = stream_ok;
>  
>  	/*
>  	 * Setup index and instruct attr to read index only
> diff --git a/archive.h b/archive.h
> index 2b0884f..370cca9 100644
> --- a/archive.h
> +++ b/archive.h
> @@ -27,9 +27,16 @@ extern void register_archiver(struct archiver *);
>  extern void init_tar_archiver(void);
>  extern void init_zip_archiver(void);
>  
> -typedef int (*write_archive_entry_fn_t)(struct archiver_args *args, const unsigned char *sha1, const char *path, size_t pathlen, unsigned int mode, void *buffer, unsigned long size);
> +struct git_istream;
> +typedef int (*write_archive_entry_fn_t)(struct archiver_args *args,
> +					const unsigned char *sha1,
> +					const char *path, size_t pathlen,
> +					unsigned int mode,
> +					void *buffer,
> +					struct git_istream *stream,
> +					unsigned long size);
>  
> -extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry);
> +extern int write_archive_entries(struct archiver_args *args, write_archive_entry_fn_t write_entry, int stream_ok);
>  extern int write_archive(int argc, const char **argv, const char *prefix, int setup_prefix, const char *name_hint, int remote);
>  
>  const char *archive_format_from_filename(const char *filename);
> diff --git a/t/t1050-large.sh b/t/t1050-large.sh
> index 52acae5..5336eb8 100755
> --- a/t/t1050-large.sh
> +++ b/t/t1050-large.sh
> @@ -151,7 +151,7 @@ test_expect_failure 'repack' '
>  	git repack -ad
>  '
>  
> -test_expect_failure 'tar achiving' '
> +test_expect_success 'tar achiving' '
>  	git archive --format=tar HEAD >/dev/null
>  '

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 01/10] Add more large blob test cases
  2012-03-04 12:59   ` [PATCH v2 01/10] Add more large blob test cases Nguyễn Thái Ngọc Duy
@ 2012-03-06  0:59     ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2012-03-06  0:59 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> diff --git a/wrapper.c b/wrapper.c
> index 85f09df..d4c0972 100644
> --- a/wrapper.c
> +++ b/wrapper.c
> @@ -9,6 +9,18 @@ static void do_nothing(size_t size)
>  
>  static void (*try_to_free_routine)(size_t size) = do_nothing;
>  
> +static void memory_limit_check(size_t size)
> +{
> +	static int limit = -1;
> +	if (limit == -1) {
> +		const char *env = getenv("GIT_ALLOC_LIMIT");
> +		limit = env ? atoi(env) * 1024 : 0;
> +	}
> +	if (limit && size > limit)
> +		die("attempting to allocate %d over limit %d",
> +		    size, limit);

size is size_t and %d calls for an int.

I'll push out a fixed-up version later to 'pu'.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2012-03-06  0:59 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
2012-02-27 20:18   ` Peter Baumann
2012-02-27  7:55 ` [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle Nguyễn Thái Ngọc Duy
2012-02-27 17:29   ` Junio C Hamano
2012-02-27 21:50     ` Junio C Hamano
2012-02-27  7:55 ` [PATCH 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
2012-02-27 17:44   ` Junio C Hamano
2012-02-28  1:08     ` Nguyen Thai Ngoc Duy
2012-02-27  7:55 ` [PATCH 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
2012-02-27 18:00   ` Junio C Hamano
2012-02-27  7:55 ` [PATCH 06/11] index-pack --verify: skip sha-1 collision test Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 07/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 08/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 09/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 10/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 11/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
2012-02-27 18:43 ` [PATCH 00/11] Large blob fixes Junio C Hamano
2012-02-28  1:23   ` Nguyen Thai Ngoc Duy
2012-03-04 12:59 ` [PATCH v2 00/10] " Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 01/10] Add more large blob test cases Nguyễn Thái Ngọc Duy
2012-03-06  0:59     ` Junio C Hamano
2012-03-04 12:59   ` [PATCH v2 02/10] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 03/10] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
2012-03-04 23:12     ` Junio C Hamano
2012-03-05  2:42       ` Nguyen Thai Ngoc Duy
2012-03-04 12:59   ` [PATCH v2 04/10] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 05/10] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 06/10] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 07/10] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 08/10] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 09/10] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 10/10] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 02/11] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
2012-03-06  0:57     ` Junio C Hamano
2012-03-05  3:43   ` [PATCH v3 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 06/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 07/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 08/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 09/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
2012-03-06  0:57     ` Junio C Hamano
2012-03-05  3:43   ` [PATCH v3 10/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 11/11] update-server-info: respect core.bigfilethreshold Nguyễn Thái Ngọc Duy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.