All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG?] bulk checkin does not respect filters
@ 2012-02-24  3:02 Jeff King
  2012-02-24  3:17 ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2012-02-24  3:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

If I do this:

  git init repo &&
  cd repo &&
  echo foo >small &&
  cat small small small small >large &&
  echo '* filter=foo2bar' >.gitattributes &&
  git config filter.foo2bar.clean 'sed s/foo/bar/' &&
  git config core.bigfilethreshold 10 &&
  git add . &&
  echo "===> small" && git cat-file blob :small
  echo "===> large" && git cat-file blob :large

the output I get is:

  ===> small
  bar
  ===> large
  foo
  foo
  foo
  foo

I.e., the clean filter is not applied to the bulk checkin file. Nor can
it be easily, because we need to know the size of the file to write the
blob header, and we don't know that until we see all of the filter's
output.

In practice, I don't know if this is a huge deal, as people aren't going
to be asking to de-CRLF files that actually cross the 512M
bigfilethreshold (OTOH, I seem to recall there are some filters floating
around for normalizing gzip'd files, which could plausibly be gigantic).

But it seems like the right choice when we see this conflict is not
"don't do filters for streaming checkin", but rather "don't do streaming
checkin when filters are in use" (because streaming is an optimization,
and filters are about correctness).

It would be even nicer if filters could play well with bulk checkin, but
I think that would involve streaming to a tempfile, checking the size of
the file, and then streaming that into an object. Which is better than
putting the whole thing in memory if it would involve swapping, but
probably worse than doing so if you can possibly fit the whole thing in
(because you're doing a ton of extra I/O for the tempfile).

Thoughts? Was this intentional, or just overlooked?

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG?] bulk checkin does not respect filters
  2012-02-24  3:02 [BUG?] bulk checkin does not respect filters Jeff King
@ 2012-02-24  3:17 ` Junio C Hamano
  2012-02-24  3:42   ` Junio C Hamano
  2012-02-24  8:28   ` Jeff King
  0 siblings, 2 replies; 19+ messages in thread
From: Junio C Hamano @ 2012-02-24  3:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> Thoughts? Was this intentional, or just overlooked?

This is intentional in the sense it is not worth worrying about (I
personally consider large and binary gunk equivalent and something nobody
should care about anything more than 1. what the exact contents it had at
each point in history and 2. if it did or did not change between two
versions, but definitely not 3. how different these two versions were),
and does not deserve the complexity to support filtering and textconv'ing.

If somebody strongly feels about lifting the limitation with a clean patch
that does not harm the common case codepaths, patches are welcome, but I
am not planning to do it myself ;-)

It is worth documenting this limitation, though.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG?] bulk checkin does not respect filters
  2012-02-24  3:17 ` Junio C Hamano
@ 2012-02-24  3:42   ` Junio C Hamano
  2012-02-24  7:54     ` Jeff King
  2012-02-24  8:28   ` Jeff King
  1 sibling, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2012-02-24  3:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> Thoughts? Was this intentional, or just overlooked?

It is a bit curious that anything filtered even goes to the streaming
codepath, given this piece of code in write_entry() in entry.c:

	if (ce_mode_s_ifmt == S_IFREG) {
		struct stream_filter *filter = get_stream_filter(path, ce->sha1);
		if (filter &&
		    !streaming_write_entry(ce, path, filter,
					   state, to_tempfile,
					   &fstat_done, &st))
			goto finish;
	}

and get_stream_filter() in convert.c has an explicit exception for this
case at the very beginning:

struct stream_filter *get_stream_filter(const char *path, const unsigned char *sha1)
{
	struct conv_attrs ca;
	enum crlf_action crlf_action;
	struct stream_filter *filter = NULL;

	convert_attrs(&ca, path);

	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
		return filter;

to make sure that it says "No streaming filtering is possible, do not even
attempt to call streaming_write_entry()".

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG?] bulk checkin does not respect filters
  2012-02-24  3:42   ` Junio C Hamano
@ 2012-02-24  7:54     ` Jeff King
  2012-02-24 18:48       ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2012-02-24  7:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Feb 23, 2012 at 07:42:11PM -0800, Junio C Hamano wrote:

> It is a bit curious that anything filtered even goes to the streaming
> codepath, given this piece of code in write_entry() in entry.c:
> 
> 	if (ce_mode_s_ifmt == S_IFREG) {
> 		struct stream_filter *filter = get_stream_filter(path, ce->sha1);
> 		if (filter &&
> 		    !streaming_write_entry(ce, path, filter,
> 					   state, to_tempfile,
> 					   &fstat_done, &st))
> 			goto finish;
> 	}
> 
> and get_stream_filter() in convert.c has an explicit exception for this
> case at the very beginning:

I think it is because we don't follow that code path at all. The stack trace
for "git add" looks on a large file looks like:

#0  stream_to_pack (...) at bulk-checkin.c:101
#1  deflate_to_pack (...) at bulk-checkin.c:219
#2  index_bulk_checkin (...) at bulk-checkin.c:258
#3  index_stream (...) at sha1_file.c:2712
#4  index_fd (...) at sha1_file.c:2726
#5  index_path (...) at sha1_file.c:2742
#6  add_to_index (...) at read-cache.c:644
#7  add_file_to_index (...) at read-cache.c:673
#8  add_files (...) at builtin/add.c:363
#9  cmd_add (...) at builtin/add.c:474
#10 run_builtin (...) at git.c:324
#11 handle_internal_command (...) at git.c:484
#12 run_argv (...) at git.c:530
#13 main (...) at git.c:605 

Isn't write_entry the _other_ side of things. I.e., checking out, not
checking in? That side works fine (making the large file handling
for git-add even more odd. We will fail to apply the "clean" filter when
adding it, but we will apply the smudge filter when we write it out).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG?] bulk checkin does not respect filters
  2012-02-24  3:17 ` Junio C Hamano
  2012-02-24  3:42   ` Junio C Hamano
@ 2012-02-24  8:28   ` Jeff King
  2012-02-24  9:39     ` Jeff King
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff King @ 2012-02-24  8:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Feb 23, 2012 at 07:17:49PM -0800, Junio C Hamano wrote:

> > Thoughts? Was this intentional, or just overlooked?
> 
> This is intentional in the sense it is not worth worrying about (I
> personally consider large and binary gunk equivalent and something nobody
> should care about anything more than 1. what the exact contents it had at
> each point in history and 2. if it did or did not change between two
> versions, but definitely not 3. how different these two versions were),
> and does not deserve the complexity to support filtering and textconv'ing.

We're purely in the hypothetical devil's advocate realm here, as I do
not use the filtering feature myself. But what if I had a filter that
canonicalized large binary files in some way (e.g., by re-zipping large
zipfiles using deterministic parameters). Even though they are large
binary gunk and you would never want to diff two versions, the important
thing is getting an identical sha1 when the underlying data has not
changed. I.e., your point 2 above, but applied to the "clean" repository
versions of files.

> If somebody strongly feels about lifting the limitation with a clean patch
> that does not harm the common case codepaths, patches are welcome, but I
> am not planning to do it myself ;-)

Perhaps something like this:

---
 convert.c   |   18 ++++++++++++++++++
 convert.h   |    1 +
 sha1_file.c |    3 ++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/convert.c b/convert.c
index 12868ed..254301a 100644
--- a/convert.c
+++ b/convert.c
@@ -742,6 +742,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 	}
 }
 
+int want_convert_to_git(const char *path)
+{
+	struct conv_attrs ca;
+	convert_attrs(&ca, path);
+
+	if (ca.drv && ca.drv->clean)
+		return 1;
+	if (ca.ident)
+		return 1;
+
+	ca.crlf_action = input_crlf_action(ca.crlf_action, ca.eol_attr);
+	if (ca.crlf_action == CRLF_BINARY)
+		return 0;
+	if (ca.crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE)
+		return 0;
+	return 1;
+}
+
 int convert_to_git(const char *path, const char *src, size_t len,
                    struct strbuf *dst, enum safe_crlf checksafe)
 {
diff --git a/convert.h b/convert.h
index d799a16..4d5936d 100644
--- a/convert.h
+++ b/convert.h
@@ -36,6 +36,7 @@ extern enum eol core_eol;
 /* returns 1 if *dst was used */
 extern int convert_to_git(const char *path, const char *src, size_t len,
 			  struct strbuf *dst, enum safe_crlf checksafe);
+extern int want_convert_to_git(const char *path);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int renormalize_buffer(const char *path, const char *src, size_t len,
diff --git a/sha1_file.c b/sha1_file.c
index f9f8d5e..6c0e05c 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2720,7 +2720,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 
 	if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
-	else if (size <= big_file_threshold || type != OBJ_BLOB)
+	else if (size <= big_file_threshold || type != OBJ_BLOB ||
+		 want_convert_to_git(path))
 		ret = index_core(sha1, fd, size, type, path, flags);
 	else
 		ret = index_stream(sha1, fd, size, type, path, flags);

There should be no performance impact, as the new code only kicks in for
files that exceed big_file_threshold (and even then, it is just an extra
duplicate attr lookup, which the check_attr code caches anyway).

I don't like repeating all of the convert_to_git policy logic. Perhaps
if you pass a NULL buffer to convert_to_git, it should run through its
usual logic, stopping just short of actually writing anything, and
return a flag indicating whether it _would_ convert (this can't be 100%
accurate, as sometimes conversion depends on looking at the actual
contents of the buffer, but it could at least tell us "yes, I might
convert" versus "no, I will definitely not convert").

-Peff

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [BUG?] bulk checkin does not respect filters
  2012-02-24  8:28   ` Jeff King
@ 2012-02-24  9:39     ` Jeff King
  2012-02-24  9:41       ` [PATCH 1/2] teach convert_to_git a "dry run" mode Jeff King
                         ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Jeff King @ 2012-02-24  9:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Feb 24, 2012 at 03:28:03AM -0500, Jeff King wrote:

> I don't like repeating all of the convert_to_git policy logic. Perhaps
> if you pass a NULL buffer to convert_to_git, it should run through its
> usual logic, stopping just short of actually writing anything, and
> return a flag indicating whether it _would_ convert...

So here's a series that does that.

  [1/2]: teach convert_to_git a "dry run" mode
  [2/2]: do not stream large files to pack when filters are in use

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] teach convert_to_git a "dry run" mode
  2012-02-24  9:39     ` Jeff King
@ 2012-02-24  9:41       ` Jeff King
  2012-02-24  9:48       ` [PATCH 2/2] do not stream large files to pack when filters are in use Jeff King
  2012-02-24 20:03       ` [BUG?] bulk checkin does not respect filters Junio C Hamano
  2 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2012-02-24  9:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Some callers may want to know whether convert_to_git will do
any conversion without actually feeding file data to it
(e.g., if you are using the decision to decide how to
acquire the data that might be converted). Rather than
replicate convert_to_git's logic in a separate function,
this patch lets callers pass a NULL buffer. The function
runs through its usual logic, except that it does not do any
actual conversion; the return value, instead of specifying
whether conversion happened, indicates whether conversion
might occur.

Note the use of the word "might" above. We cannot know for
sure whether conversion will occur without seeing the buffer
itself (because CRLF conversion may decide the file is
binary and do nothing). However, "might" is close enough for
callers which are trying to find out whether filters are in
used; they can be conservative and assume that the filters
will be used in such a case.

Signed-off-by: Jeff King <peff@peff.net>
---
 convert.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/convert.c b/convert.c
index 12868ed..33aa72f 100644
--- a/convert.c
+++ b/convert.c
@@ -195,9 +195,13 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 	char *dst;
 
 	if (crlf_action == CRLF_BINARY ||
-	    (crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE) || !len)
+	    (crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE) ||
+	    (src && !len))
 		return 0;
 
+	if (!src)
+		return 1;
+
 	gather_stats(src, len, &stats);
 
 	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_GUESS) {
@@ -391,6 +395,9 @@ static int apply_filter(const char *path, const char *src, size_t len,
 	if (!cmd)
 		return 0;
 
+	if (!src)
+		return 1;
+
 	memset(&async, 0, sizeof(async));
 	async.proc = filter_buffer;
 	async.data = &params;
@@ -522,9 +529,12 @@ static int ident_to_git(const char *path, const char *src, size_t len,
 {
 	char *dst, *dollar;
 
-	if (!ident || !count_ident(src, len))
+	if (!ident || (src && !count_ident(src, len)))
 		return 0;
 
+	if (!src)
+		return 1;
+
 	/* only grow if not in place */
 	if (strbuf_avail(buf) + buf->len < len)
 		strbuf_grow(buf, len - buf->len);
@@ -754,13 +764,13 @@ int convert_to_git(const char *path, const char *src, size_t len,
 		filter = ca.drv->clean;
 
 	ret |= apply_filter(path, src, len, dst, filter);
-	if (ret) {
+	if (ret && src) {
 		src = dst->buf;
 		len = dst->len;
 	}
 	ca.crlf_action = input_crlf_action(ca.crlf_action, ca.eol_attr);
 	ret |= crlf_to_git(path, src, len, dst, ca.crlf_action, checksafe);
-	if (ret) {
+	if (ret && src) {
 		src = dst->buf;
 		len = dst->len;
 	}
-- 
1.7.9.9.g04d94

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] do not stream large files to pack when filters are in use
  2012-02-24  9:39     ` Jeff King
  2012-02-24  9:41       ` [PATCH 1/2] teach convert_to_git a "dry run" mode Jeff King
@ 2012-02-24  9:48       ` Jeff King
  2012-02-24 20:03         ` Junio C Hamano
  2012-02-24 20:03       ` [BUG?] bulk checkin does not respect filters Junio C Hamano
  2 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2012-02-24  9:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.

The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size.  Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.

This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.

The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm on the fence about the performance regression above. On the one
hand, if you're telling git to autocrlf your gigantic binary files, you
should fix your attributes, and expect not to use the streaming
optimization until you do. On the other hand, while I can see a remote
possibility of using filter.*.clean on a large binary file, the
probability that autocrlf would apply seems quite absurdly unlikely.

So rather than be pessimistic that filters will be used in the "auto"
case, perhaps we should be optimistic that they would not be. Or another
way of thinking about it is: from the CRLF code's perspective, anything
larger than core.bigfilethreshold should be considered binary without
actually looking at it, and we can say for sure (not the "might" from
the last patch) that such large things would not be converted in the
CRLF_GUESS case.

 sha1_file.c                 |    3 +-
 t/t1051-large-conversion.sh |   86 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100755 t/t1051-large-conversion.sh

diff --git a/sha1_file.c b/sha1_file.c
index f9f8d5e..61f597b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2720,7 +2720,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 
 	if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
-	else if (size <= big_file_threshold || type != OBJ_BLOB)
+	else if (size <= big_file_threshold || type != OBJ_BLOB ||
+		 convert_to_git(path, NULL, 0, NULL, 0))
 		ret = index_core(sha1, fd, size, type, path, flags);
 	else
 		ret = index_stream(sha1, fd, size, type, path, flags);
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
new file mode 100755
index 0000000..8b7640b
--- /dev/null
+++ b/t/t1051-large-conversion.sh
@@ -0,0 +1,86 @@
+#!/bin/sh
+
+test_description='test conversion filters on large files'
+. ./test-lib.sh
+
+set_attr() {
+	test_when_finished 'rm -f .gitattributes' &&
+	echo "* $*" >.gitattributes
+}
+
+check_input() {
+	git read-tree --empty &&
+	git add small large &&
+	git cat-file blob :small >small.index &&
+	git cat-file blob :large | head -n 1 >large.index &&
+	test_cmp small.index large.index
+}
+
+check_output() {
+	rm -f small large &&
+	git checkout small large &&
+	head -n 1 large >large.head &&
+	test_cmp small large.head
+}
+
+test_expect_success 'setup input tests' '
+	printf "\$Id: foo\$\\r\\n" >small &&
+	cat small small >large &&
+	git config core.bigfilethreshold 20 &&
+	git config filter.test.clean "sed s/.*/CLEAN/"
+'
+
+test_expect_success 'autocrlf=true converts on input' '
+	test_config core.autocrlf true &&
+	check_input
+'
+
+test_expect_success 'eol=crlf converts on input' '
+	set_attr eol=crlf &&
+	check_input
+'
+
+test_expect_success 'ident converts on input' '
+	set_attr ident &&
+	check_input
+'
+
+test_expect_success 'user-defined filters convert on input' '
+	set_attr filter=test &&
+	check_input
+'
+
+test_expect_success 'setup output tests' '
+	echo "\$Id\$" >small &&
+	cat small small >large &&
+	git add small large &&
+	git config core.bigfilethreshold 7 &&
+	git config filter.test.smudge "sed s/.*/SMUDGE/"
+'
+
+test_expect_success 'autocrlf=true converts on output' '
+	test_config core.autocrlf true &&
+	check_output
+'
+
+test_expect_success 'eol=crlf converts on output' '
+	set_attr eol=crlf &&
+	check_output
+'
+
+test_expect_success 'user-defined filters convert on output' '
+	set_attr filter=test &&
+	check_output
+'
+
+test_expect_success 'ident converts on output' '
+	set_attr ident &&
+	rm -f small large &&
+	git checkout small large &&
+	sed -n "s/Id: .*/Id: SHA/p" <small >small.clean &&
+	head -n 1 large >large.head &&
+	sed -n "s/Id: .*/Id: SHA/p" <large.head >large.clean &&
+	test_cmp small.clean large.clean
+'
+
+test_done
-- 
1.7.9.9.g04d94

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [BUG?] bulk checkin does not respect filters
  2012-02-24  7:54     ` Jeff King
@ 2012-02-24 18:48       ` Junio C Hamano
  0 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2012-02-24 18:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Jeff King <peff@peff.net> writes:

> Isn't write_entry the _other_ side of things. I.e., checking out, not
> checking in?

Ah, of course.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] do not stream large files to pack when filters are in use
  2012-02-24  9:48       ` [PATCH 2/2] do not stream large files to pack when filters are in use Jeff King
@ 2012-02-24 20:03         ` Junio C Hamano
  2012-02-24 20:48           ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2012-02-24 20:03 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I'm on the fence about the performance regression above. On the one
> hand, if you're telling git to autocrlf your gigantic binary files, you
> should fix your attributes, and expect not to use the streaming
> optimization until you do.

That is a sensible angle to view the issue from, I would think.

> diff --git a/sha1_file.c b/sha1_file.c
> index f9f8d5e..61f597b 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2720,7 +2720,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
>  
>  	if (!S_ISREG(st->st_mode))
>  		ret = index_pipe(sha1, fd, type, path, flags);
> -	else if (size <= big_file_threshold || type != OBJ_BLOB)
> +	else if (size <= big_file_threshold || type != OBJ_BLOB ||
> +		 convert_to_git(path, NULL, 0, NULL, 0))

Nice.  It would be even nicer to give a readability macro whose name makes
it clear that this is a query (unfortunately we cannot add '?' at the end
of the function name) and not a conversion.  Any name suggestions?

By the way, I tried this with the tip of 'pu' as of last night, and the
result segfaults on me in t1050 (hash-object -w) ;-).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG?] bulk checkin does not respect filters
  2012-02-24  9:39     ` Jeff King
  2012-02-24  9:41       ` [PATCH 1/2] teach convert_to_git a "dry run" mode Jeff King
  2012-02-24  9:48       ` [PATCH 2/2] do not stream large files to pack when filters are in use Jeff King
@ 2012-02-24 20:03       ` Junio C Hamano
  2 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2012-02-24 20:03 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> On Fri, Feb 24, 2012 at 03:28:03AM -0500, Jeff King wrote:
>
>> I don't like repeating all of the convert_to_git policy logic. Perhaps
>> if you pass a NULL buffer to convert_to_git, it should run through its
>> usual logic, stopping just short of actually writing anything, and
>> return a flag indicating whether it _would_ convert...

Makes very good sense.  Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] do not stream large files to pack when filters are in use
  2012-02-24 20:03         ` Junio C Hamano
@ 2012-02-24 20:48           ` Jeff King
  2012-02-24 21:01             ` Jeff King
  2012-02-24 21:19             ` Jeff King
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff King @ 2012-02-24 20:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Feb 24, 2012 at 12:03:17PM -0800, Junio C Hamano wrote:

> > -	else if (size <= big_file_threshold || type != OBJ_BLOB)
> > +	else if (size <= big_file_threshold || type != OBJ_BLOB ||
> > +		 convert_to_git(path, NULL, 0, NULL, 0))
> 
> Nice.  It would be even nicer to give a readability macro whose name makes
> it clear that this is a query (unfortunately we cannot add '?' at the end
> of the function name) and not a conversion.  Any name suggestions?

Yeah, I cringed a little at all of the NULLs. How about:

  int would_convert_to_git(const char *path);

?

> By the way, I tried this with the tip of 'pu' as of last night, and the
> result segfaults on me in t1050 (hash-object -w) ;-).

Eek. I ran through "make test" on my finished series without a problem,
but it may be a bad interaction with something in "pu", or maybe a
heisenbug. I'll investigate.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] do not stream large files to pack when filters are in use
  2012-02-24 20:48           ` Jeff King
@ 2012-02-24 21:01             ` Jeff King
  2012-02-24 21:20               ` Junio C Hamano
  2012-02-24 21:19             ` Jeff King
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff King @ 2012-02-24 21:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Feb 24, 2012 at 03:48:10PM -0500, Jeff King wrote:

> On Fri, Feb 24, 2012 at 12:03:17PM -0800, Junio C Hamano wrote:
> 
> > > -	else if (size <= big_file_threshold || type != OBJ_BLOB)
> > > +	else if (size <= big_file_threshold || type != OBJ_BLOB ||
> > > +		 convert_to_git(path, NULL, 0, NULL, 0))
> > 
> > Nice.  It would be even nicer to give a readability macro whose name makes
> > it clear that this is a query (unfortunately we cannot add '?' at the end
> > of the function name) and not a conversion.  Any name suggestions?
> 
> Yeah, I cringed a little at all of the NULLs. How about:
> 
>   int would_convert_to_git(const char *path);
> 
> ?

I started to do this but had a thought: would other callers want more
flexibility? That is, should convert_to_git's dry run trigger actually
be: the _destination_ buffer is NULL?

I.e., you could call it like this:

  convert_to_git(path, buf, len, NULL, 0);

and find out the real answer, including inspecting buf, about whether
we would convert. Or you could also use a NULL buffer to get the
pessimistic "we might convert" case.

I don't think it really matters that much, as I am introducing only one
caller. I'm not sure if any other code paths would care about this
speculative "maybe we would convert" question, so perhaps it is simply
over-engineering.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] do not stream large files to pack when filters are in use
  2012-02-24 20:48           ` Jeff King
  2012-02-24 21:01             ` Jeff King
@ 2012-02-24 21:19             ` Jeff King
  2012-02-24 22:02               ` [PATCHv2 1/3] teach convert_to_git a "dry run" mode Jeff King
                                 ` (3 more replies)
  1 sibling, 4 replies; 19+ messages in thread
From: Jeff King @ 2012-02-24 21:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Feb 24, 2012 at 03:48:10PM -0500, Jeff King wrote:

> > By the way, I tried this with the tip of 'pu' as of last night, and the
> > result segfaults on me in t1050 (hash-object -w) ;-).
> 
> Eek. I ran through "make test" on my finished series without a problem,
> but it may be a bad interaction with something in "pu", or maybe a
> heisenbug. I'll investigate.

Or maybe I'm just an idiot, and apparently didn't actually run the test
suite correctly. Because it is very reproducible, and is because
hash-object --stdin passes a NULL for the path.

I'll post the fixed series in a minute (with this fix, and the improved
convert_to_git wrapper).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] do not stream large files to pack when filters are in use
  2012-02-24 21:01             ` Jeff King
@ 2012-02-24 21:20               ` Junio C Hamano
  0 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2012-02-24 21:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> On Fri, Feb 24, 2012 at 03:48:10PM -0500, Jeff King wrote:
>
>> On Fri, Feb 24, 2012 at 12:03:17PM -0800, Junio C Hamano wrote:
>> 
>> > > -	else if (size <= big_file_threshold || type != OBJ_BLOB)
>> > > +	else if (size <= big_file_threshold || type != OBJ_BLOB ||
>> > > +		 convert_to_git(path, NULL, 0, NULL, 0))
>> > 
>> > Nice.  It would be even nicer to give a readability macro whose name makes
>> > it clear that this is a query (unfortunately we cannot add '?' at the end
>> > of the function name) and not a conversion.  Any name suggestions?
>> ...
> I.e., you could call it like this:
>
>   convert_to_git(path, buf, len, NULL, 0);
>
> and find out the real answer, including inspecting buf, about whether
> we would convert. Or you could also use a NULL buffer to get the
> pessimistic "we might convert" case.
>
> I don't think it really matters that much, as I am introducing only one
> caller. I'm not sure if any other code paths would care about this
> speculative "maybe we would convert" question, so perhaps it is simply
> over-engineering.

I agree checking if dst is NULL makes a lot more sense than checking if
the src is NULL.

The reason I said "readability" macro was *NOT* because of these NULLs
looked unsightly.  I wanted the function name to tell "We are *NOT*
actually converting anything here at this point" to the readers.

I am perfectly fine to keep the source side of the parameters in the
readability macro, even if this partcular caller may have to spell NULL
and 0 there.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv2 1/3] teach convert_to_git a "dry run" mode
  2012-02-24 21:19             ` Jeff King
@ 2012-02-24 22:02               ` Jeff King
  2012-02-24 22:05               ` [PATCHv2 2/3] teach dry-run convert_to_git not to require a src buffer Jeff King
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2012-02-24 22:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Some callers may want to know whether convert_to_git will
actually do anything before performing the conversion
itself (e.g., to decide whether to stream or handle blobs
in-core). This patch lets callers specify the dry run mode
by passing a NULL destination buffer. The return value,
instead of indicating whether conversion happened, will
indicate whether conversion would occur.

For readability, we also include a wrapper function which
makes it more obvious we are not actually performing the
conversion.

Signed-off-by: Jeff King <peff@peff.net>
---
This splits the 1/2 from the first series into two patches. This part
handles the dry-run aspect. Unlike the previous version, it uses a NULL
destination, rather than a NULL source to indicate the dry-run mode.
Which just makes more sense, and means that the "don't bother doing
source analysis" bit is split off.

No callers introduced in this series actually want to do a dry-run with
a non-NULL source buffer, so that feature is of dubious value. But it's
not any extra code to do it this way, and the resulting commits are much
easier to read, I think.

 convert.c |   17 +++++++++++++++--
 convert.h |    5 +++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/convert.c b/convert.c
index 12868ed..65fa9d5 100644
--- a/convert.c
+++ b/convert.c
@@ -231,6 +231,13 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 	if (!stats.cr)
 		return 0;
 
+	/*
+	 * At this point all of our source analysis is done, and we are sure we
+	 * would convert. If we are in dry-run mode, we can give an answer.
+	 */
+	if (!buf)
+		return 1;
+
 	/* only grow if not in place */
 	if (strbuf_avail(buf) + buf->len < len)
 		strbuf_grow(buf, len - buf->len);
@@ -391,6 +398,9 @@ static int apply_filter(const char *path, const char *src, size_t len,
 	if (!cmd)
 		return 0;
 
+	if (!dst)
+		return 1;
+
 	memset(&async, 0, sizeof(async));
 	async.proc = filter_buffer;
 	async.data = &params;
@@ -525,6 +535,9 @@ static int ident_to_git(const char *path, const char *src, size_t len,
 	if (!ident || !count_ident(src, len))
 		return 0;
 
+	if (!buf)
+		return 1;
+
 	/* only grow if not in place */
 	if (strbuf_avail(buf) + buf->len < len)
 		strbuf_grow(buf, len - buf->len);
@@ -754,13 +767,13 @@ int convert_to_git(const char *path, const char *src, size_t len,
 		filter = ca.drv->clean;
 
 	ret |= apply_filter(path, src, len, dst, filter);
-	if (ret) {
+	if (ret && dst) {
 		src = dst->buf;
 		len = dst->len;
 	}
 	ca.crlf_action = input_crlf_action(ca.crlf_action, ca.eol_attr);
 	ret |= crlf_to_git(path, src, len, dst, ca.crlf_action, checksafe);
-	if (ret) {
+	if (ret && dst) {
 		src = dst->buf;
 		len = dst->len;
 	}
diff --git a/convert.h b/convert.h
index d799a16..ec5fd69 100644
--- a/convert.h
+++ b/convert.h
@@ -40,6 +40,11 @@ extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int renormalize_buffer(const char *path, const char *src, size_t len,
 			      struct strbuf *dst);
+static inline int would_convert_to_git(const char *path, const char *src,
+				       size_t len, enum safe_crlf checksafe)
+{
+	return convert_to_git(path, src, len, NULL, checksafe);
+}
 
 /*****************************************************************
  *
-- 
1.7.9.11.gca600

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCHv2 2/3] teach dry-run convert_to_git not to require a src buffer
  2012-02-24 21:19             ` Jeff King
  2012-02-24 22:02               ` [PATCHv2 1/3] teach convert_to_git a "dry run" mode Jeff King
@ 2012-02-24 22:05               ` Jeff King
  2012-02-24 22:10               ` [PATCHv2 3/3] do not stream large files to pack when filters are in use Jeff King
  2012-02-24 22:42               ` [PATCH 2/2] " Junio C Hamano
  3 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2012-02-24 22:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

When we call convert_to_git in dry-run mode, it may still
want to look at the source buffer, because some CRLF
conversion modes depend on analyzing the source to determine
whether it is in fact convertible CRLF text.

However, the main motivation for convert_to_git's dry-run
mode is that we would decide which method to use to acquire
the blob's data (streaming versus in-core). Requiring this
source analysis creates a chicken-and-egg problem. We are
better off simply guessing that anything we can't analyze
will end up needing conversion.

This patch lets a caller specify a NULL src buffer when
using dry-run mode (and only dry-run mode). A non-zero
return value goes from "we would convert" to "we might
convert"; a zero return value remains "we would definitely
not convert".

Signed-off-by: Jeff King <peff@peff.net>
---
And this is the second half of v1's patch 1/2.

By splitting this off, I think it makes it much more obvious if we want
to re-visit the pessimism later (e.g., to optimistically assume that
missing src buffers should be assumed to be binary if they are large).

 convert.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/convert.c b/convert.c
index 65fa9d5..aa7f72d 100644
--- a/convert.c
+++ b/convert.c
@@ -195,9 +195,17 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 	char *dst;
 
 	if (crlf_action == CRLF_BINARY ||
-	    (crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE) || !len)
+	    (crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE) ||
+	    (src && !len))
 		return 0;
 
+	/*
+	 * If we are doing a dry-run and have no source buffer, there is
+	 * nothing to analyze; we must assume we would convert.
+	 */
+	if (!buf && !src)
+		return 1;
+
 	gather_stats(src, len, &stats);
 
 	if (crlf_action == CRLF_AUTO || crlf_action == CRLF_GUESS) {
@@ -532,7 +540,7 @@ static int ident_to_git(const char *path, const char *src, size_t len,
 {
 	char *dst, *dollar;
 
-	if (!ident || !count_ident(src, len))
+	if (!ident || (src && !count_ident(src, len)))
 		return 0;
 
 	if (!buf)
-- 
1.7.9.11.gca600

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCHv2 3/3] do not stream large files to pack when filters are in use
  2012-02-24 21:19             ` Jeff King
  2012-02-24 22:02               ` [PATCHv2 1/3] teach convert_to_git a "dry run" mode Jeff King
  2012-02-24 22:05               ` [PATCHv2 2/3] teach dry-run convert_to_git not to require a src buffer Jeff King
@ 2012-02-24 22:10               ` Jeff King
  2012-02-24 22:42               ` [PATCH 2/2] " Junio C Hamano
  3 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2012-02-24 22:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.

The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size.  Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.

This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.

The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.

Signed-off-by: Jeff King <peff@peff.net>
---
This patches patch 2/2 from v1 of the series. The changes are:

  - use the new would_convert_to_git macro

  - only check filters if we have a path, which handles "hash-object
    --stdin". It actually seems to me that one could have autocrlf on,
    and would expect hash-object to do automatic conversion even without
    a path with which to look up attributes. But this is the same test
    that index_mem uses, so we are matching the behavior there.

  - update the comment above index_stream. I have no idea how I
    previously missed this giant comment explaining the exact situation
    I was testing.

 sha1_file.c                 |   14 ++++---
 t/t1051-large-conversion.sh |   86 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 5 deletions(-)
 create mode 100755 t/t1051-large-conversion.sh

diff --git a/sha1_file.c b/sha1_file.c
index f9f8d5e..4f06a0e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2700,10 +2700,13 @@ static int index_core(unsigned char *sha1, int fd, size_t size,
  * This also bypasses the usual "convert-to-git" dance, and that is on
  * purpose. We could write a streaming version of the converting
  * functions and insert that before feeding the data to fast-import
- * (or equivalent in-core API described above), but the primary
- * motivation for trying to stream from the working tree file and to
- * avoid mmaping it in core is to deal with large binary blobs, and
- * by definition they do _not_ want to get any conversion.
+ * (or equivalent in-core API described above). However, that is
+ * somewhat complicated, as we do not know the size of the filter
+ * result, which we need to know beforehand when writing a git object.
+ * Since the primary motivation for trying to stream from the working
+ * tree file and to avoid mmaping it in core is to deal with large
+ * binary blobs, they generally do not want to get any conversion, and
+ * callers should avoid this code path when filters are requested.
  */
 static int index_stream(unsigned char *sha1, int fd, size_t size,
 			enum object_type type, const char *path,
@@ -2720,7 +2723,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 
 	if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
-	else if (size <= big_file_threshold || type != OBJ_BLOB)
+	else if (size <= big_file_threshold || type != OBJ_BLOB ||
+		 (path && would_convert_to_git(path, NULL, 0, 0)))
 		ret = index_core(sha1, fd, size, type, path, flags);
 	else
 		ret = index_stream(sha1, fd, size, type, path, flags);
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
new file mode 100755
index 0000000..8b7640b
--- /dev/null
+++ b/t/t1051-large-conversion.sh
@@ -0,0 +1,86 @@
+#!/bin/sh
+
+test_description='test conversion filters on large files'
+. ./test-lib.sh
+
+set_attr() {
+	test_when_finished 'rm -f .gitattributes' &&
+	echo "* $*" >.gitattributes
+}
+
+check_input() {
+	git read-tree --empty &&
+	git add small large &&
+	git cat-file blob :small >small.index &&
+	git cat-file blob :large | head -n 1 >large.index &&
+	test_cmp small.index large.index
+}
+
+check_output() {
+	rm -f small large &&
+	git checkout small large &&
+	head -n 1 large >large.head &&
+	test_cmp small large.head
+}
+
+test_expect_success 'setup input tests' '
+	printf "\$Id: foo\$\\r\\n" >small &&
+	cat small small >large &&
+	git config core.bigfilethreshold 20 &&
+	git config filter.test.clean "sed s/.*/CLEAN/"
+'
+
+test_expect_success 'autocrlf=true converts on input' '
+	test_config core.autocrlf true &&
+	check_input
+'
+
+test_expect_success 'eol=crlf converts on input' '
+	set_attr eol=crlf &&
+	check_input
+'
+
+test_expect_success 'ident converts on input' '
+	set_attr ident &&
+	check_input
+'
+
+test_expect_success 'user-defined filters convert on input' '
+	set_attr filter=test &&
+	check_input
+'
+
+test_expect_success 'setup output tests' '
+	echo "\$Id\$" >small &&
+	cat small small >large &&
+	git add small large &&
+	git config core.bigfilethreshold 7 &&
+	git config filter.test.smudge "sed s/.*/SMUDGE/"
+'
+
+test_expect_success 'autocrlf=true converts on output' '
+	test_config core.autocrlf true &&
+	check_output
+'
+
+test_expect_success 'eol=crlf converts on output' '
+	set_attr eol=crlf &&
+	check_output
+'
+
+test_expect_success 'user-defined filters convert on output' '
+	set_attr filter=test &&
+	check_output
+'
+
+test_expect_success 'ident converts on output' '
+	set_attr ident &&
+	rm -f small large &&
+	git checkout small large &&
+	sed -n "s/Id: .*/Id: SHA/p" <small >small.clean &&
+	head -n 1 large >large.head &&
+	sed -n "s/Id: .*/Id: SHA/p" <large.head >large.clean &&
+	test_cmp small.clean large.clean
+'
+
+test_done
-- 
1.7.9.11.gca600

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] do not stream large files to pack when filters are in use
  2012-02-24 21:19             ` Jeff King
                                 ` (2 preceding siblings ...)
  2012-02-24 22:10               ` [PATCHv2 3/3] do not stream large files to pack when filters are in use Jeff King
@ 2012-02-24 22:42               ` Junio C Hamano
  3 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2012-02-24 22:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I'll post the fixed series in a minute (with this fix, and the improved
> convert_to_git wrapper).

Thanks.

The exclusion of path==NULL case was something I didn't think much about,
but I think your solution is the right one.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-02-24 22:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-24  3:02 [BUG?] bulk checkin does not respect filters Jeff King
2012-02-24  3:17 ` Junio C Hamano
2012-02-24  3:42   ` Junio C Hamano
2012-02-24  7:54     ` Jeff King
2012-02-24 18:48       ` Junio C Hamano
2012-02-24  8:28   ` Jeff King
2012-02-24  9:39     ` Jeff King
2012-02-24  9:41       ` [PATCH 1/2] teach convert_to_git a "dry run" mode Jeff King
2012-02-24  9:48       ` [PATCH 2/2] do not stream large files to pack when filters are in use Jeff King
2012-02-24 20:03         ` Junio C Hamano
2012-02-24 20:48           ` Jeff King
2012-02-24 21:01             ` Jeff King
2012-02-24 21:20               ` Junio C Hamano
2012-02-24 21:19             ` Jeff King
2012-02-24 22:02               ` [PATCHv2 1/3] teach convert_to_git a "dry run" mode Jeff King
2012-02-24 22:05               ` [PATCHv2 2/3] teach dry-run convert_to_git not to require a src buffer Jeff King
2012-02-24 22:10               ` [PATCHv2 3/3] do not stream large files to pack when filters are in use Jeff King
2012-02-24 22:42               ` [PATCH 2/2] " Junio C Hamano
2012-02-24 20:03       ` [BUG?] bulk checkin does not respect filters Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.