[RFC] Long running Git clean/smudge filter

* [RFC] Long running Git clean/smudge filter
@ 2016-07-10 11:35 larsxschneider
  2016-07-10 15:10 ` Joey Hess
  0 siblings, 1 reply; 3+ messages in thread
From: larsxschneider @ 2016-07-10 11:35 UTC (permalink / raw)
  To: git; +Cc: joeyh, pclouds, Johannes.Schindelin, gitster, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git's clean/smudge mechanism invokes an external filter process for every
single file that is affected by the filter. If you have *a lot* of files to
filter then the startup time of the external filter process becomes a problem
as discussed here:
http://thread.gmane.org/gmane.comp.version-control.git/298293

This patch might be a solution to this problem and I am looking for feedback.
Please note that this is a RFC/proof-of-concept patch. If the list agrees with
this approach then I will post a proper patch series with more tests, error
handling, and documentation.


## How does it work?
Git's "convert.c" got a hashmap that maps a filter command to a process. If
we need to filter a file, then we look into the hashmap. If the filter command
is not there then we create and start a new filter process and setup in/out
pipes. If the filter command is there then we just take the existing and running
process.

In case of a clean filter we send the filename of the file to clean via pipe
to the filter process. The filter process reads the file, generates the cleaned
version and sends it back to Git via pipe.

In case of a smudge filter we send the destination filename of the file to
smudge via pipe as well as its content size and content. The filter process
reads all this info, generates the smudged version and stores the smudged
version at the location of the destination filename. Finally, the filter
process sends an "OK" to tell Git that everything worked as expected.

This patch is based on Joey Hess' great "clean-smudge-annex" topic, idea,
and implementation:
https://github.com/gitster/git/tree/jh/clean-smudge-annex

v1: http://thread.gmane.org/gmane.comp.version-control.git/297475
v4 (latest): http://thread.gmane.org/gmane.comp.version-control.git/297994

You can find a branch with with my patch on top of Joey's work here:
https://github.com/larsxschneider/git/commits/clean-smudge-file-long-running/v1


## Questions

(1) Do you see a general problem with this kind of "pipe" protocol and long
running Git filter processes? Other than the test cases I haven't implemented
a real world filter, yet (on my todo list for tomorrow).

(2) Joey's topic, which is the base for my patch, looks stalled for more than
2 weeks:
http://thread.gmane.org/gmane.comp.version-control.git/297994/focus=298006
I would be happy to address Junio's comments and post a reroll. However,
I don't want to interfere with Joey's plans. How does the list usually
address this kind of situation?

@Joey (in case you are reading this):
My patch changes your initial idea quite a bit. However, I believe it is an
improvement that could be beneficial for git-annex, too. Would you prefer to
work with me on the combination of our ideas (file clean/smudge + long running
clean/smudge processes) or would you prefer to keep your interface?

(3) Clean/smudge filter use Git's async command API to create a separate thread
and start the filter process. In my implementation I start the filter process
directly as this eases the pipe communication. I assume I am missing something
obvious here, but what is the advantage of the async command API? Git needs to
wait for the result of the filter anyways and therefore async is not really
necessary, or?

(4) I am not really experienced with process pipes. Is it OK to use the
"strbuf_read_once" function to read from a pipe? If I use "strbuf_read" then
the process waits indefinitely. Is there a way to tell Git to read X bytes from
a pipe (I just found the size "hint")?

(5) As mentioned above "convert.c" got a hashmap that maps a filter command to
a process. At the end of the Git process I would like to close the pipes and
stop the filter processes. What would be the best place in the code to do that?

Thanks,
Lars

---
 convert.c             | 143 +++++++++++++++++++++++++++++++++++++++++++-------
 t/t0021-conversion.sh |  80 +++++++++++++++++++++-------
 2 files changed, 185 insertions(+), 38 deletions(-)

diff --git a/convert.c b/convert.c
index e4db007..590b85a 100644
--- a/convert.c
+++ b/convert.c
@@ -373,9 +373,120 @@ struct filter_params {
 	int fd;
 	const char *cmd;
 	const char *path; /* Path within the git repository */
-	const char *fspath; /* Path to file on disk */
 };

+static int cmd_process_map_init = 0;
+static struct hashmap cmd_process_map;
+
+struct cmd2process {
+	struct hashmap_entry ent; /* must be the first member! */
+	const char *cmd;
+	struct child_process process;
+};
+
+static int cmd2process_cmp(const struct cmd2process *e1,
+							const struct cmd2process *e2,
+							const void *unused)
+{
+	return strcmp(e1->cmd, e2->cmd);
+}
+
+static struct cmd2process *find_entry(const char *cmd)
+{
+	struct cmd2process k;
+	hashmap_entry_init(&k, memhash(&cmd, sizeof(char *)));
+	k.cmd = cmd;
+	return hashmap_get(&cmd_process_map, &k, NULL);
+}
+
+static int apply_file_filter(const char *path, const char *src, size_t len,
+							struct strbuf *dst, const char *cmd)
+{
+	int ret = 1;
+	struct strbuf nbuf = STRBUF_INIT;
+	struct cmd2process *entry = NULL;
+	struct child_process *process = NULL;
+	const char *argv[] = { NULL, NULL };
+
+	if (!cmd || !*cmd)
+		return 0;
+
+	if (!dst)
+		return 1;
+
+	if (!cmd_process_map_init) {
+		cmd_process_map_init = 1;
+		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
+	} else {
+		entry = find_entry(cmd);
+	}
+
+	if (entry) {
+		process = &entry->process;
+	} else {
+		entry = malloc(sizeof(struct cmd2process));
+		hashmap_entry_init(entry, memhash(&cmd, sizeof(char *)));
+		entry->cmd = cmd;
+		hashmap_add(&cmd_process_map, entry);
+		process = &entry->process;
+
+		child_process_init(process);
+		argv[0] = cmd;
+		process->argv = argv;
+		process->use_shell = 1;
+		process->in = -1;
+		process->out = -1;
+
+		if (start_command(process)) {
+			error("cannot fork to run external filter %s", cmd);
+			return 0;
+		}
+	}
+
+	fflush(NULL);
+
+	// TODO: Is the signchain_push OK here?
+	sigchain_push(SIGPIPE, SIG_IGN);
+	write_str_in_full(process->in, path);
+	write_str_in_full(process->in, "\n");
+	if (src && len > 0) {
+		// smudge filter processing
+		struct strbuf lenstr = STRBUF_INIT;
+		strbuf_reset(&lenstr);
+		strbuf_addf(&lenstr, "%zu", len);
+		write_str_in_full(process->in, lenstr.buf);
+		write_str_in_full(process->in, "\n");
+		write_in_full(process->in, src, len);
+		strbuf_reset(&nbuf);
+		if (strbuf_read_once(&nbuf, process->out, 2) < 0 ||
+			strcmp(nbuf.buf, "OK") != 0) {
+			error("read from external file filter %s failed", cmd);
+			ret = 0;
+		}
+	} else {
+		// clean filter processing
+		strbuf_reset(&nbuf);
+		// TODO: Should we read the expected size here first?
+		if (strbuf_read_once(&nbuf, process->out, 0) < 0) {
+			error("read from external file filter %s failed", cmd);
+			ret = 0;
+		} else {
+			strbuf_swap(dst, &nbuf);
+		}
+		// TODO: Should we ask for an OK from the filter here, too?
+	}
+	sigchain_pop(SIGPIPE);
+
+	// TODO: We probably should close the pipes and finish all processes
+	// in the hashmap. What would be a good place to do this?
+	// close(process->in)
+	// close(process->out)
+	// finish_command(process);
+
+	strbuf_release(&nbuf);
+	return ret;
+}
+
 static int filter_buffer_or_fd(int in, int out, void *data)
 {
 	/*
@@ -402,15 +513,6 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	strbuf_expand(&cmd, params->cmd, strbuf_expand_dict_cb, &dict);
 	strbuf_release(&path);

-	/* append fspath to the command if it's set, separated with a space */
-	if (params->fspath) {
-		struct strbuf fspath = STRBUF_INIT;
-		sq_quote_buf(&fspath, params->fspath);
-		strbuf_addstr(&cmd, " ");
-		strbuf_addbuf(&cmd, &fspath);
-		strbuf_release(&fspath);
-	}
-
 	argv[0] = cmd.buf;

 	child_process.argv = argv;
@@ -449,9 +551,8 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	return (write_err || status);
 }

-static int apply_filter(const char *path, const char *fspath,
-			const char *src, size_t len, int fd,
-                        struct strbuf *dst, const char *cmd)
+static int apply_filter(const char *path, const char *src, size_t len,
+						int fd, struct strbuf *dst, const char *cmd)
 {
 	/*
 	 * Create a pipeline to have the command filter the buffer's
@@ -479,7 +580,6 @@ static int apply_filter(const char *path, const char *fspath,
 	params.fd = fd;
 	params.cmd = cmd;
 	params.path = path;
-	params.fspath = fspath;

 	fflush(NULL);
 	if (start_async(&async))
@@ -610,7 +710,7 @@ static int count_ident(const char *cp, unsigned long size)
 }

 static int ident_to_git(const char *path, const char *src, size_t len,
-                        struct strbuf *buf, int ident)
+						struct strbuf *buf, int ident)
 {
 	char *dst, *dollar;

@@ -860,7 +960,7 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;

-	return apply_filter(path, NULL, NULL, 0, -1, NULL, ca.drv->clean);
+	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
 }

 static int can_filter_file(const char *filefilter, const char *filefiltername,
@@ -950,7 +1050,7 @@ int convert_to_git(const char *path, const char *src, size_t len,
 		required = ca.drv->required;
 	}

-	ret |= apply_filter(path, NULL, src, len, -1, dst, filter);
+	ret |= apply_filter(path, src, len, -1, dst, filter);
 	if (!ret && required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

@@ -976,7 +1076,7 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 	assert(ca.drv);
 	assert(ca.drv->clean);

-	if (!apply_filter(path, NULL, NULL, 0, fd, dst, ca.drv->clean))
+	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action,
@@ -995,7 +1095,7 @@ void convert_to_git_filter_from_file(const char *path, struct strbuf *dst,
 	assert(ca.drv->clean);
 	assert(ca.drv->clean_from_file);

-	if (!apply_filter(path, path, "", 0, -1, dst, ca.drv->clean_from_file))
+	if (!apply_file_filter(path, "", 0, dst, ca.drv->clean_from_file))
 		die("%s: cleanFromFile filter '%s' failed", path, ca.drv->name);

 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action,
@@ -1040,7 +1140,10 @@ static int convert_to_working_tree_internal(const char *path,
 		}
 	}

-	ret_filter = apply_filter(path, destpath, src, len, -1, dst, filter);
+	if (destpath)
+		ret_filter = apply_file_filter(destpath, src, len, dst, filter);
+	else
+		ret_filter = apply_filter(path, src, len, -1, dst, filter);
 	if (!ret_filter && required)
 		die("%s: %s filter %s failed", path, destpath ? "smudgeToFile" : "smudge", ca.drv->name);

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 2722013..1eafe67 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -14,19 +14,39 @@ chmod +x rot13.sh

 cat <<EOF >rot13-from-file.sh
 #!$SHELL_PATH
-srcfile="\$1"
-touch rot13-from-file.ran
-cat "\$srcfile" | ./rot13.sh
+while read LINE; do
+   srcfile="\$LINE"
+   echo "CLEAN \$srcfile" >> rot13-from-file.ran
+   cat "\$srcfile" | ./rot13.sh
+done
+
 EOF
 chmod +x rot13-from-file.sh

-cat <<EOF >rot13-to-file.sh
-#!$SHELL_PATH
-destfile="\$1"
-touch rot13-to-file.ran
-./rot13.sh > "\$destfile"
+# TODO: Is there a way to read X bytes from a pipe via shell?
+# I implemented the filter in Python as workaround. If Python
+# is an undesired dependency then I could reimplement it in Perl, too.
+cat <<EOF >rot13-to-file.py
+#!/usr/bin/env python
+import sys
+from subprocess import Popen, PIPE, STDOUT
+
+for data in iter(sys.stdin.readline, ''):
+	filename = data[:-1]
+	content_size = sys.stdin.readline()[:-1]
+	content = sys.stdin.read(int(content_size))
+
+	p = Popen(['./rot13.sh'], stdout=PIPE, stdin=PIPE, stderr=PIPE)
+	with open(filename, 'w') as f:
+		f.write(p.communicate(input=content)[0])
+
+	with open('rot13-to-file.ran', 'a') as f:
+		f.write('SMUDGE {} {}\n'.format(filename, content_size))
+
+	sys.stdout.write('OK')
+	sys.stdout.flush()
 EOF
-chmod +x rot13-to-file.sh
+chmod +x rot13-to-file.py

 cat <<EOF >delete-file-and-fail.sh
 #!$SHELL_PATH
@@ -293,28 +313,52 @@ test_expect_success 'disable filter with empty override' '
 '

 test_expect_success 'cleanFromFile filter is used when adding a file' '
+	{
+	    echo a
+	    echo bb
+	    echo ccc
+	} >test2 &&
+
+
 	test_config filter.rot13.cleanFromFile ./rot13-from-file.sh &&
+	test_config filter.rot13.required true &&

 	echo "*.t filter=rot13" >.gitattributes &&

 	cat test >fstest.t &&
-	git add fstest.t &&
-	test -e rot13-from-file.ran &&
+	cat test2 >fstest2.t &&
+	git add fstest.t fstest2.t &&
+
+	test_line_count = 2 rot13-from-file.ran &&
+	grep "CLEAN fstest.t" rot13-from-file.ran &&
+	grep "CLEAN fstest2.t" rot13-from-file.ran &&
 	rm -f rot13-from-file.ran &&

 	rm -f fstest.t &&
-	git checkout -- fstest.t &&
-	cmp test fstest.t
+	rm -f fstest2.t &&
+
+	git checkout . &&
+	cmp test fstest.t &&
+	cmp test2 fstest2.t
 '

 test_expect_success 'smudgeToFile filter is used when checking out a file' '
-	test_config filter.rot13.smudgeToFile ./rot13-to-file.sh &&
+	test_config filter.rot13.smudgeToFile ./rot13-to-file.py &&
+	test_config filter.rot13.required true &&

 	rm -f fstest.t &&
-	git checkout -- fstest.t &&
+	rm -f fstest2.t &&
+	git checkout . &&
+
+	# TODO: Why is this necessary?
+	sleep 0.1 &&
+
 	cmp test fstest.t &&
+	cmp test2 fstest2.t &&

-	test -e rot13-to-file.ran &&
+	test_line_count = 2 rot13-to-file.ran &&
+	grep "SMUDGE fstest.t 57" rot13-to-file.ran &&
+	grep "SMUDGE fstest2.t 9" rot13-to-file.ran &&
 	rm -f rot13-to-file.ran
 '

@@ -335,7 +379,7 @@ test_expect_success 'recovery from failure of smudgeToFile filter that deletes t
 '

 test_expect_success 'smudgeToFile filter is used in merge' '
-	test_config filter.rot13.smudgeToFile ./rot13-to-file.sh &&
+	test_config filter.rot13.smudgeToFile ./rot13-to-file.py &&

 	git commit -m "added fstest.t" fstest.t &&
 	git checkout -b old &&
@@ -350,7 +394,7 @@ test_expect_success 'smudgeToFile filter is used in merge' '
 '

 test_expect_success 'smudgeToFile filter is used by git am' '
-	test_config filter.rot13.smudgeToFile ./rot13-to-file.sh &&
+	test_config filter.rot13.smudgeToFile ./rot13-to-file.py &&

 	git format-patch HEAD^ --stdout > fstest.patch &&
 	git reset --hard HEAD^ &&
--
2.5.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread