All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3] cat-file: add a --stdin-cmd mode
@ 2022-01-28 18:33 John Cai
  2022-01-31 11:44 ` Bagas Sanjaya
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: John Cai @ 2022-01-28 18:33 UTC (permalink / raw)
  To: git; +Cc: avarab, me, phillip.wood123, e, John Cai

This RFC patch proposes a new flag --batch-command that works with
git-cat-file --batch. Similar to git-update-ref --stdin, it will accept
commands and arguments from stdin.

The start of this idea was discussed in [1], where the original
motivation was to be able to control when the buffer was flushed to
stdout in --buffer mode.

However, this can actually be much more useful in situations when
git-cat-file --batch is being used as a long lived backend query
process. At GitLab, we use a pair of cat-file processes. One for
iterating over object metadata with --batch-check, and the other to grab
object contents with --batch. However, if we had --batch-command, we could
get rid of the second --batch-check process, and just have one process
where we can flip between getting object info, and getting object contents.
This can lead to huge savings since on a given server there could be hundreds to
thousands of git cat-file processes at a time.

git cat-file --batch-command

$ <command> [arg1] [arg2] NL

This patch adds three commands: object, info, fflush

$ object <sha1> NL
$ info <sha1> NL
$ fflush NL

These three would be immediately useful in GitLab's context, but one can
imagine this mode to be further extended for other things.

Future improvements:
- a non-trivial part of "cat-file --batch" time is spent
on parsing its argument and seeing if it's a revision, ref etc. So we
could add a command that only accepts a full-length 40
character SHA-1.

This would be the first step in adding such an interface to
git-cat-file.

[1] https://lore.kernel.org/git/pull.1124.git.git.1636149400.gitgitgadget@gmail.com/

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: John Cai <johncai86@gmail.com>
---

Taylor, I'd be interested in your thoughts on this proposal since you helped
review the previous patch that turned into this RFC. Thanks!

Changes from v2:

- refactored tests to be within run_tests()
- added a test to test --buffer behavior with fflush
- cleaned up cat-file.c: clarified var names, removed unnecessary code
  based on suggestions from Phillip Wood
- removed strvec changes

Changes from v1:

- changed option name to batch-command.
- changed command function interface to receive the whole line after the command
  name to put the onus of parsing arguments to each individual command function.
- pass in whole line to batch_one_object in both parse_cmd_object and
  parse_cmd_info to support spaces in the object reference.
- removed addition of -z to include in a separate patch series
- added documentation.
---
 Documentation/git-cat-file.txt |  15 +++++
 builtin/cat-file.c             | 114 +++++++++++++++++++++++++++++++--
 t/t1006-cat-file.sh            |  83 ++++++++++++++++++++++++
 3 files changed, 205 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
index bef76f4dd0..254e546c79 100644
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -96,6 +96,21 @@ OPTIONS
 	need to specify the path, separated by whitespace.  See the
 	section `BATCH OUTPUT` below for details.
 
+--batch-command::
+	Enter a command mode that reads from stdin. May not be combined with any
+	other options or arguments except `--textconv` or `--filters`, in which
+	case the input lines also need to specify the path, separated by
+	whitespace.  See the section `BATCH OUTPUT` below for details.
+
+object <object>::
+	Print object contents for object reference <object>
+
+info <object>::
+	Print object info for object reference <object>
+
+flush::
+	Flush to stdout immediately when used with --buffer
+
 --batch-all-objects::
 	Instead of reading a list of objects on stdin, perform the
 	requested batch operation on all objects in the repository and
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 7b3f42950e..cc9e47943b 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -24,9 +24,11 @@ struct batch_options {
 	int buffer_output;
 	int all_objects;
 	int unordered;
-	int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
+	int mode; /* may be 'w' or 'c' for --filters or --textconv */
 	const char *format;
+	int command;
 };
+static char line_termination = '\n';
 
 static const char *force_path;
 
@@ -302,19 +304,19 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 	if (data->type == OBJ_BLOB) {
 		if (opt->buffer_output)
 			fflush(stdout);
-		if (opt->cmdmode) {
+		if (opt->mode) {
 			char *contents;
 			unsigned long size;
 
 			if (!data->rest)
 				die("missing path for '%s'", oid_to_hex(oid));
 
-			if (opt->cmdmode == 'w') {
+			if (opt->mode == 'w') {
 				if (filter_object(data->rest, 0100644, oid,
 						  &contents, &size))
 					die("could not convert '%s' %s",
 					    oid_to_hex(oid), data->rest);
-			} else if (opt->cmdmode == 'c') {
+			} else if (opt->mode == 'c') {
 				enum object_type type;
 				if (!textconv_object(the_repository,
 						     data->rest, 0100644, oid,
@@ -326,7 +328,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 					die("could not convert '%s' %s",
 					    oid_to_hex(oid), data->rest);
 			} else
-				BUG("invalid cmdmode: %c", opt->cmdmode);
+				BUG("invalid mode: %c", opt->mode);
 			batch_write(opt, contents, size);
 			free(contents);
 		} else {
@@ -508,6 +510,95 @@ static int batch_unordered_packed(const struct object_id *oid,
 				      data);
 }
 
+static void parse_cmd_object(struct batch_options *opt,
+			     const char *line,
+			     struct strbuf *output,
+			     struct expand_data *data)
+{
+	opt->print_contents = 1;
+	batch_one_object(line, output, opt, data);
+}
+
+static void parse_cmd_info(struct batch_options *opt,
+			   const char *line,
+			   struct strbuf *output,
+			   struct expand_data *data)
+{
+	opt->print_contents = 0;
+	batch_one_object(line, output, opt, data);
+}
+
+static void parse_cmd_fflush(struct batch_options *opt,
+			     const char *line,
+			     struct strbuf *output,
+			     struct expand_data *data)
+{
+	fflush(stdout);
+}
+
+typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
+			       struct strbuf *, struct expand_data *);
+
+static const struct parse_cmd {
+	const char *prefix;
+	parse_cmd_fn_t fn;
+	unsigned takes_args;
+} commands[] = {
+	{ "object", parse_cmd_object, 1},
+	{ "info", parse_cmd_info, 1},
+	{ "fflush", parse_cmd_fflush, 0},
+};
+
+static void batch_objects_command(struct batch_options *opt,
+				    struct strbuf *output,
+				    struct expand_data *data)
+{
+	struct strbuf input = STRBUF_INIT;
+
+	/* Read each line dispatch its command */
+	while (!strbuf_getwholeline(&input, stdin, line_termination)) {
+		int i;
+		const struct parse_cmd *cmd = NULL;
+		const char *p, *cmd_end;
+
+		if (*input.buf == line_termination)
+			die("empty command in input");
+		else if (isspace(*input.buf))
+			die("whitespace before command: %s", input.buf);
+
+		for (i = 0; i < ARRAY_SIZE(commands); i++) {
+			const char *prefix = commands[i].prefix;
+			char c;
+			if (!skip_prefix(input.buf, prefix, &cmd_end))
+				continue;
+			/*
+			 * If the command has arguments, verify that it's
+			 * followed by a space. Otherwise, it shall be followed
+			 * by a line terminator.
+			 */
+			c = commands[i].takes_args ? ' ' : line_termination;
+			if (input.buf[strlen(prefix)] != c)
+				die("arguments invalid for command: %s", commands[i].prefix);
+
+			cmd = &commands[i];
+			if (cmd->takes_args) {
+				p = cmd_end + 1;
+				// strip newline before handing it to the
+				// handling function
+				input.buf[strcspn(input.buf, "\n")] = '\0';
+			}
+
+			break;
+		}
+
+		if (!cmd)
+			die("unknown command: %s", input.buf);
+
+		cmd->fn(opt, p, output, data);
+	}
+	strbuf_release(&input);
+}
+
 static int batch_objects(struct batch_options *opt)
 {
 	struct strbuf input = STRBUF_INIT;
@@ -515,6 +606,7 @@ static int batch_objects(struct batch_options *opt)
 	struct expand_data data;
 	int save_warning;
 	int retval = 0;
+	const int command = opt->command;
 
 	if (!opt->format)
 		opt->format = "%(objectname) %(objecttype) %(objectsize)";
@@ -529,7 +621,7 @@ static int batch_objects(struct batch_options *opt)
 	strbuf_expand(&output, opt->format, expand_format, &data);
 	data.mark_query = 0;
 	strbuf_release(&output);
-	if (opt->cmdmode)
+	if (opt->mode)
 		data.split_on_whitespace = 1;
 
 	/*
@@ -590,6 +682,9 @@ static int batch_objects(struct batch_options *opt)
 	save_warning = warn_on_object_refname_ambiguity;
 	warn_on_object_refname_ambiguity = 0;
 
+	if (command)
+		batch_objects_command(opt, &output, &data);
+
 	while (strbuf_getline(&input, stdin) != EOF) {
 		if (data.split_on_whitespace) {
 			/*
@@ -636,6 +731,7 @@ static int batch_option_callback(const struct option *opt,
 
 	bo->enabled = 1;
 	bo->print_contents = !strcmp(opt->long_name, "batch");
+	bo->command = !strcmp(opt->long_name, "batch-command");
 	bo->format = arg;
 
 	return 0;
@@ -683,6 +779,10 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
 			N_("like --batch, but don't emit <contents>"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			batch_option_callback),
+		OPT_CALLBACK_F(0, "batch-command", &batch, N_(""),
+			 N_("enters batch mode that accepts commands"),
+			 PARSE_OPT_NOARG | PARSE_OPT_NONEG,
+			 batch_option_callback),
 		OPT_CMDMODE(0, "batch-all-objects", &opt,
 			    N_("with --batch[-check]: ignores stdin, batches all known objects"), 'b'),
 		/* Batch-specific options */
@@ -742,7 +842,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
 	/* Return early if we're in batch mode? */
 	if (batch.enabled) {
 		if (opt_cw)
-			batch.cmdmode = opt;
+			batch.mode = opt;
 		else if (opt && opt != 'b')
 			usage_msg_optf(_("'-%c' is incompatible with batch mode"),
 				       usage, options, opt);
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 145eee11df..92f0b14a95 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -112,6 +112,46 @@ maybe_remove_timestamp () {
     fi
 }
 
+run_buffer_test () {
+	type=$1
+	sha1=$2
+	size=$3
+	flush=$4
+
+	mkfifo V.input
+	exec 8<>V.input
+	rm V.input
+
+	mkfifo V.output
+	exec 9<>V.output
+	rm V.output
+	
+	(
+		git cat-file --buffer --batch-command <&8 >&9 &
+		echo $! >&9 &&
+		wait $!
+		echo >&9 EARLY_EXIT
+	) &
+	sh_pid=$!
+	read fi_pid <&9
+	test_when_finished "
+		exec 8>&-; exec 9>&-;
+		kill $sh_pid && wait $sh_pid
+		kill $fi_pid && wait $fi_pid
+		true"
+	expect=$(echo "$sha1 $type $size")
+	echo "info $sha1" >&8
+	if test "$flush" = "true"
+	then
+		echo "fflush" >&8
+	else
+		expect="EARLY_EXIT"
+		kill $fi_pid && wait $fi_pid
+	fi
+	read actual <&9
+	test "$actual" = "$expect"
+}
+
 run_tests () {
     type=$1
     sha1=$2
@@ -177,6 +217,18 @@ $content"
 	test_cmp expect actual
     '
 
+    test -z "$content" ||
+    test_expect_success "--batch-command output of $type content is correct" '
+	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
+	maybe_remove_timestamp "$(echo object $sha1 | git cat-file --batch-command)" $no_ts >actual &&
+	test_cmp expect actual
+    '
+
+    test_expect_success "batch-command output of $type info is correct" '
+	echo "$sha1 $type $size" >expect &&
+	echo "info $sha1" | git cat-file --batch-command >actual &&
+	test_cmp expect actual
+'
     test_expect_success "custom --batch-check format" '
 	echo "$type $sha1" >expect &&
 	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
@@ -232,12 +284,29 @@ test_expect_success '--batch-check without %(rest) considers whole line' '
 	test_cmp expect actual
 '
 
+test_expect_success '--batch-command --buffer with flush is correct for blob' '
+	run_buffer_test 'blob' $hello_sha1 $hello_size true
+'
+
+test_expect_success '--batch-command --buffer without flush is correct for blob' '
+	run_buffer_test 'blob' $hello_sha1 $hello_size false
+'
+
 tree_sha1=$(git write-tree)
+
 tree_size=$(($(test_oid rawsz) + 13))
 tree_pretty_content="100644 blob $hello_sha1	hello"
 
 run_tests 'tree' $tree_sha1 $tree_size "" "$tree_pretty_content"
 
+test_expect_success '--batch-command --buffer with flush is correct for tree' '
+	run_buffer_test 'tree' $tree_sha1 $tree_size true
+'
+
+test_expect_success '--batch-command --buffer without flush is correct for tree' '
+	run_buffer_test 'tree' $tree_sha1 $tree_size false
+'
+
 commit_message="Initial commit"
 commit_sha1=$(echo_without_newline "$commit_message" | git commit-tree $tree_sha1)
 commit_size=$(($(test_oid hexsz) + 137))
@@ -263,6 +332,14 @@ tag_size=$(strlen "$tag_content")
 
 run_tests 'tag' $tag_sha1 $tag_size "$tag_content" "$tag_content" 1
 
+test_expect_success '--batch-command --buffer with flush is correct for tag' '
+	run_buffer_test 'tag' $tag_sha1 $tag_size true
+'
+
+test_expect_success '--batch-command --buffer without flush is correct for tag' '
+	run_buffer_test 'tag' $tag_sha1 $tag_size false
+'
+
 test_expect_success \
     "Reach a blob from a tag pointing to it" \
     "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
@@ -964,4 +1041,10 @@ test_expect_success 'cat-file --batch-all-objects --batch-check ignores replace'
 	test_cmp expect actual
 '
 
+test_expect_success 'batch-command unknown command' '
+	echo unknown_command >cmd &&
+	test_expect_code 128 git cat-file --batch-command < cmd 2>err &&
+	grep -E "^fatal:.*unknown command.*" err
+'
+
 test_done
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-01-28 18:33 [RFC v3] cat-file: add a --stdin-cmd mode John Cai
@ 2022-01-31 11:44 ` Bagas Sanjaya
  2022-01-31 18:24   ` Junio C Hamano
  2022-02-01  9:39 ` Christian Couder
  2022-02-01 10:43 ` Phillip Wood
  2 siblings, 1 reply; 12+ messages in thread
From: Bagas Sanjaya @ 2022-01-31 11:44 UTC (permalink / raw)
  To: John Cai, git; +Cc: avarab, me, phillip.wood123, e

On 29/01/22 01.33, John Cai wrote:
> Future improvements:
> - a non-trivial part of "cat-file --batch" time is spent
> on parsing its argument and seeing if it's a revision, ref etc. So we
> could add a command that only accepts a full-length 40
> character SHA-1.

I think the full hash is actually revision name.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-01-31 11:44 ` Bagas Sanjaya
@ 2022-01-31 18:24   ` Junio C Hamano
  2022-02-01  9:48     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2022-01-31 18:24 UTC (permalink / raw)
  To: Bagas Sanjaya; +Cc: John Cai, git, avarab, me, phillip.wood123, e

Bagas Sanjaya <bagasdotme@gmail.com> writes:

> On 29/01/22 01.33, John Cai wrote:
>> Future improvements:
>> - a non-trivial part of "cat-file --batch" time is spent
>> on parsing its argument and seeing if it's a revision, ref etc. So we
>> could add a command that only accepts a full-length 40
>> character SHA-1.
>
> I think the full hash is actually revision name.

There is no entry for "revision name" in Documentation/glossary-content.txt
;-)

But to John, if you have a loop that feedseach line to get_oid(), 

	while (getline(buf)) {
		struct object_id oid;
		if (get_oid(buf, &oid))
			warn and continue;
		use oid;
	}

is it much slower than a mode that can ONLY handle a full object
name input, i.e.

	while (getline(buf)) {
		struct object_id oid;
		if (get_oid_hex(buf, &oid))
			warn and continue;
		use oid;
	}

when your input is restricted to full object names?

get_oid() == repo_get_oid()
-> get_oid_with_context()
   -> get_oid_with_context_1()
      -> get_oid_1()
	 -> peel_onion()
	 -> get_oid_basic()
	    -> get_oid_hex()
	    -> repo_dwim_ref()

it seems that warn_ambiguous_refs and warn_on_object_refname_ambiguity
we would waste time on refname discovery but I see cat-file already
has some provision to disable this check.  So when we do not need to
call repo_dwim_ref(), do we still spend measurable cycles in this
callchain?






^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-01-28 18:33 [RFC v3] cat-file: add a --stdin-cmd mode John Cai
  2022-01-31 11:44 ` Bagas Sanjaya
@ 2022-02-01  9:39 ` Christian Couder
  2022-02-01 17:52   ` Taylor Blau
  2022-02-02 13:11   ` John Cai
  2022-02-01 10:43 ` Phillip Wood
  2 siblings, 2 replies; 12+ messages in thread
From: Christian Couder @ 2022-02-01  9:39 UTC (permalink / raw)
  To: John Cai
  Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau,
	Phillip Wood, Eric Wong

On Mon, Jan 31, 2022 at 9:34 PM John Cai <johncai86@gmail.com> wrote:
>
> This RFC patch proposes a new flag --batch-command that works with
> git-cat-file --batch.

The subject is "Re: [RFC v3] cat-file: add a --stdin-cmd mode" and now
you are talking about '--batch-command' instead of '--stdin-cmd'.

"that works with git-cat-file --batch" is not very clear. Maybe you
could find a wording that explains better how --batch-command is
different from --batch.

Also I think at this point this should probably not be an RFC patch
anymore but a regular one.

> Similar to git-update-ref --stdin, it will accept
> commands and arguments from stdin.
>
> The start of this idea was discussed in [1], where the original
> motivation was to be able to control when the buffer was flushed to
> stdout in --buffer mode.

That would be nice in a cover letter but I am not sure a commit
message is the right place for this.

> However, this can actually be much more useful in situations when
> git-cat-file --batch is being used as a long lived backend query
> process. At GitLab, we use a pair of cat-file processes. One for
> iterating over object metadata with --batch-check, and the other to grab
> object contents with --batch. However, if we had --batch-command, we could
> get rid of the second --batch-check process,

Maybe s/second// would make it clear that there are no two
--batch-command processes.

> and just have one process
> where we can flip between getting object info, and getting object contents.
> This can lead to huge savings since on a given server there could be hundreds to
> thousands of git cat-file processes at a time.

It's not clear if all the git cat-file processes you are talking about
are mostly --batch-check processes or --batch processes, or a roughly
equal amount of both. My guess is the latter and that --batch-command
would mean that there would be around two times fewer cat-file
processes.

> git cat-file --batch-command
>
> $ <command> [arg1] [arg2] NL

It's a bit unclear what the 2 above lines mean. Maybe you could add a
small explanation like for example "The new flag can be used like
this:" and "It receives commands from stdin in the format:"

Also not sure why there is a '$' char in front of '<command> [arg1]
[arg2] NL' but not in front of 'git cat-file --batch-command'. It
doesn't look like in the 'git update-ref --stdin' doc that '$' are
used in front of the commands that can be passed through stdin.

> This patch adds three commands: object, info, fflush

Maybe s/three commands/the following first three commands/

> $ object <sha1> NL
> $ info <sha1> NL
> $ fflush NL

Idem about '$'.

> These three would be immediately useful in GitLab's context, but one can
> imagine this mode to be further extended for other things.

Not very clear which "mode" you are talking about. You have been
talking about a mode only in the subject so far. Maybe you should talk
a bit about that above when '<command> [arg1] [arg2] NL' is
introduced.

Also you don't talk about the output format. --batch and --batch-check
accept [=<format>], but it looks like --batch-command doesn't.

> Future improvements:
> - a non-trivial part of "cat-file --batch" time is spent
> on parsing its argument and seeing if it's a revision, ref etc. So we
> could add a command that only accepts a full-length 40
> character SHA-1.

In a cover letter that would be ok, but I am not sure that a commit
message is the best place for this kind of details about future work.

> This would be the first step in adding such an interface to
> git-cat-file.
>
> [1] https://lore.kernel.org/git/pull.1124.git.git.1636149400.gitgitgadget@gmail.com/
>
> Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> Signed-off-by: John Cai <johncai86@gmail.com>
> ---
>
> Taylor, I'd be interested in your thoughts on this proposal since you helped
> review the previous patch that turned into this RFC. Thanks!
>
> Changes from v2:
>
> - refactored tests to be within run_tests()
> - added a test to test --buffer behavior with fflush
> - cleaned up cat-file.c: clarified var names, removed unnecessary code
>   based on suggestions from Phillip Wood
> - removed strvec changes
>
> Changes from v1:
>
> - changed option name to batch-command.
> - changed command function interface to receive the whole line after the command
>   name to put the onus of parsing arguments to each individual command function.
> - pass in whole line to batch_one_object in both parse_cmd_object and
>   parse_cmd_info to support spaces in the object reference.
> - removed addition of -z to include in a separate patch series
> - added documentation.
> ---
>  Documentation/git-cat-file.txt |  15 +++++
>  builtin/cat-file.c             | 114 +++++++++++++++++++++++++++++++--
>  t/t1006-cat-file.sh            |  83 ++++++++++++++++++++++++
>  3 files changed, 205 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> index bef76f4dd0..254e546c79 100644
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -96,6 +96,21 @@ OPTIONS
>         need to specify the path, separated by whitespace.  See the
>         section `BATCH OUTPUT` below for details.
>
> +--batch-command::
> +       Enter a command mode that reads from stdin.

Maybe s/a command mode that reads from stdin/a mode that reads
commands from stdin/

Also I would expect something about the output, like perhaps "...and
ouputs the command results to stdout".

> May not be combined with any
> +       other options or arguments except `--textconv` or `--filters`, in which
> +       case the input lines also need to specify the path, separated by
> +       whitespace.  See the section `BATCH OUTPUT` below for details.

The BATCH OUTPUT section says that a format can be passed but that
doesn't seem to be the case with --batch-command. So you might need to
make some changes to that section too or add a bit more details about
the output here.

> +object <object>::
> +       Print object contents for object reference <object>
> +
> +info <object>::
> +       Print object info for object reference <object>
> +
> +flush::
> +       Flush to stdout immediately when used with --buffer
> +
>  --batch-all-objects::
>         Instead of reading a list of objects on stdin, perform the
>         requested batch operation on all objects in the repository and
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 7b3f42950e..cc9e47943b 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -24,9 +24,11 @@ struct batch_options {
>         int buffer_output;
>         int all_objects;
>         int unordered;
> -       int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
> +       int mode; /* may be 'w' or 'c' for --filters or --textconv */
>         const char *format;
> +       int command;
>  };

Maybe add a blank line here.

> +static char line_termination = '\n';
>
>  static const char *force_path;
>
> @@ -302,19 +304,19 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>         if (data->type == OBJ_BLOB) {
>                 if (opt->buffer_output)
>                         fflush(stdout);
> -               if (opt->cmdmode) {
> +               if (opt->mode) {

The mechanical s/cmdmode/mode/g change could have been made in a
preparatory patch to make this patch a bit smaller and easier to
digest.

> +static void batch_objects_command(struct batch_options *opt,
> +                                   struct strbuf *output,
> +                                   struct expand_data *data)
> +{
> +       struct strbuf input = STRBUF_INIT;
> +
> +       /* Read each line dispatch its command */
> +       while (!strbuf_getwholeline(&input, stdin, line_termination)) {
> +               int i;
> +               const struct parse_cmd *cmd = NULL;
> +               const char *p, *cmd_end;
> +
> +               if (*input.buf == line_termination)
> +                       die("empty command in input");
> +               else if (isspace(*input.buf))
> +                       die("whitespace before command: %s", input.buf);
> +
> +               for (i = 0; i < ARRAY_SIZE(commands); i++) {
> +                       const char *prefix = commands[i].prefix;
> +                       char c;
> +                       if (!skip_prefix(input.buf, prefix, &cmd_end))
> +                               continue;
> +                       /*
> +                        * If the command has arguments, verify that it's
> +                        * followed by a space. Otherwise, it shall be followed
> +                        * by a line terminator.
> +                        */
> +                       c = commands[i].takes_args ? ' ' : line_termination;
> +                       if (input.buf[strlen(prefix)] != c)
> +                               die("arguments invalid for command: %s", commands[i].prefix);
> +
> +                       cmd = &commands[i];
> +                       if (cmd->takes_args) {
> +                               p = cmd_end + 1;
> +                               // strip newline before handing it to the
> +                               // handling function

So above the /* */ comments delimiters are used but here // is used. I
am not sure we support // these days, but if we do, I think it would
be better to avoid mixing comment styles in the same function.

> +                               input.buf[strcspn(input.buf, "\n")] = '\0';
> +                       }
> +
> +                       break;
> +               }
> +
> +               if (!cmd)
> +                       die("unknown command: %s", input.buf);
> +
> +               cmd->fn(opt, p, output, data);
> +       }
> +       strbuf_release(&input);
> +}

> @@ -590,6 +682,9 @@ static int batch_objects(struct batch_options *opt)
>         save_warning = warn_on_object_refname_ambiguity;
>         warn_on_object_refname_ambiguity = 0;
>
> +       if (command)
> +               batch_objects_command(opt, &output, &data);
> +
>         while (strbuf_getline(&input, stdin) != EOF) {

I think batch_objects_command() will consume everything from stdin, so
it doesn't make sense to try to read again from stdin after it. Maybe
the whole while (...) { ... } clause should be inside an else clause
or something.

>                 if (data.split_on_whitespace) {
>                         /*
> @@ -636,6 +731,7 @@ static int batch_option_callback(const struct option *opt,
>
>         bo->enabled = 1;
>         bo->print_contents = !strcmp(opt->long_name, "batch");
> +       bo->command = !strcmp(opt->long_name, "batch-command");
>         bo->format = arg;
>
>         return 0;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-01-31 18:24   ` Junio C Hamano
@ 2022-02-01  9:48     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-01  9:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Bagas Sanjaya, John Cai, git, me, phillip.wood123, e


On Mon, Jan 31 2022, Junio C Hamano wrote:

> Bagas Sanjaya <bagasdotme@gmail.com> writes:
>
>> On 29/01/22 01.33, John Cai wrote:
>>> Future improvements:
>>> - a non-trivial part of "cat-file --batch" time is spent
>>> on parsing its argument and seeing if it's a revision, ref etc. So we
>>> could add a command that only accepts a full-length 40
>>> character SHA-1.
>>
>> I think the full hash is actually revision name.
>
> There is no entry for "revision name" in Documentation/glossary-content.txt
> ;-)
>
> But to John, if you have a loop that feedseach line to get_oid(), 
>
> 	while (getline(buf)) {
> 		struct object_id oid;
> 		if (get_oid(buf, &oid))
> 			warn and continue;
> 		use oid;
> 	}
>
> is it much slower than a mode that can ONLY handle a full object
> name input, i.e.
>
> 	while (getline(buf)) {
> 		struct object_id oid;
> 		if (get_oid_hex(buf, &oid))
> 			warn and continue;
> 		use oid;
> 	}
>
> when your input is restricted to full object names?
>
> get_oid() == repo_get_oid()
> -> get_oid_with_context()
>    -> get_oid_with_context_1()
>       -> get_oid_1()
> 	 -> peel_onion()
> 	 -> get_oid_basic()
> 	    -> get_oid_hex()
> 	    -> repo_dwim_ref()
>
> it seems that warn_ambiguous_refs and warn_on_object_refname_ambiguity
> we would waste time on refname discovery but I see cat-file already
> has some provision to disable this check.  So when we do not need to
> call repo_dwim_ref(), do we still spend measurable cycles in this
> callchain?

For what it's worth I think this claim that we spend a non-trivial
amount of time on the difference between these two comes from me
originally. I'd had a chat with John about various things to try out in
such a "cat-file --batch" mode, and this was one of those things.

I tried instrumenting the relevant code in builtin/cat-file.c the other
(but forgot to reply to this thread), and whatever I'd found there at
the time (this was weeks/months ago) I couldn't reproduce.

So there's probably nothing worthwhile to check out here, i.e. the
trivial cost of get_oid_with_context() is probably nothing to worry
about.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-01-28 18:33 [RFC v3] cat-file: add a --stdin-cmd mode John Cai
  2022-01-31 11:44 ` Bagas Sanjaya
  2022-02-01  9:39 ` Christian Couder
@ 2022-02-01 10:43 ` Phillip Wood
  2022-02-02 15:05   ` John Cai
  2 siblings, 1 reply; 12+ messages in thread
From: Phillip Wood @ 2022-02-01 10:43 UTC (permalink / raw)
  To: John Cai, git; +Cc: avarab, me, e

Hi John

On 28/01/2022 18:33, John Cai wrote:
> This RFC patch proposes a new flag --batch-command that works with
> git-cat-file --batch. Similar to git-update-ref --stdin, it will accept
> commands and arguments from stdin.
> 
> The start of this idea was discussed in [1], where the original
> motivation was to be able to control when the buffer was flushed to
> stdout in --buffer mode.
> 
> However, this can actually be much more useful in situations when
> git-cat-file --batch is being used as a long lived backend query
> process. At GitLab, we use a pair of cat-file processes. One for
> iterating over object metadata with --batch-check, and the other to grab
> object contents with --batch. However, if we had --batch-command, we could
> get rid of the second --batch-check process, and just have one process
> where we can flip between getting object info, and getting object contents.
> This can lead to huge savings since on a given server there could be hundreds to
> thousands of git cat-file processes at a time.
> 
> git cat-file --batch-command
> 
> $ <command> [arg1] [arg2] NL
> 
> This patch adds three commands: object, info, fflush
> 
> $ object <sha1> NL
> $ info <sha1> NL
> $ fflush NL
> 
> These three would be immediately useful in GitLab's context, but one can
> imagine this mode to be further extended for other things.

I think this is moving in the right direction, I've left some comments 
below.

> Future improvements:
> - a non-trivial part of "cat-file --batch" time is spent
> on parsing its argument and seeing if it's a revision, ref etc. So we
> could add a command that only accepts a full-length 40
> character SHA-1.
> 
> This would be the first step in adding such an interface to
> git-cat-file.
> 
> [1] https://lore.kernel.org/git/pull.1124.git.git.1636149400.gitgitgadget@gmail.com/
> 
> Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> Signed-off-by: John Cai <johncai86@gmail.com>
> ---
> 
> Taylor, I'd be interested in your thoughts on this proposal since you helped
> review the previous patch that turned into this RFC. Thanks!
> 
> Changes from v2:
> 
> - refactored tests to be within run_tests()
> - added a test to test --buffer behavior with fflush
> - cleaned up cat-file.c: clarified var names, removed unnecessary code
>    based on suggestions from Phillip Wood
> - removed strvec changes
> 
> Changes from v1:
> 
> - changed option name to batch-command.
> - changed command function interface to receive the whole line after the command
>    name to put the onus of parsing arguments to each individual command function.
> - pass in whole line to batch_one_object in both parse_cmd_object and
>    parse_cmd_info to support spaces in the object reference.
> - removed addition of -z to include in a separate patch series
> - added documentation.
> ---
>   Documentation/git-cat-file.txt |  15 +++++
>   builtin/cat-file.c             | 114 +++++++++++++++++++++++++++++++--
>   t/t1006-cat-file.sh            |  83 ++++++++++++++++++++++++
>   3 files changed, 205 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> index bef76f4dd0..254e546c79 100644
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -96,6 +96,21 @@ OPTIONS
>   	need to specify the path, separated by whitespace.  See the
>   	section `BATCH OUTPUT` below for details.
>   
> +--batch-command::
> +	Enter a command mode that reads from stdin. May not be combined with any
> +	other options or arguments except `--textconv` or `--filters`, in which
> +	case the input lines also need to specify the path, separated by
> +	whitespace.  See the section `BATCH OUTPUT` below for details.
> +
> +object <object>::
> +	Print object contents for object reference <object>
> +
> +info <object>::
> +	Print object info for object reference <object>
> +
> +flush::

I think this is a better name than fflush as calling fflush is just an 
implementation detail. I've been thinking about this and have some 
concerns with the current implementation as it allows the caller to 
force a flush but the output may be flushed at other times and so it 
does not allow for deadlock free reading and writing to cat-file from a 
single process. If instead we buffered the input lines and only 
processed then when we received a flush command it would be easy to read 
and write to cat-file from a single process without deadlocks. This 
would mean that callers would have to add flush commands to get any 
output before they closed the pipe writing to cat-file.

> +	Flush to stdout immediately when used with --buffer
> +
>   --batch-all-objects::
>   	Instead of reading a list of objects on stdin, perform the
>   	requested batch operation on all objects in the repository and
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 7b3f42950e..cc9e47943b 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -24,9 +24,11 @@ struct batch_options {
>   	int buffer_output;
>   	int all_objects;
>   	int unordered;
> -	int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
> +	int mode; /* may be 'w' or 'c' for --filters or --textconv */

If you really need to rename  this variable then doing that in a 
separate preparatory patch would make this easier to review as it 
separates out two separate sets of changes.

>   	const char *format;
> +	int command;

Can't we just call this batch_command and leave cmdmode alone? I don't 
know maybe cmdmode would be confused with having something to do with 
batch_command then.

>   };
> +static char line_termination = '\n';

This seems unnecessary as it is the only permitted line ending

>   static const char *force_path;
>   
> @@ -302,19 +304,19 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>   	if (data->type == OBJ_BLOB) {
>   		if (opt->buffer_output)
>   			fflush(stdout);
> -		if (opt->cmdmode) {
> +		if (opt->mode) {
>   			char *contents;
>   			unsigned long size;
>   
>   			if (!data->rest)
>   				die("missing path for '%s'", oid_to_hex(oid));
>   
> -			if (opt->cmdmode == 'w') {
> +			if (opt->mode == 'w') {
>   				if (filter_object(data->rest, 0100644, oid,
>   						  &contents, &size))
>   					die("could not convert '%s' %s",
>   					    oid_to_hex(oid), data->rest);
> -			} else if (opt->cmdmode == 'c') {
> +			} else if (opt->mode == 'c') {
>   				enum object_type type;
>   				if (!textconv_object(the_repository,
>   						     data->rest, 0100644, oid,
> @@ -326,7 +328,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>   					die("could not convert '%s' %s",
>   					    oid_to_hex(oid), data->rest);
>   			} else
> -				BUG("invalid cmdmode: %c", opt->cmdmode);
> +				BUG("invalid mode: %c", opt->mode);
>   			batch_write(opt, contents, size);
>   			free(contents);
>   		} else {
> @@ -508,6 +510,95 @@ static int batch_unordered_packed(const struct object_id *oid,
>   				      data);
>   }
>   
> +static void parse_cmd_object(struct batch_options *opt,
> +			     const char *line,
> +			     struct strbuf *output,
> +			     struct expand_data *data)
> +{
> +	opt->print_contents = 1;
> +	batch_one_object(line, output, opt, data);
> +}
> +
> +static void parse_cmd_info(struct batch_options *opt,
> +			   const char *line,
> +			   struct strbuf *output,
> +			   struct expand_data *data)
> +{
> +	opt->print_contents = 0;
> +	batch_one_object(line, output, opt, data);
> +}
> +
> +static void parse_cmd_fflush(struct batch_options *opt,
> +			     const char *line,
> +			     struct strbuf *output,
> +			     struct expand_data *data)
> +{
> +	fflush(stdout);
> +}
> +
> +typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
> +			       struct strbuf *, struct expand_data *);
> +
> +static const struct parse_cmd {
> +	const char *prefix;
> +	parse_cmd_fn_t fn;
> +	unsigned takes_args;
> +} commands[] = {
> +	{ "object", parse_cmd_object, 1},
> +	{ "info", parse_cmd_info, 1},
> +	{ "fflush", parse_cmd_fflush, 0},
> +};
> +
> +static void batch_objects_command(struct batch_options *opt,
> +				    struct strbuf *output,
> +				    struct expand_data *data)
> +{
> +	struct strbuf input = STRBUF_INIT;
> +
> +	/* Read each line dispatch its command */
> +	while (!strbuf_getwholeline(&input, stdin, line_termination)) {
> +		int i;
> +		const struct parse_cmd *cmd = NULL;
> +		const char *p, *cmd_end;
> +
> +		if (*input.buf == line_termination)
> +			die("empty command in input");
> +		else if (isspace(*input.buf))
> +			die("whitespace before command: %s", input.buf);
> +
> +		for (i = 0; i < ARRAY_SIZE(commands); i++) {
> +			const char *prefix = commands[i].prefix;
> +			char c;
> +			if (!skip_prefix(input.buf, prefix, &cmd_end))
> +				continue;
> +			/*
> +			 * If the command has arguments, verify that it's
> +			 * followed by a space. Otherwise, it shall be followed
> +			 * by a line terminator.
> +			 */
> +			c = commands[i].takes_args ? ' ' : line_termination;
> +			if (input.buf[strlen(prefix)] != c)

As I pointed out in the last round you can use cmd_end here rather than 
calling strlen. I've just noticed that this will fail for a command that 
does not take arguments when the last input line does not end with '\n'. 
I think
	if (*cmd_end && *cmd_end != c)
would be safer

> +				die("arguments invalid for command: %s", commands[i].prefix);
> +
> +			cmd = &commands[i];
> +			if (cmd->takes_args) {
> +				p = cmd_end + 1;
> +				// strip newline before handing it to the
> +				// handling function
> +				input.buf[strcspn(input.buf, "\n")] = '\0';

When I suggested replacing strstr() last time I failed to notice that we 
  know that if there is a newline character it is at input.buf[len - 1] 
so we don't need to scan the whole string to find it.

	if (input.buf[input.len - 1] == '\n')
		input.buf[--buf.len] = '\0';

(feel free to change it to avoid the prefix operator)

> +			}
> +
> +			break;
> +		}
> +
> +		if (!cmd)
> +			die("unknown command: %s", input.buf);
> +
> +		cmd->fn(opt, p, output, data);
> +	}
> +	strbuf_release(&input);
> +}
> +
>   static int batch_objects(struct batch_options *opt)
>   {
>   	struct strbuf input = STRBUF_INIT;
> @@ -515,6 +606,7 @@ static int batch_objects(struct batch_options *opt)
>   	struct expand_data data;
>   	int save_warning;
>   	int retval = 0;
> +	const int command = opt->command;
>   
>   	if (!opt->format)
>   		opt->format = "%(objectname) %(objecttype) %(objectsize)";
> @@ -529,7 +621,7 @@ static int batch_objects(struct batch_options *opt)
>   	strbuf_expand(&output, opt->format, expand_format, &data);
>   	data.mark_query = 0;
>   	strbuf_release(&output);
> -	if (opt->cmdmode)
> +	if (opt->mode)
>   		data.split_on_whitespace = 1;
>   
>   	/*
> @@ -590,6 +682,9 @@ static int batch_objects(struct batch_options *opt)
>   	save_warning = warn_on_object_refname_ambiguity;
>   	warn_on_object_refname_ambiguity = 0;
>   
> +	if (command)
> +		batch_objects_command(opt, &output, &data);
> +

Moving this is good as we no longer need to touch the line below

>   	while (strbuf_getline(&input, stdin) != EOF) {
>   		if (data.split_on_whitespace) {
>   			/*
> @@ -636,6 +731,7 @@ static int batch_option_callback(const struct option *opt,
>   
>   	bo->enabled = 1;
>   	bo->print_contents = !strcmp(opt->long_name, "batch");
> +	bo->command = !strcmp(opt->long_name, "batch-command");
>   	bo->format = arg;
>   
>   	return 0;
> @@ -683,6 +779,10 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>   			N_("like --batch, but don't emit <contents>"),
>   			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
>   			batch_option_callback),
> +		OPT_CALLBACK_F(0, "batch-command", &batch, N_(""),
> +			 N_("enters batch mode that accepts commands"),
> +			 PARSE_OPT_NOARG | PARSE_OPT_NONEG,
> +			 batch_option_callback),
>   		OPT_CMDMODE(0, "batch-all-objects", &opt,
>   			    N_("with --batch[-check]: ignores stdin, batches all known objects"), 'b'),
>   		/* Batch-specific options */
> @@ -742,7 +842,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>   	/* Return early if we're in batch mode? */
>   	if (batch.enabled) {
>   		if (opt_cw)
> -			batch.cmdmode = opt;
> +			batch.mode = opt;
>   		else if (opt && opt != 'b')
>   			usage_msg_optf(_("'-%c' is incompatible with batch mode"),
>   				       usage, options, opt);
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 145eee11df..92f0b14a95 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -112,6 +112,46 @@ maybe_remove_timestamp () {
>       fi
>   }

It's really good that you've found a way to test fflush

> +run_buffer_test () {
> +	type=$1
> +	sha1=$2
> +	size=$3
> +	flush=$4
> +
> +	mkfifo V.input
> +	exec 8<>V.input
> +	rm V.input
> +
> +	mkfifo V.output
> +	exec 9<>V.output
> +	rm V.output
> +	
> +	(
> +		git cat-file --buffer --batch-command <&8 >&9 &
> +		echo $! >&9 &&
> +		wait $!
> +		echo >&9 EARLY_EXIT
> +	) &
> +	sh_pid=$!
> +	read fi_pid <&9 > +	test_when_finished "
> +		exec 8>&-; exec 9>&-;
> +		kill $sh_pid && wait $sh_pid
> +		kill $fi_pid && wait $fi_pid
> +		true"
> +	expect=$(echo "$sha1 $type $size")
> +	echo "info $sha1" >&8
> +	if test "$flush" = "true"
> +	then
> +		echo "fflush" >&8

If I change "fflush" to "bad-command" then this test still passes. This 
is because die() flushes stdout and the read below will get the output 
it was expecting and not 'EARLY_EXIT'. I have a suggested change below 
to get around that also checks the exit code of "git cat-file" (This 
test effectively puts a git command upstream of a pipe which we try to 
avoid in order to pick up any crashes)

> +	else
> +		expect="EARLY_EXIT"
> +		kill $fi_pid && wait $fi_pid
> +	fi
> +	read actual <&9
> +	test "$actual" = "$expect"
> +}
> +
>   run_tests () {
>       type=$1
>       sha1=$2
> @@ -177,6 +217,18 @@ $content"
>   	test_cmp expect actual
>       '

Here is my suggestion based on your test - I ended up splitting the 
fflush and no-fflush cases

run_buffer_test_flush () {
	type=$1
	sha1=$2
	size=$3

	mkfifo input &&
	test_when_finished 'rm input'
	mkfifo output &&
	exec 9<>output &&
	test_when_finished 'rm output; exec 9<&-'
	(
	    	# TODO - Ideally we'd pipe the output of cat-file
	    	# through "sed s'/$/\\/'" to make sure that that read
	    	# would consume all the available
	    	# output. Unfortunately we cannot do this as we cannot
	    	# control when sed flushes its output. We could write
	    	# a test helper in C that appended a '\' to the end of
		# each line and flushes its output after every line.
		git cat-file --buffer --batch-command <input 2>err &
		echo $! &&
		wait $!
		echo $?
	) >&9 &
	sh_pid=$! &&
	read cat_file_pid <&9 &&
	test_when_finished "kill $cat_file_pid
			    kill $sh_pid; wait $sh_pid; :" &&
	(
		test_write_lines "info $sha1" fflush "info $sha1" &&
		# TODO - consume all available input, not just one
		# line (see above).
		read actual <&9 &&
		echo "$actual" >actual &&
		echo "$sha1 $type $size" >expect &&
		test_cmp expect actual
	) >input &&
	# check output is flushed on exit
	read actual <&9 &&
	echo "$actual" >actual &&
	test_cmp expect actual &&
	test_must_be_empty err &&
	read status <&9 &&
	test "$status" -eq 0
}

run_buffer_test_no_flush () {
	type=$1
	sha1=$2
	size=$3

	mkfifo input &&
	test_when_finished 'rm input'
	mkfifo pid &&
	exec 9<>pid &&
	test_when_finished 'rm pid; exec 9<&-'
	(
		git cat-file --buffer --batch-command <input >output &
		echo $! &&
		wait $!
		echo $?
	) >&9 &
	sh_pid=$! &&
	read cat_file_pid <&9 &&
	test_when_finished "kill $cat_file_pid
			    kill $sh_pid; wait $sh_pid; :" &&
	(
		test_write_lines "info $sha1" "info $sha1" &&
		kill $cat_file_pid &&
		read status <&9 &&
		test "$status" -ne 0 &&
		test_must_be_empty output
	) >input
}

Best Wishes

Phillip

> +    test -z "$content" ||
> +    test_expect_success "--batch-command output of $type content is correct" '
> +	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
> +	maybe_remove_timestamp "$(echo object $sha1 | git cat-file --batch-command)" $no_ts >actual &&
> +	test_cmp expect actual
> +    '
> +
> +    test_expect_success "batch-command output of $type info is correct" '
> +	echo "$sha1 $type $size" >expect &&
> +	echo "info $sha1" | git cat-file --batch-command >actual &&
> +	test_cmp expect actual
> +'
>       test_expect_success "custom --batch-check format" '
>   	echo "$type $sha1" >expect &&
>   	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
> @@ -232,12 +284,29 @@ test_expect_success '--batch-check without %(rest) considers whole line' '
>   	test_cmp expect actual
>   '
>   
> +test_expect_success '--batch-command --buffer with flush is correct for blob' '
> +	run_buffer_test 'blob' $hello_sha1 $hello_size true
> +'
> +
> +test_expect_success '--batch-command --buffer without flush is correct for blob' '
> +	run_buffer_test 'blob' $hello_sha1 $hello_size false
> +'
> +
>   tree_sha1=$(git write-tree)
> +
>   tree_size=$(($(test_oid rawsz) + 13))
>   tree_pretty_content="100644 blob $hello_sha1	hello"
>   
>   run_tests 'tree' $tree_sha1 $tree_size "" "$tree_pretty_content"
>   
> +test_expect_success '--batch-command --buffer with flush is correct for tree' '
> +	run_buffer_test 'tree' $tree_sha1 $tree_size true
> +'
> +
> +test_expect_success '--batch-command --buffer without flush is correct for tree' '
> +	run_buffer_test 'tree' $tree_sha1 $tree_size false
> +'
> +
>   commit_message="Initial commit"
>   commit_sha1=$(echo_without_newline "$commit_message" | git commit-tree $tree_sha1)
>   commit_size=$(($(test_oid hexsz) + 137))
> @@ -263,6 +332,14 @@ tag_size=$(strlen "$tag_content")
>   
>   run_tests 'tag' $tag_sha1 $tag_size "$tag_content" "$tag_content" 1
>   
> +test_expect_success '--batch-command --buffer with flush is correct for tag' '
> +	run_buffer_test 'tag' $tag_sha1 $tag_size true
> +'
> +
> +test_expect_success '--batch-command --buffer without flush is correct for tag' '
> +	run_buffer_test 'tag' $tag_sha1 $tag_size false
> +'
> +
>   test_expect_success \
>       "Reach a blob from a tag pointing to it" \
>       "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
> @@ -964,4 +1041,10 @@ test_expect_success 'cat-file --batch-all-objects --batch-check ignores replace'
>   	test_cmp expect actual
>   '
>   
> +test_expect_success 'batch-command unknown command' '
> +	echo unknown_command >cmd &&
> +	test_expect_code 128 git cat-file --batch-command < cmd 2>err &&
> +	grep -E "^fatal:.*unknown command.*" err
> +'
> +
>   test_done


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-02-01  9:39 ` Christian Couder
@ 2022-02-01 17:52   ` Taylor Blau
  2022-02-01 19:27     ` John Cai
  2022-02-02 13:11   ` John Cai
  1 sibling, 1 reply; 12+ messages in thread
From: Taylor Blau @ 2022-02-01 17:52 UTC (permalink / raw)
  To: Christian Couder
  Cc: John Cai, git, Ævar Arnfjörð Bjarmason,
	Taylor Blau, Phillip Wood, Eric Wong

On Tue, Feb 01, 2022 at 10:39:30AM +0100, Christian Couder wrote:
> Also I think at this point this should probably not be an RFC patch
> anymore but a regular one.

I think that this is due to my first "review" in [1], where I tried to
get an understanding of what John's concrete usage plans were, since I
couldn't figure out what they were on my own.

I'm not sure that I've seen a response along the lines of "we need to
control when the output stream is flushed in order to do ..." yet, but I
would be interested to see one before moving too much further ahead of
where we already are.

(Apologies if such a response was written, and I missed it).

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YehomwNiIs0l83W7@nand.local/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-02-01 17:52   ` Taylor Blau
@ 2022-02-01 19:27     ` John Cai
  2022-02-01 20:14       ` Taylor Blau
  0 siblings, 1 reply; 12+ messages in thread
From: John Cai @ 2022-02-01 19:27 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, Ævar Arnfjörð Bjarmason,
	Phillip Wood, Eric Wong

Hi Taylor,

On 1 Feb 2022, at 12:52, Taylor Blau wrote:

> On Tue, Feb 01, 2022 at 10:39:30AM +0100, Christian Couder wrote:
>> Also I think at this point this should probably not be an RFC patch
>> anymore but a regular one.
>
> I think that this is due to my first "review" in [1], where I tried to
> get an understanding of what John's concrete usage plans were, since I
> couldn't figure out what they were on my own.
>
> I'm not sure that I've seen a response along the lines of "we need to
> control when the output stream is flushed in order to do ..." yet, but I
> would be interested to see one before moving too much further ahead of
> where we already are.

This would be useful when there is another process A interacting with a long
running git cat-file process B that is retrieving object information from the odb
interactively but also wants to use --buffer mode.

In this scenario, if A is asked to retrieve a large list of object metadata, it wants to
use --buffer mode to be more efficient but it would need a way to ensure that all of the
contents have been flushed to the output. If we want to keep B running to save startup time
(since otherwise we might need to spawn many git cat-file processes), then having
a flush command where we can guarantee a flush would be very handy.

>
> (Apologies if such a response was written, and I missed it).

Nope, don't think I explained the need for a flush command very clearly.

thanks
>
> Thanks,
> Taylor
>
> [1]: https://lore.kernel.org/git/YehomwNiIs0l83W7@nand.local/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-02-01 19:27     ` John Cai
@ 2022-02-01 20:14       ` Taylor Blau
       [not found]         ` <3FE1D509-8AD0-4F0E-9298-DFD3552A98EF@gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Taylor Blau @ 2022-02-01 20:14 UTC (permalink / raw)
  To: John Cai
  Cc: Taylor Blau, Christian Couder, git,
	Ævar Arnfjörð Bjarmason, Phillip Wood, Eric Wong

On Tue, Feb 01, 2022 at 02:27:54PM -0500, John Cai wrote:
> On 1 Feb 2022, at 12:52, Taylor Blau wrote:
> > I'm not sure that I've seen a response along the lines of "we need to
> > control when the output stream is flushed in order to do ..." yet, but I
> > would be interested to see one before moving too much further ahead of
> > where we already are.
>
> This would be useful when there is another process A interacting with
> a long running git cat-file process B that is retrieving object
> information from the odb interactively but also wants to use --buffer
> mode.

Let me try and repeat my understanding of what you said to make sure
that I fully grok the use-case you have in mind.

You have a repository and want to have a long-running `git cat-file`
process that can serve multiple requests. Because the processes which
interact with your long-running `cat-file` may ask for many objects, you
don't want to flush the output buffer after each object, and so would
ideally like to use `--buffer`.

But that doesn't quite work, since the `cat-file` process may not have
decided to flush its output buffer even when process A is about to go
away.

I wonder about the viability of accomplishing this via a signal handler,
i.e., that `cat-file` would call fflush(2) whenever it receives e.g.,
SIGUSR1. A couple of possible downsides:

  - SIGUSR1 doesn't exist on Windows AFAIK.

  - There are definitely going to be synchrony issues to contend with.
    What happens if we receive our signal while writing to the output
    stream? I think you would just need to mark a variable that
    indicates we should flush after finishing serving the current
    request, but I haven't thought too hard about it.

So maybe a signal isn't the way to go. But I don't think `--stdin-cmd`
is the simplest approach either. At the very least, I don't totally
understand your plan after implementing a flush command. You mention
that it would be nice to implement other commands, but I'm not totally
convinced by your examples[1].

I wonder if we could strike a middle ground, which might look like `git
cat-file --batch --buffer`, and just feeding it something which we know
for certain isn't an object identifier. In other words, what if we
did something as simple as:

--- >8 ---

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index d94050e6c1..bae162fc18 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -595,6 +595,11 @@ static int batch_objects(struct batch_options *opt)
 	warn_on_object_refname_ambiguity = 0;

 	while (strbuf_getline(&input, stdin) != EOF) {
+		if (!strcmp("<flush>", input.buf)) {
+			fflush(stdout);
+			continue;
+		}
+
 		if (data.split_on_whitespace) {
 			/*
 			 * Split at first whitespace, tying off the beginning

--- 8< ---

On the other hand, something even hackier than the above is that we
flush stdout whenever we get a request to print an object which could
not be found. So if you feed a single "\n" to your `cat-file` process,
you'll get " missing" on its output, and the buffer will immediately be
flushed.

I'm not sure that I'd recommend relying on that behavior exactly, but if
you're looking for a short-term solution, it might work ;).

Thanks,
Taylor

[1]: One that comes to mind is changing the output format mid-stream.
     But how often does it really make sense to change the output
     format? I can understand wanting to flush at the end asking
     cat-file for a bunch of objects, but I don't see how you would want
     to change the output format often enough that shaving off Git's
     negligible startup cost is worthwhile (or couldn't be accomplished
     by just spawning another cat-file process and using that).

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
       [not found]         ` <3FE1D509-8AD0-4F0E-9298-DFD3552A98EF@gmail.com>
@ 2022-02-02  1:45           ` Taylor Blau
  0 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2022-02-02  1:45 UTC (permalink / raw)
  To: John Cai
  Cc: Taylor Blau, Christian Couder, git,
	Ævar Arnfjörð Bjarmason, Phillip Wood, Eric Wong

On Tue, Feb 01, 2022 at 08:34:06PM -0500, John Cai wrote:
> > So maybe a signal isn't the way to go. But I don't think `--stdin-cmd`
> > is the simplest approach either. At the very least, I don't totally
> > understand your plan after implementing a flush command. You mention
> > that it would be nice to implement other commands, but I'm not totally
> > convinced by your examples[1].
>
> I agree that if flush were the only use case for a new flag, it might not
> be worth it. But, the flush command is only one of the use cases that a
> --stdin-cmd (now changed to --batch-command per Phillip's suggestion)
>
> There are two other commands in this RFC
>
> object <rev>
> info <rev>
>
> These are described in [1] that are also a motivation for a command
> interface.
> The description in [1] explains why a command interface would be necessary.

This seems like a more realistic and well-motivated proposal, IMHO. I am
a little curious that having the ability to switch between asking for an
object's contents and its metadata would lead to saving thousands of
processes per server.

(Apropos of this series, I am curious: how long does GitLab keep a pair
of cat-file programs running for each repository?)

In any case, I'll take a closer look over the aforementioned version and
let you know what my thoughts are, probably tomorrow.

> 1. https://lore.kernel.org/git/20220128183319.43496-1-johncai86@gmail.com/

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-02-01  9:39 ` Christian Couder
  2022-02-01 17:52   ` Taylor Blau
@ 2022-02-02 13:11   ` John Cai
  1 sibling, 0 replies; 12+ messages in thread
From: John Cai @ 2022-02-02 13:11 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau,
	Phillip Wood, Eric Wong

Hi Christian

On 1 Feb 2022, at 4:39, Christian Couder wrote:

> On Mon, Jan 31, 2022 at 9:34 PM John Cai <johncai86@gmail.com> wrote:
>>
>> This RFC patch proposes a new flag --batch-command that works with
>> git-cat-file --batch.
>
> The subject is "Re: [RFC v3] cat-file: add a --stdin-cmd mode" and now
> you are talking about '--batch-command' instead of '--stdin-cmd'.
>
> "that works with git-cat-file --batch" is not very clear. Maybe you
> could find a wording that explains better how --batch-command is
> different from --batch.
>
> Also I think at this point this should probably not be an RFC patch
> anymore but a regular one.
>
>> Similar to git-update-ref --stdin, it will accept
>> commands and arguments from stdin.
>>
>> The start of this idea was discussed in [1], where the original
>> motivation was to be able to control when the buffer was flushed to
>> stdout in --buffer mode.
>
> That would be nice in a cover letter but I am not sure a commit
> message is the right place for this.
>
>> However, this can actually be much more useful in situations when
>> git-cat-file --batch is being used as a long lived backend query
>> process. At GitLab, we use a pair of cat-file processes. One for
>> iterating over object metadata with --batch-check, and the other to grab
>> object contents with --batch. However, if we had --batch-command, we could
>> get rid of the second --batch-check process,
>
> Maybe s/second// would make it clear that there are no two
> --batch-command processes.
>
>> and just have one process
>> where we can flip between getting object info, and getting object contents.
>> This can lead to huge savings since on a given server there could be hundreds to
>> thousands of git cat-file processes at a time.
>
> It's not clear if all the git cat-file processes you are talking about
> are mostly --batch-check processes or --batch processes, or a roughly
> equal amount of both. My guess is the latter and that --batch-command
> would mean that there would be around two times fewer cat-file
> processes.
>
>> git cat-file --batch-command
>>
>> $ <command> [arg1] [arg2] NL
>
> It's a bit unclear what the 2 above lines mean. Maybe you could add a
> small explanation like for example "The new flag can be used like
> this:" and "It receives commands from stdin in the format:"
>
> Also not sure why there is a '$' char in front of '<command> [arg1]
> [arg2] NL' but not in front of 'git cat-file --batch-command'. It
> doesn't look like in the 'git update-ref --stdin' doc that '$' are
> used in front of the commands that can be passed through stdin.
>
>> This patch adds three commands: object, info, fflush
>
> Maybe s/three commands/the following first three commands/
>
>> $ object <sha1> NL
>> $ info <sha1> NL
>> $ fflush NL
>
> Idem about '$'.
>
>> These three would be immediately useful in GitLab's context, but one can
>> imagine this mode to be further extended for other things.
>
> Not very clear which "mode" you are talking about. You have been
> talking about a mode only in the subject so far. Maybe you should talk
> a bit about that above when '<command> [arg1] [arg2] NL' is
> introduced.
>
> Also you don't talk about the output format. --batch and --batch-check
> accept [=<format>], but it looks like --batch-command doesn't.
>
>> Future improvements:
>> - a non-trivial part of "cat-file --batch" time is spent
>> on parsing its argument and seeing if it's a revision, ref etc. So we
>> could add a command that only accepts a full-length 40
>> character SHA-1.
>
> In a cover letter that would be ok, but I am not sure that a commit
> message is the best place for this kind of details about future work.
>
>> This would be the first step in adding such an interface to
>> git-cat-file.
>>
>> [1] https://lore.kernel.org/git/pull.1124.git.git.1636149400.gitgitgadget@gmail.com/
>>
>> Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> Signed-off-by: John Cai <johncai86@gmail.com>
>> ---
>>
>> Taylor, I'd be interested in your thoughts on this proposal since you helped
>> review the previous patch that turned into this RFC. Thanks!
>>
>> Changes from v2:
>>
>> - refactored tests to be within run_tests()
>> - added a test to test --buffer behavior with fflush
>> - cleaned up cat-file.c: clarified var names, removed unnecessary code
>>   based on suggestions from Phillip Wood
>> - removed strvec changes
>>
>> Changes from v1:
>>
>> - changed option name to batch-command.
>> - changed command function interface to receive the whole line after the command
>>   name to put the onus of parsing arguments to each individual command function.
>> - pass in whole line to batch_one_object in both parse_cmd_object and
>>   parse_cmd_info to support spaces in the object reference.
>> - removed addition of -z to include in a separate patch series
>> - added documentation.
>> ---
>>  Documentation/git-cat-file.txt |  15 +++++
>>  builtin/cat-file.c             | 114 +++++++++++++++++++++++++++++++--
>>  t/t1006-cat-file.sh            |  83 ++++++++++++++++++++++++
>>  3 files changed, 205 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
>> index bef76f4dd0..254e546c79 100644
>> --- a/Documentation/git-cat-file.txt
>> +++ b/Documentation/git-cat-file.txt
>> @@ -96,6 +96,21 @@ OPTIONS
>>         need to specify the path, separated by whitespace.  See the
>>         section `BATCH OUTPUT` below for details.
>>
>> +--batch-command::
>> +       Enter a command mode that reads from stdin.
>
> Maybe s/a command mode that reads from stdin/a mode that reads
> commands from stdin/
>
> Also I would expect something about the output, like perhaps "...and
> ouputs the command results to stdout".
>
>> May not be combined with any
>> +       other options or arguments except `--textconv` or `--filters`, in which
>> +       case the input lines also need to specify the path, separated by
>> +       whitespace.  See the section `BATCH OUTPUT` below for details.
>
> The BATCH OUTPUT section says that a format can be passed but that
> doesn't seem to be the case with --batch-command. So you might need to
> make some changes to that section too or add a bit more details about
> the output here.

Thinking about this more, I wonder if it'd be worth it to allow the --batch-command=<format>
for backwards compatibility reasons for users who switch over to using --batch-command
from --batch and --batch-check.

The only thing that makes me hesitant is that the <format> would only be relevant to
the "info" and "object" commands instead of being relevant to all commands in --batch-command
mode.

>
>> +object <object>::
>> +       Print object contents for object reference <object>
>> +
>> +info <object>::
>> +       Print object info for object reference <object>
>> +
>> +flush::
>> +       Flush to stdout immediately when used with --buffer
>> +
>>  --batch-all-objects::
>>         Instead of reading a list of objects on stdin, perform the
>>         requested batch operation on all objects in the repository and
>> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
>> index 7b3f42950e..cc9e47943b 100644
>> --- a/builtin/cat-file.c
>> +++ b/builtin/cat-file.c
>> @@ -24,9 +24,11 @@ struct batch_options {
>>         int buffer_output;
>>         int all_objects;
>>         int unordered;
>> -       int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
>> +       int mode; /* may be 'w' or 'c' for --filters or --textconv */
>>         const char *format;
>> +       int command;
>>  };
>
> Maybe add a blank line here.
>
>> +static char line_termination = '\n';
>>
>>  static const char *force_path;
>>
>> @@ -302,19 +304,19 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>>         if (data->type == OBJ_BLOB) {
>>                 if (opt->buffer_output)
>>                         fflush(stdout);
>> -               if (opt->cmdmode) {
>> +               if (opt->mode) {
>
> The mechanical s/cmdmode/mode/g change could have been made in a
> preparatory patch to make this patch a bit smaller and easier to
> digest.
>
>> +static void batch_objects_command(struct batch_options *opt,
>> +                                   struct strbuf *output,
>> +                                   struct expand_data *data)
>> +{
>> +       struct strbuf input = STRBUF_INIT;
>> +
>> +       /* Read each line dispatch its command */
>> +       while (!strbuf_getwholeline(&input, stdin, line_termination)) {
>> +               int i;
>> +               const struct parse_cmd *cmd = NULL;
>> +               const char *p, *cmd_end;
>> +
>> +               if (*input.buf == line_termination)
>> +                       die("empty command in input");
>> +               else if (isspace(*input.buf))
>> +                       die("whitespace before command: %s", input.buf);
>> +
>> +               for (i = 0; i < ARRAY_SIZE(commands); i++) {
>> +                       const char *prefix = commands[i].prefix;
>> +                       char c;
>> +                       if (!skip_prefix(input.buf, prefix, &cmd_end))
>> +                               continue;
>> +                       /*
>> +                        * If the command has arguments, verify that it's
>> +                        * followed by a space. Otherwise, it shall be followed
>> +                        * by a line terminator.
>> +                        */
>> +                       c = commands[i].takes_args ? ' ' : line_termination;
>> +                       if (input.buf[strlen(prefix)] != c)
>> +                               die("arguments invalid for command: %s", commands[i].prefix);
>> +
>> +                       cmd = &commands[i];
>> +                       if (cmd->takes_args) {
>> +                               p = cmd_end + 1;
>> +                               // strip newline before handing it to the
>> +                               // handling function
>
> So above the /* */ comments delimiters are used but here // is used. I
> am not sure we support // these days, but if we do, I think it would
> be better to avoid mixing comment styles in the same function.
>
>> +                               input.buf[strcspn(input.buf, "\n")] = '\0';
>> +                       }
>> +
>> +                       break;
>> +               }
>> +
>> +               if (!cmd)
>> +                       die("unknown command: %s", input.buf);
>> +
>> +               cmd->fn(opt, p, output, data);
>> +       }
>> +       strbuf_release(&input);
>> +}
>
>> @@ -590,6 +682,9 @@ static int batch_objects(struct batch_options *opt)
>>         save_warning = warn_on_object_refname_ambiguity;
>>         warn_on_object_refname_ambiguity = 0;
>>
>> +       if (command)
>> +               batch_objects_command(opt, &output, &data);
>> +
>>         while (strbuf_getline(&input, stdin) != EOF) {
>
> I think batch_objects_command() will consume everything from stdin, so
> it doesn't make sense to try to read again from stdin after it. Maybe
> the whole while (...) { ... } clause should be inside an else clause
> or something.
>
>>                 if (data.split_on_whitespace) {
>>                         /*
>> @@ -636,6 +731,7 @@ static int batch_option_callback(const struct option *opt,
>>
>>         bo->enabled = 1;
>>         bo->print_contents = !strcmp(opt->long_name, "batch");
>> +       bo->command = !strcmp(opt->long_name, "batch-command");
>>         bo->format = arg;
>>
>>         return 0;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v3] cat-file: add a --stdin-cmd mode
  2022-02-01 10:43 ` Phillip Wood
@ 2022-02-02 15:05   ` John Cai
  0 siblings, 0 replies; 12+ messages in thread
From: John Cai @ 2022-02-02 15:05 UTC (permalink / raw)
  To: phillip.wood; +Cc: git, avarab, me, e

Hi Phillip

On 1 Feb 2022, at 5:43, Phillip Wood wrote:

> Hi John
>
> On 28/01/2022 18:33, John Cai wrote:
>> This RFC patch proposes a new flag --batch-command that works with
>> git-cat-file --batch. Similar to git-update-ref --stdin, it will accept
>> commands and arguments from stdin.
>>
>> The start of this idea was discussed in [1], where the original
>> motivation was to be able to control when the buffer was flushed to
>> stdout in --buffer mode.
>>
>> However, this can actually be much more useful in situations when
>> git-cat-file --batch is being used as a long lived backend query
>> process. At GitLab, we use a pair of cat-file processes. One for
>> iterating over object metadata with --batch-check, and the other to grab
>> object contents with --batch. However, if we had --batch-command, we could
>> get rid of the second --batch-check process, and just have one process
>> where we can flip between getting object info, and getting object contents.
>> This can lead to huge savings since on a given server there could be hundreds to
>> thousands of git cat-file processes at a time.
>>
>> git cat-file --batch-command
>>
>> $ <command> [arg1] [arg2] NL
>>
>> This patch adds three commands: object, info, fflush
>>
>> $ object <sha1> NL
>> $ info <sha1> NL
>> $ fflush NL
>>
>> These three would be immediately useful in GitLab's context, but one can
>> imagine this mode to be further extended for other things.
>
> I think this is moving in the right direction, I've left some comments below.
>
>> Future improvements:
>> - a non-trivial part of "cat-file --batch" time is spent
>> on parsing its argument and seeing if it's a revision, ref etc. So we
>> could add a command that only accepts a full-length 40
>> character SHA-1.
>>
>> This would be the first step in adding such an interface to
>> git-cat-file.
>>
>> [1] https://lore.kernel.org/git/pull.1124.git.git.1636149400.gitgitgadget@gmail.com/
>>
>> Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> Signed-off-by: John Cai <johncai86@gmail.com>
>> ---
>>
>> Taylor, I'd be interested in your thoughts on this proposal since you helped
>> review the previous patch that turned into this RFC. Thanks!
>>
>> Changes from v2:
>>
>> - refactored tests to be within run_tests()
>> - added a test to test --buffer behavior with fflush
>> - cleaned up cat-file.c: clarified var names, removed unnecessary code
>>    based on suggestions from Phillip Wood
>> - removed strvec changes
>>
>> Changes from v1:
>>
>> - changed option name to batch-command.
>> - changed command function interface to receive the whole line after the command
>>    name to put the onus of parsing arguments to each individual command function.
>> - pass in whole line to batch_one_object in both parse_cmd_object and
>>    parse_cmd_info to support spaces in the object reference.
>> - removed addition of -z to include in a separate patch series
>> - added documentation.
>> ---
>>   Documentation/git-cat-file.txt |  15 +++++
>>   builtin/cat-file.c             | 114 +++++++++++++++++++++++++++++++--
>>   t/t1006-cat-file.sh            |  83 ++++++++++++++++++++++++
>>   3 files changed, 205 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
>> index bef76f4dd0..254e546c79 100644
>> --- a/Documentation/git-cat-file.txt
>> +++ b/Documentation/git-cat-file.txt
>> @@ -96,6 +96,21 @@ OPTIONS
>>   	need to specify the path, separated by whitespace.  See the
>>   	section `BATCH OUTPUT` below for details.
>>  +--batch-command::
>> +	Enter a command mode that reads from stdin. May not be combined with any
>> +	other options or arguments except `--textconv` or `--filters`, in which
>> +	case the input lines also need to specify the path, separated by
>> +	whitespace.  See the section `BATCH OUTPUT` below for details.
>> +
>> +object <object>::
>> +	Print object contents for object reference <object>
>> +
>> +info <object>::
>> +	Print object info for object reference <object>
>> +
>> +flush::
>
> I think this is a better name than fflush as calling fflush is just an implementation detail. I've been thinking about this and have some concerns with the current implementation as it allows the caller to force a flush but the output may be flushed at other times and so it does not allow for deadlock free reading and writing to cat-file from a single process. If instead we buffered the input lines and only processed then when we received a flush command it would be easy to read and write to cat-file from a single process without deadlocks. This would mean that callers would have to add flush commands to get any output before they closed the pipe writing to cat-file.

To address this concern, I wonder if instead of a flush command what we could do is to have --batch-command support "read transactions" similar to update-ref. eg:

git cat-file --batch-command

begin info
<sha1>
<sha1>
<sha1>
<sha1>
commit

We could queue up these references, and then get them all once "commit" is entered and call fflush().

>
>> +	Flush to stdout immediately when used with --buffer
>> +
>>   --batch-all-objects::
>>   	Instead of reading a list of objects on stdin, perform the
>>   	requested batch operation on all objects in the repository and
>> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
>> index 7b3f42950e..cc9e47943b 100644
>> --- a/builtin/cat-file.c
>> +++ b/builtin/cat-file.c
>> @@ -24,9 +24,11 @@ struct batch_options {
>>   	int buffer_output;
>>   	int all_objects;
>>   	int unordered;
>> -	int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
>> +	int mode; /* may be 'w' or 'c' for --filters or --textconv */
>
> If you really need to rename  this variable then doing that in a separate preparatory patch would make this easier to review as it separates out two separate sets of changes.
>
>>   	const char *format;
>> +	int command;
>
> Can't we just call this batch_command and leave cmdmode alone? I don't know maybe cmdmode would be confused with having something to do with batch_command then.
>
>>   };
>> +static char line_termination = '\n';
>
> This seems unnecessary as it is the only permitted line ending
>
>>   static const char *force_path;
>>  @@ -302,19 +304,19 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>>   	if (data->type == OBJ_BLOB) {
>>   		if (opt->buffer_output)
>>   			fflush(stdout);
>> -		if (opt->cmdmode) {
>> +		if (opt->mode) {
>>   			char *contents;
>>   			unsigned long size;
>>    			if (!data->rest)
>>   				die("missing path for '%s'", oid_to_hex(oid));
>>  -			if (opt->cmdmode == 'w') {
>> +			if (opt->mode == 'w') {
>>   				if (filter_object(data->rest, 0100644, oid,
>>   						  &contents, &size))
>>   					die("could not convert '%s' %s",
>>   					    oid_to_hex(oid), data->rest);
>> -			} else if (opt->cmdmode == 'c') {
>> +			} else if (opt->mode == 'c') {
>>   				enum object_type type;
>>   				if (!textconv_object(the_repository,
>>   						     data->rest, 0100644, oid,
>> @@ -326,7 +328,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>>   					die("could not convert '%s' %s",
>>   					    oid_to_hex(oid), data->rest);
>>   			} else
>> -				BUG("invalid cmdmode: %c", opt->cmdmode);
>> +				BUG("invalid mode: %c", opt->mode);
>>   			batch_write(opt, contents, size);
>>   			free(contents);
>>   		} else {
>> @@ -508,6 +510,95 @@ static int batch_unordered_packed(const struct object_id *oid,
>>   				      data);
>>   }
>>  +static void parse_cmd_object(struct batch_options *opt,
>> +			     const char *line,
>> +			     struct strbuf *output,
>> +			     struct expand_data *data)
>> +{
>> +	opt->print_contents = 1;
>> +	batch_one_object(line, output, opt, data);
>> +}
>> +
>> +static void parse_cmd_info(struct batch_options *opt,
>> +			   const char *line,
>> +			   struct strbuf *output,
>> +			   struct expand_data *data)
>> +{
>> +	opt->print_contents = 0;
>> +	batch_one_object(line, output, opt, data);
>> +}
>> +
>> +static void parse_cmd_fflush(struct batch_options *opt,
>> +			     const char *line,
>> +			     struct strbuf *output,
>> +			     struct expand_data *data)
>> +{
>> +	fflush(stdout);
>> +}
>> +
>> +typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
>> +			       struct strbuf *, struct expand_data *);
>> +
>> +static const struct parse_cmd {
>> +	const char *prefix;
>> +	parse_cmd_fn_t fn;
>> +	unsigned takes_args;
>> +} commands[] = {
>> +	{ "object", parse_cmd_object, 1},
>> +	{ "info", parse_cmd_info, 1},
>> +	{ "fflush", parse_cmd_fflush, 0},
>> +};
>> +
>> +static void batch_objects_command(struct batch_options *opt,
>> +				    struct strbuf *output,
>> +				    struct expand_data *data)
>> +{
>> +	struct strbuf input = STRBUF_INIT;
>> +
>> +	/* Read each line dispatch its command */
>> +	while (!strbuf_getwholeline(&input, stdin, line_termination)) {
>> +		int i;
>> +		const struct parse_cmd *cmd = NULL;
>> +		const char *p, *cmd_end;
>> +
>> +		if (*input.buf == line_termination)
>> +			die("empty command in input");
>> +		else if (isspace(*input.buf))
>> +			die("whitespace before command: %s", input.buf);
>> +
>> +		for (i = 0; i < ARRAY_SIZE(commands); i++) {
>> +			const char *prefix = commands[i].prefix;
>> +			char c;
>> +			if (!skip_prefix(input.buf, prefix, &cmd_end))
>> +				continue;
>> +			/*
>> +			 * If the command has arguments, verify that it's
>> +			 * followed by a space. Otherwise, it shall be followed
>> +			 * by a line terminator.
>> +			 */
>> +			c = commands[i].takes_args ? ' ' : line_termination;
>> +			if (input.buf[strlen(prefix)] != c)
>
> As I pointed out in the last round you can use cmd_end here rather than calling strlen. I've just noticed that this will fail for a command that does not take arguments when the last input line does not end with '\n'. I think
>     if (*cmd_end && *cmd_end != c)
> would be safer
>
>> +				die("arguments invalid for command: %s", commands[i].prefix);
>> +
>> +			cmd = &commands[i];
>> +			if (cmd->takes_args) {
>> +				p = cmd_end + 1;
>> +				// strip newline before handing it to the
>> +				// handling function
>> +				input.buf[strcspn(input.buf, "\n")] = '\0';
>
> When I suggested replacing strstr() last time I failed to notice that we  know that if there is a newline character it is at input.buf[len - 1] so we don't need to scan the whole string to find it.
>
>     if (input.buf[input.len - 1] == '\n')
>     	input.buf[--buf.len] = '\0';
>
> (feel free to change it to avoid the prefix operator)
>
>> +			}
>> +
>> +			break;
>> +		}
>> +
>> +		if (!cmd)
>> +			die("unknown command: %s", input.buf);
>> +
>> +		cmd->fn(opt, p, output, data);
>> +	}
>> +	strbuf_release(&input);
>> +}
>> +
>>   static int batch_objects(struct batch_options *opt)
>>   {
>>   	struct strbuf input = STRBUF_INIT;
>> @@ -515,6 +606,7 @@ static int batch_objects(struct batch_options *opt)
>>   	struct expand_data data;
>>   	int save_warning;
>>   	int retval = 0;
>> +	const int command = opt->command;
>>    	if (!opt->format)
>>   		opt->format = "%(objectname) %(objecttype) %(objectsize)";
>> @@ -529,7 +621,7 @@ static int batch_objects(struct batch_options *opt)
>>   	strbuf_expand(&output, opt->format, expand_format, &data);
>>   	data.mark_query = 0;
>>   	strbuf_release(&output);
>> -	if (opt->cmdmode)
>> +	if (opt->mode)
>>   		data.split_on_whitespace = 1;
>>    	/*
>> @@ -590,6 +682,9 @@ static int batch_objects(struct batch_options *opt)
>>   	save_warning = warn_on_object_refname_ambiguity;
>>   	warn_on_object_refname_ambiguity = 0;
>>  +	if (command)
>> +		batch_objects_command(opt, &output, &data);
>> +
>
> Moving this is good as we no longer need to touch the line below
>
>>   	while (strbuf_getline(&input, stdin) != EOF) {
>>   		if (data.split_on_whitespace) {
>>   			/*
>> @@ -636,6 +731,7 @@ static int batch_option_callback(const struct option *opt,
>>    	bo->enabled = 1;
>>   	bo->print_contents = !strcmp(opt->long_name, "batch");
>> +	bo->command = !strcmp(opt->long_name, "batch-command");
>>   	bo->format = arg;
>>    	return 0;
>> @@ -683,6 +779,10 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>>   			N_("like --batch, but don't emit <contents>"),
>>   			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
>>   			batch_option_callback),
>> +		OPT_CALLBACK_F(0, "batch-command", &batch, N_(""),
>> +			 N_("enters batch mode that accepts commands"),
>> +			 PARSE_OPT_NOARG | PARSE_OPT_NONEG,
>> +			 batch_option_callback),
>>   		OPT_CMDMODE(0, "batch-all-objects", &opt,
>>   			    N_("with --batch[-check]: ignores stdin, batches all known objects"), 'b'),
>>   		/* Batch-specific options */
>> @@ -742,7 +842,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>>   	/* Return early if we're in batch mode? */
>>   	if (batch.enabled) {
>>   		if (opt_cw)
>> -			batch.cmdmode = opt;
>> +			batch.mode = opt;
>>   		else if (opt && opt != 'b')
>>   			usage_msg_optf(_("'-%c' is incompatible with batch mode"),
>>   				       usage, options, opt);
>> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
>> index 145eee11df..92f0b14a95 100755
>> --- a/t/t1006-cat-file.sh
>> +++ b/t/t1006-cat-file.sh
>> @@ -112,6 +112,46 @@ maybe_remove_timestamp () {
>>       fi
>>   }
>
> It's really good that you've found a way to test fflush
>
>> +run_buffer_test () {
>> +	type=$1
>> +	sha1=$2
>> +	size=$3
>> +	flush=$4
>> +
>> +	mkfifo V.input
>> +	exec 8<>V.input
>> +	rm V.input
>> +
>> +	mkfifo V.output
>> +	exec 9<>V.output
>> +	rm V.output
>> +	
>> +	(
>> +		git cat-file --buffer --batch-command <&8 >&9 &
>> +		echo $! >&9 &&
>> +		wait $!
>> +		echo >&9 EARLY_EXIT
>> +	) &
>> +	sh_pid=$!
>> +	read fi_pid <&9 > +	test_when_finished "
>> +		exec 8>&-; exec 9>&-;
>> +		kill $sh_pid && wait $sh_pid
>> +		kill $fi_pid && wait $fi_pid
>> +		true"
>> +	expect=$(echo "$sha1 $type $size")
>> +	echo "info $sha1" >&8
>> +	if test "$flush" = "true"
>> +	then
>> +		echo "fflush" >&8
>
> If I change "fflush" to "bad-command" then this test still passes. This is because die() flushes stdout and the read below will get the output it was expecting and not 'EARLY_EXIT'. I have a suggested change below to get around that also checks the exit code of "git cat-file" (This test effectively puts a git command upstream of a pipe which we try to avoid in order to pick up any crashes)
>
>> +	else
>> +		expect="EARLY_EXIT"
>> +		kill $fi_pid && wait $fi_pid
>> +	fi
>> +	read actual <&9
>> +	test "$actual" = "$expect"
>> +}
>> +
>>   run_tests () {
>>       type=$1
>>       sha1=$2
>> @@ -177,6 +217,18 @@ $content"
>>   	test_cmp expect actual
>>       '
>
> Here is my suggestion based on your test - I ended up splitting the fflush and no-fflush cases
>
> run_buffer_test_flush () {
>     type=$1
>     sha1=$2
>     size=$3
>
>     mkfifo input &&
>     test_when_finished 'rm input'
>     mkfifo output &&
>     exec 9<>output &&
>     test_when_finished 'rm output; exec 9<&-'
>     (
>         	# TODO - Ideally we'd pipe the output of cat-file
>         	# through "sed s'/$/\\/'" to make sure that that read
>         	# would consume all the available
>         	# output. Unfortunately we cannot do this as we cannot
>         	# control when sed flushes its output. We could write
>         	# a test helper in C that appended a '\' to the end of
>     	# each line and flushes its output after every line.
>     	git cat-file --buffer --batch-command <input 2>err &
>     	echo $! &&
>     	wait $!
>     	echo $?
>     ) >&9 &
>     sh_pid=$! &&
>     read cat_file_pid <&9 &&
>     test_when_finished "kill $cat_file_pid
>     		    kill $sh_pid; wait $sh_pid; :" &&
>     (
>     	test_write_lines "info $sha1" fflush "info $sha1" &&
>     	# TODO - consume all available input, not just one
>     	# line (see above).
>     	read actual <&9 &&
>     	echo "$actual" >actual &&
>     	echo "$sha1 $type $size" >expect &&
>     	test_cmp expect actual
>     ) >input &&
>     # check output is flushed on exit
>     read actual <&9 &&
>     echo "$actual" >actual &&
>     test_cmp expect actual &&
>     test_must_be_empty err &&
>     read status <&9 &&
>     test "$status" -eq 0
> }
>
> run_buffer_test_no_flush () {
>     type=$1
>     sha1=$2
>     size=$3
>
>     mkfifo input &&
>     test_when_finished 'rm input'
>     mkfifo pid &&
>     exec 9<>pid &&
>     test_when_finished 'rm pid; exec 9<&-'
>     (
>     	git cat-file --buffer --batch-command <input >output &
>     	echo $! &&
>     	wait $!
>     	echo $?
>     ) >&9 &
>     sh_pid=$! &&
>     read cat_file_pid <&9 &&
>     test_when_finished "kill $cat_file_pid
>     		    kill $sh_pid; wait $sh_pid; :" &&
>     (
>     	test_write_lines "info $sha1" "info $sha1" &&
>     	kill $cat_file_pid &&
>     	read status <&9 &&
>     	test "$status" -ne 0 &&
>     	test_must_be_empty output
>     ) >input
> }
>
> Best Wishes
>
> Phillip
>
>> +    test -z "$content" ||
>> +    test_expect_success "--batch-command output of $type content is correct" '
>> +	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
>> +	maybe_remove_timestamp "$(echo object $sha1 | git cat-file --batch-command)" $no_ts >actual &&
>> +	test_cmp expect actual
>> +    '
>> +
>> +    test_expect_success "batch-command output of $type info is correct" '
>> +	echo "$sha1 $type $size" >expect &&
>> +	echo "info $sha1" | git cat-file --batch-command >actual &&
>> +	test_cmp expect actual
>> +'
>>       test_expect_success "custom --batch-check format" '
>>   	echo "$type $sha1" >expect &&
>>   	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
>> @@ -232,12 +284,29 @@ test_expect_success '--batch-check without %(rest) considers whole line' '
>>   	test_cmp expect actual
>>   '
>>  +test_expect_success '--batch-command --buffer with flush is correct for blob' '
>> +	run_buffer_test 'blob' $hello_sha1 $hello_size true
>> +'
>> +
>> +test_expect_success '--batch-command --buffer without flush is correct for blob' '
>> +	run_buffer_test 'blob' $hello_sha1 $hello_size false
>> +'
>> +
>>   tree_sha1=$(git write-tree)
>> +
>>   tree_size=$(($(test_oid rawsz) + 13))
>>   tree_pretty_content="100644 blob $hello_sha1	hello"
>>    run_tests 'tree' $tree_sha1 $tree_size "" "$tree_pretty_content"
>>  +test_expect_success '--batch-command --buffer with flush is correct for tree' '
>> +	run_buffer_test 'tree' $tree_sha1 $tree_size true
>> +'
>> +
>> +test_expect_success '--batch-command --buffer without flush is correct for tree' '
>> +	run_buffer_test 'tree' $tree_sha1 $tree_size false
>> +'
>> +
>>   commit_message="Initial commit"
>>   commit_sha1=$(echo_without_newline "$commit_message" | git commit-tree $tree_sha1)
>>   commit_size=$(($(test_oid hexsz) + 137))
>> @@ -263,6 +332,14 @@ tag_size=$(strlen "$tag_content")
>>    run_tests 'tag' $tag_sha1 $tag_size "$tag_content" "$tag_content" 1
>>  +test_expect_success '--batch-command --buffer with flush is correct for tag' '
>> +	run_buffer_test 'tag' $tag_sha1 $tag_size true
>> +'
>> +
>> +test_expect_success '--batch-command --buffer without flush is correct for tag' '
>> +	run_buffer_test 'tag' $tag_sha1 $tag_size false
>> +'
>> +
>>   test_expect_success \
>>       "Reach a blob from a tag pointing to it" \
>>       "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
>> @@ -964,4 +1041,10 @@ test_expect_success 'cat-file --batch-all-objects --batch-check ignores replace'
>>   	test_cmp expect actual
>>   '
>>  +test_expect_success 'batch-command unknown command' '
>> +	echo unknown_command >cmd &&
>> +	test_expect_code 128 git cat-file --batch-command < cmd 2>err &&
>> +	grep -E "^fatal:.*unknown command.*" err
>> +'
>> +
>>   test_done

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-02-02 17:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-28 18:33 [RFC v3] cat-file: add a --stdin-cmd mode John Cai
2022-01-31 11:44 ` Bagas Sanjaya
2022-01-31 18:24   ` Junio C Hamano
2022-02-01  9:48     ` Ævar Arnfjörð Bjarmason
2022-02-01  9:39 ` Christian Couder
2022-02-01 17:52   ` Taylor Blau
2022-02-01 19:27     ` John Cai
2022-02-01 20:14       ` Taylor Blau
     [not found]         ` <3FE1D509-8AD0-4F0E-9298-DFD3552A98EF@gmail.com>
2022-02-02  1:45           ` Taylor Blau
2022-02-02 13:11   ` John Cai
2022-02-01 10:43 ` Phillip Wood
2022-02-02 15:05   ` John Cai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.