git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
@ 2021-05-23  9:53 ZheNing Hu via GitGitGadget
  2021-05-23  9:53 ` [PATCH 1/3] [GSOC] quote: add *.quote_buf_with_size functions ZheNing Hu via GitGitGadget
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-23  9:53 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	ZheNing Hu

In (a2f3241: [GSOC] ref-filter: add contents:raw atom) I did not notice the
breakage that occurred during the test, In the later remediation, I found a
very serious problem: The output object data of "git cat-file tree foo" or
"git cat-file blob foo" may contain '\0'. However, most of the logic in
ref-filter depends on the atomic output not containing'\0'.

Therefore, we must carry out a series of repairs to ref-filter so that it
can support output of data containing '\0'.

In first patch, I add *.quote_buf_with_size() functions, this can deal with
data with containing'\0'.

In second patch, I add the member s_size in struct atom_value, and protects
the output of the atom from being truncated at '\0', and successfully
supported the %(contents) of blob and tree.

In third patch, I added the%(contents:raw) atom, It can print the original
content of an object.

What needs to be reconsidered:

For a binary object blob, tree,

git for-each-ref --format="%(contents)" --python refs/mytrees/first

will output a string processed by python_quote_buf_with_size(), which
contains'\0'. But the binary files seem to be useless after quoting. Should
we allow these binary files to be output in the default way with
strbuf_add()? If so, we can remove the first patch.

ZheNing Hu (3):
  [GSOC] quote: add *.quote_buf_with_size functions
  [GSOC] ref-filter: support %(contents) for blob, tree
  [GSOC] ref-filter: add contents:raw atom

 Documentation/git-for-each-ref.txt |  19 ++-
 quote.c                            | 116 +++++++++++++++
 quote.h                            |   4 +
 ref-filter.c                       | 229 +++++++++++++++++++++--------
 t/t6300-for-each-ref.sh            | 214 ++++++++++++++++++++++++++-
 5 files changed, 511 insertions(+), 71 deletions(-)


base-commit: 97eea85a0a1ec66d356567808a1e4ca2367e0ce7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-959%2Fadlternative%2Fref-filter-contents-raw-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-959/adlternative/ref-filter-contents-raw-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/959
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] [GSOC] quote: add *.quote_buf_with_size functions
  2021-05-23  9:53 [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
@ 2021-05-23  9:53 ` ZheNing Hu via GitGitGadget
  2021-05-23  9:53 ` [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree ZheNing Hu via GitGitGadget
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-23  9:53 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

With the original functions `*._quote_buf()`, there may be
'\0' in our source string, only the content before the first '\0'
will be processed and added to the buffer.

Add `perl_quote_buf_with_size()`, `python_quote_buf_with_size()`,
`tcl_quote_buf_with_size`,`sq_quote_buf_with_size()` to
`quote.c`, they will process the source string with specified
length characters. With them, the content after '\0' will not
be truncated.

This will help us add binary data containing '\0' to
`quote_formatting()` in `ref-filter.c`.

Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 quote.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 quote.h |   4 ++
 2 files changed, 120 insertions(+)

diff --git a/quote.c b/quote.c
index 8a3a5e39eb12..9a1d9dde1fdb 100644
--- a/quote.c
+++ b/quote.c
@@ -43,6 +43,39 @@ void sq_quote_buf(struct strbuf *dst, const char *src)
 	free(to_free);
 }
 
+void sq_quote_buf_with_size(struct strbuf *dst, const char *src, size_t size)
+{
+	char *to_free = NULL;
+	size_t cur_size = 0;
+
+	if (dst->buf == src)
+		to_free = strbuf_detach(dst, NULL);
+
+	strbuf_addch(dst, '\'');
+	while (cur_size < size) {
+		size_t len = strcspn(src, "'!");
+		if (!len) {
+			strbuf_add(dst, src, 1);
+			src++;
+			cur_size++;
+		} else {
+			strbuf_add(dst, src, len);
+			src += len;
+			cur_size += len;
+		}
+		if (cur_size >= size)
+			break;
+		while (need_bs_quote(*src)) {
+			strbuf_addstr(dst, "'\\");
+			strbuf_addch(dst, *src++);
+			cur_size++;
+			strbuf_addch(dst, '\'');
+		}
+	}
+	strbuf_addch(dst, '\'');
+	free(to_free);
+}
+
 void sq_quote_buf_pretty(struct strbuf *dst, const char *src)
 {
 	static const char ok_punct[] = "+,-./:=@_^";
@@ -471,6 +504,25 @@ void perl_quote_buf(struct strbuf *sb, const char *src)
 	strbuf_addch(sb, sq);
 }
 
+void perl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size)
+{
+	const char sq = '\'';
+	const char bq = '\\';
+	char c;
+	size_t cur_size = 0;
+
+	strbuf_addch(sb, sq);
+	while (cur_size < size) {
+		c = *src++;
+		cur_size++;
+
+		if (c == sq || c == bq)
+			strbuf_addch(sb, bq);
+		strbuf_addch(sb, c);
+	}
+	strbuf_addch(sb, sq);
+}
+
 void python_quote_buf(struct strbuf *sb, const char *src)
 {
 	const char sq = '\'';
@@ -492,6 +544,31 @@ void python_quote_buf(struct strbuf *sb, const char *src)
 	strbuf_addch(sb, sq);
 }
 
+void python_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size)
+{
+	const char sq = '\'';
+	const char bq = '\\';
+	const char nl = '\n';
+	char c;
+	size_t cur_size = 0;
+
+	strbuf_addch(sb, sq);
+	while (cur_size < size) {
+		c = *src++;
+		cur_size++;
+
+		if (c == nl) {
+			strbuf_addch(sb, bq);
+			strbuf_addch(sb, 'n');
+			continue;
+		}
+		if (c == sq || c == bq)
+			strbuf_addch(sb, bq);
+		strbuf_addch(sb, c);
+	}
+	strbuf_addch(sb, sq);
+}
+
 void tcl_quote_buf(struct strbuf *sb, const char *src)
 {
 	char c;
@@ -527,6 +604,45 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
 	strbuf_addch(sb, '"');
 }
 
+void tcl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size)
+{
+	char c;
+	size_t cur_size = 0;
+
+	strbuf_addch(sb, '"');
+	while (cur_size < size) {
+		c = *src++;
+		cur_size++;
+
+		switch (c) {
+		case '[': case ']':
+		case '{': case '}':
+		case '$': case '\\': case '"':
+			strbuf_addch(sb, '\\');
+			/* fallthrough */
+		default:
+			strbuf_addch(sb, c);
+			break;
+		case '\f':
+			strbuf_addstr(sb, "\\f");
+			break;
+		case '\r':
+			strbuf_addstr(sb, "\\r");
+			break;
+		case '\n':
+			strbuf_addstr(sb, "\\n");
+			break;
+		case '\t':
+			strbuf_addstr(sb, "\\t");
+			break;
+		case '\v':
+			strbuf_addstr(sb, "\\v");
+			break;
+		}
+	}
+	strbuf_addch(sb, '"');
+}
+
 void basic_regex_quote_buf(struct strbuf *sb, const char *src)
 {
 	char c;
diff --git a/quote.h b/quote.h
index 768cc6338e27..e894507329cc 100644
--- a/quote.h
+++ b/quote.h
@@ -30,6 +30,7 @@ struct strbuf;
  */
 
 void sq_quote_buf(struct strbuf *, const char *src);
+void sq_quote_buf_with_size(struct strbuf *, const char *src, size_t size);
 void sq_quote_argv(struct strbuf *, const char **argv);
 void sq_quotef(struct strbuf *, const char *fmt, ...);
 
@@ -94,8 +95,11 @@ char *quote_path(const char *in, const char *prefix, struct strbuf *out, unsigne
 
 /* quoting as a string literal for other languages */
 void perl_quote_buf(struct strbuf *sb, const char *src);
+void perl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size);
 void python_quote_buf(struct strbuf *sb, const char *src);
+void python_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size);
 void tcl_quote_buf(struct strbuf *sb, const char *src);
+void tcl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size);
 void basic_regex_quote_buf(struct strbuf *sb, const char *src);
 
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree
  2021-05-23  9:53 [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
  2021-05-23  9:53 ` [PATCH 1/3] [GSOC] quote: add *.quote_buf_with_size functions ZheNing Hu via GitGitGadget
@ 2021-05-23  9:53 ` ZheNing Hu via GitGitGadget
  2021-05-25  5:03   ` Junio C Hamano
  2021-05-23  9:53 ` [PATCH 3/3] [GSOC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
  2021-05-24  1:09 ` [PATCH 0/3] [GSOC][RFC] " Junio C Hamano
  3 siblings, 1 reply; 18+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-23  9:53 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

In order to let `cat-file --batch` use ref-filter logic,
we need to print the original content of an object. We
can reuse the existing atom `%(contents)` in `ref-filter`,
The original `%(contents)` only supports tag and commit
objects. If we want to support both blob and tree objects,
we must consider the following issues:

The original contents of blob, tree objects may contain '\0',
most of the logic in `ref-filter` depends on the output of
the atom being a string (end with '\0'):

`quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
add content to the buffer. E.g. The original content of a tree
object is `100644 one\0...`, only the `100644 one` will be
added to the buffer, which is incorrect.

Therefore, add a new member in `struct atom_value`: `s_size`,
when we record the original content of the blob and tree
objects in `grab_contents()`, use `v->s_size` to record
the size of the objects contents, then in `quote_formatting()`,
if the length of the contents passed in `quote_formatting()` is
not equal to 0, we can use `strbuf_add()` instead of
`strbuf_addstr()`  or `*._quote_buf_with_size()` instead of
`*._quote_buf()` to add contents with a specified length if the
length of the contents is not equal to 0. It will not cause
truncation problems.

Similarly, in `append_atom()`, we use `strbuf_add()` instead
of `strbuf_addstr()`; In `then_atom_handler()`, we use `memcmp()`
instread of `strcmp()`; In `cmp_ref_sorting()`, we use `memcmp()`
and `memcasecmp()` instead of `strcmp()` and `strcasecmp()` when
the `v->s_size` of one of the two atoms is not equals to 0.

Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 Documentation/git-for-each-ref.txt |  11 +-
 ref-filter.c                       | 220 +++++++++++++++++++++--------
 t/t6300-for-each-ref.sh            | 166 +++++++++++++++++++++-
 3 files changed, 327 insertions(+), 70 deletions(-)

diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
index 2ae2478de706..30b93d2e5178 100644
--- a/Documentation/git-for-each-ref.txt
+++ b/Documentation/git-for-each-ref.txt
@@ -235,11 +235,12 @@ and `date` to extract the named component.  For email fields (`authoremail`,
 without angle brackets, and `:localpart` to get the part before the `@` symbol
 out of the trimmed email.
 
-The message in a commit or a tag object is `contents`, from which
-`contents:<part>` can be used to extract various parts out of:
+The data in a object is `contents`, from which `contents:<part>` can be used
+to extract various parts out of:
 
 contents:size::
-	The size in bytes of the commit or tag message.
+	The size in bytes of the commit or tag message, and the raw object size
+	of the blob or tree.
 
 contents:subject::
 	The first paragraph of the message, which typically is a
@@ -257,7 +258,9 @@ contents:signature::
 	The optional GPG signature of the tag.
 
 contents:lines=N::
-	The first `N` lines of the message.
+	The first `N` lines of the message of the commit or tag message.
+
+Note: blob and tree objects only support `%(contents)` and `%(contents:size)`.
 
 Additionally, the trailers as interpreted by linkgit:git-interpret-trailers[1]
 are obtained as `trailers[:options]` (or by using the historical alias
diff --git a/ref-filter.c b/ref-filter.c
index e2eac50d9508..e59907188e79 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -564,6 +564,7 @@ struct ref_formatting_state {
 
 struct atom_value {
 	const char *s;
+	size_t s_size;
 	int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
 		       struct strbuf *err);
 	uintmax_t value; /* used for sorting when not FIELD_STR */
@@ -652,23 +653,38 @@ static int parse_ref_filter_atom(const struct ref_format *format,
 	return at;
 }
 
-static void quote_formatting(struct strbuf *s, const char *str, int quote_style)
+static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style)
 {
 	switch (quote_style) {
 	case QUOTE_NONE:
-		strbuf_addstr(s, str);
+		if (len)
+			strbuf_add(s, str, len);
+		else
+			strbuf_addstr(s, str);
 		break;
 	case QUOTE_SHELL:
-		sq_quote_buf(s, str);
+		if (len)
+			sq_quote_buf_with_size(s, str, len);
+		else
+			sq_quote_buf(s, str);
 		break;
 	case QUOTE_PERL:
-		perl_quote_buf(s, str);
+		if (len)
+			perl_quote_buf_with_size(s, str, len);
+		else
+			perl_quote_buf(s, str);
 		break;
 	case QUOTE_PYTHON:
-		python_quote_buf(s, str);
+		if (len)
+			python_quote_buf_with_size(s, str, len);
+		else
+			python_quote_buf(s, str);
 		break;
 	case QUOTE_TCL:
-		tcl_quote_buf(s, str);
+		if (len)
+			tcl_quote_buf_with_size(s, str, len);
+		else
+			tcl_quote_buf(s, str);
 		break;
 	}
 }
@@ -683,9 +699,12 @@ static int append_atom(struct atom_value *v, struct ref_formatting_state *state,
 	 * encountered.
 	 */
 	if (!state->stack->prev)
-		quote_formatting(&state->stack->output, v->s, state->quote_style);
+		quote_formatting(&state->stack->output, v->s, v->s_size, state->quote_style);
 	else
-		strbuf_addstr(&state->stack->output, v->s);
+		if (v->s_size)
+			strbuf_add(&state->stack->output, v->s, v->s_size);
+		else
+			strbuf_addstr(&state->stack->output, v->s);
 	return 0;
 }
 
@@ -785,14 +804,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state
 	return 0;
 }
 
-static int is_empty(const char *s)
+static int is_empty(struct strbuf *buf)
 {
-	while (*s != '\0') {
-		if (!isspace(*s))
-			return 0;
+	const char *s = buf->buf;
+	size_t cur_len = 0;
+
+	while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
 		s++;
+		cur_len++;
 	}
-	return 1;
+	return cur_len == buf->len;
 }
 
 static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_state *state,
@@ -800,6 +821,7 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
 {
 	struct ref_formatting_stack *cur = state->stack;
 	struct if_then_else *if_then_else = NULL;
+	size_t str_len = 0;
 
 	if (cur->at_end == if_then_else_handler)
 		if_then_else = (struct if_then_else *)cur->at_end_data;
@@ -810,18 +832,30 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
 	if (if_then_else->else_atom_seen)
 		return strbuf_addf_ret(err, -1, _("format: %%(then) atom used after %%(else)"));
 	if_then_else->then_atom_seen = 1;
+	if (if_then_else->str)
+		str_len = strlen(if_then_else->str);
 	/*
 	 * If the 'equals' or 'notequals' attribute is used then
 	 * perform the required comparison. If not, only non-empty
 	 * strings satisfy the 'if' condition.
 	 */
 	if (if_then_else->cmp_status == COMPARE_EQUAL) {
-		if (!strcmp(if_then_else->str, cur->output.buf))
+		if (cur->output.len > str_len)
+			str_len = cur->output.len;
+		if (!if_then_else->str)
+			BUG("when if_then_else->cmp_status == COMPARE_EQUAL,"
+			    "if_then_else->str must not be null");
+		if (!memcmp(if_then_else->str, cur->output.buf, str_len))
 			if_then_else->condition_satisfied = 1;
 	} else if (if_then_else->cmp_status == COMPARE_UNEQUAL) {
-		if (strcmp(if_then_else->str, cur->output.buf))
+		if (cur->output.len > str_len)
+			str_len = cur->output.len;
+		if (!if_then_else->str)
+			BUG("when if_then_else->cmp_status == COMPARE_UNEQUAL,"
+			    "if_then_else->str must not be null");
+		if (memcmp(if_then_else->str, cur->output.buf, str_len))
 			if_then_else->condition_satisfied = 1;
-	} else if (cur->output.len && !is_empty(cur->output.buf))
+	} else if (cur->output.len && !is_empty(&cur->output))
 		if_then_else->condition_satisfied = 1;
 	strbuf_reset(&cur->output);
 	return 0;
@@ -867,7 +901,7 @@ static int end_atom_handler(struct atom_value *atomv, struct ref_formatting_stat
 	 * only on the topmost supporting atom.
 	 */
 	if (!current->prev->prev) {
-		quote_formatting(&s, current->output.buf, state->quote_style);
+		quote_formatting(&s, current->output.buf, current->output.len, state->quote_style);
 		strbuf_swap(&current->output, &s);
 	}
 	strbuf_release(&s);
@@ -1292,7 +1326,8 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
 }
 
 /* See grab_values */
-static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
+static void grab_contents(struct atom_value *val, int deref, void *buf,
+			  unsigned long buf_size, enum object_type object_type)
 {
 	int i;
 	const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
@@ -1312,43 +1347,60 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
 		    !starts_with(name, "trailers") &&
 		    !starts_with(name, "contents"))
 			continue;
-		if (!subpos)
-			find_subpos(buf,
-				    &subpos, &sublen,
-				    &bodypos, &bodylen, &nonsiglen,
-				    &sigpos, &siglen);
-
-		if (atom->u.contents.option == C_SUB)
-			v->s = copy_subject(subpos, sublen);
-		else if (atom->u.contents.option == C_SUB_SANITIZE) {
-			struct strbuf sb = STRBUF_INIT;
-			format_sanitized_subject(&sb, subpos, sublen);
-			v->s = strbuf_detach(&sb, NULL);
-		} else if (atom->u.contents.option == C_BODY_DEP)
-			v->s = xmemdupz(bodypos, bodylen);
-		else if (atom->u.contents.option == C_LENGTH)
-			v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
-		else if (atom->u.contents.option == C_BODY)
-			v->s = xmemdupz(bodypos, nonsiglen);
-		else if (atom->u.contents.option == C_SIG)
-			v->s = xmemdupz(sigpos, siglen);
-		else if (atom->u.contents.option == C_LINES) {
-			struct strbuf s = STRBUF_INIT;
-			const char *contents_end = bodypos + nonsiglen;
-
-			/*  Size is the length of the message after removing the signature */
-			append_lines(&s, subpos, contents_end - subpos, atom->u.contents.nlines);
-			v->s = strbuf_detach(&s, NULL);
-		} else if (atom->u.contents.option == C_TRAILERS) {
-			struct strbuf s = STRBUF_INIT;
-
-			/* Format the trailer info according to the trailer_opts given */
-			format_trailers_from_commit(&s, subpos, &atom->u.contents.trailer_opts);
-
-			v->s = strbuf_detach(&s, NULL);
-		} else if (atom->u.contents.option == C_BARE)
-			v->s = xstrdup(subpos);
 
+		switch (object_type) {
+		case OBJ_TAG:
+		case OBJ_COMMIT: {
+			if (!subpos)
+				find_subpos(buf,
+					&subpos, &sublen,
+					&bodypos, &bodylen, &nonsiglen,
+					&sigpos, &siglen);
+
+			if (atom->u.contents.option == C_SUB)
+				v->s = copy_subject(subpos, sublen);
+			else if (atom->u.contents.option == C_SUB_SANITIZE) {
+				struct strbuf sb = STRBUF_INIT;
+				format_sanitized_subject(&sb, subpos, sublen);
+				v->s = strbuf_detach(&sb, NULL);
+			} else if (atom->u.contents.option == C_BODY_DEP)
+				v->s = xmemdupz(bodypos, bodylen);
+			else if (atom->u.contents.option == C_LENGTH)
+				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
+			else if (atom->u.contents.option == C_BODY)
+				v->s = xmemdupz(bodypos, nonsiglen);
+			else if (atom->u.contents.option == C_SIG)
+				v->s = xmemdupz(sigpos, siglen);
+			else if (atom->u.contents.option == C_LINES) {
+				struct strbuf s = STRBUF_INIT;
+				const char *contents_end = bodypos + nonsiglen;
+
+				/*  Size is the length of the message after removing the signature */
+				append_lines(&s, subpos, contents_end - subpos, atom->u.contents.nlines);
+				v->s = strbuf_detach(&s, NULL);
+			} else if (atom->u.contents.option == C_TRAILERS) {
+				struct strbuf s = STRBUF_INIT;
+
+				/* Format the trailer info according to the trailer_opts given */
+				format_trailers_from_commit(&s, subpos, &atom->u.contents.trailer_opts);
+
+				v->s = strbuf_detach(&s, NULL);
+			} else if (atom->u.contents.option == C_BARE)
+				v->s = xstrdup(subpos);
+			break;
+		}
+		case OBJ_BLOB:
+		case OBJ_TREE: {
+			if (atom->u.contents.option == C_BARE) {
+				v->s_size = buf_size;
+				v->s = xmemdupz(buf, buf_size);
+			} else if (atom->u.contents.option == C_LENGTH)
+				v->s = xstrfmt("%"PRIuMAX, buf_size);
+			break;
+		}
+		default:
+			BUG("unknown object type");
+		}
 	}
 	free((void *)sigpos);
 }
@@ -1374,25 +1426,30 @@ static void fill_missing_values(struct atom_value *val)
  * pointed at by the ref itself; otherwise it is the object the
  * ref (which is a tag) refers to.
  */
-static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
+static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
 {
+	void *buf = data->content;
+	unsigned long buf_size = data->size;
+
 	switch (obj->type) {
 	case OBJ_TAG:
 		grab_tag_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf);
+		grab_contents(val, deref, buf, buf_size, obj->type);
 		grab_person("tagger", val, deref, buf);
 		break;
 	case OBJ_COMMIT:
 		grab_commit_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf);
+		grab_contents(val, deref, buf, buf_size, obj->type);
 		grab_person("author", val, deref, buf);
 		grab_person("committer", val, deref, buf);
 		break;
 	case OBJ_TREE:
 		/* grab_tree_values(val, deref, obj, buf, sz); */
+		grab_contents(val, deref, buf, buf_size, obj->type);
 		break;
 	case OBJ_BLOB:
 		/* grab_blob_values(val, deref, obj, buf, sz); */
+		grab_contents(val, deref, buf, buf_size, obj->type);
 		break;
 	default:
 		die("Eh?  Object of type %d?", obj->type);
@@ -1614,7 +1671,7 @@ static int get_object(struct ref_array_item *ref, int deref, struct object **obj
 			return strbuf_addf_ret(err, -1, _("parse_object_buffer failed on %s for %s"),
 					       oid_to_hex(&oi->oid), ref->refname);
 		}
-		grab_values(ref->value, deref, *obj, oi->content);
+		grab_values(ref->value, deref, *obj, oi);
 	}
 
 	grab_common_values(ref->value, deref, oi);
@@ -2297,6 +2354,25 @@ static int compare_detached_head(struct ref_array_item *a, struct ref_array_item
 	return 0;
 }
 
+static int memcasecmp(const void *vs1, const void *vs2, size_t n)
+{
+  size_t i;
+  const char *s1 = (const char *)vs1;
+  const char *s2 = (const char *)vs2;
+
+  for (i = 0; i < n; i++) {
+	unsigned char u1 = s1[i];
+	unsigned char u2 = s2[i];
+	int U1 = toupper (u1);
+	int U2 = toupper (u2);
+	int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
+		    : U1 < U2 ? -1 : U2 < U1);
+	if (diff)
+		return diff;
+	}
+	return 0;
+}
+
 static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
 {
 	struct atom_value *va, *vb;
@@ -2304,6 +2380,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
 	int cmp_detached_head = 0;
 	cmp_type cmp_type = used_atom[s->atom].type;
 	struct strbuf err = STRBUF_INIT;
+	size_t slen = 0;
 
 	if (get_ref_atom_value(a, s->atom, &va, &err))
 		die("%s", err.buf);
@@ -2317,10 +2394,28 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
 	} else if (s->sort_flags & REF_SORTING_VERSION) {
 		cmp = versioncmp(va->s, vb->s);
 	} else if (cmp_type == FIELD_STR) {
-		int (*cmp_fn)(const char *, const char *);
-		cmp_fn = s->sort_flags & REF_SORTING_ICASE
-			? strcasecmp : strcmp;
-		cmp = cmp_fn(va->s, vb->s);
+		if (!va->s_size && !vb->s_size) {
+			int (*cmp_fn)(const char *, const char *);
+			cmp_fn = s->sort_flags & REF_SORTING_ICASE
+				? strcasecmp : strcmp;
+			cmp = cmp_fn(va->s, vb->s);
+		} else {
+			int (*cmp_fn)(const void *, const void *, size_t);
+			cmp_fn = s->sort_flags & REF_SORTING_ICASE
+				? memcasecmp : memcmp;
+			if (va->s_size && vb->s_size) {
+				cmp = cmp_fn(va->s, vb->s, va->s_size > vb->s_size ?
+					     va->s_size : vb->s_size);
+			} else if (!va->s_size) {
+				slen = strlen(va->s);
+				cmp = cmp_fn(va->s, vb->s, slen > vb->s_size ?
+					     slen : vb->s_size);
+			} else {
+				slen = strlen(vb->s);
+				cmp = cmp_fn(va->s, vb->s, va->s_size > slen ?
+					     va->s_size : slen);
+			}
+		}
 	} else {
 		if (va->value < vb->value)
 			cmp = -1;
@@ -2420,6 +2515,7 @@ int format_ref_array_item(struct ref_array_item *info,
 	}
 	if (format->need_color_reset_at_eol) {
 		struct atom_value resetv;
+		resetv.s_size = 0;
 		resetv.s = GIT_COLOR_RESET;
 		if (append_atom(&resetv, &state, error_buf)) {
 			pop_stack_element(&state.stack);
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 9e0214076b4d..4754ec639797 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -63,8 +63,10 @@ test_atom() {
 		tag)
 			# We cannot use $3 as it expects sanitize_pgp to run
 			expect=$(git cat-file tag $ref | tail -n +6 | wc -c) ;;
-		tree | blob)
-			expect='' ;;
+		tree)
+			expect=$(git cat-file tree $ref | wc -c) ;;
+		blob)
+			expect=$(git cat-file blob $ref | wc -c) ;;
 		commit)
 			expect=$(printf '%s' "$3" | wc -c) ;;
 		esac
@@ -718,14 +720,170 @@ test_atom refs/mytrees/first contents:subject ""
 test_atom refs/mytrees/first body ""
 test_atom refs/mytrees/first contents:body ""
 test_atom refs/mytrees/first contents:signature ""
-test_atom refs/mytrees/first contents ""
+
+test_expect_success 'basic atom: refs/mytrees/first contents' '
+	git cat-file tree refs/mytrees/first >expected &&
+	cat expected | wc -c >size_expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(contents)" refs/mytrees/first >actual &&
+	test_cmp expected actual &&
+	git for-each-ref --format="%(contents:size)" refs/mytrees/first >actual &&
+	test_cmp size_expected actual
+'
+
+test_expect_success 'basic atom: refs/mytrees/first contents with --python' '
+	cat >expected <<-\EOF &&
+	0000000 030447 030060 032066 020064 067157 000145 157155 153143
+	0000020 106210 070754 101352 115504 123726 045150 042451 077455
+	0000040 030061 033060 032064 072040 067567 072056 173400 167431
+	0000060 030324 025725 144317 065126 131103 062753 104126 104323
+	0000100 023561 000012
+	0000103
+	EOF
+	git for-each-ref --python --format="%(contents)" refs/mytrees/first >actual &&
+	od actual >od_actual &&
+	test_cmp expected od_actual
+'
+
+test_expect_success 'basic atom: refs/mytrees/first contents with --tcl' '
+	cat >expected <<-\EOF &&
+	0000000 030442 030060 032066 020064 067157 000145 157155 153143
+	0000020 106210 070754 101352 115504 123726 045150 042451 077455
+	0000040 030061 033060 032064 072040 067567 072056 173400 167431
+	0000060 030324 025725 144317 065126 131103 062753 104126 104323
+	0000100 021161 000012
+	0000103
+	EOF
+	git for-each-ref --tcl --format="%(contents)" refs/mytrees/first >actual &&
+	od actual >od_actual &&
+	test_cmp expected od_actual
+'
+
+test_expect_success 'basic atom: refs/mytrees/first contents with --shell' '
+	cat >expected <<-\EOF &&
+	0000000 030447 030060 032066 020064 067157 000145 157155 153143
+	0000020 106210 070754 101352 115504 123726 045150 042451 077455
+	0000040 030061 033060 032064 072040 067567 072056 173400 167431
+	0000060 030324 025725 144317 065126 131103 062753 104126 104323
+	0000100 023561 000012
+	0000103
+	EOF
+	git for-each-ref --shell --format="%(contents)" refs/mytrees/first >actual &&
+	od actual >od_actual &&
+	test_cmp expected od_actual
+'
+
+test_expect_success 'basic atom: refs/mytrees/first contents with --perl' '
+	cat >expected <<-\EOF &&
+	0000000 030447 030060 032066 020064 067157 000145 157155 153143
+	0000020 106210 070754 101352 115504 123726 045150 042451 077455
+	0000040 030061 033060 032064 072040 067567 072056 173400 167431
+	0000060 030324 025725 144317 065126 131103 062753 104126 104323
+	0000100 023561 000012
+	0000103
+	EOF
+	git for-each-ref --perl --format="%(contents)" refs/mytrees/first >actual &&
+	od actual >od_actual &&
+	test_cmp expected od_actual
+'
 
 test_atom refs/myblobs/first subject ""
 test_atom refs/myblobs/first contents:subject ""
 test_atom refs/myblobs/first body ""
 test_atom refs/myblobs/first contents:body ""
 test_atom refs/myblobs/first contents:signature ""
-test_atom refs/myblobs/first contents ""
+
+test_expect_success 'basic atom: refs/myblobs/first contents' '
+	git cat-file blob refs/myblobs/first >expected &&
+	cat expected | wc -c >size_expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(contents)" refs/myblobs/first >actual &&
+	test_cmp expected actual &&
+	git for-each-ref --format="%(contents:size)" refs/myblobs/first >actual &&
+	test_cmp size_expected actual
+'
+
+test_expect_success 'set up refs pointing to binary blob' '
+	printf "%b" "a\0b\0c" >blob1 &&
+	printf "%b" "a\0c\0b" >blob2 &&
+	printf "%b" "\0a\0b\0c" >blob3 &&
+	printf "%b" "abc" >blob4 &&
+	printf "%b" "\0 \0 \0 " >blob5 &&
+	printf "%b" "\0 \0a\0 " >blob6 &&
+	git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
+	git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
+	git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
+	git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
+	git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
+	git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6
+'
+
+test_expect_success 'Verify sorts with contents' '
+	cat >expected <<-EOF &&
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/myblobs/blob3
+	refs/mytrees/first
+	refs/myblobs/first
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob4
+	refs/heads/main
+	EOF
+	git for-each-ref --format="%(refname)" --sort=contents \
+		refs/heads/main refs/myblobs/ refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'validate contents atom with %(if:equals)' '
+	cat >expect <<-EOF &&
+	not equals
+	not equals
+	not equals
+	not equals
+	not equals
+	not equals
+	refs/myblobs/blob4
+	not equals
+	not equals
+	not equals
+	EOF
+	git for-each-ref --format="%(if:equals=abc)%(contents)%(then)%(refname)%(else)not equals%(end)" \
+		refs/myblobs/ refs/heads/ >actual &&
+	test_cmp expect actual
+'
+test_expect_success 'validate contents atom with %(if:notequals)' '
+	cat >expect <<-EOF &&
+	refs/heads/ambiguous
+	refs/heads/main
+	refs/heads/newtag
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob3
+	equals
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/myblobs/first
+	EOF
+	git for-each-ref --format="%(if:notequals=abc)%(contents)%(then)%(refname)%(else)equals%(end)" \
+		refs/myblobs/ refs/heads/ >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'empty contents refs with %(if)' '
+	cat >expect <<-EOF &&
+	refs/myblobs/blob1 not empty
+	refs/myblobs/blob2 not empty
+	refs/myblobs/blob3 not empty
+	refs/myblobs/blob4 not empty
+	refs/myblobs/blob5 empty
+	refs/myblobs/blob6 not empty
+	refs/myblobs/first not empty
+	EOF
+	git for-each-ref --format="%(refname) %(if)%(contents)%(then)not empty%(else)empty%(end)" \
+	refs/myblobs/ >actual &&
+	test_cmp expect actual
+'
 
 test_expect_success 'set up multiple-sort tags' '
 	for when in 100000 200000
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] [GSOC] ref-filter: add contents:raw atom
  2021-05-23  9:53 [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
  2021-05-23  9:53 ` [PATCH 1/3] [GSOC] quote: add *.quote_buf_with_size functions ZheNing Hu via GitGitGadget
  2021-05-23  9:53 ` [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree ZheNing Hu via GitGitGadget
@ 2021-05-23  9:53 ` ZheNing Hu via GitGitGadget
  2021-05-24  1:09 ` [PATCH 0/3] [GSOC][RFC] " Junio C Hamano
  3 siblings, 0 replies; 18+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-23  9:53 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

Add new formatting option `%(contents:raw)`, which will
print all the object contents without any changes.
For blob and tree, it priont all object data, its output
is equivalent to `%(contents)`; For commit and tag,
it not only print objects’ messages (as `%(contents)` does),
but also print objects headers, like the "tree XXX",
"parent YYY", etc lines in commits, and the "object OOO",
"type TTT", etc lines in tags.

It will help further to migrate all cat-file formatting
logic from cat-file to ref-filter.

E.g.

git for-each-ref --format=%(contents:raw) refs/heads/foo

It will have similar output to:

git rev-parse refs/heads/foo | git cat-file --batch

Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 Documentation/git-for-each-ref.txt | 10 ++++++-
 ref-filter.c                       | 11 +++++--
 t/t6300-for-each-ref.sh            | 48 ++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
index 30b93d2e5178..de77a7454434 100644
--- a/Documentation/git-for-each-ref.txt
+++ b/Documentation/git-for-each-ref.txt
@@ -260,7 +260,15 @@ contents:signature::
 contents:lines=N::
 	The first `N` lines of the message of the commit or tag message.
 
-Note: blob and tree objects only support `%(contents)` and `%(contents:size)`.
+contents:raw::
+	The raw data of the object. For blob and tree, it priont all object data,
+	its output is equivalent to `%(contents)`; For commit and tag, it not only
+	print objects’ messages (as `%(contents)` does), but also print objects
+	headers, like the "tree XXX", "parent YYY", etc lines in commits, and the
+	"object OOO", "type TTT", etc lines in tags.
+
+Note: blob and tree objects only support `%(contents)`, `%(contents:raw)` and
+`%(contents:size)`.
 
 Additionally, the trailers as interpreted by linkgit:git-interpret-trailers[1]
 are obtained as `trailers[:options]` (or by using the historical alias
diff --git a/ref-filter.c b/ref-filter.c
index e59907188e79..17ef39fe7454 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -134,7 +134,7 @@ static struct used_atom {
 		} remote_ref;
 		struct {
 			enum { C_BARE, C_BODY, C_BODY_DEP, C_LENGTH, C_LINES,
-			       C_SIG, C_SUB, C_SUB_SANITIZE, C_TRAILERS } option;
+			       C_RAW, C_SIG, C_SUB, C_SUB_SANITIZE, C_TRAILERS } option;
 			struct process_trailer_options trailer_opts;
 			unsigned int nlines;
 		} contents;
@@ -347,6 +347,8 @@ static int contents_atom_parser(const struct ref_format *format, struct used_ato
 {
 	if (!arg)
 		atom->u.contents.option = C_BARE;
+	else if (!strcmp(arg,"raw"))
+		atom->u.contents.option = C_RAW;
 	else if (!strcmp(arg, "body"))
 		atom->u.contents.option = C_BODY;
 	else if (!strcmp(arg, "size"))
@@ -1387,11 +1389,16 @@ static void grab_contents(struct atom_value *val, int deref, void *buf,
 				v->s = strbuf_detach(&s, NULL);
 			} else if (atom->u.contents.option == C_BARE)
 				v->s = xstrdup(subpos);
+			else if (atom->u.contents.option == C_RAW) {
+				v->s_size = buf_size;
+				v->s = xmemdupz(buf, buf_size);
+			}
 			break;
 		}
 		case OBJ_BLOB:
 		case OBJ_TREE: {
-			if (atom->u.contents.option == C_BARE) {
+			if (atom->u.contents.option == C_BARE ||
+			    atom->u.contents.option == C_RAW) {
 				v->s_size = buf_size;
 				v->s = xmemdupz(buf, buf_size);
 			} else if (atom->u.contents.option == C_LENGTH)
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 4754ec639797..d2a7e1f4384e 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -159,6 +159,8 @@ test_atom head subject 'Initial'
 test_atom head subject:sanitize 'Initial'
 test_atom head contents:subject 'Initial'
 test_atom head body ''
+test_atom head contents:raw "$(git cat-file commit refs/heads/main)
+"
 test_atom head contents:body ''
 test_atom head contents:signature ''
 test_atom head contents 'Initial
@@ -688,6 +690,18 @@ test_atom refs/tags/signed-empty contents:body ''
 test_atom refs/tags/signed-empty contents:signature "$sig"
 test_atom refs/tags/signed-empty contents "$sig"
 
+test_expect_success 'basic atom: refs/tags/signed-empty contents:raw' '
+	git cat-file tag refs/tags/signed-empty >expected &&
+	git for-each-ref --format="%(contents:raw)" refs/tags/signed-empty >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
+test_atom refs/tags/signed-empty '*contents:raw' "$(git cat-file commit HEAD)
+"
+
 test_atom refs/tags/signed-short subject 'subject line'
 test_atom refs/tags/signed-short subject:sanitize 'subject-line'
 test_atom refs/tags/signed-short contents:subject 'subject line'
@@ -697,6 +711,15 @@ test_atom refs/tags/signed-short contents:signature "$sig"
 test_atom refs/tags/signed-short contents "subject line
 $sig"
 
+test_expect_success 'basic atom: refs/tags/signed-short contents:raw' '
+	git cat-file tag refs/tags/signed-short >expected &&
+	git for-each-ref --format="%(contents:raw)" refs/tags/signed-short >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_atom refs/tags/signed-long subject 'subject line'
 test_atom refs/tags/signed-long subject:sanitize 'subject-line'
 test_atom refs/tags/signed-long contents:subject 'subject line'
@@ -710,6 +733,15 @@ test_atom refs/tags/signed-long contents "subject line
 body contents
 $sig"
 
+test_expect_success 'basic atom: refs/tags/signed-long contents:raw' '
+	git cat-file tag refs/tags/signed-long >expected &&
+	git for-each-ref --format="%(contents:raw)" refs/tags/signed-long >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_expect_success 'set up refs pointing to tree and blob' '
 	git update-ref refs/mytrees/first refs/heads/main^{tree} &&
 	git update-ref refs/myblobs/first refs/heads/main:one
@@ -731,6 +763,14 @@ test_expect_success 'basic atom: refs/mytrees/first contents' '
 	test_cmp size_expected actual
 '
 
+test_expect_success 'basic atom: refs/mytrees/first contents:raw' '
+	git cat-file tree refs/mytrees/first >expected &&
+	cat expected | wc -c >size_expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(contents:raw)" refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'basic atom: refs/mytrees/first contents with --python' '
 	cat >expected <<-\EOF &&
 	0000000 030447 030060 032066 020064 067157 000145 157155 153143
@@ -803,6 +843,14 @@ test_expect_success 'basic atom: refs/myblobs/first contents' '
 	test_cmp size_expected actual
 '
 
+test_expect_success 'basic atom: refs/myblobs/first contents:raw' '
+	git cat-file blob refs/myblobs/first >expected &&
+	cat expected | wc -c >size_expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(contents:raw)" refs/myblobs/first >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'set up refs pointing to binary blob' '
 	printf "%b" "a\0b\0c" >blob1 &&
 	printf "%b" "a\0c\0b" >blob2 &&
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-23  9:53 [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-05-23  9:53 ` [PATCH 3/3] [GSOC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
@ 2021-05-24  1:09 ` Junio C Hamano
  2021-05-24  2:41   ` Felipe Contreras
                     ` (2 more replies)
  3 siblings, 3 replies; 18+ messages in thread
From: Junio C Hamano @ 2021-05-24  1:09 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak, ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> In (a2f3241: [GSOC] ref-filter: add contents:raw atom) I did not notice the
> ...

Before going into any of these details, remember that the cover
letter is where you first sell the series to the readers.  Why is it
worth their time to read it?  What problem does it solve in the
bigger picture?  Mention that we want to let "cat-file --batch" use
the ref-filter --format logic, if that is the primary reason why we
have these three patches, for example.

I actually do not know if a modified form of %(contents) is a good
match for this feature.  It was invented as a way to extract the
unstructured "free text" part of structured textual objects (namely,
commits and tags) so it is natural that it would not apply to trees
(they are structured and there is no unstractured "free text" part)
nor blobs (they are totally unstructured and does not even have to
be text).

Is there another word to refer to the entire payload unadulterated?

> git for-each-ref --format="%(contents)" --python refs/mytrees/first
>
> will output a string processed by python_quote_buf_with_size(), which
> contains'\0'. But the binary files seem to be useless after quoting. Should
> we allow these binary files to be output in the default way with
> strbuf_add()? If so, we can remove the first patch.

The --language option is designed to be used to write a small script
in the language and used like this:

    git for-each-ref --format='
		name=%(refname)
		var=%(placeholder)
                mkdir -p "$(dirname "$name")"
		printf "%%s" "$var" >"$name"
    ' --shell | /bin/sh

Note that %(refname) and %(placeholder) in the --format string is
not quoted at all; the "--shell" option knows how values are quoted
in the host language (shell) and writes single-quotes around
%(refname).  If %(placeholder) produces something with a single-quote
in it, that will (eh, at least "should") be quoted appropriately.

So It does not make any sense not to quote a value that comes from
%(placeholder), whether it is binary or not, to match the syntax of
the host language you are making the "for-each-ref --format=" to
write such a script in.

So, "binary files seem to be useless after quoting" is a
misunderstanding.  They are useless if you do not quote them.

Thanks.

P.S. I am mostly offline today, and response will be slow.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-24  1:09 ` [PATCH 0/3] [GSOC][RFC] " Junio C Hamano
@ 2021-05-24  2:41   ` Felipe Contreras
  2021-05-24  5:22     ` Bagas Sanjaya
  2021-05-24 13:09   ` ZheNing Hu
  2021-05-26  0:56   ` Junio C Hamano
  2 siblings, 1 reply; 18+ messages in thread
From: Felipe Contreras @ 2021-05-24  2:41 UTC (permalink / raw)
  To: Junio C Hamano, ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak, ZheNing Hu

Junio C Hamano wrote:

> Is there another word to refer to the entire payload unadulterated?

Another word better than "raw"? I don't think so: basic, crude, green,
rough.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-24  2:41   ` Felipe Contreras
@ 2021-05-24  5:22     ` Bagas Sanjaya
  2021-05-24 15:21       ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Bagas Sanjaya @ 2021-05-24  5:22 UTC (permalink / raw)
  To: Felipe Contreras, Junio C Hamano, ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak, ZheNing Hu

On 24/05/21 09.41, Felipe Contreras wrote:
> Junio C Hamano wrote:
> 
>> Is there another word to refer to the entire payload unadulterated?
> 
> Another word better than "raw"? I don't think so: basic, crude, green,
> rough.
> 

I think we should go with "raw", because the payloads we discussed here
are unmodified. It is akin to "raw data" that is processed further to
produce porcelain output, such as templating engine that process raw
HTML into HTML pages that would be served to end user.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-24  1:09 ` [PATCH 0/3] [GSOC][RFC] " Junio C Hamano
  2021-05-24  2:41   ` Felipe Contreras
@ 2021-05-24 13:09   ` ZheNing Hu
  2021-05-26  0:56   ` Junio C Hamano
  2 siblings, 0 replies; 18+ messages in thread
From: ZheNing Hu @ 2021-05-24 13:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya

Junio C Hamano <gitster@pobox.com> 于2021年5月24日周一 上午9:09写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > In (a2f3241: [GSOC] ref-filter: add contents:raw atom) I did not notice the
> > ...
>
> Before going into any of these details, remember that the cover
> letter is where you first sell the series to the readers.  Why is it
> worth their time to read it?  What problem does it solve in the
> bigger picture?  Mention that we want to let "cat-file --batch" use
> the ref-filter --format logic, if that is the primary reason why we
> have these three patches, for example.
>

Thanks for reminding, Indeed, as you said, this cover-letter needs to
explain the patches' main purpose: let "cat-file --batch" use the ref-filter
--format logic.

> I actually do not know if a modified form of %(contents) is a good
> match for this feature.  It was invented as a way to extract the
> unstructured "free text" part of structured textual objects (namely,
> commits and tags) so it is natural that it would not apply to trees
> (they are structured and there is no unstractured "free text" part)
> nor blobs (they are totally unstructured and does not even have to
> be text).
>

This is exactly what I worry about: Does %(contents) have good
semantics to explain things like blob, tree objects which have no
unstractured "free text" part or totally unstructured?

> Is there another word to refer to the entire payload unadulterated?
>

If you specifically mean "raw" in %(contents:raw), I agree with Felipe
Contreras or Bagas Sanjaya, "raw" seems to be enough. But if you
mean we need another atom name, then "%(binary)", "%(text)" may
be possible choices.

There is a good word in `ref-filter.c`: C_BARE, but this enum vale
corresponds to %(contents). Therefore, I cannot use the name
"%(contents:bare)".

> > git for-each-ref --format="%(contents)" --python refs/mytrees/first
> >
> > will output a string processed by python_quote_buf_with_size(), which
> > contains'\0'. But the binary files seem to be useless after quoting. Should
> > we allow these binary files to be output in the default way with
> > strbuf_add()? If so, we can remove the first patch.
>
> The --language option is designed to be used to write a small script
> in the language and used like this:
>
>     git for-each-ref --format='
>                 name=%(refname)
>                 var=%(placeholder)
>                 mkdir -p "$(dirname "$name")"
>                 printf "%%s" "$var" >"$name"
>     ' --shell | /bin/sh
>
> Note that %(refname) and %(placeholder) in the --format string is
> not quoted at all; the "--shell" option knows how values are quoted
> in the host language (shell) and writes single-quotes around
> %(refname).  If %(placeholder) produces something with a single-quote
> in it, that will (eh, at least "should") be quoted appropriately.
>

Well, now I have also noticed that --language can make these atoms be
surrounded by single quotes, special characters in the output will be
escaped e.g. "a'b" will turn to "'a\'b'", In this way, shell, python... can
directly quote the content.

> So It does not make any sense not to quote a value that comes from
> %(placeholder), whether it is binary or not, to match the syntax of
> the host language you are making the "for-each-ref --format=" to
> write such a script in.
>
> So, "binary files seem to be useless after quoting" is a
> misunderstanding.  They are useless if you do not quote them.
>

Now I agree that It seems that the content of the blob or tree is
meaningful after being quoted.

> Thanks.
>
> P.S. I am mostly offline today, and response will be slow.
>

Thanks!
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-24  5:22     ` Bagas Sanjaya
@ 2021-05-24 15:21       ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2021-05-24 15:21 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Felipe Contreras, ZheNing Hu via GitGitGadget, git,
	Christian Couder, Hariom Verma, Karthik Nayak, ZheNing Hu

Bagas Sanjaya <bagasdotme@gmail.com> writes:

> I think we should go with "raw", because the payloads we discussed here
> are unmodified. It is akin to "raw data" that is processed further to
> produce porcelain output, such as templating engine that process raw
> HTML into HTML pages that would be served to end user.

Oh, I can certainly live with --format='%(raw)'; I just view that
%(contents:raw) is problematic, especially if we meant to apply it
to trees and blobs, because they are not what %(contents) is about.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree
  2021-05-23  9:53 ` [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree ZheNing Hu via GitGitGadget
@ 2021-05-25  5:03   ` Junio C Hamano
  2021-05-25  5:47     ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-05-25  5:03 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak, ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> @@ -1292,7 +1326,8 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
>  }
>  
>  /* See grab_values */
> -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)


As I already said, I do not think %(contents) mixes well with this
use for trees and blobs where you give the raw bytes, since
%(contents) for commit and tag objects was never about the full byte
sequence of the object.  It was to give the unstructured part meant
for human consumption, stripping the structured "header" part of the
object.

Nevertheless, since queuing this topic breaks the build and gets in
the way to find issues in other proposed regression fixes of higher
importance,...

> +static void grab_contents(struct atom_value *val, int deref, void *buf,
> +			  unsigned long buf_size, enum object_type object_type)
>  {
> ...
> +		switch (object_type) {
> +		case OBJ_TAG:
> +		case OBJ_COMMIT: {
> ...
> +				v->s = xmemdupz(bodypos, bodylen);
> +			else if (atom->u.contents.option == C_LENGTH)
> +				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
> ...

Note that this part inherits from the original and knows how to feed
a value to PRIuMAX with correct cast ...

> +				v->s = xmemdupz(bodypos, nonsiglen);
> +			else if (atom->u.contents.option == C_SIG)
> ...
> +		}
> +		case OBJ_BLOB:
> +		case OBJ_TREE: {
> +			if (atom->u.contents.option == C_BARE) {
> +				v->s_size = buf_size;
> +				v->s = xmemdupz(buf, buf_size);
> +			} else if (atom->u.contents.option == C_LENGTH)
> +				v->s = xstrfmt("%"PRIuMAX, buf_size);

... but this one gets sloppy, and breaks the windows-build of
'seen'.  Fix is simple:

+				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);

I'll squash it in before rebuilding 'seen'.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree
  2021-05-25  5:03   ` Junio C Hamano
@ 2021-05-25  5:47     ` Junio C Hamano
  2021-05-25  9:28       ` ZheNing Hu
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-05-25  5:47 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak, ZheNing Hu

Junio C Hamano <gitster@pobox.com> writes:

> As I already said, I do not think %(contents) mixes well with this
> use for trees and blobs where you give the raw bytes, since
> %(contents) for commit and tag objects was never about the full byte
> sequence of the object.  It was to give the unstructured part meant
> for human consumption, stripping the structured "header" part of the
> object.

To extend on this point a bit (even though this is not all that
urgent during the prerelease freeze), conceptually, the %(content)
field is understood in the larger picture like this:

+--- (the whole thing) ----------------------------------------------+
|                                                                    |
| +--- "header" ---------------------------------------------------+ |
| | tree 678c03dca0a26afd746e8c8bb9e4aadc8bf479b1                  | |
| | parent 378c7c6ad48c4ccddf9b534616a0e86f28440bd3                | |
| | author Junio C Hamano <gitster@pobox.com> 1621675665 +0900     | |
| | committer Junio C Hamano <gitster@pobox.com> 1621675741 +0900  | |
| +----------------------------------------------------------------+ |
|                                                                    |
| +--- "contents" -------------------------------------------------+ |
| |                                                                | |
| | +--- "subject" ----------------------------------------------+ | |
| | | Git 2.32-rc1                                               | | |
| | +------------------------------------------------------------+ | |
| |                                                                | |
| | +--- "body" -------------------------------------------------+ | |
| | | Signed-off-by: Junio C Hamano <gitster@pobox.com>          | | |
| | +------------------------------------------------------------+ | |
| |                                                                | |
| +----------------------------------------------------------------+ |
|                                                                    |
+--------------------------------------------------------------------+

Even though %(header), when it is invented, would make perfect sense
for commits and tags, it will never make sense for trees and blobs.
Which means "contents", which is "the whole thing except for the
header part", would not, either.

There is no %(placeholder) to ask for "the whole thing", and that is
what you want to use for cat-file --batch if I am not mistaken, and
adding one would be a good idea.  There is no %(header) yet, either,
but if somebody needs it for their scripts, it is clear where it fits
in the picture.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree
  2021-05-25  5:47     ` Junio C Hamano
@ 2021-05-25  9:28       ` ZheNing Hu
  2021-05-25 17:11         ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: ZheNing Hu @ 2021-05-25  9:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak

>
> To extend on this point a bit (even though this is not all that
> urgent during the prerelease freeze), conceptually, the %(content)
> field is understood in the larger picture like this:
>
> +--- (the whole thing) ----------------------------------------------+
> |                                                                    |
> | +--- "header" ---------------------------------------------------+ |
> | | tree 678c03dca0a26afd746e8c8bb9e4aadc8bf479b1                  | |
> | | parent 378c7c6ad48c4ccddf9b534616a0e86f28440bd3                | |
> | | author Junio C Hamano <gitster@pobox.com> 1621675665 +0900     | |
> | | committer Junio C Hamano <gitster@pobox.com> 1621675741 +0900  | |
> | +----------------------------------------------------------------+ |
> |                                                                    |
> | +--- "contents" -------------------------------------------------+ |
> | |                                                                | |
> | | +--- "subject" ----------------------------------------------+ | |
> | | | Git 2.32-rc1                                               | | |
> | | +------------------------------------------------------------+ | |
> | |                                                                | |
> | | +--- "body" -------------------------------------------------+ | |
> | | | Signed-off-by: Junio C Hamano <gitster@pobox.com>          | | |
> | | +------------------------------------------------------------+ | |
> | |                                                                | |
> | +----------------------------------------------------------------+ |
> |                                                                    |
> +--------------------------------------------------------------------+
>

Thank you for providing such a complete view. This also means
that the "raw" of contents and tag will contain two parts: "header"
and "contents". But for blobs and trees, they don’t have these things.

> Even though %(header), when it is invented, would make perfect sense
> for commits and tags, it will never make sense for trees and blobs.
> Which means "contents", which is "the whole thing except for the
> header part", would not, either.
>

Although we don’t have a %(header), but in fact we already have fragments
of "%(numparent)", "%(parent)" %(tree)" (see grab_commit_values()) and
"%(tag)"," %(type)","%(object)" (see grab_tag_values()), but they are not
obtained through the "header" part of the raw object buffer.

> There is no %(placeholder) to ask for "the whole thing", and that is
> what you want to use for cat-file --batch if I am not mistaken, and
> adding one would be a good idea.  There is no %(header) yet, either,
> but if somebody needs it for their scripts, it is clear where it fits
> in the picture.
>

So I don't know if adding %(header) will cause duplication of functions.

Thanks!
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree
  2021-05-25  9:28       ` ZheNing Hu
@ 2021-05-25 17:11         ` Junio C Hamano
  2021-05-26  7:48           ` ZheNing Hu
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-05-25 17:11 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak

ZheNing Hu <adlternative@gmail.com> writes:

> So I don't know if adding %(header) will cause duplication of functions.

I do not think you can duplicate the feature of %(header) with
other existing placeholders.  You may want to dump the headers of
series of commits so that you can find if there is any commit with
unusal header elements, but %(tree), %(parents), etc. would by
definition be the known ones, so any combination of them will not be
a substitute.

But nobody is asking for it right now, and your cat-file --batch
does not need it right away.

What I wanted to say in the message you are responding to was *not*
that you would want to add %(header) for completeness right now.
But thinking beyond your immediate need (i.e. the "whole thing") and
imagining how various pieces, including the ones that do not exist
yet, would fit together, would help you to avoid designing the parts
you immediately need in a way you would regret in the future.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-24  1:09 ` [PATCH 0/3] [GSOC][RFC] " Junio C Hamano
  2021-05-24  2:41   ` Felipe Contreras
  2021-05-24 13:09   ` ZheNing Hu
@ 2021-05-26  0:56   ` Junio C Hamano
  2021-05-26  6:45     ` ZheNing Hu
  2 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-05-26  0:56 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak, ZheNing Hu

Junio C Hamano <gitster@pobox.com> writes:

>> git for-each-ref --format="%(contents)" --python refs/mytrees/first
>>
>> will output a string processed by python_quote_buf_with_size(), which
>> contains'\0'. But the binary files seem to be useless after quoting. Should
>> we allow these binary files to be output in the default way with
>> strbuf_add()? If so, we can remove the first patch.
>
> The --language option is designed to be used to write a small script
> in the language and used like this:
>
>     git for-each-ref --format='
> 		name=%(refname)
> 		var=%(placeholder)
>                 mkdir -p "$(dirname "$name")"
> 		printf "%%s" "$var" >"$name"
>     ' --shell | /bin/sh
>
> Note that %(refname) and %(placeholder) in the --format string is
> not quoted at all; the "--shell" option knows how values are quoted
> in the host language (shell) and writes single-quotes around
> %(refname).  If %(placeholder) produces something with a single-quote
> in it, that will (eh, at least "should") be quoted appropriately.
>
> So It does not make any sense not to quote a value that comes from
> %(placeholder), whether it is binary or not, to match the syntax of
> the host language you are making the "for-each-ref --format=" to
> write such a script in.
>
> So, "binary files seem to be useless after quoting" is a
> misunderstanding.  They are useless if you do not quote them.

Another thing to keep in mind is that not all host languages may be
capable of expressing a string with NUL in it.  Most notably, shell.
The --shell quoting rule used by for-each-ref would produce an
equivalent of the "script" produced like this:

	$ tr Q '\000' >script <<\EOF
	#!/bin/sh
	varname='varQname'
	echo "$varname"
	EOF

but I do not think it would say 'var' followed by a NUL followed by
'name'.  The NUL is likely lost when assigned to the variable.

So for some host languages, binaries may be useless with or without
quoting.  But for ones that can use strings to hold arbitrary byte
sequence, it should be OK to let for-each-ref to quote the byte
sequence as a string literal for the language (so that the exact
byte sequence will end up being in the variable after assignment).

That reminds me of another thing.  The --python thing was written
back when Python3 was still a distant dream and strings were the
appropriate type for a random sequence of bytes (as opposed to
unicode, which cannot have a random sequence of bytes).  Somebody
needs to check if it needs any update to work with Python3.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-26  0:56   ` Junio C Hamano
@ 2021-05-26  6:45     ` ZheNing Hu
  2021-05-26  7:06       ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: ZheNing Hu @ 2021-05-26  6:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak

>
> Another thing to keep in mind is that not all host languages may be
> capable of expressing a string with NUL in it.  Most notably, shell.
> The --shell quoting rule used by for-each-ref would produce an
> equivalent of the "script" produced like this:
>
>         $ tr Q '\000' >script <<\EOF
>         #!/bin/sh
>         varname='varQname'
>         echo "$varname"
>         EOF
>
> but I do not think it would say 'var' followed by a NUL followed by
> 'name'.  The NUL is likely lost when assigned to the variable.
>

Yes, in the following example you mentioned earlier, I have also
noticed the loss of '\0'.

> >     git for-each-ref --format='
> >               name=%(refname)
> >               var=%(placeholder)
> >                 mkdir -p "$(dirname "$name")"
> >               printf "%%s" "$var" >"$name"
> >     ' --shell | /bin/sh
> >

> So for some host languages, binaries may be useless with or without
> quoting.  But for ones that can use strings to hold arbitrary byte
> sequence, it should be OK to let for-each-ref to quote the byte
> sequence as a string literal for the language (so that the exact
> byte sequence will end up being in the variable after assignment).
>

I agree, and maybe some'\0' can be escaped appropriately to let host
languages recognize it....

> That reminds me of another thing.  The --python thing was written
> back when Python3 was still a distant dream and strings were the
> appropriate type for a random sequence of bytes (as opposed to
> unicode, which cannot have a random sequence of bytes).  Somebody
> needs to check if it needs any update to work with Python3.

$ printf '%b' "name='a\\\0b\\\0c'\nprint(name)" | python2.7 | od -c
0000000   a  \0   b  \0   c  \n
0000006

$ printf '%b' "name='a\\\0b\\\0c'\necho -e \"\$name\"" | sh | od -c
0000000   a  \0   b  \0   c  \n
0000006

In shell or python2/3, we can replace'\0' with "\\0".

In Tcl and perl, they are ok with '\0'.

$ printf '%b' "set name \"a\0b\0c\"\nputs \$name" | tclsh | od -c
0000000   '   a  \0   b  \0   c   '  \n
0000010

$ printf '%b' "\$name = 'a\0b\0c';\n print \"\$name\""  | perl | od -c
0000000   a  \0   b  \0   c
0000005

So I currently think that a possible strategy is to modify
`python_quote_buf_with_size()` and `sq_quote_buf_with_size()` from
'\0' to "\\0".

Thanks!

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-26  6:45     ` ZheNing Hu
@ 2021-05-26  7:06       ` Junio C Hamano
  2021-05-26  9:17         ` ZheNing Hu
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-05-26  7:06 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak

ZheNing Hu <adlternative@gmail.com> writes:

> $ printf '%b' "name='a\\\0b\\\0c'\necho -e \"\$name\"" | sh | od -c
> 0000000   a  \0   b  \0   c  \n
> 0000006

This is wrong.  In the above, the variable name does not have a NUL
in it.  It insead has 2-byte sequence "\" and "0" in it, and you are
letting "echo -e" to convert it into binary, which is not portable
at all.

I'd suggest instead to declare that some host languages, like shell,
are not binary-clean and either refuse to process atoms whose values
have NUL in them.  Silently truncating strings at NUL or striping
NULs in strings are also acceptable options if clearly documented.
Claiming that we stuff binaries into variables of the host language,
while not doing so and instead assigning a quoted form, is not good.

I have not thought about Python3 very much.  For the purpose of most
%(placeholders), it is vastly more preferrable to use str (i.e. text
sequence type) as opposed to bytes, as you do not have to .decode()
to use the resulting "string", but even for things like %(refname),
it is not technically kosher to assume that the contents are UTF-8
encoded text, as filenames used to represent refnames are merely a
sequence of bytes except NUL, but for consistency with binary gunk,
we might have to emit everything as bytes.  I dunno.

> In shell or python2/3, we can replace'\0' with "\\0".

Not for shell.  We should declare that it is not supported to feed
binary to shell.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree
  2021-05-25 17:11         ` Junio C Hamano
@ 2021-05-26  7:48           ` ZheNing Hu
  0 siblings, 0 replies; 18+ messages in thread
From: ZheNing Hu @ 2021-05-26  7:48 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak

Junio C Hamano <gitster@pobox.com> 于2021年5月26日周三 上午1:11写道:
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > So I don't know if adding %(header) will cause duplication of functions.
>
> I do not think you can duplicate the feature of %(header) with
> other existing placeholders.  You may want to dump the headers of
> series of commits so that you can find if there is any commit with
> unusal header elements, but %(tree), %(parents), etc. would by
> definition be the known ones, so any combination of them will not be
> a substitute.
>

Well, "find if there is any commit with unusal header elements"
is indeed one thing that may need to be done (not now).

I tried to build "%(header)" yesterday and found some problems:

1. These atoms are too "flat":
"tree", "parent", "numparent", "object"... they are all part of the
header, why not use "%(header:tree)" like "%(contents:subject)"
does? Similarly, why not use "%(author:name)" instead of "%(authorname)"...

2. Why is there no state machine to save the parsing state of an object?
`parse_tag_buffer()`, `parse_commit_buffer()`, they only parse part of the
content of the object. E.g. tag parsed "object", "type", "tag" and "taggerdate"
in `parse_tag_buffer()`, but it did not save the starting position and length of
the parsed fields, which caused `grab_person()`, `grab_date()` to parse the
object again and again. I think this may be an optimization point.

> But nobody is asking for it right now, and your cat-file --batch
> does not need it right away.
>

Or we only provide the simplest "%(header)" and "%(header:size)"
(it can be easily supported with "%(raw)"), leaving the other feature
to the future.

> What I wanted to say in the message you are responding to was *not*
> that you would want to add %(header) for completeness right now.
> But thinking beyond your immediate need (i.e. the "whole thing") and
> imagining how various pieces, including the ones that do not exist
> yet, would fit together, would help you to avoid designing the parts
> you immediately need in a way you would regret in the future.

As I said before, many atomic designs are not very regular.
Our ultimate goal may be we can simultaneously support:

"%(raw)"
"%(raw:header)"
"%(raw:header:tagger)"
"%(raw:header:tagger:eamil)"
"%(raw:header:taggereamil)"
"%(header:taggereamil)"
"%(header:tagger:eamil)"
"%(taggereamil)"
"%(tagger:eamil)"
...

Each layer is a namespace. Of course, it may take a long
distance to support this.

Thank.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom
  2021-05-26  7:06       ` Junio C Hamano
@ 2021-05-26  9:17         ` ZheNing Hu
  0 siblings, 0 replies; 18+ messages in thread
From: ZheNing Hu @ 2021-05-26  9:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak

Junio C Hamano <gitster@pobox.com> 于2021年5月26日周三 下午3:06写道:
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > $ printf '%b' "name='a\\\0b\\\0c'\necho -e \"\$name\"" | sh | od -c
> > 0000000   a  \0   b  \0   c  \n
> > 0000006
>
> This is wrong.  In the above, the variable name does not have a NUL
> in it.  It insead has 2-byte sequence "\" and "0" in it, and you are
> letting "echo -e" to convert it into binary, which is not portable
> at all.
>

Indeed I was wrong, the var name does not contain '\0'.

> I'd suggest instead to declare that some host languages, like shell,
> are not binary-clean and either refuse to process atoms whose values
> have NUL in them.  Silently truncating strings at NUL or striping
> NULs in strings are also acceptable options if clearly documented.
> Claiming that we stuff binaries into variables of the host language,
> while not doing so and instead assigning a quoted form, is not good.
>

Makes sense. Either choose to truncate, or choose to reject.

> I have not thought about Python3 very much.  For the purpose of most
> %(placeholders), it is vastly more preferrable to use str (i.e. text
> sequence type) as opposed to bytes, as you do not have to .decode()
> to use the resulting "string", but even for things like %(refname),
> it is not technically kosher to assume that the contents are UTF-8
> encoded text, as filenames used to represent refnames are merely a
> sequence of bytes except NUL, but for consistency with binary gunk,
> we might have to emit everything as bytes.  I dunno.
>
> > In shell or python2/3, we can replace'\0' with "\\0".
>
> Not for shell.  We should declare that it is not supported to feed
> binary to shell.

Eh, it seems that we adopt a "reject" strategy.

$ git hash-object a.out -w | xargs git update-ref refs/myblobs/aoutblob
$ git for-each-ref --format="name=%(raw)" refs/myblobs/aoutblob
--python | python2
  File "<stdin>", line 1
SyntaxError: Non-ASCII character '\x8b' in file <stdin> on line 2, but
no encoding declared;
 see http://python.org/dev/peps/pep-0263/ for details

$ git for-each-ref --format="name=%(raw)" refs/myblobs/aoutblob
--python |python3
SyntaxError: Non-UTF-8 code starting with '\x8b' in file <stdin> on
line 2, but no encoding declared;
 see http://python.org/dev/peps/pep-0263/ for details

It seems that --python also needs to "reject", no matter python2 or python3.
What about tcl and perl?

$ cat a.out | od >1.od
$ git for-each-ref --format="set name %(raw)
puts -nonewline \$name" refs/myblobs/aoutblob --tcl | tclsh | od > 2.od
$ diff 1.od 2.od | head
7,12c7,12
< 0000140 114303 000002 000000 000000 141400 001230 000000 000000
< 0000160 000000 000010 000000 000000 000000 000003 000000 000004
< 0000200 000000 001430 000000 000000 000000 001430 000000 000000
< 0000220 000000 001430 000000 000000 000000 000034 000000 000000
< 0000240 000000 000034 000000 000000 000000 000001 000000 000000
< 0000260 000000 000001 000000 000004 000000 000000 000000 000000
---
> 0000140 001330 000000 000000 000000 001330 000000 000000 000000
> 0000160 000010 000000 000000 000000 000003 000000 000004 000000
> 0000200 001430 000000 000000 000000 001430 000000 000000 000000
> 0000220 001430 000000 000000 000000 000034 000000 000000 000000
> 0000240 000034 000000 000000 000000 000001 000000 000000 000000
> 0000260 000001 000000 000004 000000 000000 000000 000000 000000

It seems that a.out contents passed into tcl and then the output is
very different...

But,

$ cat a.out | od >1.od
$ git for-each-ref --format="\$name= %(raw);
print \"\$name\"" refs/myblobs/aoutblob --perl | perl | od >6.od
$ diff 1.od 2.od

There was no error this time, so for perl, it's ok...
The "binary security" we care about is currently only complied with
by the Perl language.

So I think we better reject them all languages together for normative.
The clear definition of this rejection strategy is that %(raw) and --language
cannot be used at the same time. If our binary data is passed to a variable
in the host language, there may be escape errors for the host language.

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-05-26  9:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-23  9:53 [PATCH 0/3] [GSOC][RFC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
2021-05-23  9:53 ` [PATCH 1/3] [GSOC] quote: add *.quote_buf_with_size functions ZheNing Hu via GitGitGadget
2021-05-23  9:53 ` [PATCH 2/3] [GSOC] ref-filter: support %(contents) for blob, tree ZheNing Hu via GitGitGadget
2021-05-25  5:03   ` Junio C Hamano
2021-05-25  5:47     ` Junio C Hamano
2021-05-25  9:28       ` ZheNing Hu
2021-05-25 17:11         ` Junio C Hamano
2021-05-26  7:48           ` ZheNing Hu
2021-05-23  9:53 ` [PATCH 3/3] [GSOC] ref-filter: add contents:raw atom ZheNing Hu via GitGitGadget
2021-05-24  1:09 ` [PATCH 0/3] [GSOC][RFC] " Junio C Hamano
2021-05-24  2:41   ` Felipe Contreras
2021-05-24  5:22     ` Bagas Sanjaya
2021-05-24 15:21       ` Junio C Hamano
2021-05-24 13:09   ` ZheNing Hu
2021-05-26  0:56   ` Junio C Hamano
2021-05-26  6:45     ` ZheNing Hu
2021-05-26  7:06       ` Junio C Hamano
2021-05-26  9:17         ` ZheNing Hu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).