git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom
@ 2021-05-27 14:43 ZheNing Hu via GitGitGadget
  2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-27 14:43 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu

In order to make git cat-file --batch use ref-filter logic, I add %(raw)
atom to ref-filter.

Change from last version:

 1. In my discussion with Junio, I came to the conclusion that
    --format="%(raw)" should not be used with --python, --perl, --shell,
    --tcl. Therefore, die if both --format="%(raw)" and

--language are given in parse_ref_filter_atom(). The reason I don't move
this part to raw_atom_parser() is if I move it to raw_atom_parser(), when we
use:

git --format=%raw --sort=raw --python`

Git will continue to run instead of die because parse_sorting_atom() will
use a dummy ref_format and don't remember --language details, next time
format_ref_array_item() will reuse the used_atom entry of sorting atom in
parse_ref_filter_atom(), This will skip the check in raw_atom_parser(). 2.
Give atom_value.s_size a init value ATOM_VALUE_S_SIZE_INIT (-1), which can
help us distinguish an object whose length is 0 and an object whose s_size
has not been modified after initialization. 3. Add %(header) atom.

ZheNing Hu (2):
  [GSOC] ref-filter: add %(raw) atom
  [GSOC] ref-filter: add %(header) atom

 Documentation/git-for-each-ref.txt |  21 +++
 ref-filter.c                       | 182 ++++++++++++++++++----
 t/t6300-for-each-ref.sh            | 236 +++++++++++++++++++++++++++++
 3 files changed, 412 insertions(+), 27 deletions(-)


base-commit: 5d5b1473453400224ebb126bf3947e0a3276bdf5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-963%2Fadlternative%2Fref-filter-raw-atom-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-963/adlternative/ref-filter-raw-atom-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/963
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-27 14:43 [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
@ 2021-05-27 14:43 ` ZheNing Hu via GitGitGadget
  2021-05-27 16:36   ` Felipe Contreras
  2021-05-28  3:03   ` Junio C Hamano
  2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 41+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-27 14:43 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu,
	ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

Add new formatting option `%(raw)`, which will print the raw
object data without any changes. It will help further to migrate
all cat-file formatting logic from cat-file to ref-filter.

The raw data of blob, tree objects may contain '\0', but most of
the logic in `ref-filter` depands on the output of the atom being
a structured string (end with '\0').

E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
add the data to the buffer. The raw data of a tree object is
`100644 one\0...`, only the `100644 one` will be added to the buffer,
which is incorrect.

Therefore, add a new member in `struct atom_value`: `s_size`, which
can record raw object size, it can help us add raw object data to
the buffer or compare two buffers which contain raw object data.

Beyond, `--format=%(raw)` should not combine with `--python`, `--shell`,
`--tcl`, `--perl` because if our binary raw data is passed to a variable
in the host language, the host languages may cause escape errors.

Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 Documentation/git-for-each-ref.txt |  14 +++
 ref-filter.c                       | 156 +++++++++++++++++++----
 t/t6300-for-each-ref.sh            | 191 +++++++++++++++++++++++++++++
 3 files changed, 334 insertions(+), 27 deletions(-)

diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
index 2ae2478de706..f6ae751fd256 100644
--- a/Documentation/git-for-each-ref.txt
+++ b/Documentation/git-for-each-ref.txt
@@ -235,6 +235,20 @@ and `date` to extract the named component.  For email fields (`authoremail`,
 without angle brackets, and `:localpart` to get the part before the `@` symbol
 out of the trimmed email.
 
+The raw data in a object is `raw`, For commit and tag objects, `raw` contain
+`header` and `contents` two parts, `header` is structured part of raw data, it
+composed of "tree XXX", "parent YYY", etc lines in commits , or composed of
+"object OOO", "type TTT", etc lines in tags; `contents` is unstructured "free
+text" part of raw object data. For blob and tree objects, their raw data don't
+have `header` and `contents` parts.
+
+raw:size::
+	The raw data size of the object.
+
+Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tcl`,
+`--perl` because if our binary raw data is passed to a variable in the host language,
+the host languages may cause escape errors.
+
 The message in a commit or a tag object is `contents`, from which
 `contents:<part>` can be used to extract various parts out of:
 
diff --git a/ref-filter.c b/ref-filter.c
index 97116e12d7c4..c2abf5da7006 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -138,6 +138,9 @@ static struct used_atom {
 			struct process_trailer_options trailer_opts;
 			unsigned int nlines;
 		} contents;
+		struct {
+			enum { RAW_BARE, RAW_LENGTH } option;
+		} raw_data;
 		struct {
 			cmp_status cmp_status;
 			const char *str;
@@ -370,6 +373,18 @@ static int contents_atom_parser(const struct ref_format *format, struct used_ato
 	return 0;
 }
 
+static int raw_atom_parser(const struct ref_format *format, struct used_atom *atom,
+				const char *arg, struct strbuf *err)
+{
+	if (!arg) {
+		atom->u.raw_data.option = RAW_BARE;
+	} else if (!strcmp(arg, "size"))
+		atom->u.raw_data.option = RAW_LENGTH;
+	else
+		return strbuf_addf_ret(err, -1, _("unrecognized %%(raw) argument: %s"), arg);
+	return 0;
+}
+
 static int oid_atom_parser(const struct ref_format *format, struct used_atom *atom,
 			   const char *arg, struct strbuf *err)
 {
@@ -530,6 +545,7 @@ static struct {
 	{ "body", SOURCE_OBJ, FIELD_STR, body_atom_parser },
 	{ "trailers", SOURCE_OBJ, FIELD_STR, trailers_atom_parser },
 	{ "contents", SOURCE_OBJ, FIELD_STR, contents_atom_parser },
+	{ "raw", SOURCE_OBJ, FIELD_STR, raw_atom_parser },
 	{ "upstream", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
 	{ "push", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
 	{ "symref", SOURCE_NONE, FIELD_STR, refname_atom_parser },
@@ -564,12 +580,15 @@ struct ref_formatting_state {
 
 struct atom_value {
 	const char *s;
+	size_t s_size;
 	int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
 		       struct strbuf *err);
 	uintmax_t value; /* used for sorting when not FIELD_STR */
 	struct used_atom *atom;
 };
 
+#define ATOM_VALUE_S_SIZE_INIT (-1)
+
 /*
  * Used to parse format string and sort specifiers
  */
@@ -588,6 +607,10 @@ static int parse_ref_filter_atom(const struct ref_format *format,
 		return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"),
 				       (int)(ep-atom), atom);
 
+	if (format->quote_style && starts_with(sp, "raw"))
+		return strbuf_addf_ret(err, -1, _("--format=%.*s should not combine with"
+				"--python, --shell, --tcl, --perl"), (int)(ep-atom), atom);
+
 	/* Do we have the atom already used elsewhere? */
 	for (i = 0; i < used_atom_cnt; i++) {
 		int len = strlen(used_atom[i].name);
@@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref_format *format,
 	return at;
 }
 
-static void quote_formatting(struct strbuf *s, const char *str, int quote_style)
+static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style)
 {
 	switch (quote_style) {
 	case QUOTE_NONE:
-		strbuf_addstr(s, str);
+		if (len != ATOM_VALUE_S_SIZE_INIT)
+			strbuf_add(s, str, len);
+		else
+			strbuf_addstr(s, str);
 		break;
 	case QUOTE_SHELL:
 		sq_quote_buf(s, str);
@@ -683,9 +709,12 @@ static int append_atom(struct atom_value *v, struct ref_formatting_state *state,
 	 * encountered.
 	 */
 	if (!state->stack->prev)
-		quote_formatting(&state->stack->output, v->s, state->quote_style);
+		quote_formatting(&state->stack->output, v->s, v->s_size, state->quote_style);
 	else
-		strbuf_addstr(&state->stack->output, v->s);
+		if (v->s_size != ATOM_VALUE_S_SIZE_INIT)
+			strbuf_add(&state->stack->output, v->s, v->s_size);
+		else
+			strbuf_addstr(&state->stack->output, v->s);
 	return 0;
 }
 
@@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state
 	return 0;
 }
 
-static int is_empty(const char *s)
+static int is_empty(struct strbuf *buf)
 {
-	while (*s != '\0') {
-		if (!isspace(*s))
-			return 0;
+	const char *s = buf->buf;
+	size_t cur_len = 0;
+
+	while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
 		s++;
+		cur_len++;
 	}
-	return 1;
+	return cur_len == buf->len;
 }
 
 static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_state *state,
@@ -800,6 +831,7 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
 {
 	struct ref_formatting_stack *cur = state->stack;
 	struct if_then_else *if_then_else = NULL;
+	size_t str_len = 0;
 
 	if (cur->at_end == if_then_else_handler)
 		if_then_else = (struct if_then_else *)cur->at_end_data;
@@ -810,18 +842,28 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
 	if (if_then_else->else_atom_seen)
 		return strbuf_addf_ret(err, -1, _("format: %%(then) atom used after %%(else)"));
 	if_then_else->then_atom_seen = 1;
+	if (if_then_else->str)
+		str_len = strlen(if_then_else->str);
 	/*
 	 * If the 'equals' or 'notequals' attribute is used then
 	 * perform the required comparison. If not, only non-empty
 	 * strings satisfy the 'if' condition.
 	 */
 	if (if_then_else->cmp_status == COMPARE_EQUAL) {
-		if (!strcmp(if_then_else->str, cur->output.buf))
+		if (!if_then_else->str)
+			BUG("when if_then_else->cmp_status == COMPARE_EQUAL,"
+			    "if_then_else->str must not be null");
+		if (str_len == cur->output.len &&
+		    !memcmp(if_then_else->str, cur->output.buf, cur->output.len))
 			if_then_else->condition_satisfied = 1;
 	} else if (if_then_else->cmp_status == COMPARE_UNEQUAL) {
-		if (strcmp(if_then_else->str, cur->output.buf))
+		if (!if_then_else->str)
+			BUG("when if_then_else->cmp_status == COMPARE_UNEQUAL,"
+			    "if_then_else->str must not be null");
+		if (str_len != cur->output.len ||
+		    memcmp(if_then_else->str, cur->output.buf, cur->output.len))
 			if_then_else->condition_satisfied = 1;
-	} else if (cur->output.len && !is_empty(cur->output.buf))
+	} else if (cur->output.len && !is_empty(&cur->output))
 		if_then_else->condition_satisfied = 1;
 	strbuf_reset(&cur->output);
 	return 0;
@@ -867,7 +909,7 @@ static int end_atom_handler(struct atom_value *atomv, struct ref_formatting_stat
 	 * only on the topmost supporting atom.
 	 */
 	if (!current->prev->prev) {
-		quote_formatting(&s, current->output.buf, state->quote_style);
+		quote_formatting(&s, current->output.buf, current->output.len, state->quote_style);
 		strbuf_swap(&current->output, &s);
 	}
 	strbuf_release(&s);
@@ -1292,7 +1334,7 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
 }
 
 /* See grab_values */
-static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
+static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned long buf_size, struct object *obj)
 {
 	int i;
 	const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
@@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
 			continue;
 		if (deref)
 			name++;
-		if (strcmp(name, "body") &&
-		    !starts_with(name, "subject") &&
-		    !starts_with(name, "trailers") &&
-		    !starts_with(name, "contents"))
+
+		if (starts_with(name, "raw")) {
+			if (atom->u.raw_data.option == RAW_BARE) {
+				v->s = xmemdupz(buf, buf_size);
+				v->s_size = buf_size;
+			} else if (atom->u.raw_data.option == RAW_LENGTH)
+				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
+			continue;
+		}
+
+		if ((obj->type != OBJ_TAG &&
+		     obj->type != OBJ_COMMIT) ||
+		    (strcmp(name, "body") &&
+		     !starts_with(name, "subject") &&
+		     !starts_with(name, "trailers") &&
+		     !starts_with(name, "contents")))
 			continue;
 		if (!subpos)
 			find_subpos(buf,
@@ -1374,25 +1428,30 @@ static void fill_missing_values(struct atom_value *val)
  * pointed at by the ref itself; otherwise it is the object the
  * ref (which is a tag) refers to.
  */
-static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
+static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
 {
+	void *buf = data->content;
+	unsigned long buf_size = data->size;
+
 	switch (obj->type) {
 	case OBJ_TAG:
 		grab_tag_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf);
+		grab_raw_data(val, deref, buf, buf_size, obj);
 		grab_person("tagger", val, deref, buf);
 		break;
 	case OBJ_COMMIT:
 		grab_commit_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf);
+		grab_raw_data(val, deref, buf, buf_size, obj);
 		grab_person("author", val, deref, buf);
 		grab_person("committer", val, deref, buf);
 		break;
 	case OBJ_TREE:
 		/* grab_tree_values(val, deref, obj, buf, sz); */
+		grab_raw_data(val, deref, buf, buf_size, obj);
 		break;
 	case OBJ_BLOB:
 		/* grab_blob_values(val, deref, obj, buf, sz); */
+		grab_raw_data(val, deref, buf, buf_size, obj);
 		break;
 	default:
 		die("Eh?  Object of type %d?", obj->type);
@@ -1614,7 +1673,7 @@ static int get_object(struct ref_array_item *ref, int deref, struct object **obj
 			return strbuf_addf_ret(err, -1, _("parse_object_buffer failed on %s for %s"),
 					       oid_to_hex(&oi->oid), ref->refname);
 		}
-		grab_values(ref->value, deref, *obj, oi->content);
+		grab_values(ref->value, deref, *obj, oi);
 	}
 
 	grab_common_values(ref->value, deref, oi);
@@ -1694,7 +1753,7 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err)
 		int deref = 0;
 		const char *refname;
 		struct branch *branch = NULL;
-
+		v->s_size = ATOM_VALUE_S_SIZE_INIT;
 		v->handler = append_atom;
 		v->atom = atom;
 
@@ -2297,6 +2356,25 @@ static int compare_detached_head(struct ref_array_item *a, struct ref_array_item
 	return 0;
 }
 
+static int memcasecmp(const void *vs1, const void *vs2, size_t n)
+{
+	size_t i;
+	const char *s1 = (const char *)vs1;
+	const char *s2 = (const char *)vs2;
+
+	for (i = 0; i < n; i++) {
+		unsigned char u1 = s1[i];
+		unsigned char u2 = s2[i];
+		int U1 = toupper (u1);
+		int U2 = toupper (u2);
+		int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
+			: U1 < U2 ? -1 : U2 < U1);
+		if (diff)
+			return diff;
+	}
+	return 0;
+}
+
 static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
 {
 	struct atom_value *va, *vb;
@@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
 	int cmp_detached_head = 0;
 	cmp_type cmp_type = used_atom[s->atom].type;
 	struct strbuf err = STRBUF_INIT;
+	size_t slen = 0;
 
 	if (get_ref_atom_value(a, s->atom, &va, &err))
 		die("%s", err.buf);
@@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
 	} else if (s->sort_flags & REF_SORTING_VERSION) {
 		cmp = versioncmp(va->s, vb->s);
 	} else if (cmp_type == FIELD_STR) {
-		int (*cmp_fn)(const char *, const char *);
-		cmp_fn = s->sort_flags & REF_SORTING_ICASE
-			? strcasecmp : strcmp;
-		cmp = cmp_fn(va->s, vb->s);
+		if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
+		    vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
+			int (*cmp_fn)(const char *, const char *);
+			cmp_fn = s->sort_flags & REF_SORTING_ICASE
+				? strcasecmp : strcmp;
+			cmp = cmp_fn(va->s, vb->s);
+		} else {
+			int (*cmp_fn)(const void *, const void *, size_t);
+			cmp_fn = s->sort_flags & REF_SORTING_ICASE
+				? memcasecmp : memcmp;
+
+			if (va->s_size != ATOM_VALUE_S_SIZE_INIT &&
+			    vb->s_size != ATOM_VALUE_S_SIZE_INIT) {
+				cmp = cmp_fn(va->s, vb->s, va->s_size > vb->s_size ?
+				       vb->s_size : va->s_size);
+			} else if (va->s_size == ATOM_VALUE_S_SIZE_INIT) {
+				slen = strlen(va->s);
+				cmp = cmp_fn(va->s, vb->s, slen > vb->s_size ?
+					     vb->s_size : slen);
+			} else {
+				slen = strlen(vb->s);
+				cmp = cmp_fn(va->s, vb->s, slen > va->s_size ?
+					     slen : va->s_size);
+			}
+			cmp = cmp ? cmp : va->s_size - vb->s_size;
+		}
 	} else {
 		if (va->value < vb->value)
 			cmp = -1;
@@ -2420,6 +2521,7 @@ int format_ref_array_item(struct ref_array_item *info,
 	}
 	if (format->need_color_reset_at_eol) {
 		struct atom_value resetv;
+		resetv.s_size = ATOM_VALUE_S_SIZE_INIT;
 		resetv.s = GIT_COLOR_RESET;
 		if (append_atom(&resetv, &state, error_buf)) {
 			pop_stack_element(&state.stack);
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 9e0214076b4d..07de4a84d70b 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -130,6 +130,8 @@ test_atom head parent:short=10 ''
 test_atom head numparent 0
 test_atom head object ''
 test_atom head type ''
+test_atom head raw "$(git cat-file commit refs/heads/main)
+"
 test_atom head '*objectname' ''
 test_atom head '*objecttype' ''
 test_atom head author 'A U Thor <author@example.com> 1151968724 +0200'
@@ -221,6 +223,15 @@ test_atom tag contents 'Tagging at 1151968727
 '
 test_atom tag HEAD ' '
 
+test_expect_success 'basic atom: refs/tags/testtag *raw' '
+	git cat-file commit refs/tags/testtag^{} >expected &&
+	git for-each-ref --format="%(*raw)" refs/tags/testtag >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_expect_success 'Check invalid atoms names are errors' '
 	test_must_fail git for-each-ref --format="%(INVALID)" refs/heads
 '
@@ -686,6 +697,15 @@ test_atom refs/tags/signed-empty contents:body ''
 test_atom refs/tags/signed-empty contents:signature "$sig"
 test_atom refs/tags/signed-empty contents "$sig"
 
+test_expect_success 'basic atom: refs/tags/signed-empty raw' '
+	git cat-file tag refs/tags/signed-empty >expected &&
+	git for-each-ref --format="%(raw)" refs/tags/signed-empty >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_atom refs/tags/signed-short subject 'subject line'
 test_atom refs/tags/signed-short subject:sanitize 'subject-line'
 test_atom refs/tags/signed-short contents:subject 'subject line'
@@ -695,6 +715,15 @@ test_atom refs/tags/signed-short contents:signature "$sig"
 test_atom refs/tags/signed-short contents "subject line
 $sig"
 
+test_expect_success 'basic atom: refs/tags/signed-short raw' '
+	git cat-file tag refs/tags/signed-short >expected &&
+	git for-each-ref --format="%(raw)" refs/tags/signed-short >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_atom refs/tags/signed-long subject 'subject line'
 test_atom refs/tags/signed-long subject:sanitize 'subject-line'
 test_atom refs/tags/signed-long contents:subject 'subject line'
@@ -708,6 +737,15 @@ test_atom refs/tags/signed-long contents "subject line
 body contents
 $sig"
 
+test_expect_success 'basic atom: refs/tags/signed-long raw' '
+	git cat-file tag refs/tags/signed-long >expected &&
+	git for-each-ref --format="%(raw)" refs/tags/signed-long >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_expect_success 'set up refs pointing to tree and blob' '
 	git update-ref refs/mytrees/first refs/heads/main^{tree} &&
 	git update-ref refs/myblobs/first refs/heads/main:one
@@ -720,6 +758,16 @@ test_atom refs/mytrees/first contents:body ""
 test_atom refs/mytrees/first contents:signature ""
 test_atom refs/mytrees/first contents ""
 
+test_expect_success 'basic atom: refs/mytrees/first raw' '
+	git cat-file tree refs/mytrees/first >expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(raw)" refs/mytrees/first >actual &&
+	test_cmp expected actual &&
+	git cat-file -s refs/mytrees/first >expected &&
+	git for-each-ref --format="%(raw:size)" refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
 test_atom refs/myblobs/first subject ""
 test_atom refs/myblobs/first contents:subject ""
 test_atom refs/myblobs/first body ""
@@ -727,6 +775,149 @@ test_atom refs/myblobs/first contents:body ""
 test_atom refs/myblobs/first contents:signature ""
 test_atom refs/myblobs/first contents ""
 
+test_expect_success 'basic atom: refs/myblobs/first raw' '
+	git cat-file blob refs/myblobs/first >expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(raw)" refs/myblobs/first >actual &&
+	test_cmp expected actual &&
+	git cat-file -s refs/myblobs/first >expected &&
+	git for-each-ref --format="%(raw:size)" refs/myblobs/first >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'set up refs pointing to binary blob' '
+	printf "%b" "a\0b\0c" >blob1 &&
+	printf "%b" "a\0c\0b" >blob2 &&
+	printf "%b" "\0a\0b\0c" >blob3 &&
+	printf "%b" "abc" >blob4 &&
+	printf "%b" "\0 \0 \0 " >blob5 &&
+	printf "%b" "\0 \0a\0 " >blob6 &&
+	>blob7 &&
+	git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
+	git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
+	git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
+	git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
+	git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
+	git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
+	git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7
+'
+
+test_expect_success 'Verify sorts with raw' '
+	cat >expected <<-EOF &&
+	refs/myblobs/blob7
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/myblobs/blob3
+	refs/mytrees/first
+	refs/myblobs/first
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob4
+	refs/heads/main
+	EOF
+	git for-each-ref --format="%(refname)" --sort=raw \
+		refs/heads/main refs/myblobs/ refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'Verify sorts with raw:size' '
+	cat >expected <<-EOF &&
+	refs/myblobs/blob7
+	refs/myblobs/first
+	refs/heads/main
+	refs/myblobs/blob4
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob3
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/mytrees/first
+	EOF
+	git for-each-ref --format="%(refname)" --sort=raw:size \
+		refs/heads/main refs/myblobs/ refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'validate raw atom with %(if:equals)' '
+	cat >expected <<-EOF &&
+	not equals
+	not equals
+	not equals
+	not equals
+	not equals
+	not equals
+	refs/myblobs/blob4
+	not equals
+	not equals
+	not equals
+	not equals
+	EOF
+	git for-each-ref --format="%(if:equals=abc)%(raw)%(then)%(refname)%(else)not equals%(end)" \
+		refs/myblobs/ refs/heads/ >actual &&
+	test_cmp expected actual
+'
+test_expect_success 'validate raw atom with %(if:notequals)' '
+	cat >expected <<-EOF &&
+	refs/heads/ambiguous
+	refs/heads/main
+	refs/heads/newtag
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob3
+	equals
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/myblobs/blob7
+	refs/myblobs/first
+	EOF
+	git for-each-ref --format="%(if:notequals=abc)%(raw)%(then)%(refname)%(else)equals%(end)" \
+		refs/myblobs/ refs/heads/ >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'empty raw refs with %(if)' '
+	cat >expected <<-EOF &&
+	refs/myblobs/blob1 not empty
+	refs/myblobs/blob2 not empty
+	refs/myblobs/blob3 not empty
+	refs/myblobs/blob4 not empty
+	refs/myblobs/blob5 empty
+	refs/myblobs/blob6 not empty
+	refs/myblobs/blob7 empty
+	refs/myblobs/first not empty
+	EOF
+	git for-each-ref --format="%(refname) %(if)%(raw)%(then)not empty%(else)empty%(end)" \
+	refs/myblobs/ >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success '%(raw) with --python must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --python
+'
+
+test_expect_success '%(raw) with --tcl must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --tcl
+'
+
+test_expect_success '%(raw) with --perl must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --perl
+'
+
+test_expect_success '%(raw) with --shell must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --shell
+'
+
+test_expect_success '%(raw) with --shell and --sort=raw must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --sort=raw --shell
+'
+
+test_expect_success 'for-each-ref --format compare with cat-file --batch' '
+	git rev-parse refs/mytrees/first | git cat-file --batch >expected &&
+	git for-each-ref --format="%(objectname) %(objecttype) %(objectsize)
+%(raw)" refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'set up multiple-sort tags' '
 	for when in 100000 200000
 	do
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/2] [GSOC] ref-filter: add %(header) atom
  2021-05-27 14:43 [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
  2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
@ 2021-05-27 14:43 ` ZheNing Hu via GitGitGadget
  2021-05-27 16:37   ` Felipe Contreras
                     ` (2 more replies)
  2021-05-27 15:39 ` [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom Felipe Contreras
  2021-05-30 13:01 ` [PATCH v2 " ZheNing Hu via GitGitGadget
  3 siblings, 3 replies; 41+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-27 14:43 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu,
	ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

Add new formatting option `%(header)`, which will print the
the structured header part of the raw object data.

In the storage layout of an object: blob and tree only
contains raw data; commit and tag raw data contains two part:
header and contents. The header of tag contains "object OOO",
"type TTT", "tag AAA", "tagger GGG"; The header of commit
contains "tree RRR", "parent PPP", "author UUU", "committer CCC".

Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 Documentation/git-for-each-ref.txt |  7 +++++
 ref-filter.c                       | 26 +++++++++++++++++
 t/t6300-for-each-ref.sh            | 45 ++++++++++++++++++++++++++++++
 3 files changed, 78 insertions(+)

diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
index f6ae751fd256..7827e48cde75 100644
--- a/Documentation/git-for-each-ref.txt
+++ b/Documentation/git-for-each-ref.txt
@@ -249,6 +249,13 @@ Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tc
 `--perl` because if our binary raw data is passed to a variable in the host language,
 the host languages may cause escape errors.
 
+The structured header part of the raw data in a commit or a tag object is `header`,
+it composed of "tree XXX", "parent YYY", etc lines in commits, or composed of
+"object OOO", "type TTT", etc lines in tags.
+
+header:size::
+	The header size of the object.
+
 The message in a commit or a tag object is `contents`, from which
 `contents:<part>` can be used to extract various parts out of:
 
diff --git a/ref-filter.c b/ref-filter.c
index c2abf5da7006..2f426830f562 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -141,6 +141,9 @@ static struct used_atom {
 		struct {
 			enum { RAW_BARE, RAW_LENGTH } option;
 		} raw_data;
+		struct {
+			enum { H_BARE, H_LENGTH } option;
+		} header;
 		struct {
 			cmp_status cmp_status;
 			const char *str;
@@ -385,6 +388,18 @@ static int raw_atom_parser(const struct ref_format *format, struct used_atom *at
 	return 0;
 }
 
+static int header_atom_parser(const struct ref_format *format, struct used_atom *atom,
+			      const char *arg, struct strbuf *err)
+{
+	if (!arg)
+		atom->u.header.option = H_BARE;
+	else if (!strcmp(arg, "size"))
+		atom->u.header.option = H_LENGTH;
+	else
+		return strbuf_addf_ret(err, -1, _("unrecognized %%(header) argument: %s"), arg);
+	return 0;
+}
+
 static int oid_atom_parser(const struct ref_format *format, struct used_atom *atom,
 			   const char *arg, struct strbuf *err)
 {
@@ -546,6 +561,7 @@ static struct {
 	{ "trailers", SOURCE_OBJ, FIELD_STR, trailers_atom_parser },
 	{ "contents", SOURCE_OBJ, FIELD_STR, contents_atom_parser },
 	{ "raw", SOURCE_OBJ, FIELD_STR, raw_atom_parser },
+	{ "header", SOURCE_OBJ, FIELD_STR, header_atom_parser },
 	{ "upstream", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
 	{ "push", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
 	{ "symref", SOURCE_NONE, FIELD_STR, refname_atom_parser },
@@ -1362,6 +1378,7 @@ static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned
 		if ((obj->type != OBJ_TAG &&
 		     obj->type != OBJ_COMMIT) ||
 		    (strcmp(name, "body") &&
+		     !starts_with(name, "header") &&
 		     !starts_with(name, "subject") &&
 		     !starts_with(name, "trailers") &&
 		     !starts_with(name, "contents")))
@@ -1372,6 +1389,15 @@ static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned
 				    &bodypos, &bodylen, &nonsiglen,
 				    &sigpos, &siglen);
 
+		if (starts_with(name, "header")) {
+			size_t header_len = subpos - (const char *)buf - 1;
+			if (atom->u.header.option == H_BARE) {
+				v->s = xmemdupz(buf, header_len);
+			} else if (atom->u.header.option == H_LENGTH)
+				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)header_len);
+			continue;
+		}
+
 		if (atom->u.contents.option == C_SUB)
 			v->s = copy_subject(subpos, sublen);
 		else if (atom->u.contents.option == C_SUB_SANITIZE) {
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 07de4a84d70b..11fc8fc53649 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -232,6 +232,35 @@ test_expect_success 'basic atom: refs/tags/testtag *raw' '
 	test_cmp expected.clean actual.clean
 '
 
+test_expect_success 'basic atom: refs/tags/testtag header' '
+	cat >expected <<-EOF &&
+	object ea122842f48be4afb2d1fc6a4b96c05885ab7463
+	type commit
+	tag testtag
+	tagger C O Mitter <committer@example.com> 1151968725 +0200
+
+	EOF
+	git for-each-ref --format="%(header)" refs/tags/testtag >actual &&
+	test_cmp expected actual &&
+	echo "131" >expected &&
+	git for-each-ref --format="%(header:size)" refs/tags/testtag >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'basic atom: refs/heads/main header' '
+	cat >expected <<-EOF &&
+	tree 8039ce043250c402d62ca312e9596e42ce1c7bb0
+	author A U Thor <author@example.com> 1151968724 +0200
+	committer C O Mitter <committer@example.com> 1151968723 +0200
+
+	EOF
+	git for-each-ref --format="%(header)" refs/heads/main >actual &&
+	test_cmp expected actual &&
+	echo "162" >expected &&
+	git for-each-ref --format="%(header:size)" refs/heads/main >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'Check invalid atoms names are errors' '
 	test_must_fail git for-each-ref --format="%(INVALID)" refs/heads
 '
@@ -768,6 +797,14 @@ test_expect_success 'basic atom: refs/mytrees/first raw' '
 	test_cmp expected actual
 '
 
+test_expect_success 'basic atom: refs/mytrees/first header' '
+	echo "" >expected &&
+	git for-each-ref --format="%(header)" refs/mytrees/first >actual &&
+	test_cmp expected actual &&
+	git for-each-ref --format="%(header:size)" refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
 test_atom refs/myblobs/first subject ""
 test_atom refs/myblobs/first contents:subject ""
 test_atom refs/myblobs/first body ""
@@ -785,6 +822,14 @@ test_expect_success 'basic atom: refs/myblobs/first raw' '
 	test_cmp expected actual
 '
 
+test_expect_success 'basic atom: refs/myblobs/first header' '
+	echo "" >expected &&
+	git for-each-ref --format="%(header)" refs/myblobs/first >actual &&
+	test_cmp expected actual &&
+	git for-each-ref --format="%(header:size)" refs/myblobs/first >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'set up refs pointing to binary blob' '
 	printf "%b" "a\0b\0c" >blob1 &&
 	printf "%b" "a\0c\0b" >blob2 &&
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-27 14:43 [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
  2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
  2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
@ 2021-05-27 15:39 ` Felipe Contreras
  2021-05-30 13:01 ` [PATCH v2 " ZheNing Hu via GitGitGadget
  3 siblings, 0 replies; 41+ messages in thread
From: Felipe Contreras @ 2021-05-27 15:39 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget, git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu

ZheNing Hu via GitGitGadget wrote:
> In order to make git cat-file --batch use ref-filter logic, I add %(raw)
> atom to ref-filter.
> 
> Change from last version:
> 
>  1. In my discussion with Junio, I came to the conclusion that
>     --format="%(raw)" should not be used with --python, --perl, --shell,
>     --tcl. Therefore, die if both --format="%(raw)" and
> 
> --language are given in parse_ref_filter_atom(). The reason I don't move
> this part to raw_atom_parser() is if I move it to raw_atom_parser(), when we
> use:
> 
> git --format=%raw --sort=raw --python`

Missing the command I presume (and the other backtick).

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
@ 2021-05-27 16:36   ` Felipe Contreras
  2021-05-28 13:02     ` ZheNing Hu
  2021-05-29 13:23     ` Phillip Wood
  2021-05-28  3:03   ` Junio C Hamano
  1 sibling, 2 replies; 41+ messages in thread
From: Felipe Contreras @ 2021-05-27 16:36 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget, git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu,
	ZheNing Hu

ZheNing Hu via GitGitGadget wrote:

> +static int raw_atom_parser(const struct ref_format *format, struct used_atom *atom,
> +				const char *arg, struct strbuf *err)
> +{
> +	if (!arg) {
> +		atom->u.raw_data.option = RAW_BARE;
> +	} else if (!strcmp(arg, "size"))

No need for braces.

  if (!arg)
    ...
  else

> @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
>  			continue;
>  		if (deref)
>  			name++;
> -		if (strcmp(name, "body") &&
> -		    !starts_with(name, "subject") &&
> -		    !starts_with(name, "trailers") &&
> -		    !starts_with(name, "contents"))
> +
> +		if (starts_with(name, "raw")) {
> +			if (atom->u.raw_data.option == RAW_BARE) {
> +				v->s = xmemdupz(buf, buf_size);
> +				v->s_size = buf_size;
> +			} else if (atom->u.raw_data.option == RAW_LENGTH)
> +				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);

I think it's better to be consistent: if you used braces in the if, uses
braces in else.

> +			continue;
> +		}

> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)

Why void *? We can delcare as char *.

> +{
> +	size_t i;
> +	const char *s1 = (const char *)vs1;
> +	const char *s2 = (const char *)vs2;

Then we avoid this extra step.

> +	for (i = 0; i < n; i++) {
> +		unsigned char u1 = s1[i];
> +		unsigned char u2 = s2[i];

There's no need for two entirely new variables...

> +		int U1 = toupper (u1);
> +		int U2 = toupper (u2);

You can do toupper(s1[i]) directly (BTW, there's an extra space: `foo(x)`,
not `foo (x)`).

While we are at it, why keep an extra index from s1, when s1 is never
used again?

We can simply advance both s1 and s2:

  s1++, s2++

> +		int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> +			: U1 < U2 ? -1 : U2 < U1);

I don't understand what this is supposed to achieve. Both U1 and U2 are
integers, pretty low integers actually.

If we get rid if that complexity we don't even need U1 or U2, just do:

  diff = toupper(u1) - toupper(u2);

> +		if (diff)
> +			return diff;
> +	}
> +	return 0;
> +}

All we have to do is define the end point, and then we don't need i:

	static int memcasecmp(const char *s1, const char *s2, size_t n)
	{
		const char *end = s1 + n;
		for (; s1 < end; s1++, s2++) {
			int diff = tolower(*s1) - tolower(*s2);
			if (diff)
				return diff;
		}
		return 0;
	}

(and I personally prefer lower to upper)

Check the following resource for a detailed explanation of why my
modified version is considered good taste:

https://github.com/felipec/linked-list-good-taste

>  static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
>  {
>  	struct atom_value *va, *vb;
> @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
>  	int cmp_detached_head = 0;
>  	cmp_type cmp_type = used_atom[s->atom].type;
>  	struct strbuf err = STRBUF_INIT;
> +	size_t slen = 0;
>  
>  	if (get_ref_atom_value(a, s->atom, &va, &err))
>  		die("%s", err.buf);
> @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
>  	} else if (s->sort_flags & REF_SORTING_VERSION) {
>  		cmp = versioncmp(va->s, vb->s);
>  	} else if (cmp_type == FIELD_STR) {
> -		int (*cmp_fn)(const char *, const char *);
> -		cmp_fn = s->sort_flags & REF_SORTING_ICASE
> -			? strcasecmp : strcmp;
> -		cmp = cmp_fn(va->s, vb->s);
> +		if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
> +		    vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
> +			int (*cmp_fn)(const char *, const char *);
> +			cmp_fn = s->sort_flags & REF_SORTING_ICASE
> +				? strcasecmp : strcmp;
> +			cmp = cmp_fn(va->s, vb->s);
> +		} else {
> +			int (*cmp_fn)(const void *, const void *, size_t);
> +			cmp_fn = s->sort_flags & REF_SORTING_ICASE
> +				? memcasecmp : memcmp;
> +
> +			if (va->s_size != ATOM_VALUE_S_SIZE_INIT &&
> +			    vb->s_size != ATOM_VALUE_S_SIZE_INIT) {
> +				cmp = cmp_fn(va->s, vb->s, va->s_size > vb->s_size ?
> +				       vb->s_size : va->s_size);
> +			} else if (va->s_size == ATOM_VALUE_S_SIZE_INIT) {
> +				slen = strlen(va->s);
> +				cmp = cmp_fn(va->s, vb->s, slen > vb->s_size ?
> +					     vb->s_size : slen);
> +			} else {
> +				slen = strlen(vb->s);
> +				cmp = cmp_fn(va->s, vb->s, slen > va->s_size ?
> +					     slen : va->s_size);
> +			}
> +			cmp = cmp ? cmp : va->s_size - vb->s_size;
> +		}

This hurts my eyes. I think the complexity of this chunk warrants a
separate function. Then the logic would be easer to see.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 2/2] [GSOC] ref-filter: add %(header) atom
  2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
@ 2021-05-27 16:37   ` Felipe Contreras
  2021-05-28  3:06   ` Junio C Hamano
  2021-05-28  4:36   ` Junio C Hamano
  2 siblings, 0 replies; 41+ messages in thread
From: Felipe Contreras @ 2021-05-27 16:37 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget, git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu,
	ZheNing Hu

ZheNing Hu via GitGitGadget wrote:

> @@ -1372,6 +1389,15 @@ static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned
>  				    &bodypos, &bodylen, &nonsiglen,
>  				    &sigpos, &siglen);
>  
> +		if (starts_with(name, "header")) {
> +			size_t header_len = subpos - (const char *)buf - 1;
> +			if (atom->u.header.option == H_BARE) {
> +				v->s = xmemdupz(buf, header_len);
> +			} else if (atom->u.header.option == H_LENGTH)

No need for braces in the if.

> +				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)header_len);
> +			continue;
> +		}
> +
>  		if (atom->u.contents.option == C_SUB)
>  			v->s = copy_subject(subpos, sublen);
>  		else if (atom->u.contents.option == C_SUB_SANITIZE) {

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
  2021-05-27 16:36   ` Felipe Contreras
@ 2021-05-28  3:03   ` Junio C Hamano
  2021-05-28 15:04     ` ZheNing Hu
  1 sibling, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-05-28  3:03 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The raw data of blob, tree objects may contain '\0', but most of
> the logic in `ref-filter` depands on the output of the atom being
> a structured string (end with '\0').

Text, yes, string is also OK.  But structured?  Probably not.

    ... being text (specifically, no embedded NULs in it).

> E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
> add the data to the buffer. The raw data of a tree object is
> `100644 one\0...`, only the `100644 one` will be added to the buffer,
> which is incorrect.
>
> Therefore, add a new member in `struct atom_value`: `s_size`, which
> can record raw object size, it can help us add raw object data to
> the buffer or compare two buffers which contain raw object data.

Other than the phrasing issue around "structured", all of the above
is a good description.

> Beyond, `--format=%(raw)` should not combine with `--python`, `--shell`,
> `--tcl`, `--perl` because if our binary raw data is passed to a variable
> in the host language, the host languages may cause escape errors.

OK.  I think at least --perl and possibly --python should be able to
express NULs in the "string" type we use from the host language, but
I am perfectly fine with the decision to leave it to later updates.

After all, the --<lang> feature to write scriptlets via --format and
execute them in the named language interpreter probably is not as
often used as it was originally designed for, so it might be that
nobody will ask for such "later updates".

> @@ -235,6 +235,20 @@ and `date` to extract the named component.  For email fields (`authoremail`,
>  without angle brackets, and `:localpart` to get the part before the `@` symbol
>  out of the trimmed email.
>  
> +The raw data in a object is `raw`, For commit and tag objects, `raw` contain
> +`header` and `contents` two parts, `header` is structured part of raw data, it
> +composed of "tree XXX", "parent YYY", etc lines in commits , or composed of
> +"object OOO", "type TTT", etc lines in tags; `contents` is unstructured "free
> +text" part of raw object data. For blob and tree objects, their raw data don't
> +have `header` and `contents` parts.
> +
> +raw:size::
> +	The raw data size of the object.
> +
> +Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tcl`,

"should not combine" -> "cannot be used" would make it read more
naturally (ditto for the phrase used in the proposed log message).

> +`--perl` because if our binary raw data is passed to a variable in the host language,
> +the host languages may cause escape errors.
> +
>  The message in a commit or a tag object is `contents`, from which
>  `contents:<part>` can be used to extract various parts out of:
>  
> diff --git a/ref-filter.c b/ref-filter.c
> index 97116e12d7c4..c2abf5da7006 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -138,6 +138,9 @@ static struct used_atom {
>  			struct process_trailer_options trailer_opts;
>  			unsigned int nlines;
>  		} contents;
> +		struct {
> +			enum { RAW_BARE, RAW_LENGTH } option;
> +		} raw_data;
>  		struct {
>  			cmp_status cmp_status;
>  			const char *str;
> @@ -370,6 +373,18 @@ static int contents_atom_parser(const struct ref_format *format, struct used_ato
>  	return 0;
>  }
>  
> +static int raw_atom_parser(const struct ref_format *format, struct used_atom *atom,
> +				const char *arg, struct strbuf *err)
> +{
> +	if (!arg) {
> +		atom->u.raw_data.option = RAW_BARE;
> +	} else if (!strcmp(arg, "size"))
> +		atom->u.raw_data.option = RAW_LENGTH;
> +	else
> +		return strbuf_addf_ret(err, -1, _("unrecognized %%(raw) argument: %s"), arg);
> +	return 0;
> +}
> +
>  static int oid_atom_parser(const struct ref_format *format, struct used_atom *atom,
>  			   const char *arg, struct strbuf *err)
>  {
> @@ -530,6 +545,7 @@ static struct {
>  	{ "body", SOURCE_OBJ, FIELD_STR, body_atom_parser },
>  	{ "trailers", SOURCE_OBJ, FIELD_STR, trailers_atom_parser },
>  	{ "contents", SOURCE_OBJ, FIELD_STR, contents_atom_parser },
> +	{ "raw", SOURCE_OBJ, FIELD_STR, raw_atom_parser },
>  	{ "upstream", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
>  	{ "push", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
>  	{ "symref", SOURCE_NONE, FIELD_STR, refname_atom_parser },
> @@ -564,12 +580,15 @@ struct ref_formatting_state {
>  
>  struct atom_value {
>  	const char *s;
> +	size_t s_size;

OK, so everything used to be a C-string that cannot hold NULs in it,
but now it is a counted <ptr, len> string.  Good.

>  	int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
>  		       struct strbuf *err);
>  	uintmax_t value; /* used for sorting when not FIELD_STR */
>  	struct used_atom *atom;
>  };
>  
> +#define ATOM_VALUE_S_SIZE_INIT (-1)
> +
>  /*
>   * Used to parse format string and sort specifiers
>   */
> @@ -588,6 +607,10 @@ static int parse_ref_filter_atom(const struct ref_format *format,
>  		return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"),
>  				       (int)(ep-atom), atom);
>  
> +	if (format->quote_style && starts_with(sp, "raw"))
> +		return strbuf_addf_ret(err, -1, _("--format=%.*s should not combine with"
> +				"--python, --shell, --tcl, --perl"), (int)(ep-atom), atom);

Don't we want to allow "raw:size" that would be a plain text?
I am not sure if this check belongs here in the first place.
Shouldn't it be done in raw_atom_parser() instead?

Another idea is to teach a more generic rule to quote_formatting()
to detect NULs in v->s[0..v->s_size] at runtime and barf, i.e. a
plain-text blob object can be used with "--shell --format=%(raw)"
just fine.

> @@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref_format *format,
>  	return at;
>  }
>  
> -static void quote_formatting(struct strbuf *s, const char *str, int quote_style)
> +static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style)
>  {
>  	switch (quote_style) {
>  	case QUOTE_NONE:
> -		strbuf_addstr(s, str);
> +		if (len != ATOM_VALUE_S_SIZE_INIT)
> +			strbuf_add(s, str, len);
> +		else
> +			strbuf_addstr(s, str);

It probably is a good idea to invent a C preprocessor macro for a
named constant to be used when a structure is initialized, but it
would be easier to read if the rule is "len field is negative when
the value is a C-string", e.g.

		if (len < 0)

> @@ -683,9 +709,12 @@ static int append_atom(struct atom_value *v, struct ref_formatting_state *state,
>  	 * encountered.
>  	 */
>  	if (!state->stack->prev)
> -		quote_formatting(&state->stack->output, v->s, state->quote_style);
> +		quote_formatting(&state->stack->output, v->s, v->s_size, state->quote_style);
>  	else
> -		strbuf_addstr(&state->stack->output, v->s);
> +		if (v->s_size != ATOM_VALUE_S_SIZE_INIT)
> +			strbuf_add(&state->stack->output, v->s, v->s_size);
> +		else
> +			strbuf_addstr(&state->stack->output, v->s);
>  	return 0;
>  }
>  
> @@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state
>  	return 0;
>  }
>  
> -static int is_empty(const char *s)
> +static int is_empty(struct strbuf *buf)
>  {
> -	while (*s != '\0') {
> -		if (!isspace(*s))
> -			return 0;
> +	const char *s = buf->buf;
> +	size_t cur_len = 0;
> +
> +	while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
>  		s++;
> +		cur_len++;
>  	}
> -	return 1;
> +	return cur_len == buf->len;
>  }

Assuming that we do want to treat NULs the same way as whitespaces,
the updated code works as intended, which is good.  But I have no
reason to support that design decision.  I do not have a strong
reason to support a design decision that goes the opposite way to
treat a NUL just like we treat an 'X', but at least I can understand
it (i.e. "because we have no reason to special case NUL any more
than 'X' when trying to see if a buffer is 'empty', we don't").
This code on the other hand must be supported with "because we need
to special case NUL for such and such reasons for the purpose of
determining if a buffer is 'empty', we treat them the same way as
whitespaces".

> @@ -800,6 +831,7 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
>  {
>  	struct ref_formatting_stack *cur = state->stack;
>  	struct if_then_else *if_then_else = NULL;
> +	size_t str_len = 0;
>  
>  	if (cur->at_end == if_then_else_handler)
>  		if_then_else = (struct if_then_else *)cur->at_end_data;
> @@ -810,18 +842,28 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
>  	if (if_then_else->else_atom_seen)
>  		return strbuf_addf_ret(err, -1, _("format: %%(then) atom used after %%(else)"));
>  	if_then_else->then_atom_seen = 1;
> +	if (if_then_else->str)
> +		str_len = strlen(if_then_else->str);
>  	/*
>  	 * If the 'equals' or 'notequals' attribute is used then
>  	 * perform the required comparison. If not, only non-empty
>  	 * strings satisfy the 'if' condition.
>  	 */
>  	if (if_then_else->cmp_status == COMPARE_EQUAL) {
> -		if (!strcmp(if_then_else->str, cur->output.buf))
> +		if (!if_then_else->str)
> +			BUG("when if_then_else->cmp_status == COMPARE_EQUAL,"
> +			    "if_then_else->str must not be null");

Can the change in this commit violate the invariant that
if_then_else->str cannot be NULL, which seems to have been the case
forever as we see an unchecked strcmp() done in the original?

If so, perhaps you can check the condition upfront, where you
compute str_len above, e.g.

	if (!if_then_else->str) {
		if (if_then_else->cmp_status == COMPARE_EQUAL ||
                    if_then_else->cmp_status == COMPARE_UNEQUAL)
			BUG(...);
	} else
		str_len = strlen(...);

If not, then I do not see the point of adding this (and later) check
with BUG to this code.

Or is the invariant that .str must not be NULL could have been
violated without this patch (i.e. the original was buggy in running
strcmp() on .str without checking)?  If so, please make it a separate
preliminary change to add such an assert.

> +		if (str_len == cur->output.len &&
> +		    !memcmp(if_then_else->str, cur->output.buf, cur->output.len))
>  			if_then_else->condition_satisfied = 1;
>  	} else if (if_then_else->cmp_status == COMPARE_UNEQUAL) {
> -		if (strcmp(if_then_else->str, cur->output.buf))
> +		if (!if_then_else->str)
> +			BUG("when if_then_else->cmp_status == COMPARE_UNEQUAL,"
> +			    "if_then_else->str must not be null");



>  /* See grab_values */
> -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> +static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned long buf_size, struct object *obj)
>  {
>  	int i;
>  	const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
> @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
>  			continue;
>  		if (deref)
>  			name++;
> -		if (strcmp(name, "body") &&
> -		    !starts_with(name, "subject") &&
> -		    !starts_with(name, "trailers") &&
> -		    !starts_with(name, "contents"))
> +
> +		if (starts_with(name, "raw")) {
> +			if (atom->u.raw_data.option == RAW_BARE) {
> +				v->s = xmemdupz(buf, buf_size);
> +				v->s_size = buf_size;
> +			} else if (atom->u.raw_data.option == RAW_LENGTH)
> +				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
> +			continue;
> +		}

I can understand that "raw[:<options>]" handling has been inserted
above the existing "from here on, we only deal with log message
components" check.  But

> +		if ((obj->type != OBJ_TAG &&
> +		     obj->type != OBJ_COMMIT) ||

I do not see why these new conditions are added.  The change is not
justified in the proposed log message, the original did not need
these conditions, and this does not concern the primary point of the
change, which is to start supporting %(raw[:<option>]) placeholder.

If it is needed as a bugfix (e.g. it may be that you consider "if a
blob has contents that looks very similar to 'git cat-file commit
HEAD', %(body) and friends parse these out, even though it is not a
commit" is a bug and the change to add these extra tests is meant as
a fix), that should be done as a preliminary change before adding
the support for a new atom.

> +		    (strcmp(name, "body") &&
> +		     !starts_with(name, "subject") &&
> +		     !starts_with(name, "trailers") &&
> +		     !starts_with(name, "contents")))
>  			continue;
>  		if (!subpos)
>  			find_subpos(buf,
> @@ -1374,25 +1428,30 @@ static void fill_missing_values(struct atom_value *val)
>   * pointed at by the ref itself; otherwise it is the object the
>   * ref (which is a tag) refers to.
>   */
> -static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
> +static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
>  {
> +	void *buf = data->content;
> +	unsigned long buf_size = data->size;
> +
>  	switch (obj->type) {
>  	case OBJ_TAG:
>  		grab_tag_values(val, deref, obj);
> -		grab_sub_body_contents(val, deref, buf);
> +		grab_raw_data(val, deref, buf, buf_size, obj);

It is very strange that a helper that is named to grab raw data can
still process pieces out of a structured data.  The original name is
still a far better match to what the function does, even after this
patch teaches it to also honor %(raw) placeholder.  It is still
about grabbing various "sub"-pieces out of "body contents", and the
sub-piece the %(raw) grabs just happens to be "the whole thing".

> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> +{
> +	size_t i;
> +	const char *s1 = (const char *)vs1;
> +	const char *s2 = (const char *)vs2;
> +
> +	for (i = 0; i < n; i++) {
> +		unsigned char u1 = s1[i];
> +		unsigned char u2 = s2[i];
> +		int U1 = toupper (u1);
> +		int U2 = toupper (u2);

Does toupper('\0') even have a defined meaning?

> +		int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> +			: U1 < U2 ? -1 : U2 < U1);

Looks crazy to worry about uchar wider than int.  Such a system is
not even standard compliant, is it?

> +		if (diff)
> +			return diff;
> +	}
> +	return 0;
> +}


> @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
>  	int cmp_detached_head = 0;
>  	cmp_type cmp_type = used_atom[s->atom].type;
>  	struct strbuf err = STRBUF_INIT;
> +	size_t slen = 0;
>  
>  	if (get_ref_atom_value(a, s->atom, &va, &err))
>  		die("%s", err.buf);
> @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
>  	} else if (s->sort_flags & REF_SORTING_VERSION) {
>  		cmp = versioncmp(va->s, vb->s);
>  	} else if (cmp_type == FIELD_STR) {
> -		int (*cmp_fn)(const char *, const char *);
> -		cmp_fn = s->sort_flags & REF_SORTING_ICASE
> -			? strcasecmp : strcmp;
> -		cmp = cmp_fn(va->s, vb->s);

> +		if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
> +		    vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
> +			int (*cmp_fn)(const char *, const char *);
> +			cmp_fn = s->sort_flags & REF_SORTING_ICASE
> +				? strcasecmp : strcmp;
> +			cmp = cmp_fn(va->s, vb->s);
> +		} else {
> +			int (*cmp_fn)(const void *, const void *, size_t);
> +			cmp_fn = s->sort_flags & REF_SORTING_ICASE
> +				? memcasecmp : memcmp;

Why not introduce two local temporary variables a_size and b_size
and initialize them upfront like so:

		a_size = va->s_size < 0 ? strlen(va->s) : va->s_size;
		b_size = vb->s_size < 0 ? strlen(vb->s) : vb->s_size;

Wouldn't that allow you to do without the complex "if both are
counted, do this, if A is counted but B is not, do that, ..."
cascade?

I can sort-of see the point of special casing "both are traditional
C strings" case (i.e. the "if" side of the "else" we are discussing
here) and using strcasecmp/strcmp instead of memcasecmp/memcmp, but
I do not see much point in having the if/elseif/else cascade inside
this "else" clause.

> +			if (va->s_size != ATOM_VALUE_S_SIZE_INIT &&
> +			    vb->s_size != ATOM_VALUE_S_SIZE_INIT) {
> +				cmp = cmp_fn(va->s, vb->s, va->s_size > vb->s_size ?
> +				       vb->s_size : va->s_size);
> +			} else if (va->s_size == ATOM_VALUE_S_SIZE_INIT) {
> +				slen = strlen(va->s);
> +				cmp = cmp_fn(va->s, vb->s, slen > vb->s_size ?
> +					     vb->s_size : slen);
> +			} else {
> +				slen = strlen(vb->s);
> +				cmp = cmp_fn(va->s, vb->s, slen > va->s_size ?
> +					     slen : va->s_size);
> +			}
> +			cmp = cmp ? cmp : va->s_size - vb->s_size;
> +		}
>  	} else {
>  		if (va->value < vb->value)
>  			cmp = -1;

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] [GSOC] ref-filter: add %(header) atom
  2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
  2021-05-27 16:37   ` Felipe Contreras
@ 2021-05-28  3:06   ` Junio C Hamano
  2021-05-28  4:36   ` Junio C Hamano
  2 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-05-28  3:06 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: ZheNing Hu <adlternative@gmail.com>
>
> Add new formatting option `%(header)`, which will print the
> the structured header part of the raw object data.
>
> In the storage layout of an object: blob and tree only
> contains raw data; commit and tag raw data contains two part:
> header and contents. The header of tag contains "object OOO",
> "type TTT", "tag AAA", "tagger GGG"; The header of commit
> contains "tree RRR", "parent PPP", "author UUU", "committer CCC".
>
> Signed-off-by: ZheNing Hu <adlternative@gmail.com>
> ---
>  Documentation/git-for-each-ref.txt |  7 +++++
>  ref-filter.c                       | 26 +++++++++++++++++
>  t/t6300-for-each-ref.sh            | 45 ++++++++++++++++++++++++++++++
>  3 files changed, 78 insertions(+)

While having this may not be wrong, I am not sure who needs it.  Is
your "cat-file --batch" topic needs this new atom?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] [GSOC] ref-filter: add %(header) atom
  2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
  2021-05-27 16:37   ` Felipe Contreras
  2021-05-28  3:06   ` Junio C Hamano
@ 2021-05-28  4:36   ` Junio C Hamano
  2021-05-28 15:19     ` ZheNing Hu
  2 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-05-28  4:36 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  		struct {
>  			enum { RAW_BARE, RAW_LENGTH } option;
>  		} raw_data;
> +		struct {
> +			enum { H_BARE, H_LENGTH } option;
> +		} header;

Raw does not use R_{BARE,LENGTH} and uses raw_data member.  Header
should follow suit unless there is a compelling reason not to, no?

		struct {
			enum { HEADER_BARE, HEADER_LENGTH } option;
		} header_data;

perhaps?

> @@ -1372,6 +1389,15 @@ static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned
>  				    &bodypos, &bodylen, &nonsiglen,
>  				    &sigpos, &siglen);
>  
> +		if (starts_with(name, "header")) {
> +			size_t header_len = subpos - (const char *)buf - 1;

Hmph, is this correct?  I would expect that the "header" part of a
commit or a tag object excludes the blank line after the header
fields.  In other words, the "header" would be separated by a blank
line from the "body", and that separating blank line is not part of
"header" or "body".

Otherwise, if there is a user of %(header), it needs to be coded to
ignore the last blank line but has to diagnose it as an error if
there is a blank line before that.

> +			if (atom->u.header.option == H_BARE) {
> +				v->s = xmemdupz(buf, header_len);
> +			} else if (atom->u.header.option == H_LENGTH)
> +				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)header_len);
> +			continue;
> +		}
> +
>  		if (atom->u.contents.option == C_SUB)
>  			v->s = copy_subject(subpos, sublen);
>  		else if (atom->u.contents.option == C_SUB_SANITIZE) {

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-27 16:36   ` Felipe Contreras
@ 2021-05-28 13:02     ` ZheNing Hu
  2021-05-28 16:30       ` Felipe Contreras
  2021-05-29 13:23     ` Phillip Wood
  1 sibling, 1 reply; 41+ messages in thread
From: ZheNing Hu @ 2021-05-28 13:02 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: ZheNing Hu via GitGitGadget, Git List, Junio C Hamano,
	Christian Couder, Hariom Verma, Karthik Nayak, Bagas Sanjaya,
	Jeff King

Felipe Contreras <felipe.contreras@gmail.com> 于2021年5月28日周五 上午12:36写道:
>
> ZheNing Hu via GitGitGadget wrote:
>
> > +static int raw_atom_parser(const struct ref_format *format, struct used_atom *atom,
> > +                             const char *arg, struct strbuf *err)
> > +{
> > +     if (!arg) {
> > +             atom->u.raw_data.option = RAW_BARE;
> > +     } else if (!strcmp(arg, "size"))
>
> No need for braces.
>
>   if (!arg)
>     ...
>   else
>

I sometimes forget this detail, I will pay attention.

> > @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> >                       continue;
> >               if (deref)
> >                       name++;
> > -             if (strcmp(name, "body") &&
> > -                 !starts_with(name, "subject") &&
> > -                 !starts_with(name, "trailers") &&
> > -                 !starts_with(name, "contents"))
> > +
> > +             if (starts_with(name, "raw")) {
> > +                     if (atom->u.raw_data.option == RAW_BARE) {
> > +                             v->s = xmemdupz(buf, buf_size);
> > +                             v->s_size = buf_size;
> > +                     } else if (atom->u.raw_data.option == RAW_LENGTH)
> > +                             v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
>
> I think it's better to be consistent: if you used braces in the if, uses
> braces in else.
>

OK.

> > +                     continue;
> > +             }
>
> > +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
>
> Why void *? We can delcare as char *.
>
> > +{
> > +     size_t i;
> > +     const char *s1 = (const char *)vs1;
> > +     const char *s2 = (const char *)vs2;
>
> Then we avoid this extra step.
>
> > +     for (i = 0; i < n; i++) {
> > +             unsigned char u1 = s1[i];
> > +             unsigned char u2 = s2[i];
>
> There's no need for two entirely new variables...
>
> > +             int U1 = toupper (u1);
> > +             int U2 = toupper (u2);
>
> You can do toupper(s1[i]) directly (BTW, there's an extra space: `foo(x)`,
> not `foo (x)`).
>
> While we are at it, why keep an extra index from s1, when s1 is never
> used again?
>
> We can simply advance both s1 and s2:
>
>   s1++, s2++
>
> > +             int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> > +                     : U1 < U2 ? -1 : U2 < U1);
>
> I don't understand what this is supposed to achieve. Both U1 and U2 are
> integers, pretty low integers actually.
>
> If we get rid if that complexity we don't even need U1 or U2, just do:
>
>   diff = toupper(u1) - toupper(u2);
>
> > +             if (diff)
> > +                     return diff;
> > +     }
> > +     return 0;
> > +}
>
> All we have to do is define the end point, and then we don't need i:
>
>         static int memcasecmp(const char *s1, const char *s2, size_t n)
>         {
>                 const char *end = s1 + n;
>                 for (; s1 < end; s1++, s2++) {
>                         int diff = tolower(*s1) - tolower(*s2);
>                         if (diff)
>                                 return diff;
>                 }
>                 return 0;
>         }
>
> (and I personally prefer lower to upper)
>

Sorry for the weird, unclean `memcasecmp()`, I referred to memcmp()
in glibc before, and then I was afraid that my writing was not standard
enough like "UCHAR_MAX <= INT_MAX", I can't consider such an
extreme situation. So I copied it directly from gnulib:
https://github.com/gagern/gnulib/blob/master/lib/memcasecmp.c

> Check the following resource for a detailed explanation of why my
> modified version is considered good taste:
>
> https://github.com/felipec/linked-list-good-taste
>

OK. I will gradually standardize my code style.

> >  static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
> >  {
> >       struct atom_value *va, *vb;
> > @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> >       int cmp_detached_head = 0;
> >       cmp_type cmp_type = used_atom[s->atom].type;
> >       struct strbuf err = STRBUF_INIT;
> > +     size_t slen = 0;
> >
> >       if (get_ref_atom_value(a, s->atom, &va, &err))
> >               die("%s", err.buf);
> > @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> >       } else if (s->sort_flags & REF_SORTING_VERSION) {
> >               cmp = versioncmp(va->s, vb->s);
> >       } else if (cmp_type == FIELD_STR) {
> > -             int (*cmp_fn)(const char *, const char *);
> > -             cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > -                     ? strcasecmp : strcmp;
> > -             cmp = cmp_fn(va->s, vb->s);
> > +             if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
> > +                 vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
> > +                     int (*cmp_fn)(const char *, const char *);
> > +                     cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > +                             ? strcasecmp : strcmp;
> > +                     cmp = cmp_fn(va->s, vb->s);
> > +             } else {
> > +                     int (*cmp_fn)(const void *, const void *, size_t);
> > +                     cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > +                             ? memcasecmp : memcmp;
> > +
> > +                     if (va->s_size != ATOM_VALUE_S_SIZE_INIT &&
> > +                         vb->s_size != ATOM_VALUE_S_SIZE_INIT) {
> > +                             cmp = cmp_fn(va->s, vb->s, va->s_size > vb->s_size ?
> > +                                    vb->s_size : va->s_size);
> > +                     } else if (va->s_size == ATOM_VALUE_S_SIZE_INIT) {
> > +                             slen = strlen(va->s);
> > +                             cmp = cmp_fn(va->s, vb->s, slen > vb->s_size ?
> > +                                          vb->s_size : slen);
> > +                     } else {
> > +                             slen = strlen(vb->s);
> > +                             cmp = cmp_fn(va->s, vb->s, slen > va->s_size ?
> > +                                          slen : va->s_size);
> > +                     }
> > +                     cmp = cmp ? cmp : va->s_size - vb->s_size;
> > +             }
>
> This hurts my eyes. I think the complexity of this chunk warrants a
> separate function. Then the logic would be easer to see.
>

Fine. This piece of the situation is a bit complicated...

> Cheers.
>
> --
> Felipe Contreras

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-28  3:03   ` Junio C Hamano
@ 2021-05-28 15:04     ` ZheNing Hu
  2021-05-28 16:38       ` Felipe Contreras
  2021-05-30  8:11       ` ZheNing Hu
  0 siblings, 2 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-05-28 15:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King

Junio C Hamano <gitster@pobox.com> 于2021年5月28日周五 上午11:04写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > The raw data of blob, tree objects may contain '\0', but most of
> > the logic in `ref-filter` depands on the output of the atom being
> > a structured string (end with '\0').
>
> Text, yes, string is also OK.  But structured?  Probably not.
>
>     ... being text (specifically, no embedded NULs in it).
>

OK.

> > Beyond, `--format=%(raw)` should not combine with `--python`, `--shell`,
> > `--tcl`, `--perl` because if our binary raw data is passed to a variable
> > in the host language, the host languages may cause escape errors.
>
> OK.  I think at least --perl and possibly --python should be able to
> express NULs in the "string" type we use from the host language, but
> I am perfectly fine with the decision to leave it to later updates.
>

Yes, for the time being, unified processing --<lang> will be easier.

> > +Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tcl`,
>
> "should not combine" -> "cannot be used" would make it read more
> naturally (ditto for the phrase used in the proposed log message).
>

OK, so the wording is better.

> >
> >  struct atom_value {
> >       const char *s;
> > +     size_t s_size;
>
> OK, so everything used to be a C-string that cannot hold NULs in it,
> but now it is a counted <ptr, len> string.  Good.
>

This suddenly reminded me of strbuf...
I don't know if it is worth replacing all s, s_size with strbuf.

> > @@ -588,6 +607,10 @@ static int parse_ref_filter_atom(const struct ref_format *format,
> >               return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"),
> >                                      (int)(ep-atom), atom);
> >
> > +     if (format->quote_style && starts_with(sp, "raw"))
> > +             return strbuf_addf_ret(err, -1, _("--format=%.*s should not combine with"
> > +                             "--python, --shell, --tcl, --perl"), (int)(ep-atom), atom);
>
> Don't we want to allow "raw:size" that would be a plain text?
> I am not sure if this check belongs here in the first place.
> Shouldn't it be done in raw_atom_parser() instead?
>

You are right: "raw:size" should be keep, but I can't this check to
raw_atom_parser(), becase "if I move it to raw_atom_parser(), when we
use:

`git ref-filter --format=%raw --sort=raw --python`

Git will continue to run instead of die because parse_sorting_atom() will
use a dummy ref_format and don't remember --language details, next time
format_ref_array_item() will reuse the used_atom entry of sorting atom in
parse_ref_filter_atom(), this will skip the check in raw_atom_parser()."

> Another idea is to teach a more generic rule to quote_formatting()
> to detect NULs in v->s[0..v->s_size] at runtime and barf, i.e. a
> plain-text blob object can be used with "--shell --format=%(raw)"
> just fine.
>

The cost of such a check is not small. Maybe can add an option
such as "--only-text" to do it.

> > @@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref_format *format,
> >       return at;
> >  }
> >
> > -static void quote_formatting(struct strbuf *s, const char *str, int quote_style)
> > +static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style)
> >  {
> >       switch (quote_style) {
> >       case QUOTE_NONE:
> > -             strbuf_addstr(s, str);
> > +             if (len != ATOM_VALUE_S_SIZE_INIT)
> > +                     strbuf_add(s, str, len);
> > +             else
> > +                     strbuf_addstr(s, str);
>
> It probably is a good idea to invent a C preprocessor macro for a
> named constant to be used when a structure is initialized, but it
> would be easier to read if the rule is "len field is negative when
> the value is a C-string", e.g.
>
>                 if (len < 0)
>

I do not recognize such an approach because we are deal
with "size_t s_size", if (len < 0) will never be established.
I use -1 is because it's equal to 18446744073709551615
and it's impossible to have such a large file in Git.

> > @@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state
> >       return 0;
> >  }
> >
> > -static int is_empty(const char *s)
> > +static int is_empty(struct strbuf *buf)
> >  {
> > -     while (*s != '\0') {
> > -             if (!isspace(*s))
> > -                     return 0;
> > +     const char *s = buf->buf;
> > +     size_t cur_len = 0;
> > +
> > +     while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
> >               s++;
> > +             cur_len++;
> >       }
> > -     return 1;
> > +     return cur_len == buf->len;
> >  }
>
> Assuming that we do want to treat NULs the same way as whitespaces,
> the updated code works as intended, which is good.  But I have no
> reason to support that design decision.  I do not have a strong
> reason to support a design decision that goes the opposite way to
> treat a NUL just like we treat an 'X', but at least I can understand
> it (i.e. "because we have no reason to special case NUL any more
> than 'X' when trying to see if a buffer is 'empty', we don't").
> This code on the other hand must be supported with "because we need
> to special case NUL for such and such reasons for the purpose of
> determining if a buffer is 'empty', we treat them the same way as
> whitespaces".
>

Something like "\0abc", from the perspective of the string, it is empty;
from the perspective of the memory, it is not empty; I don't know any
absolutely good solutions here.

> Can the change in this commit violate the invariant that
> if_then_else->str cannot be NULL, which seems to have been the case
> forever as we see an unchecked strcmp() done in the original?
>
> If so, perhaps you can check the condition upfront, where you
> compute str_len above, e.g.
>
>         if (!if_then_else->str) {
>                 if (if_then_else->cmp_status == COMPARE_EQUAL ||
>                     if_then_else->cmp_status == COMPARE_UNEQUAL)
>                         BUG(...);
>         } else
>                 str_len = strlen(...);
>
> If not, then I do not see the point of adding this (and later) check
> with BUG to this code.
>
> Or is the invariant that .str must not be NULL could have been
> violated without this patch (i.e. the original was buggy in running
> strcmp() on .str without checking)?  If so, please make it a separate
> preliminary change to add such an assert.
>

The BUG() here actually acts as an "assert()". ".str must not be NULL" is
right, it point to "xxx" in "%(if:equals=xxx)", so it seems that these BUG()
are somewhat redundant, I will remove them.

> >  /* See grab_values */
> > -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> > +static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned long buf_size, struct object *obj)
> >  {
> >       int i;
> >       const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
> > @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> >                       continue;
> >               if (deref)
> >                       name++;
> > -             if (strcmp(name, "body") &&
> > -                 !starts_with(name, "subject") &&
> > -                 !starts_with(name, "trailers") &&
> > -                 !starts_with(name, "contents"))
> > +
> > +             if (starts_with(name, "raw")) {
> > +                     if (atom->u.raw_data.option == RAW_BARE) {
> > +                             v->s = xmemdupz(buf, buf_size);
> > +                             v->s_size = buf_size;
> > +                     } else if (atom->u.raw_data.option == RAW_LENGTH)
> > +                             v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
> > +                     continue;
> > +             }
>
> I can understand that "raw[:<options>]" handling has been inserted
> above the existing "from here on, we only deal with log message
> components" check.  But
>
> > +             if ((obj->type != OBJ_TAG &&
> > +                  obj->type != OBJ_COMMIT) ||
>
> I do not see why these new conditions are added.  The change is not
> justified in the proposed log message, the original did not need
> these conditions, and this does not concern the primary point of the
> change, which is to start supporting %(raw[:<option>]) placeholder.
>
> If it is needed as a bugfix (e.g. it may be that you consider "if a
> blob has contents that looks very similar to 'git cat-file commit
> HEAD', %(body) and friends parse these out, even though it is not a
> commit" is a bug and the change to add these extra tests is meant as
> a fix), that should be done as a preliminary change before adding
> the support for a new atom.
>

Almost what I means: Make a strong guarantee that blob and tree
will never pass the check so that we can don't worry about incorrect
parsing in find_subpos(). The reason I put it in this patch is that only
commit and tag objects will execute `grab_sub_body_contents()` before,
but in the current patch it has changed.

> > +                 (strcmp(name, "body") &&
> > +                  !starts_with(name, "subject") &&
> > +                  !starts_with(name, "trailers") &&
> > +                  !starts_with(name, "contents")))
> >                       continue;
> >               if (!subpos)
> >                       find_subpos(buf,
> > @@ -1374,25 +1428,30 @@ static void fill_missing_values(struct atom_value *val)
> >   * pointed at by the ref itself; otherwise it is the object the
> >   * ref (which is a tag) refers to.
> >   */
> > -static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
> > +static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
> >  {
> > +     void *buf = data->content;
> > +     unsigned long buf_size = data->size;
> > +
> >       switch (obj->type) {
> >       case OBJ_TAG:
> >               grab_tag_values(val, deref, obj);
> > -             grab_sub_body_contents(val, deref, buf);
> > +             grab_raw_data(val, deref, buf, buf_size, obj);
>
> It is very strange that a helper that is named to grab raw data can
> still process pieces out of a structured data.  The original name is
> still a far better match to what the function does, even after this
> patch teaches it to also honor %(raw) placeholder.  It is still
> about grabbing various "sub"-pieces out of "body contents", and the
> sub-piece the %(raw) grabs just happens to be "the whole thing".
>

Well, I can't think of a better name, My original idea was grab_raw_data()
can grab itself, header, contents, It is more general than
grab_sub_body_contents(),
raw data is not a part of "subject" or "body" of "contents"...

> > +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > +{
> > +     size_t i;
> > +     const char *s1 = (const char *)vs1;
> > +     const char *s2 = (const char *)vs2;
> > +
> > +     for (i = 0; i < n; i++) {
> > +             unsigned char u1 = s1[i];
> > +             unsigned char u2 = s2[i];
> > +             int U1 = toupper (u1);
> > +             int U2 = toupper (u2);
>
> Does toupper('\0') even have a defined meaning?
>
> > +             int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> > +                     : U1 < U2 ? -1 : U2 < U1);
>
> Looks crazy to worry about uchar wider than int.  Such a system is
> not even standard compliant, is it?
>

Forget about this inelegant help function. As I said in my reply to Felipe,
this is copied from gunlib...

> > @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> >       int cmp_detached_head = 0;
> >       cmp_type cmp_type = used_atom[s->atom].type;
> >       struct strbuf err = STRBUF_INIT;
> > +     size_t slen = 0;
> >
> >       if (get_ref_atom_value(a, s->atom, &va, &err))
> >               die("%s", err.buf);
> > @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> >       } else if (s->sort_flags & REF_SORTING_VERSION) {
> >               cmp = versioncmp(va->s, vb->s);
> >       } else if (cmp_type == FIELD_STR) {
> > -             int (*cmp_fn)(const char *, const char *);
> > -             cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > -                     ? strcasecmp : strcmp;
> > -             cmp = cmp_fn(va->s, vb->s);
>
> > +             if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
> > +                 vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
> > +                     int (*cmp_fn)(const char *, const char *);
> > +                     cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > +                             ? strcasecmp : strcmp;
> > +                     cmp = cmp_fn(va->s, vb->s);
> > +             } else {
> > +                     int (*cmp_fn)(const void *, const void *, size_t);
> > +                     cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > +                             ? memcasecmp : memcmp;
>
> Why not introduce two local temporary variables a_size and b_size
> and initialize them upfront like so:
>
>                 a_size = va->s_size < 0 ? strlen(va->s) : va->s_size;
>                 b_size = vb->s_size < 0 ? strlen(vb->s) : vb->s_size;
>
> Wouldn't that allow you to do without the complex "if both are
> counted, do this, if A is counted but B is not, do that, ..."
> cascade?
>

Sorry, such code would be really ugly for reading.

> I can sort-of see the point of special casing "both are traditional
> C strings" case (i.e. the "if" side of the "else" we are discussing
> here) and using strcasecmp/strcmp instead of memcasecmp/memcmp, but
> I do not see much point in having the if/elseif/else cascade inside
> this "else" clause.
>

I will try to modify its logic.

> Thanks.

Your reply is very accurate.

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] [GSOC] ref-filter: add %(header) atom
  2021-05-28  4:36   ` Junio C Hamano
@ 2021-05-28 15:19     ` ZheNing Hu
  0 siblings, 0 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-05-28 15:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King

Junio C Hamano <gitster@pobox.com> 于2021年5月28日周五 下午12:36写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> >               struct {
> >                       enum { RAW_BARE, RAW_LENGTH } option;
> >               } raw_data;
> > +             struct {
> > +                     enum { H_BARE, H_LENGTH } option;
> > +             } header;
>
> Raw does not use R_{BARE,LENGTH} and uses raw_data member.  Header
> should follow suit unless there is a compelling reason not to, no?
>
>                 struct {
>                         enum { HEADER_BARE, HEADER_LENGTH } option;
>                 } header_data;
>
> perhaps?
>

OK.


> > @@ -1372,6 +1389,15 @@ static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned
> >                                   &bodypos, &bodylen, &nonsiglen,
> >                                   &sigpos, &siglen);
> >
> > +             if (starts_with(name, "header")) {
> > +                     size_t header_len = subpos - (const char *)buf - 1;
>
> Hmph, is this correct?  I would expect that the "header" part of a
> commit or a tag object excludes the blank line after the header
> fields.  In other words, the "header" would be separated by a blank
> line from the "body", and that separating blank line is not part of
> "header" or "body".
>
> Otherwise, if there is a user of %(header), it needs to be coded to
> ignore the last blank line but has to diagnose it as an error if
> there is a blank line before that.
>

I am a bit confused, Is there any problem with me doing this?

> > +                     size_t header_len = subpos - (const char *)buf - 1;

"header" part starts from "buf" and header_len have minus 1 so that
header part will not touch the blank line. At the same time, "contents"
part starts from subpos, and it also does not touch the blank line.

> While having this may not be wrong, I am not sure who needs it.  Is
> your "cat-file --batch" topic needs this new atom?

Ok, I will remove it from this topic temporarily.

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-28 13:02     ` ZheNing Hu
@ 2021-05-28 16:30       ` Felipe Contreras
  2021-05-30  5:37         ` ZheNing Hu
  0 siblings, 1 reply; 41+ messages in thread
From: Felipe Contreras @ 2021-05-28 16:30 UTC (permalink / raw)
  To: ZheNing Hu, Felipe Contreras
  Cc: ZheNing Hu via GitGitGadget, Git List, Junio C Hamano,
	Christian Couder, Hariom Verma, Karthik Nayak, Bagas Sanjaya,
	Jeff King

ZheNing Hu wrote:
> Sorry for the weird, unclean `memcasecmp()`, I referred to memcmp()
> in glibc before, and then I was afraid that my writing was not standard
> enough like "UCHAR_MAX <= INT_MAX", I can't consider such an
> extreme situation. So I copied it directly from gnulib:
> https://github.com/gagern/gnulib/blob/master/lib/memcasecmp.c

Yeah, I imagined you copied it from somewhere, but when you do that you
need to transform the code to the style of the project. I've seen GNU
code, and in my opinion it's too verbose and redundant. Not a good
style.

But more importantly: at the header of that file you can see the license
is GPLv3, that's incompatible with the license of this project, which is
GPLv2 only (see the note in COPYING).

You can't just copy code like that. You need to be careful.

And if you do copy code--even if allowed by the license--it's something
that should be mentioned in the commit message, preferably with a link
to the original, that way if there's trouble in the future with that
code, we can follow the link and figure out why it was done that way.

Also, it's just nice to give attribution to the people that wrote the
original code.

> > Check the following resource for a detailed explanation of why my
> > modified version is considered good taste:
> >
> > https://github.com/felipec/linked-list-good-taste
> 
> OK. I will gradually standardize my code style.

It is not a standard, it is my personal opinion, which is shared by
Linus Torvalds, and I presume other members of the Git project.

The style is not something that can be standardized, you get a feeling
of it as you read more code of the project, write, and then receive
feedback on what you write.

It's like learning the slang of a new city; it takes a while.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-28 15:04     ` ZheNing Hu
@ 2021-05-28 16:38       ` Felipe Contreras
  2021-05-30  8:11       ` ZheNing Hu
  1 sibling, 0 replies; 41+ messages in thread
From: Felipe Contreras @ 2021-05-28 16:38 UTC (permalink / raw)
  To: ZheNing Hu, Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King

ZheNing Hu wrote:
> Junio C Hamano <gitster@pobox.com> 于2021年5月28日周五 上午11:04写道:
> > "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> > > +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > > +{
> > > +     size_t i;
> > > +     const char *s1 = (const char *)vs1;
> > > +     const char *s2 = (const char *)vs2;
> > > +
> > > +     for (i = 0; i < n; i++) {
> > > +             unsigned char u1 = s1[i];
> > > +             unsigned char u2 = s2[i];
> > > +             int U1 = toupper (u1);
> > > +             int U2 = toupper (u2);
> >
> > Does toupper('\0') even have a defined meaning?

> Forget about this inelegant help function. As I said in my reply to Felipe,
> this is copied from gunlib...

Even if you use my modified version (which hopefully is not so
inelegant), the comment about toupper('\0') still applies.

My reading of `man toupper(3)` is that if c is neither lowercase or
uppercase it is returned as-is. As long as it's unsigned char, which
'\0' is.

So I think the behavior is indeed defined.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-27 16:36   ` Felipe Contreras
  2021-05-28 13:02     ` ZheNing Hu
@ 2021-05-29 13:23     ` Phillip Wood
  2021-05-29 15:24       ` Felipe Contreras
  2021-05-30  6:26       ` ZheNing Hu
  1 sibling, 2 replies; 41+ messages in thread
From: Phillip Wood @ 2021-05-29 13:23 UTC (permalink / raw)
  To: Felipe Contreras, ZheNing Hu via GitGitGadget, git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King, ZheNing Hu

On 27/05/2021 17:36, Felipe Contreras wrote:
> ZheNing Hu via GitGitGadget wrote:
> [...]
>> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> 
> Why void *? We can delcare as char *.

If you look at how this function is used you'll see
	int (*cmp_fn)(const void *, const void *, size_t);
	cmp_fn = s->sort_flags & REF_SORTING_ICASE
			? memcasecmp : memcmp;

So the signature must match memcmp to avoid undefined behavior (a 
ternary expression is undefined unless both sides evaluate to the same 
type and calling a function through a pointer a different type is 
undefined as well)

>> +{
>> +	size_t i;
>> +	const char *s1 = (const char *)vs1;
>> +	const char *s2 = (const char *)vs2;
> 
> Then we avoid this extra step.
> 
>> +	for (i = 0; i < n; i++) {
>> +		unsigned char u1 = s1[i];
>> +		unsigned char u2 = s2[i];
> 
> There's no need for two entirely new variables...
> 
>> +		int U1 = toupper (u1);
>> +		int U2 = toupper (u2);
> 
> You can do toupper(s1[i]) directly (BTW, there's an extra space: `foo(x)`,
> not `foo (x)`).
> 
> While we are at it, why keep an extra index from s1, when s1 is never
> used again?
> 
> We can simply advance both s1 and s2:
> 
>    s1++, s2++
> 
>> +		int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
>> +			: U1 < U2 ? -1 : U2 < U1);
> 
> I don't understand what this is supposed to achieve. Both U1 and U2 are
> integers, pretty low integers actually.
> 
> If we get rid if that complexity we don't even need U1 or U2, just do:
> 
>    diff = toupper(u1) - toupper(u2);
> 
>> +		if (diff)
>> +			return diff;
>> +	}
>> +	return 0;
>> +}
> 
> All we have to do is define the end point, and then we don't need i:
> 
> 	static int memcasecmp(const char *s1, const char *s2, size_t n)
> 	{
> 		const char *end = s1 + n;
> 		for (; s1 < end; s1++, s2++) {
> 			int diff = tolower(*s1) - tolower(*s2);
> 			if (diff)
> 				return diff;
> 		}
> 		return 0;
> 	}
> 
> (and I personally prefer lower to upper)

We should be using tolower() as that is what POSIX specifies for 
strcasecmp() [1] which we are trying to emulate and there are cases[2] where
	(tolower(c1) == tolower(c2)) != (toupper(c1) == toupper(c2))

Best Wishes

Phillip

[1] https://pubs.opengroup.org/onlinepubs/9699919799/
[2] https://en.wikipedia.org/wiki/Dotted_and_dotless_I#In_computing

> Check the following resource for a detailed explanation of why my
> modified version is considered good taste:
> 
> https://github.com/felipec/linked-list-good-taste
> 
>>   static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
>>   {
>>   	struct atom_value *va, *vb;
>> @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
>>   	int cmp_detached_head = 0;
>>   	cmp_type cmp_type = used_atom[s->atom].type;
>>   	struct strbuf err = STRBUF_INIT;
>> +	size_t slen = 0;
>>   
>>   	if (get_ref_atom_value(a, s->atom, &va, &err))
>>   		die("%s", err.buf);
>> @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
>>   	} else if (s->sort_flags & REF_SORTING_VERSION) {
>>   		cmp = versioncmp(va->s, vb->s);
>>   	} else if (cmp_type == FIELD_STR) {
>> -		int (*cmp_fn)(const char *, const char *);
>> -		cmp_fn = s->sort_flags & REF_SORTING_ICASE
>> -			? strcasecmp : strcmp;
>> -		cmp = cmp_fn(va->s, vb->s);
>> +		if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
>> +		    vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
>> +			int (*cmp_fn)(const char *, const char *);
>> +			cmp_fn = s->sort_flags & REF_SORTING_ICASE
>> +				? strcasecmp : strcmp;
>> +			cmp = cmp_fn(va->s, vb->s);
>> +		} else {
>> +			int (*cmp_fn)(const void *, const void *, size_t);
>> +			cmp_fn = s->sort_flags & REF_SORTING_ICASE
>> +				? memcasecmp : memcmp;
>> +
>> +			if (va->s_size != ATOM_VALUE_S_SIZE_INIT &&
>> +			    vb->s_size != ATOM_VALUE_S_SIZE_INIT) {
>> +				cmp = cmp_fn(va->s, vb->s, va->s_size > vb->s_size ?
>> +				       vb->s_size : va->s_size);
>> +			} else if (va->s_size == ATOM_VALUE_S_SIZE_INIT) {
>> +				slen = strlen(va->s);
>> +				cmp = cmp_fn(va->s, vb->s, slen > vb->s_size ?
>> +					     vb->s_size : slen);
>> +			} else {
>> +				slen = strlen(vb->s);
>> +				cmp = cmp_fn(va->s, vb->s, slen > va->s_size ?
>> +					     slen : va->s_size);
>> +			}
>> +			cmp = cmp ? cmp : va->s_size - vb->s_size;
>> +		}
> 
> This hurts my eyes. I think the complexity of this chunk warrants a
> separate function. Then the logic would be easer to see.
> 
> Cheers.
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-29 13:23     ` Phillip Wood
@ 2021-05-29 15:24       ` Felipe Contreras
  2021-05-29 17:23         ` Phillip Wood
  2021-05-30  6:29         ` ZheNing Hu
  2021-05-30  6:26       ` ZheNing Hu
  1 sibling, 2 replies; 41+ messages in thread
From: Felipe Contreras @ 2021-05-29 15:24 UTC (permalink / raw)
  To: Phillip Wood, Felipe Contreras, ZheNing Hu via GitGitGadget, git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King, ZheNing Hu

Phillip Wood wrote:
> On 27/05/2021 17:36, Felipe Contreras wrote:
> > ZheNing Hu via GitGitGadget wrote:
> > [...]
> >> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > 
> > Why void *? We can delcare as char *.
> 
> If you look at how this function is used you'll see
> 	int (*cmp_fn)(const void *, const void *, size_t);
> 	cmp_fn = s->sort_flags & REF_SORTING_ICASE
> 			? memcasecmp : memcmp;

Yeah, but why?

We know we are comparing two char *. Presumably the reason is that
memcmp and memcasecmp use void *, but that could be remedied with:

	cmp_fn = (int (*)(const char *, const char *, size_t))memcmp;

That way the same cmp_fn could be used for the two cases.

Either way I don't care particularly much. It also could be possible to
use void * and do the casting in tolower().
 
> > (and I personally prefer lower to upper)
> 
> We should be using tolower() as that is what POSIX specifies for 
> strcasecmp() [1] which we are trying to emulate and there are cases[2] where
> 	(tolower(c1) == tolower(c2)) != (toupper(c1) == toupper(c2))

That's true.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-29 15:24       ` Felipe Contreras
@ 2021-05-29 17:23         ` Phillip Wood
  2021-05-30  6:29         ` ZheNing Hu
  1 sibling, 0 replies; 41+ messages in thread
From: Phillip Wood @ 2021-05-29 17:23 UTC (permalink / raw)
  To: Felipe Contreras, ZheNing Hu via GitGitGadget, git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King, ZheNing Hu

On 29/05/2021 16:24, Felipe Contreras wrote:
> Phillip Wood wrote:
>> On 27/05/2021 17:36, Felipe Contreras wrote:
>>> ZheNing Hu via GitGitGadget wrote:
>>> [...]
>>>> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
>>>
>>> Why void *? We can delcare as char *.
>>
>> If you look at how this function is used you'll see
>> 	int (*cmp_fn)(const void *, const void *, size_t);
>> 	cmp_fn = s->sort_flags & REF_SORTING_ICASE
>> 			? memcasecmp : memcmp;
> 
> Yeah, but why?
> 
> We know we are comparing two char *. Presumably the reason is that
> memcmp and memcasecmp use void *, but that could be remedied with:
> 
> 	cmp_fn = (int (*)(const char *, const char *, size_t))memcmp;
> 
> That way the same cmp_fn could be used for the two cases.

But that is still undefined behavior - the ugly cast just silences any 
compiler warning without making the code safe. It calls memcmp using a 
pointer of a different type. The type of cmp_fn and the two functions 
assigned to it must match.

Best Wishes

Phillip

> Either way I don't care particularly much. It also could be possible to
> use void * and do the casting in tolower().
>   
>>> (and I personally prefer lower to upper)
>>
>> We should be using tolower() as that is what POSIX specifies for
>> strcasecmp() [1] which we are trying to emulate and there are cases[2] where
>> 	(tolower(c1) == tolower(c2)) != (toupper(c1) == toupper(c2))
> 
> That's true.
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-28 16:30       ` Felipe Contreras
@ 2021-05-30  5:37         ` ZheNing Hu
  0 siblings, 0 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-05-30  5:37 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: ZheNing Hu via GitGitGadget, Git List, Junio C Hamano,
	Christian Couder, Hariom Verma, Karthik Nayak, Bagas Sanjaya,
	Jeff King

Felipe Contreras <felipe.contreras@gmail.com> 于2021年5月29日周六 上午12:30写道:
>
> ZheNing Hu wrote:
> > Sorry for the weird, unclean `memcasecmp()`, I referred to memcmp()
> > in glibc before, and then I was afraid that my writing was not standard
> > enough like "UCHAR_MAX <= INT_MAX", I can't consider such an
> > extreme situation. So I copied it directly from gnulib:
> > https://github.com/gagern/gnulib/blob/master/lib/memcasecmp.c
>
> Yeah, I imagined you copied it from somewhere, but when you do that you
> need to transform the code to the style of the project. I've seen GNU
> code, and in my opinion it's too verbose and redundant. Not a good
> style.
>
> But more importantly: at the header of that file you can see the license
> is GPLv3, that's incompatible with the license of this project, which is
> GPLv2 only (see the note in COPYING).
>
> You can't just copy code like that. You need to be careful.
>

Now I notice the importance of license in open source project.

> And if you do copy code--even if allowed by the license--it's something
> that should be mentioned in the commit message, preferably with a link
> to the original, that way if there's trouble in the future with that
> code, we can follow the link and figure out why it was done that way.
>
> Also, it's just nice to give attribution to the people that wrote the
> original code.
>

Ok.

> > > Check the following resource for a detailed explanation of why my
> > > modified version is considered good taste:
> > >
> > > https://github.com/felipec/linked-list-good-taste
> >
> > OK. I will gradually standardize my code style.
>
> It is not a standard, it is my personal opinion, which is shared by
> Linus Torvalds, and I presume other members of the Git project.
>
> The style is not something that can be standardized, you get a feeling
> of it as you read more code of the project, write, and then receive
> feedback on what you write.
>

Yes it is. Reading and writing Git code has brought me a certain degree
of code style improvement. (this is indeed a kind of edification :) )

> It's like learning the slang of a new city; it takes a while.
>

Thanks for your guidance.

> Cheers.
>
> --
> Felipe Contreras

Thanks!
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-29 13:23     ` Phillip Wood
  2021-05-29 15:24       ` Felipe Contreras
@ 2021-05-30  6:26       ` ZheNing Hu
  2021-05-30 13:02         ` Phillip Wood
  1 sibling, 1 reply; 41+ messages in thread
From: ZheNing Hu @ 2021-05-30  6:26 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Felipe Contreras, ZheNing Hu via GitGitGadget, Git List,
	Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King

Phillip Wood <phillip.wood123@gmail.com> 于2021年5月29日周六 下午9:23写道:
>
> On 27/05/2021 17:36, Felipe Contreras wrote:
> > ZheNing Hu via GitGitGadget wrote:
> > [...]
> >> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> >
> > Why void *? We can delcare as char *.
>
> If you look at how this function is used you'll see
>         int (*cmp_fn)(const void *, const void *, size_t);
>         cmp_fn = s->sort_flags & REF_SORTING_ICASE
>                         ? memcasecmp : memcmp;
>
> So the signature must match memcmp to avoid undefined behavior (a
> ternary expression is undefined unless both sides evaluate to the same
> type and calling a function through a pointer a different type is
> undefined as well)
>

I agree.

> >> +{
> >> +    size_t i;
> >> +    const char *s1 = (const char *)vs1;
> >> +    const char *s2 = (const char *)vs2;
> >
> > Then we avoid this extra step.
> >
> >> +    for (i = 0; i < n; i++) {
> >> +            unsigned char u1 = s1[i];
> >> +            unsigned char u2 = s2[i];
> >
> > There's no need for two entirely new variables...
> >
> >> +            int U1 = toupper (u1);
> >> +            int U2 = toupper (u2);
> >
> > You can do toupper(s1[i]) directly (BTW, there's an extra space: `foo(x)`,
> > not `foo (x)`).
> >
> > While we are at it, why keep an extra index from s1, when s1 is never
> > used again?
> >
> > We can simply advance both s1 and s2:
> >
> >    s1++, s2++
> >
> >> +            int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> >> +                    : U1 < U2 ? -1 : U2 < U1);
> >
> > I don't understand what this is supposed to achieve. Both U1 and U2 are
> > integers, pretty low integers actually.
> >
> > If we get rid if that complexity we don't even need U1 or U2, just do:
> >
> >    diff = toupper(u1) - toupper(u2);
> >
> >> +            if (diff)
> >> +                    return diff;
> >> +    }
> >> +    return 0;
> >> +}
> >
> > All we have to do is define the end point, and then we don't need i:
> >
> >       static int memcasecmp(const char *s1, const char *s2, size_t n)
> >       {
> >               const char *end = s1 + n;
> >               for (; s1 < end; s1++, s2++) {
> >                       int diff = tolower(*s1) - tolower(*s2);
> >                       if (diff)
> >                               return diff;
> >               }
> >               return 0;
> >       }
> >
> > (and I personally prefer lower to upper)
>
> We should be using tolower() as that is what POSIX specifies for
> strcasecmp() [1] which we are trying to emulate and there are cases[2] where
>         (tolower(c1) == tolower(c2)) != (toupper(c1) == toupper(c2))
>

I don’t know if we overlooked a fact: This static `memcasecmp()`
is not a POSIX version. `tolower()` or `toupper()` are in git-compat-util.h,
sane_istest('\0', GIT_ALPHA) == false . So in `sane_case()`, whatever
`tolower()`, `toupper()`, they just return '\0' itself.

> Best Wishes
>
> Phillip
>
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/
> [2] https://en.wikipedia.org/wiki/Dotted_and_dotless_I#In_computing
>

Thanks.
--
ZhenNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-29 15:24       ` Felipe Contreras
  2021-05-29 17:23         ` Phillip Wood
@ 2021-05-30  6:29         ` ZheNing Hu
  2021-05-30 13:05           ` Phillip Wood
  2021-05-31 15:35           ` Felipe Contreras
  1 sibling, 2 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-05-30  6:29 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: Phillip Wood, ZheNing Hu via GitGitGadget, Git List,
	Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King

Felipe Contreras <felipe.contreras@gmail.com> 于2021年5月29日周六 下午11:24写道:
>
> Phillip Wood wrote:
> > On 27/05/2021 17:36, Felipe Contreras wrote:
> > > ZheNing Hu via GitGitGadget wrote:
> > > [...]
> > >> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > >
> > > Why void *? We can delcare as char *.
> >
> > If you look at how this function is used you'll see
> >       int (*cmp_fn)(const void *, const void *, size_t);
> >       cmp_fn = s->sort_flags & REF_SORTING_ICASE
> >                       ? memcasecmp : memcmp;
>
> Yeah, but why?
>
> We know we are comparing two char *. Presumably the reason is that
> memcmp and memcasecmp use void *, but that could be remedied with:
>
>         cmp_fn = (int (*)(const char *, const char *, size_t))memcmp;
>
> That way the same cmp_fn could be used for the two cases.
>
> Either way I don't care particularly much. It also could be possible to
> use void * and do the casting in tolower().
>

I agree with Phillip's point of view here:
It would be better for memcasecmp and memcmp to be consistent.

> > > (and I personally prefer lower to upper)
> >
> > We should be using tolower() as that is what POSIX specifies for
> > strcasecmp() [1] which we are trying to emulate and there are cases[2] where
> >       (tolower(c1) == tolower(c2)) != (toupper(c1) == toupper(c2))
>
> That's true.
>

How about something like this:

 static int memcasecmp(const void *vs1, const void *vs2, size_t n)
 {
-       size_t i;
-       const char *s1 = (const char *)vs1;
-       const char *s2 = (const char *)vs2;
-
-       for (i = 0; i < n; i++) {
-               unsigned char u1 = s1[i];
-               unsigned char u2 = s2[i];
-               int U1 = toupper (u1);
-               int U2 = toupper (u2);
-               int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
-                       : U1 < U2 ? -1 : U2 < U1);
+       const char *s1 = (const void *)vs1;
+       const char *s2 = (const void *)vs2;
+       const char *end = s1 + n;
+
+       for (; s1 < end; s1++, s2++) {
+               int diff = tolower(*s1) - tolower(*s2);
                if (diff)
                        return diff;
        }
}

> --
> Felipe Contreras

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-28 15:04     ` ZheNing Hu
  2021-05-28 16:38       ` Felipe Contreras
@ 2021-05-30  8:11       ` ZheNing Hu
  1 sibling, 0 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-05-30  8:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King

ZheNing Hu <adlternative@gmail.com> 于2021年5月28日周五 下午11:04写道:
>
> > Can the change in this commit violate the invariant that
> > if_then_else->str cannot be NULL, which seems to have been the case
> > forever as we see an unchecked strcmp() done in the original?
> >
> > If so, perhaps you can check the condition upfront, where you
> > compute str_len above, e.g.
> >
> >         if (!if_then_else->str) {
> >                 if (if_then_else->cmp_status == COMPARE_EQUAL ||
> >                     if_then_else->cmp_status == COMPARE_UNEQUAL)
> >                         BUG(...);
> >         } else
> >                 str_len = strlen(...);
> >
> > If not, then I do not see the point of adding this (and later) check
> > with BUG to this code.
> >
> > Or is the invariant that .str must not be NULL could have been
> > violated without this patch (i.e. the original was buggy in running
> > strcmp() on .str without checking)?  If so, please make it a separate
> > preliminary change to add such an assert.
> >
>
> The BUG() here actually acts as an "assert()". ".str must not be NULL" is
> right, it point to "xxx" in "%(if:equals=xxx)", so it seems that these BUG()
> are somewhat redundant, I will remove them.
>

Correct the error: If the atom is "%(if)" instread of
"%(if:equals=xxx)", .str will be NULL.
Without assert() or BUG() is ok, but clang-tidy will give a warning:
"Null pointer
passed to 1st parameter expecting 'nonnull'"

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 0/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-27 14:43 [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-05-27 15:39 ` [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom Felipe Contreras
@ 2021-05-30 13:01 ` ZheNing Hu via GitGitGadget
  2021-05-30 13:01   ` [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents ZheNing Hu via GitGitGadget
  2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
  3 siblings, 2 replies; 41+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-30 13:01 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu

In order to make git cat-file --batch use ref-filter logic, I add %(raw)
atom to ref-filter.

Change from last version:

 1. Use more elegant memcasecmp().
 2. Allow %(raw:size) used with --<lang>.
 3. Remove redundant BUG() in then_atom_handler().
 4. Roll back to origin function name grab_sub_body_contents().
 5. Split the check of object type in grab_sub_body_contents() into the
    previous patch.

ZheNing Hu (2):
  [GSOC] ref-filter: add obj-type check in grab contents
  [GSOC] ref-filter: add %(raw) atom

 Documentation/git-for-each-ref.txt |  14 ++
 ref-filter.c                       | 158 ++++++++++++++++++-----
 t/t6300-for-each-ref.sh            | 200 +++++++++++++++++++++++++++++
 3 files changed, 338 insertions(+), 34 deletions(-)


base-commit: 5d5b1473453400224ebb126bf3947e0a3276bdf5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-963%2Fadlternative%2Fref-filter-raw-atom-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-963/adlternative/ref-filter-raw-atom-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/963

Range-diff vs v1:

 -:  ------------ > 1:  e6c26d19a3f3 [GSOC] ref-filter: add obj-type check in grab contents
 1:  b3848f24f2d3 ! 2:  e44a2ed0db59 [GSOC] ref-filter: add %(raw) atom
     @@ Commit message
      
          The raw data of blob, tree objects may contain '\0', but most of
          the logic in `ref-filter` depands on the output of the atom being
     -    a structured string (end with '\0').
     +    text (specifically, no embedded NULs in it).
      
          E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
          add the data to the buffer. The raw data of a tree object is
     @@ Commit message
          can record raw object size, it can help us add raw object data to
          the buffer or compare two buffers which contain raw object data.
      
     -    Beyond, `--format=%(raw)` should not combine with `--python`, `--shell`,
     +    Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
          `--tcl`, `--perl` because if our binary raw data is passed to a variable
          in the host language, the host languages may cause escape errors.
      
     +    Helped-by: Felipe Contreras <felipe.contreras@gmail.com>
     +    Helped-by: Phillip Wood <phillip.wood@dunelm.org.uk>
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com>
          Signed-off-by: ZheNing Hu <adlternative@gmail.com>
      
     @@ Documentation/git-for-each-ref.txt: and `date` to extract the named component.
      +raw:size::
      +	The raw data size of the object.
      +
     -+Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tcl`,
     ++Note that `--format=%(raw)` can not be used with `--python`, `--shell`, `--tcl`,
      +`--perl` because if our binary raw data is passed to a variable in the host language,
      +the host languages may cause escape errors.
      +
     @@ ref-filter.c: static int contents_atom_parser(const struct ref_format *format, s
      +static int raw_atom_parser(const struct ref_format *format, struct used_atom *atom,
      +				const char *arg, struct strbuf *err)
      +{
     -+	if (!arg) {
     ++	if (!arg)
      +		atom->u.raw_data.option = RAW_BARE;
     -+	} else if (!strcmp(arg, "size"))
     ++	else if (!strcmp(arg, "size"))
      +		atom->u.raw_data.option = RAW_LENGTH;
      +	else
      +		return strbuf_addf_ret(err, -1, _("unrecognized %%(raw) argument: %s"), arg);
     @@ ref-filter.c: static int parse_ref_filter_atom(const struct ref_format *format,
       		return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"),
       				       (int)(ep-atom), atom);
       
     -+	if (format->quote_style && starts_with(sp, "raw"))
     -+		return strbuf_addf_ret(err, -1, _("--format=%.*s should not combine with"
     +-	/* Do we have the atom already used elsewhere? */
     +-	for (i = 0; i < used_atom_cnt; i++) {
     +-		int len = strlen(used_atom[i].name);
     +-		if (len == ep - atom && !memcmp(used_atom[i].name, atom, len))
     +-			return i;
     +-	}
     +-
     + 	/*
     + 	 * If the atom name has a colon, strip it and everything after
     + 	 * it off - it specifies the format for this entry, and
     +@@ ref-filter.c: static int parse_ref_filter_atom(const struct ref_format *format,
     + 	arg = memchr(sp, ':', ep - sp);
     + 	atom_len = (arg ? arg : ep) - sp;
     + 
     ++	if (format->quote_style && !strncmp(sp, "raw", 3) && !arg)
     ++		return strbuf_addf_ret(err, -1, _("--format=%.*s cannot be used with"
      +				"--python, --shell, --tcl, --perl"), (int)(ep-atom), atom);
      +
     - 	/* Do we have the atom already used elsewhere? */
     - 	for (i = 0; i < used_atom_cnt; i++) {
     - 		int len = strlen(used_atom[i].name);
     ++	/* Do we have the atom already used elsewhere? */
     ++	for (i = 0; i < used_atom_cnt; i++) {
     ++		int len = strlen(used_atom[i].name);
     ++		if (len == ep - atom && !memcmp(used_atom[i].name, atom, len))
     ++			return i;
     ++	}
     ++
     + 	/* Is the atom a valid one? */
     + 	for (i = 0; i < ARRAY_SIZE(valid_atom); i++) {
     + 		int len = strlen(valid_atom[i].name);
      @@ ref-filter.c: static int parse_ref_filter_atom(const struct ref_format *format,
       	return at;
       }
     @@ ref-filter.c: static int then_atom_handler(struct atom_value *atomv, struct ref_
       	 */
       	if (if_then_else->cmp_status == COMPARE_EQUAL) {
      -		if (!strcmp(if_then_else->str, cur->output.buf))
     -+		if (!if_then_else->str)
     -+			BUG("when if_then_else->cmp_status == COMPARE_EQUAL,"
     -+			    "if_then_else->str must not be null");
      +		if (str_len == cur->output.len &&
      +		    !memcmp(if_then_else->str, cur->output.buf, cur->output.len))
       			if_then_else->condition_satisfied = 1;
       	} else if (if_then_else->cmp_status == COMPARE_UNEQUAL) {
      -		if (strcmp(if_then_else->str, cur->output.buf))
     -+		if (!if_then_else->str)
     -+			BUG("when if_then_else->cmp_status == COMPARE_UNEQUAL,"
     -+			    "if_then_else->str must not be null");
      +		if (str_len != cur->output.len ||
      +		    memcmp(if_then_else->str, cur->output.buf, cur->output.len))
       			if_then_else->condition_satisfied = 1;
     @@ ref-filter.c: static int end_atom_handler(struct atom_value *atomv, struct ref_f
       	}
       	strbuf_release(&s);
      @@ ref-filter.c: static void append_lines(struct strbuf *out, const char *buf, unsigned long size
     - }
       
       /* See grab_values */
     --static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
     -+static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned long buf_size, struct object *obj)
     + static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf,
     +-				   struct object *obj)
     ++				   unsigned long buf_size, struct object *obj)
       {
       	int i;
       	const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
     -@@ ref-filter.c: static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
     - 			continue;
     +@@ ref-filter.c: static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf,
       		if (deref)
       			name++;
     --		if (strcmp(name, "body") &&
     --		    !starts_with(name, "subject") &&
     --		    !starts_with(name, "trailers") &&
     --		    !starts_with(name, "contents"))
     -+
     + 
      +		if (starts_with(name, "raw")) {
      +			if (atom->u.raw_data.option == RAW_BARE) {
      +				v->s = xmemdupz(buf, buf_size);
      +				v->s_size = buf_size;
     -+			} else if (atom->u.raw_data.option == RAW_LENGTH)
     ++			} else if (atom->u.raw_data.option == RAW_LENGTH) {
      +				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
     ++			}
      +			continue;
      +		}
      +
     -+		if ((obj->type != OBJ_TAG &&
     -+		     obj->type != OBJ_COMMIT) ||
     -+		    (strcmp(name, "body") &&
     -+		     !starts_with(name, "subject") &&
     -+		     !starts_with(name, "trailers") &&
     -+		     !starts_with(name, "contents")))
     - 			continue;
     - 		if (!subpos)
     - 			find_subpos(buf,
     + 		if ((obj->type != OBJ_TAG &&
     + 		     obj->type != OBJ_COMMIT) ||
     + 		    (strcmp(name, "body") &&
      @@ ref-filter.c: static void fill_missing_values(struct atom_value *val)
        * pointed at by the ref itself; otherwise it is the object the
        * ref (which is a tag) refers to.
     @@ ref-filter.c: static void fill_missing_values(struct atom_value *val)
       	switch (obj->type) {
       	case OBJ_TAG:
       		grab_tag_values(val, deref, obj);
     --		grab_sub_body_contents(val, deref, buf);
     -+		grab_raw_data(val, deref, buf, buf_size, obj);
     +-		grab_sub_body_contents(val, deref, buf, obj);
     ++		grab_sub_body_contents(val, deref, buf, buf_size, obj);
       		grab_person("tagger", val, deref, buf);
       		break;
       	case OBJ_COMMIT:
       		grab_commit_values(val, deref, obj);
     --		grab_sub_body_contents(val, deref, buf);
     -+		grab_raw_data(val, deref, buf, buf_size, obj);
     +-		grab_sub_body_contents(val, deref, buf, obj);
     ++		grab_sub_body_contents(val, deref, buf, buf_size, obj);
       		grab_person("author", val, deref, buf);
       		grab_person("committer", val, deref, buf);
       		break;
       	case OBJ_TREE:
       		/* grab_tree_values(val, deref, obj, buf, sz); */
     -+		grab_raw_data(val, deref, buf, buf_size, obj);
     ++		grab_sub_body_contents(val, deref, buf, buf_size, obj);
       		break;
       	case OBJ_BLOB:
       		/* grab_blob_values(val, deref, obj, buf, sz); */
     -+		grab_raw_data(val, deref, buf, buf_size, obj);
     ++		grab_sub_body_contents(val, deref, buf, buf_size, obj);
       		break;
       	default:
       		die("Eh?  Object of type %d?", obj->type);
     @@ ref-filter.c: static int compare_detached_head(struct ref_array_item *a, struct
       
      +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
      +{
     -+	size_t i;
     -+	const char *s1 = (const char *)vs1;
     -+	const char *s2 = (const char *)vs2;
     ++	const char *s1 = (const void *)vs1;
     ++	const char *s2 = (const void *)vs2;
     ++	const char *end = s1 + n;
      +
     -+	for (i = 0; i < n; i++) {
     -+		unsigned char u1 = s1[i];
     -+		unsigned char u2 = s2[i];
     -+		int U1 = toupper (u1);
     -+		int U2 = toupper (u2);
     -+		int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
     -+			: U1 < U2 ? -1 : U2 < U1);
     ++	for (; s1 < end; s1++, s2++) {
     ++		int diff = tolower(*s1) - tolower(*s2);
      +		if (diff)
      +			return diff;
      +	}
     @@ ref-filter.c: static int compare_detached_head(struct ref_array_item *a, struct
       static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
       {
       	struct atom_value *va, *vb;
     -@@ ref-filter.c: static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
     - 	int cmp_detached_head = 0;
     - 	cmp_type cmp_type = used_atom[s->atom].type;
     - 	struct strbuf err = STRBUF_INIT;
     -+	size_t slen = 0;
     - 
     - 	if (get_ref_atom_value(a, s->atom, &va, &err))
     - 		die("%s", err.buf);
      @@ ref-filter.c: static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
       	} else if (s->sort_flags & REF_SORTING_VERSION) {
       		cmp = versioncmp(va->s, vb->s);
     @@ ref-filter.c: static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array
      +			int (*cmp_fn)(const void *, const void *, size_t);
      +			cmp_fn = s->sort_flags & REF_SORTING_ICASE
      +				? memcasecmp : memcmp;
     ++			size_t a_size = va->s_size == ATOM_VALUE_S_SIZE_INIT ?
     ++					strlen(va->s) : va->s_size;
     ++			size_t b_size = vb->s_size == ATOM_VALUE_S_SIZE_INIT ?
     ++					strlen(vb->s) : vb->s_size;
      +
     -+			if (va->s_size != ATOM_VALUE_S_SIZE_INIT &&
     -+			    vb->s_size != ATOM_VALUE_S_SIZE_INIT) {
     -+				cmp = cmp_fn(va->s, vb->s, va->s_size > vb->s_size ?
     -+				       vb->s_size : va->s_size);
     -+			} else if (va->s_size == ATOM_VALUE_S_SIZE_INIT) {
     -+				slen = strlen(va->s);
     -+				cmp = cmp_fn(va->s, vb->s, slen > vb->s_size ?
     -+					     vb->s_size : slen);
     -+			} else {
     -+				slen = strlen(vb->s);
     -+				cmp = cmp_fn(va->s, vb->s, slen > va->s_size ?
     -+					     slen : va->s_size);
     ++			cmp = cmp_fn(va->s, vb->s, b_size > a_size ?
     ++				     a_size : b_size);
     ++			if (!cmp) {
     ++				if (a_size > b_size)
     ++					cmp = 1;
     ++				else if (a_size < b_size)
     ++					cmp = -1;
      +			}
     -+			cmp = cmp ? cmp : va->s_size - vb->s_size;
      +		}
       	} else {
       		if (va->value < vb->value)
     @@ t/t6300-for-each-ref.sh: test_atom refs/myblobs/first contents:body ""
      +	refs/myblobs/first not empty
      +	EOF
      +	git for-each-ref --format="%(refname) %(if)%(raw)%(then)not empty%(else)empty%(end)" \
     -+	refs/myblobs/ >actual &&
     ++		refs/myblobs/ >actual &&
      +	test_cmp expected actual
      +'
      +
     @@ t/t6300-for-each-ref.sh: test_atom refs/myblobs/first contents:body ""
      +	test_must_fail git for-each-ref --format="%(raw)" --sort=raw --shell
      +'
      +
     ++test_expect_success '%(raw:size) with --shell' '
     ++	git for-each-ref --format="%(raw:size)" | while read line
     ++	do
     ++		echo "'\''$line'\''" >>expect
     ++	done &&
     ++	git for-each-ref --format="%(raw:size)" --shell >actual &&
     ++	test_cmp expect actual
     ++'
     ++
      +test_expect_success 'for-each-ref --format compare with cat-file --batch' '
      +	git rev-parse refs/mytrees/first | git cat-file --batch >expected &&
      +	git for-each-ref --format="%(objectname) %(objecttype) %(objectsize)
 2:  aa6d73f3e526 < -:  ------------ [GSOC] ref-filter: add %(header) atom

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents
  2021-05-30 13:01 ` [PATCH v2 " ZheNing Hu via GitGitGadget
@ 2021-05-30 13:01   ` ZheNing Hu via GitGitGadget
  2021-05-31  5:34     ` Junio C Hamano
  2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
  1 sibling, 1 reply; 41+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-30 13:01 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

Only tag and commit will use `grab_sub_body_contents()`
to grab object contents in origin implement. If we want
to make blob, tree can also use `grab_sub_body_contents()`
to get objects' raw data, a blob look like commit or tag
will be wrongly regarded as commit, tag by `find_subpos()`.

So we must add a test before `find_subpos()` to reject
blob, tree objects. This will help us add %(raw) atom
which can grab raw data of four type objects.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 ref-filter.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 97116e12d7c4..f6a7b5290d54 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1292,7 +1292,8 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
 }
 
 /* See grab_values */
-static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
+static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf,
+				   struct object *obj)
 {
 	int i;
 	const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
@@ -1307,10 +1308,13 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
 			continue;
 		if (deref)
 			name++;
-		if (strcmp(name, "body") &&
-		    !starts_with(name, "subject") &&
-		    !starts_with(name, "trailers") &&
-		    !starts_with(name, "contents"))
+
+		if ((obj->type != OBJ_TAG &&
+		     obj->type != OBJ_COMMIT) ||
+		    (strcmp(name, "body") &&
+		     !starts_with(name, "subject") &&
+		     !starts_with(name, "trailers") &&
+		     !starts_with(name, "contents")))
 			continue;
 		if (!subpos)
 			find_subpos(buf,
@@ -1379,12 +1383,12 @@ static void grab_values(struct atom_value *val, int deref, struct object *obj, v
 	switch (obj->type) {
 	case OBJ_TAG:
 		grab_tag_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf);
+		grab_sub_body_contents(val, deref, buf, obj);
 		grab_person("tagger", val, deref, buf);
 		break;
 	case OBJ_COMMIT:
 		grab_commit_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf);
+		grab_sub_body_contents(val, deref, buf, obj);
 		grab_person("author", val, deref, buf);
 		grab_person("committer", val, deref, buf);
 		break;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30 13:01 ` [PATCH v2 " ZheNing Hu via GitGitGadget
  2021-05-30 13:01   ` [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents ZheNing Hu via GitGitGadget
@ 2021-05-30 13:01   ` ZheNing Hu via GitGitGadget
  2021-05-31  0:44     ` Junio C Hamano
                       ` (3 more replies)
  1 sibling, 4 replies; 41+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2021-05-30 13:01 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

Add new formatting option `%(raw)`, which will print the raw
object data without any changes. It will help further to migrate
all cat-file formatting logic from cat-file to ref-filter.

The raw data of blob, tree objects may contain '\0', but most of
the logic in `ref-filter` depands on the output of the atom being
text (specifically, no embedded NULs in it).

E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
add the data to the buffer. The raw data of a tree object is
`100644 one\0...`, only the `100644 one` will be added to the buffer,
which is incorrect.

Therefore, add a new member in `struct atom_value`: `s_size`, which
can record raw object size, it can help us add raw object data to
the buffer or compare two buffers which contain raw object data.

Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
`--tcl`, `--perl` because if our binary raw data is passed to a variable
in the host language, the host languages may cause escape errors.

Helped-by: Felipe Contreras <felipe.contreras@gmail.com>
Helped-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Helped-by: Junio C Hamano <gitster@pobox.com>
Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 Documentation/git-for-each-ref.txt |  14 ++
 ref-filter.c                       | 146 ++++++++++++++++-----
 t/t6300-for-each-ref.sh            | 200 +++++++++++++++++++++++++++++
 3 files changed, 330 insertions(+), 30 deletions(-)

diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
index 2ae2478de706..42cef4c9617d 100644
--- a/Documentation/git-for-each-ref.txt
+++ b/Documentation/git-for-each-ref.txt
@@ -235,6 +235,20 @@ and `date` to extract the named component.  For email fields (`authoremail`,
 without angle brackets, and `:localpart` to get the part before the `@` symbol
 out of the trimmed email.
 
+The raw data in a object is `raw`, For commit and tag objects, `raw` contain
+`header` and `contents` two parts, `header` is structured part of raw data, it
+composed of "tree XXX", "parent YYY", etc lines in commits , or composed of
+"object OOO", "type TTT", etc lines in tags; `contents` is unstructured "free
+text" part of raw object data. For blob and tree objects, their raw data don't
+have `header` and `contents` parts.
+
+raw:size::
+	The raw data size of the object.
+
+Note that `--format=%(raw)` can not be used with `--python`, `--shell`, `--tcl`,
+`--perl` because if our binary raw data is passed to a variable in the host language,
+the host languages may cause escape errors.
+
 The message in a commit or a tag object is `contents`, from which
 `contents:<part>` can be used to extract various parts out of:
 
diff --git a/ref-filter.c b/ref-filter.c
index f6a7b5290d54..5211fe946b70 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -138,6 +138,9 @@ static struct used_atom {
 			struct process_trailer_options trailer_opts;
 			unsigned int nlines;
 		} contents;
+		struct {
+			enum { RAW_BARE, RAW_LENGTH } option;
+		} raw_data;
 		struct {
 			cmp_status cmp_status;
 			const char *str;
@@ -370,6 +373,18 @@ static int contents_atom_parser(const struct ref_format *format, struct used_ato
 	return 0;
 }
 
+static int raw_atom_parser(const struct ref_format *format, struct used_atom *atom,
+				const char *arg, struct strbuf *err)
+{
+	if (!arg)
+		atom->u.raw_data.option = RAW_BARE;
+	else if (!strcmp(arg, "size"))
+		atom->u.raw_data.option = RAW_LENGTH;
+	else
+		return strbuf_addf_ret(err, -1, _("unrecognized %%(raw) argument: %s"), arg);
+	return 0;
+}
+
 static int oid_atom_parser(const struct ref_format *format, struct used_atom *atom,
 			   const char *arg, struct strbuf *err)
 {
@@ -530,6 +545,7 @@ static struct {
 	{ "body", SOURCE_OBJ, FIELD_STR, body_atom_parser },
 	{ "trailers", SOURCE_OBJ, FIELD_STR, trailers_atom_parser },
 	{ "contents", SOURCE_OBJ, FIELD_STR, contents_atom_parser },
+	{ "raw", SOURCE_OBJ, FIELD_STR, raw_atom_parser },
 	{ "upstream", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
 	{ "push", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
 	{ "symref", SOURCE_NONE, FIELD_STR, refname_atom_parser },
@@ -564,12 +580,15 @@ struct ref_formatting_state {
 
 struct atom_value {
 	const char *s;
+	size_t s_size;
 	int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
 		       struct strbuf *err);
 	uintmax_t value; /* used for sorting when not FIELD_STR */
 	struct used_atom *atom;
 };
 
+#define ATOM_VALUE_S_SIZE_INIT (-1)
+
 /*
  * Used to parse format string and sort specifiers
  */
@@ -588,13 +607,6 @@ static int parse_ref_filter_atom(const struct ref_format *format,
 		return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"),
 				       (int)(ep-atom), atom);
 
-	/* Do we have the atom already used elsewhere? */
-	for (i = 0; i < used_atom_cnt; i++) {
-		int len = strlen(used_atom[i].name);
-		if (len == ep - atom && !memcmp(used_atom[i].name, atom, len))
-			return i;
-	}
-
 	/*
 	 * If the atom name has a colon, strip it and everything after
 	 * it off - it specifies the format for this entry, and
@@ -604,6 +616,17 @@ static int parse_ref_filter_atom(const struct ref_format *format,
 	arg = memchr(sp, ':', ep - sp);
 	atom_len = (arg ? arg : ep) - sp;
 
+	if (format->quote_style && !strncmp(sp, "raw", 3) && !arg)
+		return strbuf_addf_ret(err, -1, _("--format=%.*s cannot be used with"
+				"--python, --shell, --tcl, --perl"), (int)(ep-atom), atom);
+
+	/* Do we have the atom already used elsewhere? */
+	for (i = 0; i < used_atom_cnt; i++) {
+		int len = strlen(used_atom[i].name);
+		if (len == ep - atom && !memcmp(used_atom[i].name, atom, len))
+			return i;
+	}
+
 	/* Is the atom a valid one? */
 	for (i = 0; i < ARRAY_SIZE(valid_atom); i++) {
 		int len = strlen(valid_atom[i].name);
@@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref_format *format,
 	return at;
 }
 
-static void quote_formatting(struct strbuf *s, const char *str, int quote_style)
+static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style)
 {
 	switch (quote_style) {
 	case QUOTE_NONE:
-		strbuf_addstr(s, str);
+		if (len != ATOM_VALUE_S_SIZE_INIT)
+			strbuf_add(s, str, len);
+		else
+			strbuf_addstr(s, str);
 		break;
 	case QUOTE_SHELL:
 		sq_quote_buf(s, str);
@@ -683,9 +709,12 @@ static int append_atom(struct atom_value *v, struct ref_formatting_state *state,
 	 * encountered.
 	 */
 	if (!state->stack->prev)
-		quote_formatting(&state->stack->output, v->s, state->quote_style);
+		quote_formatting(&state->stack->output, v->s, v->s_size, state->quote_style);
 	else
-		strbuf_addstr(&state->stack->output, v->s);
+		if (v->s_size != ATOM_VALUE_S_SIZE_INIT)
+			strbuf_add(&state->stack->output, v->s, v->s_size);
+		else
+			strbuf_addstr(&state->stack->output, v->s);
 	return 0;
 }
 
@@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state
 	return 0;
 }
 
-static int is_empty(const char *s)
+static int is_empty(struct strbuf *buf)
 {
-	while (*s != '\0') {
-		if (!isspace(*s))
-			return 0;
+	const char *s = buf->buf;
+	size_t cur_len = 0;
+
+	while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
 		s++;
+		cur_len++;
 	}
-	return 1;
+	return cur_len == buf->len;
 }
 
 static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_state *state,
@@ -800,6 +831,7 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
 {
 	struct ref_formatting_stack *cur = state->stack;
 	struct if_then_else *if_then_else = NULL;
+	size_t str_len = 0;
 
 	if (cur->at_end == if_then_else_handler)
 		if_then_else = (struct if_then_else *)cur->at_end_data;
@@ -810,18 +842,22 @@ static int then_atom_handler(struct atom_value *atomv, struct ref_formatting_sta
 	if (if_then_else->else_atom_seen)
 		return strbuf_addf_ret(err, -1, _("format: %%(then) atom used after %%(else)"));
 	if_then_else->then_atom_seen = 1;
+	if (if_then_else->str)
+		str_len = strlen(if_then_else->str);
 	/*
 	 * If the 'equals' or 'notequals' attribute is used then
 	 * perform the required comparison. If not, only non-empty
 	 * strings satisfy the 'if' condition.
 	 */
 	if (if_then_else->cmp_status == COMPARE_EQUAL) {
-		if (!strcmp(if_then_else->str, cur->output.buf))
+		if (str_len == cur->output.len &&
+		    !memcmp(if_then_else->str, cur->output.buf, cur->output.len))
 			if_then_else->condition_satisfied = 1;
 	} else if (if_then_else->cmp_status == COMPARE_UNEQUAL) {
-		if (strcmp(if_then_else->str, cur->output.buf))
+		if (str_len != cur->output.len ||
+		    memcmp(if_then_else->str, cur->output.buf, cur->output.len))
 			if_then_else->condition_satisfied = 1;
-	} else if (cur->output.len && !is_empty(cur->output.buf))
+	} else if (cur->output.len && !is_empty(&cur->output))
 		if_then_else->condition_satisfied = 1;
 	strbuf_reset(&cur->output);
 	return 0;
@@ -867,7 +903,7 @@ static int end_atom_handler(struct atom_value *atomv, struct ref_formatting_stat
 	 * only on the topmost supporting atom.
 	 */
 	if (!current->prev->prev) {
-		quote_formatting(&s, current->output.buf, state->quote_style);
+		quote_formatting(&s, current->output.buf, current->output.len, state->quote_style);
 		strbuf_swap(&current->output, &s);
 	}
 	strbuf_release(&s);
@@ -1293,7 +1329,7 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
 
 /* See grab_values */
 static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf,
-				   struct object *obj)
+				   unsigned long buf_size, struct object *obj)
 {
 	int i;
 	const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
@@ -1309,6 +1345,16 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf,
 		if (deref)
 			name++;
 
+		if (starts_with(name, "raw")) {
+			if (atom->u.raw_data.option == RAW_BARE) {
+				v->s = xmemdupz(buf, buf_size);
+				v->s_size = buf_size;
+			} else if (atom->u.raw_data.option == RAW_LENGTH) {
+				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
+			}
+			continue;
+		}
+
 		if ((obj->type != OBJ_TAG &&
 		     obj->type != OBJ_COMMIT) ||
 		    (strcmp(name, "body") &&
@@ -1378,25 +1424,30 @@ static void fill_missing_values(struct atom_value *val)
  * pointed at by the ref itself; otherwise it is the object the
  * ref (which is a tag) refers to.
  */
-static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
+static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
 {
+	void *buf = data->content;
+	unsigned long buf_size = data->size;
+
 	switch (obj->type) {
 	case OBJ_TAG:
 		grab_tag_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf, obj);
+		grab_sub_body_contents(val, deref, buf, buf_size, obj);
 		grab_person("tagger", val, deref, buf);
 		break;
 	case OBJ_COMMIT:
 		grab_commit_values(val, deref, obj);
-		grab_sub_body_contents(val, deref, buf, obj);
+		grab_sub_body_contents(val, deref, buf, buf_size, obj);
 		grab_person("author", val, deref, buf);
 		grab_person("committer", val, deref, buf);
 		break;
 	case OBJ_TREE:
 		/* grab_tree_values(val, deref, obj, buf, sz); */
+		grab_sub_body_contents(val, deref, buf, buf_size, obj);
 		break;
 	case OBJ_BLOB:
 		/* grab_blob_values(val, deref, obj, buf, sz); */
+		grab_sub_body_contents(val, deref, buf, buf_size, obj);
 		break;
 	default:
 		die("Eh?  Object of type %d?", obj->type);
@@ -1618,7 +1669,7 @@ static int get_object(struct ref_array_item *ref, int deref, struct object **obj
 			return strbuf_addf_ret(err, -1, _("parse_object_buffer failed on %s for %s"),
 					       oid_to_hex(&oi->oid), ref->refname);
 		}
-		grab_values(ref->value, deref, *obj, oi->content);
+		grab_values(ref->value, deref, *obj, oi);
 	}
 
 	grab_common_values(ref->value, deref, oi);
@@ -1698,7 +1749,7 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err)
 		int deref = 0;
 		const char *refname;
 		struct branch *branch = NULL;
-
+		v->s_size = ATOM_VALUE_S_SIZE_INIT;
 		v->handler = append_atom;
 		v->atom = atom;
 
@@ -2301,6 +2352,20 @@ static int compare_detached_head(struct ref_array_item *a, struct ref_array_item
 	return 0;
 }
 
+static int memcasecmp(const void *vs1, const void *vs2, size_t n)
+{
+	const char *s1 = (const void *)vs1;
+	const char *s2 = (const void *)vs2;
+	const char *end = s1 + n;
+
+	for (; s1 < end; s1++, s2++) {
+		int diff = tolower(*s1) - tolower(*s2);
+		if (diff)
+			return diff;
+	}
+	return 0;
+}
+
 static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, struct ref_array_item *b)
 {
 	struct atom_value *va, *vb;
@@ -2321,10 +2386,30 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
 	} else if (s->sort_flags & REF_SORTING_VERSION) {
 		cmp = versioncmp(va->s, vb->s);
 	} else if (cmp_type == FIELD_STR) {
-		int (*cmp_fn)(const char *, const char *);
-		cmp_fn = s->sort_flags & REF_SORTING_ICASE
-			? strcasecmp : strcmp;
-		cmp = cmp_fn(va->s, vb->s);
+		if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
+		    vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
+			int (*cmp_fn)(const char *, const char *);
+			cmp_fn = s->sort_flags & REF_SORTING_ICASE
+				? strcasecmp : strcmp;
+			cmp = cmp_fn(va->s, vb->s);
+		} else {
+			int (*cmp_fn)(const void *, const void *, size_t);
+			cmp_fn = s->sort_flags & REF_SORTING_ICASE
+				? memcasecmp : memcmp;
+			size_t a_size = va->s_size == ATOM_VALUE_S_SIZE_INIT ?
+					strlen(va->s) : va->s_size;
+			size_t b_size = vb->s_size == ATOM_VALUE_S_SIZE_INIT ?
+					strlen(vb->s) : vb->s_size;
+
+			cmp = cmp_fn(va->s, vb->s, b_size > a_size ?
+				     a_size : b_size);
+			if (!cmp) {
+				if (a_size > b_size)
+					cmp = 1;
+				else if (a_size < b_size)
+					cmp = -1;
+			}
+		}
 	} else {
 		if (va->value < vb->value)
 			cmp = -1;
@@ -2424,6 +2509,7 @@ int format_ref_array_item(struct ref_array_item *info,
 	}
 	if (format->need_color_reset_at_eol) {
 		struct atom_value resetv;
+		resetv.s_size = ATOM_VALUE_S_SIZE_INIT;
 		resetv.s = GIT_COLOR_RESET;
 		if (append_atom(&resetv, &state, error_buf)) {
 			pop_stack_element(&state.stack);
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 9e0214076b4d..46703d4e512f 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -130,6 +130,8 @@ test_atom head parent:short=10 ''
 test_atom head numparent 0
 test_atom head object ''
 test_atom head type ''
+test_atom head raw "$(git cat-file commit refs/heads/main)
+"
 test_atom head '*objectname' ''
 test_atom head '*objecttype' ''
 test_atom head author 'A U Thor <author@example.com> 1151968724 +0200'
@@ -221,6 +223,15 @@ test_atom tag contents 'Tagging at 1151968727
 '
 test_atom tag HEAD ' '
 
+test_expect_success 'basic atom: refs/tags/testtag *raw' '
+	git cat-file commit refs/tags/testtag^{} >expected &&
+	git for-each-ref --format="%(*raw)" refs/tags/testtag >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_expect_success 'Check invalid atoms names are errors' '
 	test_must_fail git for-each-ref --format="%(INVALID)" refs/heads
 '
@@ -686,6 +697,15 @@ test_atom refs/tags/signed-empty contents:body ''
 test_atom refs/tags/signed-empty contents:signature "$sig"
 test_atom refs/tags/signed-empty contents "$sig"
 
+test_expect_success 'basic atom: refs/tags/signed-empty raw' '
+	git cat-file tag refs/tags/signed-empty >expected &&
+	git for-each-ref --format="%(raw)" refs/tags/signed-empty >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_atom refs/tags/signed-short subject 'subject line'
 test_atom refs/tags/signed-short subject:sanitize 'subject-line'
 test_atom refs/tags/signed-short contents:subject 'subject line'
@@ -695,6 +715,15 @@ test_atom refs/tags/signed-short contents:signature "$sig"
 test_atom refs/tags/signed-short contents "subject line
 $sig"
 
+test_expect_success 'basic atom: refs/tags/signed-short raw' '
+	git cat-file tag refs/tags/signed-short >expected &&
+	git for-each-ref --format="%(raw)" refs/tags/signed-short >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_atom refs/tags/signed-long subject 'subject line'
 test_atom refs/tags/signed-long subject:sanitize 'subject-line'
 test_atom refs/tags/signed-long contents:subject 'subject line'
@@ -708,6 +737,15 @@ test_atom refs/tags/signed-long contents "subject line
 body contents
 $sig"
 
+test_expect_success 'basic atom: refs/tags/signed-long raw' '
+	git cat-file tag refs/tags/signed-long >expected &&
+	git for-each-ref --format="%(raw)" refs/tags/signed-long >actual &&
+	sanitize_pgp <expected >expected.clean &&
+	sanitize_pgp <actual >actual.clean &&
+	echo "" >>expected.clean &&
+	test_cmp expected.clean actual.clean
+'
+
 test_expect_success 'set up refs pointing to tree and blob' '
 	git update-ref refs/mytrees/first refs/heads/main^{tree} &&
 	git update-ref refs/myblobs/first refs/heads/main:one
@@ -720,6 +758,16 @@ test_atom refs/mytrees/first contents:body ""
 test_atom refs/mytrees/first contents:signature ""
 test_atom refs/mytrees/first contents ""
 
+test_expect_success 'basic atom: refs/mytrees/first raw' '
+	git cat-file tree refs/mytrees/first >expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(raw)" refs/mytrees/first >actual &&
+	test_cmp expected actual &&
+	git cat-file -s refs/mytrees/first >expected &&
+	git for-each-ref --format="%(raw:size)" refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
 test_atom refs/myblobs/first subject ""
 test_atom refs/myblobs/first contents:subject ""
 test_atom refs/myblobs/first body ""
@@ -727,6 +775,158 @@ test_atom refs/myblobs/first contents:body ""
 test_atom refs/myblobs/first contents:signature ""
 test_atom refs/myblobs/first contents ""
 
+test_expect_success 'basic atom: refs/myblobs/first raw' '
+	git cat-file blob refs/myblobs/first >expected &&
+	echo "" >>expected &&
+	git for-each-ref --format="%(raw)" refs/myblobs/first >actual &&
+	test_cmp expected actual &&
+	git cat-file -s refs/myblobs/first >expected &&
+	git for-each-ref --format="%(raw:size)" refs/myblobs/first >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'set up refs pointing to binary blob' '
+	printf "%b" "a\0b\0c" >blob1 &&
+	printf "%b" "a\0c\0b" >blob2 &&
+	printf "%b" "\0a\0b\0c" >blob3 &&
+	printf "%b" "abc" >blob4 &&
+	printf "%b" "\0 \0 \0 " >blob5 &&
+	printf "%b" "\0 \0a\0 " >blob6 &&
+	>blob7 &&
+	git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
+	git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
+	git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
+	git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
+	git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
+	git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
+	git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7
+'
+
+test_expect_success 'Verify sorts with raw' '
+	cat >expected <<-EOF &&
+	refs/myblobs/blob7
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/myblobs/blob3
+	refs/mytrees/first
+	refs/myblobs/first
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob4
+	refs/heads/main
+	EOF
+	git for-each-ref --format="%(refname)" --sort=raw \
+		refs/heads/main refs/myblobs/ refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'Verify sorts with raw:size' '
+	cat >expected <<-EOF &&
+	refs/myblobs/blob7
+	refs/myblobs/first
+	refs/heads/main
+	refs/myblobs/blob4
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob3
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/mytrees/first
+	EOF
+	git for-each-ref --format="%(refname)" --sort=raw:size \
+		refs/heads/main refs/myblobs/ refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'validate raw atom with %(if:equals)' '
+	cat >expected <<-EOF &&
+	not equals
+	not equals
+	not equals
+	not equals
+	not equals
+	not equals
+	refs/myblobs/blob4
+	not equals
+	not equals
+	not equals
+	not equals
+	EOF
+	git for-each-ref --format="%(if:equals=abc)%(raw)%(then)%(refname)%(else)not equals%(end)" \
+		refs/myblobs/ refs/heads/ >actual &&
+	test_cmp expected actual
+'
+test_expect_success 'validate raw atom with %(if:notequals)' '
+	cat >expected <<-EOF &&
+	refs/heads/ambiguous
+	refs/heads/main
+	refs/heads/newtag
+	refs/myblobs/blob1
+	refs/myblobs/blob2
+	refs/myblobs/blob3
+	equals
+	refs/myblobs/blob5
+	refs/myblobs/blob6
+	refs/myblobs/blob7
+	refs/myblobs/first
+	EOF
+	git for-each-ref --format="%(if:notequals=abc)%(raw)%(then)%(refname)%(else)equals%(end)" \
+		refs/myblobs/ refs/heads/ >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'empty raw refs with %(if)' '
+	cat >expected <<-EOF &&
+	refs/myblobs/blob1 not empty
+	refs/myblobs/blob2 not empty
+	refs/myblobs/blob3 not empty
+	refs/myblobs/blob4 not empty
+	refs/myblobs/blob5 empty
+	refs/myblobs/blob6 not empty
+	refs/myblobs/blob7 empty
+	refs/myblobs/first not empty
+	EOF
+	git for-each-ref --format="%(refname) %(if)%(raw)%(then)not empty%(else)empty%(end)" \
+		refs/myblobs/ >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success '%(raw) with --python must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --python
+'
+
+test_expect_success '%(raw) with --tcl must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --tcl
+'
+
+test_expect_success '%(raw) with --perl must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --perl
+'
+
+test_expect_success '%(raw) with --shell must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --shell
+'
+
+test_expect_success '%(raw) with --shell and --sort=raw must failed' '
+	test_must_fail git for-each-ref --format="%(raw)" --sort=raw --shell
+'
+
+test_expect_success '%(raw:size) with --shell' '
+	git for-each-ref --format="%(raw:size)" | while read line
+	do
+		echo "'\''$line'\''" >>expect
+	done &&
+	git for-each-ref --format="%(raw:size)" --shell >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'for-each-ref --format compare with cat-file --batch' '
+	git rev-parse refs/mytrees/first | git cat-file --batch >expected &&
+	git for-each-ref --format="%(objectname) %(objecttype) %(objectsize)
+%(raw)" refs/mytrees/first >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'set up multiple-sort tags' '
 	for when in 100000 200000
 	do
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30  6:26       ` ZheNing Hu
@ 2021-05-30 13:02         ` Phillip Wood
  0 siblings, 0 replies; 41+ messages in thread
From: Phillip Wood @ 2021-05-30 13:02 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Felipe Contreras, ZheNing Hu via GitGitGadget, Git List,
	Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King

Hi ZheNing

On 30/05/2021 07:26, ZheNing Hu wrote:
> Phillip Wood <phillip.wood123@gmail.com> 于2021年5月29日周六 下午9:23写道:
>>
>> On 27/05/2021 17:36, Felipe Contreras wrote:
>>> ZheNing Hu via GitGitGadget wrote:
>>> [...]
>>> All we have to do is define the end point, and then we don't need i:
>>>
>>>        static int memcasecmp(const char *s1, const char *s2, size_t n)
>>>        {
>>>                const char *end = s1 + n;
>>>                for (; s1 < end; s1++, s2++) {
>>>                        int diff = tolower(*s1) - tolower(*s2);
>>>                        if (diff)
>>>                                return diff;
>>>                }
>>>                return 0;
>>>        }
>>>
>>> (and I personally prefer lower to upper)
>>
>> We should be using tolower() as that is what POSIX specifies for
>> strcasecmp() [1] which we are trying to emulate and there are cases[2] where
>>          (tolower(c1) == tolower(c2)) != (toupper(c1) == toupper(c2))
>>
> 
> I don’t know if we overlooked a fact: This static `memcasecmp()`
> is not a POSIX version. `tolower()` or `toupper()` are in git-compat-util.h,
> sane_istest('\0', GIT_ALPHA) == false . So in `sane_case()`, whatever
> `tolower()`, `toupper()`, they just return '\0' itself.

Well spotted, thanks for pointing that out. So memcasecmp() and 
strcasecmp() may give different results. I'm not sure if that matters - 
as I understand it the main use for the 'raw' atom is with `git cat-file 
--batch` which does not support sorting. Also although strcasecmp() uses 
the current locale it does a byte-by-byte comparison so it is 
effectively ASCII only for UTF-8 anyway.

Best Wishes

Phillip

>> Best Wishes
>>
>> Phillip
>>
>> [1] https://pubs.opengroup.org/onlinepubs/9699919799/
>> [2] https://en.wikipedia.org/wiki/Dotted_and_dotless_I#In_computing
>>
> 
> Thanks.
> --
> ZhenNing Hu
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30  6:29         ` ZheNing Hu
@ 2021-05-30 13:05           ` Phillip Wood
  2021-05-31 14:15             ` ZheNing Hu
  2021-05-31 15:35           ` Felipe Contreras
  1 sibling, 1 reply; 41+ messages in thread
From: Phillip Wood @ 2021-05-30 13:05 UTC (permalink / raw)
  To: ZheNing Hu, Felipe Contreras
  Cc: ZheNing Hu via GitGitGadget, Git List, Junio C Hamano,
	Christian Couder, Hariom Verma, Karthik Nayak, Bagas Sanjaya,
	Jeff King

Hi ZheNing

On 30/05/2021 07:29, ZheNing Hu wrote:
> Felipe Contreras <felipe.contreras@gmail.com> 于2021年5月29日周六 下午11:24写道:
>>
>> Phillip Wood wrote:
>>> On 27/05/2021 17:36, Felipe Contreras wrote:
>>>> ZheNing Hu via GitGitGadget wrote:
>>>> [...]
>>>>> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
>>>>
>>>> Why void *? We can delcare as char *.
>>>
>>> If you look at how this function is used you'll see
>>>        int (*cmp_fn)(const void *, const void *, size_t);
>>>        cmp_fn = s->sort_flags & REF_SORTING_ICASE
>>>                        ? memcasecmp : memcmp;
>>
>> Yeah, but why?
>>
>> We know we are comparing two char *. Presumably the reason is that
>> memcmp and memcasecmp use void *, but that could be remedied with:
>>
>>          cmp_fn = (int (*)(const char *, const char *, size_t))memcmp;
>>
>> That way the same cmp_fn could be used for the two cases.
>>
>> Either way I don't care particularly much. It also could be possible to
>> use void * and do the casting in tolower().
>>
> 
> I agree with Phillip's point of view here:
> It would be better for memcasecmp and memcmp to be consistent.
> 
>>>> (and I personally prefer lower to upper)
>>>
>>> We should be using tolower() as that is what POSIX specifies for
>>> strcasecmp() [1] which we are trying to emulate and there are cases[2] where
>>>        (tolower(c1) == tolower(c2)) != (toupper(c1) == toupper(c2))
>>
>> That's true.
>>
> 
> How about something like this:
> 
>   static int memcasecmp(const void *vs1, const void *vs2, size_t n)
>   {
> -       size_t i;
> -       const char *s1 = (const char *)vs1;
> -       const char *s2 = (const char *)vs2;
> -
> -       for (i = 0; i < n; i++) {
> -               unsigned char u1 = s1[i];
> -               unsigned char u2 = s2[i];
> -               int U1 = toupper (u1);
> -               int U2 = toupper (u2);
> -               int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> -                       : U1 < U2 ? -1 : U2 < U1);
> +       const char *s1 = (const void *)vs1;
> +       const char *s2 = (const void *)vs2;

I think the new version looks fine apart from these casts. vs1 declared 
as 'const void *' in the function signature so this cast does not do 
anything. You could cast using (const char *) instead if you wanted but 
that is not required as you can assign a 'const void *' to 'const 
whatever *' without a cast.

Best Wishes

Phillip

> +       const char *end = s1 + n;
> +
> +       for (; s1 < end; s1++, s2++) {
> +               int diff = tolower(*s1) - tolower(*s2);
>                  if (diff)
>                          return diff;
>          }
> }
> 
>> --
>> Felipe Contreras
> 
> Thanks.
> --
> ZheNing Hu
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
@ 2021-05-31  0:44     ` Junio C Hamano
  2021-05-31 14:35       ` ZheNing Hu
  2021-05-31  4:04     ` Junio C Hamano
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-05-31  0:44 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
> `--tcl`, `--perl` because if our binary raw data is passed to a variable
> in the host language, the host languages may cause escape errors.

"may cause escape errors" just says you are not escaping correctly
in your code (implying that this patch is not good enough and with
more effort we should be able to fix it to allow binaries), but the
problem is the host languages may not support binaries
(specifically, anything with a NUL in it) at all, which is
fundamentally unfixable, in which case, rejecting is the only
sensible choice.

    ... because the host language may not support a NUL in the variables
    of its string type.

> +The raw data in a object is `raw`, For commit and tag objects, `raw` contain

s/contain/contains/, but more importantly, as we are not introducing
%(header), I do not see why we want to talk about its details.  For
commits and tags, just like for trees and blobs, 'raw' is the raw
data in the object, so beyond "The raw data of a object is %(raw)",
I do not think there is anything to talk about.

> +`header` and `contents` two parts, `header` is structured part of raw data, it
> +composed of "tree XXX", "parent YYY", etc lines in commits , or composed of
> +"object OOO", "type TTT", etc lines in tags; `contents` is unstructured "free
> +text" part of raw object data. For blob and tree objects, their raw data don't
> +have `header` and `contents` parts.


> +	const char *s = buf->buf;
> +	size_t cur_len = 0;
> +
> +	while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
>  		s++;
> +		cur_len++;

Is NUL treated the same as a whitespace letter for the purpose of
determining if a line is empty?  WHY?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
  2021-05-31  0:44     ` Junio C Hamano
@ 2021-05-31  4:04     ` Junio C Hamano
  2021-05-31 14:40       ` ZheNing Hu
  2021-05-31  4:10     ` Junio C Hamano
  2021-05-31 15:41     ` Felipe Contreras
  3 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-05-31  4:04 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> @@ -530,6 +545,7 @@ static struct {
>  	{ "body", SOURCE_OBJ, FIELD_STR, body_atom_parser },
>  	{ "trailers", SOURCE_OBJ, FIELD_STR, trailers_atom_parser },
>  	{ "contents", SOURCE_OBJ, FIELD_STR, contents_atom_parser },
> +	{ "raw", SOURCE_OBJ, FIELD_STR, raw_atom_parser },
>  	{ "upstream", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
>  	{ "push", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
>  	{ "symref", SOURCE_NONE, FIELD_STR, refname_atom_parser },

Doesn't this conflict with your own zh/ref-filter-atom-type topic?
Shouldn't one build on top of the other?

Or did we find something fundamentally broken about the other topic
to make us retract it that I do not remember?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
  2021-05-31  0:44     ` Junio C Hamano
  2021-05-31  4:04     ` Junio C Hamano
@ 2021-05-31  4:10     ` Junio C Hamano
  2021-05-31 15:41     ` Felipe Contreras
  3 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-05-31  4:10 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> @@ -530,6 +545,7 @@ static struct {
>  	{ "body", SOURCE_OBJ, FIELD_STR, body_atom_parser },
>  	{ "trailers", SOURCE_OBJ, FIELD_STR, trailers_atom_parser },
>  	{ "contents", SOURCE_OBJ, FIELD_STR, contents_atom_parser },
> +	{ "raw", SOURCE_OBJ, FIELD_STR, raw_atom_parser },
>  	{ "upstream", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
>  	{ "push", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
>  	{ "symref", SOURCE_NONE, FIELD_STR, refname_atom_parser },

Doesn't this conflict with your own zh/ref-filter-atom-type topic?
Shouldn't one build on top of the other?

Or did we find something fundamentally broken about the other topic
to make us retract it that I do not remember?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents
  2021-05-30 13:01   ` [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents ZheNing Hu via GitGitGadget
@ 2021-05-31  5:34     ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2021-05-31  5:34 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget
  Cc: git, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: ZheNing Hu <adlternative@gmail.com>
>
> Only tag and commit will use `grab_sub_body_contents()`
> to grab object contents in origin implement. If we want
> to make blob, tree can also use `grab_sub_body_contents()`
> to get objects' raw data, a blob look like commit or tag
> will be wrongly regarded as commit, tag by `find_subpos()`.
>
> So we must add a test before `find_subpos()` to reject
> blob, tree objects. This will help us add %(raw) atom
> which can grab raw data of four type objects.
>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: ZheNing Hu <adlternative@gmail.com>
> ---
>  ref-filter.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)

Thanks.  I'll rephrase the log message while queuing, but the change
as a separate preliminary step does make sense.

    ref-filter: add obj-type check in grab contents
    
    Only tag and commit objects use `grab_sub_body_contents()` to grab
    object contents in the current codebase.  We want to teach the
    function to also handle blobs and trees to get their raw data,
    without parsing a blob (whose contents looks like a commit or a tag)
    incorrectly as a commit or a tag.
    
    Skip the block of code that is specific to handling commits and tags
    early when the given object is of a wrong type to help later
    addition to handle other types of objects in this function.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30 13:05           ` Phillip Wood
@ 2021-05-31 14:15             ` ZheNing Hu
  0 siblings, 0 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-05-31 14:15 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Felipe Contreras, ZheNing Hu via GitGitGadget, Git List,
	Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King

Phillip Wood <phillip.wood123@gmail.com> 于2021年5月30日周日 下午9:05写道:
>
> >   static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> >   {
> > -       size_t i;
> > -       const char *s1 = (const char *)vs1;
> > -       const char *s2 = (const char *)vs2;
> > -
> > -       for (i = 0; i < n; i++) {
> > -               unsigned char u1 = s1[i];
> > -               unsigned char u2 = s2[i];
> > -               int U1 = toupper (u1);
> > -               int U2 = toupper (u2);
> > -               int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> > -                       : U1 < U2 ? -1 : U2 < U1);
> > +       const char *s1 = (const void *)vs1;
> > +       const char *s2 = (const void *)vs2;
>
> I think the new version looks fine apart from these casts. vs1 declared
> as 'const void *' in the function signature so this cast does not do
> anything. You could cast using (const char *) instead if you wanted but
> that is not required as you can assign a 'const void *' to 'const
> whatever *' without a cast.
>

Yes, forced conversion in "const char *s1 = (const char *)vs1;" is somewhat
redundant.

> Best Wishes
>
> Phillip
>

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-31  0:44     ` Junio C Hamano
@ 2021-05-31 14:35       ` ZheNing Hu
  2021-06-01  9:54         ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: ZheNing Hu @ 2021-05-31 14:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King, Phillip Wood

Junio C Hamano <gitster@pobox.com> 于2021年5月31日周一 上午8:44写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
> > `--tcl`, `--perl` because if our binary raw data is passed to a variable
> > in the host language, the host languages may cause escape errors.
>
> "may cause escape errors" just says you are not escaping correctly
> in your code (implying that this patch is not good enough and with
> more effort we should be able to fix it to allow binaries), but the
> problem is the host languages may not support binaries
> (specifically, anything with a NUL in it) at all, which is
> fundamentally unfixable, in which case, rejecting is the only
> sensible choice.
>
>     ... because the host language may not support a NUL in the variables
>     of its string type.
>

I agree. But host language not only support NUL but also some Non-ASCII
character and Non-UTF-8 code:

$ git hash-object a.out -w | xargs git update-ref refs/myblobs/aoutblob
$ git for-each-ref --format="name=%(raw)" refs/myblobs/aoutblob
--python | python2
  File "<stdin>", line 1
SyntaxError: Non-ASCII character '\x8b' in file <stdin> on line 2, but
no encoding declared;
 see http://python.org/dev/peps/pep-0263/ for details

$ git for-each-ref --format="name=%(raw)" refs/myblobs/aoutblob
--python |python3
SyntaxError: Non-UTF-8 code starting with '\x8b' in file <stdin> on
line 2, but no encoding declared;
 see http://python.org/dev/peps/pep-0263/ for details

> > +The raw data in a object is `raw`, For commit and tag objects, `raw` contain
>
> s/contain/contains/, but more importantly, as we are not introducing
> %(header), I do not see why we want to talk about its details.  For
> commits and tags, just like for trees and blobs, 'raw' is the raw
> data in the object, so beyond "The raw data of a object is %(raw)",
> I do not think there is anything to talk about.
>

Ok, I will delete this part.

> > +     const char *s = buf->buf;
> > +     size_t cur_len = 0;
> > +
> > +     while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
> >               s++;
> > +             cur_len++;
>
> Is NUL treated the same as a whitespace letter for the purpose of
> determining if a line is empty?  WHY?
>

Well, there seems to be no correction here. But is it true that memory
like "\0abc" is considered empty?

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-31  4:04     ` Junio C Hamano
@ 2021-05-31 14:40       ` ZheNing Hu
  2021-06-01  8:54         ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: ZheNing Hu @ 2021-05-31 14:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King, Phillip Wood

Junio C Hamano <gitster@pobox.com> 于2021年5月31日周一 下午12:04写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > @@ -530,6 +545,7 @@ static struct {
> >       { "body", SOURCE_OBJ, FIELD_STR, body_atom_parser },
> >       { "trailers", SOURCE_OBJ, FIELD_STR, trailers_atom_parser },
> >       { "contents", SOURCE_OBJ, FIELD_STR, contents_atom_parser },
> > +     { "raw", SOURCE_OBJ, FIELD_STR, raw_atom_parser },
> >       { "upstream", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
> >       { "push", SOURCE_NONE, FIELD_STR, remote_ref_atom_parser },
> >       { "symref", SOURCE_NONE, FIELD_STR, refname_atom_parser },
>
> Doesn't this conflict with your own zh/ref-filter-atom-type topic?
> Shouldn't one build on top of the other?
>
> Or did we find something fundamentally broken about the other topic
> to make us retract it that I do not remember?
>
> Thanks.

I am waiting for zh/ref-filter-atom-type to be merged into master. But it
hasn't happened yet. But if I want to base the current topic on
zh/ref-filter-atom-type, GGG will send past patches (zh/ref-filter-atom-type)
repeatedly. If necessary, I will submit the current branch based on
zh/ref-filter-atom-type.

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30  6:29         ` ZheNing Hu
  2021-05-30 13:05           ` Phillip Wood
@ 2021-05-31 15:35           ` Felipe Contreras
  1 sibling, 0 replies; 41+ messages in thread
From: Felipe Contreras @ 2021-05-31 15:35 UTC (permalink / raw)
  To: ZheNing Hu, Felipe Contreras
  Cc: Phillip Wood, ZheNing Hu via GitGitGadget, Git List,
	Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Bagas Sanjaya, Jeff King

ZheNing Hu wrote:
> Felipe Contreras <felipe.contreras@gmail.com> 于2021年5月29日周六 下午11:24写道:
> >
> > Phillip Wood wrote:
> > > On 27/05/2021 17:36, Felipe Contreras wrote:
> > > > ZheNing Hu via GitGitGadget wrote:
> > > > [...]
> > > >> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > > >
> > > > Why void *? We can delcare as char *.
> > >
> > > If you look at how this function is used you'll see
> > >       int (*cmp_fn)(const void *, const void *, size_t);
> > >       cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > >                       ? memcasecmp : memcmp;
> >
> > Yeah, but why?
> >
> > We know we are comparing two char *. Presumably the reason is that
> > memcmp and memcasecmp use void *, but that could be remedied with:
> >
> >         cmp_fn = (int (*)(const char *, const char *, size_t))memcmp;
> >
> > That way the same cmp_fn could be used for the two cases.
> >
> > Either way I don't care particularly much. It also could be possible to
> > use void * and do the casting in tolower().
> >
> 
> I agree with Phillip's point of view here:
> It would be better for memcasecmp and memcmp to be consistent.

Fair enough.

>  static int memcasecmp(const void *vs1, const void *vs2, size_t n)
>  {
> -       size_t i;
> -       const char *s1 = (const char *)vs1;
> -       const char *s2 = (const char *)vs2;
> -
> -       for (i = 0; i < n; i++) {
> -               unsigned char u1 = s1[i];
> -               unsigned char u2 = s2[i];
> -               int U1 = toupper (u1);
> -               int U2 = toupper (u2);
> -               int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> -                       : U1 < U2 ? -1 : U2 < U1);
> +       const char *s1 = (const void *)vs1;
> +       const char *s2 = (const void *)vs2;

vs1 is already a const void *, and there's not much point in adding
another line:

	const char *s1 = vs1, *s2 = vs2;

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-05-31  4:10     ` Junio C Hamano
@ 2021-05-31 15:41     ` Felipe Contreras
  2021-06-01 10:37       ` ZheNing Hu
  3 siblings, 1 reply; 41+ messages in thread
From: Felipe Contreras @ 2021-05-31 15:41 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget, git
  Cc: Junio C Hamano, Christian Couder, Hariom Verma, Karthik Nayak,
	Felipe Contreras, Bagas Sanjaya, Jeff King, Phillip Wood,
	ZheNing Hu, ZheNing Hu

ZheNing Hu via GitGitGadget wrote:
> From: ZheNing Hu <adlternative@gmail.com>

> +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> +{
> +	const char *s1 = (const void *)vs1;
> +	const char *s2 = (const void *)vs2;

As I explained in another mail, I think this is better:

	const char *s1 = vs1, *s2 = vs2;

> +	const char *end = s1 + n;
> +
> +	for (; s1 < end; s1++, s2++) {
> +		int diff = tolower(*s1) - tolower(*s2);
> +		if (diff)
> +			return diff;
> +	}
> +	return 0;
> +}

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-31 14:40       ` ZheNing Hu
@ 2021-06-01  8:54         ` Junio C Hamano
  2021-06-01 11:00           ` ZheNing Hu
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-06-01  8:54 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: ZheNing Hu via GitGitGadget, Git List, Johannes Schindelin

ZheNing Hu <adlternative@gmail.com> writes:

>> Doesn't this conflict with your own zh/ref-filter-atom-type topic?
>> Shouldn't one build on top of the other?
>>
>> Or did we find something fundamentally broken about the other topic
>> to make us retract it that I do not remember?
>>
>> Thanks.
>
> I am waiting for zh/ref-filter-atom-type to be merged into master. But it

As you sent this that conflicts with it, clearly you are doing
something else that conflicts with it _without waiting_ ;-).

> hasn't happened yet. But if I want to base the current topic on
> zh/ref-filter-atom-type, GGG will send past patches (zh/ref-filter-atom-type)
> repeatedly.

I thought GGG lets you say "this is based on that other branch, not
on the 'master' branch" to solve that exact issue?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-31 14:35       ` ZheNing Hu
@ 2021-06-01  9:54         ` Junio C Hamano
  2021-06-01 11:05           ` ZheNing Hu
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2021-06-01  9:54 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King, Phillip Wood

ZheNing Hu <adlternative@gmail.com> writes:

>>     ... because the host language may not support a NUL in the variables
>>     of its string type.
>>
>
> I agree. But host language not only support NUL but also some Non-ASCII
> character and Non-UTF-8 code:

Yup, rephase with "a NUL" with "arbitrary binary data" and you got
what I meant.  Thanks.

>> > +     const char *s = buf->buf;
>> > +     size_t cur_len = 0;
>> > +
>> > +     while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
>> >               s++;
>> > +             cur_len++;
>>
>> Is NUL treated the same as a whitespace letter for the purpose of
>> determining if a line is empty?  WHY?
>
> Well, there seems to be no correction here. But is it true that memory
> like "\0abc" is considered empty?

That sample has 'a' or 'b' or 'c' that are clearly not part of an
"empty" string and irrelevant.  After all, a string " abc" is not
treated as empty in the original implementation, either.

You are treating a block of memory with e.g. " \000 " (SP NUL SP) as
an "empty line" just like you do for "   " (SP SP SP), but I think we
should treat it more like " \001 " or " \007 ", i.e. not an empty
string at all.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-05-31 15:41     ` Felipe Contreras
@ 2021-06-01 10:37       ` ZheNing Hu
  0 siblings, 0 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-06-01 10:37 UTC (permalink / raw)
  To: Felipe Contreras
  Cc: ZheNing Hu via GitGitGadget, Git List, Junio C Hamano,
	Christian Couder, Hariom Verma, Karthik Nayak, Bagas Sanjaya,
	Jeff King, Phillip Wood

Felipe Contreras <felipe.contreras@gmail.com> 于2021年5月31日周一 下午11:41写道:
>
> ZheNing Hu via GitGitGadget wrote:
> > From: ZheNing Hu <adlternative@gmail.com>
>
> > +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > +{
> > +     const char *s1 = (const void *)vs1;
> > +     const char *s2 = (const void *)vs2;
>
> As I explained in another mail, I think this is better:
>
>         const char *s1 = vs1, *s2 = vs2;
>

OK, I understand it now.

> > +     const char *end = s1 + n;
> > +
> > +     for (; s1 < end; s1++, s2++) {
> > +             int diff = tolower(*s1) - tolower(*s2);
> > +             if (diff)
> > +                     return diff;
> > +     }
> > +     return 0;
> > +}
>
> --
> Felipe Contreras

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-06-01  8:54         ` Junio C Hamano
@ 2021-06-01 11:00           ` ZheNing Hu
  2021-06-01 13:48             ` Johannes Schindelin
  0 siblings, 1 reply; 41+ messages in thread
From: ZheNing Hu @ 2021-06-01 11:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: ZheNing Hu via GitGitGadget, Git List, Johannes Schindelin

Junio C Hamano <gitster@pobox.com> 于2021年6月1日周二 下午4:54写道:

>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> >> Doesn't this conflict with your own zh/ref-filter-atom-type topic?
> >> Shouldn't one build on top of the other?
> >>
> >> Or did we find something fundamentally broken about the other topic
> >> to make us retract it that I do not remember?
> >>
> >> Thanks.
> >
> > I am waiting for zh/ref-filter-atom-type to be merged into master. But it
>
> As you sent this that conflicts with it, clearly you are doing
> something else that conflicts with it _without waiting_ ;-).
>

OK.

> > hasn't happened yet. But if I want to base the current topic on
> > zh/ref-filter-atom-type, GGG will send past patches (zh/ref-filter-atom-type)
> > repeatedly.
>
> I thought GGG lets you say "this is based on that other branch, not
> on the 'master' branch" to solve that exact issue?

I'm not sure...I will try it after I rebasing this topic to
zh/ref-filter-atom-type.
I just remember that it looked like something went wrong with my base on
your patch last time.[1]

[1] https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-06-01  9:54         ` Junio C Hamano
@ 2021-06-01 11:05           ` ZheNing Hu
  0 siblings, 0 replies; 41+ messages in thread
From: ZheNing Hu @ 2021-06-01 11:05 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Hariom Verma, Karthik Nayak, Felipe Contreras, Bagas Sanjaya,
	Jeff King, Phillip Wood

Junio C Hamano <gitster@pobox.com> 于2021年6月1日周二 下午5:54写道:
>
> > Well, there seems to be no correction here. But is it true that memory
> > like "\0abc" is considered empty?
>
> That sample has 'a' or 'b' or 'c' that are clearly not part of an
> "empty" string and irrelevant.  After all, a string " abc" is not
> treated as empty in the original implementation, either.
>

In other words, we still need to look at each character of strbuf,
instead of stopping at NUL.

> You are treating a block of memory with e.g. " \000 " (SP NUL SP) as
> an "empty line" just like you do for "   " (SP SP SP), but I think we
> should treat it more like " \001 " or " \007 ", i.e. not an empty
> string at all.

OK. I understand it now: " \001 " is It’s like a block of space, but it’s
not truly "empty", "SP NUL SP" is same too, So the complete definition of
"empty" here should be: All characters are SP which do not contain NUL
or other characters.

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom
  2021-06-01 11:00           ` ZheNing Hu
@ 2021-06-01 13:48             ` Johannes Schindelin
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Schindelin @ 2021-06-01 13:48 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Junio C Hamano, ZheNing Hu via GitGitGadget, Git List

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

Hi,

On Tue, 1 Jun 2021, ZheNing Hu wrote:

> Junio C Hamano <gitster@pobox.com> 于2021年6月1日周二 下午4:54写道:
>
> > ZheNing Hu <adlternative@gmail.com> writes:
> >
> > > [...] But if I want to base the current topic on
> > > zh/ref-filter-atom-type, GGG will send past patches
> > > (zh/ref-filter-atom-type) repeatedly.
> >
> > I thought GGG lets you say "this is based on that other branch, not on
> > the 'master' branch" to solve that exact issue?
>
> I'm not sure...I will try it after I rebasing this topic to
> zh/ref-filter-atom-type.

Yes, it should be possible to rebase your patch on top of one of the
[a-z][a-z]/* patches Junio publishes at https://github.com/gitster/git
(and which get mirrored automatically to
https://github.com/gitgitgadget/git via a scheduled Azure Pipeline), and
then to change the PR base (simply click the `Edit` button next to the PR
title, as if you wanted to edit the title, and you can also change the
base branch).

> I just remember that it looked like something went wrong with my base on
> your patch last time.[1]
>
> [1] https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@gmail.com/

https://github.com/gitgitgadget/git/pull/870 seems to be based on
jc/diffcore-rotate all right.

If you are talking about including Junio's patch in v5
(https://lore.kernel.org/git/fb4bfd0f8b162e51e71711fe5503ca684f980d58.1613480198.git.gitgitgadget@gmail.com/#r),
I _think_ that there might have been the simple problem of
jc/diffcore-rotate having been force-pushed just before you sent v5, and
therefore you had a stale branch.

To prevent things like that, it is a good idea to set the upstream of your
local branch accordingly (in this instance, `git branch
--set-upstream-to=gitgitgadget/jc/diffcore-rotate`) and ensure to `git
pull --rebase` before force-pushing and submitting.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-06-01 13:48 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27 14:43 [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
2021-05-27 16:36   ` Felipe Contreras
2021-05-28 13:02     ` ZheNing Hu
2021-05-28 16:30       ` Felipe Contreras
2021-05-30  5:37         ` ZheNing Hu
2021-05-29 13:23     ` Phillip Wood
2021-05-29 15:24       ` Felipe Contreras
2021-05-29 17:23         ` Phillip Wood
2021-05-30  6:29         ` ZheNing Hu
2021-05-30 13:05           ` Phillip Wood
2021-05-31 14:15             ` ZheNing Hu
2021-05-31 15:35           ` Felipe Contreras
2021-05-30  6:26       ` ZheNing Hu
2021-05-30 13:02         ` Phillip Wood
2021-05-28  3:03   ` Junio C Hamano
2021-05-28 15:04     ` ZheNing Hu
2021-05-28 16:38       ` Felipe Contreras
2021-05-30  8:11       ` ZheNing Hu
2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
2021-05-27 16:37   ` Felipe Contreras
2021-05-28  3:06   ` Junio C Hamano
2021-05-28  4:36   ` Junio C Hamano
2021-05-28 15:19     ` ZheNing Hu
2021-05-27 15:39 ` [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom Felipe Contreras
2021-05-30 13:01 ` [PATCH v2 " ZheNing Hu via GitGitGadget
2021-05-30 13:01   ` [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents ZheNing Hu via GitGitGadget
2021-05-31  5:34     ` Junio C Hamano
2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
2021-05-31  0:44     ` Junio C Hamano
2021-05-31 14:35       ` ZheNing Hu
2021-06-01  9:54         ` Junio C Hamano
2021-06-01 11:05           ` ZheNing Hu
2021-05-31  4:04     ` Junio C Hamano
2021-05-31 14:40       ` ZheNing Hu
2021-06-01  8:54         ` Junio C Hamano
2021-06-01 11:00           ` ZheNing Hu
2021-06-01 13:48             ` Johannes Schindelin
2021-05-31  4:10     ` Junio C Hamano
2021-05-31 15:41     ` Felipe Contreras
2021-06-01 10:37       ` ZheNing Hu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).