All of lore.kernel.org
 help / color / mirror / Atom feed
From: ZheNing Hu <adlternative@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: ZheNing Hu via GitGitGadget <gitgitgadget@gmail.com>,
	Git List <git@vger.kernel.org>,
	Christian Couder <christian.couder@gmail.com>,
	Hariom Verma <hariom18599@gmail.com>,
	Karthik Nayak <karthik.188@gmail.com>,
	Felipe Contreras <felipe.contreras@gmail.com>,
	Bagas Sanjaya <bagasdotme@gmail.com>, Jeff King <peff@peff.net>
Subject: Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
Date: Fri, 28 May 2021 23:04:59 +0800	[thread overview]
Message-ID: <CAOLTT8SLLgZnF0SV2FPPBJkB=ybeh8mamTPqc-M6CXQeepooOQ@mail.gmail.com> (raw)
In-Reply-To: <xmqq1r9r8spu.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> 于2021年5月28日周五 上午11:04写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > The raw data of blob, tree objects may contain '\0', but most of
> > the logic in `ref-filter` depands on the output of the atom being
> > a structured string (end with '\0').
>
> Text, yes, string is also OK.  But structured?  Probably not.
>
>     ... being text (specifically, no embedded NULs in it).
>

OK.

> > Beyond, `--format=%(raw)` should not combine with `--python`, `--shell`,
> > `--tcl`, `--perl` because if our binary raw data is passed to a variable
> > in the host language, the host languages may cause escape errors.
>
> OK.  I think at least --perl and possibly --python should be able to
> express NULs in the "string" type we use from the host language, but
> I am perfectly fine with the decision to leave it to later updates.
>

Yes, for the time being, unified processing --<lang> will be easier.

> > +Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tcl`,
>
> "should not combine" -> "cannot be used" would make it read more
> naturally (ditto for the phrase used in the proposed log message).
>

OK, so the wording is better.

> >
> >  struct atom_value {
> >       const char *s;
> > +     size_t s_size;
>
> OK, so everything used to be a C-string that cannot hold NULs in it,
> but now it is a counted <ptr, len> string.  Good.
>

This suddenly reminded me of strbuf...
I don't know if it is worth replacing all s, s_size with strbuf.

> > @@ -588,6 +607,10 @@ static int parse_ref_filter_atom(const struct ref_format *format,
> >               return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"),
> >                                      (int)(ep-atom), atom);
> >
> > +     if (format->quote_style && starts_with(sp, "raw"))
> > +             return strbuf_addf_ret(err, -1, _("--format=%.*s should not combine with"
> > +                             "--python, --shell, --tcl, --perl"), (int)(ep-atom), atom);
>
> Don't we want to allow "raw:size" that would be a plain text?
> I am not sure if this check belongs here in the first place.
> Shouldn't it be done in raw_atom_parser() instead?
>

You are right: "raw:size" should be keep, but I can't this check to
raw_atom_parser(), becase "if I move it to raw_atom_parser(), when we
use:

`git ref-filter --format=%raw --sort=raw --python`

Git will continue to run instead of die because parse_sorting_atom() will
use a dummy ref_format and don't remember --language details, next time
format_ref_array_item() will reuse the used_atom entry of sorting atom in
parse_ref_filter_atom(), this will skip the check in raw_atom_parser()."

> Another idea is to teach a more generic rule to quote_formatting()
> to detect NULs in v->s[0..v->s_size] at runtime and barf, i.e. a
> plain-text blob object can be used with "--shell --format=%(raw)"
> just fine.
>

The cost of such a check is not small. Maybe can add an option
such as "--only-text" to do it.

> > @@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref_format *format,
> >       return at;
> >  }
> >
> > -static void quote_formatting(struct strbuf *s, const char *str, int quote_style)
> > +static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style)
> >  {
> >       switch (quote_style) {
> >       case QUOTE_NONE:
> > -             strbuf_addstr(s, str);
> > +             if (len != ATOM_VALUE_S_SIZE_INIT)
> > +                     strbuf_add(s, str, len);
> > +             else
> > +                     strbuf_addstr(s, str);
>
> It probably is a good idea to invent a C preprocessor macro for a
> named constant to be used when a structure is initialized, but it
> would be easier to read if the rule is "len field is negative when
> the value is a C-string", e.g.
>
>                 if (len < 0)
>

I do not recognize such an approach because we are deal
with "size_t s_size", if (len < 0) will never be established.
I use -1 is because it's equal to 18446744073709551615
and it's impossible to have such a large file in Git.

> > @@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state
> >       return 0;
> >  }
> >
> > -static int is_empty(const char *s)
> > +static int is_empty(struct strbuf *buf)
> >  {
> > -     while (*s != '\0') {
> > -             if (!isspace(*s))
> > -                     return 0;
> > +     const char *s = buf->buf;
> > +     size_t cur_len = 0;
> > +
> > +     while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
> >               s++;
> > +             cur_len++;
> >       }
> > -     return 1;
> > +     return cur_len == buf->len;
> >  }
>
> Assuming that we do want to treat NULs the same way as whitespaces,
> the updated code works as intended, which is good.  But I have no
> reason to support that design decision.  I do not have a strong
> reason to support a design decision that goes the opposite way to
> treat a NUL just like we treat an 'X', but at least I can understand
> it (i.e. "because we have no reason to special case NUL any more
> than 'X' when trying to see if a buffer is 'empty', we don't").
> This code on the other hand must be supported with "because we need
> to special case NUL for such and such reasons for the purpose of
> determining if a buffer is 'empty', we treat them the same way as
> whitespaces".
>

Something like "\0abc", from the perspective of the string, it is empty;
from the perspective of the memory, it is not empty; I don't know any
absolutely good solutions here.

> Can the change in this commit violate the invariant that
> if_then_else->str cannot be NULL, which seems to have been the case
> forever as we see an unchecked strcmp() done in the original?
>
> If so, perhaps you can check the condition upfront, where you
> compute str_len above, e.g.
>
>         if (!if_then_else->str) {
>                 if (if_then_else->cmp_status == COMPARE_EQUAL ||
>                     if_then_else->cmp_status == COMPARE_UNEQUAL)
>                         BUG(...);
>         } else
>                 str_len = strlen(...);
>
> If not, then I do not see the point of adding this (and later) check
> with BUG to this code.
>
> Or is the invariant that .str must not be NULL could have been
> violated without this patch (i.e. the original was buggy in running
> strcmp() on .str without checking)?  If so, please make it a separate
> preliminary change to add such an assert.
>

The BUG() here actually acts as an "assert()". ".str must not be NULL" is
right, it point to "xxx" in "%(if:equals=xxx)", so it seems that these BUG()
are somewhat redundant, I will remove them.

> >  /* See grab_values */
> > -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> > +static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned long buf_size, struct object *obj)
> >  {
> >       int i;
> >       const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
> > @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> >                       continue;
> >               if (deref)
> >                       name++;
> > -             if (strcmp(name, "body") &&
> > -                 !starts_with(name, "subject") &&
> > -                 !starts_with(name, "trailers") &&
> > -                 !starts_with(name, "contents"))
> > +
> > +             if (starts_with(name, "raw")) {
> > +                     if (atom->u.raw_data.option == RAW_BARE) {
> > +                             v->s = xmemdupz(buf, buf_size);
> > +                             v->s_size = buf_size;
> > +                     } else if (atom->u.raw_data.option == RAW_LENGTH)
> > +                             v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
> > +                     continue;
> > +             }
>
> I can understand that "raw[:<options>]" handling has been inserted
> above the existing "from here on, we only deal with log message
> components" check.  But
>
> > +             if ((obj->type != OBJ_TAG &&
> > +                  obj->type != OBJ_COMMIT) ||
>
> I do not see why these new conditions are added.  The change is not
> justified in the proposed log message, the original did not need
> these conditions, and this does not concern the primary point of the
> change, which is to start supporting %(raw[:<option>]) placeholder.
>
> If it is needed as a bugfix (e.g. it may be that you consider "if a
> blob has contents that looks very similar to 'git cat-file commit
> HEAD', %(body) and friends parse these out, even though it is not a
> commit" is a bug and the change to add these extra tests is meant as
> a fix), that should be done as a preliminary change before adding
> the support for a new atom.
>

Almost what I means: Make a strong guarantee that blob and tree
will never pass the check so that we can don't worry about incorrect
parsing in find_subpos(). The reason I put it in this patch is that only
commit and tag objects will execute `grab_sub_body_contents()` before,
but in the current patch it has changed.

> > +                 (strcmp(name, "body") &&
> > +                  !starts_with(name, "subject") &&
> > +                  !starts_with(name, "trailers") &&
> > +                  !starts_with(name, "contents")))
> >                       continue;
> >               if (!subpos)
> >                       find_subpos(buf,
> > @@ -1374,25 +1428,30 @@ static void fill_missing_values(struct atom_value *val)
> >   * pointed at by the ref itself; otherwise it is the object the
> >   * ref (which is a tag) refers to.
> >   */
> > -static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
> > +static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
> >  {
> > +     void *buf = data->content;
> > +     unsigned long buf_size = data->size;
> > +
> >       switch (obj->type) {
> >       case OBJ_TAG:
> >               grab_tag_values(val, deref, obj);
> > -             grab_sub_body_contents(val, deref, buf);
> > +             grab_raw_data(val, deref, buf, buf_size, obj);
>
> It is very strange that a helper that is named to grab raw data can
> still process pieces out of a structured data.  The original name is
> still a far better match to what the function does, even after this
> patch teaches it to also honor %(raw) placeholder.  It is still
> about grabbing various "sub"-pieces out of "body contents", and the
> sub-piece the %(raw) grabs just happens to be "the whole thing".
>

Well, I can't think of a better name, My original idea was grab_raw_data()
can grab itself, header, contents, It is more general than
grab_sub_body_contents(),
raw data is not a part of "subject" or "body" of "contents"...

> > +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > +{
> > +     size_t i;
> > +     const char *s1 = (const char *)vs1;
> > +     const char *s2 = (const char *)vs2;
> > +
> > +     for (i = 0; i < n; i++) {
> > +             unsigned char u1 = s1[i];
> > +             unsigned char u2 = s2[i];
> > +             int U1 = toupper (u1);
> > +             int U2 = toupper (u2);
>
> Does toupper('\0') even have a defined meaning?
>
> > +             int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> > +                     : U1 < U2 ? -1 : U2 < U1);
>
> Looks crazy to worry about uchar wider than int.  Such a system is
> not even standard compliant, is it?
>

Forget about this inelegant help function. As I said in my reply to Felipe,
this is copied from gunlib...

> > @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> >       int cmp_detached_head = 0;
> >       cmp_type cmp_type = used_atom[s->atom].type;
> >       struct strbuf err = STRBUF_INIT;
> > +     size_t slen = 0;
> >
> >       if (get_ref_atom_value(a, s->atom, &va, &err))
> >               die("%s", err.buf);
> > @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> >       } else if (s->sort_flags & REF_SORTING_VERSION) {
> >               cmp = versioncmp(va->s, vb->s);
> >       } else if (cmp_type == FIELD_STR) {
> > -             int (*cmp_fn)(const char *, const char *);
> > -             cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > -                     ? strcasecmp : strcmp;
> > -             cmp = cmp_fn(va->s, vb->s);
>
> > +             if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
> > +                 vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
> > +                     int (*cmp_fn)(const char *, const char *);
> > +                     cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > +                             ? strcasecmp : strcmp;
> > +                     cmp = cmp_fn(va->s, vb->s);
> > +             } else {
> > +                     int (*cmp_fn)(const void *, const void *, size_t);
> > +                     cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > +                             ? memcasecmp : memcmp;
>
> Why not introduce two local temporary variables a_size and b_size
> and initialize them upfront like so:
>
>                 a_size = va->s_size < 0 ? strlen(va->s) : va->s_size;
>                 b_size = vb->s_size < 0 ? strlen(vb->s) : vb->s_size;
>
> Wouldn't that allow you to do without the complex "if both are
> counted, do this, if A is counted but B is not, do that, ..."
> cascade?
>

Sorry, such code would be really ugly for reading.

> I can sort-of see the point of special casing "both are traditional
> C strings" case (i.e. the "if" side of the "else" we are discussing
> here) and using strcasecmp/strcmp instead of memcasecmp/memcmp, but
> I do not see much point in having the if/elseif/else cascade inside
> this "else" clause.
>

I will try to modify its logic.

> Thanks.

Your reply is very accurate.

Thanks.
--
ZheNing Hu

  reply	other threads:[~2021-05-28 15:05 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27 14:43 [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
2021-05-27 16:36   ` Felipe Contreras
2021-05-28 13:02     ` ZheNing Hu
2021-05-28 16:30       ` Felipe Contreras
2021-05-30  5:37         ` ZheNing Hu
2021-05-29 13:23     ` Phillip Wood
2021-05-29 15:24       ` Felipe Contreras
2021-05-29 17:23         ` Phillip Wood
2021-05-30  6:29         ` ZheNing Hu
2021-05-30 13:05           ` Phillip Wood
2021-05-31 14:15             ` ZheNing Hu
2021-05-31 15:35           ` Felipe Contreras
2021-05-30  6:26       ` ZheNing Hu
2021-05-30 13:02         ` Phillip Wood
2021-05-28  3:03   ` Junio C Hamano
2021-05-28 15:04     ` ZheNing Hu [this message]
2021-05-28 16:38       ` Felipe Contreras
2021-05-30  8:11       ` ZheNing Hu
2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
2021-05-27 16:37   ` Felipe Contreras
2021-05-28  3:06   ` Junio C Hamano
2021-05-28  4:36   ` Junio C Hamano
2021-05-28 15:19     ` ZheNing Hu
2021-05-27 15:39 ` [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom Felipe Contreras
2021-05-30 13:01 ` [PATCH v2 " ZheNing Hu via GitGitGadget
2021-05-30 13:01   ` [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents ZheNing Hu via GitGitGadget
2021-05-31  5:34     ` Junio C Hamano
2021-05-30 13:01   ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
2021-05-31  0:44     ` Junio C Hamano
2021-05-31 14:35       ` ZheNing Hu
2021-06-01  9:54         ` Junio C Hamano
2021-06-01 11:05           ` ZheNing Hu
2021-05-31  4:04     ` Junio C Hamano
2021-05-31 14:40       ` ZheNing Hu
2021-06-01  8:54         ` Junio C Hamano
2021-06-01 11:00           ` ZheNing Hu
2021-06-01 13:48             ` Johannes Schindelin
2021-05-31  4:10     ` Junio C Hamano
2021-05-31 15:41     ` Felipe Contreras
2021-06-01 10:37       ` ZheNing Hu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOLTT8SLLgZnF0SV2FPPBJkB=ybeh8mamTPqc-M6CXQeepooOQ@mail.gmail.com' \
    --to=adlternative@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=felipe.contreras@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=hariom18599@gmail.com \
    --cc=karthik.188@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.