From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FF15C2B9F7 for ; Fri, 28 May 2021 15:05:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 62E636103E for ; Fri, 28 May 2021 15:05:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236456AbhE1PGu (ORCPT ); Fri, 28 May 2021 11:06:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235683AbhE1PGu (ORCPT ); Fri, 28 May 2021 11:06:50 -0400 Received: from mail-il1-x12b.google.com (mail-il1-x12b.google.com [IPv6:2607:f8b0:4864:20::12b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3082BC061574 for ; Fri, 28 May 2021 08:05:15 -0700 (PDT) Received: by mail-il1-x12b.google.com with SMTP id o10so3537597ilm.13 for ; Fri, 28 May 2021 08:05:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=v5xkiHrQk5VWPkAvWKvYHPSpsSQrxsyD4tiHrGyMTXY=; b=MT0Qc/8014mV7P93EBf+YEGHACvqOZ9INMGP27mLZW9Z8S3VSC2/3VVWFEOMCiD7IB VygQ5jRXTs77wGiZHSalLulWrdo6R5oXbXMySL3vT+y0l7ZX7KUtIBWboNanWoUZiQKx Zd01OMHh2T9e2cz5lDMw9FWaiGOQ5F4+hV62X6pddsN3qC07HmnVFcHEE5T2SCSE0e8Y VlyEOBfASh2SUR01Zg366YSwfY5d11CRm3VPzcli8syWzfR8MhdR2R2hN4mzosCxVjO2 y4x6MDtMS/x/jxyrtcDKLAASl7ugfjmZU/pu+kgeR9SZJmOxCt+krymVwMPRG2QWMnhO SNlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=v5xkiHrQk5VWPkAvWKvYHPSpsSQrxsyD4tiHrGyMTXY=; b=XO35Q/rE4ANAEqYPBZEKEJcGXC683GkU5TOfdu4HocVFuYbhweO8Qc1KIniUGgWNF1 ZqXUW2OrJvMDpy0uraK4q9rS1HQO0EPWNq5i8F4MsInuzvEYb1NFdANgGj+nNAHNsI8l nGINR86bL65nGeLxUO/OMssAEnMuBYNginLBrxeWNcwMljkaLq8pAo++BaxddWmz4z/q 88fFModIKTBeIJUDv1s5Vwp9/2FImMHCB7q9Xhfv11CufwVSoBevBOc2tJXklXXNk3nV kau2x38/8yYzJ8FZ9cfVjLZLWXztKEvpDy3Wm1h/uiO2EkV6JXkUX+uXmT3JivaGWDDz i2RA== X-Gm-Message-State: AOAM533avnDvht7zCmIa1XgTNBHPNiei8USuz+Hy/8OVJHccZHFP/PYq Q7p9g5Hfzmduzay3blypZK8pgHlQ1X9PxEJXkGg= X-Google-Smtp-Source: ABdhPJxRjIAxAdbEQaUUPjG0ONa2oJmMM6L99APo3AeXNqP3cemnJk36j96MebN3+qsHCSvEl46S20h5vCXpPb0woBs= X-Received: by 2002:a05:6e02:1302:: with SMTP id g2mr7289240ilr.259.1622214314384; Fri, 28 May 2021 08:05:14 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: ZheNing Hu Date: Fri, 28 May 2021 23:04:59 +0800 Message-ID: Subject: Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom To: Junio C Hamano Cc: ZheNing Hu via GitGitGadget , Git List , Christian Couder , Hariom Verma , Karthik Nayak , Felipe Contreras , Bagas Sanjaya , Jeff King Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Junio C Hamano =E4=BA=8E2021=E5=B9=B45=E6=9C=8828=E6=97= =A5=E5=91=A8=E4=BA=94 =E4=B8=8A=E5=8D=8811:04=E5=86=99=E9=81=93=EF=BC=9A > > "ZheNing Hu via GitGitGadget" writes: > > > The raw data of blob, tree objects may contain '\0', but most of > > the logic in `ref-filter` depands on the output of the atom being > > a structured string (end with '\0'). > > Text, yes, string is also OK. But structured? Probably not. > > ... being text (specifically, no embedded NULs in it). > OK. > > Beyond, `--format=3D%(raw)` should not combine with `--python`, `--shel= l`, > > `--tcl`, `--perl` because if our binary raw data is passed to a variabl= e > > in the host language, the host languages may cause escape errors. > > OK. I think at least --perl and possibly --python should be able to > express NULs in the "string" type we use from the host language, but > I am perfectly fine with the decision to leave it to later updates. > Yes, for the time being, unified processing -- will be easier. > > +Note that `--format=3D%(raw)` should not combine with `--python`, `--s= hell`, `--tcl`, > > "should not combine" -> "cannot be used" would make it read more > naturally (ditto for the phrase used in the proposed log message). > OK, so the wording is better. > > > > struct atom_value { > > const char *s; > > + size_t s_size; > > OK, so everything used to be a C-string that cannot hold NULs in it, > but now it is a counted string. Good. > This suddenly reminded me of strbuf... I don't know if it is worth replacing all s, s_size with strbuf. > > @@ -588,6 +607,10 @@ static int parse_ref_filter_atom(const struct ref_= format *format, > > return strbuf_addf_ret(err, -1, _("malformed field name: = %.*s"), > > (int)(ep-atom), atom); > > > > + if (format->quote_style && starts_with(sp, "raw")) > > + return strbuf_addf_ret(err, -1, _("--format=3D%.*s should= not combine with" > > + "--python, --shell, --tcl, --perl"), (int= )(ep-atom), atom); > > Don't we want to allow "raw:size" that would be a plain text? > I am not sure if this check belongs here in the first place. > Shouldn't it be done in raw_atom_parser() instead? > You are right: "raw:size" should be keep, but I can't this check to raw_atom_parser(), becase "if I move it to raw_atom_parser(), when we use: `git ref-filter --format=3D%raw --sort=3Draw --python` Git will continue to run instead of die because parse_sorting_atom() will use a dummy ref_format and don't remember --language details, next time format_ref_array_item() will reuse the used_atom entry of sorting atom in parse_ref_filter_atom(), this will skip the check in raw_atom_parser()." > Another idea is to teach a more generic rule to quote_formatting() > to detect NULs in v->s[0..v->s_size] at runtime and barf, i.e. a > plain-text blob object can be used with "--shell --format=3D%(raw)" > just fine. > The cost of such a check is not small. Maybe can add an option such as "--only-text" to do it. > > @@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref= _format *format, > > return at; > > } > > > > -static void quote_formatting(struct strbuf *s, const char *str, int qu= ote_style) > > +static void quote_formatting(struct strbuf *s, const char *str, size_t= len, int quote_style) > > { > > switch (quote_style) { > > case QUOTE_NONE: > > - strbuf_addstr(s, str); > > + if (len !=3D ATOM_VALUE_S_SIZE_INIT) > > + strbuf_add(s, str, len); > > + else > > + strbuf_addstr(s, str); > > It probably is a good idea to invent a C preprocessor macro for a > named constant to be used when a structure is initialized, but it > would be easier to read if the rule is "len field is negative when > the value is a C-string", e.g. > > if (len < 0) > I do not recognize such an approach because we are deal with "size_t s_size", if (len < 0) will never be established. I use -1 is because it's equal to 18446744073709551615 and it's impossible to have such a large file in Git. > > @@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *ato= mv, struct ref_formatting_state > > return 0; > > } > > > > -static int is_empty(const char *s) > > +static int is_empty(struct strbuf *buf) > > { > > - while (*s !=3D '\0') { > > - if (!isspace(*s)) > > - return 0; > > + const char *s =3D buf->buf; > > + size_t cur_len =3D 0; > > + > > + while ((cur_len !=3D buf->len) && (isspace(*s) || *s =3D=3D '\0')= ) { > > s++; > > + cur_len++; > > } > > - return 1; > > + return cur_len =3D=3D buf->len; > > } > > Assuming that we do want to treat NULs the same way as whitespaces, > the updated code works as intended, which is good. But I have no > reason to support that design decision. I do not have a strong > reason to support a design decision that goes the opposite way to > treat a NUL just like we treat an 'X', but at least I can understand > it (i.e. "because we have no reason to special case NUL any more > than 'X' when trying to see if a buffer is 'empty', we don't"). > This code on the other hand must be supported with "because we need > to special case NUL for such and such reasons for the purpose of > determining if a buffer is 'empty', we treat them the same way as > whitespaces". > Something like "\0abc", from the perspective of the string, it is empty; from the perspective of the memory, it is not empty; I don't know any absolutely good solutions here. > Can the change in this commit violate the invariant that > if_then_else->str cannot be NULL, which seems to have been the case > forever as we see an unchecked strcmp() done in the original? > > If so, perhaps you can check the condition upfront, where you > compute str_len above, e.g. > > if (!if_then_else->str) { > if (if_then_else->cmp_status =3D=3D COMPARE_EQUAL || > if_then_else->cmp_status =3D=3D COMPARE_UNEQUAL) > BUG(...); > } else > str_len =3D strlen(...); > > If not, then I do not see the point of adding this (and later) check > with BUG to this code. > > Or is the invariant that .str must not be NULL could have been > violated without this patch (i.e. the original was buggy in running > strcmp() on .str without checking)? If so, please make it a separate > preliminary change to add such an assert. > The BUG() here actually acts as an "assert()". ".str must not be NULL" is right, it point to "xxx" in "%(if:equals=3Dxxx)", so it seems that these BU= G() are somewhat redundant, I will remove them. > > /* See grab_values */ > > -static void grab_sub_body_contents(struct atom_value *val, int deref, = void *buf) > > +static void grab_raw_data(struct atom_value *val, int deref, void *buf= , unsigned long buf_size, struct object *obj) > > { > > int i; > > const char *subpos =3D NULL, *bodypos =3D NULL, *sigpos =3D NULL; > > @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_= value *val, int deref, void *buf) > > continue; > > if (deref) > > name++; > > - if (strcmp(name, "body") && > > - !starts_with(name, "subject") && > > - !starts_with(name, "trailers") && > > - !starts_with(name, "contents")) > > + > > + if (starts_with(name, "raw")) { > > + if (atom->u.raw_data.option =3D=3D RAW_BARE) { > > + v->s =3D xmemdupz(buf, buf_size); > > + v->s_size =3D buf_size; > > + } else if (atom->u.raw_data.option =3D=3D RAW_LEN= GTH) > > + v->s =3D xstrfmt("%"PRIuMAX, (uintmax_t)b= uf_size); > > + continue; > > + } > > I can understand that "raw[:]" handling has been inserted > above the existing "from here on, we only deal with log message > components" check. But > > > + if ((obj->type !=3D OBJ_TAG && > > + obj->type !=3D OBJ_COMMIT) || > > I do not see why these new conditions are added. The change is not > justified in the proposed log message, the original did not need > these conditions, and this does not concern the primary point of the > change, which is to start supporting %(raw[: