All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ci: disallow directional formatting
@ 2021-11-02 12:58 Johannes Schindelin via GitGitGadget
  2021-11-02 15:01 ` Ævar Arnfjörð Bjarmason
  2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
  0 siblings, 2 replies; 18+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-11-02 12:58 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

As described in https://trojansource.codes/trojan-source.pdf, it is
possible to abuse directional formatting (a feature of Unicode) to
deceive human readers into interpreting code differently from compilers.

It is highly unlikely that Git's source code wants to contain such
directional formatting in the first place, so let's disallow it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    ci: disallow directional formatting
    
    I just stumbled over
    https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
    which details an interesting social-engineering attack: it uses
    directional formatting in source code to pretend to human readers that
    the code does something different than it actually does.
    
    It is highly unlikely that Git's source code wants to contain such
    directional formatting in the first place, so let's disallow it.
    
    Technically, this is not exactly -rc material, but the paper was just
    published, and I want us to be safe.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1071

 .github/workflows/main.yml | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 6ed6a9e8076..7b4b4df03c3 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -289,6 +289,13 @@ jobs:
     - uses: actions/checkout@v2
     - run: ci/install-dependencies.sh
     - run: ci/run-static-analysis.sh
+    - name: disallow Unicode directional formatting
+      run: |
+        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
+        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
+        # could use `git grep -P` with the `\u` syntax).
+        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
+          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
   sparse:
     needs: ci-config
     if: needs.ci-config.outputs.enabled == 'yes'

base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] ci: disallow directional formatting
  2021-11-02 12:58 [PATCH] ci: disallow directional formatting Johannes Schindelin via GitGitGadget
@ 2021-11-02 15:01 ` Ævar Arnfjörð Bjarmason
  2021-11-02 15:48   ` Taylor Blau
  2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
  1 sibling, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-02 15:01 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget; +Cc: git, Johannes Schindelin


On Tue, Nov 02 2021, Johannes Schindelin via GitGitGadget wrote:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> As described in https://trojansource.codes/trojan-source.pdf, it is
> possible to abuse directional formatting (a feature of Unicode) to
> deceive human readers into interpreting code differently from compilers.
>
> It is highly unlikely that Git's source code wants to contain such
> directional formatting in the first place, so let's disallow it.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>     ci: disallow directional formatting
>     
>     I just stumbled over
>     https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
>     which details an interesting social-engineering attack: it uses
>     directional formatting in source code to pretend to human readers that
>     the code does something different than it actually does.
>     
>     It is highly unlikely that Git's source code wants to contain such
>     directional formatting in the first place, so let's disallow it.
>     
>     Technically, this is not exactly -rc material, but the paper was just
>     published, and I want us to be safe.

There's a parallel discussion about doing something to detect this in
"git am", which for the git project seems like a better place to put
this.

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1071
>
>  .github/workflows/main.yml | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> index 6ed6a9e8076..7b4b4df03c3 100644
> --- a/.github/workflows/main.yml
> +++ b/.github/workflows/main.yml
> @@ -289,6 +289,13 @@ jobs:
>      - uses: actions/checkout@v2
>      - run: ci/install-dependencies.sh
>      - run: ci/run-static-analysis.sh
> +    - name: disallow Unicode directional formatting
> +      run: |
> +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
> +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
> +        # could use `git grep -P` with the `\u` syntax).
> +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
> +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
>    sparse:
>      needs: ci-config
>      if: needs.ci-config.outputs.enabled == 'yes'
>
> base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49

It would be easier to maintain this if were added to
ci/run-static-analysis.sh instead, where we have some similar tests, and
if it lives there we could do away with the double-escaping, then it can
also be run manually.

Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
its unconditional support for e.g. unicode properties in regexes.

How will this change impact RTL languages being added to po/?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] ci: disallow directional formatting
  2021-11-02 15:01 ` Ævar Arnfjörð Bjarmason
@ 2021-11-02 15:48   ` Taylor Blau
  2021-11-02 16:03     ` Ævar Arnfjörð Bjarmason
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Taylor Blau @ 2021-11-02 15:48 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
> There's a parallel discussion about doing something to detect this in
> "git am", which for the git project seems like a better place to put
> this.

I don't think that one impacts the other necessarily. Having `git am`
guard against this would probably be sufficient to protect Junio
accidentally apply something containing directional formatting to his
tree unknowingly.

But the idea that we rely on the import mechanism to protect against
this doesn't sit well with me. Ultimately, we should be relying on a
static check like below to ensure that directional formatting hasn't
entered the tree by any mechanism (not just 'git am').

> > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > index 6ed6a9e8076..7b4b4df03c3 100644
> > --- a/.github/workflows/main.yml
> > +++ b/.github/workflows/main.yml
> > @@ -289,6 +289,13 @@ jobs:
> >      - uses: actions/checkout@v2
> >      - run: ci/install-dependencies.sh
> >      - run: ci/run-static-analysis.sh
> > +    - name: disallow Unicode directional formatting
> > +      run: |
> > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
> > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
> > +        # could use `git grep -P` with the `\u` syntax).
> > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
> > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
> >    sparse:
> >      needs: ci-config
> >      if: needs.ci-config.outputs.enabled == 'yes'
> >
> > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
>
> It would be easier to maintain this if were added to
> ci/run-static-analysis.sh instead, where we have some similar tests, and
> if it lives there we could do away with the double-escaping, then it can
> also be run manually.
>
> Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
> its unconditional support for e.g. unicode properties in regexes.

I agree that the double-escaping is ugly. I think that this would be
easier to maintain if it lived in ci/run-static-analysis.sh or its own
script like ci/check-directional-formatting.sh.

And yes, constructing a byte pattern is a little odd as well, but I
think that it's the best you can do if you're limited to running 'git
grep' without libpcre. I wondered if we could depend on perl being
around during CI, but as far as I know we can since install Perl modules
in ci/install-dependencies.sh and use Perl extensively through the test
suite.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] ci: disallow directional formatting
  2021-11-02 15:48   ` Taylor Blau
@ 2021-11-02 16:03     ` Ævar Arnfjörð Bjarmason
  2021-11-02 16:12     ` Johannes Schindelin
  2021-11-03 17:20     ` Junio C Hamano
  2 siblings, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-02 16:03 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin


On Tue, Nov 02 2021, Taylor Blau wrote:

> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> There's a parallel discussion about doing something to detect this in
>> "git am", which for the git project seems like a better place to put
>> this.
>
> I don't think that one impacts the other necessarily. Having `git am`
> guard against this would probably be sufficient to protect Junio
> accidentally apply something containing directional formatting to his
> tree unknowingly.
>
> But the idea that we rely on the import mechanism to protect against
> this doesn't sit well with me. Ultimately, we should be relying on a
> static check like below to ensure that directional formatting hasn't
> entered the tree by any mechanism (not just 'git am').
>
>> > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
>> > index 6ed6a9e8076..7b4b4df03c3 100644
>> > --- a/.github/workflows/main.yml
>> > +++ b/.github/workflows/main.yml
>> > @@ -289,6 +289,13 @@ jobs:
>> >      - uses: actions/checkout@v2
>> >      - run: ci/install-dependencies.sh
>> >      - run: ci/run-static-analysis.sh
>> > +    - name: disallow Unicode directional formatting
>> > +      run: |
>> > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
>> > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
>> > +        # could use `git grep -P` with the `\u` syntax).
>> > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
>> > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
>> >    sparse:
>> >      needs: ci-config
>> >      if: needs.ci-config.outputs.enabled == 'yes'
>> >
>> > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
>>
>> It would be easier to maintain this if were added to
>> ci/run-static-analysis.sh instead, where we have some similar tests, and
>> if it lives there we could do away with the double-escaping, then it can
>> also be run manually.
>>
>> Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
>> its unconditional support for e.g. unicode properties in regexes.
>
> I agree that the double-escaping is ugly. I think that this would be
> easier to maintain if it lived in ci/run-static-analysis.sh or its own
> script like ci/check-directional-formatting.sh.
>
> And yes, constructing a byte pattern is a little odd as well, but I
> think that it's the best you can do if you're limited to running 'git
> grep' without libpcre. I wondered if we could depend on perl being
> around during CI, but as far as I know we can since install Perl modules
> in ci/install-dependencies.sh and use Perl extensively through the test
> suite.

We can hard depend on anything listed in [1][2], in this case there's a
Perl 5.30 available.

We don't need any Perl modules, it's all Perl-native regex features and
a small for-loop.

On the topic at large I wonder how much we need to worry about this at
all, seems like somewher between case of "the anglosphere discovers
scary foreign characters in Unicode again" and "security researcher
creates scary landing page for well-known Unicode edge-case"[3] :)

In this particular case the test cases involved seem extremely contrived
and unlikely to be something we'd end up with in our code in any case,
even in a case where all the reviewers would be fooled by this method in
itself.

I.e. since you can't use these to "fold" lines in the eyes of a human
viewer it's all rather contrived cases where comments and valid C code
lives on the same line, e.g.:

    /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
        printf("You are an admin.\n");
    /* end admins only ‮ { ⁦*/

Which at least in Emacs is highlighted the right way, which is the first
major clue, and having tested some of these just now on AIX I've got xlc
whining about some cases that gcc/clang will happily pass by, so our
tooling at large will probably catch these anyway...

1. https://github.com/actions/virtual-environments
2. https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-README.md
3. https://www.trojansource.codes/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] ci: disallow directional formatting
  2021-11-02 15:48   ` Taylor Blau
  2021-11-02 16:03     ` Ævar Arnfjörð Bjarmason
@ 2021-11-02 16:12     ` Johannes Schindelin
  2021-11-02 16:38       ` Ævar Arnfjörð Bjarmason
  2021-11-03 17:20     ` Junio C Hamano
  2 siblings, 1 reply; 18+ messages in thread
From: Johannes Schindelin @ 2021-11-02 16:12 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason,
	Johannes Schindelin via GitGitGadget, git

[-- Attachment #1: Type: text/plain, Size: 3124 bytes --]

Hi Taylor,

On Tue, 2 Nov 2021, Taylor Blau wrote:

> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
> > There's a parallel discussion about doing something to detect this in
> > "git am", which for the git project seems like a better place to put
> > this.
>
> I don't think that one impacts the other necessarily. Having `git am`
> guard against this would probably be sufficient to protect Junio
> accidentally apply something containing directional formatting to his
> tree unknowingly.
>
> But the idea that we rely on the import mechanism to protect against
> this doesn't sit well with me. Ultimately, we should be relying on a
> static check like below to ensure that directional formatting hasn't
> entered the tree by any mechanism (not just 'git am').

Yep, the `git am` change and the CI change are addressing two very
different concerns.

> > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > > index 6ed6a9e8076..7b4b4df03c3 100644
> > > --- a/.github/workflows/main.yml
> > > +++ b/.github/workflows/main.yml
> > > @@ -289,6 +289,13 @@ jobs:
> > >      - uses: actions/checkout@v2
> > >      - run: ci/install-dependencies.sh
> > >      - run: ci/run-static-analysis.sh
> > > +    - name: disallow Unicode directional formatting
> > > +      run: |
> > > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
> > > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
> > > +        # could use `git grep -P` with the `\u` syntax).
> > > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
> > > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
> > >    sparse:
> > >      needs: ci-config
> > >      if: needs.ci-config.outputs.enabled == 'yes'
> > >
> > > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
> >
> > It would be easier to maintain this if were added to
> > ci/run-static-analysis.sh instead, where we have some similar tests, and
> > if it lives there we could do away with the double-escaping, then it can
> > also be run manually.
> >
> > Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
> > its unconditional support for e.g. unicode properties in regexes.
>
> I agree that the double-escaping is ugly. I think that this would be
> easier to maintain if it lived in ci/run-static-analysis.sh or its own
> script like ci/check-directional-formatting.sh.

That's a good idea, I will put it into its own script in ci/.

> And yes, constructing a byte pattern is a little odd as well, but I
> think that it's the best you can do if you're limited to running 'git
> grep' without libpcre. I wondered if we could depend on perl being
> around during CI, but as far as I know we can since install Perl modules
> in ci/install-dependencies.sh and use Perl extensively through the test
> suite.

Perl alone won't fix anything. A crucial part of the `git grep` invocation
is the `-I` option: ignore binary files.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] ci: disallow directional formatting
  2021-11-02 16:12     ` Johannes Schindelin
@ 2021-11-02 16:38       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-02 16:38 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Taylor Blau, Johannes Schindelin via GitGitGadget, git


On Tue, Nov 02 2021, Johannes Schindelin wrote:

> Hi Taylor,
>
> On Tue, 2 Nov 2021, Taylor Blau wrote:
>
>> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> > There's a parallel discussion about doing something to detect this in
>> > "git am", which for the git project seems like a better place to put
>> > this.
>>
>> I don't think that one impacts the other necessarily. Having `git am`
>> guard against this would probably be sufficient to protect Junio
>> accidentally apply something containing directional formatting to his
>> tree unknowingly.
>>
>> But the idea that we rely on the import mechanism to protect against
>> this doesn't sit well with me. Ultimately, we should be relying on a
>> static check like below to ensure that directional formatting hasn't
>> entered the tree by any mechanism (not just 'git am').
>
> Yep, the `git am` change and the CI change are addressing two very
> different concerns.
>
>> > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
>> > > index 6ed6a9e8076..7b4b4df03c3 100644
>> > > --- a/.github/workflows/main.yml
>> > > +++ b/.github/workflows/main.yml
>> > > @@ -289,6 +289,13 @@ jobs:
>> > >      - uses: actions/checkout@v2
>> > >      - run: ci/install-dependencies.sh
>> > >      - run: ci/run-static-analysis.sh
>> > > +    - name: disallow Unicode directional formatting
>> > > +      run: |
>> > > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
>> > > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
>> > > +        # could use `git grep -P` with the `\u` syntax).
>> > > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
>> > > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
>> > >    sparse:
>> > >      needs: ci-config
>> > >      if: needs.ci-config.outputs.enabled == 'yes'
>> > >
>> > > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
>> >
>> > It would be easier to maintain this if were added to
>> > ci/run-static-analysis.sh instead, where we have some similar tests, and
>> > if it lives there we could do away with the double-escaping, then it can
>> > also be run manually.
>> >
>> > Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
>> > its unconditional support for e.g. unicode properties in regexes.
>>
>> I agree that the double-escaping is ugly. I think that this would be
>> easier to maintain if it lived in ci/run-static-analysis.sh or its own
>> script like ci/check-directional-formatting.sh.
>
> That's a good idea, I will put it into its own script in ci/.
>
>> And yes, constructing a byte pattern is a little odd as well, but I
>> think that it's the best you can do if you're limited to running 'git
>> grep' without libpcre. I wondered if we could depend on perl being
>> around during CI, but as far as I know we can since install Perl modules
>> in ci/install-dependencies.sh and use Perl extensively through the test
>> suite.
>
> Perl alone won't fix anything. A crucial part of the `git grep` invocation
> is the `-I` option: ignore binary files.

As noted in [1] I'm rather skeptical of us needing this at all, but why
is "-I" needed over asking "git grep" or "git ls-files" to exclude
binary files?

    git ls-files ':!(attr:binary)'

I.e. why do ad-hoc binary detection on the fly in git.git when we should
already be marking what files are binary?

If this check shows a false positive due to a binary file isn't that a
good thing (sans LTR issues I mentioned upthread), i.e. we should be
adding that file to .gitattributes, no?

In any case, I meant that the match on the RHS might be easier with
Perl, in such a pipeline you could always farm out the binary detection
to GNU grep on the LHS or whatever. Maybe you don't want to do it with
Perl, but using the -I option seems like a bad idea in either case.

1. https://lore.kernel.org/git/211102.86lf261q2e.gmgdl@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] ci: disallow directional formatting
  2021-11-02 12:58 [PATCH] ci: disallow directional formatting Johannes Schindelin via GitGitGadget
  2021-11-02 15:01 ` Ævar Arnfjörð Bjarmason
@ 2021-11-03 12:23 ` Johannes Schindelin via GitGitGadget
  2021-11-03 16:36   ` Ævar Arnfjörð Bjarmason
                     ` (3 more replies)
  1 sibling, 4 replies; 18+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-11-03 12:23 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, Taylor Blau,
	Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

As described in https://trojansource.codes/trojan-source.pdf, it is
possible to abuse directional formatting (a feature of Unicode) to
deceive human readers into interpreting code differently from compilers.

For example, an "if ()" expression could be enclosed in a comment, but
rendered as if it was outside of that comment. In effect, this could
fool a reviewer into misinterpreting the code flow as benign when it is
not.

It is highly unlikely that Git's source code wants to contain such
directional formatting in the first place, so let's just disallow it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    ci: disallow directional formatting
    
    I just stumbled over
    https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
    which details an interesting social-engineering attack: it uses
    directional formatting in source code to pretend to human readers that
    the code does something different than it actually does.
    
    It is highly unlikely that Git's source code wants to contain such
    directional formatting in the first place, so let's disallow it.
    
    Technically, this is not exactly -rc material, but the paper was just
    published, and I want us to be safe.
    
    Changes since v1:
    
     * The code was moved into a script in ci/.
     * We use git ls-files now to exclude files marked as binary in the Git
       attributes.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1071

Range-diff vs v1:

 1:  6a1661fd887 ! 1:  bbf963695ba ci: disallow directional formatting
     @@ Commit message
          possible to abuse directional formatting (a feature of Unicode) to
          deceive human readers into interpreting code differently from compilers.
      
     +    For example, an "if ()" expression could be enclosed in a comment, but
     +    rendered as if it was outside of that comment. In effect, this could
     +    fool a reviewer into misinterpreting the code flow as benign when it is
     +    not.
     +
          It is highly unlikely that Git's source code wants to contain such
     -    directional formatting in the first place, so let's disallow it.
     +    directional formatting in the first place, so let's just disallow it.
      
          Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
      
     @@ .github/workflows/main.yml: jobs:
           - uses: actions/checkout@v2
           - run: ci/install-dependencies.sh
           - run: ci/run-static-analysis.sh
     -+    - name: disallow Unicode directional formatting
     -+      run: |
     -+        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
     -+        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
     -+        # could use `git grep -P` with the `\u` syntax).
     -+        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
     -+          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
     ++    - run: ci/check-directional-formatting.sh
         sparse:
           needs: ci-config
           if: needs.ci-config.outputs.enabled == 'yes'
     +
     + ## ci/check-directional-formatting.sh (new) ##
     +@@
     ++#!/bin/bash
     ++
     ++# This script verifies that the non-binary files tracked in the Git index do
     ++# not contain any Unicode directional formatting: such formatting could be used
     ++# to deceive reviewers into interpreting code differently from the compiler.
     ++# This is intended to run on an Ubuntu agent in a GitHub workflow.
     ++#
     ++# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
     ++# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
     ++# not build its `git` packages with PCRE support.
     ++#
     ++# To work around that, we use `printf` to produce the pattern as a byte
     ++# sequence, and then feed that to `git grep` as a byte sequence (setting
     ++# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
     ++#
     ++# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
     ++# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
     ++# for example, would use a `printf` that does not understand that syntax.
     ++
     ++# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
     ++# U+2066..U+2069: LRI, RLI, FSI and PDI
     ++regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
     ++
     ++! git ls-files -z ':(attr:!binary)' |
     ++LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --


 .github/workflows/main.yml         |  1 +
 ci/check-directional-formatting.sh | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)
 create mode 100755 ci/check-directional-formatting.sh

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 6ed6a9e8076..36b7a0bee38 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -289,6 +289,7 @@ jobs:
     - uses: actions/checkout@v2
     - run: ci/install-dependencies.sh
     - run: ci/run-static-analysis.sh
+    - run: ci/check-directional-formatting.sh
   sparse:
     needs: ci-config
     if: needs.ci-config.outputs.enabled == 'yes'
diff --git a/ci/check-directional-formatting.sh b/ci/check-directional-formatting.sh
new file mode 100755
index 00000000000..ab894715cf1
--- /dev/null
+++ b/ci/check-directional-formatting.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+# This script verifies that the non-binary files tracked in the Git index do
+# not contain any Unicode directional formatting: such formatting could be used
+# to deceive reviewers into interpreting code differently from the compiler.
+# This is intended to run on an Ubuntu agent in a GitHub workflow.
+#
+# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
+# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
+# not build its `git` packages with PCRE support.
+#
+# To work around that, we use `printf` to produce the pattern as a byte
+# sequence, and then feed that to `git grep` as a byte sequence (setting
+# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
+#
+# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
+# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
+# for example, would use a `printf` that does not understand that syntax.
+
+# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
+# U+2066..U+2069: LRI, RLI, FSI and PDI
+regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
+
+! git ls-files -z ':(attr:!binary)' |
+LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --

base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] ci: disallow directional formatting
  2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
@ 2021-11-03 16:36   ` Ævar Arnfjörð Bjarmason
  2021-11-03 18:00   ` Junio C Hamano
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-03 16:36 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Taylor Blau, Johannes Schindelin


On Wed, Nov 03 2021, Johannes Schindelin via GitGitGadget wrote:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> As described in https://trojansource.codes/trojan-source.pdf, it is
> possible to abuse directional formatting (a feature of Unicode) to
> deceive human readers into interpreting code differently from compilers.
>
> For example, an "if ()" expression could be enclosed in a comment, but
> rendered as if it was outside of that comment. In effect, this could
> fool a reviewer into misinterpreting the code flow as benign when it is
> not.
>
> It is highly unlikely that Git's source code wants to contain such
> directional formatting in the first place, so let's just disallow it.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>     ci: disallow directional formatting
>     
>     I just stumbled over
>     https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
>     which details an interesting social-engineering attack: it uses
>     directional formatting in source code to pretend to human readers that
>     the code does something different than it actually does.
>     
>     It is highly unlikely that Git's source code wants to contain such
>     directional formatting in the first place, so let's disallow it.
>     
>     Technically, this is not exactly -rc material, but the paper was just
>     published, and I want us to be safe.
>     
>     Changes since v1:
>     
>      * The code was moved into a script in ci/.
>      * We use git ls-files now to exclude files marked as binary in the Git
>        attributes.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1071
>
> Range-diff vs v1:
>
>  1:  6a1661fd887 ! 1:  bbf963695ba ci: disallow directional formatting
>      @@ Commit message
>           possible to abuse directional formatting (a feature of Unicode) to
>           deceive human readers into interpreting code differently from compilers.
>       
>      +    For example, an "if ()" expression could be enclosed in a comment, but
>      +    rendered as if it was outside of that comment. In effect, this could
>      +    fool a reviewer into misinterpreting the code flow as benign when it is
>      +    not.
>      +
>           It is highly unlikely that Git's source code wants to contain such
>      -    directional formatting in the first place, so let's disallow it.
>      +    directional formatting in the first place, so let's just disallow it.
>       
>           Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>       
>      @@ .github/workflows/main.yml: jobs:
>            - uses: actions/checkout@v2
>            - run: ci/install-dependencies.sh
>            - run: ci/run-static-analysis.sh
>      -+    - name: disallow Unicode directional formatting
>      -+      run: |
>      -+        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
>      -+        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
>      -+        # could use `git grep -P` with the `\u` syntax).
>      -+        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
>      -+          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
>      ++    - run: ci/check-directional-formatting.sh
>          sparse:
>            needs: ci-config
>            if: needs.ci-config.outputs.enabled == 'yes'
>      +
>      + ## ci/check-directional-formatting.sh (new) ##
>      +@@
>      ++#!/bin/bash
>      ++
>      ++# This script verifies that the non-binary files tracked in the Git index do
>      ++# not contain any Unicode directional formatting: such formatting could be used
>      ++# to deceive reviewers into interpreting code differently from the compiler.
>      ++# This is intended to run on an Ubuntu agent in a GitHub workflow.
>      ++#
>      ++# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
>      ++# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
>      ++# not build its `git` packages with PCRE support.
>      ++#
>      ++# To work around that, we use `printf` to produce the pattern as a byte
>      ++# sequence, and then feed that to `git grep` as a byte sequence (setting
>      ++# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
>      ++#
>      ++# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
>      ++# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
>      ++# for example, would use a `printf` that does not understand that syntax.
>      ++
>      ++# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
>      ++# U+2066..U+2069: LRI, RLI, FSI and PDI
>      ++regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
>      ++
>      ++! git ls-files -z ':(attr:!binary)' |
>      ++LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
>
>
>  .github/workflows/main.yml         |  1 +
>  ci/check-directional-formatting.sh | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
>  create mode 100755 ci/check-directional-formatting.sh
>
> diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> index 6ed6a9e8076..36b7a0bee38 100644
> --- a/.github/workflows/main.yml
> +++ b/.github/workflows/main.yml
> @@ -289,6 +289,7 @@ jobs:
>      - uses: actions/checkout@v2
>      - run: ci/install-dependencies.sh
>      - run: ci/run-static-analysis.sh
> +    - run: ci/check-directional-formatting.sh
>    sparse:
>      needs: ci-config
>      if: needs.ci-config.outputs.enabled == 'yes'
> diff --git a/ci/check-directional-formatting.sh b/ci/check-directional-formatting.sh
> new file mode 100755
> index 00000000000..ab894715cf1
> --- /dev/null
> +++ b/ci/check-directional-formatting.sh
> @@ -0,0 +1,25 @@
> +#!/bin/bash
> +
> +# This script verifies that the non-binary files tracked in the Git index do
> +# not contain any Unicode directional formatting: such formatting could be used
> +# to deceive reviewers into interpreting code differently from the compiler.
> +# This is intended to run on an Ubuntu agent in a GitHub workflow.
> +#
> +# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
> +# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
> +# not build its `git` packages with PCRE support.

I just took you at your word before, but, no. It does have PCRE. It's
this Ubuntu version:
https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-README.md
i.e. "Ubuntu 20.04.3 LTS".

Debian's "git" package builds with libpcre, and Ubuntu's page about it
suggests it does the same: https://packages.ubuntu.com/focal/git

And if you try to push e.g. a grep for "git grep -P 'foo(?=bar)' you'll
find that it works.

What we haven't done in "git grep" is to compile the pattern with
PCRE2_ALT_BSUX, which is probably what you ran into. I.e. it doesn't
grok \u by default, we should probably support that, but you don't need
it.

> +#
> +# To work around that, we use `printf` to produce the pattern as a byte
> +# sequence, and then feed that to `git grep` as a byte sequence (setting
> +# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
> +#
> +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
> +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
> +# for example, would use a `printf` that does not understand that syntax.
> +
> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! git ls-files -z ':(attr:!binary)' |
> +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --

FWIW with that ls-files suggestion of mine I meant that would make sense
as a way to pipe it into Perl since doing this with a Unicode-aware
regex engine is less painful. I.e. you'd be able to just name the
Unicode range.

But since you didn't go for that surely this whole ls-files -> xargs ->
git grep pipeline isn't needed, and you'd just want:

    ! git -P grep -nE "$(LC_CTYPE=C.UTF-8 printf "$regex")" ':(attr:!binary)'

I added the -n for readability of the output, a matter of taste, and -P
to "git" for ease of ad-hoc testing. A test of that on the researcher's
trojan-source.git repo:
    
    $ ! git -P grep -nE "$(LC_CTYPE=C.UTF-8 printf "$regex")" ':(attr:!binary)'; echo $?
    C#/commenting-out.csx:4:/*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
    C#/commenting-out.csx:6:/* end admins only ‮ { ⁦*/
    C#/stretched-string.csx:4:if (access_level != "user‮ ⁦// Check if admin⁩ ⁦") {
    C++/commenting-out.cpp:5:    /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
    C++/commenting-out.cpp:7:    /* end admins only ‮ { ⁦*/
    C++/stretched-string.cpp:6:    if (access_level.compare("user‮ ⁦// Check if admin⁩ ⁦")) {
    C/commenting-out.c:6:    /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
    C/commenting-out.c:8:    /* end admins only ‮ { ⁦*/
    C/early-return.c:4:    /* Say hello; newline⁧ /*/ return 0 ;
    C/stretched-string.c:6:    if (strcmp(access_level, "user‮ ⁦// Check if admin⁩ ⁦")) {
    Go/commenting-out.go:9:    /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
    Go/commenting-out.go:11:    /* end admins only ‮ { ⁦*/
    Go/stretched-string.go:7:       if accessLevel != "user‮ ⁦// Check if admin⁩ ⁦" {
    Java/CommentingOut.java:5:        /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
    Java/CommentingOut.java:7:        /* end admins only ‮ { ⁦*/
    Java/StretchedString.java:5:        if (accessLevel != "user‮ ⁦// Check if admin⁩ ⁦") {
    JavaScript/commenting-out.js:4:/*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
    JavaScript/commenting-out.js:6:/* end admins only ‮ { ⁦*/
    JavaScript/stretched-string.js:4:if (accessLevel != "user‮ ⁦// Check if admin⁩ ⁦") {
    Python/commenting-out.py:4:if access_level != 'none‮⁦': # Check if admin ⁩⁦' and access_level != 'user
    Python/early-return.py:5:    ''' Subtract funds from bank account then ⁧''' ;return
    Rust/commenting-out.rs:3:    /*‮ } ⁦if is_admin⁩ ⁦ begin admins only */
    Rust/commenting-out.rs:5:    /* end admins only ‮ { ⁦*/
    Rust/stretched-string.rs:3:    if access_level != "user‮ ⁦// Check if admin⁩ ⁦" {
    Binary file website/public/trojan-source.pdf matches
    Binary file website/src/assets/img/faces/erik-lucatero-2.jpg matches
    1

That "Binary file..." being because they don't maintain an up-to-date
.gitattributes file.

The -e you've got is redundant since you have only one pattern. A relic
of trying to feed a list of these with "-e" to "git grep"?

But anyway, as noted you're not correct that we don't have PCRE with the
"git" in the CI, we just don't have \uxxxx, but you don't need that
optional syntax. Just use \N{U+xxxx}:

    ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':(attr:!binary)'

All that being said I still don't see much of a point in doing this, but
if we are let's not make it needlessly complex, but actually, why not
some variant of:
    
    $ ~/g/git/git grep -Pn '[\N{U+2000}-\N{U+206F}]' -- ':!(attr:binary)' ':!/po/' ':!*/po/*' ':!t/t0200/' ':!t/t0204*' ':!t/t5150-request-pull.sh'
    builtin/gc.c:1543: *   * if the input value *cmd isn’t the key of any of the comma-separated list
    builtin/gc.c:1555: *   GIT_TEST_MAINT_SCHEDULER set to “foo:./mock_foo.sh,bar:./mock_bar.sh”
    builtin/submodule--helper.c:3262:                          N_("sets the submodule’s name to the given string "
    contrib/credential/netrc/git-credential-netrc.perl:122:   The credential’s username, if we already have one. (username=X)
    t/helper/test-mergesort.c:364: * Function" by Bentley and McIlroy, Software—Practice and Experience,
    t/helper/test-mergesort.c:365: * Volume 23, Issue 11, 1249–1265 (November 1993).
    Binary file t/t0013/shattered-1.pdf matches

I.e. that notices that the "binary" is missing for that *.pdf, and that
we have some Unicode quoting we probably should get rid of anyway, it
also nicely excludes po/* in the pathspec, so any legitimate use of this
for RTL languages won't be prevented by this check.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] ci: disallow directional formatting
  2021-11-02 15:48   ` Taylor Blau
  2021-11-02 16:03     ` Ævar Arnfjörð Bjarmason
  2021-11-02 16:12     ` Johannes Schindelin
@ 2021-11-03 17:20     ` Junio C Hamano
  2 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2021-11-03 17:20 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason,
	Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

Taylor Blau <me@ttaylorr.com> writes:

> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> There's a parallel discussion about doing something to detect this in
>> "git am", which for the git project seems like a better place to put
>> this.
>
> I don't think that one impacts the other necessarily. Having `git am`
> guard against this would probably be sufficient to protect Junio
> accidentally apply something containing directional formatting to his
> tree unknowingly.
>
> But the idea that we rely on the import mechanism to protect against
> this doesn't sit well with me. Ultimately, we should be relying on a
> static check like below to ensure that directional formatting hasn't
> entered the tree by any mechanism (not just 'git am').

Yes.  Quite honestly, such a check shouldn't be in "am" proper at
all.

Rather, for am users who care, they should protect themselves with
something like the pre-applypatch hook, which can perform the same
check as their pre-commit hook to protect their other commits.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] ci: disallow directional formatting
  2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
  2021-11-03 16:36   ` Ævar Arnfjörð Bjarmason
@ 2021-11-03 18:00   ` Junio C Hamano
  2021-11-03 23:38   ` Junio C Hamano
  2021-11-04 13:13   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
  3 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2021-11-03 18:00 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau,
	Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
> +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
> +# for example, would use a `printf` that does not understand that syntax.

Wow, the situation is ugly and the solution is understandable.

Shouldn't we call this script *.bash instead of *.sh then, though?

Other than that, looks good to me.  Thanks.

> +
> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! git ls-files -z ':(attr:!binary)' |
> +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
>
> base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] ci: disallow directional formatting
  2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
  2021-11-03 16:36   ` Ævar Arnfjörð Bjarmason
  2021-11-03 18:00   ` Junio C Hamano
@ 2021-11-03 23:38   ` Junio C Hamano
  2021-11-04 10:19     ` Johannes Schindelin
  2021-11-04 13:13   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
  3 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-11-03 23:38 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau,
	Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! git ls-files -z ':(attr:!binary)' |
> +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --

One thing for the future, and one thing for the present.

 - Do some languages we would add to po/ hierarchy in the future
   possibly want to use these sequences as legitimate contents?

 - Do we need ls-files?  


For the latter, shouldn't the attribute-based pathspec work just
fine with "git grep"?  i.e.

    git grep -l -E -e $pattern -- ':(exclude,attr:binary)'


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] ci: disallow directional formatting
  2021-11-03 23:38   ` Junio C Hamano
@ 2021-11-04 10:19     ` Johannes Schindelin
  0 siblings, 0 replies; 18+ messages in thread
From: Johannes Schindelin @ 2021-11-04 10:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, Taylor Blau

Hi Junio,

On Wed, 3 Nov 2021, Junio C Hamano wrote:

> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> > +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> > +# U+2066..U+2069: LRI, RLI, FSI and PDI
> > +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> > +
> > +! git ls-files -z ':(attr:!binary)' |
> > +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
>
> One thing for the future, and one thing for the present.
>
>  - Do some languages we would add to po/ hierarchy in the future
>    possibly want to use these sequences as legitimate contents?

I mulled over that. And I think you're right. If a right-to-left
translation needs to refer to, say, a `git` invocation, the part that
shows the commandline surely would need to be guarded within directional
formatting code points. We currently only have translated messages that
read left-to-right, for example we lack Arabic and Hebrew translations.
Those would be likely to contain such code points on purpose.

I therefore added `:(exclude)*.po` to the command.

>  - Do we need ls-files?
>
>
> For the latter, shouldn't the attribute-based pathspec work just
> fine with "git grep"?  i.e.
>
>     git grep -l -E -e $pattern -- ':(exclude,attr:binary)'

D'oh. You're right! I fixed it.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] ci: disallow directional formatting
  2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-11-03 23:38   ` Junio C Hamano
@ 2021-11-04 13:13   ` Johannes Schindelin via GitGitGadget
  2021-11-04 13:48     ` Ævar Arnfjörð Bjarmason
  3 siblings, 1 reply; 18+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-11-04 13:13 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, Taylor Blau,
	Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

As described in https://trojansource.codes/trojan-source.pdf, it is
possible to abuse directional formatting (a feature of Unicode) to
deceive human readers into interpreting code differently from compilers.

For example, an "if ()" expression could be enclosed in a comment, but
rendered as if it was outside of that comment. In effect, this could
fool a reviewer into misinterpreting the code flow as benign when it is
not.

It is highly unlikely that Git's source code wants to contain such
directional formatting in the first place, so let's just disallow it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    ci: disallow directional formatting
    
    A couple days ago, I stumbled over
    https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
    which details an interesting social-engineering attack: it uses
    directional formatting in source code to pretend to human readers that
    the code does something different than it actually does.
    
    It is highly unlikely that Git's source code wants to contain such
    directional formatting in the first place, so let's disallow it.
    
    Technically, this is not exactly -rc material, but the paper was just
    published, and I want us to be safe.
    
    Changes since v2:
    
     * The pathspec excluding binary files is now used directly instead of
       doing the ls-files | xargs dance.
     * Corrected a code comment: my custom git grep was not PCRE-enabled,
       but Ubuntu's isn't. But git grep -P still does not understand \uNNNN.
     * Even if the *.po files currently pass the check, the script is now
       future-proof by excluding them.
     * Renamed the script to have the .bash extension, to indicate that it
       requires a Bashism (i.e. a printf that understands the \uNNNN
       syntax).
    
    Changes since v1:
    
     * The code was moved into a script in ci/.
     * We use git ls-files now to exclude files marked as binary in the Git
       attributes.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1071

Range-diff vs v2:

 1:  bbf963695ba ! 1:  80447819de3 ci: disallow directional formatting
     @@ .github/workflows/main.yml: jobs:
           - uses: actions/checkout@v2
           - run: ci/install-dependencies.sh
           - run: ci/run-static-analysis.sh
     -+    - run: ci/check-directional-formatting.sh
     ++    - run: ci/check-directional-formatting.bash
         sparse:
           needs: ci-config
           if: needs.ci-config.outputs.enabled == 'yes'
      
     - ## ci/check-directional-formatting.sh (new) ##
     + ## ci/check-directional-formatting.bash (new) ##
      @@
      +#!/bin/bash
      +
     @@ ci/check-directional-formatting.sh (new)
      +# to deceive reviewers into interpreting code differently from the compiler.
      +# This is intended to run on an Ubuntu agent in a GitHub workflow.
      +#
     -+# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
     -+# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
     -+# not build its `git` packages with PCRE support.
     ++# To allow translated messages to introduce such directional formatting in the
     ++# future, we exclude the `.po` files from this validation.
     ++#
     ++# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to
     ++# specify UTF-8.
      +#
      +# To work around that, we use `printf` to produce the pattern as a byte
      +# sequence, and then feed that to `git grep` as a byte sequence (setting
     @@ ci/check-directional-formatting.sh (new)
      +# U+2066..U+2069: LRI, RLI, FSI and PDI
      +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
      +
     -+! git ls-files -z ':(attr:!binary)' |
     -+LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
     ++! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
     ++	-- ':(exclude,attr:binary)' ':(exclude)*.po'


 .github/workflows/main.yml           |  1 +
 ci/check-directional-formatting.bash | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)
 create mode 100755 ci/check-directional-formatting.bash

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 6ed6a9e8076..deda12db3a9 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -289,6 +289,7 @@ jobs:
     - uses: actions/checkout@v2
     - run: ci/install-dependencies.sh
     - run: ci/run-static-analysis.sh
+    - run: ci/check-directional-formatting.bash
   sparse:
     needs: ci-config
     if: needs.ci-config.outputs.enabled == 'yes'
diff --git a/ci/check-directional-formatting.bash b/ci/check-directional-formatting.bash
new file mode 100755
index 00000000000..e6211b141aa
--- /dev/null
+++ b/ci/check-directional-formatting.bash
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+# This script verifies that the non-binary files tracked in the Git index do
+# not contain any Unicode directional formatting: such formatting could be used
+# to deceive reviewers into interpreting code differently from the compiler.
+# This is intended to run on an Ubuntu agent in a GitHub workflow.
+#
+# To allow translated messages to introduce such directional formatting in the
+# future, we exclude the `.po` files from this validation.
+#
+# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to
+# specify UTF-8.
+#
+# To work around that, we use `printf` to produce the pattern as a byte
+# sequence, and then feed that to `git grep` as a byte sequence (setting
+# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
+#
+# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
+# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
+# for example, would use a `printf` that does not understand that syntax.
+
+# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
+# U+2066..U+2069: LRI, RLI, FSI and PDI
+regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
+
+! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
+	-- ':(exclude,attr:binary)' ':(exclude)*.po'

base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] ci: disallow directional formatting
  2021-11-04 13:13   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
@ 2021-11-04 13:48     ` Ævar Arnfjörð Bjarmason
  2021-11-04 17:19       ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-04 13:48 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Taylor Blau, Johannes Schindelin


On Thu, Nov 04 2021, Johannes Schindelin via GitGitGadget wrote:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> [...]
>      * Corrected a code comment: my custom git grep was not PCRE-enabled,
>        but Ubuntu's isn't. But git grep -P still does not understand \uNNNN.
>      * Even if the *.po files currently pass the check, the script is now
>        future-proof by excluding them.
> [...]

> +# This is intended to run on an Ubuntu agent in a GitHub workflow.
> +#
> +# To allow translated messages to introduce such directional formatting in the
> +# future, we exclude the `.po` files from this validation.
> +#
> +# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to
> +# specify UTF-8.
> +#
> +# To work around that, we use `printf` to produce the pattern as a byte
> +# sequence, and then feed that to `git grep` as a byte sequence (setting
> +# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
> +#
> +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
> +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
> +# for example, would use a `printf` that does not understand that syntax.
> +
> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
> +	-- ':(exclude,attr:binary)' ':(exclude)*.po'
>
> base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49

So this is still using not-PCRE instead of the much simpler PCRE-enabled
one I suggested in[1] because your locally-built git doesn't link to
libpcre?

At the very least that comment is still quite confusing. I.e. it starts
out by saying that it's meant to run in CI where we've got PCRE, but
then goes on to describe an elaborate workaround that's only needed with
a not-PCRE grep.

Would be less confusing to understand why it's so complex as:

   # This is intended to run on an Ubuntu agent in a GitHub
   # workflow. We've got PCRE there, so all of the below could also be:
   #
   #    ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)' ':!**po/**'
   #
   #
   # But the author wanted to run this locally on a system that didn't
   # have PCRE, So [go on to describe elaborate bash / "git grep -E" /
   # LC_* tweaking workaround....]
   [...]

B.t.w. I think that **po/** exclusion is closer to what you want,
i.e. "a 'po' dir anywhere in our tree". It'll also exclude this if we
e.g. end up using these in language-specific README files there or
whatever.

1. https://lore.kernel.org/git/211103.86zgqlhzvz.gmgdl@evledraar.gmail.com/ or:

   ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)'

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] ci: disallow directional formatting
  2021-11-04 13:48     ` Ævar Arnfjörð Bjarmason
@ 2021-11-04 17:19       ` Junio C Hamano
  2021-11-08 18:49         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-11-04 17:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
	Johannes Schindelin

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
>> +# U+2066..U+2069: LRI, RLI, FSI and PDI
>> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
>> +
>> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
>> +	-- ':(exclude,attr:binary)' ':(exclude)*.po'
> ...
>    ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)'

So you are comparing

 * requiring bash and C.UTF-8 locale to be available

vs

 * requiring git built with PCRE

assuming that "Dscho says doesn't work with PCRE and you say it
works with PCRE" is resolved?  They seem roughly the same
difficulty to me.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] ci: disallow directional formatting
  2021-11-04 17:19       ` Junio C Hamano
@ 2021-11-08 18:49         ` Ævar Arnfjörð Bjarmason
  2021-11-08 20:08           ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-08 18:49 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
	Johannes Schindelin


On Thu, Nov 04 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
>>> +# U+2066..U+2069: LRI, RLI, FSI and PDI
>>> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
>>> +
>>> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
>>> +	-- ':(exclude,attr:binary)' ':(exclude)*.po'
>> ...
>>    ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)'
>
> So you are comparing
>
>  * requiring bash and C.UTF-8 locale to be available
>
> vs
>
>  * requiring git built with PCRE
>
> assuming that "Dscho says doesn't work with PCRE and you say it
> works with PCRE" is resolved?  They seem roughly the same
> difficulty to me.

We can hard depend on a git built with PCRE, since the point of this
thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's
unlikely to ever change.

The caveats around PCRE that still somewhat persist around PCRE are due
to a misunderstanding, i.e. I think Johannes tried the \uXXXX syntax,
which we don't opt git-grep into, but as shown above you can just use
the other universally supported PCRE syntax for referring to Unicode
codepoints.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] ci: disallow directional formatting
  2021-11-08 18:49         ` Ævar Arnfjörð Bjarmason
@ 2021-11-08 20:08           ` Junio C Hamano
  2021-11-09 13:34             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-11-08 20:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
	Johannes Schindelin

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> So you are comparing
>>
>>  * requiring bash and C.UTF-8 locale to be available
>>
>> vs
>>
>>  * requiring git built with PCRE
>>
>> assuming that "Dscho says doesn't work with PCRE and you say it
>> works with PCRE" is resolved?  They seem roughly the same
>> difficulty to me.
>
> We can hard depend on a git built with PCRE, since the point of this
> thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's
> unlikely to ever change.

Yes, so is the availability of bash and C.UTF-8 for the same reason:
we are talking about controlled environment.  That is what I meant
by "roughly the same difficulty to me".

FWIW, I am OK with either approach, as I find the patch in question
is just as readable as any rewrite that would use "grep -P", so...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] ci: disallow directional formatting
  2021-11-08 20:08           ` Junio C Hamano
@ 2021-11-09 13:34             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-09 13:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
	Johannes Schindelin


On Mon, Nov 08 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> So you are comparing
>>>
>>>  * requiring bash and C.UTF-8 locale to be available
>>>
>>> vs
>>>
>>>  * requiring git built with PCRE
>>>
>>> assuming that "Dscho says doesn't work with PCRE and you say it
>>> works with PCRE" is resolved?  They seem roughly the same
>>> difficulty to me.
>>
>> We can hard depend on a git built with PCRE, since the point of this
>> thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's
>> unlikely to ever change.
>
> Yes, so is the availability of bash and C.UTF-8 for the same reason:
> we are talking about controlled environment.  That is what I meant
> by "roughly the same difficulty to me".
>
> FWIW, I am OK with either approach, as I find the patch in question
> is just as readable as any rewrite that would use "grep -P", so...

To each his own I guess :) I do find the simple regex of:

    '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]'

Much easier to understand than something using printf, shell
interpolation, and needing to switch around LC_CTYPE to two different
values twice on one line.

But anyway, that's a matter of taste.

What isn't is the issue I noted at the bottom of [1], i.e. if we're
relying on '(attr:binary)' we should probably start with an assertion
that our idea of "binary" matches reality, or perhaps go back to the -I
heuristic.

Because now we've got at least one binary non-attribute-marked file, and
if files like that ever get updated they might start matching this
pattern. Maybe not a big deal, but someone updating those might
confusingly trip over this otherwise...

1. https://lore.kernel.org/git/211103.86zgqlhzvz.gmgdl@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-11-09 13:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-02 12:58 [PATCH] ci: disallow directional formatting Johannes Schindelin via GitGitGadget
2021-11-02 15:01 ` Ævar Arnfjörð Bjarmason
2021-11-02 15:48   ` Taylor Blau
2021-11-02 16:03     ` Ævar Arnfjörð Bjarmason
2021-11-02 16:12     ` Johannes Schindelin
2021-11-02 16:38       ` Ævar Arnfjörð Bjarmason
2021-11-03 17:20     ` Junio C Hamano
2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
2021-11-03 16:36   ` Ævar Arnfjörð Bjarmason
2021-11-03 18:00   ` Junio C Hamano
2021-11-03 23:38   ` Junio C Hamano
2021-11-04 10:19     ` Johannes Schindelin
2021-11-04 13:13   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
2021-11-04 13:48     ` Ævar Arnfjörð Bjarmason
2021-11-04 17:19       ` Junio C Hamano
2021-11-08 18:49         ` Ævar Arnfjörð Bjarmason
2021-11-08 20:08           ` Junio C Hamano
2021-11-09 13:34             ` Ævar Arnfjörð Bjarmason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.