* Re: [PATCH v2] ci: disallow directional formatting
2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
@ 2021-11-03 16:36 ` Ævar Arnfjörð Bjarmason
2021-11-03 18:00 ` Junio C Hamano
` (2 subsequent siblings)
3 siblings, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-03 16:36 UTC (permalink / raw)
To: Johannes Schindelin via GitGitGadget
Cc: git, Taylor Blau, Johannes Schindelin
On Wed, Nov 03 2021, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> As described in https://trojansource.codes/trojan-source.pdf, it is
> possible to abuse directional formatting (a feature of Unicode) to
> deceive human readers into interpreting code differently from compilers.
>
> For example, an "if ()" expression could be enclosed in a comment, but
> rendered as if it was outside of that comment. In effect, this could
> fool a reviewer into misinterpreting the code flow as benign when it is
> not.
>
> It is highly unlikely that Git's source code wants to contain such
> directional formatting in the first place, so let's just disallow it.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
> ci: disallow directional formatting
>
> I just stumbled over
> https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
> which details an interesting social-engineering attack: it uses
> directional formatting in source code to pretend to human readers that
> the code does something different than it actually does.
>
> It is highly unlikely that Git's source code wants to contain such
> directional formatting in the first place, so let's disallow it.
>
> Technically, this is not exactly -rc material, but the paper was just
> published, and I want us to be safe.
>
> Changes since v1:
>
> * The code was moved into a script in ci/.
> * We use git ls-files now to exclude files marked as binary in the Git
> attributes.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1071
>
> Range-diff vs v1:
>
> 1: 6a1661fd887 ! 1: bbf963695ba ci: disallow directional formatting
> @@ Commit message
> possible to abuse directional formatting (a feature of Unicode) to
> deceive human readers into interpreting code differently from compilers.
>
> + For example, an "if ()" expression could be enclosed in a comment, but
> + rendered as if it was outside of that comment. In effect, this could
> + fool a reviewer into misinterpreting the code flow as benign when it is
> + not.
> +
> It is highly unlikely that Git's source code wants to contain such
> - directional formatting in the first place, so let's disallow it.
> + directional formatting in the first place, so let's just disallow it.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> @@ .github/workflows/main.yml: jobs:
> - uses: actions/checkout@v2
> - run: ci/install-dependencies.sh
> - run: ci/run-static-analysis.sh
> -+ - name: disallow Unicode directional formatting
> -+ run: |
> -+ # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
> -+ # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
> -+ # could use `git grep -P` with the `\u` syntax).
> -+ ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
> -+ '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
> ++ - run: ci/check-directional-formatting.sh
> sparse:
> needs: ci-config
> if: needs.ci-config.outputs.enabled == 'yes'
> +
> + ## ci/check-directional-formatting.sh (new) ##
> +@@
> ++#!/bin/bash
> ++
> ++# This script verifies that the non-binary files tracked in the Git index do
> ++# not contain any Unicode directional formatting: such formatting could be used
> ++# to deceive reviewers into interpreting code differently from the compiler.
> ++# This is intended to run on an Ubuntu agent in a GitHub workflow.
> ++#
> ++# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
> ++# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
> ++# not build its `git` packages with PCRE support.
> ++#
> ++# To work around that, we use `printf` to produce the pattern as a byte
> ++# sequence, and then feed that to `git grep` as a byte sequence (setting
> ++# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
> ++#
> ++# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
> ++# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
> ++# for example, would use a `printf` that does not understand that syntax.
> ++
> ++# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> ++# U+2066..U+2069: LRI, RLI, FSI and PDI
> ++regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> ++
> ++! git ls-files -z ':(attr:!binary)' |
> ++LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
>
>
> .github/workflows/main.yml | 1 +
> ci/check-directional-formatting.sh | 25 +++++++++++++++++++++++++
> 2 files changed, 26 insertions(+)
> create mode 100755 ci/check-directional-formatting.sh
>
> diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> index 6ed6a9e8076..36b7a0bee38 100644
> --- a/.github/workflows/main.yml
> +++ b/.github/workflows/main.yml
> @@ -289,6 +289,7 @@ jobs:
> - uses: actions/checkout@v2
> - run: ci/install-dependencies.sh
> - run: ci/run-static-analysis.sh
> + - run: ci/check-directional-formatting.sh
> sparse:
> needs: ci-config
> if: needs.ci-config.outputs.enabled == 'yes'
> diff --git a/ci/check-directional-formatting.sh b/ci/check-directional-formatting.sh
> new file mode 100755
> index 00000000000..ab894715cf1
> --- /dev/null
> +++ b/ci/check-directional-formatting.sh
> @@ -0,0 +1,25 @@
> +#!/bin/bash
> +
> +# This script verifies that the non-binary files tracked in the Git index do
> +# not contain any Unicode directional formatting: such formatting could be used
> +# to deceive reviewers into interpreting code differently from the compiler.
> +# This is intended to run on an Ubuntu agent in a GitHub workflow.
> +#
> +# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
> +# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
> +# not build its `git` packages with PCRE support.
I just took you at your word before, but, no. It does have PCRE. It's
this Ubuntu version:
https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-README.md
i.e. "Ubuntu 20.04.3 LTS".
Debian's "git" package builds with libpcre, and Ubuntu's page about it
suggests it does the same: https://packages.ubuntu.com/focal/git
And if you try to push e.g. a grep for "git grep -P 'foo(?=bar)' you'll
find that it works.
What we haven't done in "git grep" is to compile the pattern with
PCRE2_ALT_BSUX, which is probably what you ran into. I.e. it doesn't
grok \u by default, we should probably support that, but you don't need
it.
> +#
> +# To work around that, we use `printf` to produce the pattern as a byte
> +# sequence, and then feed that to `git grep` as a byte sequence (setting
> +# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
> +#
> +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
> +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
> +# for example, would use a `printf` that does not understand that syntax.
> +
> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! git ls-files -z ':(attr:!binary)' |
> +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
FWIW with that ls-files suggestion of mine I meant that would make sense
as a way to pipe it into Perl since doing this with a Unicode-aware
regex engine is less painful. I.e. you'd be able to just name the
Unicode range.
But since you didn't go for that surely this whole ls-files -> xargs ->
git grep pipeline isn't needed, and you'd just want:
! git -P grep -nE "$(LC_CTYPE=C.UTF-8 printf "$regex")" ':(attr:!binary)'
I added the -n for readability of the output, a matter of taste, and -P
to "git" for ease of ad-hoc testing. A test of that on the researcher's
trojan-source.git repo:
$ ! git -P grep -nE "$(LC_CTYPE=C.UTF-8 printf "$regex")" ':(attr:!binary)'; echo $?
C#/commenting-out.csx:4:/* } if (isAdmin) begin admins only */
C#/commenting-out.csx:6:/* end admins only { */
C#/stretched-string.csx:4:if (access_level != "user // Check if admin ") {
C++/commenting-out.cpp:5: /* } if (isAdmin) begin admins only */
C++/commenting-out.cpp:7: /* end admins only { */
C++/stretched-string.cpp:6: if (access_level.compare("user // Check if admin ")) {
C/commenting-out.c:6: /* } if (isAdmin) begin admins only */
C/commenting-out.c:8: /* end admins only { */
C/early-return.c:4: /* Say hello; newline /*/ return 0 ;
C/stretched-string.c:6: if (strcmp(access_level, "user // Check if admin ")) {
Go/commenting-out.go:9: /* } if (isAdmin) begin admins only */
Go/commenting-out.go:11: /* end admins only { */
Go/stretched-string.go:7: if accessLevel != "user // Check if admin " {
Java/CommentingOut.java:5: /* } if (isAdmin) begin admins only */
Java/CommentingOut.java:7: /* end admins only { */
Java/StretchedString.java:5: if (accessLevel != "user // Check if admin ") {
JavaScript/commenting-out.js:4:/* } if (isAdmin) begin admins only */
JavaScript/commenting-out.js:6:/* end admins only { */
JavaScript/stretched-string.js:4:if (accessLevel != "user // Check if admin ") {
Python/commenting-out.py:4:if access_level != 'none': # Check if admin ' and access_level != 'user
Python/early-return.py:5: ''' Subtract funds from bank account then ''' ;return
Rust/commenting-out.rs:3: /* } if is_admin begin admins only */
Rust/commenting-out.rs:5: /* end admins only { */
Rust/stretched-string.rs:3: if access_level != "user // Check if admin " {
Binary file website/public/trojan-source.pdf matches
Binary file website/src/assets/img/faces/erik-lucatero-2.jpg matches
1
That "Binary file..." being because they don't maintain an up-to-date
.gitattributes file.
The -e you've got is redundant since you have only one pattern. A relic
of trying to feed a list of these with "-e" to "git grep"?
But anyway, as noted you're not correct that we don't have PCRE with the
"git" in the CI, we just don't have \uxxxx, but you don't need that
optional syntax. Just use \N{U+xxxx}:
! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':(attr:!binary)'
All that being said I still don't see much of a point in doing this, but
if we are let's not make it needlessly complex, but actually, why not
some variant of:
$ ~/g/git/git grep -Pn '[\N{U+2000}-\N{U+206F}]' -- ':!(attr:binary)' ':!/po/' ':!*/po/*' ':!t/t0200/' ':!t/t0204*' ':!t/t5150-request-pull.sh'
builtin/gc.c:1543: * * if the input value *cmd isn’t the key of any of the comma-separated list
builtin/gc.c:1555: * GIT_TEST_MAINT_SCHEDULER set to “foo:./mock_foo.sh,bar:./mock_bar.sh”
builtin/submodule--helper.c:3262: N_("sets the submodule’s name to the given string "
contrib/credential/netrc/git-credential-netrc.perl:122: The credential’s username, if we already have one. (username=X)
t/helper/test-mergesort.c:364: * Function" by Bentley and McIlroy, Software—Practice and Experience,
t/helper/test-mergesort.c:365: * Volume 23, Issue 11, 1249–1265 (November 1993).
Binary file t/t0013/shattered-1.pdf matches
I.e. that notices that the "binary" is missing for that *.pdf, and that
we have some Unicode quoting we probably should get rid of anyway, it
also nicely excludes po/* in the pathspec, so any legitimate use of this
for RTL languages won't be prevented by this check.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] ci: disallow directional formatting
2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
2021-11-03 16:36 ` Ævar Arnfjörð Bjarmason
@ 2021-11-03 18:00 ` Junio C Hamano
2021-11-03 23:38 ` Junio C Hamano
2021-11-04 13:13 ` [PATCH v3] " Johannes Schindelin via GitGitGadget
3 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2021-11-03 18:00 UTC (permalink / raw)
To: Johannes Schindelin via GitGitGadget
Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau,
Johannes Schindelin
"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:
> +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
> +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
> +# for example, would use a `printf` that does not understand that syntax.
Wow, the situation is ugly and the solution is understandable.
Shouldn't we call this script *.bash instead of *.sh then, though?
Other than that, looks good to me. Thanks.
> +
> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! git ls-files -z ':(attr:!binary)' |
> +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
>
> base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] ci: disallow directional formatting
2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
2021-11-03 16:36 ` Ævar Arnfjörð Bjarmason
2021-11-03 18:00 ` Junio C Hamano
@ 2021-11-03 23:38 ` Junio C Hamano
2021-11-04 10:19 ` Johannes Schindelin
2021-11-04 13:13 ` [PATCH v3] " Johannes Schindelin via GitGitGadget
3 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-11-03 23:38 UTC (permalink / raw)
To: Johannes Schindelin via GitGitGadget
Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau,
Johannes Schindelin
"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:
> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! git ls-files -z ':(attr:!binary)' |
> +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
One thing for the future, and one thing for the present.
- Do some languages we would add to po/ hierarchy in the future
possibly want to use these sequences as legitimate contents?
- Do we need ls-files?
For the latter, shouldn't the attribute-based pathspec work just
fine with "git grep"? i.e.
git grep -l -E -e $pattern -- ':(exclude,attr:binary)'
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] ci: disallow directional formatting
2021-11-03 23:38 ` Junio C Hamano
@ 2021-11-04 10:19 ` Johannes Schindelin
0 siblings, 0 replies; 18+ messages in thread
From: Johannes Schindelin @ 2021-11-04 10:19 UTC (permalink / raw)
To: Junio C Hamano
Cc: Johannes Schindelin via GitGitGadget, git,
Ævar Arnfjörð Bjarmason, Taylor Blau
Hi Junio,
On Wed, 3 Nov 2021, Junio C Hamano wrote:
> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> > +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> > +# U+2066..U+2069: LRI, RLI, FSI and PDI
> > +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> > +
> > +! git ls-files -z ':(attr:!binary)' |
> > +LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
>
> One thing for the future, and one thing for the present.
>
> - Do some languages we would add to po/ hierarchy in the future
> possibly want to use these sequences as legitimate contents?
I mulled over that. And I think you're right. If a right-to-left
translation needs to refer to, say, a `git` invocation, the part that
shows the commandline surely would need to be guarded within directional
formatting code points. We currently only have translated messages that
read left-to-right, for example we lack Arabic and Hebrew translations.
Those would be likely to contain such code points on purpose.
I therefore added `:(exclude)*.po` to the command.
> - Do we need ls-files?
>
>
> For the latter, shouldn't the attribute-based pathspec work just
> fine with "git grep"? i.e.
>
> git grep -l -E -e $pattern -- ':(exclude,attr:binary)'
D'oh. You're right! I fixed it.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3] ci: disallow directional formatting
2021-11-03 12:23 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
` (2 preceding siblings ...)
2021-11-03 23:38 ` Junio C Hamano
@ 2021-11-04 13:13 ` Johannes Schindelin via GitGitGadget
2021-11-04 13:48 ` Ævar Arnfjörð Bjarmason
3 siblings, 1 reply; 18+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2021-11-04 13:13 UTC (permalink / raw)
To: git
Cc: Ævar Arnfjörð Bjarmason, Taylor Blau,
Johannes Schindelin, Johannes Schindelin
From: Johannes Schindelin <johannes.schindelin@gmx.de>
As described in https://trojansource.codes/trojan-source.pdf, it is
possible to abuse directional formatting (a feature of Unicode) to
deceive human readers into interpreting code differently from compilers.
For example, an "if ()" expression could be enclosed in a comment, but
rendered as if it was outside of that comment. In effect, this could
fool a reviewer into misinterpreting the code flow as benign when it is
not.
It is highly unlikely that Git's source code wants to contain such
directional formatting in the first place, so let's just disallow it.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
ci: disallow directional formatting
A couple days ago, I stumbled over
https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
which details an interesting social-engineering attack: it uses
directional formatting in source code to pretend to human readers that
the code does something different than it actually does.
It is highly unlikely that Git's source code wants to contain such
directional formatting in the first place, so let's disallow it.
Technically, this is not exactly -rc material, but the paper was just
published, and I want us to be safe.
Changes since v2:
* The pathspec excluding binary files is now used directly instead of
doing the ls-files | xargs dance.
* Corrected a code comment: my custom git grep was not PCRE-enabled,
but Ubuntu's isn't. But git grep -P still does not understand \uNNNN.
* Even if the *.po files currently pass the check, the script is now
future-proof by excluding them.
* Renamed the script to have the .bash extension, to indicate that it
requires a Bashism (i.e. a printf that understands the \uNNNN
syntax).
Changes since v1:
* The code was moved into a script in ci/.
* We use git ls-files now to exclude files marked as binary in the Git
attributes.
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1071
Range-diff vs v2:
1: bbf963695ba ! 1: 80447819de3 ci: disallow directional formatting
@@ .github/workflows/main.yml: jobs:
- uses: actions/checkout@v2
- run: ci/install-dependencies.sh
- run: ci/run-static-analysis.sh
-+ - run: ci/check-directional-formatting.sh
++ - run: ci/check-directional-formatting.bash
sparse:
needs: ci-config
if: needs.ci-config.outputs.enabled == 'yes'
- ## ci/check-directional-formatting.sh (new) ##
+ ## ci/check-directional-formatting.bash (new) ##
@@
+#!/bin/bash
+
@@ ci/check-directional-formatting.sh (new)
+# to deceive reviewers into interpreting code differently from the compiler.
+# This is intended to run on an Ubuntu agent in a GitHub workflow.
+#
-+# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
-+# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
-+# not build its `git` packages with PCRE support.
++# To allow translated messages to introduce such directional formatting in the
++# future, we exclude the `.po` files from this validation.
++#
++# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to
++# specify UTF-8.
+#
+# To work around that, we use `printf` to produce the pattern as a byte
+# sequence, and then feed that to `git grep` as a byte sequence (setting
@@ ci/check-directional-formatting.sh (new)
+# U+2066..U+2069: LRI, RLI, FSI and PDI
+regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
+
-+! git ls-files -z ':(attr:!binary)' |
-+LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
++! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
++ -- ':(exclude,attr:binary)' ':(exclude)*.po'
.github/workflows/main.yml | 1 +
ci/check-directional-formatting.bash | 27 +++++++++++++++++++++++++++
2 files changed, 28 insertions(+)
create mode 100755 ci/check-directional-formatting.bash
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 6ed6a9e8076..deda12db3a9 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -289,6 +289,7 @@ jobs:
- uses: actions/checkout@v2
- run: ci/install-dependencies.sh
- run: ci/run-static-analysis.sh
+ - run: ci/check-directional-formatting.bash
sparse:
needs: ci-config
if: needs.ci-config.outputs.enabled == 'yes'
diff --git a/ci/check-directional-formatting.bash b/ci/check-directional-formatting.bash
new file mode 100755
index 00000000000..e6211b141aa
--- /dev/null
+++ b/ci/check-directional-formatting.bash
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+# This script verifies that the non-binary files tracked in the Git index do
+# not contain any Unicode directional formatting: such formatting could be used
+# to deceive reviewers into interpreting code differently from the compiler.
+# This is intended to run on an Ubuntu agent in a GitHub workflow.
+#
+# To allow translated messages to introduce such directional formatting in the
+# future, we exclude the `.po` files from this validation.
+#
+# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to
+# specify UTF-8.
+#
+# To work around that, we use `printf` to produce the pattern as a byte
+# sequence, and then feed that to `git grep` as a byte sequence (setting
+# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
+#
+# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
+# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
+# for example, would use a `printf` that does not understand that syntax.
+
+# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
+# U+2066..U+2069: LRI, RLI, FSI and PDI
+regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
+
+! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
+ -- ':(exclude,attr:binary)' ':(exclude)*.po'
base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
--
gitgitgadget
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3] ci: disallow directional formatting
2021-11-04 13:13 ` [PATCH v3] " Johannes Schindelin via GitGitGadget
@ 2021-11-04 13:48 ` Ævar Arnfjörð Bjarmason
2021-11-04 17:19 ` Junio C Hamano
0 siblings, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-04 13:48 UTC (permalink / raw)
To: Johannes Schindelin via GitGitGadget
Cc: git, Taylor Blau, Johannes Schindelin
On Thu, Nov 04 2021, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> [...]
> * Corrected a code comment: my custom git grep was not PCRE-enabled,
> but Ubuntu's isn't. But git grep -P still does not understand \uNNNN.
> * Even if the *.po files currently pass the check, the script is now
> future-proof by excluding them.
> [...]
> +# This is intended to run on an Ubuntu agent in a GitHub workflow.
> +#
> +# To allow translated messages to introduce such directional formatting in the
> +# future, we exclude the `.po` files from this validation.
> +#
> +# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to
> +# specify UTF-8.
> +#
> +# To work around that, we use `printf` to produce the pattern as a byte
> +# sequence, and then feed that to `git grep` as a byte sequence (setting
> +# `LC_CTYPE` to make sure that the arguments are interpreted as intended).
> +#
> +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as
> +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`,
> +# for example, would use a `printf` that does not understand that syntax.
> +
> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
> +# U+2066..U+2069: LRI, RLI, FSI and PDI
> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
> +
> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
> + -- ':(exclude,attr:binary)' ':(exclude)*.po'
>
> base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
So this is still using not-PCRE instead of the much simpler PCRE-enabled
one I suggested in[1] because your locally-built git doesn't link to
libpcre?
At the very least that comment is still quite confusing. I.e. it starts
out by saying that it's meant to run in CI where we've got PCRE, but
then goes on to describe an elaborate workaround that's only needed with
a not-PCRE grep.
Would be less confusing to understand why it's so complex as:
# This is intended to run on an Ubuntu agent in a GitHub
# workflow. We've got PCRE there, so all of the below could also be:
#
# ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)' ':!**po/**'
#
#
# But the author wanted to run this locally on a system that didn't
# have PCRE, So [go on to describe elaborate bash / "git grep -E" /
# LC_* tweaking workaround....]
[...]
B.t.w. I think that **po/** exclusion is closer to what you want,
i.e. "a 'po' dir anywhere in our tree". It'll also exclude this if we
e.g. end up using these in language-specific README files there or
whatever.
1. https://lore.kernel.org/git/211103.86zgqlhzvz.gmgdl@evledraar.gmail.com/ or:
! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)'
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] ci: disallow directional formatting
2021-11-04 13:48 ` Ævar Arnfjörð Bjarmason
@ 2021-11-04 17:19 ` Junio C Hamano
2021-11-08 18:49 ` Ævar Arnfjörð Bjarmason
0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-11-04 17:19 UTC (permalink / raw)
To: Ævar Arnfjörð Bjarmason
Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
Johannes Schindelin
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
>> +# U+2066..U+2069: LRI, RLI, FSI and PDI
>> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
>> +
>> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
>> + -- ':(exclude,attr:binary)' ':(exclude)*.po'
> ...
> ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)'
So you are comparing
* requiring bash and C.UTF-8 locale to be available
vs
* requiring git built with PCRE
assuming that "Dscho says doesn't work with PCRE and you say it
works with PCRE" is resolved? They seem roughly the same
difficulty to me.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] ci: disallow directional formatting
2021-11-04 17:19 ` Junio C Hamano
@ 2021-11-08 18:49 ` Ævar Arnfjörð Bjarmason
2021-11-08 20:08 ` Junio C Hamano
0 siblings, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-08 18:49 UTC (permalink / raw)
To: Junio C Hamano
Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
Johannes Schindelin
On Thu, Nov 04 2021, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO
>>> +# U+2066..U+2069: LRI, RLI, FSI and PDI
>>> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
>>> +
>>> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
>>> + -- ':(exclude,attr:binary)' ':(exclude)*.po'
>> ...
>> ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)'
>
> So you are comparing
>
> * requiring bash and C.UTF-8 locale to be available
>
> vs
>
> * requiring git built with PCRE
>
> assuming that "Dscho says doesn't work with PCRE and you say it
> works with PCRE" is resolved? They seem roughly the same
> difficulty to me.
We can hard depend on a git built with PCRE, since the point of this
thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's
unlikely to ever change.
The caveats around PCRE that still somewhat persist around PCRE are due
to a misunderstanding, i.e. I think Johannes tried the \uXXXX syntax,
which we don't opt git-grep into, but as shown above you can just use
the other universally supported PCRE syntax for referring to Unicode
codepoints.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] ci: disallow directional formatting
2021-11-08 18:49 ` Ævar Arnfjörð Bjarmason
@ 2021-11-08 20:08 ` Junio C Hamano
2021-11-09 13:34 ` Ævar Arnfjörð Bjarmason
0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2021-11-08 20:08 UTC (permalink / raw)
To: Ævar Arnfjörð Bjarmason
Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
Johannes Schindelin
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> So you are comparing
>>
>> * requiring bash and C.UTF-8 locale to be available
>>
>> vs
>>
>> * requiring git built with PCRE
>>
>> assuming that "Dscho says doesn't work with PCRE and you say it
>> works with PCRE" is resolved? They seem roughly the same
>> difficulty to me.
>
> We can hard depend on a git built with PCRE, since the point of this
> thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's
> unlikely to ever change.
Yes, so is the availability of bash and C.UTF-8 for the same reason:
we are talking about controlled environment. That is what I meant
by "roughly the same difficulty to me".
FWIW, I am OK with either approach, as I find the patch in question
is just as readable as any rewrite that would use "grep -P", so...
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] ci: disallow directional formatting
2021-11-08 20:08 ` Junio C Hamano
@ 2021-11-09 13:34 ` Ævar Arnfjörð Bjarmason
0 siblings, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-09 13:34 UTC (permalink / raw)
To: Junio C Hamano
Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
Johannes Schindelin
On Mon, Nov 08 2021, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> So you are comparing
>>>
>>> * requiring bash and C.UTF-8 locale to be available
>>>
>>> vs
>>>
>>> * requiring git built with PCRE
>>>
>>> assuming that "Dscho says doesn't work with PCRE and you say it
>>> works with PCRE" is resolved? They seem roughly the same
>>> difficulty to me.
>>
>> We can hard depend on a git built with PCRE, since the point of this
>> thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's
>> unlikely to ever change.
>
> Yes, so is the availability of bash and C.UTF-8 for the same reason:
> we are talking about controlled environment. That is what I meant
> by "roughly the same difficulty to me".
>
> FWIW, I am OK with either approach, as I find the patch in question
> is just as readable as any rewrite that would use "grep -P", so...
To each his own I guess :) I do find the simple regex of:
'[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]'
Much easier to understand than something using printf, shell
interpolation, and needing to switch around LC_CTYPE to two different
values twice on one line.
But anyway, that's a matter of taste.
What isn't is the issue I noted at the bottom of [1], i.e. if we're
relying on '(attr:binary)' we should probably start with an assertion
that our idea of "binary" matches reality, or perhaps go back to the -I
heuristic.
Because now we've got at least one binary non-attribute-marked file, and
if files like that ever get updated they might start matching this
pattern. Maybe not a big deal, but someone updating those might
confusingly trip over this otherwise...
1. https://lore.kernel.org/git/211103.86zgqlhzvz.gmgdl@evledraar.gmail.com/
^ permalink raw reply [flat|nested] 18+ messages in thread