All of lore.kernel.org
 help / color / mirror / Atom feed
* b4: unicode control characters -- warn or remove?
@ 2021-11-01 17:50 Konstantin Ryabitsev
  2021-11-01 18:10 ` Miguel Ojeda
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-01 17:50 UTC (permalink / raw)
  To: users, tools

Hi, all:

Per exhibit a, what should we do in the situation where we discover unicode
control characters in an email?

1. Warn and strip these chars out, because they are extremely unlikely to be
   doing anything legitimate in the context of a patch (unless someone is
   sending patches for docs actually written in RTL languages)
2. Warn and error out, refusing to produce an mbox
3. Just warn and produce an mbox anyway

I'd normally do #3, but with many people piping things to git-am, I'm not sure
if it's the safest choice.

Exibit a: https://lwn.net/Articles/874546/

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 17:50 b4: unicode control characters -- warn or remove? Konstantin Ryabitsev
@ 2021-11-01 18:10 ` Miguel Ojeda
  2021-11-02  8:38   ` Petr Mladek
  2021-11-01 18:38 ` Florian Weimer
  2021-11-01 19:09 ` Eric Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Miguel Ojeda @ 2021-11-01 18:10 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: users, tools

On Mon, Nov 1, 2021 at 6:50 PM Konstantin Ryabitsev
<konstantin@linuxfoundation.org> wrote:
>
> Per exhibit a, what should we do in the situation where we discover unicode
> control characters in an email?
>
> 1. Warn and strip these chars out, because they are extremely unlikely to be
>    doing anything legitimate in the context of a patch (unless someone is
>    sending patches for docs actually written in RTL languages)
> 2. Warn and error out, refusing to produce an mbox
> 3. Just warn and produce an mbox anyway

I think option #2 would be the best + a flag to override the error (to
be used if the user has verified it is legitimate).

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 17:50 b4: unicode control characters -- warn or remove? Konstantin Ryabitsev
  2021-11-01 18:10 ` Miguel Ojeda
@ 2021-11-01 18:38 ` Florian Weimer
  2021-11-01 19:09 ` Eric Wong
  2 siblings, 0 replies; 12+ messages in thread
From: Florian Weimer @ 2021-11-01 18:38 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: users, tools

* Konstantin Ryabitsev:

> Per exhibit a, what should we do in the situation where we discover unicode
> control characters in an email?
>
> 1. Warn and strip these chars out, because they are extremely unlikely to be
>    doing anything legitimate in the context of a patch (unless someone is
>    sending patches for docs actually written in RTL languages)
> 2. Warn and error out, refusing to produce an mbox
> 3. Just warn and produce an mbox anyway
>
> I'd normally do #3, but with many people piping things to git-am, I'm
> not sure if it's the safest choice.

It's probably best to coordinate such steps with people who use RTL
script in patches.  In practice, this might mean deferring to whatever
mechanisms the Unicode folks come up with.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 17:50 b4: unicode control characters -- warn or remove? Konstantin Ryabitsev
  2021-11-01 18:10 ` Miguel Ojeda
  2021-11-01 18:38 ` Florian Weimer
@ 2021-11-01 19:09 ` Eric Wong
  2021-11-01 19:17   ` Konstantin Ryabitsev
  2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 12+ messages in thread
From: Eric Wong @ 2021-11-01 19:09 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: users, tools, git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hi, all:
> 
> Per exhibit a, what should we do in the situation where we discover unicode
> control characters in an email?
> 
> 1. Warn and strip these chars out, because they are extremely unlikely to be
>    doing anything legitimate in the context of a patch (unless someone is
>    sending patches for docs actually written in RTL languages)
> 2. Warn and error out, refusing to produce an mbox
> 3. Just warn and produce an mbox anyway
> 
> I'd normally do #3, but with many people piping things to git-am, I'm not sure
> if it's the safest choice.
> 
> Exibit a: https://lwn.net/Articles/874546/

+Cc: git@vger

IMHO, defense for this belongs in git-am (which already checks
things like whitespace).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 19:09 ` Eric Wong
@ 2021-11-01 19:17   ` Konstantin Ryabitsev
  2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-01 19:17 UTC (permalink / raw)
  To: Eric Wong; +Cc: users, tools, git

On Mon, Nov 01, 2021 at 07:09:05PM +0000, Eric Wong wrote:
> > Per exhibit a, what should we do in the situation where we discover unicode
> > control characters in an email?
> > 
> > 1. Warn and strip these chars out, because they are extremely unlikely to be
> >    doing anything legitimate in the context of a patch (unless someone is
> >    sending patches for docs actually written in RTL languages)
> > 2. Warn and error out, refusing to produce an mbox
> > 3. Just warn and produce an mbox anyway
> > 
> > I'd normally do #3, but with many people piping things to git-am, I'm not sure
> > if it's the safest choice.
> > 
> > Exibit a: https://lwn.net/Articles/874546/
> 
> +Cc: git@vger
> 
> IMHO, defense for this belongs in git-am (which already checks
> things like whitespace).

I agree, but even if that is implemented in git, we'll still probably want to
catch this on the b4 side of things until everyone uses the git client where
that's handled natively.

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 19:09 ` Eric Wong
  2021-11-01 19:17   ` Konstantin Ryabitsev
@ 2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
  2021-11-01 20:22     ` Konstantin Ryabitsev
  1 sibling, 1 reply; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-01 20:02 UTC (permalink / raw)
  To: Eric Wong; +Cc: Konstantin Ryabitsev, users, tools, git


On Mon, Nov 01 2021, Eric Wong wrote:

> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
>> Hi, all:
>> 
>> Per exhibit a, what should we do in the situation where we discover unicode
>> control characters in an email?
>> 
>> 1. Warn and strip these chars out, because they are extremely unlikely to be
>>    doing anything legitimate in the context of a patch (unless someone is
>>    sending patches for docs actually written in RTL languages)
>> 2. Warn and error out, refusing to produce an mbox
>> 3. Just warn and produce an mbox anyway
>> 
>> I'd normally do #3, but with many people piping things to git-am, I'm not sure
>> if it's the safest choice.
>> 
>> Exibit a: https://lwn.net/Articles/874546/
>
> +Cc: git@vger
>
> IMHO, defense for this belongs in git-am (which already checks
> things like whitespace).

It checks whitespace because that's something that's commonly a source
of patch corruption. I'm not adverse to adding this to core.whitespace,
but trying to catch malicious injected code seems like a rather big
expansion of its scope, particularly since:

    "[...]sending patches for docs actually written in RTL languages[...]"

Or just code? People write comment and even in their native languages,
and not all projects are as anglo-centric as those hosted on kernel.org.

I haven't checked what the overlap is between solving this issue & i18n
support, but we definitely should not be assuming that git's only using
by kernel.org users & similar, even something as relatively obscure as
git-am.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
@ 2021-11-01 20:22     ` Konstantin Ryabitsev
  2021-11-01 20:49       ` Pavel Machek
  2021-11-02 14:09       ` Konstantin Ryabitsev
  0 siblings, 2 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-01 20:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Eric Wong, users, tools, git

On Mon, Nov 01, 2021 at 09:02:34PM +0100, Ævar Arnfjörð Bjarmason wrote:
> It checks whitespace because that's something that's commonly a source
> of patch corruption. I'm not adverse to adding this to core.whitespace,
> but trying to catch malicious injected code seems like a rather big
> expansion of its scope, particularly since:
> 
>     "[...]sending patches for docs actually written in RTL languages[...]"
> 
> Or just code? People write comment and even in their native languages,
> and not all projects are as anglo-centric as those hosted on kernel.org.

My comment about docs was purely within the scope of the Linux kernel.

I think the following would be a sane check:

1. are there unicode control characters (CCs) present?
2. are there other characters from RTL languages present in the same line?

if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
true, then it's likely worth a warning.

Maybe even relax #2 to just check for unicode characters above a certain
barrier where RTL languages live. I think everyone will agree that if there
are unicode CCs and no other unicode characters in that same line, it's likely
not a legitimate use of control characters.

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:22     ` Konstantin Ryabitsev
@ 2021-11-01 20:49       ` Pavel Machek
  2021-11-01 21:02         ` Konstantin Ryabitsev
  2021-11-02 14:09       ` Konstantin Ryabitsev
  1 sibling, 1 reply; 12+ messages in thread
From: Pavel Machek @ 2021-11-01 20:49 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Ævar Arnfjörð Bjarmason, Eric Wong, users, tools, git

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

Hi!

> > It checks whitespace because that's something that's commonly a source
> > of patch corruption. I'm not adverse to adding this to core.whitespace,
> > but trying to catch malicious injected code seems like a rather big
> > expansion of its scope, particularly since:
> > 
> >     "[...]sending patches for docs actually written in RTL languages[...]"
> > 
> > Or just code? People write comment and even in their native languages,
> > and not all projects are as anglo-centric as those hosted on kernel.org.
> 
> My comment about docs was purely within the scope of the Linux kernel.
> 
> I think the following would be a sane check:
> 
> 1. are there unicode control characters (CCs) present?
> 2. are there other characters from RTL languages present in the same line?
> 
> if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
> true, then it's likely worth a warning.
> 
> Maybe even relax #2 to just check for unicode characters above a certain
> barrier where RTL languages live. I think everyone will agree that if there
> are unicode CCs and no other unicode characters in that same line, it's likely
> not a legitimate use of control characters.

If you are worried about malicious patches, then it should be easy for
attackers to add some RTL characters and escape the check...

Best regards,
								Pavel
-- 
http://www.livejournal.com/~pavelmachek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:49       ` Pavel Machek
@ 2021-11-01 21:02         ` Konstantin Ryabitsev
  0 siblings, 0 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-01 21:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ævar Arnfjörð Bjarmason, Eric Wong, users, tools, git

[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]

On Mon, Nov 01, 2021 at 09:49:14PM +0100, Pavel Machek wrote:
> > I think the following would be a sane check:
> > 
> > 1. are there unicode control characters (CCs) present?
> > 2. are there other characters from RTL languages present in the same line?
> > 
> > if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
> > true, then it's likely worth a warning.
> > 
> > Maybe even relax #2 to just check for unicode characters above a certain
> > barrier where RTL languages live. I think everyone will agree that if there
> > are unicode CCs and no other unicode characters in that same line, it's likely
> > not a legitimate use of control characters.
> 
> If you are worried about malicious patches, then it should be easy for
> attackers to add some RTL characters and escape the check...

Well, the point of this attack was to trick the reviewer into accepting code
that the compiler would treat differently (e.g. something that looked to be
inside a comment block is actually outside of it).

So, if attackers include some actual RTL text, then the reviewer would no
longer be (as easily) tricked because there would be stuff other than just
invisible characters in the line of code.

This actually similar to how we treat unicode domains. Most browsers only
allow unicode domains when the entire domain name consists of unicode
characters. I suggest we take a similar approach.

-K

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 18:10 ` Miguel Ojeda
@ 2021-11-02  8:38   ` Petr Mladek
  2021-11-02 12:17     ` Konstantin Ryabitsev
  0 siblings, 1 reply; 12+ messages in thread
From: Petr Mladek @ 2021-11-02  8:38 UTC (permalink / raw)
  To: Miguel Ojeda; +Cc: Konstantin Ryabitsev, users, tools

On Mon 2021-11-01 19:10:05, Miguel Ojeda wrote:
> On Mon, Nov 1, 2021 at 6:50 PM Konstantin Ryabitsev
> <konstantin@linuxfoundation.org> wrote:
> >
> > Per exhibit a, what should we do in the situation where we discover unicode
> > control characters in an email?
> >
> > 1. Warn and strip these chars out, because they are extremely unlikely to be
> >    doing anything legitimate in the context of a patch (unless someone is
> >    sending patches for docs actually written in RTL languages)
> > 2. Warn and error out, refusing to produce an mbox
> > 3. Just warn and produce an mbox anyway
> 
> I think option #2 would be the best + a flag to override the error (to
> be used if the user has verified it is legitimate).

Yes, please add a --force option to override these kind of errors.
It should be used only when users know what they are doing.
The point is that they will have a chance to check, fix, and
eventually apply the patch.

For example, I was not able to get a patch that was sent as a followup
to a patchset by another user. The original patchset was sent
from @infradead.org. The followup patch from @suse.com.

b4 refused to get it because attestation failed:

$> b4 am -o in YXlKqCPY9suM4mfT@alley "-P "
Looking up https://lore.kernel.org/r/YXlKqCPY9suM4mfT%40alley
Grabbing thread from lore.kernel.org/all/YXlKqCPY9suM4mfT%40alley/t.mbox.gz
Analyzing 16 messages in the thread
Will use the latest revision: v2
You can pick other revisions using the -vN flag
Unknown range value specified: 
Checking attestation on all messages, may take a moment...
---
  ---
  ✗ BADSIG: DKIM/infradead.org
---
Total patches: 0 (cherrypicked:  )
---
Cover: in/v2_20211019_willy_improvements_to_pgp.cover
 Link: https://lore.kernel.org/r/20211019142621.2810043-1-willy@infradead.org
 Base: not specified
       git am in/v2_20211019_willy_improvements_to_pgp.mbx



I think that this has already been discussed somewhere and it is a
rare situation that is not easy to support. Anyway, the patch
was trivial, correct, and --force option would have helped me
to handle it easily.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-02  8:38   ` Petr Mladek
@ 2021-11-02 12:17     ` Konstantin Ryabitsev
  0 siblings, 0 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-02 12:17 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Miguel Ojeda, users, tools

On Tue, Nov 02, 2021 at 09:38:02AM +0100, Petr Mladek wrote:
> For example, I was not able to get a patch that was sent as a followup
> to a patchset by another user. The original patchset was sent
> from @infradead.org. The followup patch from @suse.com.
> 
> b4 refused to get it because attestation failed:
> 
> $> b4 am -o in YXlKqCPY9suM4mfT@alley "-P "
> Looking up https://lore.kernel.org/r/YXlKqCPY9suM4mfT%40alley
> Grabbing thread from lore.kernel.org/all/YXlKqCPY9suM4mfT%40alley/t.mbox.gz
> Analyzing 16 messages in the thread
> Will use the latest revision: v2
> You can pick other revisions using the -vN flag
> Unknown range value specified: 
> Checking attestation on all messages, may take a moment...
> ---
>   ---
>   ✗ BADSIG: DKIM/infradead.org
> ---
> Total patches: 0 (cherrypicked:  )
> ---
> Cover: in/v2_20211019_willy_improvements_to_pgp.cover
>  Link: https://lore.kernel.org/r/20211019142621.2810043-1-willy@infradead.org
>  Base: not specified
>        git am in/v2_20211019_willy_improvements_to_pgp.mbx

The problem here is that -P is missing a value and not failing in an obvious
way, which is a bug (but unrelated to this discussion). I believe you wanted
to use -P_ here, like this?

	$ b4 am -o/tmp YXlKqCPY9suM4mfT@alley -P _
	Looking up https://lore.kernel.org/r/YXlKqCPY9suM4mfT%40alley
	Analyzing 4 messages in the thread
	Checking attestation on all messages, may take a moment...
	---
	  ✓ [PATCH] vsprintf: Update %pGp documentation about that it prints hex value
		+ Reviewed-by: Yafang Shao <laoar.shao@gmail.com> (✓ DKIM/gmail.com)
	  ---
	  ✓ Signed: DKIM/suse.com
	---
	Total patches: 1 (cherrypicked: <YXlKqCPY9suM4mfT@alley>)
	---
	 Link: https://lore.kernel.org/r/YXlKqCPY9suM4mfT@alley
	 Base: not specified
		   git am /tmp/20211027_pmladek_vsprintf_update_pgp_documentation_about_that_it_prints_hex_value.mbx

I'll add a fix to make incorrect -P values error out more clearly.

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: b4: unicode control characters -- warn or remove?
  2021-11-01 20:22     ` Konstantin Ryabitsev
  2021-11-01 20:49       ` Pavel Machek
@ 2021-11-02 14:09       ` Konstantin Ryabitsev
  1 sibling, 0 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-11-02 14:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Eric Wong, users, tools, git

On Mon, Nov 01, 2021 at 04:22:20PM -0400, Konstantin Ryabitsev wrote:
> I think the following would be a sane check:
> 
> 1. are there unicode control characters (CCs) present?
> 2. are there other characters from RTL languages present in the same line?
> 
> if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
> true, then it's likely worth a warning.

I implemented this solution in b4 master, so it should error out only when it
finds control characters without any "other letter" unicode character category
present in the same line (where Hebrew, Arabic, etc live). There's probably
still a way to take advantage of this, but hopefully it's a lot less trivial
now and less likely to go unnoticed by the reviewer.

The error message will point where it found the problem:

	WARNING: Message contains suspicious unicode control characters!
			 Subject: [PATCH 1/2] SPI: Add SPI driver for Sunplus SP7021
				Line: + /* ‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
				-----------^
				Char: RIGHT-TO-LEFT OVERRIDE (0x202e)
			 If you are sure about this, rerun with the right flag to allow.

One can rerun with --allow-unicode-control-chars to override this.

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-11-02 14:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-01 17:50 b4: unicode control characters -- warn or remove? Konstantin Ryabitsev
2021-11-01 18:10 ` Miguel Ojeda
2021-11-02  8:38   ` Petr Mladek
2021-11-02 12:17     ` Konstantin Ryabitsev
2021-11-01 18:38 ` Florian Weimer
2021-11-01 19:09 ` Eric Wong
2021-11-01 19:17   ` Konstantin Ryabitsev
2021-11-01 20:02   ` Ævar Arnfjörð Bjarmason
2021-11-01 20:22     ` Konstantin Ryabitsev
2021-11-01 20:49       ` Pavel Machek
2021-11-01 21:02         ` Konstantin Ryabitsev
2021-11-02 14:09       ` Konstantin Ryabitsev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.