Re: [PATCH] docs: license-rules.txt: cover SPDX headers on Python scripts

From: Markus Heiser <markus.heiser@darmarit.de>
To: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>,
	Jonathan Corbet <corbet@lwn.net>
Cc: Linux Media Mailing List <linux-media@vger.kernel.org>,
	Mauro Carvalho Chehab <mchehab@infradead.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Joe Perches <joe@perches.com>,
	linux-kernel@vger.kernel.org, Jessica Yu <jeyu@kernel.org>,
	Federico Vaga <federico.vaga@vaga.pv.it>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH] docs: license-rules.txt: cover SPDX headers on Python scripts
Date: Fri, 6 Sep 2019 17:18:19 +0200	[thread overview]
Message-ID: <be6f3670-c5d5-e686-8472-d9d33e6c2a6a@darmarit.de> (raw)
In-Reply-To: <20190905170733.3a25dee8@coco.lan>

In practice the python needs "-*- coding: utf-8 -*-  to be in one of the first
two lines.  The SPDX tag in practice has to be in one of the first 15 lines:

     ap.add_argument('-m', '--maxlines', type=int, default=15,
                     help='Maximum number of lines to scan in a file. Default 15')

IMO, all we need to patch is the documentation from:

   """The SPDX license identifier in kernel files shall be added at the first
      possible line in a file which can contain a comment. """

to something like ..

   """The SPDX license identifier in kernel files shall be added at the first 15
      lines in a file which can contain a comments. """

Often when it comes to encodings people are tend to mix up things.
Below you find some comments of mine in the hope to clarify encoding mess:

    TL;DR

Am 05.09.19 um 22:07 schrieb Mauro Carvalho Chehab:
> Em Thu, 5 Sep 2019 13:40:08 -0600
> Jonathan Corbet <corbet@lwn.net> escreveu:
> 
>> On Thu, 5 Sep 2019 16:28:10 -0300
>> Mauro Carvalho Chehab <mchehab+samsung@kernel.org> wrote:
>>
>>> I don't think we can count that python 3 uses utf-8 per default.
>>>
>>> I strongly suspect that, if one uses a Python3 version < 3.7, it will
>>> still default to ASCII.
>>>
>>> On a quick look, the new UTF-8 mode was added on PEP-540:
>>>
>>> 	https://www.python.org/dev/peps/pep-0540/
>>>
>>> Such change happened at Python 3.7.
>>
>> That PEP is to override the locale and use utf8 unconditionally.  It
>> says, with regard to the pre-PEP state:
>>
>> 	UTF-8 is also the default encoding of Python scripts, XML and JSON
>> 	file formats.
>>
>> Unicode was the reason for much of the Python 3 pain; it seems unlikely
>> that many installations are defaulting to ASCII anyway...?

Don't mix unicode and utf-8.

What has changed in python 3 compared to python 2 is the internal
representation of the string type (which has nothing to with the
encoding of files!).

Python2 type str() -- results in a "byte literal"
   https://docs.python.org/2.7/library/functions.html#str

Python2 type unicode() -- results in a "unicode literal"
   https://docs.python.org/2.7/library/functions.html#unicode

Python3 type str() -- results in a unicode literal
   https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str

Python3 unicode() -- this type is not defined, was replaced by str()

Python3 byte() -- results in a  byte literal (what str() was in py2)
   https://docs.python.org/3/library/stdtypes.html#bytes

This is mostly a pain when your source works with byte-streams
and you have to switch py2 to py3.  But this has nothing to do
with the encoding of source files and how the encoding is tagged
in a file.  And what the default encoding of a source file is
when such tags are omitted.

First lets have a look at the default encoding of py source:

   Python will default to ASCII as standard encoding if no other
   encoding hints are given.

from  https://www.python.org/dev/peps/pep-0263/#defining-the-encoding

And my addition: what is correct for py2.  In py3 we have UTF-8

   PEP 3120: The default source encoding is now UTF-8.

from 
https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

> 
> Yeah, but I remember that UTF-8 handling changed a few times during python 3
> releases. I didn't really tracked what happened, as I don't usually program
> in Python. So, I'm actually relying on what I can find about that.
> 
> Looking at Python 3.0 release[1], it says:
> 
> 	"In many cases, but not all, the system default is UTF-8;
> 	 you should never count on this default."
> 
> [1] https://docs.python.org/3.0/whatsnew/3.0.html
> 
> So, at least on early Python 3 releases, the default may not be UTF-8.
> 
> I don't know about you, but, from time to time, people complain about
> UTF-8 chars when I'm handling patches (last time was on a patch series
> for Kernel 5.3 by a core dev in Australia, with was unable to apply a
> patch from me with had some UTF-8 chars).
> 

Don't mix the output of a text file which is read by the standard
open() function in py2/py3 and the source file encoding used by
the interpreter for reading the source file itself.  To complete
your citation:

   In many cases, but not all, the system default is UTF-8; you
   should never count on this default. Any application reading or
   writing more than pure ASCII text should probably have a way to
   override the encoding.

This means your application has to know the encoding of a stream/file.
E.g. we handle the output from of the external Perl script
scripts/kernel-docs by encoding the byte stream from proc-call's
stdout into utf-8:

    out, err = codecs.decode(out, 'utf-8'), codecs.decode(err, 'utf-8')

see patch 
https://github.com/torvalds/linux/commit/86c0f046a8b0c23fca65f77333c233a06c25ef9a

Again, this is talking about application development and has
nothing to do with the encoding of the source files.

> So, I'm pretty sure that some devs don't set the locale to UTF8 even
> those days.

The LANG environment influence only the default encoding of streams like stdout
and does not change the default encoding for source code files.

To clarify by example, create a test123.py file and save it as !! UTF-8 !!.

   import sys, locale
   print("system's default encoding: " +  sys.getdefaultencoding())
   print("sys.stdout's encoding: " + sys.stdout.encoding)
   print("locale's prefered encoding: " +  locale.getpreferredencoding())
   text = "Encoding ist Scheiße"
   try:
       print("::" + text)
   except UnicodeEncodeError as exc:
       print("appl had a UnicodeEncodeError exception: %s" % exc)

Probe encoding of the source file and run some tests:

   $ file test123.py
   test123.py: Python script, UTF-8 Unicode text executable

Lets see how it is stored::

   $ hexdump -C test123.py | grep Schei
   000000e0  6e 67 20 69 73 74 20 53  63 68 65 69 c3 9f 65 22  |ng ist Schei..e"|

The 'ß' is a two byte char 0xc39f in UTF-8 and the unicode code-point is U+00DF.

Since we encoded the file in utf-8 and python2 excepts ASCII as default
we will get a SyntaxError::

   $ python2 test123.py
   File "test123.py", line 6
   SyntaxError: Non-ASCII character '\xc3' in file test123.py on line 6,
   but no encoding declared; see http://python.org/dev/peps/pep-0263/ for
   details

The only way to fix it is to use a magic comment (# -*- coding: utf-8 -*-)
Py3 expects UTF-8 encoded source as default, see PEP-3120 vs PEP-0263.  And
we got what excepted:

   $ python3 test123.py
   system's default encoding: utf-8
   sys.stdout's encoding: UTF-8
   locale's prefered encoding: UTF-8
   ::Encoding ist Scheiße

OK, now lets see what happens when locale (LANG) is changed:

   $ LANG=POSIX python3 test123.py
   system's default encoding: utf-8
   sys.stdout's encoding: ANSI_X3.4-1968
   locale's prefered encoding: ANSI_X3.4-1968
   application print throws UnicodeEncodeError: 'ascii' codec can't encode
   character '\xdf' in position 21: ordinal not in range(128)

The interpreter has no problem to read the source file, because the py3
systems's default encoding stays in UTF-8 .. only the print application
throws a UnicodeEncodeError when it prints the internal unicode (with non
ASCII char in) to a stdout which encodes now ASCII (aka ANSI_X3.4-1968).

Next lets try ISO-8859-1

   LANG=en_US.ISO-8859-1 python3 test123.py
   system's default encoding: utf-8
   sys.stdout's encoding: ISO-8859-1
   locale's prefered encoding: ISO-8859-1
   ::Encoding ist Schei�e

Again; The interpreter has no problem to read the UTF-8 source file (no
SyntaxError).  Only the application works not as expected...

The German 'ß' was correctly read by the interpreter and converted to the
internal unicode representation '\u00DF'.  When it is written to stdout (print)
it is correctly encoded from unicode to ISO-8859-1, but why do we get a '�'?
Because the my X-terminal still uses UTF-8 even when I change the LANG.
Lets use liut(1) and everything is fine:

   LANG=en_US.ISO-8859-1 luit -encoding ISO-8859-1 python3 test123.py
   system's default encoding: utf-8
   sys.stdout's encoding: ISO-8859-1
   locale's prefered encoding: ISO-8859-1
   ::Encoding ist Scheiße

So much more to say about encoding, we haven't talked about all the
pitfalls e.g. with filename encoding or the encoding of used editors ;)

-- Markus --