Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)

From: Mauro Carvalho Chehab <mchehab@kernel.org>
To: "Michal Suchánek" <msuchanek@suse.de>
Cc: Markus Heiser <markus.heiser@darmarit.de>,
	linux-doc@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>
Subject: Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
Date: Fri, 7 May 2021 10:52:15 +0200	[thread overview]
Message-ID: <20210507105215.0902461d@coco.lan> (raw)
In-Reply-To: <20210506180625.GI6564@kitsune.suse.cz>

Em Thu, 6 May 2021 20:06:25 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> On Thu, May 06, 2021 at 07:53:25PM +0200, Markus Heiser wrote:

> > Hi Mauro,
> > 
> > it is not comfortable but is it mad? ..
> > 
> > Most often languages (or applications) do not handle encoding
> > of strings they just piping a binary stream while python
> > decode / encodes strings.
> > 
> > "The Zen of Python" [1] says
> > 
> >    Explicit is better than implicit.

This was taken into an extreme with regards to charsets:

	 "better" should never be translated to "crash" ;-)

> > If a stream can't encode symbols and these symbols should be ignored
> > you have to set the encoding of the stream explicit to ignore
> > such symbols.  
> 
> The problem is this part never happened. Loggers are supposed to tell
> you about the error in your application, not crash it.

It is insane to crash the error log due to a charset issue ;-)

> But the problem with Sphinx may be that the output file is also assumed
> to be in the locale encoding, and the output encoding is never set. It's
> HTML so it could be encoded with entities, too.
> 
> The idea about handlinng encoding precisely is not mad in itself but then
> everybody working with just ASCII and never testing their software works
> in the cases where explicit handling is needed is the mad part. 

True. The machine's locale shouldn't affect *at all* the produced
documents. See, there's a hole set of non-latin family of charsets
supported on Linux:

	https://man7.org/linux/man-pages/man7/charsets.7.html

Nothing prevents that someone using a machine whose default encoding is
KOI8-R/BIG-5/GB 2312/JIS X 0208/... to use Sphinx to produce 
UTF-8 [1] documents.

[1] or whatever other output encoding

Ok, the logger may not be able to correctly display certain
chars, but it it be perfectly fine and sane to use //TRANSLIT (or
something similar) in order to do a charset conversion. 

Even to just print a <?> for all chars that aren't printable at
the logger's output using the charset set by LANG/LC_* is 
better/saner than crashing.

Thanks,
Mauro