linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
@ 2021-05-06 10:39 Michal Suchánek
  2021-05-06 11:20 ` Mauro Carvalho Chehab
  2021-05-06 15:57 ` Markus Heiser
  0 siblings, 2 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-06 10:39 UTC (permalink / raw)
  To: linux-doc; +Cc: Jonathan Corbet, Mauro Carvalho Chehab

When building HTML documentation I get this output:

[  120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs
[  120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
[  120s] cat: /etc/os-release: No such file or directory
[  121s]   SPHINX  htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output
[  121s]   PARSE   include/uapi/linux/dvb/audio.h
[  121s]   PARSE   include/uapi/linux/dvb/ca.h
[  121s]   PARSE   include/uapi/linux/dvb/dmx.h
[  121s]   PARSE   include/uapi/linux/dvb/frontend.h
[  122s]   PARSE   include/uapi/linux/dvb/net.h
[  122s]   PARSE   include/uapi/linux/dvb/video.h
[  122s]   PARSE   include/uapi/linux/videodev2.h
[  122s]   PARSE   include/uapi/linux/media.h
[  122s]   PARSE   include/uapi/linux/cec.h
[  122s]   PARSE   include/uapi/linux/lirc.h
[  190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead
[  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc'
[  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc'
[  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc'
[  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc'
[  203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller'
[  203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams'
[  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
[  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
[  233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager'
[  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
[  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser'
[  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser'
[  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser'
[  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
[  233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register'
[  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent.
[  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket'
[  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation.
[  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent.
[  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent.
[  307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header)
[  412s] 
[  412s] Sphinx parallel build error:
[  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
[  431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2
[  431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2
[  431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
[  431s] make: *** [Makefile:222: __sub-make] Error 2
[  431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build)

It does not say which input file contains the offending character so I can't tell which file is broken.

Any idea how to debug?

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek
@ 2021-05-06 11:20 ` Mauro Carvalho Chehab
  2021-05-06 13:32   ` Michal Suchánek
  2021-05-06 15:57 ` Markus Heiser
  1 sibling, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-06 11:20 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: linux-doc, Jonathan Corbet

Em Thu, 6 May 2021 12:39:13 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> When building HTML documentation I get this output:
> 
> [  120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs
> [  120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> [  120s] cat: /etc/os-release: No such file or directory
> [  121s]   SPHINX  htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output
> [  121s]   PARSE   include/uapi/linux/dvb/audio.h
> [  121s]   PARSE   include/uapi/linux/dvb/ca.h
> [  121s]   PARSE   include/uapi/linux/dvb/dmx.h
> [  121s]   PARSE   include/uapi/linux/dvb/frontend.h
> [  122s]   PARSE   include/uapi/linux/dvb/net.h
> [  122s]   PARSE   include/uapi/linux/dvb/video.h
> [  122s]   PARSE   include/uapi/linux/videodev2.h
> [  122s]   PARSE   include/uapi/linux/media.h
> [  122s]   PARSE   include/uapi/linux/cec.h
> [  122s]   PARSE   include/uapi/linux/lirc.h
> [  190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead
> [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc'
> [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc'
> [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc'
> [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc'
> [  203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller'
> [  203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams'
> [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> [  233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager'
> [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser'
> [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser'
> [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser'
> [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> [  233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register'
> [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent.
> [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket'
> [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation.
> [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent.
> [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent.
> [  307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header)
> [  412s] 
> [  412s] Sphinx parallel build error:
> [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
> [  431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2
> [  431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2
> [  431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> [  431s] make: *** [Makefile:222: __sub-make] Error 2
> [  431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build)
> 
> It does not say which input file contains the offending character so I can't tell which file is broken.
> 
> Any idea how to debug?

Yes. You probably has some weird file under Documentation/ABI.
Some text editors like kate tend to keep temporary files sometimes.

The scripts/get_ABI.pl (currently) doesn't have any logic
to recognize valid ABI files from trash stuff added at
the ABI dirs.

Just doing a git status (or a git clean) and removing such
files should fix the build.

Regards,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 11:20 ` Mauro Carvalho Chehab
@ 2021-05-06 13:32   ` Michal Suchánek
  2021-05-06 14:24     ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Suchánek @ 2021-05-06 13:32 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet

On Thu, May 06, 2021 at 01:20:06PM +0200, Mauro Carvalho Chehab wrote:
> Em Thu, 6 May 2021 12:39:13 +0200
> Michal Suchánek <msuchanek@suse.de> escreveu:
> 
> > When building HTML documentation I get this output:
> > 
> > [  120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs
> > [  120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> > [  120s] cat: /etc/os-release: No such file or directory
> > [  121s]   SPHINX  htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output
> > [  121s]   PARSE   include/uapi/linux/dvb/audio.h
> > [  121s]   PARSE   include/uapi/linux/dvb/ca.h
> > [  121s]   PARSE   include/uapi/linux/dvb/dmx.h
> > [  121s]   PARSE   include/uapi/linux/dvb/frontend.h
> > [  122s]   PARSE   include/uapi/linux/dvb/net.h
> > [  122s]   PARSE   include/uapi/linux/dvb/video.h
> > [  122s]   PARSE   include/uapi/linux/videodev2.h
> > [  122s]   PARSE   include/uapi/linux/media.h
> > [  122s]   PARSE   include/uapi/linux/cec.h
> > [  122s]   PARSE   include/uapi/linux/lirc.h
> > [  190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead
> > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc'
> > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc'
> > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc'
> > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc'
> > [  203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller'
> > [  203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams'
> > [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> > [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> > [  233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager'
> > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser'
> > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser'
> > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser'
> > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> > [  233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register'
> > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent.
> > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket'
> > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation.
> > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent.
> > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent.
> > [  307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header)
> > [  412s] 
> > [  412s] Sphinx parallel build error:
> > [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
> > [  431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2
> > [  431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2
> > [  431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> > [  431s] make: *** [Makefile:222: __sub-make] Error 2
> > [  431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build)
> > 
> > It does not say which input file contains the offending character so I can't tell which file is broken.
> > 
> > Any idea how to debug?
> 
> Yes. You probably has some weird file under Documentation/ABI.
> Some text editors like kate tend to keep temporary files sometimes.
> 
> The scripts/get_ABI.pl (currently) doesn't have any logic
> to recognize valid ABI files from trash stuff added at
> the ABI dirs.
> 
> Just doing a git status (or a git clean) and removing such
> files should fix the build.

This is clean git-archived tarball uploaded to a build service so the
likehood of some garbage files popping out in Documentation/ABI out of
nowhere is quite small.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 13:32   ` Michal Suchánek
@ 2021-05-06 14:24     ` Mauro Carvalho Chehab
  2021-05-06 14:35       ` Michal Suchánek
  0 siblings, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-06 14:24 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: linux-doc, Jonathan Corbet

Em Thu, 6 May 2021 15:32:12 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> On Thu, May 06, 2021 at 01:20:06PM +0200, Mauro Carvalho Chehab wrote:
> > Em Thu, 6 May 2021 12:39:13 +0200
> > Michal Suchánek <msuchanek@suse.de> escreveu:
> >   
> > > When building HTML documentation I get this output:
> > > 
> > > [  120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs
> > > [  120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> > > [  120s] cat: /etc/os-release: No such file or directory
> > > [  121s]   SPHINX  htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output
> > > [  121s]   PARSE   include/uapi/linux/dvb/audio.h
> > > [  121s]   PARSE   include/uapi/linux/dvb/ca.h
> > > [  121s]   PARSE   include/uapi/linux/dvb/dmx.h
> > > [  121s]   PARSE   include/uapi/linux/dvb/frontend.h
> > > [  122s]   PARSE   include/uapi/linux/dvb/net.h
> > > [  122s]   PARSE   include/uapi/linux/dvb/video.h
> > > [  122s]   PARSE   include/uapi/linux/videodev2.h
> > > [  122s]   PARSE   include/uapi/linux/media.h
> > > [  122s]   PARSE   include/uapi/linux/cec.h
> > > [  122s]   PARSE   include/uapi/linux/lirc.h
> > > [  190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead
> > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc'
> > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc'
> > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc'
> > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc'
> > > [  203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller'
> > > [  203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams'
> > > [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> > > [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> > > [  233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager'
> > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser'
> > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser'
> > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser'
> > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> > > [  233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register'
> > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent.
> > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket'
> > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation.
> > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent.
> > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent.
> > > [  307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header)
> > > [  412s] 
> > > [  412s] Sphinx parallel build error:
> > > [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
> > > [  431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2
> > > [  431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2
> > > [  431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> > > [  431s] make: *** [Makefile:222: __sub-make] Error 2
> > > [  431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build)
> > > 
> > > It does not say which input file contains the offending character so I can't tell which file is broken.
> > > 
> > > Any idea how to debug?  
> > 
> > Yes. You probably has some weird file under Documentation/ABI.
> > Some text editors like kate tend to keep temporary files sometimes.
> > 
> > The scripts/get_ABI.pl (currently) doesn't have any logic
> > to recognize valid ABI files from trash stuff added at
> > the ABI dirs.
> > 
> > Just doing a git status (or a git clean) and removing such
> > files should fix the build.  
> 
> This is clean git-archived tarball uploaded to a build service so the
> likehood of some garbage files popping out in Documentation/ABI out of
> nowhere is quite small.

Well, it could be something completely different ;-) 

This crash happens when Sphinx/python finds a character it doesn't 
recognize as valid, like if you run something like:

	$ echo -en "What:\tBROKEN description\ndescription:\n";dd if=/dev/random count=10 ) > Documentation/ABI/testing/foobar
	$ ./scripts/get_abi.pl  rest 2>/dev/null|grep BROK
	Binary file (standard input) matches

On such case, non UTF-8 chars end being inserted, causing python and/or Sphinx
exceptions, causing it to crash:

	WARNING: The kernel documentation build process
	        support for Sphinx v3.0 and above is brand new. Be prepared for
	        possible issues in the generated output.
	enabling CJK for LaTeX builder

	Sphinx parallel build error:
	UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 1985806: invalid start byte
	make[1]: *** [Documentation/Makefile:91: htmldocs] Error 2
	make: *** [Makefile:1790: htmldocs] Error 2

Yet, this sounds a weird to me:

	UnicodeEncodeError: 'latin-1' 

It sounds that it is somehow trying to use latin-1 alphabet, instead of utf-8.

This will certainly cause troubles, as there are non-latin-1 characters
at the docs (specially at Japanese and Chinese translations, but there
are also a few utf-8 graphic symbols somewhere else).

I remember there were a change in the past that made utf-8 to be default
for Sphinx, but can't remember the details.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 14:24     ` Mauro Carvalho Chehab
@ 2021-05-06 14:35       ` Michal Suchánek
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-06 14:35 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet

On Thu, May 06, 2021 at 04:24:42PM +0200, Mauro Carvalho Chehab wrote:
> Em Thu, 6 May 2021 15:32:12 +0200
> Michal Suchánek <msuchanek@suse.de> escreveu:
> 
> > On Thu, May 06, 2021 at 01:20:06PM +0200, Mauro Carvalho Chehab wrote:
> > > Em Thu, 6 May 2021 12:39:13 +0200
> > > Michal Suchánek <msuchanek@suse.de> escreveu:
> > >   
> > > > When building HTML documentation I get this output:
> > > > 
> > > > [  120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs
> > > > [  120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> > > > [  120s] cat: /etc/os-release: No such file or directory
> > > > [  121s]   SPHINX  htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output
> > > > [  121s]   PARSE   include/uapi/linux/dvb/audio.h
> > > > [  121s]   PARSE   include/uapi/linux/dvb/ca.h
> > > > [  121s]   PARSE   include/uapi/linux/dvb/dmx.h
> > > > [  121s]   PARSE   include/uapi/linux/dvb/frontend.h
> > > > [  122s]   PARSE   include/uapi/linux/dvb/net.h
> > > > [  122s]   PARSE   include/uapi/linux/dvb/video.h
> > > > [  122s]   PARSE   include/uapi/linux/videodev2.h
> > > > [  122s]   PARSE   include/uapi/linux/media.h
> > > > [  122s]   PARSE   include/uapi/linux/cec.h
> > > > [  122s]   PARSE   include/uapi/linux/lirc.h
> > > > [  190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead
> > > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc'
> > > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc'
> > > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc'
> > > > [  203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc'
> > > > [  203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller'
> > > > [  203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams'
> > > > [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> > > > [  233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init'
> > > > [  233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager'
> > > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> > > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser'
> > > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser'
> > > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser'
> > > > [  233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser'
> > > > [  233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register'
> > > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent.
> > > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket'
> > > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation.
> > > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent.
> > > > [  234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent.
> > > > [  307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header)
> > > > [  412s] 
> > > > [  412s] Sphinx parallel build error:
> > > > [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
> > > > [  431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2
> > > > [  431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2
> > > > [  431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html'
> > > > [  431s] make: *** [Makefile:222: __sub-make] Error 2
> > > > [  431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build)
> > > > 
> > > > It does not say which input file contains the offending character so I can't tell which file is broken.
> > > > 
> > > > Any idea how to debug?  
> > > 
> > > Yes. You probably has some weird file under Documentation/ABI.
> > > Some text editors like kate tend to keep temporary files sometimes.
> > > 
> > > The scripts/get_ABI.pl (currently) doesn't have any logic
> > > to recognize valid ABI files from trash stuff added at
> > > the ABI dirs.
> > > 
> > > Just doing a git status (or a git clean) and removing such
> > > files should fix the build.  
> > 
> > This is clean git-archived tarball uploaded to a build service so the
> > likehood of some garbage files popping out in Documentation/ABI out of
> > nowhere is quite small.
> 
> Well, it could be something completely different ;-) 
> 
> This crash happens when Sphinx/python finds a character it doesn't 
> recognize as valid, like if you run something like:
> 
> 	$ echo -en "What:\tBROKEN description\ndescription:\n";dd if=/dev/random count=10 ) > Documentation/ABI/testing/foobar
> 	$ ./scripts/get_abi.pl  rest 2>/dev/null|grep BROK
> 	Binary file (standard input) matches
> 
> On such case, non UTF-8 chars end being inserted, causing python and/or Sphinx
> exceptions, causing it to crash:
> 
> 	WARNING: The kernel documentation build process
> 	        support for Sphinx v3.0 and above is brand new. Be prepared for
> 	        possible issues in the generated output.
> 	enabling CJK for LaTeX builder
> 
> 	Sphinx parallel build error:
> 	UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 1985806: invalid start byte
> 	make[1]: *** [Documentation/Makefile:91: htmldocs] Error 2
> 	make: *** [Makefile:1790: htmldocs] Error 2
> 
> Yet, this sounds a weird to me:
> 
> 	UnicodeEncodeError: 'latin-1' 
> 
> It sounds that it is somehow trying to use latin-1 alphabet, instead of utf-8.

How does it determine what character set to use?

I suspect the limited build environment does not have locale files.

The build system locale is kind of irrelevant for the documentation
locale but if it creeps in because the input file locale is not
specified anywhere this can cause problems.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek
  2021-05-06 11:20 ` Mauro Carvalho Chehab
@ 2021-05-06 15:57 ` Markus Heiser
  2021-05-06 16:46   ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 41+ messages in thread
From: Markus Heiser @ 2021-05-06 15:57 UTC (permalink / raw)
  To: Michal Suchánek, linux-doc; +Cc: Jonathan Corbet, Mauro Carvalho Chehab


Am 06.05.21 um 12:39 schrieb Michal Suchánek:
> When building HTML documentation I get this output:
...
> [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
...
> 
> It does not say which input file contains the offending character so I can't tell which file is broken.
> 
> Any idea how to debug?

I guess the build host is a very simple container, what does

   echo $LC_ALL
   echo $LANG

prompt?  If it is latin, change it to something using utf-8 (I recommend 
'en_US.utf8').

A UnicodeEncodeError can occour everywhere where characters are
encoded from (internal) unicode to the encoding of the stream.

By example:

A print or log statement which streams to stdout needs to encode
from unicode to stdout's encoding.  If there is one unicode symbol
which can not encoded to stream's encoding a UnicodeEncodeError
is raised.

   -- Markus --

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 15:57 ` Markus Heiser
@ 2021-05-06 16:46   ` Mauro Carvalho Chehab
  2021-05-06 17:04     ` Markus Heiser
  0 siblings, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-06 16:46 UTC (permalink / raw)
  To: Markus Heiser; +Cc: Michal Suchánek, linux-doc, Jonathan Corbet

Em Thu, 6 May 2021 17:57:15 +0200
Markus Heiser <markus.heiser@darmarit.de> escreveu:

> Am 06.05.21 um 12:39 schrieb Michal Suchánek:
> > When building HTML documentation I get this output:  
> ...
> > [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)  
> ...
> > 
> > It does not say which input file contains the offending character so I can't tell which file is broken.
> > 
> > Any idea how to debug?  
> 
> I guess the build host is a very simple container, what does
> 
>    echo $LC_ALL
>    echo $LANG
> 
> prompt?  If it is latin, change it to something using utf-8 (I recommend 
> 'en_US.utf8').
> 
> A UnicodeEncodeError can occour everywhere where characters are
> encoded from (internal) unicode to the encoding of the stream.
> 
> By example:
> 
> A print or log statement which streams to stdout needs to encode
> from unicode to stdout's encoding.  If there is one unicode symbol
> which can not encoded to stream's encoding a UnicodeEncodeError
> is raised.

Hi Markus,

It shouldn't matter the builder's locale when building the Kernel
documentation (or any other documents built from other git trees
on other open source projects), as the Kernel's *.rpm document charset
won't change, no matter on what part of the globe it was built.

I vaguely remember about a change we made a couple of years ago
in order to address this issue.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 16:46   ` Mauro Carvalho Chehab
@ 2021-05-06 17:04     ` Markus Heiser
  2021-05-06 17:27       ` Mauro Carvalho Chehab
  2021-05-06 17:48       ` Michal Suchánek
  0 siblings, 2 replies; 41+ messages in thread
From: Markus Heiser @ 2021-05-06 17:04 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Michal Suchánek; +Cc: linux-doc, Jonathan Corbet

Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
> Em Thu, 6 May 2021 17:57:15 +0200
> Markus Heiser <markus.heiser@darmarit.de> escreveu:
> 
>> Am 06.05.21 um 12:39 schrieb Michal Suchánek:
>>> When building HTML documentation I get this output:
>> ...
>>> [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
>> ...
>>>
>>> It does not say which input file contains the offending character so I can't tell which file is broken.
>>>
>>> Any idea how to debug?
>>
>> I guess the build host is a very simple container, what does
>>
>>     echo $LC_ALL
>>     echo $LANG
>>
>> prompt?  If it is latin, change it to something using utf-8 (I recommend
>> 'en_US.utf8').
>>
>> A UnicodeEncodeError can occour everywhere where characters are
>> encoded from (internal) unicode to the encoding of the stream.
>>
>> By example:
>>
>> A print or log statement which streams to stdout needs to encode
>> from unicode to stdout's encoding.  If there is one unicode symbol
>> which can not encoded to stream's encoding a UnicodeEncodeError
>> is raised.
> 
> Hi Markus,
> 
> It shouldn't matter the builder's locale when building the Kernel
> documentation (or any other documents built from other git trees
> on other open source projects), as the Kernel's *.rpm document charset
> won't change, no matter on what part of the globe it was built.
> 
> I vaguely remember about a change we made a couple of years ago
> in order to address this issue.

Hi Mauro :)

sure? .. what if the logger wants to log some symbols from the
chines translated parts to stdout and the encoding of stdout is
latin?

In python the logger will raise a UnicodeEncodeError, this is
what I know .. but I'm often wrong ;)

I remember we had some patches to the chinese translation
these days, may be there is an issue the logger wants to report.

    Anyway I would always recommend to use utf-8.

@Michal would you give it a try?

   -- Markus --


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:04     ` Markus Heiser
@ 2021-05-06 17:27       ` Mauro Carvalho Chehab
  2021-05-06 17:53         ` Markus Heiser
  2021-05-06 17:57         ` Randy Dunlap
  2021-05-06 17:48       ` Michal Suchánek
  1 sibling, 2 replies; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-06 17:27 UTC (permalink / raw)
  To: Markus Heiser; +Cc: Michal Suchánek, linux-doc, Jonathan Corbet

Em Thu, 6 May 2021 19:04:44 +0200
Markus Heiser <markus.heiser@darmarit.de> escreveu:

> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
> > Em Thu, 6 May 2021 17:57:15 +0200
> > Markus Heiser <markus.heiser@darmarit.de> escreveu:
> >   
> >> Am 06.05.21 um 12:39 schrieb Michal Suchánek:  
> >>> When building HTML documentation I get this output:  
> >> ...  
> >>> [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)  
> >> ...  
> >>>
> >>> It does not say which input file contains the offending character so I can't tell which file is broken.
> >>>
> >>> Any idea how to debug?  
> >>
> >> I guess the build host is a very simple container, what does
> >>
> >>     echo $LC_ALL
> >>     echo $LANG
> >>
> >> prompt?  If it is latin, change it to something using utf-8 (I recommend
> >> 'en_US.utf8').
> >>
> >> A UnicodeEncodeError can occour everywhere where characters are
> >> encoded from (internal) unicode to the encoding of the stream.
> >>
> >> By example:
> >>
> >> A print or log statement which streams to stdout needs to encode
> >> from unicode to stdout's encoding.  If there is one unicode symbol
> >> which can not encoded to stream's encoding a UnicodeEncodeError
> >> is raised.  
> > 
> > Hi Markus,
> > 
> > It shouldn't matter the builder's locale when building the Kernel
> > documentation (or any other documents built from other git trees
> > on other open source projects), as the Kernel's *.rpm document charset
> > won't change, no matter on what part of the globe it was built.
> > 
> > I vaguely remember about a change we made a couple of years ago
> > in order to address this issue.  
> 
> Hi Mauro :)
> 
> sure? .. what if the logger wants to log some symbols from the
> chines translated parts to stdout and the encoding of stdout is
> latin?
>
> In python the logger will raise a UnicodeEncodeError, this is
> what I know .. but I'm often wrong ;)

Yeah, Python (and almost all python apps) has a mad behavior when
it finds an unexpected character: instead of ignoring it, it
just crashes. On Sphinx, this is is even worse, as it blames
the parallel building, instead of pinpointing the real culprit.

In this specific case, crashing due to an invalid char sent to
the logger sounds pretty stupid to me, as what really matter
is the built documents.

> I remember we had some patches to the chinese translation
> these days, may be there is an issue the logger wants to report.
> 
>     Anyway I would always recommend to use utf-8.

Well, IMO, for things like logger, python or Sphinx should internally
be ding something similar to:

	$ iconv -t latin-1//IGNORE

e. g. if the system's charset is latin-1, it should ignore all charset
errors at the logger, converting the charset the best way it can
without crashing.

As, unfortunately, this is not happening, conf.py should do something
like hardcoding env{LANG}=<lang>.utf-8 (or something similar),
in order to to eliminate the risk of crashes.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:04     ` Markus Heiser
  2021-05-06 17:27       ` Mauro Carvalho Chehab
@ 2021-05-06 17:48       ` Michal Suchánek
  2021-05-06 17:59         ` Markus Heiser
  2021-05-12  6:22         ` Mauro Carvalho Chehab
  1 sibling, 2 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-06 17:48 UTC (permalink / raw)
  To: Markus Heiser; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet

On Thu, May 06, 2021 at 07:04:44PM +0200, Markus Heiser wrote:
> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
> > Em Thu, 6 May 2021 17:57:15 +0200
> > Markus Heiser <markus.heiser@darmarit.de> escreveu:
> > 
> > > Am 06.05.21 um 12:39 schrieb Michal Suchánek:
> > > > When building HTML documentation I get this output:
> > > ...
> > > > [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
> > > ...
> > > > 
> > > > It does not say which input file contains the offending character so I can't tell which file is broken.
> > > > 
> > > > Any idea how to debug?
> > > 
> > > I guess the build host is a very simple container, what does
> > > 
> > >     echo $LC_ALL
> > >     echo $LANG
It's actually set to en_US just before the build.
> > > 
> > > prompt?  If it is latin, change it to something using utf-8 (I recommend
> > > 'en_US.utf8').
> > > 
> > > A UnicodeEncodeError can occour everywhere where characters are
> > > encoded from (internal) unicode to the encoding of the stream.
> > > 
> > > By example:
> > > 
> > > A print or log statement which streams to stdout needs to encode
> > > from unicode to stdout's encoding.  If there is one unicode symbol
> > > which can not encoded to stream's encoding a UnicodeEncodeError
> > > is raised.
> > 
> > Hi Markus,
> > 
> > It shouldn't matter the builder's locale when building the Kernel
> > documentation (or any other documents built from other git trees
> > on other open source projects), as the Kernel's *.rpm document charset
> > won't change, no matter on what part of the globe it was built.
> > 
> > I vaguely remember about a change we made a couple of years ago
> > in order to address this issue.
> 
> Hi Mauro :)
> 
> sure? .. what if the logger wants to log some symbols from the
> chines translated parts to stdout and the encoding of stdout is
> latin?

[  127s] + cd linux-5.12-next-20210506
[  127s] + export LANG=en_US
[  127s] + LANG=en_US
[  127s] + mkdir -p html
[  127s] + python3 -c 'print("↑ᛏ个")'
[  127s] ↑ᛏ个
[  127s] + echo 'print("↑ᛏ个")'
[  127s] + python3 test.py
[  127s] Traceback (most recent call last):
[  127s]   File "test.py", line 1, in <module>
[  127s]     print("\u2191\u16cf\u4e2a\uf8f9")
[  127s] UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 0-3: ordinal not in range(256)

It certainly does not look like python can print unicode in this
environment. It tells me where the problem is, though.

Thanks

Michal

[  127s] + :
[  127s] + locale
[  128s] LANG=en_US
[  128s] LC_CTYPE="en_US"
[  128s] LC_NUMERIC="en_US"
[  128s] LC_TIME="en_US"
[  128s] LC_COLLATE="en_US"
[  128s] LC_MONETARY="en_US"
[  128s] LC_MESSAGES="en_US"
[  128s] LC_PAPER="en_US"
[  128s] LC_NAME="en_US"
[  128s] LC_ADDRESS="en_US"
[  128s] LC_TELEPHONE="en_US"
[  128s] LC_MEASUREMENT="en_US"
[  128s] LC_IDENTIFICATION="en_US"
[  128s] LC_ALL=
[  128s] + echo LC_ALL=
[  128s] LC_ALL=
[  128s] + echo LANG=en_US
[  128s] LANG=en_US

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:27       ` Mauro Carvalho Chehab
@ 2021-05-06 17:53         ` Markus Heiser
  2021-05-06 18:06           ` Michal Suchánek
  2021-05-06 17:57         ` Randy Dunlap
  1 sibling, 1 reply; 41+ messages in thread
From: Markus Heiser @ 2021-05-06 17:53 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Michal Suchánek, linux-doc, Jonathan Corbet

Am 06.05.21 um 19:27 schrieb Mauro Carvalho Chehab:
> Em Thu, 6 May 2021 19:04:44 +0200
> Markus Heiser <markus.heiser@darmarit.de> escreveu:
> 
>> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
>>> Em Thu, 6 May 2021 17:57:15 +0200
>>> Markus Heiser <markus.heiser@darmarit.de> escreveu:
>>>    
>>>> Am 06.05.21 um 12:39 schrieb Michal Suchánek:
>>>>> When building HTML documentation I get this output:
>>>> ...
>>>>> [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
>>>> ...
>>>>>
>>>>> It does not say which input file contains the offending character so I can't tell which file is broken.
>>>>>
>>>>> Any idea how to debug?
>>>>
>>>> I guess the build host is a very simple container, what does
>>>>
>>>>      echo $LC_ALL
>>>>      echo $LANG
>>>>
>>>> prompt?  If it is latin, change it to something using utf-8 (I recommend
>>>> 'en_US.utf8').
>>>>
>>>> A UnicodeEncodeError can occour everywhere where characters are
>>>> encoded from (internal) unicode to the encoding of the stream.
>>>>
>>>> By example:
>>>>
>>>> A print or log statement which streams to stdout needs to encode
>>>> from unicode to stdout's encoding.  If there is one unicode symbol
>>>> which can not encoded to stream's encoding a UnicodeEncodeError
>>>> is raised.
>>>
>>> Hi Markus,
>>>
>>> It shouldn't matter the builder's locale when building the Kernel
>>> documentation (or any other documents built from other git trees
>>> on other open source projects), as the Kernel's *.rpm document charset
>>> won't change, no matter on what part of the globe it was built.
>>>
>>> I vaguely remember about a change we made a couple of years ago
>>> in order to address this issue.
>>
>> Hi Mauro :)
>>
>> sure? .. what if the logger wants to log some symbols from the
>> chines translated parts to stdout and the encoding of stdout is
>> latin?
>>
>> In python the logger will raise a UnicodeEncodeError, this is
>> what I know .. but I'm often wrong ;)
> 
> Yeah, Python (and almost all python apps) has a mad behavior when
> it finds an unexpected character: instead of ignoring it, it

Hi Mauro,

it is not comfortable but is it mad? ..

Most often languages (or applications) do not handle encoding
of strings they just piping a binary stream while python
decode / encodes strings.

"The Zen of Python" [1] says

    Explicit is better than implicit.

If a stream can't encode symbols and these symbols should be ignored
you have to set the encoding of the stream explicit to ignore
such symbols.

I guess this encode discussions will haunt me for the rest of my
life.  My escape strategy is to use UTF-8 wherever possible.

[1] https://www.python.org/dev/peps/pep-0020/

   -- Markus --

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:27       ` Mauro Carvalho Chehab
  2021-05-06 17:53         ` Markus Heiser
@ 2021-05-06 17:57         ` Randy Dunlap
  2021-05-06 18:08           ` Matthew Wilcox
  1 sibling, 1 reply; 41+ messages in thread
From: Randy Dunlap @ 2021-05-06 17:57 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Markus Heiser
  Cc: Michal Suchánek, linux-doc, Jonathan Corbet

On 5/6/21 10:27 AM, Mauro Carvalho Chehab wrote:
> Em Thu, 6 May 2021 19:04:44 +0200
> Markus Heiser <markus.heiser@darmarit.de> escreveu:
> 
>> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
>>> Em Thu, 6 May 2021 17:57:15 +0200
>>> Markus Heiser <markus.heiser@darmarit.de> escreveu:
>>>   
>>>> Am 06.05.21 um 12:39 schrieb Michal Suchánek:  
>>>>> When building HTML documentation I get this output:  
>>>> ...  
>>>>> [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)  
>>>> ...  
>>>>>
>>>>> It does not say which input file contains the offending character so I can't tell which file is broken.
>>>>>
>>>>> Any idea how to debug?  
>>>>
>>>> I guess the build host is a very simple container, what does
>>>>
>>>>     echo $LC_ALL
>>>>     echo $LANG
>>>>
>>>> prompt?  If it is latin, change it to something using utf-8 (I recommend
>>>> 'en_US.utf8').
>>>>
>>>> A UnicodeEncodeError can occour everywhere where characters are
>>>> encoded from (internal) unicode to the encoding of the stream.
>>>>
>>>> By example:
>>>>
>>>> A print or log statement which streams to stdout needs to encode
>>>> from unicode to stdout's encoding.  If there is one unicode symbol
>>>> which can not encoded to stream's encoding a UnicodeEncodeError
>>>> is raised.  
>>>
>>> Hi Markus,
>>>
>>> It shouldn't matter the builder's locale when building the Kernel
>>> documentation (or any other documents built from other git trees
>>> on other open source projects), as the Kernel's *.rpm document charset
>>> won't change, no matter on what part of the globe it was built.
>>>
>>> I vaguely remember about a change we made a couple of years ago
>>> in order to address this issue.  
>>
>> Hi Mauro :)
>>
>> sure? .. what if the logger wants to log some symbols from the
>> chines translated parts to stdout and the encoding of stdout is
>> latin?
>>
>> In python the logger will raise a UnicodeEncodeError, this is
>> what I know .. but I'm often wrong ;)
> 
> Yeah, Python (and almost all python apps) has a mad behavior when
> it finds an unexpected character: instead of ignoring it, it
> just crashes. On Sphinx, this is is even worse, as it blames
> the parallel building, instead of pinpointing the real culprit.

And for error messages such as this problem, it should include
file name and line number along with the position.  Is position
in this case offset from the beginning of file or beginning of line?
What a bad error message.

[ah, I see that Michal has found where the error happens.]


I have been going thru some of the Documentation/ files...

Why do several of the files begin with
(hex) ef bb bf    followed by "=================="
for a heading, instead of just "===================".
See e.g. Documentation/timers/no_hz.rst.

thanks.
-- 
~Randy [resending due to smtp error]


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:48       ` Michal Suchánek
@ 2021-05-06 17:59         ` Markus Heiser
  2021-05-06 18:16           ` Michal Suchánek
  2021-05-12  6:22         ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 41+ messages in thread
From: Markus Heiser @ 2021-05-06 17:59 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet

Am 06.05.21 um 19:48 schrieb Michal Suchánek:
> On Thu, May 06, 2021 at 07:04:44PM +0200, Markus Heiser wrote:
>> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
>>> Em Thu, 6 May 2021 17:57:15 +0200
>>> Markus Heiser <markus.heiser@darmarit.de> escreveu:
>>>
>>>> Am 06.05.21 um 12:39 schrieb Michal Suchánek:
>>>>> When building HTML documentation I get this output:
>>>> ...
>>>>> [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
>>>> ...
>>>>>
>>>>> It does not say which input file contains the offending character so I can't tell which file is broken.
>>>>>
>>>>> Any idea how to debug?
>>>>
>>>> I guess the build host is a very simple container, what does
>>>>
>>>>      echo $LC_ALL
>>>>      echo $LANG
> It's actually set to en_US just before the build.
>>>>
>>>> prompt?  If it is latin, change it to something using utf-8 (I recommend
>>>> 'en_US.utf8').
>>>>
>>>> A UnicodeEncodeError can occour everywhere where characters are
>>>> encoded from (internal) unicode to the encoding of the stream.
>>>>
>>>> By example:
>>>>
>>>> A print or log statement which streams to stdout needs to encode
>>>> from unicode to stdout's encoding.  If there is one unicode symbol
>>>> which can not encoded to stream's encoding a UnicodeEncodeError
>>>> is raised.
>>>
>>> Hi Markus,
>>>
>>> It shouldn't matter the builder's locale when building the Kernel
>>> documentation (or any other documents built from other git trees
>>> on other open source projects), as the Kernel's *.rpm document charset
>>> won't change, no matter on what part of the globe it was built.
>>>
>>> I vaguely remember about a change we made a couple of years ago
>>> in order to address this issue.
>>
>> Hi Mauro :)
>>
>> sure? .. what if the logger wants to log some symbols from the
>> chines translated parts to stdout and the encoding of stdout is
>> latin?
> 
> [  127s] + cd linux-5.12-next-20210506
> [  127s] + export LANG=en_US
> [  127s] + LANG=en_US
> [  127s] + mkdir -p html
> [  127s] + python3 -c 'print("↑ᛏ个")'
> [  127s] ↑ᛏ个
> [  127s] + echo 'print("↑ᛏ个")'
> [  127s] + python3 test.py
> [  127s] Traceback (most recent call last):
> [  127s]   File "test.py", line 1, in <module>
> [  127s]     print("\u2191\u16cf\u4e2a\uf8f9")
> [  127s] UnicodeEncodeError: 'latin-1' codec can't encode characters in
> position 0-3: ordinal not in range(256)
> 
> It certainly does not look like python can print unicode in this
> environment. It tells me where the problem is, though.

Can't speak for the image of your container, may you need to install
some utf-8 packages / but in most cases

   export LANG=en_US.UTF-8
   export LC_ALL=en_US.UTF-8

should help.

   -- Markus --

> 
> Thanks
> 
> Michal
> 
> [  127s] + :
> [  127s] + locale
> [  128s] LANG=en_US
> [  128s] LC_CTYPE="en_US"
> [  128s] LC_NUMERIC="en_US"
> [  128s] LC_TIME="en_US"
> [  128s] LC_COLLATE="en_US"
> [  128s] LC_MONETARY="en_US"
> [  128s] LC_MESSAGES="en_US"
> [  128s] LC_PAPER="en_US"
> [  128s] LC_NAME="en_US"
> [  128s] LC_ADDRESS="en_US"
> [  128s] LC_TELEPHONE="en_US"
> [  128s] LC_MEASUREMENT="en_US"
> [  128s] LC_IDENTIFICATION="en_US"
> [  128s] LC_ALL=
> [  128s] + echo LC_ALL=
> [  128s] LC_ALL=
> [  128s] + echo LANG=en_US
> [  128s] LANG=en_US
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:53         ` Markus Heiser
@ 2021-05-06 18:06           ` Michal Suchánek
  2021-05-07  8:52             ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Suchánek @ 2021-05-06 18:06 UTC (permalink / raw)
  To: Markus Heiser; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet

On Thu, May 06, 2021 at 07:53:25PM +0200, Markus Heiser wrote:
> Am 06.05.21 um 19:27 schrieb Mauro Carvalho Chehab:
> > Em Thu, 6 May 2021 19:04:44 +0200
> > Markus Heiser <markus.heiser@darmarit.de> escreveu:
> > 
> > > Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
> > > > Em Thu, 6 May 2021 17:57:15 +0200
> > > > Markus Heiser <markus.heiser@darmarit.de> escreveu:
> > > > > Am 06.05.21 um 12:39 schrieb Michal Suchánek:
> > > > > > When building HTML documentation I get this output:
> > > > > ...
> > > > > > [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
> > > > > ...
> > > > > > 
> > > > > > It does not say which input file contains the offending character so I can't tell which file is broken.
> > > > > > 
> > > > > > Any idea how to debug?
> > > > > 
> > > > > I guess the build host is a very simple container, what does
> > > > > 
> > > > >      echo $LC_ALL
> > > > >      echo $LANG
> > > > > 
> > > > > prompt?  If it is latin, change it to something using utf-8 (I recommend
> > > > > 'en_US.utf8').
> > > > > 
> > > > > A UnicodeEncodeError can occour everywhere where characters are
> > > > > encoded from (internal) unicode to the encoding of the stream.
> > > > > 
> > > > > By example:
> > > > > 
> > > > > A print or log statement which streams to stdout needs to encode
> > > > > from unicode to stdout's encoding.  If there is one unicode symbol
> > > > > which can not encoded to stream's encoding a UnicodeEncodeError
> > > > > is raised.
> > > > 
> > > > Hi Markus,
> > > > 
> > > > It shouldn't matter the builder's locale when building the Kernel
> > > > documentation (or any other documents built from other git trees
> > > > on other open source projects), as the Kernel's *.rpm document charset
> > > > won't change, no matter on what part of the globe it was built.
> > > > 
> > > > I vaguely remember about a change we made a couple of years ago
> > > > in order to address this issue.
> > > 
> > > Hi Mauro :)
> > > 
> > > sure? .. what if the logger wants to log some symbols from the
> > > chines translated parts to stdout and the encoding of stdout is
> > > latin?
> > > 
> > > In python the logger will raise a UnicodeEncodeError, this is
> > > what I know .. but I'm often wrong ;)
> > 
> > Yeah, Python (and almost all python apps) has a mad behavior when
> > it finds an unexpected character: instead of ignoring it, it
> 
> Hi Mauro,
> 
> it is not comfortable but is it mad? ..
> 
> Most often languages (or applications) do not handle encoding
> of strings they just piping a binary stream while python
> decode / encodes strings.
> 
> "The Zen of Python" [1] says
> 
>    Explicit is better than implicit.
> 
> If a stream can't encode symbols and these symbols should be ignored
> you have to set the encoding of the stream explicit to ignore
> such symbols.

The problem is this part never happened. Loggers are supposed to tell
you about the error in your application, not crash it.

But the problem with Sphinx may be that the output file is also assumed
to be in the locale encoding, and the output encoding is never set. It's
HTML so it could be encoded with entities, too.

The idea about handlinng encoding precisely is not mad in itself but then
everybody working with just ASCII and never testing their software works
in the cases where explicit handling is needed is the mad part. Too
US-centric culture for getting encodings right I guess.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:57         ` Randy Dunlap
@ 2021-05-06 18:08           ` Matthew Wilcox
  2021-05-06 21:21             ` Randy Dunlap
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2021-05-06 18:08 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Mauro Carvalho Chehab, Markus Heiser, Michal Suchánek,
	linux-doc, Jonathan Corbet

On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:
> I have been going thru some of the Documentation/ files...
> 
> Why do several of the files begin with
> (hex) ef bb bf    followed by "=================="
> for a heading, instead of just "===================".
> See e.g. Documentation/timers/no_hz.rst.

00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 3d  |...=============|

ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the
https://en.wikipedia.org/wiki/Byte_order_mark

We should delete it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:59         ` Markus Heiser
@ 2021-05-06 18:16           ` Michal Suchánek
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-06 18:16 UTC (permalink / raw)
  To: Markus Heiser; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet

On Thu, May 06, 2021 at 07:59:18PM +0200, Markus Heiser wrote:
> Am 06.05.21 um 19:48 schrieb Michal Suchánek:
> > On Thu, May 06, 2021 at 07:04:44PM +0200, Markus Heiser wrote:
> > > Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
> > > > Em Thu, 6 May 2021 17:57:15 +0200
> > > > Markus Heiser <markus.heiser@darmarit.de> escreveu:
> > > > 
> > > > > Am 06.05.21 um 12:39 schrieb Michal Suchánek:
> > > > > > When building HTML documentation I get this output:
> > > > > ...
> > > > > > [  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
> > > > > ...
> > > > > > 
> > > > > > It does not say which input file contains the offending character so I can't tell which file is broken.
> > > > > > 
> > > > > > Any idea how to debug?
> > > > > 
> > > > > I guess the build host is a very simple container, what does
> > > > > 
> > > > >      echo $LC_ALL
> > > > >      echo $LANG
> > It's actually set to en_US just before the build.
> > > > > 
> > > > > prompt?  If it is latin, change it to something using utf-8 (I recommend
> > > > > 'en_US.utf8').
> > > > > 
> > > > > A UnicodeEncodeError can occour everywhere where characters are
> > > > > encoded from (internal) unicode to the encoding of the stream.
> > > > > 
> > > > > By example:
> > > > > 
> > > > > A print or log statement which streams to stdout needs to encode
> > > > > from unicode to stdout's encoding.  If there is one unicode symbol
> > > > > which can not encoded to stream's encoding a UnicodeEncodeError
> > > > > is raised.
> > > > 
> > > > Hi Markus,
> > > > 
> > > > It shouldn't matter the builder's locale when building the Kernel
> > > > documentation (or any other documents built from other git trees
> > > > on other open source projects), as the Kernel's *.rpm document charset
> > > > won't change, no matter on what part of the globe it was built.
> > > > 
> > > > I vaguely remember about a change we made a couple of years ago
> > > > in order to address this issue.
> > > 
> > > Hi Mauro :)
> > > 
> > > sure? .. what if the logger wants to log some symbols from the
> > > chines translated parts to stdout and the encoding of stdout is
> > > latin?
> > 
> > [  127s] + cd linux-5.12-next-20210506
> > [  127s] + export LANG=en_US
> > [  127s] + LANG=en_US
> > [  127s] + mkdir -p html
> > [  127s] + python3 -c 'print("↑ᛏ个")'
> > [  127s] ↑ᛏ个
> > [  127s] + echo 'print("↑ᛏ个")'
> > [  127s] + python3 test.py
> > [  127s] Traceback (most recent call last):
> > [  127s]   File "test.py", line 1, in <module>
> > [  127s]     print("\u2191\u16cf\u4e2a\uf8f9")
> > [  127s] UnicodeEncodeError: 'latin-1' codec can't encode characters in
> > position 0-3: ordinal not in range(256)
> > 
> > It certainly does not look like python can print unicode in this
> > environment. It tells me where the problem is, though.
> 
> Can't speak for the image of your container, may you need to install
> some utf-8 packages / but in most cases
> 
>   export LANG=en_US.UTF-8

Yes, in this case export LANG=en_US.utf8 is an easy workaround.

The UTF-8 locale is already included in the build environment by
default.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 18:08           ` Matthew Wilcox
@ 2021-05-06 21:21             ` Randy Dunlap
  2021-05-07  6:39               ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 41+ messages in thread
From: Randy Dunlap @ 2021-05-06 21:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mauro Carvalho Chehab, Markus Heiser, Michal Suchánek,
	linux-doc, Jonathan Corbet

On 5/6/21 11:08 AM, Matthew Wilcox wrote:
> On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:
>> I have been going thru some of the Documentation/ files...
>>
>> Why do several of the files begin with
>> (hex) ef bb bf    followed by "=================="
>> for a heading, instead of just "===================".
>> See e.g. Documentation/timers/no_hz.rst.
> 
> 00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 3d  |...=============|
> 
> ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the
> https://en.wikipedia.org/wiki/Byte_order_mark
> 
> We should delete it.
> 

OK, thanks, I have started on that.


Just another question: ("inquiring minds want to know")

Why is/are some docs using U+2217 '*' instead of ASCII '*'?
E.g., Documentation/block/cdrom-standard.rst.

Maybe some $EDITOR is doing this?

thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 21:21             ` Randy Dunlap
@ 2021-05-07  6:39               ` Mauro Carvalho Chehab
  2021-05-07  6:49                 ` Randy Dunlap
                                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-07  6:39 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc,
	Jonathan Corbet

Em Thu, 6 May 2021 14:21:01 -0700
Randy Dunlap <rdunlap@infradead.org> escreveu:

> On 5/6/21 11:08 AM, Matthew Wilcox wrote:
> > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:  
> >> I have been going thru some of the Documentation/ files...
> >>
> >> Why do several of the files begin with
> >> (hex) ef bb bf    followed by "=================="
> >> for a heading, instead of just "===================".
> >> See e.g. Documentation/timers/no_hz.rst.  

No idea! It seems that the text editor I used on that time added
it for whatever reason.

> > 
> > 00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 3d  |...=============|
> > 
> > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the
> > https://en.wikipedia.org/wiki/Byte_order_mark
> > 
> > We should delete it.
> >   
> 
> OK, thanks, I have started on that.
> 
> 
> Just another question: ("inquiring minds want to know")
> 
> Why is/are some docs using U+2217 '*' instead of ASCII '*'?
> E.g., Documentation/block/cdrom-standard.rst.

The cdrom doc is a very special case: it was originally written in LaTeX.
I don't remember any other document in LaTeX inside the Kernel docs during
the conversions I made. See:
	e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST")

In order to convert it to .rst, I used some tool to first turn it
into plain text (probably LaTeX, but I don't remember anymore), and then
I manually reviewed the entire file, adding ReST tags where needed.

I didn't realize that utf-8 chars were used instead of normal ASCII chars,
as both appear the same when editing it[1].

[1] I use Fedora here. Fedora changed the default charset to utf-8 a long
    time ago.

Anyway, we should be able of get rid of weird UTF-8 chars from it with:

	$ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst

I'll prepare a patch fixing it. Some care should be taken, however, as
it has two places where UTF-8 chars should be used[2].

[2] There are two German person names that use UTF-8 chars:
    - 'o' + umlat;
    - a LATIN SMALL LETTER SHARP S (Eszett)

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  6:39               ` Mauro Carvalho Chehab
@ 2021-05-07  6:49                 ` Randy Dunlap
  2021-05-07  8:04                 ` Mauro Carvalho Chehab
  2021-05-08  9:22                 ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 41+ messages in thread
From: Randy Dunlap @ 2021-05-07  6:49 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc,
	Jonathan Corbet

On 5/6/21 11:39 PM, Mauro Carvalho Chehab wrote:
> Em Thu, 6 May 2021 14:21:01 -0700
> Randy Dunlap <rdunlap@infradead.org> escreveu:
> 

>>
>> Just another question: ("inquiring minds want to know")
>>
>> Why is/are some docs using U+2217 '*' instead of ASCII '*'?
>> E.g., Documentation/block/cdrom-standard.rst.
> 
> The cdrom doc is a very special case: it was originally written in LaTeX.

Yes, I recall that. I even edited it at least once.

> I don't remember any other document in LaTeX inside the Kernel docs during
> the conversions I made. See:
> 	e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST")
> 
> In order to convert it to .rst, I used some tool to first turn it
> into plain text (probably LaTeX, but I don't remember anymore), and then
> I manually reviewed the entire file, adding ReST tags where needed.
> 
> I didn't realize that utf-8 chars were used instead of normal ASCII chars,
> as both appear the same when editing it[1].
> 
> [1] I use Fedora here. Fedora changed the default charset to utf-8 a long
>     time ago.
> 
> Anyway, we should be able of get rid of weird UTF-8 chars from it with:
> 
> 	$ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst
> 
> I'll prepare a patch fixing it. Some care should be taken, however, as
> it has two places where UTF-8 chars should be used[2].

Thanks!

> [2] There are two German person names that use UTF-8 chars:
>     - 'o' + umlat;
>     - a LATIN SMALL LETTER SHARP S (Eszett)

My patch preparation notes say that the cdrom .rst file contains
"fancy '*'" (not ASCII) instead of ASCII '*' in several places.

Also there are several files that contain U+00A0 non-breaking space
where it is not needed AFAICT.


-- 
~Randy


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  6:39               ` Mauro Carvalho Chehab
  2021-05-07  6:49                 ` Randy Dunlap
@ 2021-05-07  8:04                 ` Mauro Carvalho Chehab
  2021-05-07  8:35                   ` Michal Suchánek
  2021-05-08  9:22                 ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-07  8:04 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc,
	Jonathan Corbet

Em Fri, 7 May 2021 08:39:24 +0200
Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:

> Em Thu, 6 May 2021 14:21:01 -0700
> Randy Dunlap <rdunlap@infradead.org> escreveu:
> 
> > On 5/6/21 11:08 AM, Matthew Wilcox wrote:  
> > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:    
> > >> I have been going thru some of the Documentation/ files...
> > >>
> > >> Why do several of the files begin with
> > >> (hex) ef bb bf    followed by "=================="
> > >> for a heading, instead of just "===================".
> > >> See e.g. Documentation/timers/no_hz.rst.    
> 
> No idea! It seems that the text editor I used on that time added
> it for whatever reason.
> 
> > > 
> > > 00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 3d  |...=============|
> > > 
> > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the
> > > https://en.wikipedia.org/wiki/Byte_order_mark
> > > 
> > > We should delete it.
> > >     
> > 
> > OK, thanks, I have started on that.
> > 
> > 
> > Just another question: ("inquiring minds want to know")
> > 
> > Why is/are some docs using U+2217 '*' instead of ASCII '*'?
> > E.g., Documentation/block/cdrom-standard.rst.  
> 
> The cdrom doc is a very special case: it was originally written in LaTeX.
> I don't remember any other document in LaTeX inside the Kernel docs during
> the conversions I made. See:
> 	e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST")
> 
> In order to convert it to .rst, I used some tool to first turn it
> into plain text (probably LaTeX, but I don't remember anymore), and then
> I manually reviewed the entire file, adding ReST tags where needed.
> 
> I didn't realize that utf-8 chars were used instead of normal ASCII chars,
> as both appear the same when editing it[1].
> 
> [1] I use Fedora here. Fedora changed the default charset to utf-8 a long
>     time ago.
> 
> Anyway, we should be able of get rid of weird UTF-8 chars from it with:
> 
> 	$ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst
> 
> I'll prepare a patch fixing it. Some care should be taken, however, as
> it has two places where UTF-8 chars should be used[2].
> 
> [2] There are two German person names that use UTF-8 chars:
>     - 'o' + umlat;
>     - a LATIN SMALL LETTER SHARP S (Eszett)

Btw, I did a quick check here: excluding translations, there are 182
files with UTF-8 chars at next-20210429. It seems that most of them
are on files that got converted from DocBook and html.

Several of them are valid ones: the ones used on names 
(like Günther, Alcôve, ...). 

Those should remain as-is.

Several Docbook/html converted documents contain UTF-8 NO-BREAK SPACE 
and other invisible chars, like the byte order mark (BOM) pointed
by Randy.

Those should be replaced (or removed for non-printable ones).

-

Now, there are other cases where I'm not sure if there's a
consensus:

1. UTF-8 is used where there's an ASCII similar (but with
   a different graph symbol), like:

	- UTF-8 commas;
	- UTF-8 hyphen chars, including the long ones:
	  FIGURE DASH, EN DASH, EM DASH

   IMO, those should also be converted.

2. Some UTF-8 symbols, like:

	- ® 
	- ™
	- ² - used mainly for I²C
	- …
	- ⬍ ↑ ↓   
	- µs - used for microsseconds

   I would keep those.

3. There are couple of places which uses UTF-8 graphic characters, like:

        /sys/devices/system/edac/
        ├── mc
        │   ├── mc0
        │   │   ├── ce_count
        │   │   ├── ce_noinfo_count

   This is the normal output of the "tree" command on machines with UTF-8.
   I would keep it. 

   Yet, iconv converts it into:

        /sys/devices/system/edac/
        +-- mc
        |   +-- mc0
        |   |   +-- ce_count
        |   |   +-- ce_noinfo_count

   which would also be fine. So, replacing those would be no-brain,
   but I probably newer documents will be written using such symbols. 

   So, I would preserve the UTF-8 graphics characters.

I'm preparing a patchset to address the UTF-8 issues on the top of
today's next, but before posting, it seems reasonable to discuss
what to do with the above cases. Comments?

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  8:04                 ` Mauro Carvalho Chehab
@ 2021-05-07  8:35                   ` Michal Suchánek
  2021-05-07  8:56                     ` Markus Heiser
  2021-05-07  9:02                     ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-07  8:35 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet

On Fri, May 07, 2021 at 10:04:35AM +0200, Mauro Carvalho Chehab wrote:
> Em Fri, 7 May 2021 08:39:24 +0200
> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
> 
> > Em Thu, 6 May 2021 14:21:01 -0700
> > Randy Dunlap <rdunlap@infradead.org> escreveu:
> > 
> > > On 5/6/21 11:08 AM, Matthew Wilcox wrote:  
> > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:    
> > > >> I have been going thru some of the Documentation/ files...
> > > >>
> > > >> Why do several of the files begin with
> > > >> (hex) ef bb bf    followed by "=================="
> > > >> for a heading, instead of just "===================".
> > > >> See e.g. Documentation/timers/no_hz.rst.    
> > 
> > No idea! It seems that the text editor I used on that time added
> > it for whatever reason.
> > 
> > > > 
> > > > 00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 3d  |...=============|
> > > > 
> > > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the
> > > > https://en.wikipedia.org/wiki/Byte_order_mark
> > > > 
> > > > We should delete it.
> > > >     
> > > 
> > > OK, thanks, I have started on that.
> > > 
> > > 
> > > Just another question: ("inquiring minds want to know")
> > > 
> > > Why is/are some docs using U+2217 '*' instead of ASCII '*'?
> > > E.g., Documentation/block/cdrom-standard.rst.  
> > 
> > The cdrom doc is a very special case: it was originally written in LaTeX.
> > I don't remember any other document in LaTeX inside the Kernel docs during
> > the conversions I made. See:
> > 	e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST")
> > 
> > In order to convert it to .rst, I used some tool to first turn it
> > into plain text (probably LaTeX, but I don't remember anymore), and then
> > I manually reviewed the entire file, adding ReST tags where needed.
> > 
> > I didn't realize that utf-8 chars were used instead of normal ASCII chars,
> > as both appear the same when editing it[1].
> > 
> > [1] I use Fedora here. Fedora changed the default charset to utf-8 a long
> >     time ago.
> > 
> > Anyway, we should be able of get rid of weird UTF-8 chars from it with:
> > 
> > 	$ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst
> > 
> > I'll prepare a patch fixing it. Some care should be taken, however, as
> > it has two places where UTF-8 chars should be used[2].
> > 
> > [2] There are two German person names that use UTF-8 chars:
> >     - 'o' + umlat;
> >     - a LATIN SMALL LETTER SHARP S (Eszett)
> 
> Btw, I did a quick check here: excluding translations, there are 182
> files with UTF-8 chars at next-20210429. It seems that most of them
> are on files that got converted from DocBook and html.
> 
> Several of them are valid ones: the ones used on names 
> (like Günther, Alcôve, ...). 

> 2. Some UTF-8 symbols, like:
> 
> 	- ® 
> 	- ™
> 	- ² - used mainly for I²C
> 	- …
> 	- ⬍ ↑ ↓   
> 	- µs - used for microsseconds

> 3. There are couple of places which uses UTF-8 graphic characters, like:
> 
>         /sys/devices/system/edac/
>         ├── mc
>         │   ├── mc0
>         │   │   ├── ce_count
>         │   │   ├── ce_noinfo_count

> I'm preparing a patchset to address the UTF-8 issues on the top of
> today's next, but before posting, it seems reasonable to discuss
> what to do with the above cases. Comments?

So the bottom line is that UTF-8 in the files will stay, and Sphinx
cannot handle UTF-8 when the locale is not UTF-8.

In the long run it might be nice to fix Sphinx to properly set the
encoding of the files it reads and writes. Or maybe there is some
parameter that specifies it?

For the short term I think it is reasonable to run a python test script
that prints fancy unicode characters before running Sphinx and bail if
the test script fails.

eg.
echo 'print("↑ᛏ个")' > test.py
python3 test.py

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 18:06           ` Michal Suchánek
@ 2021-05-07  8:52             ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-07  8:52 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Markus Heiser, linux-doc, Jonathan Corbet

Em Thu, 6 May 2021 20:06:25 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> On Thu, May 06, 2021 at 07:53:25PM +0200, Markus Heiser wrote:

> > Hi Mauro,
> > 
> > it is not comfortable but is it mad? ..
> > 
> > Most often languages (or applications) do not handle encoding
> > of strings they just piping a binary stream while python
> > decode / encodes strings.
> > 
> > "The Zen of Python" [1] says
> > 
> >    Explicit is better than implicit.

This was taken into an extreme with regards to charsets:

	 "better" should never be translated to "crash" ;-)

> > If a stream can't encode symbols and these symbols should be ignored
> > you have to set the encoding of the stream explicit to ignore
> > such symbols.  
> 
> The problem is this part never happened. Loggers are supposed to tell
> you about the error in your application, not crash it.

It is insane to crash the error log due to a charset issue ;-)

> But the problem with Sphinx may be that the output file is also assumed
> to be in the locale encoding, and the output encoding is never set. It's
> HTML so it could be encoded with entities, too.
> 
> The idea about handlinng encoding precisely is not mad in itself but then
> everybody working with just ASCII and never testing their software works
> in the cases where explicit handling is needed is the mad part. 

True. The machine's locale shouldn't affect *at all* the produced
documents. See, there's a hole set of non-latin family of charsets
supported on Linux:

	https://man7.org/linux/man-pages/man7/charsets.7.html

Nothing prevents that someone using a machine whose default encoding is
KOI8-R/BIG-5/GB 2312/JIS X 0208/... to use Sphinx to produce 
UTF-8 [1] documents.

[1] or whatever other output encoding

Ok, the logger may not be able to correctly display certain
chars, but it it be perfectly fine and sane to use //TRANSLIT (or
something similar) in order to do a charset conversion. 

Even to just print a <?> for all chars that aren't printable at
the logger's output using the charset set by LANG/LC_* is 
better/saner than crashing.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  8:35                   ` Michal Suchánek
@ 2021-05-07  8:56                     ` Markus Heiser
  2021-05-07  9:14                       ` Mauro Carvalho Chehab
  2021-05-07  9:02                     ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 41+ messages in thread
From: Markus Heiser @ 2021-05-07  8:56 UTC (permalink / raw)
  To: Michal Suchánek, Mauro Carvalho Chehab
  Cc: Randy Dunlap, Matthew Wilcox, linux-doc, Jonathan Corbet


Am 07.05.21 um 10:35 schrieb Michal Suchánek:
> So the bottom line is that UTF-8 in the files will stay, and Sphinx
> cannot handle UTF-8 when the locale is not UTF-8.
> 
> In the long run it might be nice to fix Sphinx to properly set the
> encoding of the files it reads and writes. Or maybe there is some
> parameter that specifies it?

Let's not mix things up. The Unicode-Error is not related or limited
to log nor to sphinx, it is related to the fact that we (you) try to
run a utf-8 application in an environment which is not full utf-8
functional.

> For the short term I think it is reasonable to run a python test script
> that prints fancy unicode characters before running Sphinx and bail if
> the test script fails.

To be assure, I recommend to set UTF-8 locale environment in the
Makefile.

My experience shows that this is the default with almost all
containers (images), there are only a few where this is not the
case (may be suse?).

   -- Markus --

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  8:35                   ` Michal Suchánek
  2021-05-07  8:56                     ` Markus Heiser
@ 2021-05-07  9:02                     ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-07  9:02 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet

Em Fri, 7 May 2021 10:35:27 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> On Fri, May 07, 2021 at 10:04:35AM +0200, Mauro Carvalho Chehab wrote:
> > Em Fri, 7 May 2021 08:39:24 +0200
> > Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
> >   
> > > Em Thu, 6 May 2021 14:21:01 -0700
> > > Randy Dunlap <rdunlap@infradead.org> escreveu:
> > >   
> > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote:    
> > > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:      
> > > > >> I have been going thru some of the Documentation/ files...
> > > > >>
> > > > >> Why do several of the files begin with
> > > > >> (hex) ef bb bf    followed by "=================="
> > > > >> for a heading, instead of just "===================".
> > > > >> See e.g. Documentation/timers/no_hz.rst.      
> > > 
> > > No idea! It seems that the text editor I used on that time added
> > > it for whatever reason.
> > >   
> > > > > 
> > > > > 00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 3d  |...=============|
> > > > > 
> > > > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the
> > > > > https://en.wikipedia.org/wiki/Byte_order_mark
> > > > > 
> > > > > We should delete it.
> > > > >       
> > > > 
> > > > OK, thanks, I have started on that.
> > > > 
> > > > 
> > > > Just another question: ("inquiring minds want to know")
> > > > 
> > > > Why is/are some docs using U+2217 '*' instead of ASCII '*'?
> > > > E.g., Documentation/block/cdrom-standard.rst.    
> > > 
> > > The cdrom doc is a very special case: it was originally written in LaTeX.
> > > I don't remember any other document in LaTeX inside the Kernel docs during
> > > the conversions I made. See:
> > > 	e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST")
> > > 
> > > In order to convert it to .rst, I used some tool to first turn it
> > > into plain text (probably LaTeX, but I don't remember anymore), and then
> > > I manually reviewed the entire file, adding ReST tags where needed.
> > > 
> > > I didn't realize that utf-8 chars were used instead of normal ASCII chars,
> > > as both appear the same when editing it[1].
> > > 
> > > [1] I use Fedora here. Fedora changed the default charset to utf-8 a long
> > >     time ago.
> > > 
> > > Anyway, we should be able of get rid of weird UTF-8 chars from it with:
> > > 
> > > 	$ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst
> > > 
> > > I'll prepare a patch fixing it. Some care should be taken, however, as
> > > it has two places where UTF-8 chars should be used[2].
> > > 
> > > [2] There are two German person names that use UTF-8 chars:
> > >     - 'o' + umlat;
> > >     - a LATIN SMALL LETTER SHARP S (Eszett)  
> > 
> > Btw, I did a quick check here: excluding translations, there are 182
> > files with UTF-8 chars at next-20210429. It seems that most of them
> > are on files that got converted from DocBook and html.
> > 
> > Several of them are valid ones: the ones used on names 
> > (like Günther, Alcôve, ...).   
> 
> > 2. Some UTF-8 symbols, like:
> > 
> > 	- ® 
> > 	- ™
> > 	- ² - used mainly for I²C
> > 	- …
> > 	- ⬍ ↑ ↓   
> > 	- µs - used for microsseconds  
> 
> > 3. There are couple of places which uses UTF-8 graphic characters, like:
> > 
> >         /sys/devices/system/edac/
> >         ├── mc
> >         │   ├── mc0
> >         │   │   ├── ce_count
> >         │   │   ├── ce_noinfo_count  
> 
> > I'm preparing a patchset to address the UTF-8 issues on the top of
> > today's next, but before posting, it seems reasonable to discuss
> > what to do with the above cases. Comments?  
> 
> So the bottom line is that UTF-8 in the files will stay, and Sphinx
> cannot handle UTF-8 when the locale is not UTF-8.

Yes. We can reduce the number of UTF-8, but some documents need more
chars than ASCII can provide.

Btw, probably (almost?) all files under Documentation/translation use
UTF-8 charsets, due to obvious reasons.

> In the long run it might be nice to fix Sphinx to properly set the
> encoding of the files it reads and writes. 

Agreed.

> Or maybe there is some parameter that specifies it?
> 
> For the short term I think it is reasonable to run a python test script
> that prints fancy unicode characters before running Sphinx and bail if
> the test script fails.
> 
> eg.
> echo 'print("↑ᛏ个")' > test.py
> python3 test.py

Actually, a better workaround could be introduced at conf.py. This 
file is read/parsed by Sphinx on an early stage.

Something could be added there that would detect if the machine's
charset is not UTF-8 and either produce a warning before starts
building or would change the charset used by python to something
that won't crash with utf-8.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  8:56                     ` Markus Heiser
@ 2021-05-07  9:14                       ` Mauro Carvalho Chehab
  2021-05-07  9:51                         ` Markus Heiser
  0 siblings, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-07  9:14 UTC (permalink / raw)
  To: Markus Heiser
  Cc: Michal Suchánek, Randy Dunlap, Matthew Wilcox, linux-doc,
	Jonathan Corbet

Em Fri, 7 May 2021 10:56:39 +0200
Markus Heiser <markus.heiser@darmarit.de> escreveu:

> Am 07.05.21 um 10:35 schrieb Michal Suchánek:
> > So the bottom line is that UTF-8 in the files will stay, and Sphinx
> > cannot handle UTF-8 when the locale is not UTF-8.
> > 
> > In the long run it might be nice to fix Sphinx to properly set the
> > encoding of the files it reads and writes. Or maybe there is some
> > parameter that specifies it?  
> 
> Let's not mix things up. The Unicode-Error is not related or limited
> to log nor to sphinx, it is related to the fact that we (you) try to
> run a utf-8 application in an environment which is not full utf-8
> functional.

No. The application itself is not UTF-8. The application input files are.

The big issue with the way python works with charsets is due to that:
it does a very poor job with regards to that.

I remember that in the past I had to use this quite often
(before UTF-8 being default on the distros I was using on that time):

	LANG=C <some_python_script>

Just to avoid them to crash.

If I'm not mistaken, older Fedora/Mandrake distros had some bugs with
python-written scripts that, if the machine's language were not
English, such scripts crash, as the i18n translated messages were
on a different charset than what the python script would be expecting.

> > For the short term I think it is reasonable to run a python test script
> > that prints fancy unicode characters before running Sphinx and bail if
> > the test script fails.  
> 
> To be assure, I recommend to set UTF-8 locale environment in the
> Makefile.
> 
> My experience shows that this is the default with almost all
> containers (images), there are only a few where this is not the
> case (may be suse?).

That may not be true on certain parts of the globe.

I've no idea what charsets the most-used distributions in Asian
Countries use use ;-)

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  9:14                       ` Mauro Carvalho Chehab
@ 2021-05-07  9:51                         ` Markus Heiser
  2021-05-07 10:29                           ` Michal Suchánek
  0 siblings, 1 reply; 41+ messages in thread
From: Markus Heiser @ 2021-05-07  9:51 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michal Suchánek, Randy Dunlap, Matthew Wilcox, linux-doc,
	Jonathan Corbet

Am 07.05.21 um 11:14 schrieb Mauro Carvalho Chehab:
> Em Fri, 7 May 2021 10:56:39 +0200
> Markus Heiser <markus.heiser@darmarit.de> escreveu:
> 
>> Am 07.05.21 um 10:35 schrieb Michal Suchánek:
>>> So the bottom line is that UTF-8 in the files will stay, and Sphinx
>>> cannot handle UTF-8 when the locale is not UTF-8.
>>>
>>> In the long run it might be nice to fix Sphinx to properly set the
>>> encoding of the files it reads and writes. Or maybe there is some
>>> parameter that specifies it?
>>
>> Let's not mix things up. The Unicode-Error is not related or limited
>> to log nor to sphinx, it is related to the fact that we (you) try to
>> run a utf-8 application in an environment which is not full utf-8
>> functional.
> 
> No. The application itself is not UTF-8. The application input files are.

May be we have a different view on this, for me an application which
reads UTF-8 in and spids out UTF-8 is an UTF-8 application.

hint: HTML is just one Sphinx writer, there exist also other writers
e.g. LaTeX.

> The big issue with the way python works with charsets is due to that:
> it does a very poor job with regards to that.

This is your POV, the python developers have a different view on
handling strings.  There are epic discussions around about.

But all this discussions won't help, since we can't change the
principles of python.

Personally I think I can't ignore the principles of a language
and I'm feeling well with setting up an UTF-8 environment.

> I remember that in the past I had to use this quite often
> (before UTF-8 being default on the distros I was using on that time):
> 
> 	LANG=C <some_python_script>
> 
> Just to avoid them to crash.
> 
> If I'm not mistaken, older Fedora/Mandrake distros had some bugs with
> python-written scripts that, if the machine's language were not
> English, such scripts crash, as the i18n translated messages were
> on a different charset than what the python script would be expecting.

For me "i18n translated message" is a good example that I'm not
wrong with my opinions.  This is not true for all devices but
on those device you won't run an applications like Sphinx.

>>> For the short term I think it is reasonable to run a python test script
>>> that prints fancy unicode characters before running Sphinx and bail if
>>> the test script fails.
>>
>> To be assure, I recommend to set UTF-8 locale environment in the
>> Makefile.
>>
>> My experience shows that this is the default with almost all
>> containers (images), there are only a few where this is not the
>> case (may be suse?).
> 
> That may not be true on certain parts of the globe.

Sorry, I have spoken about common LXC images.

> I've no idea what charsets the most-used distributions in Asian
> Countries use use ;-)

I guess these days most often they will use UTF-8 since ASCII
haven't helped in the past 80s ;-)

   -- Markus --

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  9:51                         ` Markus Heiser
@ 2021-05-07 10:29                           ` Michal Suchánek
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-07 10:29 UTC (permalink / raw)
  To: Markus Heiser
  Cc: Mauro Carvalho Chehab, Randy Dunlap, Matthew Wilcox, linux-doc,
	Jonathan Corbet

On Fri, May 07, 2021 at 11:51:47AM +0200, Markus Heiser wrote:
> Am 07.05.21 um 11:14 schrieb Mauro Carvalho Chehab:
> > Em Fri, 7 May 2021 10:56:39 +0200
> > Markus Heiser <markus.heiser@darmarit.de> escreveu:
> > 
> > > Am 07.05.21 um 10:35 schrieb Michal Suchánek:
> > > > So the bottom line is that UTF-8 in the files will stay, and Sphinx
> > > > cannot handle UTF-8 when the locale is not UTF-8.
> > > > 
> > > > In the long run it might be nice to fix Sphinx to properly set the
> > > > encoding of the files it reads and writes. Or maybe there is some
> > > > parameter that specifies it?
> > > 
> > > Let's not mix things up. The Unicode-Error is not related or limited
> > > to log nor to sphinx, it is related to the fact that we (you) try to
> > > run a utf-8 application in an environment which is not full utf-8
> > > functional.
> > 
> > No. The application itself is not UTF-8. The application input files are.
> 
> May be we have a different view on this, for me an application which
> reads UTF-8 in and spids out UTF-8 is an UTF-8 application.
> 
> hint: HTML is just one Sphinx writer, there exist also other writers
> e.g. LaTeX.

And same as the browser can display HTML documents in pretty much any
character set independently of your system locale Sphinx should be able
to produce those for your browser to display independent of the system
locale. Same for LaTeX, PDF, or whatver else.

> > The big issue with the way python works with charsets is due to that:
> > it does a very poor job with regards to that.
> 
> This is your POV, the python developers have a different view on
> handling strings.  There are epic discussions around about.
> 
> But all this discussions won't help, since we can't change the
> principles of python.

It has nothing to do with python developer POV on handling strings or
principles of python.

The python support for handling strings is complete in the sense it does
not depend on the system locale and can handle strings in multiple
charcter sets. Sphinx as program written in python could handle
documents in any encoding supported by python independent of system
locale if Sphinx developers bothered to use the python encoding support
correctly. Apparently they did not.

> 
> Personally I think I can't ignore the principles of a language
> and I'm feeling well with setting up an UTF-8 environment.
> 
> > I remember that in the past I had to use this quite often
> > (before UTF-8 being default on the distros I was using on that time):
> > 
> > 	LANG=C <some_python_script>
> > 
> > Just to avoid them to crash.
> > 
> > If I'm not mistaken, older Fedora/Mandrake distros had some bugs with
> > python-written scripts that, if the machine's language were not
> > English, such scripts crash, as the i18n translated messages were
> > on a different charset than what the python script would be expecting.
> 
> For me "i18n translated message" is a good example that I'm not
> wrong with my opinions.  This is not true for all devices but
> on those device you won't run an applications like Sphinx.

Or it's a good example of people never testing the application for the
case where explicit handling is required, and possibly one of the
reasons more requirements for explicit handling of the encoding were
added. In the end it merely led to changing from universal ASCII
encoding to universal UTF-8 encoding with no support for running python
scripts in any locale that does not use the 'universal' encoding.

I think that the idea was to make scripts resilient to encoding errors
and prevent data corruption by raising an exception when mishandling of
encoding is detected but instead of handling the exceptions people just
punted to using the same encoding all the time.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-07  6:39               ` Mauro Carvalho Chehab
  2021-05-07  6:49                 ` Randy Dunlap
  2021-05-07  8:04                 ` Mauro Carvalho Chehab
@ 2021-05-08  9:22                 ` Mauro Carvalho Chehab
  2021-05-08 10:41                   ` Michal Suchánek
  2 siblings, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-08  9:22 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc,
	Jonathan Corbet

Em Fri, 7 May 2021 08:39:24 +0200
Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:

> Em Thu, 6 May 2021 14:21:01 -0700
> Randy Dunlap <rdunlap@infradead.org> escreveu:
> 
> > On 5/6/21 11:08 AM, Matthew Wilcox wrote:  
> > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:    
> > >> I have been going thru some of the Documentation/ files...
> > >>
> > >> Why do several of the files begin with
> > >> (hex) ef bb bf    followed by "=================="
> > >> for a heading, instead of just "===================".
> > >> See e.g. Documentation/timers/no_hz.rst.    
> 
> No idea! It seems that the text editor I used on that time added
> it for whatever reason.

> I'll prepare a patch fixing it. Some care should be taken, however, as
> it has two places where UTF-8 chars should be used[2].

Ok, I did a small script in order to check what special chars we
currently have (next-20210507) at Documentation/ excluding the
translations.

Based on my script results, we have those groups:

1. Latin accented characters:
	- U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
	- U+00df (LATIN SMALL LETTER SHARP S) (ß)
	- U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
	- U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
	- U+00e6 (LATIN SMALL LETTER AE) (æ)
	- U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
	- U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
	- U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
	- U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
	- U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
	- U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
	- U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
	- U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
	- U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
	- U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
	- U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)

2. symbols:
	- U+00a9 (COPYRIGHT SIGN) (©)
	- U+2122 (TRADE MARK SIGN) (™)
	- U+00ae (REGISTERED SIGN) (®)
	- U+00b0 (DEGREE SIGN) (°)
	- U+00b1 (PLUS-MINUS SIGN) (±)
	- U+00b2 (SUPERSCRIPT TWO) (²)
	- U+00b5 (MICRO SIGN) (µ)
	- U+00bd (VULGAR FRACTION ONE HALF) (½)
	- U+2026 (HORIZONTAL ELLIPSIS) (…)

3. arrows:
	- U+2191 (UPWARDS ARROW) (↑)
	- U+2192 (RIGHTWARDS ARROW) (→)
	- U+2193 (DOWNWARDS ARROW) (↓)
	- U+2b0d (UP DOWN BLACK ARROW) (⬍)

4. box drawings:
	- U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
	- U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
	- U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
	- U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)

5. math symbols:
	- U+00b7 (MIDDLE DOT) (·)
	- U+00d7 (MULTIPLICATION SIGN) (×)
	- U+2212 (MINUS SIGN) (−)
	- U+2217 (ASTERISK OPERATOR) (∗)
	- U+223c (TILDE OPERATOR) (∼)
	- U+2264 (LESS-THAN OR EQUAL TO) (≤)
	- U+2265 (GREATER-THAN OR EQUAL TO) (≥)
	- U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
	- U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
	- U+00ac (NOT SIGN) (¬)

6. commas:
	- U+00b4 (ACUTE ACCENT) (´)
	- U+2018 (LEFT SINGLE QUOTATION MARK) (‘)
	- U+2019 (RIGHT SINGLE QUOTATION MARK) (’)
	- U+201c (LEFT DOUBLE QUOTATION MARK) (“)
	- U+201d (RIGHT DOUBLE QUOTATION MARK) (”)

7. spaces:
	- U+00a0 (NO-BREAK SPACE) ( )
	- U+feff (ZERO WIDTH NO-BREAK SPACE) ()

8. dashes and hyphen:
	- U+2010 (HYPHEN) (‐)
	- U+2013 (EN DASH) (–)
	- U+2014 (EM DASH) (—)

I would keep (1) to (5), replacing just:
	- commas;
	- spaces;
	- dashes and hyphen.

Comments?

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-08  9:22                 ` Mauro Carvalho Chehab
@ 2021-05-08 10:41                   ` Michal Suchánek
  2021-05-08 14:41                     ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Suchánek @ 2021-05-08 10:41 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet

On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote:
> Em Fri, 7 May 2021 08:39:24 +0200
> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
> 
> > Em Thu, 6 May 2021 14:21:01 -0700
> > Randy Dunlap <rdunlap@infradead.org> escreveu:
> > 
> > > On 5/6/21 11:08 AM, Matthew Wilcox wrote:  
> > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:    
> > > >> I have been going thru some of the Documentation/ files...
> > > >>
> > > >> Why do several of the files begin with
> > > >> (hex) ef bb bf    followed by "=================="
> > > >> for a heading, instead of just "===================".
> > > >> See e.g. Documentation/timers/no_hz.rst.    
> > 
> > No idea! It seems that the text editor I used on that time added
> > it for whatever reason.
> 
> > I'll prepare a patch fixing it. Some care should be taken, however, as
> > it has two places where UTF-8 chars should be used[2].
> 
> Ok, I did a small script in order to check what special chars we
> currently have (next-20210507) at Documentation/ excluding the
> translations.
> 
> Based on my script results, we have those groups:
> 
> 1. Latin accented characters:
> 	- U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
> 	- U+00df (LATIN SMALL LETTER SHARP S) (ß)
> 	- U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
> 	- U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
> 	- U+00e6 (LATIN SMALL LETTER AE) (æ)
> 	- U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
> 	- U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
> 	- U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
> 	- U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
> 	- U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
> 	- U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
> 	- U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
> 	- U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
> 	- U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
> 	- U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
> 	- U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)
> 
> 2. symbols:
> 	- U+00a9 (COPYRIGHT SIGN) (©)
> 	- U+2122 (TRADE MARK SIGN) (™)
> 	- U+00ae (REGISTERED SIGN) (®)
> 	- U+00b0 (DEGREE SIGN) (°)
> 	- U+00b1 (PLUS-MINUS SIGN) (±)
> 	- U+00b2 (SUPERSCRIPT TWO) (²)
> 	- U+00b5 (MICRO SIGN) (µ)
> 	- U+00bd (VULGAR FRACTION ONE HALF) (½)
> 	- U+2026 (HORIZONTAL ELLIPSIS) (…)
> 
> 3. arrows:
> 	- U+2191 (UPWARDS ARROW) (↑)
> 	- U+2192 (RIGHTWARDS ARROW) (→)
> 	- U+2193 (DOWNWARDS ARROW) (↓)
> 	- U+2b0d (UP DOWN BLACK ARROW) (⬍)
> 
> 4. box drawings:
> 	- U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
> 	- U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
> 	- U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
> 	- U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)
> 
> 5. math symbols:
> 	- U+00b7 (MIDDLE DOT) (·)
> 	- U+00d7 (MULTIPLICATION SIGN) (×)
> 	- U+2212 (MINUS SIGN) (−)
> 	- U+2217 (ASTERISK OPERATOR) (∗)
> 	- U+223c (TILDE OPERATOR) (∼)
> 	- U+2264 (LESS-THAN OR EQUAL TO) (≤)
> 	- U+2265 (GREATER-THAN OR EQUAL TO) (≥)
> 	- U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
> 	- U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
> 	- U+00ac (NOT SIGN) (¬)

Clearly his is supposed to be ASCII tilde:
Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩)

Use of ¬ is also very dubious in documentation (in fonts it is understandable):
Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬
Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬
Documentation/powerpc/transactional_memory.rst:  if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then


The use of − is rare can could be replaed with ASCII hyphen-minus entirely
without making the text harder to understand:

Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml:          0: REFIN1(+)/REFIN1(−).
Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml:          1: REFIN2(+)/REFIN2(−).
Documentation/devicetree/bindings/iio/adc/adi,ad7192.yaml:      External reference applied between the P1/REFIN2(+) and P0/REFIN2(−) pins.
Documentation/scheduler/sched-deadline.rst:     ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
drivers/gpu/drm/drm_color_mgmt.c: * - range: [-2^2, 2^2 - 2^−15]
drivers/iio/light/tsl2583.c:                     * sheet (TAOS134 − MARCH 2011):
drivers/staging/iio/adc/ad7280a.c:       *                         (Number of Conversions per Part)) −
sound/soc/codecs/sgtl5000.c: * is the array index and the following formula: 10^((idx−15)/40) * 100

Asterisk operator is clearly meant to be ASCII:
Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ lseek ∗/
Documentation/cdrom/cdrom-standard.rst:         block _read ,           /∗ read—general block-dev read ∗/
Documentation/cdrom/cdrom-standard.rst:         block _write,           /∗ write—general block-dev write ∗/
Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ readdir ∗/
Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ select ∗/
Documentation/cdrom/cdrom-standard.rst:         cdrom_ioctl,            /∗ ioctl ∗/
Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ mmap ∗/
Documentation/cdrom/cdrom-standard.rst:         cdrom_open,             /∗ open ∗/
Documentation/cdrom/cdrom-standard.rst:         cdrom_release,          /∗ release ∗/
Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ fsync ∗/
Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ fasync ∗/
Documentation/cdrom/cdrom-standard.rst:         NULL                    /∗ revalidate ∗/
Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.

There is only one place where ⟨⟩ is used which is very dubious:
Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ...

The middle dot is mostly used in mathmatical formulas that would be
unintelligible otherwise but there are a few odd uses:
Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3
Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3
Documentation/devicetree/bindings/clock/qcom,rpmcc.txt:                 "qcom,rpmcc-msm8992",·"qcom,rpmcc"
Documentation/devicetree/bindings/clock/qcom,rpmcc.txt:                 "qcom,rpmcc-msm8994",·"qcom,rpmcc"
Documentation/translations/zh_CN/kernel-hacking/hacking.rst:    阿列克谢·库兹涅佐夫享用的糟糕伏特加有关。
Documentation/translations/zh_CN/process/howto.rst:   《C程序设计语言(第2版·新版)》(徐宝文 李志 译)[机械工业出版社]
Documentation/translations/zh_CN/process/management-style.rst:.. [#cnf2] 保罗·西蒙演唱了“离开爱人的50种方法”,因为坦率地说,“告诉开发者

The × ≤ and ≥ uses look fine.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-08 10:41                   ` Michal Suchánek
@ 2021-05-08 14:41                     ` Mauro Carvalho Chehab
  2021-05-08 15:55                       ` Randy Dunlap
  0 siblings, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-08 14:41 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet

Em Sat, 8 May 2021 12:41:57 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote:
> > Em Fri, 7 May 2021 08:39:24 +0200
> > Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
> >   
> > > Em Thu, 6 May 2021 14:21:01 -0700
> > > Randy Dunlap <rdunlap@infradead.org> escreveu:
> > >   
> > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote:    
> > > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:      
> > > > >> I have been going thru some of the Documentation/ files...
> > > > >>
> > > > >> Why do several of the files begin with
> > > > >> (hex) ef bb bf    followed by "=================="
> > > > >> for a heading, instead of just "===================".
> > > > >> See e.g. Documentation/timers/no_hz.rst.      
> > > 
> > > No idea! It seems that the text editor I used on that time added
> > > it for whatever reason.  
> >   
> > > I'll prepare a patch fixing it. Some care should be taken, however, as
> > > it has two places where UTF-8 chars should be used[2].  
> > 
> > Ok, I did a small script in order to check what special chars we
> > currently have (next-20210507) at Documentation/ excluding the
> > translations.
> > 
> > Based on my script results, we have those groups:
> > 
> > 1. Latin accented characters:
> > 	- U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
> > 	- U+00df (LATIN SMALL LETTER SHARP S) (ß)
> > 	- U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
> > 	- U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
> > 	- U+00e6 (LATIN SMALL LETTER AE) (æ)
> > 	- U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
> > 	- U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
> > 	- U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
> > 	- U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
> > 	- U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
> > 	- U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
> > 	- U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
> > 	- U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
> > 	- U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
> > 	- U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
> > 	- U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)
> > 
> > 2. symbols:
> > 	- U+00a9 (COPYRIGHT SIGN) (©)
> > 	- U+2122 (TRADE MARK SIGN) (™)
> > 	- U+00ae (REGISTERED SIGN) (®)
> > 	- U+00b0 (DEGREE SIGN) (°)
> > 	- U+00b1 (PLUS-MINUS SIGN) (±)
> > 	- U+00b2 (SUPERSCRIPT TWO) (²)
> > 	- U+00b5 (MICRO SIGN) (µ)
> > 	- U+00bd (VULGAR FRACTION ONE HALF) (½)
> > 	- U+2026 (HORIZONTAL ELLIPSIS) (…)
> > 
> > 3. arrows:
> > 	- U+2191 (UPWARDS ARROW) (↑)
> > 	- U+2192 (RIGHTWARDS ARROW) (→)
> > 	- U+2193 (DOWNWARDS ARROW) (↓)
> > 	- U+2b0d (UP DOWN BLACK ARROW) (⬍)
> > 
> > 4. box drawings:
> > 	- U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
> > 	- U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
> > 	- U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
> > 	- U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)
> > 
> > 5. math symbols:
> > 	- U+00b7 (MIDDLE DOT) (·)
> > 	- U+00d7 (MULTIPLICATION SIGN) (×)
> > 	- U+2212 (MINUS SIGN) (−)
> > 	- U+2217 (ASTERISK OPERATOR) (∗)
> > 	- U+223c (TILDE OPERATOR) (∼)
> > 	- U+2264 (LESS-THAN OR EQUAL TO) (≤)
> > 	- U+2265 (GREATER-THAN OR EQUAL TO) (≥)
> > 	- U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
> > 	- U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
> > 	- U+00ac (NOT SIGN) (¬)  
> 

Hi Michal,

> Clearly his is supposed to be ASCII tilde:
> Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩)

Yes, for this specific file, iconv //translit should solve everything.

In the case of cdrom-standard, those came from the LaTeX conversion.
> 
> Use of ¬ is also very dubious in documentation (in fonts it is understandable):
> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬
> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬


> Documentation/powerpc/transactional_memory.rst:  if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then

Yeah, this should probably be better written as:

  if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then

> The use of − is rare can could be replaed with ASCII hyphen-minus entirely
> without making the text harder to understand:
> 
> Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml:          0: REFIN1(+)/REFIN1(−).
> Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml:          1: REFIN2(+)/REFIN2(−).
> Documentation/devicetree/bindings/iio/adc/adi,ad7192.yaml:      External reference applied between the P1/REFIN2(+) and P0/REFIN2(−) pins.
> Documentation/scheduler/sched-deadline.rst:     ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
> drivers/gpu/drm/drm_color_mgmt.c: * - range: [-2^2, 2^2 - 2^−15]
> drivers/iio/light/tsl2583.c:                     * sheet (TAOS134 − MARCH 2011):
> drivers/staging/iio/adc/ad7280a.c:       *                         (Number of Conversions per Part)) −
> sound/soc/codecs/sgtl5000.c: * is the array index and the following formula: 10^((idx−15)/40) * 100

Agreed.

> Asterisk operator is clearly meant to be ASCII:
> Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ lseek ∗/
> Documentation/cdrom/cdrom-standard.rst:         block _read ,           /∗ read—general block-dev read ∗/
> Documentation/cdrom/cdrom-standard.rst:         block _write,           /∗ write—general block-dev write ∗/
> Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ readdir ∗/
> Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ select ∗/
> Documentation/cdrom/cdrom-standard.rst:         cdrom_ioctl,            /∗ ioctl ∗/
> Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ mmap ∗/
> Documentation/cdrom/cdrom-standard.rst:         cdrom_open,             /∗ open ∗/
> Documentation/cdrom/cdrom-standard.rst:         cdrom_release,          /∗ release ∗/
> Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ fsync ∗/
> Documentation/cdrom/cdrom-standard.rst:         NULL,                   /∗ fasync ∗/
> Documentation/cdrom/cdrom-standard.rst:         NULL                    /∗ revalidate ∗/
> Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
> 
> There is only one place where ⟨⟩ is used which is very dubious:
> Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ...

Yeah. Again, this was due to LaTeX to text conversion.

> The middle dot is mostly used in mathmatical formulas that would be
> unintelligible otherwise but there are a few odd uses:
> Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3
> Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3
> Documentation/devicetree/bindings/clock/qcom,rpmcc.txt:                 "qcom,rpmcc-msm8992",·"qcom,rpmcc"
> Documentation/devicetree/bindings/clock/qcom,rpmcc.txt:                 "qcom,rpmcc-msm8994",·"qcom,rpmcc"

Yeah. It sounds that space would be the best replacement there.

> Documentation/translations/zh_CN/kernel-hacking/hacking.rst:    阿列克谢·库兹涅佐夫享用的糟糕伏特加有关。
> Documentation/translations/zh_CN/process/howto.rst:   《C程序设计语言(第2版·新版)》(徐宝文 李志 译)[机械工业出版社]
> Documentation/translations/zh_CN/process/management-style.rst:.. [#cnf2] 保罗·西蒙演唱了“离开爱人的50种方法”,因为坦率地说,“告诉开发者

I wouldn't touch translations.

> The × ≤ and ≥ uses look fine.

Agreed.

Thanks for double-checking those. I'll address them.

In the mean time, I'm already preparing a patch series addressing
the issues inside documentation, using some scripting to avoid
manual mistakes:

	https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8

(patch series is not 100% yet... some adjustments are still
needed on some places).

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-08 14:41                     ` Mauro Carvalho Chehab
@ 2021-05-08 15:55                       ` Randy Dunlap
  2021-05-08 17:09                         ` Michal Suchánek
  2021-05-10  8:17                         ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 41+ messages in thread
From: Randy Dunlap @ 2021-05-08 15:55 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Michal Suchánek
  Cc: Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet

Hi Mauro,

On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote:
> Em Sat, 8 May 2021 12:41:57 +0200
> Michal Suchánek <msuchanek@suse.de> escreveu:
> 
>> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote:
>>> Em Fri, 7 May 2021 08:39:24 +0200
>>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
>>>   
>>>> Em Thu, 6 May 2021 14:21:01 -0700
>>>> Randy Dunlap <rdunlap@infradead.org> escreveu:
>>>>   
>>>   
>>>> I'll prepare a patch fixing it. Some care should be taken, however, as
>>>> it has two places where UTF-8 chars should be used[2].  
>>>
>>> Ok, I did a small script in order to check what special chars we
>>> currently have (next-20210507) at Documentation/ excluding the
>>> translations.
>>>
>>> Based on my script results, we have those groups:
>>>
>>> 1. Latin accented characters:
>>> 	- U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
>>> 	- U+00df (LATIN SMALL LETTER SHARP S) (ß)
>>> 	- U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
>>> 	- U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
>>> 	- U+00e6 (LATIN SMALL LETTER AE) (æ)
>>> 	- U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
>>> 	- U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
>>> 	- U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
>>> 	- U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
>>> 	- U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
>>> 	- U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
>>> 	- U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
>>> 	- U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
>>> 	- U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
>>> 	- U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
>>> 	- U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)
>>>
>>> 2. symbols:
>>> 	- U+00a9 (COPYRIGHT SIGN) (©)
>>> 	- U+2122 (TRADE MARK SIGN) (™)
>>> 	- U+00ae (REGISTERED SIGN) (®)
>>> 	- U+00b0 (DEGREE SIGN) (°)
>>> 	- U+00b1 (PLUS-MINUS SIGN) (±)
>>> 	- U+00b2 (SUPERSCRIPT TWO) (²)
>>> 	- U+00b5 (MICRO SIGN) (µ)
>>> 	- U+00bd (VULGAR FRACTION ONE HALF) (½)
>>> 	- U+2026 (HORIZONTAL ELLIPSIS) (…)
>>>
>>> 3. arrows:
>>> 	- U+2191 (UPWARDS ARROW) (↑)
>>> 	- U+2192 (RIGHTWARDS ARROW) (→)
>>> 	- U+2193 (DOWNWARDS ARROW) (↓)
>>> 	- U+2b0d (UP DOWN BLACK ARROW) (⬍)
>>>
>>> 4. box drawings:
>>> 	- U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
>>> 	- U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
>>> 	- U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
>>> 	- U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)
>>>
>>> 5. math symbols:
>>> 	- U+00b7 (MIDDLE DOT) (·)
>>> 	- U+00d7 (MULTIPLICATION SIGN) (×)
>>> 	- U+2212 (MINUS SIGN) (−)
>>> 	- U+2217 (ASTERISK OPERATOR) (∗)
>>> 	- U+223c (TILDE OPERATOR) (∼)
>>> 	- U+2264 (LESS-THAN OR EQUAL TO) (≤)
>>> 	- U+2265 (GREATER-THAN OR EQUAL TO) (≥)
>>> 	- U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
>>> 	- U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
>>> 	- U+00ac (NOT SIGN) (¬)  
>>
> 
>>
>> Use of ¬ is also very dubious in documentation (in fonts it is understandable):
>> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬
>> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬
> 
> 
>> Documentation/powerpc/transactional_memory.rst:  if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then
> 
> Yeah, this should probably be better written as:
> 
>   if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then

If the original with the 'NOT SIGN' was correct, then this
version can't be correct. Or do you suspect that the "original"
was corrupted somehow?


> In the mean time, I'm already preparing a patch series addressing
> the issues inside documentation, using some scripting to avoid
> manual mistakes:
> 
> 	https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
> 
> (patch series is not 100% yet... some adjustments are still
> needed on some places).


Thanks for digging into this and providing fixes.

-- 
~Randy


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-08 15:55                       ` Randy Dunlap
@ 2021-05-08 17:09                         ` Michal Suchánek
  2021-05-08 17:46                           ` Randy Dunlap
  2021-05-10  8:17                         ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Suchánek @ 2021-05-08 17:09 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Mauro Carvalho Chehab, Matthew Wilcox, Markus Heiser, linux-doc,
	Jonathan Corbet

On Sat, May 08, 2021 at 08:55:11AM -0700, Randy Dunlap wrote:
> Hi Mauro,
> 
> On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote:
> > Em Sat, 8 May 2021 12:41:57 +0200
> > Michal Suchánek <msuchanek@suse.de> escreveu:
> > 
> >> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote:
> >>> Em Fri, 7 May 2021 08:39:24 +0200
> >>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
> >>>   
> >>>> Em Thu, 6 May 2021 14:21:01 -0700
> >>>> Randy Dunlap <rdunlap@infradead.org> escreveu:
> >>>>   
> >>>   
> >>>> I'll prepare a patch fixing it. Some care should be taken, however, as
> >>>> it has two places where UTF-8 chars should be used[2].  
> >>>
> >>> Ok, I did a small script in order to check what special chars we
> >>> currently have (next-20210507) at Documentation/ excluding the
> >>> translations.
> >>>
> >>> Based on my script results, we have those groups:
> >>>
> >>> 1. Latin accented characters:
> >>> 	- U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
> >>> 	- U+00df (LATIN SMALL LETTER SHARP S) (ß)
> >>> 	- U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
> >>> 	- U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
> >>> 	- U+00e6 (LATIN SMALL LETTER AE) (æ)
> >>> 	- U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
> >>> 	- U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
> >>> 	- U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
> >>> 	- U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
> >>> 	- U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
> >>> 	- U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
> >>> 	- U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
> >>> 	- U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
> >>> 	- U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
> >>> 	- U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
> >>> 	- U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)
> >>>
> >>> 2. symbols:
> >>> 	- U+00a9 (COPYRIGHT SIGN) (©)
> >>> 	- U+2122 (TRADE MARK SIGN) (™)
> >>> 	- U+00ae (REGISTERED SIGN) (®)
> >>> 	- U+00b0 (DEGREE SIGN) (°)
> >>> 	- U+00b1 (PLUS-MINUS SIGN) (±)
> >>> 	- U+00b2 (SUPERSCRIPT TWO) (²)
> >>> 	- U+00b5 (MICRO SIGN) (µ)
> >>> 	- U+00bd (VULGAR FRACTION ONE HALF) (½)
> >>> 	- U+2026 (HORIZONTAL ELLIPSIS) (…)
> >>>
> >>> 3. arrows:
> >>> 	- U+2191 (UPWARDS ARROW) (↑)
> >>> 	- U+2192 (RIGHTWARDS ARROW) (→)
> >>> 	- U+2193 (DOWNWARDS ARROW) (↓)
> >>> 	- U+2b0d (UP DOWN BLACK ARROW) (⬍)
> >>>
> >>> 4. box drawings:
> >>> 	- U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
> >>> 	- U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
> >>> 	- U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
> >>> 	- U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)
> >>>
> >>> 5. math symbols:
> >>> 	- U+00b7 (MIDDLE DOT) (·)
> >>> 	- U+00d7 (MULTIPLICATION SIGN) (×)
> >>> 	- U+2212 (MINUS SIGN) (−)
> >>> 	- U+2217 (ASTERISK OPERATOR) (∗)
> >>> 	- U+223c (TILDE OPERATOR) (∼)
> >>> 	- U+2264 (LESS-THAN OR EQUAL TO) (≤)
> >>> 	- U+2265 (GREATER-THAN OR EQUAL TO) (≥)
> >>> 	- U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
> >>> 	- U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
> >>> 	- U+00ac (NOT SIGN) (¬)  
> >>
> > 
> >>
> >> Use of ¬ is also very dubious in documentation (in fonts it is understandable):
> >> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬
> >> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬
> > 
> > 
> >> Documentation/powerpc/transactional_memory.rst:  if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then
> > 
> > Yeah, this should probably be better written as:
> > 
> >   if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then
> 
> If the original with the 'NOT SIGN' was correct, then this
> version can't be correct. Or do you suspect that the "original"
> was corrupted somehow?

This does not make sense however you look at it. Using | between logical
expressions ...
It sounds like it is some pseudocode in no language in particular so
it's hard to tell what it actually means and the document does not have
enough context to be able to tell. I suppose there is some comment
somewhere in the kernel code that would clarify this - at least what the
bit patterns mean.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-08 17:09                         ` Michal Suchánek
@ 2021-05-08 17:46                           ` Randy Dunlap
  2021-05-10  6:22                             ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 41+ messages in thread
From: Randy Dunlap @ 2021-05-08 17:46 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Mauro Carvalho Chehab, Matthew Wilcox, Markus Heiser, linux-doc,
	Jonathan Corbet

On 5/8/21 10:09 AM, Michal Suchánek wrote:
> On Sat, May 08, 2021 at 08:55:11AM -0700, Randy Dunlap wrote:
>> Hi Mauro,
>>
>> On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote:
>>> Em Sat, 8 May 2021 12:41:57 +0200
>>> Michal Suchánek <msuchanek@suse.de> escreveu:
>>>
>>>> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote:
>>>>> Em Fri, 7 May 2021 08:39:24 +0200
>>>>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
>>>>>   
>>>>>> Em Thu, 6 May 2021 14:21:01 -0700
>>>>>> Randy Dunlap <rdunlap@infradead.org> escreveu:
>>>>>>   
>>>>>   
>>>>>> I'll prepare a patch fixing it. Some care should be taken, however, as
>>>>>> it has two places where UTF-8 chars should be used[2].  
>>>>>
>>>>> Ok, I did a small script in order to check what special chars we
>>>>> currently have (next-20210507) at Documentation/ excluding the
>>>>> translations.
>>>>>
>>>>> Based on my script results, we have those groups:
>>>>>
>>>>> 1. Latin accented characters:
>>>>> 	- U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
>>>>> 	- U+00df (LATIN SMALL LETTER SHARP S) (ß)
>>>>> 	- U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
>>>>> 	- U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
>>>>> 	- U+00e6 (LATIN SMALL LETTER AE) (æ)
>>>>> 	- U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
>>>>> 	- U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
>>>>> 	- U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
>>>>> 	- U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
>>>>> 	- U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
>>>>> 	- U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
>>>>> 	- U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
>>>>> 	- U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
>>>>> 	- U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
>>>>> 	- U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
>>>>> 	- U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)
>>>>>
>>>>> 2. symbols:
>>>>> 	- U+00a9 (COPYRIGHT SIGN) (©)
>>>>> 	- U+2122 (TRADE MARK SIGN) (™)
>>>>> 	- U+00ae (REGISTERED SIGN) (®)
>>>>> 	- U+00b0 (DEGREE SIGN) (°)
>>>>> 	- U+00b1 (PLUS-MINUS SIGN) (±)
>>>>> 	- U+00b2 (SUPERSCRIPT TWO) (²)
>>>>> 	- U+00b5 (MICRO SIGN) (µ)
>>>>> 	- U+00bd (VULGAR FRACTION ONE HALF) (½)
>>>>> 	- U+2026 (HORIZONTAL ELLIPSIS) (…)
>>>>>
>>>>> 3. arrows:
>>>>> 	- U+2191 (UPWARDS ARROW) (↑)
>>>>> 	- U+2192 (RIGHTWARDS ARROW) (→)
>>>>> 	- U+2193 (DOWNWARDS ARROW) (↓)
>>>>> 	- U+2b0d (UP DOWN BLACK ARROW) (⬍)
>>>>>
>>>>> 4. box drawings:
>>>>> 	- U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
>>>>> 	- U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
>>>>> 	- U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
>>>>> 	- U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)
>>>>>
>>>>> 5. math symbols:
>>>>> 	- U+00b7 (MIDDLE DOT) (·)
>>>>> 	- U+00d7 (MULTIPLICATION SIGN) (×)
>>>>> 	- U+2212 (MINUS SIGN) (−)
>>>>> 	- U+2217 (ASTERISK OPERATOR) (∗)
>>>>> 	- U+223c (TILDE OPERATOR) (∼)
>>>>> 	- U+2264 (LESS-THAN OR EQUAL TO) (≤)
>>>>> 	- U+2265 (GREATER-THAN OR EQUAL TO) (≥)
>>>>> 	- U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
>>>>> 	- U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
>>>>> 	- U+00ac (NOT SIGN) (¬)  
>>>>
>>>
>>>>
>>>> Use of ¬ is also very dubious in documentation (in fonts it is understandable):
>>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬
>>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬
>>>
>>>
>>>> Documentation/powerpc/transactional_memory.rst:  if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then
>>>
>>> Yeah, this should probably be better written as:
>>>
>>>   if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then
>>
>> If the original with the 'NOT SIGN' was correct, then this
>> version can't be correct. Or do you suspect that the "original"
>> was corrupted somehow?
> 
> This does not make sense however you look at it. Using | between logical
> expressions ...

To my eyes/brain, it looks like classic (IBM) symbolic logic notation.
In that context, I don't see anything wrong with it.

> It sounds like it is some pseudocode in no language in particular so
> it's hard to tell what it actually means and the document does not have
> enough context to be able to tell. I suppose there is some comment
> somewhere in the kernel code that would clarify this - at least what the
> bit patterns mean.

Yeah, I have been looking thru the arch/powerpc/ source code for this,
but I haven't found it yet.

-- 
~Randy


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-08 17:46                           ` Randy Dunlap
@ 2021-05-10  6:22                             ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-10  6:22 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Michal Suchánek, Matthew Wilcox, Markus Heiser, linux-doc,
	Jonathan Corbet

Em Sat, 8 May 2021 10:46:46 -0700
Randy Dunlap <rdunlap@infradead.org> escreveu:

> On 5/8/21 10:09 AM, Michal Suchánek wrote:
> > On Sat, May 08, 2021 at 08:55:11AM -0700, Randy Dunlap wrote:  
> >> Hi Mauro,
> >>
> >> On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote:  
> >>> Em Sat, 8 May 2021 12:41:57 +0200
> >>> Michal Suchánek <msuchanek@suse.de> escreveu:
> >>>  
> >>>> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote:  
> >>>>> Em Fri, 7 May 2021 08:39:24 +0200
> >>>>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
> >>>>>     
> >>>>>> Em Thu, 6 May 2021 14:21:01 -0700
> >>>>>> Randy Dunlap <rdunlap@infradead.org> escreveu:
> >>>>>>     
> >>>>>     
> >>>>>> I'll prepare a patch fixing it. Some care should be taken, however, as
> >>>>>> it has two places where UTF-8 chars should be used[2].    
> >>>>>
> >>>>> Ok, I did a small script in order to check what special chars we
> >>>>> currently have (next-20210507) at Documentation/ excluding the
> >>>>> translations.
> >>>>>
> >>>>> Based on my script results, we have those groups:
> >>>>>
> >>>>> 1. Latin accented characters:
> >>>>> 	- U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
> >>>>> 	- U+00df (LATIN SMALL LETTER SHARP S) (ß)
> >>>>> 	- U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
> >>>>> 	- U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
> >>>>> 	- U+00e6 (LATIN SMALL LETTER AE) (æ)
> >>>>> 	- U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
> >>>>> 	- U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
> >>>>> 	- U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
> >>>>> 	- U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
> >>>>> 	- U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
> >>>>> 	- U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
> >>>>> 	- U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
> >>>>> 	- U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
> >>>>> 	- U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
> >>>>> 	- U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
> >>>>> 	- U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)
> >>>>>
> >>>>> 2. symbols:
> >>>>> 	- U+00a9 (COPYRIGHT SIGN) (©)
> >>>>> 	- U+2122 (TRADE MARK SIGN) (™)
> >>>>> 	- U+00ae (REGISTERED SIGN) (®)
> >>>>> 	- U+00b0 (DEGREE SIGN) (°)
> >>>>> 	- U+00b1 (PLUS-MINUS SIGN) (±)
> >>>>> 	- U+00b2 (SUPERSCRIPT TWO) (²)
> >>>>> 	- U+00b5 (MICRO SIGN) (µ)
> >>>>> 	- U+00bd (VULGAR FRACTION ONE HALF) (½)
> >>>>> 	- U+2026 (HORIZONTAL ELLIPSIS) (…)
> >>>>>
> >>>>> 3. arrows:
> >>>>> 	- U+2191 (UPWARDS ARROW) (↑)
> >>>>> 	- U+2192 (RIGHTWARDS ARROW) (→)
> >>>>> 	- U+2193 (DOWNWARDS ARROW) (↓)
> >>>>> 	- U+2b0d (UP DOWN BLACK ARROW) (⬍)
> >>>>>
> >>>>> 4. box drawings:
> >>>>> 	- U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
> >>>>> 	- U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
> >>>>> 	- U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
> >>>>> 	- U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)
> >>>>>
> >>>>> 5. math symbols:
> >>>>> 	- U+00b7 (MIDDLE DOT) (·)
> >>>>> 	- U+00d7 (MULTIPLICATION SIGN) (×)
> >>>>> 	- U+2212 (MINUS SIGN) (−)
> >>>>> 	- U+2217 (ASTERISK OPERATOR) (∗)
> >>>>> 	- U+223c (TILDE OPERATOR) (∼)
> >>>>> 	- U+2264 (LESS-THAN OR EQUAL TO) (≤)
> >>>>> 	- U+2265 (GREATER-THAN OR EQUAL TO) (≥)
> >>>>> 	- U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
> >>>>> 	- U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
> >>>>> 	- U+00ac (NOT SIGN) (¬)    
> >>>>  
> >>>  
> >>>>
> >>>> Use of ¬ is also very dubious in documentation (in fonts it is understandable):
> >>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬
> >>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬  
> >>>
> >>>  
> >>>> Documentation/powerpc/transactional_memory.rst:  if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then  
> >>>
> >>> Yeah, this should probably be better written as:
> >>>
> >>>   if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then  
> >>
> >> If the original with the 'NOT SIGN' was correct, then this
> >> version can't be correct. Or do you suspect that the "original"
> >> was corrupted somehow?  

No, I just misread the expression.

> > 
> > This does not make sense however you look at it. Using | between logical
> > expressions ...  
> 
> To my eyes/brain, it looks like classic (IBM) symbolic logic notation.
> In that context, I don't see anything wrong with it.

In this particular case, I would keep it as-is, with the UTF-8 char
on it. I mean, it might be converted to some other symbolic logic
notation, but "MSR 29:31" and "SRR1 29:31" aren't valid names on C.

> Yeah, I have been looking thru the arch/powerpc/ source code for this,
> but I haven't found it yet.

The title of the session says that it is part of "h/rfid mtmsrd quirk".

Searching for rfid:

	$ git grep -l rfid arch/powerpc/

Shows a lot of asm code. I guess that if the above quirk is still at
the Kernel, it is probably somewhere at the assembler part.

So, it sounds to me that converting it into C (or pseudo-C) won't
make it any better.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-08 15:55                       ` Randy Dunlap
  2021-05-08 17:09                         ` Michal Suchánek
@ 2021-05-10  8:17                         ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-10  8:17 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Michal Suchánek, Matthew Wilcox, Markus Heiser, linux-doc,
	Jonathan Corbet

Em Sat, 8 May 2021 08:55:11 -0700
Randy Dunlap <rdunlap@infradead.org> escreveu:

> > In the mean time, I'm already preparing a patch series addressing
> > the issues inside documentation, using some scripting to avoid
> > manual mistakes:
> > 
> > 	https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
> > 
> > (patch series is not 100% yet... some adjustments are still
> > needed on some places).  
> 
> 
> Thanks for digging into this and providing fixes.

Just pushed a new version there, rebasing the branch:

	https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8

The first tree patches were manually written, in order to address 
a couple of special cases.

I'll be submitting the patches via e-mail later today.

The remaining ones were generated by a script that seeks for UTF-8
characters only inside Documentation .rst and ABI files, doing this
conversion:

my %char_map = (
	0x2010 => '-',		# HYPHEN
	0xad   => '-',		# SOFT HYPHEN
	0x2013 => '-',		# EN DASH
	0x2014 => '-',		# EM DASH

	0x2018 => "'",		# LEFT SINGLE QUOTATION MARK
	0x2019 => "'",		# RIGHT SINGLE QUOTATION MARK
	0xb4   => "'",		# ACUTE ACCENT

	0x201c => '"',		# LEFT DOUBLE QUOTATION MARK
	0x201d => '"',		# RIGHT DOUBLE QUOTATION MARK

	0x2212 => '-',		# MINUS SIGN
	0x2217 => '*',		# ASTERISK OPERATOR
	0xd7   => 'x',		# MULTIPLICATION SIGN

	0xbb   => '>',		# RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

	0xa0   => ' ',		# NO-BREAK SPACE
	0xfeff => '',		# ZERO WIDTH NO-BREAK SPACE
);

Basically, after the conversion, those UTF-8 chars will remain
at Documentation/:

	- U+00a9 ('©'): COPYRIGHT SIGN
	- U+00ac ('¬'): NOT SIGN		# only at Documentation/powerpc/transactional_memory.rst
	- U+00ae ('®'): REGISTERED SIGN
	- U+00b0 ('°'): DEGREE SIGN
	- U+00b1 ('±'): PLUS-MINUS SIGN
	- U+00b2 ('²'): SUPERSCRIPT TWO
	- U+00b5 ('µ'): MICRO SIGN
	- U+00b7 ('·'): MIDDLE DOT		# See below
	- U+00bd ('½'): VULGAR FRACTION ONE HALF
	- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
	- U+00df ('ß'): LATIN SMALL LETTER SHARP S
	- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
	- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
	- U+00e6 ('æ'): LATIN SMALL LETTER AE
	- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
	- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
	- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
	- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
	- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
	- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
	- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
	- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
	- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
	- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
	- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
	- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
	- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE
	- U+03bc ('μ'): GREEK SMALL LETTER MU
	- U+2026 ('…'): HORIZONTAL ELLIPSIS
	- U+2122 ('™'): TRADE MARK SIGN
	- U+2191 ('↑'): UPWARDS ARROW
	- U+2192 ('→'): RIGHTWARDS ARROW
	- U+2193 ('↓'): DOWNWARDS ARROW
	- U+2264 ('≤'): LESS-THAN OR EQUAL TO
	- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
	- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
	- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
	- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
	- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT
	- U+2b0d ('⬍'): UP DOWN BLACK ARROW

For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places:

- Documentation/devicetree/bindings/clock/qcom,rpmcc.txt

  As this file will be some day converted to yaml, where the 
  MIDDLE DOT will be removed, I guess it is not worth touching it.

- Documentation/scheduler/sched-deadline.rst

  There, it is used on a math expressions. So, better to keep.

- Documentation/devicetree/bindings/media/video-interface-devices.yaml

  There, it part of an ASCII artwork.

- translations/zh_CN

  I prefer not touching it, as it might have some special meaning
  in Simplified Chinese.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-06 17:48       ` Michal Suchánek
  2021-05-06 17:59         ` Markus Heiser
@ 2021-05-12  6:22         ` Mauro Carvalho Chehab
  2021-05-12  7:01           ` Michal Suchánek
  1 sibling, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12  6:22 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Markus Heiser, linux-doc, Jonathan Corbet

Hi Michal,

Em Thu, 6 May 2021 19:48:49 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> [  127s] + :
> [  127s] + locale
> [  128s] LANG=en_US
> [  128s] LC_CTYPE="en_US"
> [  128s] LC_NUMERIC="en_US"
> [  128s] LC_TIME="en_US"
> [  128s] LC_COLLATE="en_US"
> [  128s] LC_MONETARY="en_US"
> [  128s] LC_MESSAGES="en_US"
> [  128s] LC_PAPER="en_US"
> [  128s] LC_NAME="en_US"
> [  128s] LC_ADDRESS="en_US"
> [  128s] LC_TELEPHONE="en_US"
> [  128s] LC_MEASUREMENT="en_US"
> [  128s] LC_IDENTIFICATION="en_US"
> [  128s] LC_ALL=
> [  128s] + echo LC_ALL=
> [  128s] LC_ALL=
> [  128s] + echo LANG=en_US
> [  128s] LANG=en_US

Where those the locale settings that you used when the build
failed?

I tried to reproduce the bug here with, disabling the parallel run (as
it masks the real error) with both:

	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done
	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs

(this one caused lots of warnings on Debian, due to the
 settings at /etc/locale.gen)

and:

	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done
	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs

Without any success.

Could you please provide more details about the build VM and the git 
changeset that caused the issue?

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-12  6:22         ` Mauro Carvalho Chehab
@ 2021-05-12  7:01           ` Michal Suchánek
  2021-05-12  7:18             ` Markus Heiser
  2021-05-12  7:59             ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-12  7:01 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Markus Heiser, linux-doc, Jonathan Corbet

On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote:
> Hi Michal,
> 
> Em Thu, 6 May 2021 19:48:49 +0200
> Michal Suchánek <msuchanek@suse.de> escreveu:
> 
> > [  127s] + :
> > [  127s] + locale
> > [  128s] LANG=en_US
> > [  128s] LC_CTYPE="en_US"
> > [  128s] LC_NUMERIC="en_US"
> > [  128s] LC_TIME="en_US"
> > [  128s] LC_COLLATE="en_US"
> > [  128s] LC_MONETARY="en_US"
> > [  128s] LC_MESSAGES="en_US"
> > [  128s] LC_PAPER="en_US"
> > [  128s] LC_NAME="en_US"
> > [  128s] LC_ADDRESS="en_US"
> > [  128s] LC_TELEPHONE="en_US"
> > [  128s] LC_MEASUREMENT="en_US"
> > [  128s] LC_IDENTIFICATION="en_US"
> > [  128s] LC_ALL=
> > [  128s] + echo LC_ALL=
> > [  128s] LC_ALL=
> > [  128s] + echo LANG=en_US
> > [  128s] LANG=en_US
> 
> Where those the locale settings that you used when the build
> failed?
> 
> I tried to reproduce the bug here with, disabling the parallel run (as
> it masks the real error) with both:
> 
> 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done
> 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
> 
> (this one caused lots of warnings on Debian, due to the
>  settings at /etc/locale.gen)
> 
> and:
> 
> 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done
> 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
> 
> Without any success.
> 
> Could you please provide more details about the build VM and the git 
> changeset that caused the issue?

It depends on what character set your en_US locale implements.

~> cat test.py 
print("↑ᛏ个")
~> locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
~> python3 test.py 
↑ᛏ个
~> LANG=en_US python3 test.py 
Traceback (most recent call last):
  File "test.py", line 1, in <module>
    print("\u2191\u16cf\u4e2a\uf8f9")
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
~> LANG=C python3 test.py 
↑ᛏ个

You can easily test if your python version can print UTF-8 in a specific
locale, and if necessary define an ISO-8859-1 locale for testing.
On some systems the situation is reversed - C locale is ASCII only, and
en_US is UTF-8, and it is possible that some systems don't ship an 8bit
locale at all.

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-12  7:01           ` Michal Suchánek
@ 2021-05-12  7:18             ` Markus Heiser
  2021-05-12  7:37               ` Markus Heiser
  2021-05-12  7:59             ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 41+ messages in thread
From: Markus Heiser @ 2021-05-12  7:18 UTC (permalink / raw)
  To: Michal Suchánek, Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet

Am 12.05.21 um 09:01 schrieb Michal Suchánek:
> On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote:
>> Hi Michal,
>>
>> Em Thu, 6 May 2021 19:48:49 +0200
>> Michal Suchánek <msuchanek@suse.de> escreveu:
>>
>>> [  127s] + :
>>> [  127s] + locale
>>> [  128s] LANG=en_US
>>> [  128s] LC_CTYPE="en_US"
>>> [  128s] LC_NUMERIC="en_US"
>>> [  128s] LC_TIME="en_US"
>>> [  128s] LC_COLLATE="en_US"
>>> [  128s] LC_MONETARY="en_US"
>>> [  128s] LC_MESSAGES="en_US"
>>> [  128s] LC_PAPER="en_US"
>>> [  128s] LC_NAME="en_US"
>>> [  128s] LC_ADDRESS="en_US"
>>> [  128s] LC_TELEPHONE="en_US"
>>> [  128s] LC_MEASUREMENT="en_US"
>>> [  128s] LC_IDENTIFICATION="en_US"
>>> [  128s] LC_ALL=
>>> [  128s] + echo LC_ALL=
>>> [  128s] LC_ALL=
>>> [  128s] + echo LANG=en_US
>>> [  128s] LANG=en_US
>>
>> Where those the locale settings that you used when the build
>> failed?
>>
>> I tried to reproduce the bug here with, disabling the parallel run (as
>> it masks the real error) with both:
>>
>> 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done
>> 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
>>
>> (this one caused lots of warnings on Debian, due to the
>>   settings at /etc/locale.gen)
>>
>> and:
>>
>> 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done
>> 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
>>
>> Without any success.
>>
>> Could you please provide more details about the build VM and the git
>> changeset that caused the issue?
> 
> It depends on what character set your en_US locale implements.
> 
> ~> cat test.py
> print("↑ᛏ个")
> ~> locale
> LANG=en_US.utf8
> LC_CTYPE="en_US.utf8"
> LC_NUMERIC="en_US.utf8"
> LC_TIME="en_US.utf8"
> LC_COLLATE="en_US.utf8"
> LC_MONETARY="en_US.utf8"
> LC_MESSAGES="en_US.utf8"
> LC_PAPER="en_US.utf8"
> LC_NAME="en_US.utf8"
> LC_ADDRESS="en_US.utf8"
> LC_TELEPHONE="en_US.utf8"
> LC_MEASUREMENT="en_US.utf8"
> LC_IDENTIFICATION="en_US.utf8"
> LC_ALL=
> ~> python3 test.py
> ↑ᛏ个
> ~> LANG=en_US python3 test.py
> Traceback (most recent call last):
>    File "test.py", line 1, in <module>
>      print("\u2191\u16cf\u4e2a\uf8f9")
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
> ~> LANG=C python3 test.py
> ↑ᛏ个
> 
> You can easily test if your python version can print UTF-8 in a specific
> locale, and if necessary define an ISO-8859-1 locale for testing.
> On some systems the situation is reversed - C locale is ASCII only, and
> en_US is UTF-8, and it is possible that some systems don't ship an 8bit
> locale at all.

Thats my problem :-) On my system (terminal) I can't reproduce the
issue since stdout always support utf-8, no matter what LANG
environment is set.

$ LANG=en_US.ISO-8859-1 python3
...
 >>> import sys
 >>> print (sys.stdout.encoding)
utf-8

 >>> import locale
 >>> locale.getdefaultlocale()
('en_US', 'UTF-8')

I'm not familar with POSIX's locale [1] in detail and in particular
on my system (gnome terminal), I can't say how I can change to 8 bit
coding to reproduce the issue.

[1] https://docs.python.org/3/library/locale.html

-- Markus --


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-12  7:18             ` Markus Heiser
@ 2021-05-12  7:37               ` Markus Heiser
  0 siblings, 0 replies; 41+ messages in thread
From: Markus Heiser @ 2021-05-12  7:37 UTC (permalink / raw)
  To: Michal Suchánek, Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet

Am 12.05.21 um 09:18 schrieb Markus Heiser:
> It depends on what character set your en_US locale implements.
> 
> ~> cat test.py
> print("↑ᛏ个")

At least the last character is a not-printable character.

For the non 8bit characters I recommend to use python's
unicode representation (\u):

   >>> print('The currency in EU is \u20AC')
   The currency in EU is €

  -- Markus --

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-12  7:01           ` Michal Suchánek
  2021-05-12  7:18             ` Markus Heiser
@ 2021-05-12  7:59             ` Mauro Carvalho Chehab
  2021-05-17 13:10               ` Michal Suchánek
  1 sibling, 1 reply; 41+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12  7:59 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Markus Heiser, linux-doc, Jonathan Corbet

Em Wed, 12 May 2021 09:01:57 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:

> On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote:
> > Hi Michal,
> > 
> > Em Thu, 6 May 2021 19:48:49 +0200
> > Michal Suchánek <msuchanek@suse.de> escreveu:
> >   
> > > [  127s] + :
> > > [  127s] + locale
> > > [  128s] LANG=en_US
> > > [  128s] LC_CTYPE="en_US"
> > > [  128s] LC_NUMERIC="en_US"
> > > [  128s] LC_TIME="en_US"
> > > [  128s] LC_COLLATE="en_US"
> > > [  128s] LC_MONETARY="en_US"
> > > [  128s] LC_MESSAGES="en_US"
> > > [  128s] LC_PAPER="en_US"
> > > [  128s] LC_NAME="en_US"
> > > [  128s] LC_ADDRESS="en_US"
> > > [  128s] LC_TELEPHONE="en_US"
> > > [  128s] LC_MEASUREMENT="en_US"
> > > [  128s] LC_IDENTIFICATION="en_US"
> > > [  128s] LC_ALL=
> > > [  128s] + echo LC_ALL=
> > > [  128s] LC_ALL=
> > > [  128s] + echo LANG=en_US
> > > [  128s] LANG=en_US  
> > 
> > Where those the locale settings that you used when the build
> > failed?
> > 
> > I tried to reproduce the bug here with, disabling the parallel run (as
> > it masks the real error) with both:
> > 
> > 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done
> > 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
> > 
> > (this one caused lots of warnings on Debian, due to the
> >  settings at /etc/locale.gen)
> > 
> > and:
> > 
> > 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done
> > 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
> > 
> > Without any success.
> > 
> > Could you please provide more details about the build VM and the git 
> > changeset that caused the issue?  
> 
> It depends on what character set your en_US locale implements.
> 
> ~> cat test.py   
> print("↑ᛏ个")
> ~> locale  
> LANG=en_US.utf8
> LC_CTYPE="en_US.utf8"
> LC_NUMERIC="en_US.utf8"
> LC_TIME="en_US.utf8"
> LC_COLLATE="en_US.utf8"
> LC_MONETARY="en_US.utf8"
> LC_MESSAGES="en_US.utf8"
> LC_PAPER="en_US.utf8"
> LC_NAME="en_US.utf8"
> LC_ADDRESS="en_US.utf8"
> LC_TELEPHONE="en_US.utf8"
> LC_MEASUREMENT="en_US.utf8"
> LC_IDENTIFICATION="en_US.utf8"
> LC_ALL=
> ~> python3 test.py   
> ↑ᛏ个
> ~> LANG=en_US python3 test.py   
> Traceback (most recent call last):
>   File "test.py", line 1, in <module>
>     print("\u2191\u16cf\u4e2a\uf8f9")
> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
> ~> LANG=C python3 test.py   
> ↑ᛏ个
> 

This is working as expected on my test machine:

	$ LANG=en_US.utf8 python3 test.py
	↑ᛏ个
	$ LANG=en_US python3 test.py
	Traceback (most recent call last):
	  File "test.py", line 1, in <module>
	    print("\u2191\u16cf\u4e2a\uf8f9")
	UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

Yet, running:

	$ . /devel/v4l/docs/sphinx_3.3.1/bin/activate
	make cleandocs && LANG=en_US make SPHINXOPTS=-j1 htmldocs

Doesn't produce any UnicodeEncodeError errors.

See, here I'm testing it with Sphinx version 3.3.1, on Ubuntu 20.04,
using changeset 9f4ad9e425a1 Linux 5.12. Also, both UTF8 and iso8859-1
are on this machine's locale:

	$ more /etc/locale.gen |grep -v ^#
	de_DE.UTF-8 UTF-8
	en_US ISO-8859-1
	en_US.UTF-8 UTF-8

(On Debian/Ubuntu, python and other tools complain a lot if the used 
 locale is not at /etc/locale.gen)

Maybe you're using a different Sphinx version, or maybe the distro
on your VM is using has different locales installed on it or some
other different packages.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
  2021-05-12  7:59             ` Mauro Carvalho Chehab
@ 2021-05-17 13:10               ` Michal Suchánek
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Suchánek @ 2021-05-17 13:10 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Markus Heiser, linux-doc, Jonathan Corbet

On Wed, May 12, 2021 at 09:59:31AM +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 09:01:57 +0200
> Michal Suchánek <msuchanek@suse.de> escreveu:
> 
> > On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote:
> > > Hi Michal,
> > > 
> > > Em Thu, 6 May 2021 19:48:49 +0200
> > > Michal Suchánek <msuchanek@suse.de> escreveu:
> > >   
> > > > [  127s] + :
> > > > [  127s] + locale
> > > > [  128s] LANG=en_US
> > > > [  128s] LC_CTYPE="en_US"
> > > > [  128s] LC_NUMERIC="en_US"
> > > > [  128s] LC_TIME="en_US"
> > > > [  128s] LC_COLLATE="en_US"
> > > > [  128s] LC_MONETARY="en_US"
> > > > [  128s] LC_MESSAGES="en_US"
> > > > [  128s] LC_PAPER="en_US"
> > > > [  128s] LC_NAME="en_US"
> > > > [  128s] LC_ADDRESS="en_US"
> > > > [  128s] LC_TELEPHONE="en_US"
> > > > [  128s] LC_MEASUREMENT="en_US"
> > > > [  128s] LC_IDENTIFICATION="en_US"
> > > > [  128s] LC_ALL=
> > > > [  128s] + echo LC_ALL=
> > > > [  128s] LC_ALL=
> > > > [  128s] + echo LANG=en_US
> > > > [  128s] LANG=en_US  
> > > 
> > > Where those the locale settings that you used when the build
> > > failed?
> > > 
> > > I tried to reproduce the bug here with, disabling the parallel run (as
> > > it masks the real error) with both:
> > > 
> > > 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done
> > > 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
> > > 
> > > (this one caused lots of warnings on Debian, due to the
> > >  settings at /etc/locale.gen)
> > > 
> > > and:
> > > 
> > > 	$ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done
> > > 	$ make cleandocs && make SPHINXOPTS=-j1 htmldocs
> > > 
> > > Without any success.
> > > 
> > > Could you please provide more details about the build VM and the git 
> > > changeset that caused the issue?  
> > 
> > It depends on what character set your en_US locale implements.
> > 
> > ~> cat test.py   
> > print("↑ᛏ个")
> > ~> locale  
> > LANG=en_US.utf8
> > LC_CTYPE="en_US.utf8"
> > LC_NUMERIC="en_US.utf8"
> > LC_TIME="en_US.utf8"
> > LC_COLLATE="en_US.utf8"
> > LC_MONETARY="en_US.utf8"
> > LC_MESSAGES="en_US.utf8"
> > LC_PAPER="en_US.utf8"
> > LC_NAME="en_US.utf8"
> > LC_ADDRESS="en_US.utf8"
> > LC_TELEPHONE="en_US.utf8"
> > LC_MEASUREMENT="en_US.utf8"
> > LC_IDENTIFICATION="en_US.utf8"
> > LC_ALL=
> > ~> python3 test.py   
> > ↑ᛏ个
> > ~> LANG=en_US python3 test.py   
> > Traceback (most recent call last):
> >   File "test.py", line 1, in <module>
> >     print("\u2191\u16cf\u4e2a\uf8f9")
> > UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
> > ~> LANG=C python3 test.py   
> > ↑ᛏ个
> > 
> 
> This is working as expected on my test machine:
> 
> 	$ LANG=en_US.utf8 python3 test.py
> 	↑ᛏ个
> 	$ LANG=en_US python3 test.py
> 	Traceback (most recent call last):
> 	  File "test.py", line 1, in <module>
> 	    print("\u2191\u16cf\u4e2a\uf8f9")
> 	UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
> 
> Yet, running:
> 
> 	$ . /devel/v4l/docs/sphinx_3.3.1/bin/activate
> 	make cleandocs && LANG=en_US make SPHINXOPTS=-j1 htmldocs
> 
> Doesn't produce any UnicodeEncodeError errors.
> 
> See, here I'm testing it with Sphinx version 3.3.1, on Ubuntu 20.04,
> using changeset 9f4ad9e425a1 Linux 5.12. Also, both UTF8 and iso8859-1
> are on this machine's locale:
> 
> 	$ more /etc/locale.gen |grep -v ^#
> 	de_DE.UTF-8 UTF-8
> 	en_US ISO-8859-1
> 	en_US.UTF-8 UTF-8
> 
> (On Debian/Ubuntu, python and other tools complain a lot if the used 
>  locale is not at /etc/locale.gen)
> 
> Maybe you're using a different Sphinx version, or maybe the distro
> on your VM is using has different locales installed on it or some
> other different packages.

I am using these:

[   14s] [287/464] cumulate python38-sphinxcontrib-websupport-1.2.4-1.3
[   14s] [323/464] cumulate python38-Sphinx2-2.3.1-4.1
[   14s] [324/464] cumulate python38-sphinx_rtd_theme-0.5.2-1.1
[   14s] [325/464] cumulate python38-sphinxcontrib-applehelp-1.0.2-1.4
[   14s] [326/464] cumulate python38-sphinxcontrib-devhelp-1.0.2-1.4
[   14s] [327/464] cumulate python38-sphinxcontrib-htmlhelp-1.0.3-1.4
[   14s] [328/464] cumulate python38-sphinxcontrib-jsmath-1.0.1-2.5
[   14s] [329/464] cumulate python38-sphinxcontrib-qthelp-1.0.3-1.4
[   14s] [330/464] cumulate python38-sphinxcontrib-serializinghtml-1.1.4-1.4

[  455s] Sphinx parallel build error:
[  455s] UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 18-20: ordinal not in range(256)
[  467s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2
[  467s] make[1]: ***
[/home/abuild/rpmbuild/BUILD/kernel-docs-5.13~rc1.next.20210514/linux-5.13-rc1-next-20210514/Makefile:1784:
htmldocs] Error 2
[  467s] make[1]: Leaving directory
'/home/abuild/rpmbuild/BUILD/kernel-docs-5.13~rc1.next.20210514/linux-5.13-rc1-next-20210514/html'
[  467s] make: *** [Makefile:222: __sub-make] Error 2

Thanks

Michal

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-05-17 13:10 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek
2021-05-06 11:20 ` Mauro Carvalho Chehab
2021-05-06 13:32   ` Michal Suchánek
2021-05-06 14:24     ` Mauro Carvalho Chehab
2021-05-06 14:35       ` Michal Suchánek
2021-05-06 15:57 ` Markus Heiser
2021-05-06 16:46   ` Mauro Carvalho Chehab
2021-05-06 17:04     ` Markus Heiser
2021-05-06 17:27       ` Mauro Carvalho Chehab
2021-05-06 17:53         ` Markus Heiser
2021-05-06 18:06           ` Michal Suchánek
2021-05-07  8:52             ` Mauro Carvalho Chehab
2021-05-06 17:57         ` Randy Dunlap
2021-05-06 18:08           ` Matthew Wilcox
2021-05-06 21:21             ` Randy Dunlap
2021-05-07  6:39               ` Mauro Carvalho Chehab
2021-05-07  6:49                 ` Randy Dunlap
2021-05-07  8:04                 ` Mauro Carvalho Chehab
2021-05-07  8:35                   ` Michal Suchánek
2021-05-07  8:56                     ` Markus Heiser
2021-05-07  9:14                       ` Mauro Carvalho Chehab
2021-05-07  9:51                         ` Markus Heiser
2021-05-07 10:29                           ` Michal Suchánek
2021-05-07  9:02                     ` Mauro Carvalho Chehab
2021-05-08  9:22                 ` Mauro Carvalho Chehab
2021-05-08 10:41                   ` Michal Suchánek
2021-05-08 14:41                     ` Mauro Carvalho Chehab
2021-05-08 15:55                       ` Randy Dunlap
2021-05-08 17:09                         ` Michal Suchánek
2021-05-08 17:46                           ` Randy Dunlap
2021-05-10  6:22                             ` Mauro Carvalho Chehab
2021-05-10  8:17                         ` Mauro Carvalho Chehab
2021-05-06 17:48       ` Michal Suchánek
2021-05-06 17:59         ` Markus Heiser
2021-05-06 18:16           ` Michal Suchánek
2021-05-12  6:22         ` Mauro Carvalho Chehab
2021-05-12  7:01           ` Michal Suchánek
2021-05-12  7:18             ` Markus Heiser
2021-05-12  7:37               ` Markus Heiser
2021-05-12  7:59             ` Mauro Carvalho Chehab
2021-05-17 13:10               ` Michal Suchánek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).