* Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) @ 2021-05-06 10:39 Michal Suchánek 2021-05-06 11:20 ` Mauro Carvalho Chehab 2021-05-06 15:57 ` Markus Heiser 0 siblings, 2 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-06 10:39 UTC (permalink / raw) To: linux-doc; +Cc: Jonathan Corbet, Mauro Carvalho Chehab When building HTML documentation I get this output: [ 120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs [ 120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' [ 120s] cat: /etc/os-release: No such file or directory [ 121s] SPHINX htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output [ 121s] PARSE include/uapi/linux/dvb/audio.h [ 121s] PARSE include/uapi/linux/dvb/ca.h [ 121s] PARSE include/uapi/linux/dvb/dmx.h [ 121s] PARSE include/uapi/linux/dvb/frontend.h [ 122s] PARSE include/uapi/linux/dvb/net.h [ 122s] PARSE include/uapi/linux/dvb/video.h [ 122s] PARSE include/uapi/linux/videodev2.h [ 122s] PARSE include/uapi/linux/media.h [ 122s] PARSE include/uapi/linux/cec.h [ 122s] PARSE include/uapi/linux/lirc.h [ 190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc' [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc' [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc' [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc' [ 203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller' [ 203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams' [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' [ 233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager' [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser' [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser' [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser' [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' [ 233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register' [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent. [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket' [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation. [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent. [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent. [ 307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header) [ 412s] [ 412s] Sphinx parallel build error: [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) [ 431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2 [ 431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2 [ 431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' [ 431s] make: *** [Makefile:222: __sub-make] Error 2 [ 431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build) It does not say which input file contains the offending character so I can't tell which file is broken. Any idea how to debug? Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek @ 2021-05-06 11:20 ` Mauro Carvalho Chehab 2021-05-06 13:32 ` Michal Suchánek 2021-05-06 15:57 ` Markus Heiser 1 sibling, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-06 11:20 UTC (permalink / raw) To: Michal Suchánek; +Cc: linux-doc, Jonathan Corbet Em Thu, 6 May 2021 12:39:13 +0200 Michal Suchánek <msuchanek@suse.de> escreveu: > When building HTML documentation I get this output: > > [ 120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs > [ 120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > [ 120s] cat: /etc/os-release: No such file or directory > [ 121s] SPHINX htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output > [ 121s] PARSE include/uapi/linux/dvb/audio.h > [ 121s] PARSE include/uapi/linux/dvb/ca.h > [ 121s] PARSE include/uapi/linux/dvb/dmx.h > [ 121s] PARSE include/uapi/linux/dvb/frontend.h > [ 122s] PARSE include/uapi/linux/dvb/net.h > [ 122s] PARSE include/uapi/linux/dvb/video.h > [ 122s] PARSE include/uapi/linux/videodev2.h > [ 122s] PARSE include/uapi/linux/media.h > [ 122s] PARSE include/uapi/linux/cec.h > [ 122s] PARSE include/uapi/linux/lirc.h > [ 190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc' > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc' > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc' > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc' > [ 203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller' > [ 203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams' > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > [ 233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager' > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser' > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser' > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser' > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > [ 233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register' > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent. > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket' > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation. > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent. > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent. > [ 307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header) > [ 412s] > [ 412s] Sphinx parallel build error: > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > [ 431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2 > [ 431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2 > [ 431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > [ 431s] make: *** [Makefile:222: __sub-make] Error 2 > [ 431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build) > > It does not say which input file contains the offending character so I can't tell which file is broken. > > Any idea how to debug? Yes. You probably has some weird file under Documentation/ABI. Some text editors like kate tend to keep temporary files sometimes. The scripts/get_ABI.pl (currently) doesn't have any logic to recognize valid ABI files from trash stuff added at the ABI dirs. Just doing a git status (or a git clean) and removing such files should fix the build. Regards, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 11:20 ` Mauro Carvalho Chehab @ 2021-05-06 13:32 ` Michal Suchánek 2021-05-06 14:24 ` Mauro Carvalho Chehab 0 siblings, 1 reply; 41+ messages in thread From: Michal Suchánek @ 2021-05-06 13:32 UTC (permalink / raw) To: Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet On Thu, May 06, 2021 at 01:20:06PM +0200, Mauro Carvalho Chehab wrote: > Em Thu, 6 May 2021 12:39:13 +0200 > Michal Suchánek <msuchanek@suse.de> escreveu: > > > When building HTML documentation I get this output: > > > > [ 120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs > > [ 120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > > [ 120s] cat: /etc/os-release: No such file or directory > > [ 121s] SPHINX htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output > > [ 121s] PARSE include/uapi/linux/dvb/audio.h > > [ 121s] PARSE include/uapi/linux/dvb/ca.h > > [ 121s] PARSE include/uapi/linux/dvb/dmx.h > > [ 121s] PARSE include/uapi/linux/dvb/frontend.h > > [ 122s] PARSE include/uapi/linux/dvb/net.h > > [ 122s] PARSE include/uapi/linux/dvb/video.h > > [ 122s] PARSE include/uapi/linux/videodev2.h > > [ 122s] PARSE include/uapi/linux/media.h > > [ 122s] PARSE include/uapi/linux/cec.h > > [ 122s] PARSE include/uapi/linux/lirc.h > > [ 190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc' > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc' > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc' > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc' > > [ 203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller' > > [ 203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams' > > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > > [ 233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager' > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser' > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser' > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser' > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > > [ 233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register' > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent. > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket' > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation. > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent. > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent. > > [ 307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header) > > [ 412s] > > [ 412s] Sphinx parallel build error: > > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > > [ 431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2 > > [ 431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2 > > [ 431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > > [ 431s] make: *** [Makefile:222: __sub-make] Error 2 > > [ 431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build) > > > > It does not say which input file contains the offending character so I can't tell which file is broken. > > > > Any idea how to debug? > > Yes. You probably has some weird file under Documentation/ABI. > Some text editors like kate tend to keep temporary files sometimes. > > The scripts/get_ABI.pl (currently) doesn't have any logic > to recognize valid ABI files from trash stuff added at > the ABI dirs. > > Just doing a git status (or a git clean) and removing such > files should fix the build. This is clean git-archived tarball uploaded to a build service so the likehood of some garbage files popping out in Documentation/ABI out of nowhere is quite small. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 13:32 ` Michal Suchánek @ 2021-05-06 14:24 ` Mauro Carvalho Chehab 2021-05-06 14:35 ` Michal Suchánek 0 siblings, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-06 14:24 UTC (permalink / raw) To: Michal Suchánek; +Cc: linux-doc, Jonathan Corbet Em Thu, 6 May 2021 15:32:12 +0200 Michal Suchánek <msuchanek@suse.de> escreveu: > On Thu, May 06, 2021 at 01:20:06PM +0200, Mauro Carvalho Chehab wrote: > > Em Thu, 6 May 2021 12:39:13 +0200 > > Michal Suchánek <msuchanek@suse.de> escreveu: > > > > > When building HTML documentation I get this output: > > > > > > [ 120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs > > > [ 120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > > > [ 120s] cat: /etc/os-release: No such file or directory > > > [ 121s] SPHINX htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output > > > [ 121s] PARSE include/uapi/linux/dvb/audio.h > > > [ 121s] PARSE include/uapi/linux/dvb/ca.h > > > [ 121s] PARSE include/uapi/linux/dvb/dmx.h > > > [ 121s] PARSE include/uapi/linux/dvb/frontend.h > > > [ 122s] PARSE include/uapi/linux/dvb/net.h > > > [ 122s] PARSE include/uapi/linux/dvb/video.h > > > [ 122s] PARSE include/uapi/linux/videodev2.h > > > [ 122s] PARSE include/uapi/linux/media.h > > > [ 122s] PARSE include/uapi/linux/cec.h > > > [ 122s] PARSE include/uapi/linux/lirc.h > > > [ 190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc' > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc' > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc' > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc' > > > [ 203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller' > > > [ 203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams' > > > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > > > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > > > [ 233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager' > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser' > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser' > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser' > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > > > [ 233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register' > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent. > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket' > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation. > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent. > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent. > > > [ 307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header) > > > [ 412s] > > > [ 412s] Sphinx parallel build error: > > > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > > > [ 431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2 > > > [ 431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2 > > > [ 431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > > > [ 431s] make: *** [Makefile:222: __sub-make] Error 2 > > > [ 431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build) > > > > > > It does not say which input file contains the offending character so I can't tell which file is broken. > > > > > > Any idea how to debug? > > > > Yes. You probably has some weird file under Documentation/ABI. > > Some text editors like kate tend to keep temporary files sometimes. > > > > The scripts/get_ABI.pl (currently) doesn't have any logic > > to recognize valid ABI files from trash stuff added at > > the ABI dirs. > > > > Just doing a git status (or a git clean) and removing such > > files should fix the build. > > This is clean git-archived tarball uploaded to a build service so the > likehood of some garbage files popping out in Documentation/ABI out of > nowhere is quite small. Well, it could be something completely different ;-) This crash happens when Sphinx/python finds a character it doesn't recognize as valid, like if you run something like: $ echo -en "What:\tBROKEN description\ndescription:\n";dd if=/dev/random count=10 ) > Documentation/ABI/testing/foobar $ ./scripts/get_abi.pl rest 2>/dev/null|grep BROK Binary file (standard input) matches On such case, non UTF-8 chars end being inserted, causing python and/or Sphinx exceptions, causing it to crash: WARNING: The kernel documentation build process support for Sphinx v3.0 and above is brand new. Be prepared for possible issues in the generated output. enabling CJK for LaTeX builder Sphinx parallel build error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 1985806: invalid start byte make[1]: *** [Documentation/Makefile:91: htmldocs] Error 2 make: *** [Makefile:1790: htmldocs] Error 2 Yet, this sounds a weird to me: UnicodeEncodeError: 'latin-1' It sounds that it is somehow trying to use latin-1 alphabet, instead of utf-8. This will certainly cause troubles, as there are non-latin-1 characters at the docs (specially at Japanese and Chinese translations, but there are also a few utf-8 graphic symbols somewhere else). I remember there were a change in the past that made utf-8 to be default for Sphinx, but can't remember the details. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 14:24 ` Mauro Carvalho Chehab @ 2021-05-06 14:35 ` Michal Suchánek 0 siblings, 0 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-06 14:35 UTC (permalink / raw) To: Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet On Thu, May 06, 2021 at 04:24:42PM +0200, Mauro Carvalho Chehab wrote: > Em Thu, 6 May 2021 15:32:12 +0200 > Michal Suchánek <msuchanek@suse.de> escreveu: > > > On Thu, May 06, 2021 at 01:20:06PM +0200, Mauro Carvalho Chehab wrote: > > > Em Thu, 6 May 2021 12:39:13 +0200 > > > Michal Suchánek <msuchanek@suse.de> escreveu: > > > > > > > When building HTML documentation I get this output: > > > > > > > > [ 120s] + make O=/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html PYTHON=python3 htmldocs > > > > [ 120s] make[1]: Entering directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > > > > [ 120s] cat: /etc/os-release: No such file or directory > > > > [ 121s] SPHINX htmldocs --> file:///home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html/Documentation/output > > > > [ 121s] PARSE include/uapi/linux/dvb/audio.h > > > > [ 121s] PARSE include/uapi/linux/dvb/ca.h > > > > [ 121s] PARSE include/uapi/linux/dvb/dmx.h > > > > [ 121s] PARSE include/uapi/linux/dvb/frontend.h > > > > [ 122s] PARSE include/uapi/linux/dvb/net.h > > > > [ 122s] PARSE include/uapi/linux/dvb/video.h > > > > [ 122s] PARSE include/uapi/linux/videodev2.h > > > > [ 122s] PARSE include/uapi/linux/media.h > > > > [ 122s] PARSE include/uapi/linux/cec.h > > > > [ 122s] PARSE include/uapi/linux/lirc.h > > > > [ 190s] ../include/linux/dcache.h:318: warning: expecting prototype for dget, dget_dlock(). Prototype was for dget_dlock() instead > > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_reg' not described in 'regulator_desc' > > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_mask' not described in 'regulator_desc' > > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'ramp_delay_table' not described in 'regulator_desc' > > > > [ 203s] ../include/linux/regulator/driver.h:388: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc' > > > > [ 203s] ../include/linux/spi/spi.h:671: warning: Function parameter or member 'devm_allocated' not described in 'spi_controller' > > > > [ 203s] ../drivers/usb/dwc3/core.h:865: warning: Function parameter or member 'hwparams9' not described in 'dwc3_hwparams' > > > > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > > > > [ 233s] ../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2808: warning: Excess function parameter 'vm_context' description in 'amdgpu_vm_init' > > > > [ 233s] ../drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:426: warning: Function parameter or member 'disable_hpd_irq' not described in 'amdgpu_display_manager' > > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'jump_whitelist' not described in 'intel_engine_cmd_parser' > > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'shadow_map' not described in 'intel_engine_cmd_parser' > > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Function parameter or member 'batch_map' not described in 'intel_engine_cmd_parser' > > > > [ 233s] ../drivers/gpu/drm/i915/i915_cmd_parser.c:1420: warning: Excess function parameter 'trampoline' description in 'intel_engine_cmd_parser' > > > > [ 233s] ../drivers/gpu/host1x/bus.c:774: warning: Excess function parameter 'key' description in '__host1x_client_register' > > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/ABI/testing/sysfs-platform-intel-pmc:2: WARNING: Definition list ends without a blank line; unexpected unindent. > > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/serial/index.rst:17: WARNING: toctree contains reference to nonexisting document 'driver-api/serial/rocket' > > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:323: WARNING: Unexpected indentation. > > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:324: WARNING: Block quote ends without a blank line; unexpected unindent. > > > > [ 234s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/gpu/amdgpu:96: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:327: WARNING: Definition list ends without a blank line; unexpected unindent. > > > > [ 307s] /home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Documentation/driver-api/usb/writing_usb_driver.rst:129: WARNING: undefined label: usb_header (if the link has no caption the label must precede a section header) > > > > [ 412s] > > > > [ 412s] Sphinx parallel build error: > > > > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > > > > [ 431s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2 > > > > [ 431s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/Makefile:1784: htmldocs] Error 2 > > > > [ 431s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.12.0.next.20210506/linux-5.12-next-20210506/html' > > > > [ 431s] make: *** [Makefile:222: __sub-make] Error 2 > > > > [ 431s] error: Bad exit status from /var/tmp/rpm-tmp.npkyVx (%build) > > > > > > > > It does not say which input file contains the offending character so I can't tell which file is broken. > > > > > > > > Any idea how to debug? > > > > > > Yes. You probably has some weird file under Documentation/ABI. > > > Some text editors like kate tend to keep temporary files sometimes. > > > > > > The scripts/get_ABI.pl (currently) doesn't have any logic > > > to recognize valid ABI files from trash stuff added at > > > the ABI dirs. > > > > > > Just doing a git status (or a git clean) and removing such > > > files should fix the build. > > > > This is clean git-archived tarball uploaded to a build service so the > > likehood of some garbage files popping out in Documentation/ABI out of > > nowhere is quite small. > > Well, it could be something completely different ;-) > > This crash happens when Sphinx/python finds a character it doesn't > recognize as valid, like if you run something like: > > $ echo -en "What:\tBROKEN description\ndescription:\n";dd if=/dev/random count=10 ) > Documentation/ABI/testing/foobar > $ ./scripts/get_abi.pl rest 2>/dev/null|grep BROK > Binary file (standard input) matches > > On such case, non UTF-8 chars end being inserted, causing python and/or Sphinx > exceptions, causing it to crash: > > WARNING: The kernel documentation build process > support for Sphinx v3.0 and above is brand new. Be prepared for > possible issues in the generated output. > enabling CJK for LaTeX builder > > Sphinx parallel build error: > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 1985806: invalid start byte > make[1]: *** [Documentation/Makefile:91: htmldocs] Error 2 > make: *** [Makefile:1790: htmldocs] Error 2 > > Yet, this sounds a weird to me: > > UnicodeEncodeError: 'latin-1' > > It sounds that it is somehow trying to use latin-1 alphabet, instead of utf-8. How does it determine what character set to use? I suspect the limited build environment does not have locale files. The build system locale is kind of irrelevant for the documentation locale but if it creeps in because the input file locale is not specified anywhere this can cause problems. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek 2021-05-06 11:20 ` Mauro Carvalho Chehab @ 2021-05-06 15:57 ` Markus Heiser 2021-05-06 16:46 ` Mauro Carvalho Chehab 1 sibling, 1 reply; 41+ messages in thread From: Markus Heiser @ 2021-05-06 15:57 UTC (permalink / raw) To: Michal Suchánek, linux-doc; +Cc: Jonathan Corbet, Mauro Carvalho Chehab Am 06.05.21 um 12:39 schrieb Michal Suchánek: > When building HTML documentation I get this output: ... > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) ... > > It does not say which input file contains the offending character so I can't tell which file is broken. > > Any idea how to debug? I guess the build host is a very simple container, what does echo $LC_ALL echo $LANG prompt? If it is latin, change it to something using utf-8 (I recommend 'en_US.utf8'). A UnicodeEncodeError can occour everywhere where characters are encoded from (internal) unicode to the encoding of the stream. By example: A print or log statement which streams to stdout needs to encode from unicode to stdout's encoding. If there is one unicode symbol which can not encoded to stream's encoding a UnicodeEncodeError is raised. -- Markus -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 15:57 ` Markus Heiser @ 2021-05-06 16:46 ` Mauro Carvalho Chehab 2021-05-06 17:04 ` Markus Heiser 0 siblings, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-06 16:46 UTC (permalink / raw) To: Markus Heiser; +Cc: Michal Suchánek, linux-doc, Jonathan Corbet Em Thu, 6 May 2021 17:57:15 +0200 Markus Heiser <markus.heiser@darmarit.de> escreveu: > Am 06.05.21 um 12:39 schrieb Michal Suchánek: > > When building HTML documentation I get this output: > ... > > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > ... > > > > It does not say which input file contains the offending character so I can't tell which file is broken. > > > > Any idea how to debug? > > I guess the build host is a very simple container, what does > > echo $LC_ALL > echo $LANG > > prompt? If it is latin, change it to something using utf-8 (I recommend > 'en_US.utf8'). > > A UnicodeEncodeError can occour everywhere where characters are > encoded from (internal) unicode to the encoding of the stream. > > By example: > > A print or log statement which streams to stdout needs to encode > from unicode to stdout's encoding. If there is one unicode symbol > which can not encoded to stream's encoding a UnicodeEncodeError > is raised. Hi Markus, It shouldn't matter the builder's locale when building the Kernel documentation (or any other documents built from other git trees on other open source projects), as the Kernel's *.rpm document charset won't change, no matter on what part of the globe it was built. I vaguely remember about a change we made a couple of years ago in order to address this issue. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 16:46 ` Mauro Carvalho Chehab @ 2021-05-06 17:04 ` Markus Heiser 2021-05-06 17:27 ` Mauro Carvalho Chehab 2021-05-06 17:48 ` Michal Suchánek 0 siblings, 2 replies; 41+ messages in thread From: Markus Heiser @ 2021-05-06 17:04 UTC (permalink / raw) To: Mauro Carvalho Chehab, Michal Suchánek; +Cc: linux-doc, Jonathan Corbet Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: > Em Thu, 6 May 2021 17:57:15 +0200 > Markus Heiser <markus.heiser@darmarit.de> escreveu: > >> Am 06.05.21 um 12:39 schrieb Michal Suchánek: >>> When building HTML documentation I get this output: >> ... >>> [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) >> ... >>> >>> It does not say which input file contains the offending character so I can't tell which file is broken. >>> >>> Any idea how to debug? >> >> I guess the build host is a very simple container, what does >> >> echo $LC_ALL >> echo $LANG >> >> prompt? If it is latin, change it to something using utf-8 (I recommend >> 'en_US.utf8'). >> >> A UnicodeEncodeError can occour everywhere where characters are >> encoded from (internal) unicode to the encoding of the stream. >> >> By example: >> >> A print or log statement which streams to stdout needs to encode >> from unicode to stdout's encoding. If there is one unicode symbol >> which can not encoded to stream's encoding a UnicodeEncodeError >> is raised. > > Hi Markus, > > It shouldn't matter the builder's locale when building the Kernel > documentation (or any other documents built from other git trees > on other open source projects), as the Kernel's *.rpm document charset > won't change, no matter on what part of the globe it was built. > > I vaguely remember about a change we made a couple of years ago > in order to address this issue. Hi Mauro :) sure? .. what if the logger wants to log some symbols from the chines translated parts to stdout and the encoding of stdout is latin? In python the logger will raise a UnicodeEncodeError, this is what I know .. but I'm often wrong ;) I remember we had some patches to the chinese translation these days, may be there is an issue the logger wants to report. Anyway I would always recommend to use utf-8. @Michal would you give it a try? -- Markus -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:04 ` Markus Heiser @ 2021-05-06 17:27 ` Mauro Carvalho Chehab 2021-05-06 17:53 ` Markus Heiser 2021-05-06 17:57 ` Randy Dunlap 2021-05-06 17:48 ` Michal Suchánek 1 sibling, 2 replies; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-06 17:27 UTC (permalink / raw) To: Markus Heiser; +Cc: Michal Suchánek, linux-doc, Jonathan Corbet Em Thu, 6 May 2021 19:04:44 +0200 Markus Heiser <markus.heiser@darmarit.de> escreveu: > Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: > > Em Thu, 6 May 2021 17:57:15 +0200 > > Markus Heiser <markus.heiser@darmarit.de> escreveu: > > > >> Am 06.05.21 um 12:39 schrieb Michal Suchánek: > >>> When building HTML documentation I get this output: > >> ... > >>> [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > >> ... > >>> > >>> It does not say which input file contains the offending character so I can't tell which file is broken. > >>> > >>> Any idea how to debug? > >> > >> I guess the build host is a very simple container, what does > >> > >> echo $LC_ALL > >> echo $LANG > >> > >> prompt? If it is latin, change it to something using utf-8 (I recommend > >> 'en_US.utf8'). > >> > >> A UnicodeEncodeError can occour everywhere where characters are > >> encoded from (internal) unicode to the encoding of the stream. > >> > >> By example: > >> > >> A print or log statement which streams to stdout needs to encode > >> from unicode to stdout's encoding. If there is one unicode symbol > >> which can not encoded to stream's encoding a UnicodeEncodeError > >> is raised. > > > > Hi Markus, > > > > It shouldn't matter the builder's locale when building the Kernel > > documentation (or any other documents built from other git trees > > on other open source projects), as the Kernel's *.rpm document charset > > won't change, no matter on what part of the globe it was built. > > > > I vaguely remember about a change we made a couple of years ago > > in order to address this issue. > > Hi Mauro :) > > sure? .. what if the logger wants to log some symbols from the > chines translated parts to stdout and the encoding of stdout is > latin? > > In python the logger will raise a UnicodeEncodeError, this is > what I know .. but I'm often wrong ;) Yeah, Python (and almost all python apps) has a mad behavior when it finds an unexpected character: instead of ignoring it, it just crashes. On Sphinx, this is is even worse, as it blames the parallel building, instead of pinpointing the real culprit. In this specific case, crashing due to an invalid char sent to the logger sounds pretty stupid to me, as what really matter is the built documents. > I remember we had some patches to the chinese translation > these days, may be there is an issue the logger wants to report. > > Anyway I would always recommend to use utf-8. Well, IMO, for things like logger, python or Sphinx should internally be ding something similar to: $ iconv -t latin-1//IGNORE e. g. if the system's charset is latin-1, it should ignore all charset errors at the logger, converting the charset the best way it can without crashing. As, unfortunately, this is not happening, conf.py should do something like hardcoding env{LANG}=<lang>.utf-8 (or something similar), in order to to eliminate the risk of crashes. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:27 ` Mauro Carvalho Chehab @ 2021-05-06 17:53 ` Markus Heiser 2021-05-06 18:06 ` Michal Suchánek 2021-05-06 17:57 ` Randy Dunlap 1 sibling, 1 reply; 41+ messages in thread From: Markus Heiser @ 2021-05-06 17:53 UTC (permalink / raw) To: Mauro Carvalho Chehab; +Cc: Michal Suchánek, linux-doc, Jonathan Corbet Am 06.05.21 um 19:27 schrieb Mauro Carvalho Chehab: > Em Thu, 6 May 2021 19:04:44 +0200 > Markus Heiser <markus.heiser@darmarit.de> escreveu: > >> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: >>> Em Thu, 6 May 2021 17:57:15 +0200 >>> Markus Heiser <markus.heiser@darmarit.de> escreveu: >>> >>>> Am 06.05.21 um 12:39 schrieb Michal Suchánek: >>>>> When building HTML documentation I get this output: >>>> ... >>>>> [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) >>>> ... >>>>> >>>>> It does not say which input file contains the offending character so I can't tell which file is broken. >>>>> >>>>> Any idea how to debug? >>>> >>>> I guess the build host is a very simple container, what does >>>> >>>> echo $LC_ALL >>>> echo $LANG >>>> >>>> prompt? If it is latin, change it to something using utf-8 (I recommend >>>> 'en_US.utf8'). >>>> >>>> A UnicodeEncodeError can occour everywhere where characters are >>>> encoded from (internal) unicode to the encoding of the stream. >>>> >>>> By example: >>>> >>>> A print or log statement which streams to stdout needs to encode >>>> from unicode to stdout's encoding. If there is one unicode symbol >>>> which can not encoded to stream's encoding a UnicodeEncodeError >>>> is raised. >>> >>> Hi Markus, >>> >>> It shouldn't matter the builder's locale when building the Kernel >>> documentation (or any other documents built from other git trees >>> on other open source projects), as the Kernel's *.rpm document charset >>> won't change, no matter on what part of the globe it was built. >>> >>> I vaguely remember about a change we made a couple of years ago >>> in order to address this issue. >> >> Hi Mauro :) >> >> sure? .. what if the logger wants to log some symbols from the >> chines translated parts to stdout and the encoding of stdout is >> latin? >> >> In python the logger will raise a UnicodeEncodeError, this is >> what I know .. but I'm often wrong ;) > > Yeah, Python (and almost all python apps) has a mad behavior when > it finds an unexpected character: instead of ignoring it, it Hi Mauro, it is not comfortable but is it mad? .. Most often languages (or applications) do not handle encoding of strings they just piping a binary stream while python decode / encodes strings. "The Zen of Python" [1] says Explicit is better than implicit. If a stream can't encode symbols and these symbols should be ignored you have to set the encoding of the stream explicit to ignore such symbols. I guess this encode discussions will haunt me for the rest of my life. My escape strategy is to use UTF-8 wherever possible. [1] https://www.python.org/dev/peps/pep-0020/ -- Markus -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:53 ` Markus Heiser @ 2021-05-06 18:06 ` Michal Suchánek 2021-05-07 8:52 ` Mauro Carvalho Chehab 0 siblings, 1 reply; 41+ messages in thread From: Michal Suchánek @ 2021-05-06 18:06 UTC (permalink / raw) To: Markus Heiser; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet On Thu, May 06, 2021 at 07:53:25PM +0200, Markus Heiser wrote: > Am 06.05.21 um 19:27 schrieb Mauro Carvalho Chehab: > > Em Thu, 6 May 2021 19:04:44 +0200 > > Markus Heiser <markus.heiser@darmarit.de> escreveu: > > > > > Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: > > > > Em Thu, 6 May 2021 17:57:15 +0200 > > > > Markus Heiser <markus.heiser@darmarit.de> escreveu: > > > > > Am 06.05.21 um 12:39 schrieb Michal Suchánek: > > > > > > When building HTML documentation I get this output: > > > > > ... > > > > > > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > > > > > ... > > > > > > > > > > > > It does not say which input file contains the offending character so I can't tell which file is broken. > > > > > > > > > > > > Any idea how to debug? > > > > > > > > > > I guess the build host is a very simple container, what does > > > > > > > > > > echo $LC_ALL > > > > > echo $LANG > > > > > > > > > > prompt? If it is latin, change it to something using utf-8 (I recommend > > > > > 'en_US.utf8'). > > > > > > > > > > A UnicodeEncodeError can occour everywhere where characters are > > > > > encoded from (internal) unicode to the encoding of the stream. > > > > > > > > > > By example: > > > > > > > > > > A print or log statement which streams to stdout needs to encode > > > > > from unicode to stdout's encoding. If there is one unicode symbol > > > > > which can not encoded to stream's encoding a UnicodeEncodeError > > > > > is raised. > > > > > > > > Hi Markus, > > > > > > > > It shouldn't matter the builder's locale when building the Kernel > > > > documentation (or any other documents built from other git trees > > > > on other open source projects), as the Kernel's *.rpm document charset > > > > won't change, no matter on what part of the globe it was built. > > > > > > > > I vaguely remember about a change we made a couple of years ago > > > > in order to address this issue. > > > > > > Hi Mauro :) > > > > > > sure? .. what if the logger wants to log some symbols from the > > > chines translated parts to stdout and the encoding of stdout is > > > latin? > > > > > > In python the logger will raise a UnicodeEncodeError, this is > > > what I know .. but I'm often wrong ;) > > > > Yeah, Python (and almost all python apps) has a mad behavior when > > it finds an unexpected character: instead of ignoring it, it > > Hi Mauro, > > it is not comfortable but is it mad? .. > > Most often languages (or applications) do not handle encoding > of strings they just piping a binary stream while python > decode / encodes strings. > > "The Zen of Python" [1] says > > Explicit is better than implicit. > > If a stream can't encode symbols and these symbols should be ignored > you have to set the encoding of the stream explicit to ignore > such symbols. The problem is this part never happened. Loggers are supposed to tell you about the error in your application, not crash it. But the problem with Sphinx may be that the output file is also assumed to be in the locale encoding, and the output encoding is never set. It's HTML so it could be encoded with entities, too. The idea about handlinng encoding precisely is not mad in itself but then everybody working with just ASCII and never testing their software works in the cases where explicit handling is needed is the mad part. Too US-centric culture for getting encodings right I guess. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 18:06 ` Michal Suchánek @ 2021-05-07 8:52 ` Mauro Carvalho Chehab 0 siblings, 0 replies; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-07 8:52 UTC (permalink / raw) To: Michal Suchánek; +Cc: Markus Heiser, linux-doc, Jonathan Corbet Em Thu, 6 May 2021 20:06:25 +0200 Michal Suchánek <msuchanek@suse.de> escreveu: > On Thu, May 06, 2021 at 07:53:25PM +0200, Markus Heiser wrote: > > Hi Mauro, > > > > it is not comfortable but is it mad? .. > > > > Most often languages (or applications) do not handle encoding > > of strings they just piping a binary stream while python > > decode / encodes strings. > > > > "The Zen of Python" [1] says > > > > Explicit is better than implicit. This was taken into an extreme with regards to charsets: "better" should never be translated to "crash" ;-) > > If a stream can't encode symbols and these symbols should be ignored > > you have to set the encoding of the stream explicit to ignore > > such symbols. > > The problem is this part never happened. Loggers are supposed to tell > you about the error in your application, not crash it. It is insane to crash the error log due to a charset issue ;-) > But the problem with Sphinx may be that the output file is also assumed > to be in the locale encoding, and the output encoding is never set. It's > HTML so it could be encoded with entities, too. > > The idea about handlinng encoding precisely is not mad in itself but then > everybody working with just ASCII and never testing their software works > in the cases where explicit handling is needed is the mad part. True. The machine's locale shouldn't affect *at all* the produced documents. See, there's a hole set of non-latin family of charsets supported on Linux: https://man7.org/linux/man-pages/man7/charsets.7.html Nothing prevents that someone using a machine whose default encoding is KOI8-R/BIG-5/GB 2312/JIS X 0208/... to use Sphinx to produce UTF-8 [1] documents. [1] or whatever other output encoding Ok, the logger may not be able to correctly display certain chars, but it it be perfectly fine and sane to use //TRANSLIT (or something similar) in order to do a charset conversion. Even to just print a <?> for all chars that aren't printable at the logger's output using the charset set by LANG/LC_* is better/saner than crashing. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:27 ` Mauro Carvalho Chehab 2021-05-06 17:53 ` Markus Heiser @ 2021-05-06 17:57 ` Randy Dunlap 2021-05-06 18:08 ` Matthew Wilcox 1 sibling, 1 reply; 41+ messages in thread From: Randy Dunlap @ 2021-05-06 17:57 UTC (permalink / raw) To: Mauro Carvalho Chehab, Markus Heiser Cc: Michal Suchánek, linux-doc, Jonathan Corbet On 5/6/21 10:27 AM, Mauro Carvalho Chehab wrote: > Em Thu, 6 May 2021 19:04:44 +0200 > Markus Heiser <markus.heiser@darmarit.de> escreveu: > >> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: >>> Em Thu, 6 May 2021 17:57:15 +0200 >>> Markus Heiser <markus.heiser@darmarit.de> escreveu: >>> >>>> Am 06.05.21 um 12:39 schrieb Michal Suchánek: >>>>> When building HTML documentation I get this output: >>>> ... >>>>> [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) >>>> ... >>>>> >>>>> It does not say which input file contains the offending character so I can't tell which file is broken. >>>>> >>>>> Any idea how to debug? >>>> >>>> I guess the build host is a very simple container, what does >>>> >>>> echo $LC_ALL >>>> echo $LANG >>>> >>>> prompt? If it is latin, change it to something using utf-8 (I recommend >>>> 'en_US.utf8'). >>>> >>>> A UnicodeEncodeError can occour everywhere where characters are >>>> encoded from (internal) unicode to the encoding of the stream. >>>> >>>> By example: >>>> >>>> A print or log statement which streams to stdout needs to encode >>>> from unicode to stdout's encoding. If there is one unicode symbol >>>> which can not encoded to stream's encoding a UnicodeEncodeError >>>> is raised. >>> >>> Hi Markus, >>> >>> It shouldn't matter the builder's locale when building the Kernel >>> documentation (or any other documents built from other git trees >>> on other open source projects), as the Kernel's *.rpm document charset >>> won't change, no matter on what part of the globe it was built. >>> >>> I vaguely remember about a change we made a couple of years ago >>> in order to address this issue. >> >> Hi Mauro :) >> >> sure? .. what if the logger wants to log some symbols from the >> chines translated parts to stdout and the encoding of stdout is >> latin? >> >> In python the logger will raise a UnicodeEncodeError, this is >> what I know .. but I'm often wrong ;) > > Yeah, Python (and almost all python apps) has a mad behavior when > it finds an unexpected character: instead of ignoring it, it > just crashes. On Sphinx, this is is even worse, as it blames > the parallel building, instead of pinpointing the real culprit. And for error messages such as this problem, it should include file name and line number along with the position. Is position in this case offset from the beginning of file or beginning of line? What a bad error message. [ah, I see that Michal has found where the error happens.] I have been going thru some of the Documentation/ files... Why do several of the files begin with (hex) ef bb bf followed by "==================" for a heading, instead of just "===================". See e.g. Documentation/timers/no_hz.rst. thanks. -- ~Randy [resending due to smtp error] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:57 ` Randy Dunlap @ 2021-05-06 18:08 ` Matthew Wilcox 2021-05-06 21:21 ` Randy Dunlap 0 siblings, 1 reply; 41+ messages in thread From: Matthew Wilcox @ 2021-05-06 18:08 UTC (permalink / raw) To: Randy Dunlap Cc: Mauro Carvalho Chehab, Markus Heiser, Michal Suchánek, linux-doc, Jonathan Corbet On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > I have been going thru some of the Documentation/ files... > > Why do several of the files begin with > (hex) ef bb bf followed by "==================" > for a heading, instead of just "===================". > See e.g. Documentation/timers/no_hz.rst. 00000000 ef bb bf 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |...=============| ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the https://en.wikipedia.org/wiki/Byte_order_mark We should delete it. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 18:08 ` Matthew Wilcox @ 2021-05-06 21:21 ` Randy Dunlap 2021-05-07 6:39 ` Mauro Carvalho Chehab 0 siblings, 1 reply; 41+ messages in thread From: Randy Dunlap @ 2021-05-06 21:21 UTC (permalink / raw) To: Matthew Wilcox Cc: Mauro Carvalho Chehab, Markus Heiser, Michal Suchánek, linux-doc, Jonathan Corbet On 5/6/21 11:08 AM, Matthew Wilcox wrote: > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: >> I have been going thru some of the Documentation/ files... >> >> Why do several of the files begin with >> (hex) ef bb bf followed by "==================" >> for a heading, instead of just "===================". >> See e.g. Documentation/timers/no_hz.rst. > > 00000000 ef bb bf 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |...=============| > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the > https://en.wikipedia.org/wiki/Byte_order_mark > > We should delete it. > OK, thanks, I have started on that. Just another question: ("inquiring minds want to know") Why is/are some docs using U+2217 '*' instead of ASCII '*'? E.g., Documentation/block/cdrom-standard.rst. Maybe some $EDITOR is doing this? thanks. -- ~Randy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 21:21 ` Randy Dunlap @ 2021-05-07 6:39 ` Mauro Carvalho Chehab 2021-05-07 6:49 ` Randy Dunlap ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-07 6:39 UTC (permalink / raw) To: Randy Dunlap Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc, Jonathan Corbet Em Thu, 6 May 2021 14:21:01 -0700 Randy Dunlap <rdunlap@infradead.org> escreveu: > On 5/6/21 11:08 AM, Matthew Wilcox wrote: > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > >> I have been going thru some of the Documentation/ files... > >> > >> Why do several of the files begin with > >> (hex) ef bb bf followed by "==================" > >> for a heading, instead of just "===================". > >> See e.g. Documentation/timers/no_hz.rst. No idea! It seems that the text editor I used on that time added it for whatever reason. > > > > 00000000 ef bb bf 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |...=============| > > > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the > > https://en.wikipedia.org/wiki/Byte_order_mark > > > > We should delete it. > > > > OK, thanks, I have started on that. > > > Just another question: ("inquiring minds want to know") > > Why is/are some docs using U+2217 '*' instead of ASCII '*'? > E.g., Documentation/block/cdrom-standard.rst. The cdrom doc is a very special case: it was originally written in LaTeX. I don't remember any other document in LaTeX inside the Kernel docs during the conversions I made. See: e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST") In order to convert it to .rst, I used some tool to first turn it into plain text (probably LaTeX, but I don't remember anymore), and then I manually reviewed the entire file, adding ReST tags where needed. I didn't realize that utf-8 chars were used instead of normal ASCII chars, as both appear the same when editing it[1]. [1] I use Fedora here. Fedora changed the default charset to utf-8 a long time ago. Anyway, we should be able of get rid of weird UTF-8 chars from it with: $ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst I'll prepare a patch fixing it. Some care should be taken, however, as it has two places where UTF-8 chars should be used[2]. [2] There are two German person names that use UTF-8 chars: - 'o' + umlat; - a LATIN SMALL LETTER SHARP S (Eszett) Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 6:39 ` Mauro Carvalho Chehab @ 2021-05-07 6:49 ` Randy Dunlap 2021-05-07 8:04 ` Mauro Carvalho Chehab 2021-05-08 9:22 ` Mauro Carvalho Chehab 2 siblings, 0 replies; 41+ messages in thread From: Randy Dunlap @ 2021-05-07 6:49 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc, Jonathan Corbet On 5/6/21 11:39 PM, Mauro Carvalho Chehab wrote: > Em Thu, 6 May 2021 14:21:01 -0700 > Randy Dunlap <rdunlap@infradead.org> escreveu: > >> >> Just another question: ("inquiring minds want to know") >> >> Why is/are some docs using U+2217 '*' instead of ASCII '*'? >> E.g., Documentation/block/cdrom-standard.rst. > > The cdrom doc is a very special case: it was originally written in LaTeX. Yes, I recall that. I even edited it at least once. > I don't remember any other document in LaTeX inside the Kernel docs during > the conversions I made. See: > e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST") > > In order to convert it to .rst, I used some tool to first turn it > into plain text (probably LaTeX, but I don't remember anymore), and then > I manually reviewed the entire file, adding ReST tags where needed. > > I didn't realize that utf-8 chars were used instead of normal ASCII chars, > as both appear the same when editing it[1]. > > [1] I use Fedora here. Fedora changed the default charset to utf-8 a long > time ago. > > Anyway, we should be able of get rid of weird UTF-8 chars from it with: > > $ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst > > I'll prepare a patch fixing it. Some care should be taken, however, as > it has two places where UTF-8 chars should be used[2]. Thanks! > [2] There are two German person names that use UTF-8 chars: > - 'o' + umlat; > - a LATIN SMALL LETTER SHARP S (Eszett) My patch preparation notes say that the cdrom .rst file contains "fancy '*'" (not ASCII) instead of ASCII '*' in several places. Also there are several files that contain U+00A0 non-breaking space where it is not needed AFAICT. -- ~Randy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 6:39 ` Mauro Carvalho Chehab 2021-05-07 6:49 ` Randy Dunlap @ 2021-05-07 8:04 ` Mauro Carvalho Chehab 2021-05-07 8:35 ` Michal Suchánek 2021-05-08 9:22 ` Mauro Carvalho Chehab 2 siblings, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-07 8:04 UTC (permalink / raw) To: Randy Dunlap Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc, Jonathan Corbet Em Fri, 7 May 2021 08:39:24 +0200 Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > Em Thu, 6 May 2021 14:21:01 -0700 > Randy Dunlap <rdunlap@infradead.org> escreveu: > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote: > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > > >> I have been going thru some of the Documentation/ files... > > >> > > >> Why do several of the files begin with > > >> (hex) ef bb bf followed by "==================" > > >> for a heading, instead of just "===================". > > >> See e.g. Documentation/timers/no_hz.rst. > > No idea! It seems that the text editor I used on that time added > it for whatever reason. > > > > > > > 00000000 ef bb bf 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |...=============| > > > > > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the > > > https://en.wikipedia.org/wiki/Byte_order_mark > > > > > > We should delete it. > > > > > > > OK, thanks, I have started on that. > > > > > > Just another question: ("inquiring minds want to know") > > > > Why is/are some docs using U+2217 '*' instead of ASCII '*'? > > E.g., Documentation/block/cdrom-standard.rst. > > The cdrom doc is a very special case: it was originally written in LaTeX. > I don't remember any other document in LaTeX inside the Kernel docs during > the conversions I made. See: > e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST") > > In order to convert it to .rst, I used some tool to first turn it > into plain text (probably LaTeX, but I don't remember anymore), and then > I manually reviewed the entire file, adding ReST tags where needed. > > I didn't realize that utf-8 chars were used instead of normal ASCII chars, > as both appear the same when editing it[1]. > > [1] I use Fedora here. Fedora changed the default charset to utf-8 a long > time ago. > > Anyway, we should be able of get rid of weird UTF-8 chars from it with: > > $ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst > > I'll prepare a patch fixing it. Some care should be taken, however, as > it has two places where UTF-8 chars should be used[2]. > > [2] There are two German person names that use UTF-8 chars: > - 'o' + umlat; > - a LATIN SMALL LETTER SHARP S (Eszett) Btw, I did a quick check here: excluding translations, there are 182 files with UTF-8 chars at next-20210429. It seems that most of them are on files that got converted from DocBook and html. Several of them are valid ones: the ones used on names (like Günther, Alcôve, ...). Those should remain as-is. Several Docbook/html converted documents contain UTF-8 NO-BREAK SPACE and other invisible chars, like the byte order mark (BOM) pointed by Randy. Those should be replaced (or removed for non-printable ones). - Now, there are other cases where I'm not sure if there's a consensus: 1. UTF-8 is used where there's an ASCII similar (but with a different graph symbol), like: - UTF-8 commas; - UTF-8 hyphen chars, including the long ones: FIGURE DASH, EN DASH, EM DASH IMO, those should also be converted. 2. Some UTF-8 symbols, like: - ® - ™ - ² - used mainly for I²C - … - ⬍ ↑ ↓ - µs - used for microsseconds I would keep those. 3. There are couple of places which uses UTF-8 graphic characters, like: /sys/devices/system/edac/ ├── mc │ ├── mc0 │ │ ├── ce_count │ │ ├── ce_noinfo_count This is the normal output of the "tree" command on machines with UTF-8. I would keep it. Yet, iconv converts it into: /sys/devices/system/edac/ +-- mc | +-- mc0 | | +-- ce_count | | +-- ce_noinfo_count which would also be fine. So, replacing those would be no-brain, but I probably newer documents will be written using such symbols. So, I would preserve the UTF-8 graphics characters. I'm preparing a patchset to address the UTF-8 issues on the top of today's next, but before posting, it seems reasonable to discuss what to do with the above cases. Comments? Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 8:04 ` Mauro Carvalho Chehab @ 2021-05-07 8:35 ` Michal Suchánek 2021-05-07 8:56 ` Markus Heiser 2021-05-07 9:02 ` Mauro Carvalho Chehab 0 siblings, 2 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-07 8:35 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet On Fri, May 07, 2021 at 10:04:35AM +0200, Mauro Carvalho Chehab wrote: > Em Fri, 7 May 2021 08:39:24 +0200 > Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > > > Em Thu, 6 May 2021 14:21:01 -0700 > > Randy Dunlap <rdunlap@infradead.org> escreveu: > > > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote: > > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > > > >> I have been going thru some of the Documentation/ files... > > > >> > > > >> Why do several of the files begin with > > > >> (hex) ef bb bf followed by "==================" > > > >> for a heading, instead of just "===================". > > > >> See e.g. Documentation/timers/no_hz.rst. > > > > No idea! It seems that the text editor I used on that time added > > it for whatever reason. > > > > > > > > > > 00000000 ef bb bf 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |...=============| > > > > > > > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the > > > > https://en.wikipedia.org/wiki/Byte_order_mark > > > > > > > > We should delete it. > > > > > > > > > > OK, thanks, I have started on that. > > > > > > > > > Just another question: ("inquiring minds want to know") > > > > > > Why is/are some docs using U+2217 '*' instead of ASCII '*'? > > > E.g., Documentation/block/cdrom-standard.rst. > > > > The cdrom doc is a very special case: it was originally written in LaTeX. > > I don't remember any other document in LaTeX inside the Kernel docs during > > the conversions I made. See: > > e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST") > > > > In order to convert it to .rst, I used some tool to first turn it > > into plain text (probably LaTeX, but I don't remember anymore), and then > > I manually reviewed the entire file, adding ReST tags where needed. > > > > I didn't realize that utf-8 chars were used instead of normal ASCII chars, > > as both appear the same when editing it[1]. > > > > [1] I use Fedora here. Fedora changed the default charset to utf-8 a long > > time ago. > > > > Anyway, we should be able of get rid of weird UTF-8 chars from it with: > > > > $ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst > > > > I'll prepare a patch fixing it. Some care should be taken, however, as > > it has two places where UTF-8 chars should be used[2]. > > > > [2] There are two German person names that use UTF-8 chars: > > - 'o' + umlat; > > - a LATIN SMALL LETTER SHARP S (Eszett) > > Btw, I did a quick check here: excluding translations, there are 182 > files with UTF-8 chars at next-20210429. It seems that most of them > are on files that got converted from DocBook and html. > > Several of them are valid ones: the ones used on names > (like Günther, Alcôve, ...). > 2. Some UTF-8 symbols, like: > > - ® > - ™ > - ² - used mainly for I²C > - … > - ⬍ ↑ ↓ > - µs - used for microsseconds > 3. There are couple of places which uses UTF-8 graphic characters, like: > > /sys/devices/system/edac/ > ├── mc > │ ├── mc0 > │ │ ├── ce_count > │ │ ├── ce_noinfo_count > I'm preparing a patchset to address the UTF-8 issues on the top of > today's next, but before posting, it seems reasonable to discuss > what to do with the above cases. Comments? So the bottom line is that UTF-8 in the files will stay, and Sphinx cannot handle UTF-8 when the locale is not UTF-8. In the long run it might be nice to fix Sphinx to properly set the encoding of the files it reads and writes. Or maybe there is some parameter that specifies it? For the short term I think it is reasonable to run a python test script that prints fancy unicode characters before running Sphinx and bail if the test script fails. eg. echo 'print("↑ᛏ个")' > test.py python3 test.py Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 8:35 ` Michal Suchánek @ 2021-05-07 8:56 ` Markus Heiser 2021-05-07 9:14 ` Mauro Carvalho Chehab 2021-05-07 9:02 ` Mauro Carvalho Chehab 1 sibling, 1 reply; 41+ messages in thread From: Markus Heiser @ 2021-05-07 8:56 UTC (permalink / raw) To: Michal Suchánek, Mauro Carvalho Chehab Cc: Randy Dunlap, Matthew Wilcox, linux-doc, Jonathan Corbet Am 07.05.21 um 10:35 schrieb Michal Suchánek: > So the bottom line is that UTF-8 in the files will stay, and Sphinx > cannot handle UTF-8 when the locale is not UTF-8. > > In the long run it might be nice to fix Sphinx to properly set the > encoding of the files it reads and writes. Or maybe there is some > parameter that specifies it? Let's not mix things up. The Unicode-Error is not related or limited to log nor to sphinx, it is related to the fact that we (you) try to run a utf-8 application in an environment which is not full utf-8 functional. > For the short term I think it is reasonable to run a python test script > that prints fancy unicode characters before running Sphinx and bail if > the test script fails. To be assure, I recommend to set UTF-8 locale environment in the Makefile. My experience shows that this is the default with almost all containers (images), there are only a few where this is not the case (may be suse?). -- Markus -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 8:56 ` Markus Heiser @ 2021-05-07 9:14 ` Mauro Carvalho Chehab 2021-05-07 9:51 ` Markus Heiser 0 siblings, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-07 9:14 UTC (permalink / raw) To: Markus Heiser Cc: Michal Suchánek, Randy Dunlap, Matthew Wilcox, linux-doc, Jonathan Corbet Em Fri, 7 May 2021 10:56:39 +0200 Markus Heiser <markus.heiser@darmarit.de> escreveu: > Am 07.05.21 um 10:35 schrieb Michal Suchánek: > > So the bottom line is that UTF-8 in the files will stay, and Sphinx > > cannot handle UTF-8 when the locale is not UTF-8. > > > > In the long run it might be nice to fix Sphinx to properly set the > > encoding of the files it reads and writes. Or maybe there is some > > parameter that specifies it? > > Let's not mix things up. The Unicode-Error is not related or limited > to log nor to sphinx, it is related to the fact that we (you) try to > run a utf-8 application in an environment which is not full utf-8 > functional. No. The application itself is not UTF-8. The application input files are. The big issue with the way python works with charsets is due to that: it does a very poor job with regards to that. I remember that in the past I had to use this quite often (before UTF-8 being default on the distros I was using on that time): LANG=C <some_python_script> Just to avoid them to crash. If I'm not mistaken, older Fedora/Mandrake distros had some bugs with python-written scripts that, if the machine's language were not English, such scripts crash, as the i18n translated messages were on a different charset than what the python script would be expecting. > > For the short term I think it is reasonable to run a python test script > > that prints fancy unicode characters before running Sphinx and bail if > > the test script fails. > > To be assure, I recommend to set UTF-8 locale environment in the > Makefile. > > My experience shows that this is the default with almost all > containers (images), there are only a few where this is not the > case (may be suse?). That may not be true on certain parts of the globe. I've no idea what charsets the most-used distributions in Asian Countries use use ;-) Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 9:14 ` Mauro Carvalho Chehab @ 2021-05-07 9:51 ` Markus Heiser 2021-05-07 10:29 ` Michal Suchánek 0 siblings, 1 reply; 41+ messages in thread From: Markus Heiser @ 2021-05-07 9:51 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Michal Suchánek, Randy Dunlap, Matthew Wilcox, linux-doc, Jonathan Corbet Am 07.05.21 um 11:14 schrieb Mauro Carvalho Chehab: > Em Fri, 7 May 2021 10:56:39 +0200 > Markus Heiser <markus.heiser@darmarit.de> escreveu: > >> Am 07.05.21 um 10:35 schrieb Michal Suchánek: >>> So the bottom line is that UTF-8 in the files will stay, and Sphinx >>> cannot handle UTF-8 when the locale is not UTF-8. >>> >>> In the long run it might be nice to fix Sphinx to properly set the >>> encoding of the files it reads and writes. Or maybe there is some >>> parameter that specifies it? >> >> Let's not mix things up. The Unicode-Error is not related or limited >> to log nor to sphinx, it is related to the fact that we (you) try to >> run a utf-8 application in an environment which is not full utf-8 >> functional. > > No. The application itself is not UTF-8. The application input files are. May be we have a different view on this, for me an application which reads UTF-8 in and spids out UTF-8 is an UTF-8 application. hint: HTML is just one Sphinx writer, there exist also other writers e.g. LaTeX. > The big issue with the way python works with charsets is due to that: > it does a very poor job with regards to that. This is your POV, the python developers have a different view on handling strings. There are epic discussions around about. But all this discussions won't help, since we can't change the principles of python. Personally I think I can't ignore the principles of a language and I'm feeling well with setting up an UTF-8 environment. > I remember that in the past I had to use this quite often > (before UTF-8 being default on the distros I was using on that time): > > LANG=C <some_python_script> > > Just to avoid them to crash. > > If I'm not mistaken, older Fedora/Mandrake distros had some bugs with > python-written scripts that, if the machine's language were not > English, such scripts crash, as the i18n translated messages were > on a different charset than what the python script would be expecting. For me "i18n translated message" is a good example that I'm not wrong with my opinions. This is not true for all devices but on those device you won't run an applications like Sphinx. >>> For the short term I think it is reasonable to run a python test script >>> that prints fancy unicode characters before running Sphinx and bail if >>> the test script fails. >> >> To be assure, I recommend to set UTF-8 locale environment in the >> Makefile. >> >> My experience shows that this is the default with almost all >> containers (images), there are only a few where this is not the >> case (may be suse?). > > That may not be true on certain parts of the globe. Sorry, I have spoken about common LXC images. > I've no idea what charsets the most-used distributions in Asian > Countries use use ;-) I guess these days most often they will use UTF-8 since ASCII haven't helped in the past 80s ;-) -- Markus -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 9:51 ` Markus Heiser @ 2021-05-07 10:29 ` Michal Suchánek 0 siblings, 0 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-07 10:29 UTC (permalink / raw) To: Markus Heiser Cc: Mauro Carvalho Chehab, Randy Dunlap, Matthew Wilcox, linux-doc, Jonathan Corbet On Fri, May 07, 2021 at 11:51:47AM +0200, Markus Heiser wrote: > Am 07.05.21 um 11:14 schrieb Mauro Carvalho Chehab: > > Em Fri, 7 May 2021 10:56:39 +0200 > > Markus Heiser <markus.heiser@darmarit.de> escreveu: > > > > > Am 07.05.21 um 10:35 schrieb Michal Suchánek: > > > > So the bottom line is that UTF-8 in the files will stay, and Sphinx > > > > cannot handle UTF-8 when the locale is not UTF-8. > > > > > > > > In the long run it might be nice to fix Sphinx to properly set the > > > > encoding of the files it reads and writes. Or maybe there is some > > > > parameter that specifies it? > > > > > > Let's not mix things up. The Unicode-Error is not related or limited > > > to log nor to sphinx, it is related to the fact that we (you) try to > > > run a utf-8 application in an environment which is not full utf-8 > > > functional. > > > > No. The application itself is not UTF-8. The application input files are. > > May be we have a different view on this, for me an application which > reads UTF-8 in and spids out UTF-8 is an UTF-8 application. > > hint: HTML is just one Sphinx writer, there exist also other writers > e.g. LaTeX. And same as the browser can display HTML documents in pretty much any character set independently of your system locale Sphinx should be able to produce those for your browser to display independent of the system locale. Same for LaTeX, PDF, or whatver else. > > The big issue with the way python works with charsets is due to that: > > it does a very poor job with regards to that. > > This is your POV, the python developers have a different view on > handling strings. There are epic discussions around about. > > But all this discussions won't help, since we can't change the > principles of python. It has nothing to do with python developer POV on handling strings or principles of python. The python support for handling strings is complete in the sense it does not depend on the system locale and can handle strings in multiple charcter sets. Sphinx as program written in python could handle documents in any encoding supported by python independent of system locale if Sphinx developers bothered to use the python encoding support correctly. Apparently they did not. > > Personally I think I can't ignore the principles of a language > and I'm feeling well with setting up an UTF-8 environment. > > > I remember that in the past I had to use this quite often > > (before UTF-8 being default on the distros I was using on that time): > > > > LANG=C <some_python_script> > > > > Just to avoid them to crash. > > > > If I'm not mistaken, older Fedora/Mandrake distros had some bugs with > > python-written scripts that, if the machine's language were not > > English, such scripts crash, as the i18n translated messages were > > on a different charset than what the python script would be expecting. > > For me "i18n translated message" is a good example that I'm not > wrong with my opinions. This is not true for all devices but > on those device you won't run an applications like Sphinx. Or it's a good example of people never testing the application for the case where explicit handling is required, and possibly one of the reasons more requirements for explicit handling of the encoding were added. In the end it merely led to changing from universal ASCII encoding to universal UTF-8 encoding with no support for running python scripts in any locale that does not use the 'universal' encoding. I think that the idea was to make scripts resilient to encoding errors and prevent data corruption by raising an exception when mishandling of encoding is detected but instead of handling the exceptions people just punted to using the same encoding all the time. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 8:35 ` Michal Suchánek 2021-05-07 8:56 ` Markus Heiser @ 2021-05-07 9:02 ` Mauro Carvalho Chehab 1 sibling, 0 replies; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-07 9:02 UTC (permalink / raw) To: Michal Suchánek Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet Em Fri, 7 May 2021 10:35:27 +0200 Michal Suchánek <msuchanek@suse.de> escreveu: > On Fri, May 07, 2021 at 10:04:35AM +0200, Mauro Carvalho Chehab wrote: > > Em Fri, 7 May 2021 08:39:24 +0200 > > Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > > > > > Em Thu, 6 May 2021 14:21:01 -0700 > > > Randy Dunlap <rdunlap@infradead.org> escreveu: > > > > > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote: > > > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > > > > >> I have been going thru some of the Documentation/ files... > > > > >> > > > > >> Why do several of the files begin with > > > > >> (hex) ef bb bf followed by "==================" > > > > >> for a heading, instead of just "===================". > > > > >> See e.g. Documentation/timers/no_hz.rst. > > > > > > No idea! It seems that the text editor I used on that time added > > > it for whatever reason. > > > > > > > > > > > > > 00000000 ef bb bf 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d |...=============| > > > > > > > > > > ef bb bf is utf8 for 0b1111'111011'111111 = 0xFEFF which is the > > > > > https://en.wikipedia.org/wiki/Byte_order_mark > > > > > > > > > > We should delete it. > > > > > > > > > > > > > OK, thanks, I have started on that. > > > > > > > > > > > > Just another question: ("inquiring minds want to know") > > > > > > > > Why is/are some docs using U+2217 '*' instead of ASCII '*'? > > > > E.g., Documentation/block/cdrom-standard.rst. > > > > > > The cdrom doc is a very special case: it was originally written in LaTeX. > > > I don't remember any other document in LaTeX inside the Kernel docs during > > > the conversions I made. See: > > > e327cfcb2542 ("docs: cdrom-standard.tex: convert from LaTeX to ReST") > > > > > > In order to convert it to .rst, I used some tool to first turn it > > > into plain text (probably LaTeX, but I don't remember anymore), and then > > > I manually reviewed the entire file, adding ReST tags where needed. > > > > > > I didn't realize that utf-8 chars were used instead of normal ASCII chars, > > > as both appear the same when editing it[1]. > > > > > > [1] I use Fedora here. Fedora changed the default charset to utf-8 a long > > > time ago. > > > > > > Anyway, we should be able of get rid of weird UTF-8 chars from it with: > > > > > > $ iconv -f utf-8 -t ascii//TRANSLIT Documentation/cdrom/cdrom-standard.rst > > > > > > I'll prepare a patch fixing it. Some care should be taken, however, as > > > it has two places where UTF-8 chars should be used[2]. > > > > > > [2] There are two German person names that use UTF-8 chars: > > > - 'o' + umlat; > > > - a LATIN SMALL LETTER SHARP S (Eszett) > > > > Btw, I did a quick check here: excluding translations, there are 182 > > files with UTF-8 chars at next-20210429. It seems that most of them > > are on files that got converted from DocBook and html. > > > > Several of them are valid ones: the ones used on names > > (like Günther, Alcôve, ...). > > > 2. Some UTF-8 symbols, like: > > > > - ® > > - ™ > > - ² - used mainly for I²C > > - … > > - ⬍ ↑ ↓ > > - µs - used for microsseconds > > > 3. There are couple of places which uses UTF-8 graphic characters, like: > > > > /sys/devices/system/edac/ > > ├── mc > > │ ├── mc0 > > │ │ ├── ce_count > > │ │ ├── ce_noinfo_count > > > I'm preparing a patchset to address the UTF-8 issues on the top of > > today's next, but before posting, it seems reasonable to discuss > > what to do with the above cases. Comments? > > So the bottom line is that UTF-8 in the files will stay, and Sphinx > cannot handle UTF-8 when the locale is not UTF-8. Yes. We can reduce the number of UTF-8, but some documents need more chars than ASCII can provide. Btw, probably (almost?) all files under Documentation/translation use UTF-8 charsets, due to obvious reasons. > In the long run it might be nice to fix Sphinx to properly set the > encoding of the files it reads and writes. Agreed. > Or maybe there is some parameter that specifies it? > > For the short term I think it is reasonable to run a python test script > that prints fancy unicode characters before running Sphinx and bail if > the test script fails. > > eg. > echo 'print("↑ᛏ个")' > test.py > python3 test.py Actually, a better workaround could be introduced at conf.py. This file is read/parsed by Sphinx on an early stage. Something could be added there that would detect if the machine's charset is not UTF-8 and either produce a warning before starts building or would change the charset used by python to something that won't crash with utf-8. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-07 6:39 ` Mauro Carvalho Chehab 2021-05-07 6:49 ` Randy Dunlap 2021-05-07 8:04 ` Mauro Carvalho Chehab @ 2021-05-08 9:22 ` Mauro Carvalho Chehab 2021-05-08 10:41 ` Michal Suchánek 2 siblings, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-08 9:22 UTC (permalink / raw) To: Randy Dunlap Cc: Matthew Wilcox, Markus Heiser, Michal Suchánek, linux-doc, Jonathan Corbet Em Fri, 7 May 2021 08:39:24 +0200 Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > Em Thu, 6 May 2021 14:21:01 -0700 > Randy Dunlap <rdunlap@infradead.org> escreveu: > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote: > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > > >> I have been going thru some of the Documentation/ files... > > >> > > >> Why do several of the files begin with > > >> (hex) ef bb bf followed by "==================" > > >> for a heading, instead of just "===================". > > >> See e.g. Documentation/timers/no_hz.rst. > > No idea! It seems that the text editor I used on that time added > it for whatever reason. > I'll prepare a patch fixing it. Some care should be taken, however, as > it has two places where UTF-8 chars should be used[2]. Ok, I did a small script in order to check what special chars we currently have (next-20210507) at Documentation/ excluding the translations. Based on my script results, we have those groups: 1. Latin accented characters: - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç) - U+00df (LATIN SMALL LETTER SHARP S) (ß) - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á) - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä) - U+00e6 (LATIN SMALL LETTER AE) (æ) - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç) - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é) - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê) - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë) - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó) - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô) - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö) - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø) - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü) - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ) - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł) 2. symbols: - U+00a9 (COPYRIGHT SIGN) (©) - U+2122 (TRADE MARK SIGN) (™) - U+00ae (REGISTERED SIGN) (®) - U+00b0 (DEGREE SIGN) (°) - U+00b1 (PLUS-MINUS SIGN) (±) - U+00b2 (SUPERSCRIPT TWO) (²) - U+00b5 (MICRO SIGN) (µ) - U+00bd (VULGAR FRACTION ONE HALF) (½) - U+2026 (HORIZONTAL ELLIPSIS) (…) 3. arrows: - U+2191 (UPWARDS ARROW) (↑) - U+2192 (RIGHTWARDS ARROW) (→) - U+2193 (DOWNWARDS ARROW) (↓) - U+2b0d (UP DOWN BLACK ARROW) (⬍) 4. box drawings: - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─) - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│) - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└) - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├) 5. math symbols: - U+00b7 (MIDDLE DOT) (·) - U+00d7 (MULTIPLICATION SIGN) (×) - U+2212 (MINUS SIGN) (−) - U+2217 (ASTERISK OPERATOR) (∗) - U+223c (TILDE OPERATOR) (∼) - U+2264 (LESS-THAN OR EQUAL TO) (≤) - U+2265 (GREATER-THAN OR EQUAL TO) (≥) - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨) - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩) - U+00ac (NOT SIGN) (¬) 6. commas: - U+00b4 (ACUTE ACCENT) (´) - U+2018 (LEFT SINGLE QUOTATION MARK) (‘) - U+2019 (RIGHT SINGLE QUOTATION MARK) (’) - U+201c (LEFT DOUBLE QUOTATION MARK) (“) - U+201d (RIGHT DOUBLE QUOTATION MARK) (”) 7. spaces: - U+00a0 (NO-BREAK SPACE) ( ) - U+feff (ZERO WIDTH NO-BREAK SPACE) () 8. dashes and hyphen: - U+2010 (HYPHEN) (‐) - U+2013 (EN DASH) (–) - U+2014 (EM DASH) (—) I would keep (1) to (5), replacing just: - commas; - spaces; - dashes and hyphen. Comments? Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-08 9:22 ` Mauro Carvalho Chehab @ 2021-05-08 10:41 ` Michal Suchánek 2021-05-08 14:41 ` Mauro Carvalho Chehab 0 siblings, 1 reply; 41+ messages in thread From: Michal Suchánek @ 2021-05-08 10:41 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote: > Em Fri, 7 May 2021 08:39:24 +0200 > Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > > > Em Thu, 6 May 2021 14:21:01 -0700 > > Randy Dunlap <rdunlap@infradead.org> escreveu: > > > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote: > > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > > > >> I have been going thru some of the Documentation/ files... > > > >> > > > >> Why do several of the files begin with > > > >> (hex) ef bb bf followed by "==================" > > > >> for a heading, instead of just "===================". > > > >> See e.g. Documentation/timers/no_hz.rst. > > > > No idea! It seems that the text editor I used on that time added > > it for whatever reason. > > > I'll prepare a patch fixing it. Some care should be taken, however, as > > it has two places where UTF-8 chars should be used[2]. > > Ok, I did a small script in order to check what special chars we > currently have (next-20210507) at Documentation/ excluding the > translations. > > Based on my script results, we have those groups: > > 1. Latin accented characters: > - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç) > - U+00df (LATIN SMALL LETTER SHARP S) (ß) > - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á) > - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä) > - U+00e6 (LATIN SMALL LETTER AE) (æ) > - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç) > - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é) > - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê) > - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë) > - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó) > - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô) > - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö) > - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø) > - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü) > - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ) > - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł) > > 2. symbols: > - U+00a9 (COPYRIGHT SIGN) (©) > - U+2122 (TRADE MARK SIGN) (™) > - U+00ae (REGISTERED SIGN) (®) > - U+00b0 (DEGREE SIGN) (°) > - U+00b1 (PLUS-MINUS SIGN) (±) > - U+00b2 (SUPERSCRIPT TWO) (²) > - U+00b5 (MICRO SIGN) (µ) > - U+00bd (VULGAR FRACTION ONE HALF) (½) > - U+2026 (HORIZONTAL ELLIPSIS) (…) > > 3. arrows: > - U+2191 (UPWARDS ARROW) (↑) > - U+2192 (RIGHTWARDS ARROW) (→) > - U+2193 (DOWNWARDS ARROW) (↓) > - U+2b0d (UP DOWN BLACK ARROW) (⬍) > > 4. box drawings: > - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─) > - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│) > - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└) > - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├) > > 5. math symbols: > - U+00b7 (MIDDLE DOT) (·) > - U+00d7 (MULTIPLICATION SIGN) (×) > - U+2212 (MINUS SIGN) (−) > - U+2217 (ASTERISK OPERATOR) (∗) > - U+223c (TILDE OPERATOR) (∼) > - U+2264 (LESS-THAN OR EQUAL TO) (≤) > - U+2265 (GREATER-THAN OR EQUAL TO) (≥) > - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨) > - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩) > - U+00ac (NOT SIGN) (¬) Clearly his is supposed to be ASCII tilde: Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) Use of ¬ is also very dubious in documentation (in fonts it is understandable): Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬ Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬ Documentation/powerpc/transactional_memory.rst: if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then The use of − is rare can could be replaed with ASCII hyphen-minus entirely without making the text harder to understand: Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml: 0: REFIN1(+)/REFIN1(−). Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml: 1: REFIN2(+)/REFIN2(−). Documentation/devicetree/bindings/iio/adc/adi,ad7192.yaml: External reference applied between the P1/REFIN2(+) and P0/REFIN2(−) pins. Documentation/scheduler/sched-deadline.rst: ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max drivers/gpu/drm/drm_color_mgmt.c: * - range: [-2^2, 2^2 - 2^−15] drivers/iio/light/tsl2583.c: * sheet (TAOS134 − MARCH 2011): drivers/staging/iio/adc/ad7280a.c: * (Number of Conversions per Part)) − sound/soc/codecs/sgtl5000.c: * is the array index and the following formula: 10^((idx−15)/40) * 100 Asterisk operator is clearly meant to be ASCII: Documentation/cdrom/cdrom-standard.rst: NULL, /∗ lseek ∗/ Documentation/cdrom/cdrom-standard.rst: block _read , /∗ read—general block-dev read ∗/ Documentation/cdrom/cdrom-standard.rst: block _write, /∗ write—general block-dev write ∗/ Documentation/cdrom/cdrom-standard.rst: NULL, /∗ readdir ∗/ Documentation/cdrom/cdrom-standard.rst: NULL, /∗ select ∗/ Documentation/cdrom/cdrom-standard.rst: cdrom_ioctl, /∗ ioctl ∗/ Documentation/cdrom/cdrom-standard.rst: NULL, /∗ mmap ∗/ Documentation/cdrom/cdrom-standard.rst: cdrom_open, /∗ open ∗/ Documentation/cdrom/cdrom-standard.rst: cdrom_release, /∗ release ∗/ Documentation/cdrom/cdrom-standard.rst: NULL, /∗ fsync ∗/ Documentation/cdrom/cdrom-standard.rst: NULL, /∗ fasync ∗/ Documentation/cdrom/cdrom-standard.rst: NULL /∗ revalidate ∗/ Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. There is only one place where ⟨⟩ is used which is very dubious: Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ... The middle dot is mostly used in mathmatical formulas that would be unintelligible otherwise but there are a few odd uses: Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3 Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3 Documentation/devicetree/bindings/clock/qcom,rpmcc.txt: "qcom,rpmcc-msm8992",·"qcom,rpmcc" Documentation/devicetree/bindings/clock/qcom,rpmcc.txt: "qcom,rpmcc-msm8994",·"qcom,rpmcc" Documentation/translations/zh_CN/kernel-hacking/hacking.rst: 阿列克谢·库兹涅佐夫享用的糟糕伏特加有关。 Documentation/translations/zh_CN/process/howto.rst: 《C程序设计语言(第2版·新版)》(徐宝文 李志 译)[机械工业出版社] Documentation/translations/zh_CN/process/management-style.rst:.. [#cnf2] 保罗·西蒙演唱了“离开爱人的50种方法”,因为坦率地说,“告诉开发者 The × ≤ and ≥ uses look fine. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-08 10:41 ` Michal Suchánek @ 2021-05-08 14:41 ` Mauro Carvalho Chehab 2021-05-08 15:55 ` Randy Dunlap 0 siblings, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-08 14:41 UTC (permalink / raw) To: Michal Suchánek Cc: Randy Dunlap, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet Em Sat, 8 May 2021 12:41:57 +0200 Michal Suchánek <msuchanek@suse.de> escreveu: > On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote: > > Em Fri, 7 May 2021 08:39:24 +0200 > > Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > > > > > Em Thu, 6 May 2021 14:21:01 -0700 > > > Randy Dunlap <rdunlap@infradead.org> escreveu: > > > > > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote: > > > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote: > > > > >> I have been going thru some of the Documentation/ files... > > > > >> > > > > >> Why do several of the files begin with > > > > >> (hex) ef bb bf followed by "==================" > > > > >> for a heading, instead of just "===================". > > > > >> See e.g. Documentation/timers/no_hz.rst. > > > > > > No idea! It seems that the text editor I used on that time added > > > it for whatever reason. > > > > > I'll prepare a patch fixing it. Some care should be taken, however, as > > > it has two places where UTF-8 chars should be used[2]. > > > > Ok, I did a small script in order to check what special chars we > > currently have (next-20210507) at Documentation/ excluding the > > translations. > > > > Based on my script results, we have those groups: > > > > 1. Latin accented characters: > > - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç) > > - U+00df (LATIN SMALL LETTER SHARP S) (ß) > > - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á) > > - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä) > > - U+00e6 (LATIN SMALL LETTER AE) (æ) > > - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç) > > - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é) > > - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê) > > - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë) > > - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó) > > - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô) > > - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö) > > - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø) > > - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü) > > - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ) > > - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł) > > > > 2. symbols: > > - U+00a9 (COPYRIGHT SIGN) (©) > > - U+2122 (TRADE MARK SIGN) (™) > > - U+00ae (REGISTERED SIGN) (®) > > - U+00b0 (DEGREE SIGN) (°) > > - U+00b1 (PLUS-MINUS SIGN) (±) > > - U+00b2 (SUPERSCRIPT TWO) (²) > > - U+00b5 (MICRO SIGN) (µ) > > - U+00bd (VULGAR FRACTION ONE HALF) (½) > > - U+2026 (HORIZONTAL ELLIPSIS) (…) > > > > 3. arrows: > > - U+2191 (UPWARDS ARROW) (↑) > > - U+2192 (RIGHTWARDS ARROW) (→) > > - U+2193 (DOWNWARDS ARROW) (↓) > > - U+2b0d (UP DOWN BLACK ARROW) (⬍) > > > > 4. box drawings: > > - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─) > > - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│) > > - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└) > > - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├) > > > > 5. math symbols: > > - U+00b7 (MIDDLE DOT) (·) > > - U+00d7 (MULTIPLICATION SIGN) (×) > > - U+2212 (MINUS SIGN) (−) > > - U+2217 (ASTERISK OPERATOR) (∗) > > - U+223c (TILDE OPERATOR) (∼) > > - U+2264 (LESS-THAN OR EQUAL TO) (≤) > > - U+2265 (GREATER-THAN OR EQUAL TO) (≥) > > - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨) > > - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩) > > - U+00ac (NOT SIGN) (¬) > Hi Michal, > Clearly his is supposed to be ASCII tilde: > Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) Yes, for this specific file, iconv //translit should solve everything. In the case of cdrom-standard, those came from the LaTeX conversion. > > Use of ¬ is also very dubious in documentation (in fonts it is understandable): > Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬ > Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬ > Documentation/powerpc/transactional_memory.rst: if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then Yeah, this should probably be better written as: if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then > The use of − is rare can could be replaed with ASCII hyphen-minus entirely > without making the text harder to understand: > > Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml: 0: REFIN1(+)/REFIN1(−). > Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml: 1: REFIN2(+)/REFIN2(−). > Documentation/devicetree/bindings/iio/adc/adi,ad7192.yaml: External reference applied between the P1/REFIN2(+) and P0/REFIN2(−) pins. > Documentation/scheduler/sched-deadline.rst: ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max > drivers/gpu/drm/drm_color_mgmt.c: * - range: [-2^2, 2^2 - 2^−15] > drivers/iio/light/tsl2583.c: * sheet (TAOS134 − MARCH 2011): > drivers/staging/iio/adc/ad7280a.c: * (Number of Conversions per Part)) − > sound/soc/codecs/sgtl5000.c: * is the array index and the following formula: 10^((idx−15)/40) * 100 Agreed. > Asterisk operator is clearly meant to be ASCII: > Documentation/cdrom/cdrom-standard.rst: NULL, /∗ lseek ∗/ > Documentation/cdrom/cdrom-standard.rst: block _read , /∗ read—general block-dev read ∗/ > Documentation/cdrom/cdrom-standard.rst: block _write, /∗ write—general block-dev write ∗/ > Documentation/cdrom/cdrom-standard.rst: NULL, /∗ readdir ∗/ > Documentation/cdrom/cdrom-standard.rst: NULL, /∗ select ∗/ > Documentation/cdrom/cdrom-standard.rst: cdrom_ioctl, /∗ ioctl ∗/ > Documentation/cdrom/cdrom-standard.rst: NULL, /∗ mmap ∗/ > Documentation/cdrom/cdrom-standard.rst: cdrom_open, /∗ open ∗/ > Documentation/cdrom/cdrom-standard.rst: cdrom_release, /∗ release ∗/ > Documentation/cdrom/cdrom-standard.rst: NULL, /∗ fsync ∗/ > Documentation/cdrom/cdrom-standard.rst: NULL, /∗ fasync ∗/ > Documentation/cdrom/cdrom-standard.rst: NULL /∗ revalidate ∗/ > Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. > > There is only one place where ⟨⟩ is used which is very dubious: > Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ... Yeah. Again, this was due to LaTeX to text conversion. > The middle dot is mostly used in mathmatical formulas that would be > unintelligible otherwise but there are a few odd uses: > Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3 > Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3 > Documentation/devicetree/bindings/clock/qcom,rpmcc.txt: "qcom,rpmcc-msm8992",·"qcom,rpmcc" > Documentation/devicetree/bindings/clock/qcom,rpmcc.txt: "qcom,rpmcc-msm8994",·"qcom,rpmcc" Yeah. It sounds that space would be the best replacement there. > Documentation/translations/zh_CN/kernel-hacking/hacking.rst: 阿列克谢·库兹涅佐夫享用的糟糕伏特加有关。 > Documentation/translations/zh_CN/process/howto.rst: 《C程序设计语言(第2版·新版)》(徐宝文 李志 译)[机械工业出版社] > Documentation/translations/zh_CN/process/management-style.rst:.. [#cnf2] 保罗·西蒙演唱了“离开爱人的50种方法”,因为坦率地说,“告诉开发者 I wouldn't touch translations. > The × ≤ and ≥ uses look fine. Agreed. Thanks for double-checking those. I'll address them. In the mean time, I'm already preparing a patch series addressing the issues inside documentation, using some scripting to avoid manual mistakes: https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8 (patch series is not 100% yet... some adjustments are still needed on some places). Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-08 14:41 ` Mauro Carvalho Chehab @ 2021-05-08 15:55 ` Randy Dunlap 2021-05-08 17:09 ` Michal Suchánek 2021-05-10 8:17 ` Mauro Carvalho Chehab 0 siblings, 2 replies; 41+ messages in thread From: Randy Dunlap @ 2021-05-08 15:55 UTC (permalink / raw) To: Mauro Carvalho Chehab, Michal Suchánek Cc: Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet Hi Mauro, On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote: > Em Sat, 8 May 2021 12:41:57 +0200 > Michal Suchánek <msuchanek@suse.de> escreveu: > >> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote: >>> Em Fri, 7 May 2021 08:39:24 +0200 >>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: >>> >>>> Em Thu, 6 May 2021 14:21:01 -0700 >>>> Randy Dunlap <rdunlap@infradead.org> escreveu: >>>> >>> >>>> I'll prepare a patch fixing it. Some care should be taken, however, as >>>> it has two places where UTF-8 chars should be used[2]. >>> >>> Ok, I did a small script in order to check what special chars we >>> currently have (next-20210507) at Documentation/ excluding the >>> translations. >>> >>> Based on my script results, we have those groups: >>> >>> 1. Latin accented characters: >>> - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç) >>> - U+00df (LATIN SMALL LETTER SHARP S) (ß) >>> - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á) >>> - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä) >>> - U+00e6 (LATIN SMALL LETTER AE) (æ) >>> - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç) >>> - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é) >>> - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê) >>> - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë) >>> - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó) >>> - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô) >>> - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö) >>> - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø) >>> - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü) >>> - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ) >>> - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł) >>> >>> 2. symbols: >>> - U+00a9 (COPYRIGHT SIGN) (©) >>> - U+2122 (TRADE MARK SIGN) (™) >>> - U+00ae (REGISTERED SIGN) (®) >>> - U+00b0 (DEGREE SIGN) (°) >>> - U+00b1 (PLUS-MINUS SIGN) (±) >>> - U+00b2 (SUPERSCRIPT TWO) (²) >>> - U+00b5 (MICRO SIGN) (µ) >>> - U+00bd (VULGAR FRACTION ONE HALF) (½) >>> - U+2026 (HORIZONTAL ELLIPSIS) (…) >>> >>> 3. arrows: >>> - U+2191 (UPWARDS ARROW) (↑) >>> - U+2192 (RIGHTWARDS ARROW) (→) >>> - U+2193 (DOWNWARDS ARROW) (↓) >>> - U+2b0d (UP DOWN BLACK ARROW) (⬍) >>> >>> 4. box drawings: >>> - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─) >>> - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│) >>> - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└) >>> - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├) >>> >>> 5. math symbols: >>> - U+00b7 (MIDDLE DOT) (·) >>> - U+00d7 (MULTIPLICATION SIGN) (×) >>> - U+2212 (MINUS SIGN) (−) >>> - U+2217 (ASTERISK OPERATOR) (∗) >>> - U+223c (TILDE OPERATOR) (∼) >>> - U+2264 (LESS-THAN OR EQUAL TO) (≤) >>> - U+2265 (GREATER-THAN OR EQUAL TO) (≥) >>> - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨) >>> - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩) >>> - U+00ac (NOT SIGN) (¬) >> > >> >> Use of ¬ is also very dubious in documentation (in fonts it is understandable): >> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬ >> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬ > > >> Documentation/powerpc/transactional_memory.rst: if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then > > Yeah, this should probably be better written as: > > if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then If the original with the 'NOT SIGN' was correct, then this version can't be correct. Or do you suspect that the "original" was corrupted somehow? > In the mean time, I'm already preparing a patch series addressing > the issues inside documentation, using some scripting to avoid > manual mistakes: > > https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8 > > (patch series is not 100% yet... some adjustments are still > needed on some places). Thanks for digging into this and providing fixes. -- ~Randy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-08 15:55 ` Randy Dunlap @ 2021-05-08 17:09 ` Michal Suchánek 2021-05-08 17:46 ` Randy Dunlap 2021-05-10 8:17 ` Mauro Carvalho Chehab 1 sibling, 1 reply; 41+ messages in thread From: Michal Suchánek @ 2021-05-08 17:09 UTC (permalink / raw) To: Randy Dunlap Cc: Mauro Carvalho Chehab, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet On Sat, May 08, 2021 at 08:55:11AM -0700, Randy Dunlap wrote: > Hi Mauro, > > On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote: > > Em Sat, 8 May 2021 12:41:57 +0200 > > Michal Suchánek <msuchanek@suse.de> escreveu: > > > >> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote: > >>> Em Fri, 7 May 2021 08:39:24 +0200 > >>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > >>> > >>>> Em Thu, 6 May 2021 14:21:01 -0700 > >>>> Randy Dunlap <rdunlap@infradead.org> escreveu: > >>>> > >>> > >>>> I'll prepare a patch fixing it. Some care should be taken, however, as > >>>> it has two places where UTF-8 chars should be used[2]. > >>> > >>> Ok, I did a small script in order to check what special chars we > >>> currently have (next-20210507) at Documentation/ excluding the > >>> translations. > >>> > >>> Based on my script results, we have those groups: > >>> > >>> 1. Latin accented characters: > >>> - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç) > >>> - U+00df (LATIN SMALL LETTER SHARP S) (ß) > >>> - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á) > >>> - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä) > >>> - U+00e6 (LATIN SMALL LETTER AE) (æ) > >>> - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç) > >>> - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é) > >>> - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê) > >>> - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë) > >>> - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó) > >>> - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô) > >>> - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö) > >>> - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø) > >>> - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü) > >>> - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ) > >>> - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł) > >>> > >>> 2. symbols: > >>> - U+00a9 (COPYRIGHT SIGN) (©) > >>> - U+2122 (TRADE MARK SIGN) (™) > >>> - U+00ae (REGISTERED SIGN) (®) > >>> - U+00b0 (DEGREE SIGN) (°) > >>> - U+00b1 (PLUS-MINUS SIGN) (±) > >>> - U+00b2 (SUPERSCRIPT TWO) (²) > >>> - U+00b5 (MICRO SIGN) (µ) > >>> - U+00bd (VULGAR FRACTION ONE HALF) (½) > >>> - U+2026 (HORIZONTAL ELLIPSIS) (…) > >>> > >>> 3. arrows: > >>> - U+2191 (UPWARDS ARROW) (↑) > >>> - U+2192 (RIGHTWARDS ARROW) (→) > >>> - U+2193 (DOWNWARDS ARROW) (↓) > >>> - U+2b0d (UP DOWN BLACK ARROW) (⬍) > >>> > >>> 4. box drawings: > >>> - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─) > >>> - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│) > >>> - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└) > >>> - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├) > >>> > >>> 5. math symbols: > >>> - U+00b7 (MIDDLE DOT) (·) > >>> - U+00d7 (MULTIPLICATION SIGN) (×) > >>> - U+2212 (MINUS SIGN) (−) > >>> - U+2217 (ASTERISK OPERATOR) (∗) > >>> - U+223c (TILDE OPERATOR) (∼) > >>> - U+2264 (LESS-THAN OR EQUAL TO) (≤) > >>> - U+2265 (GREATER-THAN OR EQUAL TO) (≥) > >>> - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨) > >>> - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩) > >>> - U+00ac (NOT SIGN) (¬) > >> > > > >> > >> Use of ¬ is also very dubious in documentation (in fonts it is understandable): > >> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬ > >> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬ > > > > > >> Documentation/powerpc/transactional_memory.rst: if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then > > > > Yeah, this should probably be better written as: > > > > if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then > > If the original with the 'NOT SIGN' was correct, then this > version can't be correct. Or do you suspect that the "original" > was corrupted somehow? This does not make sense however you look at it. Using | between logical expressions ... It sounds like it is some pseudocode in no language in particular so it's hard to tell what it actually means and the document does not have enough context to be able to tell. I suppose there is some comment somewhere in the kernel code that would clarify this - at least what the bit patterns mean. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-08 17:09 ` Michal Suchánek @ 2021-05-08 17:46 ` Randy Dunlap 2021-05-10 6:22 ` Mauro Carvalho Chehab 0 siblings, 1 reply; 41+ messages in thread From: Randy Dunlap @ 2021-05-08 17:46 UTC (permalink / raw) To: Michal Suchánek Cc: Mauro Carvalho Chehab, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet On 5/8/21 10:09 AM, Michal Suchánek wrote: > On Sat, May 08, 2021 at 08:55:11AM -0700, Randy Dunlap wrote: >> Hi Mauro, >> >> On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote: >>> Em Sat, 8 May 2021 12:41:57 +0200 >>> Michal Suchánek <msuchanek@suse.de> escreveu: >>> >>>> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote: >>>>> Em Fri, 7 May 2021 08:39:24 +0200 >>>>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: >>>>> >>>>>> Em Thu, 6 May 2021 14:21:01 -0700 >>>>>> Randy Dunlap <rdunlap@infradead.org> escreveu: >>>>>> >>>>> >>>>>> I'll prepare a patch fixing it. Some care should be taken, however, as >>>>>> it has two places where UTF-8 chars should be used[2]. >>>>> >>>>> Ok, I did a small script in order to check what special chars we >>>>> currently have (next-20210507) at Documentation/ excluding the >>>>> translations. >>>>> >>>>> Based on my script results, we have those groups: >>>>> >>>>> 1. Latin accented characters: >>>>> - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç) >>>>> - U+00df (LATIN SMALL LETTER SHARP S) (ß) >>>>> - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á) >>>>> - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä) >>>>> - U+00e6 (LATIN SMALL LETTER AE) (æ) >>>>> - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç) >>>>> - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é) >>>>> - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê) >>>>> - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë) >>>>> - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó) >>>>> - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô) >>>>> - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö) >>>>> - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø) >>>>> - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü) >>>>> - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ) >>>>> - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł) >>>>> >>>>> 2. symbols: >>>>> - U+00a9 (COPYRIGHT SIGN) (©) >>>>> - U+2122 (TRADE MARK SIGN) (™) >>>>> - U+00ae (REGISTERED SIGN) (®) >>>>> - U+00b0 (DEGREE SIGN) (°) >>>>> - U+00b1 (PLUS-MINUS SIGN) (±) >>>>> - U+00b2 (SUPERSCRIPT TWO) (²) >>>>> - U+00b5 (MICRO SIGN) (µ) >>>>> - U+00bd (VULGAR FRACTION ONE HALF) (½) >>>>> - U+2026 (HORIZONTAL ELLIPSIS) (…) >>>>> >>>>> 3. arrows: >>>>> - U+2191 (UPWARDS ARROW) (↑) >>>>> - U+2192 (RIGHTWARDS ARROW) (→) >>>>> - U+2193 (DOWNWARDS ARROW) (↓) >>>>> - U+2b0d (UP DOWN BLACK ARROW) (⬍) >>>>> >>>>> 4. box drawings: >>>>> - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─) >>>>> - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│) >>>>> - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└) >>>>> - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├) >>>>> >>>>> 5. math symbols: >>>>> - U+00b7 (MIDDLE DOT) (·) >>>>> - U+00d7 (MULTIPLICATION SIGN) (×) >>>>> - U+2212 (MINUS SIGN) (−) >>>>> - U+2217 (ASTERISK OPERATOR) (∗) >>>>> - U+223c (TILDE OPERATOR) (∼) >>>>> - U+2264 (LESS-THAN OR EQUAL TO) (≤) >>>>> - U+2265 (GREATER-THAN OR EQUAL TO) (≥) >>>>> - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨) >>>>> - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩) >>>>> - U+00ac (NOT SIGN) (¬) >>>> >>> >>>> >>>> Use of ¬ is also very dubious in documentation (in fonts it is understandable): >>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬ >>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬ >>> >>> >>>> Documentation/powerpc/transactional_memory.rst: if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then >>> >>> Yeah, this should probably be better written as: >>> >>> if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then >> >> If the original with the 'NOT SIGN' was correct, then this >> version can't be correct. Or do you suspect that the "original" >> was corrupted somehow? > > This does not make sense however you look at it. Using | between logical > expressions ... To my eyes/brain, it looks like classic (IBM) symbolic logic notation. In that context, I don't see anything wrong with it. > It sounds like it is some pseudocode in no language in particular so > it's hard to tell what it actually means and the document does not have > enough context to be able to tell. I suppose there is some comment > somewhere in the kernel code that would clarify this - at least what the > bit patterns mean. Yeah, I have been looking thru the arch/powerpc/ source code for this, but I haven't found it yet. -- ~Randy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-08 17:46 ` Randy Dunlap @ 2021-05-10 6:22 ` Mauro Carvalho Chehab 0 siblings, 0 replies; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-10 6:22 UTC (permalink / raw) To: Randy Dunlap Cc: Michal Suchánek, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet Em Sat, 8 May 2021 10:46:46 -0700 Randy Dunlap <rdunlap@infradead.org> escreveu: > On 5/8/21 10:09 AM, Michal Suchánek wrote: > > On Sat, May 08, 2021 at 08:55:11AM -0700, Randy Dunlap wrote: > >> Hi Mauro, > >> > >> On 5/8/21 7:41 AM, Mauro Carvalho Chehab wrote: > >>> Em Sat, 8 May 2021 12:41:57 +0200 > >>> Michal Suchánek <msuchanek@suse.de> escreveu: > >>> > >>>> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote: > >>>>> Em Fri, 7 May 2021 08:39:24 +0200 > >>>>> Mauro Carvalho Chehab <mchehab@kernel.org> escreveu: > >>>>> > >>>>>> Em Thu, 6 May 2021 14:21:01 -0700 > >>>>>> Randy Dunlap <rdunlap@infradead.org> escreveu: > >>>>>> > >>>>> > >>>>>> I'll prepare a patch fixing it. Some care should be taken, however, as > >>>>>> it has two places where UTF-8 chars should be used[2]. > >>>>> > >>>>> Ok, I did a small script in order to check what special chars we > >>>>> currently have (next-20210507) at Documentation/ excluding the > >>>>> translations. > >>>>> > >>>>> Based on my script results, we have those groups: > >>>>> > >>>>> 1. Latin accented characters: > >>>>> - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç) > >>>>> - U+00df (LATIN SMALL LETTER SHARP S) (ß) > >>>>> - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á) > >>>>> - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä) > >>>>> - U+00e6 (LATIN SMALL LETTER AE) (æ) > >>>>> - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç) > >>>>> - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é) > >>>>> - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê) > >>>>> - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë) > >>>>> - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó) > >>>>> - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô) > >>>>> - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö) > >>>>> - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø) > >>>>> - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü) > >>>>> - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ) > >>>>> - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł) > >>>>> > >>>>> 2. symbols: > >>>>> - U+00a9 (COPYRIGHT SIGN) (©) > >>>>> - U+2122 (TRADE MARK SIGN) (™) > >>>>> - U+00ae (REGISTERED SIGN) (®) > >>>>> - U+00b0 (DEGREE SIGN) (°) > >>>>> - U+00b1 (PLUS-MINUS SIGN) (±) > >>>>> - U+00b2 (SUPERSCRIPT TWO) (²) > >>>>> - U+00b5 (MICRO SIGN) (µ) > >>>>> - U+00bd (VULGAR FRACTION ONE HALF) (½) > >>>>> - U+2026 (HORIZONTAL ELLIPSIS) (…) > >>>>> > >>>>> 3. arrows: > >>>>> - U+2191 (UPWARDS ARROW) (↑) > >>>>> - U+2192 (RIGHTWARDS ARROW) (→) > >>>>> - U+2193 (DOWNWARDS ARROW) (↓) > >>>>> - U+2b0d (UP DOWN BLACK ARROW) (⬍) > >>>>> > >>>>> 4. box drawings: > >>>>> - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─) > >>>>> - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│) > >>>>> - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└) > >>>>> - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├) > >>>>> > >>>>> 5. math symbols: > >>>>> - U+00b7 (MIDDLE DOT) (·) > >>>>> - U+00d7 (MULTIPLICATION SIGN) (×) > >>>>> - U+2212 (MINUS SIGN) (−) > >>>>> - U+2217 (ASTERISK OPERATOR) (∗) > >>>>> - U+223c (TILDE OPERATOR) (∼) > >>>>> - U+2264 (LESS-THAN OR EQUAL TO) (≤) > >>>>> - U+2265 (GREATER-THAN OR EQUAL TO) (≥) > >>>>> - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨) > >>>>> - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩) > >>>>> - U+00ac (NOT SIGN) (¬) > >>>> > >>> > >>>> > >>>> Use of ¬ is also very dubious in documentation (in fonts it is understandable): > >>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬ > >>>> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬ > >>> > >>> > >>>> Documentation/powerpc/transactional_memory.rst: if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then > >>> > >>> Yeah, this should probably be better written as: > >>> > >>> if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then > >> > >> If the original with the 'NOT SIGN' was correct, then this > >> version can't be correct. Or do you suspect that the "original" > >> was corrupted somehow? No, I just misread the expression. > > > > This does not make sense however you look at it. Using | between logical > > expressions ... > > To my eyes/brain, it looks like classic (IBM) symbolic logic notation. > In that context, I don't see anything wrong with it. In this particular case, I would keep it as-is, with the UTF-8 char on it. I mean, it might be converted to some other symbolic logic notation, but "MSR 29:31" and "SRR1 29:31" aren't valid names on C. > Yeah, I have been looking thru the arch/powerpc/ source code for this, > but I haven't found it yet. The title of the session says that it is part of "h/rfid mtmsrd quirk". Searching for rfid: $ git grep -l rfid arch/powerpc/ Shows a lot of asm code. I guess that if the above quirk is still at the Kernel, it is probably somewhere at the assembler part. So, it sounds to me that converting it into C (or pseudo-C) won't make it any better. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-08 15:55 ` Randy Dunlap 2021-05-08 17:09 ` Michal Suchánek @ 2021-05-10 8:17 ` Mauro Carvalho Chehab 1 sibling, 0 replies; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-10 8:17 UTC (permalink / raw) To: Randy Dunlap Cc: Michal Suchánek, Matthew Wilcox, Markus Heiser, linux-doc, Jonathan Corbet Em Sat, 8 May 2021 08:55:11 -0700 Randy Dunlap <rdunlap@infradead.org> escreveu: > > In the mean time, I'm already preparing a patch series addressing > > the issues inside documentation, using some scripting to avoid > > manual mistakes: > > > > https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8 > > > > (patch series is not 100% yet... some adjustments are still > > needed on some places). > > > Thanks for digging into this and providing fixes. Just pushed a new version there, rebasing the branch: https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8 The first tree patches were manually written, in order to address a couple of special cases. I'll be submitting the patches via e-mail later today. The remaining ones were generated by a script that seeks for UTF-8 characters only inside Documentation .rst and ABI files, doing this conversion: my %char_map = ( 0x2010 => '-', # HYPHEN 0xad => '-', # SOFT HYPHEN 0x2013 => '-', # EN DASH 0x2014 => '-', # EM DASH 0x2018 => "'", # LEFT SINGLE QUOTATION MARK 0x2019 => "'", # RIGHT SINGLE QUOTATION MARK 0xb4 => "'", # ACUTE ACCENT 0x201c => '"', # LEFT DOUBLE QUOTATION MARK 0x201d => '"', # RIGHT DOUBLE QUOTATION MARK 0x2212 => '-', # MINUS SIGN 0x2217 => '*', # ASTERISK OPERATOR 0xd7 => 'x', # MULTIPLICATION SIGN 0xbb => '>', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 0xa0 => ' ', # NO-BREAK SPACE 0xfeff => '', # ZERO WIDTH NO-BREAK SPACE ); Basically, after the conversion, those UTF-8 chars will remain at Documentation/: - U+00a9 ('©'): COPYRIGHT SIGN - U+00ac ('¬'): NOT SIGN # only at Documentation/powerpc/transactional_memory.rst - U+00ae ('®'): REGISTERED SIGN - U+00b0 ('°'): DEGREE SIGN - U+00b1 ('±'): PLUS-MINUS SIGN - U+00b2 ('²'): SUPERSCRIPT TWO - U+00b5 ('µ'): MICRO SIGN - U+00b7 ('·'): MIDDLE DOT # See below - U+00bd ('½'): VULGAR FRACTION ONE HALF - U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA - U+00df ('ß'): LATIN SMALL LETTER SHARP S - U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE - U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS - U+00e6 ('æ'): LATIN SMALL LETTER AE - U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA - U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE - U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX - U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS - U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE - U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX - U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS - U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE - U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE - U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS - U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE - U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE - U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE - U+03bc ('μ'): GREEK SMALL LETTER MU - U+2026 ('…'): HORIZONTAL ELLIPSIS - U+2122 ('™'): TRADE MARK SIGN - U+2191 ('↑'): UPWARDS ARROW - U+2192 ('→'): RIGHTWARDS ARROW - U+2193 ('↓'): DOWNWARDS ARROW - U+2264 ('≤'): LESS-THAN OR EQUAL TO - U+2265 ('≥'): GREATER-THAN OR EQUAL TO - U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL - U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL - U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT - U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT - U+2b0d ('⬍'): UP DOWN BLACK ARROW For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places: - Documentation/devicetree/bindings/clock/qcom,rpmcc.txt As this file will be some day converted to yaml, where the MIDDLE DOT will be removed, I guess it is not worth touching it. - Documentation/scheduler/sched-deadline.rst There, it is used on a math expressions. So, better to keep. - Documentation/devicetree/bindings/media/video-interface-devices.yaml There, it part of an ASCII artwork. - translations/zh_CN I prefer not touching it, as it might have some special meaning in Simplified Chinese. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:04 ` Markus Heiser 2021-05-06 17:27 ` Mauro Carvalho Chehab @ 2021-05-06 17:48 ` Michal Suchánek 2021-05-06 17:59 ` Markus Heiser 2021-05-12 6:22 ` Mauro Carvalho Chehab 1 sibling, 2 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-06 17:48 UTC (permalink / raw) To: Markus Heiser; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet On Thu, May 06, 2021 at 07:04:44PM +0200, Markus Heiser wrote: > Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: > > Em Thu, 6 May 2021 17:57:15 +0200 > > Markus Heiser <markus.heiser@darmarit.de> escreveu: > > > > > Am 06.05.21 um 12:39 schrieb Michal Suchánek: > > > > When building HTML documentation I get this output: > > > ... > > > > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > > > ... > > > > > > > > It does not say which input file contains the offending character so I can't tell which file is broken. > > > > > > > > Any idea how to debug? > > > > > > I guess the build host is a very simple container, what does > > > > > > echo $LC_ALL > > > echo $LANG It's actually set to en_US just before the build. > > > > > > prompt? If it is latin, change it to something using utf-8 (I recommend > > > 'en_US.utf8'). > > > > > > A UnicodeEncodeError can occour everywhere where characters are > > > encoded from (internal) unicode to the encoding of the stream. > > > > > > By example: > > > > > > A print or log statement which streams to stdout needs to encode > > > from unicode to stdout's encoding. If there is one unicode symbol > > > which can not encoded to stream's encoding a UnicodeEncodeError > > > is raised. > > > > Hi Markus, > > > > It shouldn't matter the builder's locale when building the Kernel > > documentation (or any other documents built from other git trees > > on other open source projects), as the Kernel's *.rpm document charset > > won't change, no matter on what part of the globe it was built. > > > > I vaguely remember about a change we made a couple of years ago > > in order to address this issue. > > Hi Mauro :) > > sure? .. what if the logger wants to log some symbols from the > chines translated parts to stdout and the encoding of stdout is > latin? [ 127s] + cd linux-5.12-next-20210506 [ 127s] + export LANG=en_US [ 127s] + LANG=en_US [ 127s] + mkdir -p html [ 127s] + python3 -c 'print("↑ᛏ个")' [ 127s] ↑ᛏ个 [ 127s] + echo 'print("↑ᛏ个")' [ 127s] + python3 test.py [ 127s] Traceback (most recent call last): [ 127s] File "test.py", line 1, in <module> [ 127s] print("\u2191\u16cf\u4e2a\uf8f9") [ 127s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) It certainly does not look like python can print unicode in this environment. It tells me where the problem is, though. Thanks Michal [ 127s] + : [ 127s] + locale [ 128s] LANG=en_US [ 128s] LC_CTYPE="en_US" [ 128s] LC_NUMERIC="en_US" [ 128s] LC_TIME="en_US" [ 128s] LC_COLLATE="en_US" [ 128s] LC_MONETARY="en_US" [ 128s] LC_MESSAGES="en_US" [ 128s] LC_PAPER="en_US" [ 128s] LC_NAME="en_US" [ 128s] LC_ADDRESS="en_US" [ 128s] LC_TELEPHONE="en_US" [ 128s] LC_MEASUREMENT="en_US" [ 128s] LC_IDENTIFICATION="en_US" [ 128s] LC_ALL= [ 128s] + echo LC_ALL= [ 128s] LC_ALL= [ 128s] + echo LANG=en_US [ 128s] LANG=en_US ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:48 ` Michal Suchánek @ 2021-05-06 17:59 ` Markus Heiser 2021-05-06 18:16 ` Michal Suchánek 2021-05-12 6:22 ` Mauro Carvalho Chehab 1 sibling, 1 reply; 41+ messages in thread From: Markus Heiser @ 2021-05-06 17:59 UTC (permalink / raw) To: Michal Suchánek; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet Am 06.05.21 um 19:48 schrieb Michal Suchánek: > On Thu, May 06, 2021 at 07:04:44PM +0200, Markus Heiser wrote: >> Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: >>> Em Thu, 6 May 2021 17:57:15 +0200 >>> Markus Heiser <markus.heiser@darmarit.de> escreveu: >>> >>>> Am 06.05.21 um 12:39 schrieb Michal Suchánek: >>>>> When building HTML documentation I get this output: >>>> ... >>>>> [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) >>>> ... >>>>> >>>>> It does not say which input file contains the offending character so I can't tell which file is broken. >>>>> >>>>> Any idea how to debug? >>>> >>>> I guess the build host is a very simple container, what does >>>> >>>> echo $LC_ALL >>>> echo $LANG > It's actually set to en_US just before the build. >>>> >>>> prompt? If it is latin, change it to something using utf-8 (I recommend >>>> 'en_US.utf8'). >>>> >>>> A UnicodeEncodeError can occour everywhere where characters are >>>> encoded from (internal) unicode to the encoding of the stream. >>>> >>>> By example: >>>> >>>> A print or log statement which streams to stdout needs to encode >>>> from unicode to stdout's encoding. If there is one unicode symbol >>>> which can not encoded to stream's encoding a UnicodeEncodeError >>>> is raised. >>> >>> Hi Markus, >>> >>> It shouldn't matter the builder's locale when building the Kernel >>> documentation (or any other documents built from other git trees >>> on other open source projects), as the Kernel's *.rpm document charset >>> won't change, no matter on what part of the globe it was built. >>> >>> I vaguely remember about a change we made a couple of years ago >>> in order to address this issue. >> >> Hi Mauro :) >> >> sure? .. what if the logger wants to log some symbols from the >> chines translated parts to stdout and the encoding of stdout is >> latin? > > [ 127s] + cd linux-5.12-next-20210506 > [ 127s] + export LANG=en_US > [ 127s] + LANG=en_US > [ 127s] + mkdir -p html > [ 127s] + python3 -c 'print("↑ᛏ个")' > [ 127s] ↑ᛏ个 > [ 127s] + echo 'print("↑ᛏ个")' > [ 127s] + python3 test.py > [ 127s] Traceback (most recent call last): > [ 127s] File "test.py", line 1, in <module> > [ 127s] print("\u2191\u16cf\u4e2a\uf8f9") > [ 127s] UnicodeEncodeError: 'latin-1' codec can't encode characters in > position 0-3: ordinal not in range(256) > > It certainly does not look like python can print unicode in this > environment. It tells me where the problem is, though. Can't speak for the image of your container, may you need to install some utf-8 packages / but in most cases export LANG=en_US.UTF-8 export LC_ALL=en_US.UTF-8 should help. -- Markus -- > > Thanks > > Michal > > [ 127s] + : > [ 127s] + locale > [ 128s] LANG=en_US > [ 128s] LC_CTYPE="en_US" > [ 128s] LC_NUMERIC="en_US" > [ 128s] LC_TIME="en_US" > [ 128s] LC_COLLATE="en_US" > [ 128s] LC_MONETARY="en_US" > [ 128s] LC_MESSAGES="en_US" > [ 128s] LC_PAPER="en_US" > [ 128s] LC_NAME="en_US" > [ 128s] LC_ADDRESS="en_US" > [ 128s] LC_TELEPHONE="en_US" > [ 128s] LC_MEASUREMENT="en_US" > [ 128s] LC_IDENTIFICATION="en_US" > [ 128s] LC_ALL= > [ 128s] + echo LC_ALL= > [ 128s] LC_ALL= > [ 128s] + echo LANG=en_US > [ 128s] LANG=en_US > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:59 ` Markus Heiser @ 2021-05-06 18:16 ` Michal Suchánek 0 siblings, 0 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-06 18:16 UTC (permalink / raw) To: Markus Heiser; +Cc: Mauro Carvalho Chehab, linux-doc, Jonathan Corbet On Thu, May 06, 2021 at 07:59:18PM +0200, Markus Heiser wrote: > Am 06.05.21 um 19:48 schrieb Michal Suchánek: > > On Thu, May 06, 2021 at 07:04:44PM +0200, Markus Heiser wrote: > > > Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab: > > > > Em Thu, 6 May 2021 17:57:15 +0200 > > > > Markus Heiser <markus.heiser@darmarit.de> escreveu: > > > > > > > > > Am 06.05.21 um 12:39 schrieb Michal Suchánek: > > > > > > When building HTML documentation I get this output: > > > > > ... > > > > > > [ 412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) > > > > > ... > > > > > > > > > > > > It does not say which input file contains the offending character so I can't tell which file is broken. > > > > > > > > > > > > Any idea how to debug? > > > > > > > > > > I guess the build host is a very simple container, what does > > > > > > > > > > echo $LC_ALL > > > > > echo $LANG > > It's actually set to en_US just before the build. > > > > > > > > > > prompt? If it is latin, change it to something using utf-8 (I recommend > > > > > 'en_US.utf8'). > > > > > > > > > > A UnicodeEncodeError can occour everywhere where characters are > > > > > encoded from (internal) unicode to the encoding of the stream. > > > > > > > > > > By example: > > > > > > > > > > A print or log statement which streams to stdout needs to encode > > > > > from unicode to stdout's encoding. If there is one unicode symbol > > > > > which can not encoded to stream's encoding a UnicodeEncodeError > > > > > is raised. > > > > > > > > Hi Markus, > > > > > > > > It shouldn't matter the builder's locale when building the Kernel > > > > documentation (or any other documents built from other git trees > > > > on other open source projects), as the Kernel's *.rpm document charset > > > > won't change, no matter on what part of the globe it was built. > > > > > > > > I vaguely remember about a change we made a couple of years ago > > > > in order to address this issue. > > > > > > Hi Mauro :) > > > > > > sure? .. what if the logger wants to log some symbols from the > > > chines translated parts to stdout and the encoding of stdout is > > > latin? > > > > [ 127s] + cd linux-5.12-next-20210506 > > [ 127s] + export LANG=en_US > > [ 127s] + LANG=en_US > > [ 127s] + mkdir -p html > > [ 127s] + python3 -c 'print("↑ᛏ个")' > > [ 127s] ↑ᛏ个 > > [ 127s] + echo 'print("↑ᛏ个")' > > [ 127s] + python3 test.py > > [ 127s] Traceback (most recent call last): > > [ 127s] File "test.py", line 1, in <module> > > [ 127s] print("\u2191\u16cf\u4e2a\uf8f9") > > [ 127s] UnicodeEncodeError: 'latin-1' codec can't encode characters in > > position 0-3: ordinal not in range(256) > > > > It certainly does not look like python can print unicode in this > > environment. It tells me where the problem is, though. > > Can't speak for the image of your container, may you need to install > some utf-8 packages / but in most cases > > export LANG=en_US.UTF-8 Yes, in this case export LANG=en_US.utf8 is an easy workaround. The UTF-8 locale is already included in the build environment by default. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-06 17:48 ` Michal Suchánek 2021-05-06 17:59 ` Markus Heiser @ 2021-05-12 6:22 ` Mauro Carvalho Chehab 2021-05-12 7:01 ` Michal Suchánek 1 sibling, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-12 6:22 UTC (permalink / raw) To: Michal Suchánek; +Cc: Markus Heiser, linux-doc, Jonathan Corbet Hi Michal, Em Thu, 6 May 2021 19:48:49 +0200 Michal Suchánek <msuchanek@suse.de> escreveu: > [ 127s] + : > [ 127s] + locale > [ 128s] LANG=en_US > [ 128s] LC_CTYPE="en_US" > [ 128s] LC_NUMERIC="en_US" > [ 128s] LC_TIME="en_US" > [ 128s] LC_COLLATE="en_US" > [ 128s] LC_MONETARY="en_US" > [ 128s] LC_MESSAGES="en_US" > [ 128s] LC_PAPER="en_US" > [ 128s] LC_NAME="en_US" > [ 128s] LC_ADDRESS="en_US" > [ 128s] LC_TELEPHONE="en_US" > [ 128s] LC_MEASUREMENT="en_US" > [ 128s] LC_IDENTIFICATION="en_US" > [ 128s] LC_ALL= > [ 128s] + echo LC_ALL= > [ 128s] LC_ALL= > [ 128s] + echo LANG=en_US > [ 128s] LANG=en_US Where those the locale settings that you used when the build failed? I tried to reproduce the bug here with, disabling the parallel run (as it masks the real error) with both: $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done $ make cleandocs && make SPHINXOPTS=-j1 htmldocs (this one caused lots of warnings on Debian, due to the settings at /etc/locale.gen) and: $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done $ make cleandocs && make SPHINXOPTS=-j1 htmldocs Without any success. Could you please provide more details about the build VM and the git changeset that caused the issue? Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-12 6:22 ` Mauro Carvalho Chehab @ 2021-05-12 7:01 ` Michal Suchánek 2021-05-12 7:18 ` Markus Heiser 2021-05-12 7:59 ` Mauro Carvalho Chehab 0 siblings, 2 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-12 7:01 UTC (permalink / raw) To: Mauro Carvalho Chehab; +Cc: Markus Heiser, linux-doc, Jonathan Corbet On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote: > Hi Michal, > > Em Thu, 6 May 2021 19:48:49 +0200 > Michal Suchánek <msuchanek@suse.de> escreveu: > > > [ 127s] + : > > [ 127s] + locale > > [ 128s] LANG=en_US > > [ 128s] LC_CTYPE="en_US" > > [ 128s] LC_NUMERIC="en_US" > > [ 128s] LC_TIME="en_US" > > [ 128s] LC_COLLATE="en_US" > > [ 128s] LC_MONETARY="en_US" > > [ 128s] LC_MESSAGES="en_US" > > [ 128s] LC_PAPER="en_US" > > [ 128s] LC_NAME="en_US" > > [ 128s] LC_ADDRESS="en_US" > > [ 128s] LC_TELEPHONE="en_US" > > [ 128s] LC_MEASUREMENT="en_US" > > [ 128s] LC_IDENTIFICATION="en_US" > > [ 128s] LC_ALL= > > [ 128s] + echo LC_ALL= > > [ 128s] LC_ALL= > > [ 128s] + echo LANG=en_US > > [ 128s] LANG=en_US > > Where those the locale settings that you used when the build > failed? > > I tried to reproduce the bug here with, disabling the parallel run (as > it masks the real error) with both: > > $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done > $ make cleandocs && make SPHINXOPTS=-j1 htmldocs > > (this one caused lots of warnings on Debian, due to the > settings at /etc/locale.gen) > > and: > > $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done > $ make cleandocs && make SPHINXOPTS=-j1 htmldocs > > Without any success. > > Could you please provide more details about the build VM and the git > changeset that caused the issue? It depends on what character set your en_US locale implements. ~> cat test.py print("↑ᛏ个") ~> locale LANG=en_US.utf8 LC_CTYPE="en_US.utf8" LC_NUMERIC="en_US.utf8" LC_TIME="en_US.utf8" LC_COLLATE="en_US.utf8" LC_MONETARY="en_US.utf8" LC_MESSAGES="en_US.utf8" LC_PAPER="en_US.utf8" LC_NAME="en_US.utf8" LC_ADDRESS="en_US.utf8" LC_TELEPHONE="en_US.utf8" LC_MEASUREMENT="en_US.utf8" LC_IDENTIFICATION="en_US.utf8" LC_ALL= ~> python3 test.py ↑ᛏ个 ~> LANG=en_US python3 test.py Traceback (most recent call last): File "test.py", line 1, in <module> print("\u2191\u16cf\u4e2a\uf8f9") UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) ~> LANG=C python3 test.py ↑ᛏ个 You can easily test if your python version can print UTF-8 in a specific locale, and if necessary define an ISO-8859-1 locale for testing. On some systems the situation is reversed - C locale is ASCII only, and en_US is UTF-8, and it is possible that some systems don't ship an 8bit locale at all. Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-12 7:01 ` Michal Suchánek @ 2021-05-12 7:18 ` Markus Heiser 2021-05-12 7:37 ` Markus Heiser 2021-05-12 7:59 ` Mauro Carvalho Chehab 1 sibling, 1 reply; 41+ messages in thread From: Markus Heiser @ 2021-05-12 7:18 UTC (permalink / raw) To: Michal Suchánek, Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet Am 12.05.21 um 09:01 schrieb Michal Suchánek: > On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote: >> Hi Michal, >> >> Em Thu, 6 May 2021 19:48:49 +0200 >> Michal Suchánek <msuchanek@suse.de> escreveu: >> >>> [ 127s] + : >>> [ 127s] + locale >>> [ 128s] LANG=en_US >>> [ 128s] LC_CTYPE="en_US" >>> [ 128s] LC_NUMERIC="en_US" >>> [ 128s] LC_TIME="en_US" >>> [ 128s] LC_COLLATE="en_US" >>> [ 128s] LC_MONETARY="en_US" >>> [ 128s] LC_MESSAGES="en_US" >>> [ 128s] LC_PAPER="en_US" >>> [ 128s] LC_NAME="en_US" >>> [ 128s] LC_ADDRESS="en_US" >>> [ 128s] LC_TELEPHONE="en_US" >>> [ 128s] LC_MEASUREMENT="en_US" >>> [ 128s] LC_IDENTIFICATION="en_US" >>> [ 128s] LC_ALL= >>> [ 128s] + echo LC_ALL= >>> [ 128s] LC_ALL= >>> [ 128s] + echo LANG=en_US >>> [ 128s] LANG=en_US >> >> Where those the locale settings that you used when the build >> failed? >> >> I tried to reproduce the bug here with, disabling the parallel run (as >> it masks the real error) with both: >> >> $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done >> $ make cleandocs && make SPHINXOPTS=-j1 htmldocs >> >> (this one caused lots of warnings on Debian, due to the >> settings at /etc/locale.gen) >> >> and: >> >> $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done >> $ make cleandocs && make SPHINXOPTS=-j1 htmldocs >> >> Without any success. >> >> Could you please provide more details about the build VM and the git >> changeset that caused the issue? > > It depends on what character set your en_US locale implements. > > ~> cat test.py > print("↑ᛏ个") > ~> locale > LANG=en_US.utf8 > LC_CTYPE="en_US.utf8" > LC_NUMERIC="en_US.utf8" > LC_TIME="en_US.utf8" > LC_COLLATE="en_US.utf8" > LC_MONETARY="en_US.utf8" > LC_MESSAGES="en_US.utf8" > LC_PAPER="en_US.utf8" > LC_NAME="en_US.utf8" > LC_ADDRESS="en_US.utf8" > LC_TELEPHONE="en_US.utf8" > LC_MEASUREMENT="en_US.utf8" > LC_IDENTIFICATION="en_US.utf8" > LC_ALL= > ~> python3 test.py > ↑ᛏ个 > ~> LANG=en_US python3 test.py > Traceback (most recent call last): > File "test.py", line 1, in <module> > print("\u2191\u16cf\u4e2a\uf8f9") > UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) > ~> LANG=C python3 test.py > ↑ᛏ个 > > You can easily test if your python version can print UTF-8 in a specific > locale, and if necessary define an ISO-8859-1 locale for testing. > On some systems the situation is reversed - C locale is ASCII only, and > en_US is UTF-8, and it is possible that some systems don't ship an 8bit > locale at all. Thats my problem :-) On my system (terminal) I can't reproduce the issue since stdout always support utf-8, no matter what LANG environment is set. $ LANG=en_US.ISO-8859-1 python3 ... >>> import sys >>> print (sys.stdout.encoding) utf-8 >>> import locale >>> locale.getdefaultlocale() ('en_US', 'UTF-8') I'm not familar with POSIX's locale [1] in detail and in particular on my system (gnome terminal), I can't say how I can change to 8 bit coding to reproduce the issue. [1] https://docs.python.org/3/library/locale.html -- Markus -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-12 7:18 ` Markus Heiser @ 2021-05-12 7:37 ` Markus Heiser 0 siblings, 0 replies; 41+ messages in thread From: Markus Heiser @ 2021-05-12 7:37 UTC (permalink / raw) To: Michal Suchánek, Mauro Carvalho Chehab; +Cc: linux-doc, Jonathan Corbet Am 12.05.21 um 09:18 schrieb Markus Heiser: > It depends on what character set your en_US locale implements. > > ~> cat test.py > print("↑ᛏ个") At least the last character is a not-printable character. For the non 8bit characters I recommend to use python's unicode representation (\u): >>> print('The currency in EU is \u20AC') The currency in EU is € -- Markus -- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-12 7:01 ` Michal Suchánek 2021-05-12 7:18 ` Markus Heiser @ 2021-05-12 7:59 ` Mauro Carvalho Chehab 2021-05-17 13:10 ` Michal Suchánek 1 sibling, 1 reply; 41+ messages in thread From: Mauro Carvalho Chehab @ 2021-05-12 7:59 UTC (permalink / raw) To: Michal Suchánek; +Cc: Markus Heiser, linux-doc, Jonathan Corbet Em Wed, 12 May 2021 09:01:57 +0200 Michal Suchánek <msuchanek@suse.de> escreveu: > On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote: > > Hi Michal, > > > > Em Thu, 6 May 2021 19:48:49 +0200 > > Michal Suchánek <msuchanek@suse.de> escreveu: > > > > > [ 127s] + : > > > [ 127s] + locale > > > [ 128s] LANG=en_US > > > [ 128s] LC_CTYPE="en_US" > > > [ 128s] LC_NUMERIC="en_US" > > > [ 128s] LC_TIME="en_US" > > > [ 128s] LC_COLLATE="en_US" > > > [ 128s] LC_MONETARY="en_US" > > > [ 128s] LC_MESSAGES="en_US" > > > [ 128s] LC_PAPER="en_US" > > > [ 128s] LC_NAME="en_US" > > > [ 128s] LC_ADDRESS="en_US" > > > [ 128s] LC_TELEPHONE="en_US" > > > [ 128s] LC_MEASUREMENT="en_US" > > > [ 128s] LC_IDENTIFICATION="en_US" > > > [ 128s] LC_ALL= > > > [ 128s] + echo LC_ALL= > > > [ 128s] LC_ALL= > > > [ 128s] + echo LANG=en_US > > > [ 128s] LANG=en_US > > > > Where those the locale settings that you used when the build > > failed? > > > > I tried to reproduce the bug here with, disabling the parallel run (as > > it masks the real error) with both: > > > > $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done > > $ make cleandocs && make SPHINXOPTS=-j1 htmldocs > > > > (this one caused lots of warnings on Debian, due to the > > settings at /etc/locale.gen) > > > > and: > > > > $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done > > $ make cleandocs && make SPHINXOPTS=-j1 htmldocs > > > > Without any success. > > > > Could you please provide more details about the build VM and the git > > changeset that caused the issue? > > It depends on what character set your en_US locale implements. > > ~> cat test.py > print("↑ᛏ个") > ~> locale > LANG=en_US.utf8 > LC_CTYPE="en_US.utf8" > LC_NUMERIC="en_US.utf8" > LC_TIME="en_US.utf8" > LC_COLLATE="en_US.utf8" > LC_MONETARY="en_US.utf8" > LC_MESSAGES="en_US.utf8" > LC_PAPER="en_US.utf8" > LC_NAME="en_US.utf8" > LC_ADDRESS="en_US.utf8" > LC_TELEPHONE="en_US.utf8" > LC_MEASUREMENT="en_US.utf8" > LC_IDENTIFICATION="en_US.utf8" > LC_ALL= > ~> python3 test.py > ↑ᛏ个 > ~> LANG=en_US python3 test.py > Traceback (most recent call last): > File "test.py", line 1, in <module> > print("\u2191\u16cf\u4e2a\uf8f9") > UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) > ~> LANG=C python3 test.py > ↑ᛏ个 > This is working as expected on my test machine: $ LANG=en_US.utf8 python3 test.py ↑ᛏ个 $ LANG=en_US python3 test.py Traceback (most recent call last): File "test.py", line 1, in <module> print("\u2191\u16cf\u4e2a\uf8f9") UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) Yet, running: $ . /devel/v4l/docs/sphinx_3.3.1/bin/activate make cleandocs && LANG=en_US make SPHINXOPTS=-j1 htmldocs Doesn't produce any UnicodeEncodeError errors. See, here I'm testing it with Sphinx version 3.3.1, on Ubuntu 20.04, using changeset 9f4ad9e425a1 Linux 5.12. Also, both UTF8 and iso8859-1 are on this machine's locale: $ more /etc/locale.gen |grep -v ^# de_DE.UTF-8 UTF-8 en_US ISO-8859-1 en_US.UTF-8 UTF-8 (On Debian/Ubuntu, python and other tools complain a lot if the used locale is not at /etc/locale.gen) Maybe you're using a different Sphinx version, or maybe the distro on your VM is using has different locales installed on it or some other different packages. Thanks, Mauro ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) 2021-05-12 7:59 ` Mauro Carvalho Chehab @ 2021-05-17 13:10 ` Michal Suchánek 0 siblings, 0 replies; 41+ messages in thread From: Michal Suchánek @ 2021-05-17 13:10 UTC (permalink / raw) To: Mauro Carvalho Chehab; +Cc: Markus Heiser, linux-doc, Jonathan Corbet On Wed, May 12, 2021 at 09:59:31AM +0200, Mauro Carvalho Chehab wrote: > Em Wed, 12 May 2021 09:01:57 +0200 > Michal Suchánek <msuchanek@suse.de> escreveu: > > > On Wed, May 12, 2021 at 08:22:38AM +0200, Mauro Carvalho Chehab wrote: > > > Hi Michal, > > > > > > Em Thu, 6 May 2021 19:48:49 +0200 > > > Michal Suchánek <msuchanek@suse.de> escreveu: > > > > > > > [ 127s] + : > > > > [ 127s] + locale > > > > [ 128s] LANG=en_US > > > > [ 128s] LC_CTYPE="en_US" > > > > [ 128s] LC_NUMERIC="en_US" > > > > [ 128s] LC_TIME="en_US" > > > > [ 128s] LC_COLLATE="en_US" > > > > [ 128s] LC_MONETARY="en_US" > > > > [ 128s] LC_MESSAGES="en_US" > > > > [ 128s] LC_PAPER="en_US" > > > > [ 128s] LC_NAME="en_US" > > > > [ 128s] LC_ADDRESS="en_US" > > > > [ 128s] LC_TELEPHONE="en_US" > > > > [ 128s] LC_MEASUREMENT="en_US" > > > > [ 128s] LC_IDENTIFICATION="en_US" > > > > [ 128s] LC_ALL= > > > > [ 128s] + echo LC_ALL= > > > > [ 128s] LC_ALL= > > > > [ 128s] + echo LANG=en_US > > > > [ 128s] LANG=en_US > > > > > > Where those the locale settings that you used when the build > > > failed? > > > > > > I tried to reproduce the bug here with, disabling the parallel run (as > > > it masks the real error) with both: > > > > > > $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US; done > > > $ make cleandocs && make SPHINXOPTS=-j1 htmldocs > > > > > > (this one caused lots of warnings on Debian, due to the > > > settings at /etc/locale.gen) > > > > > > and: > > > > > > $ for i in LANG LC_ALL LC_ADDRESS LC_IDENTIFICATION LC_MEASUREMENT LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_TIME; do echo $i=en_US.ISO-8859-1; done > > > $ make cleandocs && make SPHINXOPTS=-j1 htmldocs > > > > > > Without any success. > > > > > > Could you please provide more details about the build VM and the git > > > changeset that caused the issue? > > > > It depends on what character set your en_US locale implements. > > > > ~> cat test.py > > print("↑ᛏ个") > > ~> locale > > LANG=en_US.utf8 > > LC_CTYPE="en_US.utf8" > > LC_NUMERIC="en_US.utf8" > > LC_TIME="en_US.utf8" > > LC_COLLATE="en_US.utf8" > > LC_MONETARY="en_US.utf8" > > LC_MESSAGES="en_US.utf8" > > LC_PAPER="en_US.utf8" > > LC_NAME="en_US.utf8" > > LC_ADDRESS="en_US.utf8" > > LC_TELEPHONE="en_US.utf8" > > LC_MEASUREMENT="en_US.utf8" > > LC_IDENTIFICATION="en_US.utf8" > > LC_ALL= > > ~> python3 test.py > > ↑ᛏ个 > > ~> LANG=en_US python3 test.py > > Traceback (most recent call last): > > File "test.py", line 1, in <module> > > print("\u2191\u16cf\u4e2a\uf8f9") > > UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) > > ~> LANG=C python3 test.py > > ↑ᛏ个 > > > > This is working as expected on my test machine: > > $ LANG=en_US.utf8 python3 test.py > ↑ᛏ个 > $ LANG=en_US python3 test.py > Traceback (most recent call last): > File "test.py", line 1, in <module> > print("\u2191\u16cf\u4e2a\uf8f9") > UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) > > Yet, running: > > $ . /devel/v4l/docs/sphinx_3.3.1/bin/activate > make cleandocs && LANG=en_US make SPHINXOPTS=-j1 htmldocs > > Doesn't produce any UnicodeEncodeError errors. > > See, here I'm testing it with Sphinx version 3.3.1, on Ubuntu 20.04, > using changeset 9f4ad9e425a1 Linux 5.12. Also, both UTF8 and iso8859-1 > are on this machine's locale: > > $ more /etc/locale.gen |grep -v ^# > de_DE.UTF-8 UTF-8 > en_US ISO-8859-1 > en_US.UTF-8 UTF-8 > > (On Debian/Ubuntu, python and other tools complain a lot if the used > locale is not at /etc/locale.gen) > > Maybe you're using a different Sphinx version, or maybe the distro > on your VM is using has different locales installed on it or some > other different packages. I am using these: [ 14s] [287/464] cumulate python38-sphinxcontrib-websupport-1.2.4-1.3 [ 14s] [323/464] cumulate python38-Sphinx2-2.3.1-4.1 [ 14s] [324/464] cumulate python38-sphinx_rtd_theme-0.5.2-1.1 [ 14s] [325/464] cumulate python38-sphinxcontrib-applehelp-1.0.2-1.4 [ 14s] [326/464] cumulate python38-sphinxcontrib-devhelp-1.0.2-1.4 [ 14s] [327/464] cumulate python38-sphinxcontrib-htmlhelp-1.0.3-1.4 [ 14s] [328/464] cumulate python38-sphinxcontrib-jsmath-1.0.1-2.5 [ 14s] [329/464] cumulate python38-sphinxcontrib-qthelp-1.0.3-1.4 [ 14s] [330/464] cumulate python38-sphinxcontrib-serializinghtml-1.1.4-1.4 [ 455s] Sphinx parallel build error: [ 455s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) [ 467s] make[2]: *** [../Documentation/Makefile:91: htmldocs] Error 2 [ 467s] make[1]: *** [/home/abuild/rpmbuild/BUILD/kernel-docs-5.13~rc1.next.20210514/linux-5.13-rc1-next-20210514/Makefile:1784: htmldocs] Error 2 [ 467s] make[1]: Leaving directory '/home/abuild/rpmbuild/BUILD/kernel-docs-5.13~rc1.next.20210514/linux-5.13-rc1-next-20210514/html' [ 467s] make: *** [Makefile:222: __sub-make] Error 2 Thanks Michal ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2021-05-17 13:10 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek 2021-05-06 11:20 ` Mauro Carvalho Chehab 2021-05-06 13:32 ` Michal Suchánek 2021-05-06 14:24 ` Mauro Carvalho Chehab 2021-05-06 14:35 ` Michal Suchánek 2021-05-06 15:57 ` Markus Heiser 2021-05-06 16:46 ` Mauro Carvalho Chehab 2021-05-06 17:04 ` Markus Heiser 2021-05-06 17:27 ` Mauro Carvalho Chehab 2021-05-06 17:53 ` Markus Heiser 2021-05-06 18:06 ` Michal Suchánek 2021-05-07 8:52 ` Mauro Carvalho Chehab 2021-05-06 17:57 ` Randy Dunlap 2021-05-06 18:08 ` Matthew Wilcox 2021-05-06 21:21 ` Randy Dunlap 2021-05-07 6:39 ` Mauro Carvalho Chehab 2021-05-07 6:49 ` Randy Dunlap 2021-05-07 8:04 ` Mauro Carvalho Chehab 2021-05-07 8:35 ` Michal Suchánek 2021-05-07 8:56 ` Markus Heiser 2021-05-07 9:14 ` Mauro Carvalho Chehab 2021-05-07 9:51 ` Markus Heiser 2021-05-07 10:29 ` Michal Suchánek 2021-05-07 9:02 ` Mauro Carvalho Chehab 2021-05-08 9:22 ` Mauro Carvalho Chehab 2021-05-08 10:41 ` Michal Suchánek 2021-05-08 14:41 ` Mauro Carvalho Chehab 2021-05-08 15:55 ` Randy Dunlap 2021-05-08 17:09 ` Michal Suchánek 2021-05-08 17:46 ` Randy Dunlap 2021-05-10 6:22 ` Mauro Carvalho Chehab 2021-05-10 8:17 ` Mauro Carvalho Chehab 2021-05-06 17:48 ` Michal Suchánek 2021-05-06 17:59 ` Markus Heiser 2021-05-06 18:16 ` Michal Suchánek 2021-05-12 6:22 ` Mauro Carvalho Chehab 2021-05-12 7:01 ` Michal Suchánek 2021-05-12 7:18 ` Markus Heiser 2021-05-12 7:37 ` Markus Heiser 2021-05-12 7:59 ` Mauro Carvalho Chehab 2021-05-17 13:10 ` Michal Suchánek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).