* [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
@ 2021-05-12 12:50 Mauro Carvalho Chehab
2021-05-12 12:50 ` [PATCH v2 32/40] docs: gpu: " Mauro Carvalho Chehab
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12 12:50 UTC (permalink / raw)
To: Linux Doc Mailing List
Cc: alsa-devel, kvm, linux-iio, linux-pci, dri-devel, keyrings,
linux-sgx, Jonathan Corbet, Mauro Carvalho Chehab, linux-acpi,
Mali DP Maintainers, linux-input, intel-wired-lan, linux-ext4,
intel-gfx, linux-media, linux-pm, coresight, rcu, mjpeg-users,
linux-arm-kernel, linux-edac, linux-hwmon, netdev, linux-usb,
linux-kernel, linux-f2fs-devel, linux-rdma, linux-integrity
This series contain basically a cleanup from all those years of converting
files to ReST.
During the conversion period, several tools like LaTeX, pandoc, DocBook
and some specially-written scripts were used in order to convert
existing documents.
Such conversion tools - plus some text editor like LibreOffice or similar - have
a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
for instance converting commas into curly commas and adding non-breakable
spaces. All of those are meant to produce better results when the text is
displayed in HTML or PDF formats.
While it is perfectly fine to use UTF-8 characters in Linux, and specially at
the documentation, it is better to stick to the ASCII subset on such
particular case, due to a couple of reasons:
1. it makes life easier for tools like grep;
2. they easier to edit with the some commonly used text/source
code editors.
Also, Sphinx already do such conversion automatically outside
literal blocks, as described at:
https://docutils.sourceforge.io/docs/user/smartquotes.html
In this series, the following UTF-8 symbols are replaced:
- U+00a0 (' '): NO-BREAK SPACE
- U+00ad (''): SOFT HYPHEN
- U+00b4 ('´'): ACUTE ACCENT
- U+00d7 ('×'): MULTIPLICATION SIGN
- U+2010 ('‐'): HYPHEN
- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
- U+2212 ('−'): MINUS SIGN
- U+2217 ('∗'): ASTERISK OPERATOR
- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)
---
v2:
- removed EM/EN DASH conversion from this patchset;
- removed a few fixes, as those were addressed on a separate series.
PS.:
The first version of this series was posted with a different name:
https://lore.kernel.org/lkml/cover.1620641727.git.mchehab+huawei@kernel.org/
I also changed the patch texts, in order to better describe the patches goals.
Mauro Carvalho Chehab (40):
docs: hwmon: Use ASCII subset instead of UTF-8 alternate symbols
docs: admin-guide: Use ASCII subset instead of UTF-8 alternate symbols
docs: admin-guide: media: ipu3.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: admin-guide: perf: imx-ddr.rst: Use ASCII subset instead of
UTF-8 alternate symbols
docs: admin-guide: pm: Use ASCII subset instead of UTF-8 alternate
symbols
docs: trace: coresight: coresight-etm4x-reference.rst: Use ASCII
subset instead of UTF-8 alternate symbols
docs: driver-api: ioctl.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: driver-api: thermal: Use ASCII subset instead of UTF-8 alternate
symbols
docs: driver-api: media: drivers: Use ASCII subset instead of UTF-8
alternate symbols
docs: driver-api: firmware: other_interfaces.rst: Use ASCII subset
instead of UTF-8 alternate symbols
docs: fault-injection: nvme-fault-injection.rst: Use ASCII subset
instead of UTF-8 alternate symbols
docs: usb: Use ASCII subset instead of UTF-8 alternate symbols
docs: process: code-of-conduct.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: userspace-api: media: fdl-appendix.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: userspace-api: media: v4l: Use ASCII subset instead of UTF-8
alternate symbols
docs: userspace-api: media: dvb: Use ASCII subset instead of UTF-8
alternate symbols
docs: vm: zswap.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: filesystems: f2fs.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: filesystems: ext4: Use ASCII subset instead of UTF-8 alternate
symbols
docs: kernel-hacking: Use ASCII subset instead of UTF-8 alternate
symbols
docs: hid: Use ASCII subset instead of UTF-8 alternate symbols
docs: security: tpm: tpm_event_log.rst: Use ASCII subset instead of
UTF-8 alternate symbols
docs: security: keys: trusted-encrypted.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: networking: scaling.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: networking: devlink: devlink-dpipe.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: networking: device_drivers: Use ASCII subset instead of UTF-8
alternate symbols
docs: x86: Use ASCII subset instead of UTF-8 alternate symbols
docs: scheduler: sched-deadline.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: power: powercap: powercap.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: ABI: Use ASCII subset instead of UTF-8 alternate symbols
docs: PCI: acpi-info.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: gpu: Use ASCII subset instead of UTF-8 alternate symbols
docs: sound: kernel-api: writing-an-alsa-driver.rst: Use ASCII subset
instead of UTF-8 alternate symbols
docs: arm64: arm-acpi.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: infiniband: tag_matching.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: misc-devices: ibmvmc.rst: Use ASCII subset instead of UTF-8
alternate symbols
docs: firmware-guide: acpi: lpit.rst: Use ASCII subset instead of
UTF-8 alternate symbols
docs: firmware-guide: acpi: dsd: graph.rst: Use ASCII subset instead
of UTF-8 alternate symbols
docs: virt: kvm: api.rst: Use ASCII subset instead of UTF-8 alternate
symbols
docs: RCU: Use ASCII subset instead of UTF-8 alternate symbols
...sfs-class-chromeos-driver-cros-ec-lightbar | 2 +-
.../ABI/testing/sysfs-devices-platform-ipmi | 2 +-
.../testing/sysfs-devices-platform-trackpoint | 2 +-
Documentation/ABI/testing/sysfs-devices-soc | 4 +-
Documentation/PCI/acpi-info.rst | 22 +-
.../Data-Structures/Data-Structures.rst | 52 ++--
.../Expedited-Grace-Periods.rst | 40 +--
.../Tree-RCU-Memory-Ordering.rst | 10 +-
.../RCU/Design/Requirements/Requirements.rst | 122 ++++-----
Documentation/admin-guide/media/ipu3.rst | 2 +-
Documentation/admin-guide/perf/imx-ddr.rst | 2 +-
Documentation/admin-guide/pm/intel_idle.rst | 4 +-
Documentation/admin-guide/pm/intel_pstate.rst | 4 +-
Documentation/admin-guide/ras.rst | 86 +++---
.../admin-guide/reporting-issues.rst | 2 +-
Documentation/arm64/arm-acpi.rst | 8 +-
.../driver-api/firmware/other_interfaces.rst | 2 +-
Documentation/driver-api/ioctl.rst | 8 +-
.../media/drivers/sh_mobile_ceu_camera.rst | 8 +-
.../driver-api/media/drivers/zoran.rst | 2 +-
.../driver-api/thermal/cpu-idle-cooling.rst | 14 +-
.../driver-api/thermal/intel_powerclamp.rst | 6 +-
.../thermal/x86_pkg_temperature_thermal.rst | 2 +-
.../fault-injection/nvme-fault-injection.rst | 2 +-
Documentation/filesystems/ext4/attributes.rst | 20 +-
Documentation/filesystems/ext4/bigalloc.rst | 6 +-
Documentation/filesystems/ext4/blockgroup.rst | 8 +-
Documentation/filesystems/ext4/blocks.rst | 2 +-
Documentation/filesystems/ext4/directory.rst | 16 +-
Documentation/filesystems/ext4/eainode.rst | 2 +-
Documentation/filesystems/ext4/inlinedata.rst | 6 +-
Documentation/filesystems/ext4/inodes.rst | 6 +-
Documentation/filesystems/ext4/journal.rst | 8 +-
Documentation/filesystems/ext4/mmp.rst | 2 +-
.../filesystems/ext4/special_inodes.rst | 4 +-
Documentation/filesystems/ext4/super.rst | 10 +-
Documentation/filesystems/f2fs.rst | 4 +-
.../firmware-guide/acpi/dsd/graph.rst | 2 +-
Documentation/firmware-guide/acpi/lpit.rst | 2 +-
Documentation/gpu/i915.rst | 2 +-
Documentation/gpu/komeda-kms.rst | 2 +-
Documentation/hid/hid-sensor.rst | 70 ++---
Documentation/hid/intel-ish-hid.rst | 246 +++++++++---------
Documentation/hwmon/ir36021.rst | 2 +-
Documentation/hwmon/ltc2992.rst | 2 +-
Documentation/hwmon/pm6764tr.rst | 2 +-
Documentation/infiniband/tag_matching.rst | 4 +-
Documentation/kernel-hacking/hacking.rst | 2 +-
Documentation/kernel-hacking/locking.rst | 2 +-
Documentation/misc-devices/ibmvmc.rst | 8 +-
.../device_drivers/ethernet/intel/i40e.rst | 8 +-
.../device_drivers/ethernet/intel/iavf.rst | 4 +-
.../device_drivers/ethernet/netronome/nfp.rst | 12 +-
.../networking/devlink/devlink-dpipe.rst | 2 +-
Documentation/networking/scaling.rst | 18 +-
Documentation/power/powercap/powercap.rst | 210 +++++++--------
Documentation/process/code-of-conduct.rst | 2 +-
Documentation/scheduler/sched-deadline.rst | 2 +-
.../security/keys/trusted-encrypted.rst | 4 +-
Documentation/security/tpm/tpm_event_log.rst | 2 +-
.../kernel-api/writing-an-alsa-driver.rst | 68 ++---
.../coresight/coresight-etm4x-reference.rst | 16 +-
Documentation/usb/ehci.rst | 2 +-
Documentation/usb/gadget_printer.rst | 2 +-
Documentation/usb/mass-storage.rst | 36 +--
.../media/dvb/audio-set-bypass-mode.rst | 2 +-
.../userspace-api/media/dvb/audio.rst | 2 +-
.../userspace-api/media/dvb/dmx-fopen.rst | 2 +-
.../userspace-api/media/dvb/dmx-fread.rst | 2 +-
.../media/dvb/dmx-set-filter.rst | 2 +-
.../userspace-api/media/dvb/intro.rst | 6 +-
.../userspace-api/media/dvb/video.rst | 2 +-
.../userspace-api/media/fdl-appendix.rst | 64 ++---
.../userspace-api/media/v4l/crop.rst | 16 +-
.../userspace-api/media/v4l/dev-decoder.rst | 6 +-
.../userspace-api/media/v4l/diff-v4l.rst | 2 +-
.../userspace-api/media/v4l/open.rst | 2 +-
.../media/v4l/vidioc-cropcap.rst | 4 +-
Documentation/virt/kvm/api.rst | 28 +-
Documentation/vm/zswap.rst | 4 +-
Documentation/x86/resctrl.rst | 2 +-
Documentation/x86/sgx.rst | 4 +-
82 files changed, 693 insertions(+), 693 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 32/40] docs: gpu: Use ASCII subset instead of UTF-8 alternate symbols
2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
@ 2021-05-12 12:50 ` Mauro Carvalho Chehab
2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
2021-05-12 17:07 ` David Woodhouse
2 siblings, 0 replies; 10+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12 12:50 UTC (permalink / raw)
To: Linux Doc Mailing List
Cc: Jani Nikula, Thomas Zimmermann, Jonathan Corbet,
Mauro Carvalho Chehab, dri-devel, Liviu Dudau, linux-kernel,
David Airlie, James (Qian) Wang, Rodrigo Vivi,
Mali DP Maintainers, Mihail Atanassov, intel-gfx
The conversion tools used during DocBook/LaTeX/Markdown->ReST conversion
and some automatic rules which exists on certain text editors like
LibreOffice turned ASCII characters into some UTF-8 alternatives that
are better displayed on html and PDF.
While it is OK to use UTF-8 characters in Linux, it is better to
use the ASCII subset instead of using an UTF-8 equivalent character
as it makes life easier for tools like grep, and are easier to edit
with the some commonly used text/source code editors.
Also, Sphinx already do such conversion automatically outside literal blocks:
https://docutils.sourceforge.io/docs/user/smartquotes.html
So, replace the occurences of the following UTF-8 characters:
- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Liviu Dudau <liviu.dudau@arm.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
Documentation/gpu/i915.rst | 2 +-
Documentation/gpu/komeda-kms.rst | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst
index 486c720f3890..2cbf54460b48 100644
--- a/Documentation/gpu/i915.rst
+++ b/Documentation/gpu/i915.rst
@@ -361,7 +361,7 @@ Locking Guidelines
real bad.
#. Do not nest different lru/memory manager locks within each other.
- Take them in turn to update memory allocations, relying on the object’s
+ Take them in turn to update memory allocations, relying on the object's
dma_resv ww_mutex to serialize against other operations.
#. The suggestion for lru/memory managers locks is that they are small
diff --git a/Documentation/gpu/komeda-kms.rst b/Documentation/gpu/komeda-kms.rst
index eb693c857e2d..c2067678e92c 100644
--- a/Documentation/gpu/komeda-kms.rst
+++ b/Documentation/gpu/komeda-kms.rst
@@ -324,7 +324,7 @@ the control-abilites of device.
We have &komeda_dev, &komeda_pipeline, &komeda_component. Now fill devices with
pipelines. Since komeda is not for D71 only but also intended for later products,
-of course we’d better share as much as possible between different products. To
+of course we'd better share as much as possible between different products. To
achieve this, split the komeda device into two layers: CORE and CHIP.
- CORE: for common features and capabilities handling.
--
2.30.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
2021-05-12 12:50 ` [PATCH v2 32/40] docs: gpu: " Mauro Carvalho Chehab
@ 2021-05-12 14:14 ` Theodore Ts'o
2021-05-12 15:17 ` Mauro Carvalho Chehab
2021-05-12 17:07 ` David Woodhouse
2 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2021-05-12 14:14 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: alsa-devel, kvm, Linux Doc Mailing List, linux-iio, linux-pci,
dri-devel, keyrings, linux-sgx, Jonathan Corbet, linux-rdma,
linux-acpi, Mali DP Maintainers, linux-input, intel-wired-lan,
linux-ext4, intel-gfx, linux-media, linux-pm, coresight, rcu,
mjpeg-users, linux-arm-kernel, linux-edac, linux-hwmon, netdev,
linux-usb, linux-kernel, linux-f2fs-devel, linux-integrity
On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> v2:
> - removed EM/EN DASH conversion from this patchset;
Are you still thinking about doing the
EN DASH --> "--"
EM DASH --> "---"
conversion? That's not going to change what the documentation will
look like in the HTML and PDF output forms, and I think it would make
life easier for people are reading and editing the Documentation/*
files in text form.
- Ted
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
@ 2021-05-12 15:17 ` Mauro Carvalho Chehab
2021-05-12 17:12 ` David Woodhouse
0 siblings, 1 reply; 10+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12 15:17 UTC (permalink / raw)
To: Theodore Ts'o
Cc: alsa-devel, kvm, Linux Doc Mailing List, linux-iio, linux-pci,
dri-devel, keyrings, linux-sgx, Jonathan Corbet, linux-rdma,
linux-acpi, Mali DP Maintainers, linux-input, intel-wired-lan,
linux-ext4, intel-gfx, linux-media, linux-pm, coresight, rcu,
mjpeg-users, linux-arm-kernel, linux-edac, linux-hwmon, netdev,
linux-usb, linux-kernel, linux-f2fs-devel, linux-integrity
Em Wed, 12 May 2021 10:14:44 -0400
"Theodore Ts'o" <tytso@mit.edu> escreveu:
> On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> > v2:
> > - removed EM/EN DASH conversion from this patchset;
>
> Are you still thinking about doing the
>
> EN DASH --> "--"
> EM DASH --> "---"
>
> conversion?
Yes, but I intend to submit it on a separate patch series, probably after
having this one merged. Let's first cleanup the large part of the
conversion-generated UTF-8 char noise ;-)
> That's not going to change what the documentation will
> look like in the HTML and PDF output forms, and I think it would make
> life easier for people are reading and editing the Documentation/*
> files in text form.
Agreed. I'm also considering to add a couple of cases of this char:
- U+2026 ('…'): HORIZONTAL ELLIPSIS
As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.
-
Anyway, I'm opting to submitting those in separate because it seems
that at least some maintainers added EM/EN DASH intentionally.
So, it may generate case-per-case discussions.
Also, IMO, at least a couple of EN/EM DASH cases would be better served
with a single hyphen.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
2021-05-12 12:50 ` [PATCH v2 32/40] docs: gpu: " Mauro Carvalho Chehab
2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
@ 2021-05-12 17:07 ` David Woodhouse
2021-05-14 8:21 ` Mauro Carvalho Chehab
2 siblings, 1 reply; 10+ messages in thread
From: David Woodhouse @ 2021-05-12 17:07 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Linux Doc Mailing List
Cc: alsa-devel, kvm, linux-iio, linux-pci, dri-devel, keyrings,
linux-sgx, Jonathan Corbet, linux-rdma, linux-acpi,
Mali DP Maintainers, linux-input, intel-wired-lan, linux-ext4,
intel-gfx, linux-media, linux-pm, coresight, rcu, mjpeg-users,
linux-arm-kernel, linux-edac, linux-hwmon, netdev, linux-usb,
linux-kernel, linux-f2fs-devel, linux-integrity
[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]
Your title 'Use ASCII subset' is now at least a bit *closer* to
describing what the patches are actually doing, but it's still a bit
misleading because you're only doing it for *some* characters.
And the wording is still indicative of a fundamentally *misguided*
motivation for doing any of this. Your commit comments should be about
fixing a specific thing, nothing to do with "use ASCII subset", which
is pointless in itself.
On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> Such conversion tools - plus some text editor like LibreOffice or similar - have
> a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> for instance converting commas into curly commas and adding non-breakable
> spaces. All of those are meant to produce better results when the text is
> displayed in HTML or PDF formats.
And don't we render our documentation into HTML or PDF formats? Are
some of those non-breaking spaces not actually *useful* for their
intended purpose?
> While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> the documentation, it is better to stick to the ASCII subset on such
> particular case, due to a couple of reasons:
>
> 1. it makes life easier for tools like grep;
Barely, as noted, because of things like line feeds.
> 2. they easier to edit with the some commonly used text/source
> code editors.
That is nonsense. Any but the most broken and/or anachronistic
environments and editors will be just fine.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-12 15:17 ` Mauro Carvalho Chehab
@ 2021-05-12 17:12 ` David Woodhouse
0 siblings, 0 replies; 10+ messages in thread
From: David Woodhouse @ 2021-05-12 17:12 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Theodore Ts'o
Cc: alsa-devel, kvm, Linux Doc Mailing List, linux-iio, linux-pci,
dri-devel, keyrings, linux-sgx, Jonathan Corbet, linux-rdma,
linux-acpi, Mali DP Maintainers, linux-input, intel-wired-lan,
linux-ext4, intel-gfx, linux-media, linux-pm, coresight, rcu,
mjpeg-users, linux-arm-kernel, linux-edac, linux-hwmon, netdev,
linux-usb, linux-kernel, linux-f2fs-devel, linux-integrity
[-- Attachment #1: Type: text/plain, Size: 1744 bytes --]
On Wed, 2021-05-12 at 17:17 +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 10:14:44 -0400
> "Theodore Ts'o" <tytso@mit.edu> escreveu:
>
> > On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> > > v2:
> > > - removed EM/EN DASH conversion from this patchset;
> >
> > Are you still thinking about doing the
> >
> > EN DASH --> "--"
> > EM DASH --> "---"
> >
> > conversion?
>
> Yes, but I intend to submit it on a separate patch series, probably after
> having this one merged. Let's first cleanup the large part of the
> conversion-generated UTF-8 char noise ;-)
>
> > That's not going to change what the documentation will
> > look like in the HTML and PDF output forms, and I think it would make
> > life easier for people are reading and editing the Documentation/*
> > files in text form.
>
> Agreed. I'm also considering to add a couple of cases of this char:
>
> - U+2026 ('…'): HORIZONTAL ELLIPSIS
>
> As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.
Er, what?
The *only* part of this whole enterprise that actually seemed to make
even a tiny bit of sense — rather than seeming like a thinly veiled
retrospective excuse for dragging us back in time by 30 years — was the
bit about making it easier to grep.
But if I understand you correctly, you're talking about using something
like C trigraphs to represent the perfectly reasonable text emdash
character ("—") as two hyphen-minuses ("--") in the source code of the
documentation? Isn't that going to achieve precisely the *opposite*? If
I select some text in the HTML output of the docs and then search for
it in the source code, that's going to *stop* it matching my search?
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-12 17:07 ` David Woodhouse
@ 2021-05-14 8:21 ` Mauro Carvalho Chehab
2021-05-14 9:06 ` David Woodhouse
0 siblings, 1 reply; 10+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-14 8:21 UTC (permalink / raw)
To: David Woodhouse
Cc: alsa-devel, kvm, Linux Doc Mailing List, linux-iio, linux-pci,
dri-devel, keyrings, linux-sgx, Jonathan Corbet, linux-rdma,
linux-acpi, Mali DP Maintainers, linux-input, intel-wired-lan,
linux-ext4, intel-gfx, linux-media, linux-pm, coresight, rcu,
mjpeg-users, linux-arm-kernel, linux-edac, linux-hwmon, netdev,
linux-usb, linux-kernel, linux-f2fs-devel, linux-integrity
Em Wed, 12 May 2021 18:07:04 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:
> On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > Such conversion tools - plus some text editor like LibreOffice or similar - have
> > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > for instance converting commas into curly commas and adding non-breakable
> > spaces. All of those are meant to produce better results when the text is
> > displayed in HTML or PDF formats.
>
> And don't we render our documentation into HTML or PDF formats?
Yes.
> Are
> some of those non-breaking spaces not actually *useful* for their
> intended purpose?
No.
The thing is: non-breaking space can cause a lot of problems.
We even had to disable Sphinx usage of non-breaking space for
PDF outputs, as this was causing bad LaTeX/PDF outputs.
See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
The afore mentioned patch disables Sphinx default behavior of
using NON-BREAKABLE SPACE on literal blocks and strings, using this
special setting: "parsedliteralwraps=true".
When NON-BREAKABLE SPACE were used on PDF outputs, several parts of
the media uAPI docs were violating the document margins by far,
causing texts to be truncated.
So, please **don't add NON-BREAKABLE SPACE**, unless you test
(and keep testing it from time to time) if outputs on all
formats are properly supporting it on different Sphinx versions.
-
Also, most of those came from conversion tools, together with other
eccentricities, like the usage of U+FEFF (BOM) character at the
start of some documents. The remaining ones seem to came from
cut-and-paste.
For instance, bibliographic references (there are a couple of
those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty
sure that those came from cut-and-pasting the document titles
from their names at the original PDF documents or web pages that
are referenced.
> > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > the documentation, it is better to stick to the ASCII subset on such
> > particular case, due to a couple of reasons:
> >
> > 1. it makes life easier for tools like grep;
>
> Barely, as noted, because of things like line feeds.
You can use grep with "-z" to seek for multi-line strings(*), Like:
$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
Documentation/RCU/Design/Data-Structures/Data-Structures.rst
(*) Unfortunately, while "git grep" also has a "-z" flag, it
seems that this is (currently?) broken with regards of handling multilines:
$ git grep -Pzl 'grace period started,\s*then'
$
> > 2. they easier to edit with the some commonly used text/source
> > code editors.
>
> That is nonsense. Any but the most broken and/or anachronistic
> environments and editors will be just fine.
Not really.
I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
on the US-intl keyboard settings, that allow me to type as "'a" for á.
However, there's no shortcut for non-Latin UTF-codes, as far as I know.
So, if would need to type a curly comma on the text editors I normally
use for development (vim, nano, kate), I would need to cut-and-paste
it from somewhere[1].
[1] If I have a table with UTF-8 codes handy, I could type the UTF-8
number manually... However, it seems that this is currently broken
at least on Fedora 33 (with Mate Desktop and US intl keyboard with
dead keys).
Here, <CTRL><SHIFT>U is not working. No idea why. I haven't
test it for *years*, as I din't see any reason why I would
need to type UTF-8 characters by numbers until we started
this thread.
In practice, on the very rare cases where I needed to write
non-Latin utf-8 chars (maybe once in a year or so, Like when I
would need to use a Greek letter or some weird symbol), there changes
are high that I wouldn't remember its UTF-8 code.
So, If I need to spend time to seek for an specific symbol, after
finding it, I just cut-and-paste it.
But even in the best case scenario where I know the UTF-8 and
<CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
comma, the keystroke sequence would be:
<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
That's a lot harder than typing and has a higher chances of
mistakenly add a wrong symbol than just typing:
"some string"
Knowing that both will produce *exactly* the same output, why
should I bother doing it the hard way?
-
Now, I'm not arguing that you can't use whatever UTF-8 symbol you
want on your docs. I'm just saying that, now that the conversion
is over and a lot of documents ended getting some UTF-8 characters
by accident, it is time for a cleanup.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-14 8:21 ` Mauro Carvalho Chehab
@ 2021-05-14 9:06 ` David Woodhouse
2021-05-14 11:08 ` Edward Cree
0 siblings, 1 reply; 10+ messages in thread
From: David Woodhouse @ 2021-05-14 9:06 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: alsa-devel, kvm, Linux Doc Mailing List, linux-iio, linux-pci,
dri-devel, keyrings, linux-sgx, Jonathan Corbet, linux-rdma,
linux-acpi, Mali DP Maintainers, linux-input, intel-wired-lan,
linux-ext4, intel-gfx, linux-media, linux-pm, coresight, rcu,
mjpeg-users, linux-arm-kernel, linux-edac, linux-hwmon, netdev,
linux-usb, linux-kernel, linux-f2fs-devel, linux-integrity
[-- Attachment #1: Type: text/plain, Size: 6843 bytes --]
On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 18:07:04 +0100
> David Woodhouse <dwmw2@infradead.org> escreveu:
>
> > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > > Such conversion tools - plus some text editor like LibreOffice or similar - have
> > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > > for instance converting commas into curly commas and adding non-breakable
> > > spaces. All of those are meant to produce better results when the text is
> > > displayed in HTML or PDF formats.
> >
> > And don't we render our documentation into HTML or PDF formats?
>
> Yes.
>
> > Are
> > some of those non-breaking spaces not actually *useful* for their
> > intended purpose?
>
> No.
>
> The thing is: non-breaking space can cause a lot of problems.
>
> We even had to disable Sphinx usage of non-breaking space for
> PDF outputs, as this was causing bad LaTeX/PDF outputs.
>
> See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
>
> The afore mentioned patch disables Sphinx default behavior of
> using NON-BREAKABLE SPACE on literal blocks and strings, using this
> special setting: "parsedliteralwraps=true".
>
> When NON-BREAKABLE SPACE were used on PDF outputs, several parts of
> the media uAPI docs were violating the document margins by far,
> causing texts to be truncated.
>
> So, please **don't add NON-BREAKABLE SPACE**, unless you test
> (and keep testing it from time to time) if outputs on all
> formats are properly supporting it on different Sphinx versions.
And there you have a specific change with a specific fix. Nothing to do
with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to
do with the fact that, like *every* character in every kernel file
except the *binary* files, it's representable in UTF-8.
By all means fix the specific characters which are typographically
wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering
the documentation.
> Also, most of those came from conversion tools, together with other
> eccentricities, like the usage of U+FEFF (BOM) character at the
> start of some documents. The remaining ones seem to came from
> cut-and-paste.
... or which are just entirely redundant and gratuitous, like a BOM in
an environment where all files are UTF-8 and never 16-bit encodings
anyway.
> > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > > the documentation, it is better to stick to the ASCII subset on such
> > > particular case, due to a couple of reasons:
> > >
> > > 1. it makes life easier for tools like grep;
> >
> > Barely, as noted, because of things like line feeds.
>
> You can use grep with "-z" to seek for multi-line strings(*), Like:
>
> $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
> Documentation/RCU/Design/Data-Structures/Data-Structures.rst
Yeah, right. That works if you don't just use the text that you'll have
seen in the HTML/PDF "grace period started, then", and if you instead
craft a *regex* for it, replacing the spaces with '\s*'. Or is that
[[:space:]]* if you don't want to use the experimental Perl regex
feature?
$ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU
Documentation/RCU/Design/Data-Structures/Data-Structures.rst
And without '-l' it'll obviously just give you the whole file. No '-A5
-B5' to see the surroundings... it's hardly a useful thing, is it?
> (*) Unfortunately, while "git grep" also has a "-z" flag, it
> seems that this is (currently?) broken with regards of handling multilines:
>
> $ git grep -Pzl 'grace period started,\s*then'
> $
Even better. So no, multiline grep isn't really a commonly usable
feature at all.
This is why we prefer to put user-visible strings on one line in C
source code, even if it takes the lines over 80 characters — to allow
for grep to find them.
> > > 2. they easier to edit with the some commonly used text/source
> > > code editors.
> >
> > That is nonsense. Any but the most broken and/or anachronistic
> > environments and editors will be just fine.
>
> Not really.
>
> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
> on the US-intl keyboard settings, that allow me to type as "'a" for á.
> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
>
> So, if would need to type a curly comma on the text editors I normally
> use for development (vim, nano, kate), I would need to cut-and-paste
> it from somewhere[1].
That's entirely irrelevant. You don't need to be able to *type* every
character that you see in front of you, as long as your editor will
render it correctly and perhaps let you cut/paste it as you're editing
the document if you're moving things around.
> [1] If I have a table with UTF-8 codes handy, I could type the UTF-8
> number manually... However, it seems that this is currently broken
> at least on Fedora 33 (with Mate Desktop and US intl keyboard with
> dead keys).
>
> Here, <CTRL><SHIFT>U is not working. No idea why. I haven't
> test it for *years*, as I din't see any reason why I would
> need to type UTF-8 characters by numbers until we started
> this thread.
Please provide the bug number for this; I'd like to track it.
> But even in the best case scenario where I know the UTF-8 and
> <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
> comma, the keystroke sequence would be:
>
> <CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
>
> That's a lot harder than typing and has a higher chances of
> mistakenly add a wrong symbol than just typing:
>
> "some string"
>
> Knowing that both will produce *exactly* the same output, why
> should I bother doing it the hard way?
Nobody's asked you to do it the "hard way". That's completely
irrelevant to the discussion we were having.
> Now, I'm not arguing that you can't use whatever UTF-8 symbol you
> want on your docs. I'm just saying that, now that the conversion
> is over and a lot of documents ended getting some UTF-8 characters
> by accident, it is time for a cleanup.
All text documents are *full* of UTF-8 characters. If there is a file
in the source code which has *any* non-UTF8, we call that a 'binary
file'.
Again, if you want to make specific fixes like removing non-breaking
spaces and byte order marks, with specific reasons, then those make
sense. But it's got very little to do with UTF-8 and how easy it is to
type them. And the excuse you've put in the commit comment for your
patches is utterly bogus.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-14 9:06 ` David Woodhouse
@ 2021-05-14 11:08 ` Edward Cree
2021-05-14 14:18 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 10+ messages in thread
From: Edward Cree @ 2021-05-14 11:08 UTC (permalink / raw)
To: David Woodhouse, Mauro Carvalho Chehab
Cc: alsa-devel, kvm, Linux Doc Mailing List, linux-iio, linux-pci,
dri-devel, keyrings, linux-sgx, Jonathan Corbet, linux-rdma,
linux-acpi, Mali DP Maintainers, linux-input, intel-wired-lan,
linux-ext4, intel-gfx, linux-media, linux-pm, coresight, rcu,
mjpeg-users, linux-arm-kernel, linux-edac, linux-hwmon, netdev,
linux-usb, linux-kernel, linux-f2fs-devel, linux-integrity
> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
>> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
>> on the US-intl keyboard settings, that allow me to type as "'a" for á.
>> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
>>
>> So, if would need to type a curly comma on the text editors I normally
>> use for development (vim, nano, kate), I would need to cut-and-paste
>> it from somewhere
For anyone who doesn't know about it: X has this wonderful thing called
the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “.
Much more mnemonic than Unicode codepoints; and you can extend it with
user-defined sequences in your ~/.XCompose file.
(I assume Wayland supports all this too, but don't know the details.)
On 14/05/2021 10:06, David Woodhouse wrote:
> Again, if you want to make specific fixes like removing non-breaking
> spaces and byte order marks, with specific reasons, then those make
> sense. But it's got very little to do with UTF-8 and how easy it is to
> type them. And the excuse you've put in the commit comment for your
> patches is utterly bogus.
+1
-ed
[1] https://en.wikipedia.org/wiki/Compose_key
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
2021-05-14 11:08 ` Edward Cree
@ 2021-05-14 14:18 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 10+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-14 14:18 UTC (permalink / raw)
To: Edward Cree
Cc: alsa-devel, kvm, Linux Doc Mailing List, linux-iio, linux-pci,
dri-devel, keyrings, linux-sgx, Jonathan Corbet, linux-rdma,
linux-acpi, Mali DP Maintainers, linux-input, intel-wired-lan,
linux-ext4, intel-gfx, linux-media, linux-pm, coresight, rcu,
mjpeg-users, linux-arm-kernel, linux-edac, linux-hwmon, netdev,
linux-usb, linux-kernel, linux-f2fs-devel, linux-integrity,
David Woodhouse
Em Fri, 14 May 2021 12:08:36 +0100
Edward Cree <ecree.xilinx@gmail.com> escreveu:
> For anyone who doesn't know about it: X has this wonderful thing called
> the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “.
> Much more mnemonic than Unicode codepoints; and you can extend it with
> user-defined sequences in your ~/.XCompose file.
Good tip. I haven't use composite for years, as US-intl with dead keys is
enough for 99.999% of my needs.
Btw, at least on Fedora with Mate, Composite is disabled by default. It has
to be enabled first using the same tool that allows changing the Keyboard
layout[1].
Yet, typing an EN DASH for example, would be "<composite>--.", with is 4
keystrokes instead of just two ('--'). It means twice the effort ;-)
[1] KDE, GNome, Mate, ... have different ways to enable it and to
select what key would be considered <composite>:
https://dry.sailingissues.com/us-international-keyboard-layout.html
https://help.ubuntu.com/community/ComposeKey
Thanks,
Mauro
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-05-14 14:18 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
2021-05-12 12:50 ` [PATCH v2 32/40] docs: gpu: " Mauro Carvalho Chehab
2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
2021-05-12 15:17 ` Mauro Carvalho Chehab
2021-05-12 17:12 ` David Woodhouse
2021-05-12 17:07 ` David Woodhouse
2021-05-14 8:21 ` Mauro Carvalho Chehab
2021-05-14 9:06 ` David Woodhouse
2021-05-14 11:08 ` Edward Cree
2021-05-14 14:18 ` Mauro Carvalho Chehab
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).