rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
@ 2021-05-12 12:50 Mauro Carvalho Chehab
  2021-05-12 12:50 ` [PATCH v2 40/40] docs: RCU: " Mauro Carvalho Chehab
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12 12:50 UTC (permalink / raw)
  To: Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, dri-devel, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

This series contain basically a cleanup from all those years of converting
files to ReST.

During the conversion period, several tools like LaTeX, pandoc, DocBook
and some specially-written scripts were used in order to convert
existing documents.

Such conversion tools - plus some text editor like LibreOffice  or similar  - have
a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
for instance converting commas into curly commas and adding non-breakable
spaces. All of those are meant to produce better results when the text is
displayed in HTML or PDF formats.

While it is perfectly fine to use UTF-8 characters in Linux, and specially at
the documentation,  it is better to  stick to the ASCII subset  on such
particular case,  due to a couple of reasons:

1. it makes life easier for tools like grep;
2. they easier to edit with the some commonly used text/source
   code editors.
    
Also, Sphinx already do such conversion automatically outside 
literal blocks, as described at:

       https://docutils.sourceforge.io/docs/user/smartquotes.html

In this series, the following UTF-8 symbols are replaced:

            - U+00a0 (' '): NO-BREAK SPACE
            - U+00ad ('­'): SOFT HYPHEN
            - U+00b4 ('´'): ACUTE ACCENT
            - U+00d7 ('×'): MULTIPLICATION SIGN
            - U+2010 ('‐'): HYPHEN
            - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
            - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
            - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
            - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
            - U+2212 ('−'): MINUS SIGN
            - U+2217 ('∗'): ASTERISK OPERATOR
            - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

---

v2:
- removed EM/EN DASH conversion from this patchset;
- removed a few fixes, as those were addressed on a separate series.
 
PS.:
   The first version of this series was posted with a different name:

	https://lore.kernel.org/lkml/cover.1620641727.git.mchehab+huawei@kernel.org/

   I also changed the patch texts, in order to better describe the patches goals.

Mauro Carvalho Chehab (40):
  docs: hwmon: Use ASCII subset instead of UTF-8 alternate symbols
  docs: admin-guide: Use ASCII subset instead of UTF-8 alternate symbols
  docs: admin-guide: media: ipu3.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: admin-guide: perf: imx-ddr.rst: Use ASCII subset instead of
    UTF-8 alternate symbols
  docs: admin-guide: pm: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: trace: coresight: coresight-etm4x-reference.rst: Use ASCII
    subset instead of UTF-8 alternate symbols
  docs: driver-api: ioctl.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: driver-api: thermal: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: driver-api: media: drivers: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: driver-api: firmware: other_interfaces.rst: Use ASCII subset
    instead of UTF-8 alternate symbols
  docs: fault-injection: nvme-fault-injection.rst: Use ASCII subset
    instead of UTF-8 alternate symbols
  docs: usb: Use ASCII subset instead of UTF-8 alternate symbols
  docs: process: code-of-conduct.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: userspace-api: media: fdl-appendix.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: userspace-api: media: v4l: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: userspace-api: media: dvb: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: vm: zswap.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: filesystems: f2fs.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: filesystems: ext4: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: kernel-hacking: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: hid: Use ASCII subset instead of UTF-8 alternate symbols
  docs: security: tpm: tpm_event_log.rst: Use ASCII subset instead of
    UTF-8 alternate symbols
  docs: security: keys: trusted-encrypted.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: networking: scaling.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: networking: devlink: devlink-dpipe.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: networking: device_drivers: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: x86: Use ASCII subset instead of UTF-8 alternate symbols
  docs: scheduler: sched-deadline.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: power: powercap: powercap.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: ABI: Use ASCII subset instead of UTF-8 alternate symbols
  docs: PCI: acpi-info.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: gpu: Use ASCII subset instead of UTF-8 alternate symbols
  docs: sound: kernel-api: writing-an-alsa-driver.rst: Use ASCII subset
    instead of UTF-8 alternate symbols
  docs: arm64: arm-acpi.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: infiniband: tag_matching.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: misc-devices: ibmvmc.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: firmware-guide: acpi: lpit.rst: Use ASCII subset instead of
    UTF-8 alternate symbols
  docs: firmware-guide: acpi: dsd: graph.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: virt: kvm: api.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: RCU: Use ASCII subset instead of UTF-8 alternate symbols

 ...sfs-class-chromeos-driver-cros-ec-lightbar |   2 +-
 .../ABI/testing/sysfs-devices-platform-ipmi   |   2 +-
 .../testing/sysfs-devices-platform-trackpoint |   2 +-
 Documentation/ABI/testing/sysfs-devices-soc   |   4 +-
 Documentation/PCI/acpi-info.rst               |  22 +-
 .../Data-Structures/Data-Structures.rst       |  52 ++--
 .../Expedited-Grace-Periods.rst               |  40 +--
 .../Tree-RCU-Memory-Ordering.rst              |  10 +-
 .../RCU/Design/Requirements/Requirements.rst  | 122 ++++-----
 Documentation/admin-guide/media/ipu3.rst      |   2 +-
 Documentation/admin-guide/perf/imx-ddr.rst    |   2 +-
 Documentation/admin-guide/pm/intel_idle.rst   |   4 +-
 Documentation/admin-guide/pm/intel_pstate.rst |   4 +-
 Documentation/admin-guide/ras.rst             |  86 +++---
 .../admin-guide/reporting-issues.rst          |   2 +-
 Documentation/arm64/arm-acpi.rst              |   8 +-
 .../driver-api/firmware/other_interfaces.rst  |   2 +-
 Documentation/driver-api/ioctl.rst            |   8 +-
 .../media/drivers/sh_mobile_ceu_camera.rst    |   8 +-
 .../driver-api/media/drivers/zoran.rst        |   2 +-
 .../driver-api/thermal/cpu-idle-cooling.rst   |  14 +-
 .../driver-api/thermal/intel_powerclamp.rst   |   6 +-
 .../thermal/x86_pkg_temperature_thermal.rst   |   2 +-
 .../fault-injection/nvme-fault-injection.rst  |   2 +-
 Documentation/filesystems/ext4/attributes.rst |  20 +-
 Documentation/filesystems/ext4/bigalloc.rst   |   6 +-
 Documentation/filesystems/ext4/blockgroup.rst |   8 +-
 Documentation/filesystems/ext4/blocks.rst     |   2 +-
 Documentation/filesystems/ext4/directory.rst  |  16 +-
 Documentation/filesystems/ext4/eainode.rst    |   2 +-
 Documentation/filesystems/ext4/inlinedata.rst |   6 +-
 Documentation/filesystems/ext4/inodes.rst     |   6 +-
 Documentation/filesystems/ext4/journal.rst    |   8 +-
 Documentation/filesystems/ext4/mmp.rst        |   2 +-
 .../filesystems/ext4/special_inodes.rst       |   4 +-
 Documentation/filesystems/ext4/super.rst      |  10 +-
 Documentation/filesystems/f2fs.rst            |   4 +-
 .../firmware-guide/acpi/dsd/graph.rst         |   2 +-
 Documentation/firmware-guide/acpi/lpit.rst    |   2 +-
 Documentation/gpu/i915.rst                    |   2 +-
 Documentation/gpu/komeda-kms.rst              |   2 +-
 Documentation/hid/hid-sensor.rst              |  70 ++---
 Documentation/hid/intel-ish-hid.rst           | 246 +++++++++---------
 Documentation/hwmon/ir36021.rst               |   2 +-
 Documentation/hwmon/ltc2992.rst               |   2 +-
 Documentation/hwmon/pm6764tr.rst              |   2 +-
 Documentation/infiniband/tag_matching.rst     |   4 +-
 Documentation/kernel-hacking/hacking.rst      |   2 +-
 Documentation/kernel-hacking/locking.rst      |   2 +-
 Documentation/misc-devices/ibmvmc.rst         |   8 +-
 .../device_drivers/ethernet/intel/i40e.rst    |   8 +-
 .../device_drivers/ethernet/intel/iavf.rst    |   4 +-
 .../device_drivers/ethernet/netronome/nfp.rst |  12 +-
 .../networking/devlink/devlink-dpipe.rst      |   2 +-
 Documentation/networking/scaling.rst          |  18 +-
 Documentation/power/powercap/powercap.rst     | 210 +++++++--------
 Documentation/process/code-of-conduct.rst     |   2 +-
 Documentation/scheduler/sched-deadline.rst    |   2 +-
 .../security/keys/trusted-encrypted.rst       |   4 +-
 Documentation/security/tpm/tpm_event_log.rst  |   2 +-
 .../kernel-api/writing-an-alsa-driver.rst     |  68 ++---
 .../coresight/coresight-etm4x-reference.rst   |  16 +-
 Documentation/usb/ehci.rst                    |   2 +-
 Documentation/usb/gadget_printer.rst          |   2 +-
 Documentation/usb/mass-storage.rst            |  36 +--
 .../media/dvb/audio-set-bypass-mode.rst       |   2 +-
 .../userspace-api/media/dvb/audio.rst         |   2 +-
 .../userspace-api/media/dvb/dmx-fopen.rst     |   2 +-
 .../userspace-api/media/dvb/dmx-fread.rst     |   2 +-
 .../media/dvb/dmx-set-filter.rst              |   2 +-
 .../userspace-api/media/dvb/intro.rst         |   6 +-
 .../userspace-api/media/dvb/video.rst         |   2 +-
 .../userspace-api/media/fdl-appendix.rst      |  64 ++---
 .../userspace-api/media/v4l/crop.rst          |  16 +-
 .../userspace-api/media/v4l/dev-decoder.rst   |   6 +-
 .../userspace-api/media/v4l/diff-v4l.rst      |   2 +-
 .../userspace-api/media/v4l/open.rst          |   2 +-
 .../media/v4l/vidioc-cropcap.rst              |   4 +-
 Documentation/virt/kvm/api.rst                |  28 +-
 Documentation/vm/zswap.rst                    |   4 +-
 Documentation/x86/resctrl.rst                 |   2 +-
 Documentation/x86/sgx.rst                     |   4 +-
 82 files changed, 693 insertions(+), 693 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 40/40] docs: RCU: Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
@ 2021-05-12 12:50 ` Mauro Carvalho Chehab
  2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
  2021-05-12 17:07 ` David Woodhouse
  2 siblings, 0 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12 12:50 UTC (permalink / raw)
  To: Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, Jonathan Corbet,
	Nícolas F. R. A. Prado, Paul E. McKenney, Joel Fernandes,
	Josh Triplett, Lai Jiangshan, Mathieu Desnoyers, Paul Gortmaker,
	Randy Dunlap, Sebastian Andrzej Siewior, Steven Rostedt,
	Takashi Iwai, Will Deacon, linux-kernel, rcu

The conversion tools used during DocBook/LaTeX/Markdown->ReST conversion
and some automatic rules which exists on certain text editors like
LibreOffice turned ASCII characters into some UTF-8 alternatives that
are better displayed on html and PDF.

While it is OK to use UTF-8 characters in Linux, it is better to
use the ASCII subset instead of using an UTF-8 equivalent character
as it makes life easier for tools like grep, and are easier to edit
with the some commonly used text/source code editors.

Also, Sphinx already do such conversion automatically outside literal blocks:
   https://docutils.sourceforge.io/docs/user/smartquotes.html

So, replace the occurences of the following UTF-8 characters:

	- U+00a0 (' '): NO-BREAK SPACE
	- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
	- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK

Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 .../Data-Structures/Data-Structures.rst       |  52 ++++----
 .../Expedited-Grace-Periods.rst               |  40 +++---
 .../Tree-RCU-Memory-Ordering.rst              |  10 +-
 .../RCU/Design/Requirements/Requirements.rst  | 122 +++++++++---------
 4 files changed, 112 insertions(+), 112 deletions(-)

diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
index f4efd6897b09..e95c6c8eeb6a 100644
--- a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
+++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
@@ -301,7 +301,7 @@ The ``->gp_max`` field tracks the duration of the longest grace period
 in jiffies. It is protected by the root ``rcu_node``'s ``->lock``.
 
 The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU
-(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”).
+("rcu_preempt" and "p") and non-preemptible RCU ("rcu_sched" and "s").
 These fields are used for diagnostic and tracing purposes.
 
 The ``rcu_node`` Structure
@@ -456,21 +456,21 @@ expedited grace periods, respectively.
 | Lockless grace-period computation! Such a tantalizing possibility!    |
 | But consider the following sequence of events:                        |
 |                                                                       |
-| #. CPU 0 has been in dyntick-idle mode for quite some time. When it   |
+| #. CPU 0 has been in dyntick-idle mode for quite some time. When it   |
 |    wakes up, it notices that the current RCU grace period needs it to |
 |    report in, so it sets a flag where the scheduling clock interrupt  |
 |    will find it.                                                      |
-| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and       |
-|    notices that CPU 0 has been in dyntick idle mode, which qualifies  |
+| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and       |
+|    notices that CPU 0 has been in dyntick idle mode, which qualifies  |
 |    as an extended quiescent state.                                    |
-| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU   |
+| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU   |
 |    read-side critical section, and notices that the RCU core needs    |
 |    something, so commences RCU softirq processing.                    |
-| #. CPU 0's softirq handler executes and is just about ready to report |
+| #. CPU 0's softirq handler executes and is just about ready to report |
 |    its quiescent state up the ``rcu_node`` tree.                      |
-| #. But CPU 1 beats it to the punch, completing the current grace      |
+| #. But CPU 1 beats it to the punch, completing the current grace      |
 |    period and starting a new one.                                     |
-| #. CPU 0 now reports its quiescent state for the wrong grace period.  |
+| #. CPU 0 now reports its quiescent state for the wrong grace period.  |
 |    That grace period might now end before the RCU read-side critical  |
 |    section. If that happens, disaster will ensue.                     |
 |                                                                       |
@@ -515,18 +515,18 @@ removes itself from the ``->blkd_tasks`` list, then that task must
 advance the pointer to the next task on the list, or set the pointer to
 ``NULL`` if there are no subsequent tasks on the list.
 
-For example, suppose that tasks T1, T2, and T3 are all hard-affinitied
-to the largest-numbered CPU in the system. Then if task T1 blocked in an
+For example, suppose that tasks T1, T2, and T3 are all hard-affinitied
+to the largest-numbered CPU in the system. Then if task T1 blocked in an
 RCU read-side critical section, then an expedited grace period started,
-then task T2 blocked in an RCU read-side critical section, then a normal
-grace period started, and finally task 3 blocked in an RCU read-side
+then task T2 blocked in an RCU read-side critical section, then a normal
+grace period started, and finally task 3 blocked in an RCU read-side
 critical section, then the state of the last leaf ``rcu_node``
 structure's blocked-task list would be as shown below:
 
 .. kernel-figure:: blkd_task.svg
 
-Task T1 is blocking both grace periods, task T2 is blocking only the
-normal grace period, and task T3 is blocking neither grace period. Note
+Task T1 is blocking both grace periods, task T2 is blocking only the
+normal grace period, and task T3 is blocking neither grace period. Note
 that these tasks will not remove themselves from this list immediately
 upon resuming execution. They will instead remain on the list until they
 execute the outermost ``rcu_read_unlock()`` that ends their RCU
@@ -611,8 +611,8 @@ expressions as follows:
    66 #endif
 
 The maximum number of levels in the ``rcu_node`` structure is currently
-limited to four, as specified by lines 21-24 and the structure of the
-subsequent “if” statement. For 32-bit systems, this allows
+limited to four, as specified by lines 21-24 and the structure of the
+subsequent "if" statement. For 32-bit systems, this allows
 16*32*32*32=524,288 CPUs, which should be sufficient for the next few
 years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is
 allowed, which should see us through the next decade or so. This
@@ -638,9 +638,9 @@ fields. The number of CPUs per leaf ``rcu_node`` structure is therefore
 limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If
 ``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based
 on the word size of the system, just as for ``CONFIG_RCU_FANOUT``.
-Lines 11-19 perform this computation.
+Lines 11-19 perform this computation.
 
-Lines 21-24 compute the maximum number of CPUs supported by a
+Lines 21-24 compute the maximum number of CPUs supported by a
 single-level (which contains a single ``rcu_node`` structure),
 two-level, three-level, and four-level ``rcu_node`` tree, respectively,
 given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``.
@@ -649,18 +649,18 @@ These numbers of CPUs are retained in the ``RCU_FANOUT_1``,
 variables, respectively.
 
 These variables are used to control the C-preprocessor ``#if`` statement
-spanning lines 26-66 that computes the number of ``rcu_node`` structures
+spanning lines 26-66 that computes the number of ``rcu_node`` structures
 required for each level of the tree, as well as the number of levels
 required. The number of levels is placed in the ``NUM_RCU_LVLS``
-C-preprocessor variable by lines 27, 35, 44, and 54. The number of
+C-preprocessor variable by lines 27, 35, 44, and 54. The number of
 ``rcu_node`` structures for the topmost level of the tree is always
 exactly one, and this value is unconditionally placed into
-``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels
+``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels
 (if any) of the ``rcu_node`` tree are computed by dividing the maximum
 number of CPUs by the fanout supported by the number of levels from the
 current level down, rounding up. This computation is performed by
-lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
-initializers for lockdep lock-class names. Finally, lines 64-66 produce
+lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
+initializers for lockdep lock-class names. Finally, lines 64-66 produce
 an error if the maximum number of CPUs is too large for the specified
 fanout.
 
@@ -716,13 +716,13 @@ In this figure, the ``->head`` pointer references the first RCU callback
 in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the
 ``->head`` pointer itself, indicating that none of the callbacks is
 ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references
-callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
+callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
 are both waiting on the current grace period, give or take possible
 disagreements about exactly which grace period is the current one. The
 ``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU
 callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that
 there are no callbacks waiting on the next RCU grace period. The
-``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
+``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
 pointer, indicating that all the remaining RCU callbacks have not yet
 been assigned to an RCU grace period. Note that the
 ``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU
@@ -1031,7 +1031,7 @@ field to record the offset of the ``rcu_head`` structure within the
 enclosing RCU-protected data structure.
 
 Both of these fields are used internally by RCU. From the viewpoint of
-RCU users, this structure is an opaque “cookie”.
+RCU users, this structure is an opaque "cookie".
 
 +-----------------------------------------------------------------------+
 | **Quick Quiz**:                                                       |
diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst
index 6f89cf1e567d..742921a7532b 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst
@@ -304,8 +304,8 @@ representing the elements of the ``->exp_wq[]`` array.
 
 .. kernel-figure:: Funnel0.svg
 
-The next diagram shows the situation after the arrival of Task A and
-Task B at the leftmost and rightmost leaf ``rcu_node`` structures,
+The next diagram shows the situation after the arrival of Task A and
+Task B at the leftmost and rightmost leaf ``rcu_node`` structures,
 respectively. The current value of the ``rcu_state`` structure's
 ``->expedited_sequence`` field is zero, so adding three and clearing the
 bottom bit results in the value two, which both tasks record in the
@@ -313,13 +313,13 @@ bottom bit results in the value two, which both tasks record in the
 
 .. kernel-figure:: Funnel1.svg
 
-Each of Tasks A and B will move up to the root ``rcu_node`` structure.
-Suppose that Task A wins, recording its desired grace-period sequence
+Each of Tasks A and B will move up to the root ``rcu_node`` structure.
+Suppose that Task A wins, recording its desired grace-period sequence
 number and resulting in the state shown below:
 
 .. kernel-figure:: Funnel2.svg
 
-Task A now advances to initiate a new grace period, while Task B moves
+Task A now advances to initiate a new grace period, while Task B moves
 up to the root ``rcu_node`` structure, and, seeing that its desired
 sequence number is already recorded, blocks on ``->exp_wq[1]``.
 
@@ -340,7 +340,7 @@ sequence number is already recorded, blocks on ``->exp_wq[1]``.
 | ``->exp_wq[1]``.                                                      |
 +-----------------------------------------------------------------------+
 
-If Tasks C and D also arrive at this point, they will compute the same
+If Tasks C and D also arrive at this point, they will compute the same
 desired grace-period sequence number, and see that both leaf
 ``rcu_node`` structures already have that value recorded. They will
 therefore block on their respective ``rcu_node`` structures'
@@ -348,52 +348,52 @@ therefore block on their respective ``rcu_node`` structures'
 
 .. kernel-figure:: Funnel3.svg
 
-Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
+Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
 initiates the grace period, which increments ``->expedited_sequence``.
-Therefore, if Tasks E and F arrive, they will compute a desired sequence
+Therefore, if Tasks E and F arrive, they will compute a desired sequence
 number of 4 and will record this value as shown below:
 
 .. kernel-figure:: Funnel4.svg
 
-Tasks E and F will propagate up the ``rcu_node`` combining tree, with
-Task F blocking on the root ``rcu_node`` structure and Task E wait for
-Task A to finish so that it can start the next grace period. The
+Tasks E and F will propagate up the ``rcu_node`` combining tree, with
+Task F blocking on the root ``rcu_node`` structure and Task E wait for
+Task A to finish so that it can start the next grace period. The
 resulting state is as shown below:
 
 .. kernel-figure:: Funnel5.svg
 
-Once the grace period completes, Task A starts waking up the tasks
+Once the grace period completes, Task A starts waking up the tasks
 waiting for this grace period to complete, increments the
 ``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then
 releases the ``->exp_mutex``. This results in the following state:
 
 .. kernel-figure:: Funnel6.svg
 
-Task E can then acquire ``->exp_mutex`` and increment
-``->expedited_sequence`` to the value three. If new tasks G and H arrive
+Task E can then acquire ``->exp_mutex`` and increment
+``->expedited_sequence`` to the value three. If new tasks G and H arrive
 and moves up the combining tree at the same time, the state will be as
 follows:
 
 .. kernel-figure:: Funnel7.svg
 
 Note that three of the root ``rcu_node`` structure's waitqueues are now
-occupied. However, at some point, Task A will wake up the tasks blocked
+occupied. However, at some point, Task A will wake up the tasks blocked
 on the ``->exp_wq`` waitqueues, resulting in the following state:
 
 .. kernel-figure:: Funnel8.svg
 
-Execution will continue with Tasks E and H completing their grace
+Execution will continue with Tasks E and H completing their grace
 periods and carrying out their wakeups.
 
 +-----------------------------------------------------------------------+
 | **Quick Quiz**:                                                       |
 +-----------------------------------------------------------------------+
-| What happens if Task A takes so long to do its wakeups that Task E's  |
+| What happens if Task A takes so long to do its wakeups that Task E's  |
 | grace period completes?                                               |
 +-----------------------------------------------------------------------+
 | **Answer**:                                                           |
 +-----------------------------------------------------------------------+
-| Then Task E will block on the ``->exp_wake_mutex``, which will also   |
+| Then Task E will block on the ``->exp_wake_mutex``, which will also   |
 | prevent it from releasing ``->exp_mutex``, which in turn will prevent |
 | the next grace period from starting. This last is important in        |
 | preventing overflow of the ``->exp_wq[]`` array.                      |
@@ -464,8 +464,8 @@ code need not worry about POSIX signals. Unfortunately, it has the
 corresponding disadvantage that workqueues cannot be used until they are
 initialized, which does not happen until some time after the scheduler
 spawns the first task. Given that there are parts of the kernel that
-really do want to execute grace periods during this mid-boot “dead
-zone”, expedited grace periods must do something else during thie time.
+really do want to execute grace periods during this mid-boot "dead
+zone", expedited grace periods must do something else during thie time.
 
 What they do is to fall back to the old practice of requiring that the
 requesting task drive the expedited grace period, as was the case before
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
index a648b423ba0e..a131d6cd41cc 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
@@ -215,7 +215,7 @@ newly arrived RCU callbacks against future grace periods:
    43 }
 
 But the only part of ``rcu_prepare_for_idle()`` that really matters for
-this discussion are lines 37–39. We will therefore abbreviate this
+this discussion are lines 37–39. We will therefore abbreviate this
 function as follows:
 
 .. kernel-figure:: rcu_node-lock.svg
@@ -418,7 +418,7 @@ wait on.
 | It is indeed not necessary for the grace period to wait on such a     |
 | critical section. However, it is permissible to wait on it. And it is |
 | furthermore important to wait on it, as this lazy approach is far     |
-| more scalable than a “big bang” all-at-once grace-period start could  |
+| more scalable than a "big bang" all-at-once grace-period start could  |
 | possibly be.                                                          |
 +-----------------------------------------------------------------------+
 
@@ -448,7 +448,7 @@ proceeds upwards from that point, and the ``rcu_node`` ``->lock``
 guarantees that the first CPU's quiescent state happens before the
 remainder of the second CPU's traversal. Applying this line of thought
 repeatedly shows that all CPUs' quiescent states happen before the last
-CPU traverses through the root ``rcu_node`` structure, the “last CPU”
+CPU traverses through the root ``rcu_node`` structure, the "last CPU"
 being the one that clears the last bit in the root ``rcu_node``
 structure's ``->qsmask`` field.
 
@@ -501,8 +501,8 @@ Forcing Quiescent States
 
 As noted above, idle and offline CPUs cannot report their own quiescent
 states, and therefore the grace-period kernel thread must do the
-reporting on their behalf. This process is called “forcing quiescent
-states”, it is repeated every few jiffies, and its ordering effects are
+reporting on their behalf. This process is called "forcing quiescent
+states", it is repeated every few jiffies, and its ordering effects are
 shown below:
 
 .. kernel-figure:: TreeRCU-gp-fqs.svg
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index 38a39476fc24..673369024129 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -4,7 +4,7 @@ A Tour Through RCU's Requirements
 
 Copyright IBM Corporation, 2015
 
-Author: Paul E. McKenney
+Author: Paul E. McKenney
 
 The initial version of this document appeared in the
 `LWN <https://lwn.net/>`_ on those articles:
@@ -66,7 +66,7 @@ Grace-Period Guarantee
 
 RCU's grace-period guarantee is unusual in being premeditated: Jack
 Slingwine and I had this guarantee firmly in mind when we started work
-on RCU (then called “rclock”) in the early 1990s. That said, the past
+on RCU (then called "rclock") in the early 1990s. That said, the past
 two decades of experience with RCU have produced a much more detailed
 understanding of this guarantee.
 
@@ -102,7 +102,7 @@ overhead to readers, for example:
       15   WRITE_ONCE(y, 1);
       16 }
 
-Because the synchronize_rcu() on line 14 waits for all pre-existing
+Because the synchronize_rcu() on line 14 waits for all pre-existing
 readers, any instance of thread0() that loads a value of zero from
 ``x`` must complete before thread1() stores to ``y``, so that
 instance must also load a value of zero from ``y``. Similarly, any
@@ -178,7 +178,7 @@ little or no synchronization overhead in do_something_dlm().
 +-----------------------------------------------------------------------+
 | **Quick Quiz**:                                                       |
 +-----------------------------------------------------------------------+
-| Why is the synchronize_rcu() on line 28 needed?                       |
+| Why is the synchronize_rcu() on line 28 needed?                       |
 +-----------------------------------------------------------------------+
 | **Answer**:                                                           |
 +-----------------------------------------------------------------------+
@@ -244,7 +244,7 @@ their rights to reorder this code as follows:
       16 }
 
 If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized``
-executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
+executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
 And this is but one of many ways in which compiler and hardware
 optimizations could cause trouble. Therefore, we clearly need some way
 to prevent the compiler and the CPU from reordering in this manner,
@@ -279,11 +279,11 @@ shows an example of insertion:
       15   return true;
       16 }
 
-The rcu_assign_pointer() on line 13 is conceptually equivalent to a
+The rcu_assign_pointer() on line 13 is conceptually equivalent to a
 simple assignment statement, but also guarantees that its assignment
-will happen after the two assignments in lines 11 and 12, similar to the
+will happen after the two assignments in lines 11 and 12, similar to the
 C11 ``memory_order_release`` store operation. It also prevents any
-number of “interesting” compiler optimizations, for example, the use of
+number of "interesting" compiler optimizations, for example, the use of
 ``gp`` as a scratch location immediately preceding the assignment.
 
 +-----------------------------------------------------------------------+
@@ -410,11 +410,11 @@ This process is implemented by remove_gp_synchronous():
       15   return true;
       16 }
 
-This function is straightforward, with line 13 waiting for a grace
-period before line 14 frees the old data element. This waiting ensures
-that readers will reach line 7 of do_something_gp() before the data
+This function is straightforward, with line 13 waiting for a grace
+period before line 14 frees the old data element. This waiting ensures
+that readers will reach line 7 of do_something_gp() before the data
 element referenced by ``p`` is freed. The rcu_access_pointer() on
-line 6 is similar to rcu_dereference(), except that:
+line 6 is similar to rcu_dereference(), except that:
 
 #. The value returned by rcu_access_pointer() cannot be
    dereferenced. If you want to access the value pointed to as well as
@@ -488,25 +488,25 @@ systems with more than one CPU:
    section ends and the time that synchronize_rcu() returns. Without
    this guarantee, a pre-existing RCU read-side critical section might
    hold a reference to the newly removed ``struct foo`` after the
-   kfree() on line 14 of remove_gp_synchronous().
+   kfree() on line 14 of remove_gp_synchronous().
 #. Each CPU that has an RCU read-side critical section that ends after
    synchronize_rcu() returns is guaranteed to execute a full memory
    barrier between the time that synchronize_rcu() begins and the
    time that the RCU read-side critical section begins. Without this
    guarantee, a later RCU read-side critical section running after the
-   kfree() on line 14 of remove_gp_synchronous() might later run
+   kfree() on line 14 of remove_gp_synchronous() might later run
    do_something_gp() and find the newly deleted ``struct foo``.
 #. If the task invoking synchronize_rcu() remains on a given CPU,
    then that CPU is guaranteed to execute a full memory barrier sometime
    during the execution of synchronize_rcu(). This guarantee ensures
-   that the kfree() on line 14 of remove_gp_synchronous() really
-   does execute after the removal on line 11.
+   that the kfree() on line 14 of remove_gp_synchronous() really
+   does execute after the removal on line 11.
 #. If the task invoking synchronize_rcu() migrates among a group of
    CPUs during that invocation, then each of the CPUs in that group is
    guaranteed to execute a full memory barrier sometime during the
    execution of synchronize_rcu(). This guarantee also ensures that
-   the kfree() on line 14 of remove_gp_synchronous() really does
-   execute after the removal on line 11, but also in the case where the
+   the kfree() on line 14 of remove_gp_synchronous() really does
+   execute after the removal on line 11, but also in the case where the
    thread executing the synchronize_rcu() migrates in the meantime.
 
 +-----------------------------------------------------------------------+
@@ -525,8 +525,8 @@ systems with more than one CPU:
 | In other words, a given instance of synchronize_rcu() can avoid       |
 | waiting on a given RCU read-side critical section only if it can      |
 | prove that synchronize_rcu() started first.                           |
-| A related question is “When rcu_read_lock() doesn't generate any      |
-| code, why does it matter how it relates to a grace period?” The       |
+| A related question is "When rcu_read_lock() doesn't generate any      |
+| code, why does it matter how it relates to a grace period?" The       |
 | answer is that it is not the relationship of rcu_read_lock()          |
 | itself that is important, but rather the relationship of the code     |
 | within the enclosed RCU read-side critical section to the code        |
@@ -538,7 +538,7 @@ systems with more than one CPU:
 | of any access following the grace period.                             |
 |                                                                       |
 | As of late 2016, mathematical models of RCU take this viewpoint, for  |
-| example, see slides 62 and 63 of the `2016 LinuxCon                   |
+| example, see slides 62 and 63 of the `2016 LinuxCon                   |
 | EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 |
 | 6.10.04c.LCE.pdf>`__                                                  |
 | presentation.                                                         |
@@ -584,9 +584,9 @@ systems with more than one CPU:
 |                                                                       |
 | And similarly, without a memory barrier between the beginning of the  |
 | grace period and the beginning of the RCU read-side critical section, |
-| CPU 1 might end up accessing the freelist.                            |
+| CPU 1 might end up accessing the freelist.                            |
 |                                                                       |
-| The “as if” rule of course applies, so that any implementation that   |
+| The "as if" rule of course applies, so that any implementation that   |
 | acts as if the appropriate memory barriers were in place is a correct |
 | implementation. That said, it is much easier to fool yourself into    |
 | believing that you have adhered to the as-if rule than it is to       |
@@ -1002,7 +1002,7 @@ RCU implementation must abide by them. They therefore bear repeating:
    ECC errors, NMIs, and other hardware events. Although a delay of more
    than about 20 seconds can result in splats, the RCU implementation is
    obligated to use algorithms that can tolerate extremely long delays,
-   but where “extremely long” is not long enough to allow wrap-around
+   but where "extremely long" is not long enough to allow wrap-around
    when incrementing a 64-bit counter.
 #. Both the compiler and the CPU can reorder memory accesses. Where it
    matters, RCU must use compiler directives and memory-barrier
@@ -1169,7 +1169,7 @@ Energy efficiency is a critical component of performance today, and
 Linux-kernel RCU implementations must therefore avoid unnecessarily
 awakening idle CPUs. I cannot claim that this requirement was
 premeditated. In fact, I learned of it during a telephone conversation
-in which I was given “frank and open” feedback on the importance of
+in which I was given "frank and open" feedback on the importance of
 energy efficiency in battery-powered systems and on specific
 energy-efficiency shortcomings of the Linux-kernel RCU implementation.
 In my experience, the battery-powered embedded community will consider
@@ -1234,7 +1234,7 @@ requirements: A storm of synchronize_rcu_expedited() invocations on
 4096 CPUs should at least make reasonable forward progress. In return
 for its shorter latencies, synchronize_rcu_expedited() is permitted
 to impose modest degradation of real-time latency on non-idle online
-CPUs. Here, “modest” means roughly the same latency degradation as a
+CPUs. Here, "modest" means roughly the same latency degradation as a
 scheduling-clock interrupt.
 
 There are a number of situations where even
@@ -1274,8 +1274,8 @@ be used in place of synchronize_rcu() as follows:
       28 }
 
 A definition of ``struct foo`` is finally needed, and appears on
-lines 1-5. The function remove_gp_cb() is passed to call_rcu()
-on line 25, and will be invoked after the end of a subsequent grace
+lines 1-5. The function remove_gp_cb() is passed to call_rcu()
+on line 25, and will be invoked after the end of a subsequent grace
 period. This gets the same effect as remove_gp_synchronous(), but
 without forcing the updater to wait for a grace period to elapse. The
 call_rcu() function may be used in a number of situations where
@@ -1294,23 +1294,23 @@ threads or (in the Linux kernel) workqueues.
 +-----------------------------------------------------------------------+
 | **Quick Quiz**:                                                       |
 +-----------------------------------------------------------------------+
-| Why does line 19 use rcu_access_pointer()? After all,                 |
-| call_rcu() on line 25 stores into the structure, which would          |
+| Why does line 19 use rcu_access_pointer()? After all,                 |
+| call_rcu() on line 25 stores into the structure, which would          |
 | interact badly with concurrent insertions. Doesn't this mean that     |
 | rcu_dereference() is required?                                        |
 +-----------------------------------------------------------------------+
 | **Answer**:                                                           |
 +-----------------------------------------------------------------------+
-| Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
+| Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
 | changes, including any insertions that rcu_dereference() would        |
 | protect against. Therefore, any insertions will be delayed until      |
-| after ``->gp_lock`` is released on line 25, which in turn means that  |
+| after ``->gp_lock`` is released on line 25, which in turn means that  |
 | rcu_access_pointer() suffices.                                        |
 +-----------------------------------------------------------------------+
 
 However, all that remove_gp_cb() is doing is invoking kfree() on
 the data element. This is a common idiom, and is supported by
-kfree_rcu(), which allows “fire and forget” operation as shown
+kfree_rcu(), which allows "fire and forget" operation as shown
 below:
 
    ::
@@ -1396,8 +1396,8 @@ may be used for this purpose, as shown below:
       18   return true;
       19 }
 
-On line 14, get_state_synchronize_rcu() obtains a “cookie” from RCU,
-then line 15 carries out other tasks, and finally, line 16 returns
+On line 14, get_state_synchronize_rcu() obtains a "cookie" from RCU,
+then line 15 carries out other tasks, and finally, line 16 returns
 immediately if a grace period has elapsed in the meantime, but otherwise
 waits as required. The need for ``get_state_synchronize_rcu`` and
 cond_synchronize_rcu() has appeared quite recently, so it is too
@@ -1420,9 +1420,9 @@ example, an infinite loop in an RCU read-side critical section must by
 definition prevent later grace periods from ever completing. For a more
 involved example, consider a 64-CPU system built with
 ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
-CPUs 1 through 63 spin in tight loops that invoke call_rcu(). Even
+CPUs 1 through 63 spin in tight loops that invoke call_rcu(). Even
 if these tight loops also contain calls to cond_resched() (thus
-allowing grace periods to complete), CPU 0 simply will not be able to
+allowing grace periods to complete), CPU 0 simply will not be able to
 invoke callbacks as fast as the other 63 CPUs can register them, at
 least not until the system runs out of memory. In both of these
 examples, the Spiderman principle applies: With great power comes great
@@ -1433,7 +1433,7 @@ callbacks.
 RCU takes the following steps to encourage timely completion of grace
 periods:
 
-#. If a grace period fails to complete within 100 milliseconds, RCU
+#. If a grace period fails to complete within 100 milliseconds, RCU
    causes future invocations of cond_resched() on the holdout CPUs
    to provide an RCU quiescent state. RCU also causes those CPUs'
    need_resched() invocations to return ``true``, but only after the
@@ -1442,12 +1442,12 @@ periods:
    indefinitely in the kernel without scheduling-clock interrupts, which
    defeats the above need_resched() strategem. RCU will therefore
    invoke resched_cpu() on any ``nohz_full`` CPUs still holding out
-   after 109 milliseconds.
+   after 109 milliseconds.
 #. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that
    has been preempted within an RCU read-side critical section is
-   holding out for more than 500 milliseconds, RCU will resort to
+   holding out for more than 500 milliseconds, RCU will resort to
    priority boosting.
-#. If a CPU is still holding out 10 seconds into the grace period, RCU
+#. If a CPU is still holding out 10 seconds into the grace period, RCU
    will invoke resched_cpu() on it regardless of its ``nohz_full``
    state.
 
@@ -1579,7 +1579,7 @@ period.
 Software-Engineering Requirements
 ---------------------------------
 
-Between Murphy's Law and “To err is human”, it is necessary to guard
+Between Murphy's Law and "To err is human", it is necessary to guard
 against mishaps and misuse:
 
 #. It is all too easy to forget to use rcu_read_lock() everywhere
@@ -1626,7 +1626,7 @@ against mishaps and misuse:
    `patch <https://lore.kernel.org/r/20100319013024.GA28456@Krystal>`__.
 #. An infinite loop in an RCU read-side critical section will eventually
    trigger an RCU CPU stall warning splat, with the duration of
-   “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT``
+   "eventually" being controlled by the ``RCU_CPU_STALL_TIMEOUT``
    ``Kconfig`` option, or, alternatively, by the
    ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU
    is not obligated to produce this splat unless there is a grace period
@@ -1704,7 +1704,7 @@ Configuration
 
 RCU's goal is automatic configuration, so that almost nobody needs to
 worry about RCU's ``Kconfig`` options. And for almost all users, RCU
-does in fact work well “out of the box.”
+does in fact work well "out of the box."
 
 However, there are specialized use cases that are handled by kernel boot
 parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig``
@@ -1733,7 +1733,7 @@ listings.
 
 RCU must therefore wait for a given CPU to actually come online before
 it can allow itself to believe that the CPU actually exists. The
-resulting “ghost CPUs” (which are never going to come online) cause a
+resulting "ghost CPUs" (which are never going to come online) cause a
 number of `interesting
 complications <https://paulmck.livejournal.com/37494.html>`__.
 
@@ -1789,7 +1789,7 @@ normally.
 | **Answer**:                                                           |
 +-----------------------------------------------------------------------+
 | Very carefully!                                                       |
-| During the “dead zone” between the time that the scheduler spawns the |
+| During the "dead zone" between the time that the scheduler spawns the |
 | first task and the time that all of RCU's kthreads have been spawned, |
 | all synchronous grace periods are handled by the expedited            |
 | grace-period mechanism. At runtime, this expedited mechanism relies   |
@@ -1824,7 +1824,7 @@ Some Linux-kernel architectures can enter an interrupt handler from
 non-idle process context, and then just never leave it, instead
 stealthily transitioning back to process context. This trick is
 sometimes used to invoke system calls from inside the kernel. These
-“half-interrupts” mean that RCU has to be very careful about how it
+"half-interrupts" mean that RCU has to be very careful about how it
 counts interrupt nesting levels. I learned of this requirement the hard
 way during a rewrite of RCU's dyntick-idle code.
 
@@ -1921,7 +1921,7 @@ and go. It is of course illegal to use any RCU API member from an
 offline CPU, with the exception of `SRCU <Sleepable RCU_>`__ read-side
 critical sections. This requirement was present from day one in
 DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
-implementation is “interesting.”
+implementation is "interesting."
 
 The Linux-kernel CPU-hotplug implementation has notifiers that are used
 to allow the various kernel subsystems (including RCU) to respond
@@ -2268,7 +2268,7 @@ remain zero during all phases of grace-period processing, and that bit
 happens to map to the bottom bit of the ``rcu_head`` structure's
 ``->next`` field. RCU makes this guarantee as long as call_rcu() is
 used to post the callback, as opposed to kfree_rcu() or some future
-“lazy” variant of call_rcu() that might one day be created for
+"lazy" variant of call_rcu() that might one day be created for
 energy-efficiency purposes.
 
 That said, there are limits. RCU requires that the ``rcu_head``
@@ -2281,7 +2281,7 @@ architecture provides only two-byte alignment, and thus acts as
 alignment's least common denominator.
 
 The reason for reserving the bottom bit of pointers to ``rcu_head``
-structures is to leave the door open to “lazy” callbacks whose
+structures is to leave the door open to "lazy" callbacks whose
 invocations can safely be deferred. Deferring invocation could
 potentially have energy-efficiency benefits, but only if the rate of
 non-lazy callbacks decreases significantly for some important workload.
@@ -2399,7 +2399,7 @@ single flavor. The read-side API remains, and continues to disable
 softirq and to be accounted for by lockdep. Much of the material in this
 section is therefore strictly historical in nature.
 
-The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
+The softirq-disable (AKA "bottom-half", hence the "_bh" abbreviations)
 flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
 flavor of RCU that could withstand the network-based denial-of-service
 attacks researched by Robert Olsson. These attacks placed so much
@@ -2458,7 +2458,7 @@ effect of also waiting for all pre-existing interrupt and NMI handlers.
 However, there are legitimate preemptible-RCU implementations that do
 not have this property, given that any point in the code outside of an
 RCU read-side critical section can be a quiescent state. Therefore,
-*RCU-sched* was created, which follows “classic” RCU in that an
+*RCU-sched* was created, which follows "classic" RCU in that an
 RCU-sched grace period waits for pre-existing interrupt and NMI
 handlers. In kernels built with ``CONFIG_PREEMPTION=n``, the RCU and
 RCU-sched APIs have identical implementations, while kernels built with
@@ -2490,8 +2490,8 @@ and local_irq_restore(), and so on.
 Sleepable RCU
 ~~~~~~~~~~~~~
 
-For well over a decade, someone saying “I need to block within an RCU
-read-side critical section” was a reliable indication that this someone
+For well over a decade, someone saying "I need to block within an RCU
+read-side critical section" was a reliable indication that this someone
 did not understand RCU. After all, if you are always blocking in an RCU
 read-side critical section, you can probably afford to use a
 higher-overhead synchronization mechanism. However, that changed with
@@ -2507,7 +2507,7 @@ this structure must be passed in to each SRCU function, for example,
 structure. The key benefit of these domains is that a slow SRCU reader
 in one domain does not delay an SRCU grace period in some other domain.
 That said, one consequence of these domains is that read-side code must
-pass a “cookie” from srcu_read_lock() to srcu_read_unlock(), for
+pass a "cookie" from srcu_read_lock() to srcu_read_unlock(), for
 example, as follows:
 
    ::
@@ -2536,9 +2536,9 @@ period to elapse. For example, this results in a self-deadlock:
        5 synchronize_srcu(&ss);
        6 srcu_read_unlock(&ss, idx);
 
-However, if line 5 acquired a mutex that was held across a
+However, if line 5 acquired a mutex that was held across a
 synchronize_srcu() for domain ``ss``, deadlock would still be
-possible. Furthermore, if line 5 acquired a mutex that was held across a
+possible. Furthermore, if line 5 acquired a mutex that was held across a
 synchronize_srcu() for some other domain ``ss1``, and if an
 ``ss1``-domain SRCU read-side critical section acquired another mutex
 that was held across as ``ss``-domain synchronize_srcu(), deadlock
@@ -2557,7 +2557,7 @@ memory barrier.
 Also unlike other RCU flavors, synchronize_srcu() may **not** be
 invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
 periods make use of timers and the possibility of timers being
-temporarily “stranded” on the outgoing CPU. This stranding of timers
+temporarily "stranded" on the outgoing CPU. This stranding of timers
 means that timers posted to the outgoing CPU will not fire until late in
 the CPU-hotplug process. The problem is that if a notifier is waiting on
 an SRCU grace period, that grace period is waiting on a timer, and that
@@ -2573,7 +2573,7 @@ period has the side effect of expediting all prior grace periods that
 have not yet completed. (But please note that this is a property of the
 current implementation, not necessarily of future implementations.) In
 addition, if SRCU has been idle for longer than the interval specified
-by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds
+by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds
 by default), and if a synchronize_srcu() invocation ends this idle
 period, that invocation will be automatically expedited.
 
@@ -2619,7 +2619,7 @@ from the cache, an SRCU grace period will be very likely to have elapsed.
 Tasks RCU
 ~~~~~~~~~
 
-Some forms of tracing use “trampolines” to handle the binary rewriting
+Some forms of tracing use "trampolines" to handle the binary rewriting
 required to install different types of probes. It would be good to be
 able to free old trampolines, which sounds like a job for some form of
 RCU. However, because it is necessary to be able to install a trace
@@ -2687,7 +2687,7 @@ your architecture should also benefit from the
 number of CPUs in a socket, NUMA node, or whatever. If the number of
 CPUs is too large, use a fraction of the number of CPUs. If the number
 of CPUs is a large prime number, well, that certainly is an
-“interesting” architectural choice! More flexible arrangements might be
+"interesting" architectural choice! More flexible arrangements might be
 considered, but only if ``rcutree.rcu_fanout_leaf`` has proven
 inadequate, and only if the inadequacy has been demonstrated by a
 carefully run and realistic system-level workload.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
  2021-05-12 12:50 ` [PATCH v2 40/40] docs: RCU: " Mauro Carvalho Chehab
@ 2021-05-12 14:14 ` Theodore Ts'o
  2021-05-12 15:17   ` Mauro Carvalho Chehab
  2021-05-12 17:07 ` David Woodhouse
  2 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2021-05-12 14:14 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, dri-devel, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> v2:
> - removed EM/EN DASH conversion from this patchset;

Are you still thinking about doing the

EN DASH --> "--"
EM DASH --> "---"

conversion?  That's not going to change what the documentation will
look like in the HTML and PDF output forms, and I think it would make
life easier for people are reading and editing the Documentation/*
files in text form.

				- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
@ 2021-05-12 15:17   ` Mauro Carvalho Chehab
  2021-05-12 17:12     ` David Woodhouse
  0 siblings, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-12 15:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, dri-devel, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

Em Wed, 12 May 2021 10:14:44 -0400
"Theodore Ts'o" <tytso@mit.edu> escreveu:

> On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> > v2:
> > - removed EM/EN DASH conversion from this patchset;  
> 
> Are you still thinking about doing the
> 
> EN DASH --> "--"
> EM DASH --> "---"
> 
> conversion?  

Yes, but I intend to submit it on a separate patch series, probably after
having this one merged. Let's first cleanup the large part of the 
conversion-generated UTF-8 char noise ;-)

> That's not going to change what the documentation will
> look like in the HTML and PDF output forms, and I think it would make
> life easier for people are reading and editing the Documentation/*
> files in text form.

Agreed. I'm also considering to add a couple of cases of this char:

	- U+2026 ('…'): HORIZONTAL ELLIPSIS

As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.

-

Anyway, I'm opting to submitting those in separate because it seems
that at least some maintainers added EM/EN DASH intentionally.

So, it may generate case-per-case discussions.

Also, IMO, at least a couple of EN/EM DASH cases would be better served 
with a single hyphen.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
  2021-05-12 12:50 ` [PATCH v2 40/40] docs: RCU: " Mauro Carvalho Chehab
  2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
@ 2021-05-12 17:07 ` David Woodhouse
  2021-05-14  8:21   ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 14+ messages in thread
From: David Woodhouse @ 2021-05-12 17:07 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Linux Doc Mailing List
  Cc: linux-kernel, Jonathan Corbet, Mali DP Maintainers, alsa-devel,
	coresight, dri-devel, intel-gfx, intel-wired-lan, keyrings, kvm,
	linux-acpi, linux-arm-kernel, linux-edac, linux-ext4,
	linux-f2fs-devel, linux-hwmon, linux-iio, linux-input,
	linux-integrity, linux-media, linux-pci, linux-pm, linux-rdma,
	linux-sgx, linux-usb, mjpeg-users, netdev, rcu

[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]

Your title 'Use ASCII subset' is now at least a bit *closer* to
describing what the patches are actually doing, but it's still a bit
misleading because you're only doing it for *some* characters.

And the wording is still indicative of a fundamentally *misguided*
motivation for doing any of this. Your commit comments should be about
fixing a specific thing, nothing to do with "use ASCII subset", which
is pointless in itself.

On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> Such conversion tools - plus some text editor like LibreOffice  or similar  - have
> a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> for instance converting commas into curly commas and adding non-breakable
> spaces. All of those are meant to produce better results when the text is
> displayed in HTML or PDF formats.

And don't we render our documentation into HTML or PDF formats? Are
some of those non-breaking spaces not actually *useful* for their
intended purpose?

> While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> the documentation,  it is better to  stick to the ASCII subset  on such
> particular case,  due to a couple of reasons:
> 
> 1. it makes life easier for tools like grep;

Barely, as noted, because of things like line feeds.

> 2. they easier to edit with the some commonly used text/source
>    code editors.

That is nonsense. Any but the most broken and/or anachronistic
environments and editors will be just fine.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-12 15:17   ` Mauro Carvalho Chehab
@ 2021-05-12 17:12     ` David Woodhouse
  0 siblings, 0 replies; 14+ messages in thread
From: David Woodhouse @ 2021-05-12 17:12 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Theodore Ts'o
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, dri-devel, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

[-- Attachment #1: Type: text/plain, Size: 1744 bytes --]

On Wed, 2021-05-12 at 17:17 +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 10:14:44 -0400
> "Theodore Ts'o" <tytso@mit.edu> escreveu:
> 
> > On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote:
> > > v2:
> > > - removed EM/EN DASH conversion from this patchset;  
> > 
> > Are you still thinking about doing the
> > 
> > EN DASH --> "--"
> > EM DASH --> "---"
> > 
> > conversion?  
> 
> Yes, but I intend to submit it on a separate patch series, probably after
> having this one merged. Let's first cleanup the large part of the 
> conversion-generated UTF-8 char noise ;-)
> 
> > That's not going to change what the documentation will
> > look like in the HTML and PDF output forms, and I think it would make
> > life easier for people are reading and editing the Documentation/*
> > files in text form.
> 
> Agreed. I'm also considering to add a couple of cases of this char:
> 
> 	- U+2026 ('…'): HORIZONTAL ELLIPSIS
> 
> As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS.

Er, what?

The *only* part of this whole enterprise that actually seemed to make
even a tiny bit of sense — rather than seeming like a thinly veiled
retrospective excuse for dragging us back in time by 30 years — was the
bit about making it easier to grep.

But if I understand you correctly, you're talking about using something
like C trigraphs to represent the perfectly reasonable text emdash
character ("—") as two hyphen-minuses ("--") in the source code of the
documentation? Isn't that going to achieve precisely the *opposite*? If
I select some text in the HTML output of the docs and then search for
it in the source code, that's going to *stop* it matching my search?


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-12 17:07 ` David Woodhouse
@ 2021-05-14  8:21   ` Mauro Carvalho Chehab
  2021-05-14  9:06     ` David Woodhouse
  0 siblings, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-14  8:21 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, dri-devel, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

Em Wed, 12 May 2021 18:07:04 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:

> On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > Such conversion tools - plus some text editor like LibreOffice  or similar  - have
> > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > for instance converting commas into curly commas and adding non-breakable
> > spaces. All of those are meant to produce better results when the text is
> > displayed in HTML or PDF formats.  
> 
> And don't we render our documentation into HTML or PDF formats? 

Yes.

> Are
> some of those non-breaking spaces not actually *useful* for their
> intended purpose?

No.

The thing is: non-breaking space can cause a lot of problems.

We even had to disable Sphinx usage of non-breaking space for
PDF outputs, as this was causing bad LaTeX/PDF outputs.

See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")

The afore mentioned patch disables Sphinx default behavior of
using NON-BREAKABLE SPACE on literal blocks and strings, using this
special setting: "parsedliteralwraps=true".

When NON-BREAKABLE SPACE were used on PDF outputs, several parts of 
the media uAPI docs were violating the document margins by far,
causing texts to be truncated.

So, please **don't add NON-BREAKABLE SPACE**, unless you test
(and keep testing it from time to time) if outputs on all
formats are properly supporting it on different Sphinx versions.

-

Also, most of those came from conversion tools, together with other
eccentricities, like the usage of U+FEFF (BOM) character at the
start of some documents. The remaining ones seem to came from 
cut-and-paste.

For instance,  bibliographic references (there are a couple of
those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty
sure that those came from cut-and-pasting the document titles
from their names at the original PDF documents or web pages that
are referenced.

> > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > the documentation,  it is better to  stick to the ASCII subset  on such
> > particular case,  due to a couple of reasons:
> > 
> > 1. it makes life easier for tools like grep;  
> 
> Barely, as noted, because of things like line feeds.

You can use grep with "-z" to seek for multi-line strings(*), Like:

	$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
	Documentation/RCU/Design/Data-Structures/Data-Structures.rst

(*) Unfortunately, while "git grep" also has a "-z" flag, it
    seems that this is (currently?) broken with regards of handling multilines:

	$ git grep -Pzl 'grace period started,\s*then'
	$

> > 2. they easier to edit with the some commonly used text/source
> >    code editors.  
> 
> That is nonsense. Any but the most broken and/or anachronistic
> environments and editors will be just fine.

Not really.

I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
on the US-intl keyboard settings, that allow me to type as "'a" for á.
However, there's no shortcut for non-Latin UTF-codes, as far as I know.

So, if would need to type a curly comma on the text editors I normally 
use for development (vim, nano, kate), I would need to cut-and-paste
it from somewhere[1].

[1] If I have a table with UTF-8 codes handy, I could type the UTF-8 
    number manually... However, it seems that this is currently broken 
    at least on Fedora 33 (with Mate Desktop and US intl keyboard with 
    dead keys).

    Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
    test it for *years*, as I din't see any reason why I would
    need to type UTF-8 characters by numbers until we started
    this thread.
 
In practice, on the very rare cases where I needed to write
non-Latin utf-8 chars (maybe once in a year or so, Like when I
would need to use a Greek letter or some weird symbol), there changes
are high that I wouldn't remember its UTF-8 code.

So, If I need to spend time to seek for an specific symbol, after
finding it, I just cut-and-paste it.

But even in the best case scenario where I know the UTF-8 and
<CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
comma, the keystroke sequence would be:

	<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d

That's a lot harder than typing and has a higher chances of
mistakenly add a wrong symbol than just typing:

	"some string"

Knowing that both will produce *exactly* the same output, why
should I bother doing it the hard way?

-

Now, I'm not arguing that you can't use whatever UTF-8 symbol you
want on your docs. I'm just saying that, now that the conversion 
is over and a lot of documents ended getting some UTF-8 characters
by accident, it is time for a cleanup.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-14  8:21   ` Mauro Carvalho Chehab
@ 2021-05-14  9:06     ` David Woodhouse
  2021-05-14 11:08       ` Edward Cree
  2021-05-15  8:22       ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 14+ messages in thread
From: David Woodhouse @ 2021-05-14  9:06 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, dri-devel, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

[-- Attachment #1: Type: text/plain, Size: 6843 bytes --]

On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 18:07:04 +0100
> David Woodhouse <dwmw2@infradead.org> escreveu:
> 
> > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > > Such conversion tools - plus some text editor like LibreOffice  or similar  - have
> > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > > for instance converting commas into curly commas and adding non-breakable
> > > spaces. All of those are meant to produce better results when the text is
> > > displayed in HTML or PDF formats.  
> > 
> > And don't we render our documentation into HTML or PDF formats? 
> 
> Yes.
> 
> > Are
> > some of those non-breaking spaces not actually *useful* for their
> > intended purpose?
> 
> No.
> 
> The thing is: non-breaking space can cause a lot of problems.
> 
> We even had to disable Sphinx usage of non-breaking space for
> PDF outputs, as this was causing bad LaTeX/PDF outputs.
> 
> See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
> 
> The afore mentioned patch disables Sphinx default behavior of
> using NON-BREAKABLE SPACE on literal blocks and strings, using this
> special setting: "parsedliteralwraps=true".
> 
> When NON-BREAKABLE SPACE were used on PDF outputs, several parts of 
> the media uAPI docs were violating the document margins by far,
> causing texts to be truncated.
> 
> So, please **don't add NON-BREAKABLE SPACE**, unless you test
> (and keep testing it from time to time) if outputs on all
> formats are properly supporting it on different Sphinx versions.

And there you have a specific change with a specific fix. Nothing to do
with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to
do with the fact that, like *every* character in every kernel file
except the *binary* files, it's representable in UTF-8.

By all means fix the specific characters which are typographically
wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering
the documentation.


> Also, most of those came from conversion tools, together with other
> eccentricities, like the usage of U+FEFF (BOM) character at the
> start of some documents. The remaining ones seem to came from 
> cut-and-paste.

... or which are just entirely redundant and gratuitous, like a BOM in
an environment where all files are UTF-8 and never 16-bit encodings
anyway.

> > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > > the documentation,  it is better to  stick to the ASCII subset  on such
> > > particular case,  due to a couple of reasons:
> > > 
> > > 1. it makes life easier for tools like grep;  
> > 
> > Barely, as noted, because of things like line feeds.
> 
> You can use grep with "-z" to seek for multi-line strings(*), Like:
> 
> 	$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
> 	Documentation/RCU/Design/Data-Structures/Data-Structures.rst

Yeah, right. That works if you don't just use the text that you'll have
seen in the HTML/PDF "grace period started, then", and if you instead
craft a *regex* for it, replacing the spaces with '\s*'. Or is that
[[:space:]]* if you don't want to use the experimental Perl regex
feature?

 $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU
Documentation/RCU/Design/Data-Structures/Data-Structures.rst

And without '-l' it'll obviously just give you the whole file. No '-A5
-B5' to see the surroundings... it's hardly a useful thing, is it?

> (*) Unfortunately, while "git grep" also has a "-z" flag, it
>     seems that this is (currently?) broken with regards of handling multilines:
> 
> 	$ git grep -Pzl 'grace period started,\s*then'
> 	$

Even better. So no, multiline grep isn't really a commonly usable
feature at all.

This is why we prefer to put user-visible strings on one line in C
source code, even if it takes the lines over 80 characters — to allow
for grep to find them.

> > > 2. they easier to edit with the some commonly used text/source
> > >    code editors.  
> > 
> > That is nonsense. Any but the most broken and/or anachronistic
> > environments and editors will be just fine.
> 
> Not really.
> 
> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
> on the US-intl keyboard settings, that allow me to type as "'a" for á.
> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
> 
> So, if would need to type a curly comma on the text editors I normally 
> use for development (vim, nano, kate), I would need to cut-and-paste
> it from somewhere[1].

That's entirely irrelevant. You don't need to be able to *type* every
character that you see in front of you, as long as your editor will
render it correctly and perhaps let you cut/paste it as you're editing
the document if you're moving things around.

> [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 
>     number manually... However, it seems that this is currently broken 
>     at least on Fedora 33 (with Mate Desktop and US intl keyboard with 
>     dead keys).
> 
>     Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
>     test it for *years*, as I din't see any reason why I would
>     need to type UTF-8 characters by numbers until we started
>     this thread.

Please provide the bug number for this; I'd like to track it.

> But even in the best case scenario where I know the UTF-8 and
> <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
> comma, the keystroke sequence would be:
> 
> 	<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
> 
> That's a lot harder than typing and has a higher chances of
> mistakenly add a wrong symbol than just typing:
> 
> 	"some string"
> 
> Knowing that both will produce *exactly* the same output, why
> should I bother doing it the hard way?

Nobody's asked you to do it the "hard way". That's completely
irrelevant to the discussion we were having.

> Now, I'm not arguing that you can't use whatever UTF-8 symbol you
> want on your docs. I'm just saying that, now that the conversion 
> is over and a lot of documents ended getting some UTF-8 characters
> by accident, it is time for a cleanup.

All text documents are *full* of UTF-8 characters. If there is a file
in the source code which has *any* non-UTF8, we call that a 'binary
file'.

Again, if you want to make specific fixes like removing non-breaking
spaces and byte order marks, with specific reasons, then those make
sense. But it's got very little to do with UTF-8 and how easy it is to
type them. And the excuse you've put in the commit comment for your
patches is utterly bogus.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-14  9:06     ` David Woodhouse
@ 2021-05-14 11:08       ` Edward Cree
  2021-05-14 14:18         ` Mauro Carvalho Chehab
  2021-05-15  8:22       ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 14+ messages in thread
From: Edward Cree @ 2021-05-14 11:08 UTC (permalink / raw)
  To: David Woodhouse, Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, dri-devel, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
>> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
>> on the US-intl keyboard settings, that allow me to type as "'a" for á.
>> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
>>
>> So, if would need to type a curly comma on the text editors I normally 
>> use for development (vim, nano, kate), I would need to cut-and-paste
>> it from somewhere

For anyone who doesn't know about it: X has this wonderful thing called
 the Compose key[1].  For instance, type ⎄--- to get —, or ⎄<" for “.
Much more mnemonic than Unicode codepoints; and you can extend it with
 user-defined sequences in your ~/.XCompose file.
(I assume Wayland supports all this too, but don't know the details.)

On 14/05/2021 10:06, David Woodhouse wrote:
> Again, if you want to make specific fixes like removing non-breaking
> spaces and byte order marks, with specific reasons, then those make
> sense. But it's got very little to do with UTF-8 and how easy it is to
> type them. And the excuse you've put in the commit comment for your
> patches is utterly bogus.

+1

-ed

[1] https://en.wikipedia.org/wiki/Compose_key

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-14 11:08       ` Edward Cree
@ 2021-05-14 14:18         ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-14 14:18 UTC (permalink / raw)
  To: Edward Cree
  Cc: David Woodhouse, Linux Doc Mailing List, linux-kernel,
	Jonathan Corbet, Mali DP Maintainers, alsa-devel, coresight,
	dri-devel, intel-gfx, intel-wired-lan, keyrings, kvm, linux-acpi,
	linux-arm-kernel, linux-edac, linux-ext4, linux-f2fs-devel,
	linux-hwmon, linux-iio, linux-input, linux-integrity,
	linux-media, linux-pci, linux-pm, linux-rdma, linux-sgx,
	linux-usb, mjpeg-users, netdev, rcu

Em Fri, 14 May 2021 12:08:36 +0100
Edward Cree <ecree.xilinx@gmail.com> escreveu:

> For anyone who doesn't know about it: X has this wonderful thing called
>  the Compose key[1].  For instance, type ⎄--- to get —, or ⎄<" for “.
> Much more mnemonic than Unicode codepoints; and you can extend it with
>  user-defined sequences in your ~/.XCompose file.

Good tip. I haven't use composite for years, as US-intl with dead keys is
enough for 99.999% of my needs. 

Btw, at least on Fedora with Mate, Composite is disabled by default. It has
to be enabled first using the same tool that allows changing the Keyboard
layout[1].

Yet, typing an EN DASH for example, would be "<composite>--.", with is 4
keystrokes instead of just two ('--'). It means twice the effort ;-)

[1] KDE, GNome, Mate, ... have different ways to enable it and to 
    select what key would be considered <composite>:

	https://dry.sailingissues.com/us-international-keyboard-layout.html
	https://help.ubuntu.com/community/ComposeKey

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-14  9:06     ` David Woodhouse
  2021-05-14 11:08       ` Edward Cree
@ 2021-05-15  8:22       ` Mauro Carvalho Chehab
  2021-05-15  9:24         ` David Woodhouse
  1 sibling, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-15  8:22 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

Em Fri, 14 May 2021 10:06:01 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:

> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
> > Em Wed, 12 May 2021 18:07:04 +0100
> > David Woodhouse <dwmw2@infradead.org> escreveu:
> >   
> > > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:  
> > > > Such conversion tools - plus some text editor like LibreOffice  or similar  - have
> > > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > > > for instance converting commas into curly commas and adding non-breakable
> > > > spaces. All of those are meant to produce better results when the text is
> > > > displayed in HTML or PDF formats.    
> > > 
> > > And don't we render our documentation into HTML or PDF formats?   
> > 
> > Yes.
> >   
> > > Are
> > > some of those non-breaking spaces not actually *useful* for their
> > > intended purpose?  
> > 
> > No.
> > 
> > The thing is: non-breaking space can cause a lot of problems.
> > 
> > We even had to disable Sphinx usage of non-breaking space for
> > PDF outputs, as this was causing bad LaTeX/PDF outputs.
> > 
> > See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
> > 
> > The afore mentioned patch disables Sphinx default behavior of
> > using NON-BREAKABLE SPACE on literal blocks and strings, using this
> > special setting: "parsedliteralwraps=true".
> > 
> > When NON-BREAKABLE SPACE were used on PDF outputs, several parts of 
> > the media uAPI docs were violating the document margins by far,
> > causing texts to be truncated.
> > 
> > So, please **don't add NON-BREAKABLE SPACE**, unless you test
> > (and keep testing it from time to time) if outputs on all
> > formats are properly supporting it on different Sphinx versions.  
> 
> And there you have a specific change with a specific fix. Nothing to do
> with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to
> do with the fact that, like *every* character in every kernel file
> except the *binary* files, it's representable in UTF-8.
> 
> By all means fix the specific characters which are typographically
> wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering
> the documentation.
> 
> 
> > Also, most of those came from conversion tools, together with other
> > eccentricities, like the usage of U+FEFF (BOM) character at the
> > start of some documents. The remaining ones seem to came from 
> > cut-and-paste.  
> 
> ... or which are just entirely redundant and gratuitous, like a BOM in
> an environment where all files are UTF-8 and never 16-bit encodings
> anyway.

Agreed.

> 
> > > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > > > the documentation,  it is better to  stick to the ASCII subset  on such
> > > > particular case,  due to a couple of reasons:
> > > > 
> > > > 1. it makes life easier for tools like grep;    
> > > 
> > > Barely, as noted, because of things like line feeds.  
> > 
> > You can use grep with "-z" to seek for multi-line strings(*), Like:
> > 
> > 	$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
> > 	Documentation/RCU/Design/Data-Structures/Data-Structures.rst  
> 
> Yeah, right. That works if you don't just use the text that you'll have
> seen in the HTML/PDF "grace period started, then", and if you instead
> craft a *regex* for it, replacing the spaces with '\s*'. Or is that
> [[:space:]]* if you don't want to use the experimental Perl regex
> feature?
> 
>  $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU
> Documentation/RCU/Design/Data-Structures/Data-Structures.rst
> 
> And without '-l' it'll obviously just give you the whole file. No '-A5
> -B5' to see the surroundings... it's hardly a useful thing, is it?
> 
> > (*) Unfortunately, while "git grep" also has a "-z" flag, it
> >     seems that this is (currently?) broken with regards of handling multilines:
> > 
> > 	$ git grep -Pzl 'grace period started,\s*then'
> > 	$  
> 
> Even better. So no, multiline grep isn't really a commonly usable
> feature at all.
> 
> This is why we prefer to put user-visible strings on one line in C
> source code, even if it takes the lines over 80 characters — to allow
> for grep to find them.

Makes sense, but in case of documentation, this is a little more
complex than that. 

Btw, the theme used when building html by default[1] has a search
box (written in Javascript) that could be able to find multi-line
patterns, working somewhat similar to "git grep foo -a bar".

[1] https://github.com/readthedocs/sphinx_rtd_theme

> > [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 
> >     number manually... However, it seems that this is currently broken 
> >     at least on Fedora 33 (with Mate Desktop and US intl keyboard with 
> >     dead keys).
> > 
> >     Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
> >     test it for *years*, as I din't see any reason why I would
> >     need to type UTF-8 characters by numbers until we started
> >     this thread.  
> 
> Please provide the bug number for this; I'd like to track it.

Just opened a BZ and added you as c/c.

> > Now, I'm not arguing that you can't use whatever UTF-8 symbol you
> > want on your docs. I'm just saying that, now that the conversion 
> > is over and a lot of documents ended getting some UTF-8 characters
> > by accident, it is time for a cleanup.  
> 
> All text documents are *full* of UTF-8 characters. If there is a file
> in the source code which has *any* non-UTF8, we call that a 'binary
> file'.
> 
> Again, if you want to make specific fixes like removing non-breaking
> spaces and byte order marks, with specific reasons, then those make
> sense. But it's got very little to do with UTF-8 and how easy it is to
> type them. And the excuse you've put in the commit comment for your
> patches is utterly bogus.

Let's take one step back, in order to return to the intents of this
UTF-8, as the discussions here are not centered into the patches, but
instead, on what to do and why.

-

This discussion started originally at linux-doc ML.

While discussing about an issue when machine's locale was not set
to UTF-8 on a build VM, we discovered that some converted docs ended
with BOM characters. Those specific changes were introduced by some
of my convert patches, probably converted via pandoc.

So, I went ahead in order to check what other possible weird things
were introduced by the conversion, where several scripts and tools
were used on files that had already a different markup.

I actually checked the current UTF-8 issues, and asked people at
linux-doc to comment what of those are valid usecases, and what
should be replaced by plain ASCII.

Basically, this is the current situation (at docs/docs-next), for the
ReST files under Documentation/, excluding translations is:

1. Spaces and BOM

	- U+00a0 (' '): NO-BREAK SPACE
	- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

Based on the discussions there and on this thread, those should be
dropped, as BOM is useless and NO-BREAK SPACE can cause problems
at the html/pdf output;

2. Symbols

	- U+00a9 ('©'): COPYRIGHT SIGN
	- U+00ac ('¬'): NOT SIGN
	- U+00ae ('®'): REGISTERED SIGN
	- U+00b0 ('°'): DEGREE SIGN
	- U+00b1 ('±'): PLUS-MINUS SIGN
	- U+00b2 ('²'): SUPERSCRIPT TWO
	- U+00b5 ('µ'): MICRO SIGN
	- U+03bc ('μ'): GREEK SMALL LETTER MU
	- U+00b7 ('·'): MIDDLE DOT
	- U+00bd ('½'): VULGAR FRACTION ONE HALF
	- U+2122 ('™'): TRADE MARK SIGN
	- U+2264 ('≤'): LESS-THAN OR EQUAL TO
	- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
	- U+2b0d ('⬍'): UP DOWN BLACK ARROW

Those seem OK on my eyes.

On a side note, both MICRO SIGN and GREEK SMALL LETTER MU are
used several docs to represent microseconds, micro-volts and
micro-ampères. If we write an orientation document, it probably
makes sense to recommend using MICRO SIGN on such cases.

3. Latin

	- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
	- U+00df ('ß'): LATIN SMALL LETTER SHARP S
	- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
	- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
	- U+00e6 ('æ'): LATIN SMALL LETTER AE
	- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
	- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
	- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
	- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
	- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
	- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
	- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
	- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
	- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
	- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
	- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
	- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
	- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE

Those should be kept as well, as they're used for non-English names.

4. arrows and box drawing symbols:
	- U+2191 ('↑'): UPWARDS ARROW
	- U+2192 ('→'): RIGHTWARDS ARROW
	- U+2193 ('↓'): DOWNWARDS ARROW

	- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
	- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
	- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
	- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT

Also should be kept.

In summary, based on the discussions we have so far, I suspect that
there's not much to be discussed for the above cases.

So, I'll post a v3 of this series, changing only:

	- U+00a0 (' '): NO-BREAK SPACE
	- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

---

Now, this specific patch series address also this extra case:

5. curly commas:

	- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
	- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
	- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
	- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK

IMO, those should be replaced by ASCII commas: ' and ".

The rationale is simple: 

- most were introduced during the conversion from Docbook,
  markdown and LaTex;
- they don't add any extra value, as using "foo" of “foo” means
  the same thing;
- Sphinx already use "fancy" commas at the output. 

I guess I will put this on a separate series, as this is not a bug
fix, but just a cleanup from the conversion work.

I'll re-post those cleanups on a separate series, for patch per patch
review.

---

The remaining cases are future work, outside the scope of this v2:

6. Hyphen/Dashes and ellipsis

	- U+2212 ('−'): MINUS SIGN
	- U+00ad ('­'): SOFT HYPHEN
	- U+2010 ('‐'): HYPHEN

	    Those three are used on places where a normal ASCII hyphen/minus
	    should be used instead. There are even a couple of C files which
	    use them instead of '-' on comments.

	    IMO are fixes/cleanups from conversions and bad cut-and-paste.

	- U+2013 ('–'): EN DASH
	- U+2014 ('—'): EM DASH
	- U+2026 ('…'): HORIZONTAL ELLIPSIS

	    Those are auto-replaced by Sphinx from "--", "---" and "...",
	    respectively.

	    I guess those are a matter of personal preference about
	    weather using ASCII or UTF-8.

            My personal preference (and Ted seems to have a similar
	    opinion) is to let Sphinx do the conversion.

	    For those, I intend to post a separate series, to be
	    reviewed patch per patch, as this is really a matter
	    of personal taste. Hardly we'll reach a consensus here.

7. math symbols:

	- U+00d7 ('×'): MULTIPLICATION SIGN

	   This one is used mostly do describe video resolutions, but this is
	   on a smaller changeset than the ones that use "x" letter.

	- U+2217 ('∗'): ASTERISK OPERATOR

	   This is used only here:
		Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.

	   Probably added by some conversion tool. IMO, this one should
	   also be replaced by an ASCII asterisk.

I guess I'll post a patch for the ASTERISK OPERATOR.
Thanks,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-15  8:22       ` Mauro Carvalho Chehab
@ 2021-05-15  9:24         ` David Woodhouse
  2021-05-15 11:23           ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 14+ messages in thread
From: David Woodhouse @ 2021-05-15  9:24 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

[-- Attachment #1: Type: text/plain, Size: 9347 bytes --]

On Sat, 2021-05-15 at 10:22 +0200, Mauro Carvalho Chehab wrote:
> > >      Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
> > >      test it for *years*, as I din't see any reason why I would
> > >      need to type UTF-8 characters by numbers until we started
> > >      this thread.  
> > 
> > Please provide the bug number for this; I'd like to track it.
> 
> Just opened a BZ and added you as c/c.

Thanks.

> Let's take one step back, in order to return to the intents of this
> UTF-8, as the discussions here are not centered into the patches, but
> instead, on what to do and why.
> 
> -
> 
> This discussion started originally at linux-doc ML.
> 
> While discussing about an issue when machine's locale was not set
> to UTF-8 on a build VM, 

Stop. Stop *right* there before you go any further.

The machine's locale should have *nothing* to do with anything.

When you view this email, it comes with a Content-Type: header which
explicitly tells you the character set that the message is encoded in, 
which I think I've set to UTF-7.

When showing you the mail, your system has to interpret the bytes of
the content using *that* character set encoding. Anything else is just
fundamentally broken. Your system locale has *nothing* to do with it.

If your local system is running EBCDIC that doesn't *matter*.

Now, the character set encoding of the kernel source and documentation
text files is UTF-8. It isn't EBCDIC, it isn't ISO8859-15 or any of the
legacy crap. It isn't system locale either, unless your system locale
*happens* to be UTF-8.

UTF-8 *happens* to be compatible with ASCII for the limited subset of
characters which ASCII contains, sure — just as *many*, but not all, of
the legacy 8-bit character sets are also a superset of ASCII's 7 bits.

But if the docs contain *any* characters which aren't ASCII, and you
build them with a broken build system which assumes ASCII, you are
going to produce wrong output. There is *no* substitute for fixing the
*actual* bug which started all this, and ensuring your build system (or
whatever) uses the *actual* encoding of the text files it's processing,
instead of making stupid and bogus assumptions based on a system
default.

You concede keeping U+00a9 © COPYRIGHT SIGN. And that's encoded in UTF-
8 as two bytes 0xC2 0xA9. If some broken build system *assumes* those
bytes are ISO8859-15 it'll take them to mean two separate characters

    U+00C2 Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
    U+00A9 © COPYRIGHT SIGN

Your broken build system that started all this is never going to be
*anything* other than broken. You can only paper over the cracks and
make it slightly less likely that people will notice in the common
case, perhaps? That's all you do by *reducing* the use of non-ASCII,
unless you're going to drag us all the way back to the 1980s and
strictly limit us to pure ASCII, using the equivalent of trigraphs for
*anything* outside the 0-127 character ranges.

And even if you did that, systems which use EBCDIC as their local
encoding would *still* be broken, if they have the same bug you started
from. Because EBCDIC isn't compatible with ASCII *even* for the first 7
bits.


> we discovered that some converted docs ended
> with BOM characters. Those specific changes were introduced by some
> of my convert patches, probably converted via pandoc.
> 
> So, I went ahead in order to check what other possible weird things
> were introduced by the conversion, where several scripts and tools
> were used on files that had already a different markup.
> 
> I actually checked the current UTF-8 issues, and asked people at
> linux-doc to comment what of those are valid usecases, and what
> should be replaced by plain ASCII.

No, these aren't "UTF-8 issues". Those are *conversion* issues, and
would still be there if the output of the conversion had been UTF-7,
UCS-16, etc. Or *even* if the output of the conversion had been
trigraph-like stuff like '--' for emdash. It's *nothing* to do with the
encoding that we happen to be using.

Fixing the conversion issues makes a lot of sense. Try to do it without
making *any* mention of UTF-8 at all.

> In summary, based on the discussions we have so far, I suspect that
> there's not much to be discussed for the above cases.
> 
> So, I'll post a v3 of this series, changing only:
> 
>         - U+00a0 (' '): NO-BREAK SPACE
>         - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

Ack, as long as those make *no* mention of UTF-8. Except perhaps to
note that BOM is redundant because UTF-8 doesn't have a byteorder.

> ---
> 
> Now, this specific patch series address also this extra case:
> 
> 5. curly commas:
> 
>         - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
>         - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
>         - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
>         - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
> 
> IMO, those should be replaced by ASCII commas: ' and ".
> 
> The rationale is simple: 
> 
> - most were introduced during the conversion from Docbook,
>   markdown and LaTex;
> - they don't add any extra value, as using "foo" of “foo” means
>   the same thing;
> - Sphinx already use "fancy" commas at the output. 
> 
> I guess I will put this on a separate series, as this is not a bug
> fix, but just a cleanup from the conversion work.
> 
> I'll re-post those cleanups on a separate series, for patch per patch
> review.

Makes sense. 

The left/right quotation marks exists to make human-readable text much
easier to read, but the key point here is that they are redundant
because the tooling already emits them in the *output* so they don't
need to be in the source, yes?

As long as the tooling gets it *right* and uses them where it should,
that seems sane enough.

However, it *does* break 'grep', because if I cut/paste a snippet from
the documentation and try to grep for it, it'll no longer match.

Consistency is good, but perhaps we should actually be consistent the
other way round and always use the left/right versions in the source
*instead* of relying on the tooling, to make searches work better?
You claimed to care about that, right?

> The remaining cases are future work, outside the scope of this v2:
> 
> 6. Hyphen/Dashes and ellipsis
> 
>         - U+2212 ('−'): MINUS SIGN
>         - U+00ad ('­'): SOFT HYPHEN
>         - U+2010 ('‐'): HYPHEN
> 
>             Those three are used on places where a normal ASCII hyphen/minus
>             should be used instead. There are even a couple of C files which
>             use them instead of '-' on comments.
> 
>             IMO are fixes/cleanups from conversions and bad cut-and-paste.

That seems to make sense.

>         - U+2013 ('–'): EN DASH
>         - U+2014 ('—'): EM DASH
>         - U+2026 ('…'): HORIZONTAL ELLIPSIS
> 
>             Those are auto-replaced by Sphinx from "--", "---" and "...",
>             respectively.
> 
>             I guess those are a matter of personal preference about
>             weather using ASCII or UTF-8.
> 
>             My personal preference (and Ted seems to have a similar
>             opinion) is to let Sphinx do the conversion.
> 
>             For those, I intend to post a separate series, to be
>             reviewed patch per patch, as this is really a matter
>             of personal taste. Hardly we'll reach a consensus here.
> 

Again using the trigraph-like '--' and '...' instead of just using the
plain text '—' and '…' breaks searching, because what's in the output
doesn't match the input. Again consistency is good, but perhaps we
should standardise on just putting these in their plain text form
instead of the trigraphs?

> 7. math symbols:
> 
>         - U+00d7 ('×'): MULTIPLICATION SIGN
> 
>            This one is used mostly do describe video resolutions, but this is
>            on a smaller changeset than the ones that use "x" letter.

I think standardising on × for video resolutions in documentation would
make it look better and be easier to read.

> 
>         - U+2217 ('∗'): ASTERISK OPERATOR
> 
>            This is used only here:
>                 Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
> 
>            Probably added by some conversion tool. IMO, this one should
>            also be replaced by an ASCII asterisk.
> 
> I guess I'll post a patch for the ASTERISK OPERATOR.

That makes sense.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-15  9:24         ` David Woodhouse
@ 2021-05-15 11:23           ` Mauro Carvalho Chehab
  2021-05-15 12:02             ` David Woodhouse
  0 siblings, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-15 11:23 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

Em Sat, 15 May 2021 10:24:28 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:

> On Sat, 2021-05-15 at 10:22 +0200, Mauro Carvalho Chehab wrote:
> > > >      Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
> > > >      test it for *years*, as I din't see any reason why I would
> > > >      need to type UTF-8 characters by numbers until we started
> > > >      this thread.    
> > > 
> > > Please provide the bug number for this; I'd like to track it.  
> > 
> > Just opened a BZ and added you as c/c.  
> 
> Thanks.
> 
> > Let's take one step back, in order to return to the intents of this
> > UTF-8, as the discussions here are not centered into the patches, but
> > instead, on what to do and why.
> > 
> > -
> > 
> > This discussion started originally at linux-doc ML.
> > 
> > While discussing about an issue when machine's locale was not set
> > to UTF-8 on a build VM,   
> 
> Stop. Stop *right* there before you go any further.
> 
> The machine's locale should have *nothing* to do with anything.
> 
> When you view this email, it comes with a Content-Type: header which
> explicitly tells you the character set that the message is encoded in, 
> which I think I've set to UTF-7.
> 
> When showing you the mail, your system has to interpret the bytes of
> the content using *that* character set encoding. Anything else is just
> fundamentally broken. Your system locale has *nothing* to do with it.
> 
> If your local system is running EBCDIC that doesn't *matter*.
> 
> Now, the character set encoding of the kernel source and documentation
> text files is UTF-8. It isn't EBCDIC, it isn't ISO8859-15 or any of the
> legacy crap. It isn't system locale either, unless your system locale
> *happens* to be UTF-8.
> 
> UTF-8 *happens* to be compatible with ASCII for the limited subset of
> characters which ASCII contains, sure — just as *many*, but not all, of
> the legacy 8-bit character sets are also a superset of ASCII's 7 bits.
> 
> But if the docs contain *any* characters which aren't ASCII, and you
> build them with a broken build system which assumes ASCII, you are
> going to produce wrong output. There is *no* substitute for fixing the
> *actual* bug which started all this, and ensuring your build system (or
> whatever) uses the *actual* encoding of the text files it's processing,
> instead of making stupid and bogus assumptions based on a system
> default.
> 
> You concede keeping U+00a9 © COPYRIGHT SIGN. And that's encoded in UTF-
> 8 as two bytes 0xC2 0xA9. If some broken build system *assumes* those
> bytes are ISO8859-15 it'll take them to mean two separate characters
> 
>     U+00C2 Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
>     U+00A9 © COPYRIGHT SIGN
> 
> Your broken build system that started all this is never going to be
> *anything* other than broken. You can only paper over the cracks and
> make it slightly less likely that people will notice in the common
> case, perhaps? That's all you do by *reducing* the use of non-ASCII,
> unless you're going to drag us all the way back to the 1980s and
> strictly limit us to pure ASCII, using the equivalent of trigraphs for
> *anything* outside the 0-127 character ranges.
> 
> And even if you did that, systems which use EBCDIC as their local
> encoding would *still* be broken, if they have the same bug you started
> from. Because EBCDIC isn't compatible with ASCII *even* for the first 7
> bits.

Now, you're making a lot of wrong assumptions here ;-)

1. I didn't report the bug. Another person reported it at linux-doc;
2. I fully agree with you that the building system should work fine
   whatever locate the machine has;
3. Sphinx supported charset for the REST input and its output is UTF-8.

Despite of that, it seems that there are some issues at the building
tool set, at least under certain circunstances. One of the hypothesis 
that it was mentioned there is that the Sphinx logger crashes when it
tries to print an UTF-8 message when the machine's locale is not UTF-8.

That's said, I tried forcing a non-UTF-8 on some tests I did to try
to reproduce, but the build went fine.

So, I was not able to reproduce the issue.

This series doesn't address the issue. It is just a side effect of the
discussions, where, while trying to understand the bug, we noticed
several UTF-8 characters introduced during the conversion that were't
the original author's intent.

So, with regards to the original but report, if I find a way to
reproduce it and to address it, I'll post a separate series.

If you want to discuss this issue further, let's not discuss here, but
instead, at the linux-doc thread:

	https://lore.kernel.org/linux-doc/20210506103913.GE6564@kitsune.suse.cz/

> 
> 
> > we discovered that some converted docs ended
> > with BOM characters. Those specific changes were introduced by some
> > of my convert patches, probably converted via pandoc.
> > 
> > So, I went ahead in order to check what other possible weird things
> > were introduced by the conversion, where several scripts and tools
> > were used on files that had already a different markup.
> > 
> > I actually checked the current UTF-8 issues, and asked people at
> > linux-doc to comment what of those are valid usecases, and what
> > should be replaced by plain ASCII.  
> 
> No, these aren't "UTF-8 issues". Those are *conversion* issues, and
> would still be there if the output of the conversion had been UTF-7,
> UCS-16, etc. Or *even* if the output of the conversion had been
> trigraph-like stuff like '--' for emdash. It's *nothing* to do with the
> encoding that we happen to be using.

Yes. That's what I said.

> 
> Fixing the conversion issues makes a lot of sense. Try to do it without
> making *any* mention of UTF-8 at all.
> 
> > In summary, based on the discussions we have so far, I suspect that
> > there's not much to be discussed for the above cases.
> > 
> > So, I'll post a v3 of this series, changing only:
> > 
> >         - U+00a0 (' '): NO-BREAK SPACE
> >         - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)  
> 
> Ack, as long as those make *no* mention of UTF-8. Except perhaps to
> note that BOM is redundant because UTF-8 doesn't have a byteorder.

I need to tell what UTF-8 codes are replaced, as otherwise the patch
wouldn't make much sense to reviewers, as both U+00a0 and whitespaces
are displayed the same way, and BOM is invisible.

> 
> > ---
> > 
> > Now, this specific patch series address also this extra case:
> > 
> > 5. curly commas:
> > 
> >         - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> >         - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> >         - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> >         - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
> > 
> > IMO, those should be replaced by ASCII commas: ' and ".
> > 
> > The rationale is simple: 
> > 
> > - most were introduced during the conversion from Docbook,
> >   markdown and LaTex;
> > - they don't add any extra value, as using "foo" of “foo” means
> >   the same thing;
> > - Sphinx already use "fancy" commas at the output. 
> > 
> > I guess I will put this on a separate series, as this is not a bug
> > fix, but just a cleanup from the conversion work.
> > 
> > I'll re-post those cleanups on a separate series, for patch per patch
> > review.  
> 
> Makes sense. 
> 
> The left/right quotation marks exists to make human-readable text much
> easier to read, but the key point here is that they are redundant
> because the tooling already emits them in the *output* so they don't
> need to be in the source, yes?

Yes.

> As long as the tooling gets it *right* and uses them where it should,
> that seems sane enough.
> 
> However, it *does* break 'grep', because if I cut/paste a snippet from
> the documentation and try to grep for it, it'll no longer match.

> 
> Consistency is good, but perhaps we should actually be consistent the
> other way round and always use the left/right versions in the source
> *instead* of relying on the tooling, to make searches work better?
> You claimed to care about that, right?

That's indeed a good point. It would be interesting to have more
opinions with that matter.

There are a couple of things to consider:

1. It is (usually) trivial to discover what document produced a
   certain page at the documentation.

   For instance, if you want to know where the text under this
   file came from, or to grep a text from it:

	https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

   You can click at the "View page source" button at the first line.
   It will show the .rst file used to produce it:

	https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.txt

2. If all you want is to search for a text inside the docs,
   you can click at the "Search docs" box, which is part of the
   Read the Docs theme.

3. Kernel has several extensions for Sphinx, in order to make life 
   easier for Kernel developers:

	Documentation/sphinx/automarkup.py
	Documentation/sphinx/cdomain.py
	Documentation/sphinx/kernel_abi.py
	Documentation/sphinx/kernel_feat.py
	Documentation/sphinx/kernel_include.py
	Documentation/sphinx/kerneldoc.py
	Documentation/sphinx/kernellog.py
	Documentation/sphinx/kfigure.py
	Documentation/sphinx/load_config.py
	Documentation/sphinx/maintainers_include.py
	Documentation/sphinx/rstFlatTable.py

Those (in particular automarkup and kerneldoc) will also dynamically 
change things during ReST conversion, which may cause grep to not work. 

5. some PDF tools like evince will match curly commas if you
   type an ASCII comma on their search boxes.

6. Some developers prefer to only deal with the files inside the
   Kernel tree. Those are very unlikely to do grep with curly aspas.

My opinion on that matter is that we should make life easier for
developers to grep on text files, as the ones using the web interface
are already served by the search box in html format or by tools like
evince.

So, my vote here is to keep aspas as plain ASCII.

> 
> > The remaining cases are future work, outside the scope of this v2:
> > 
> > 6. Hyphen/Dashes and ellipsis
> > 
> >         - U+2212 ('−'): MINUS SIGN
> >         - U+00ad ('­'): SOFT HYPHEN
> >         - U+2010 ('‐'): HYPHEN
> > 
> >             Those three are used on places where a normal ASCII hyphen/minus
> >             should be used instead. There are even a couple of C files which
> >             use them instead of '-' on comments.
> > 
> >             IMO are fixes/cleanups from conversions and bad cut-and-paste.  
> 
> That seems to make sense.
> 
> >         - U+2013 ('–'): EN DASH
> >         - U+2014 ('—'): EM DASH
> >         - U+2026 ('…'): HORIZONTAL ELLIPSIS
> > 
> >             Those are auto-replaced by Sphinx from "--", "---" and "...",
> >             respectively.
> > 
> >             I guess those are a matter of personal preference about
> >             weather using ASCII or UTF-8.
> > 
> >             My personal preference (and Ted seems to have a similar
> >             opinion) is to let Sphinx do the conversion.
> > 
> >             For those, I intend to post a separate series, to be
> >             reviewed patch per patch, as this is really a matter
> >             of personal taste. Hardly we'll reach a consensus here.
> >   
> 
> Again using the trigraph-like '--' and '...' instead of just using the
> plain text '—' and '…' breaks searching, because what's in the output
> doesn't match the input. Again consistency is good, but perhaps we
> should standardise on just putting these in their plain text form
> instead of the trigraphs?

Good point. 

While I don't have any strong preferences here, there's something that
annoys me with regards to EM/EN DASH:

With the monospaced fonts I'm using here - both at my e-mailer and
on my terminals, both EM and EN DASH are displayed look *exactly*
the same.

> 
> > 7. math symbols:
> > 
> >         - U+00d7 ('×'): MULTIPLICATION SIGN
> > 
> >            This one is used mostly do describe video resolutions, but this is
> >            on a smaller changeset than the ones that use "x" letter.  
> 
> I think standardising on × for video resolutions in documentation would
> make it look better and be easier to read.
> 
> > 
> >         - U+2217 ('∗'): ASTERISK OPERATOR
> > 
> >            This is used only here:
> >                 Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
> > 
> >            Probably added by some conversion tool. IMO, this one should
> >            also be replaced by an ASCII asterisk.
> > 
> > I guess I'll post a patch for the ASTERISK OPERATOR.  
> 
> That makes sense.



Thanks,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols
  2021-05-15 11:23           ` Mauro Carvalho Chehab
@ 2021-05-15 12:02             ` David Woodhouse
  0 siblings, 0 replies; 14+ messages in thread
From: David Woodhouse @ 2021-05-15 12:02 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, linux-kernel, Jonathan Corbet,
	Mali DP Maintainers, alsa-devel, coresight, intel-gfx,
	intel-wired-lan, keyrings, kvm, linux-acpi, linux-arm-kernel,
	linux-edac, linux-ext4, linux-f2fs-devel, linux-hwmon, linux-iio,
	linux-input, linux-integrity, linux-media, linux-pci, linux-pm,
	linux-rdma, linux-sgx, linux-usb, mjpeg-users, netdev, rcu

[-- Attachment #1: Type: text/plain, Size: 9607 bytes --]

On Sat, 2021-05-15 at 13:23 +0200, Mauro Carvalho Chehab wrote:
> Em Sat, 15 May 2021 10:24:28 +0100
> David Woodhouse <dwmw2@infradead.org> escreveu:
> > > Let's take one step back, in order to return to the intents of this
> > > UTF-8, as the discussions here are not centered into the patches, but
> > > instead, on what to do and why.
> > > 
> > > This discussion started originally at linux-doc ML.
> > > 
> > > While discussing about an issue when machine's locale was not set
> > > to UTF-8 on a build VM,   
> > 
> > Stop. Stop *right* there before you go any further.
> > 
> > The machine's locale should have *nothing* to do with anything.
>
> Now, you're making a lot of wrong assumptions here ;-)
> 
> 1. I didn't report the bug. Another person reported it at linux-doc;
> 2. I fully agree with you that the building system should work fine
>    whatever locate the machine has;
> 3. Sphinx supported charset for the REST input and its output is UTF-8.

OK, fine. So that's an unrelated issue really, and just happened to be
what historically triggered the discussion. Let's set it aside.

> > > I actually checked the current UTF-8 issues … 
> > 
> > No, these aren't "UTF-8 issues". Those are *conversion* issues, and 
> > … *nothing* to do with the encoding that we happen to be using.
> 
> Yes. That's what I said.

Er… I'm fairly sure you *did* call them "UTF-8 issues". Whatever.




> > 
> > Fixing the conversion issues makes a lot of sense. Try to do it without
> > making *any* mention of UTF-8 at all.
> > 
> > > In summary, based on the discussions we have so far, I suspect that
> > > there's not much to be discussed for the above cases.
> > > 
> > > So, I'll post a v3 of this series, changing only:
> > > 
> > >         - U+00a0 (' '): NO-BREAK SPACE
> > >         - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)  
> > 
> > Ack, as long as those make *no* mention of UTF-8. Except perhaps to
> > note that BOM is redundant because UTF-8 doesn't have a byteorder.
> 
> I need to tell what UTF-8 codes are replaced, as otherwise the patch
> wouldn't make much sense to reviewers, as both U+00a0 and whitespaces
> are displayed the same way, and BOM is invisible.
> 

No. Again, this is *nothing* to do with UTF-8. The encoding we choose
to map between byte in the file and characters is *utterly* irrelevant
here. If we were using UTF-7, UTF-16, or even (in the case of non-
breaking space) one of the legacy 8-bit charsets that includes it like
ISO8859-1, the issue would be precisely the same. 

It's about the *character* U+00A0 NO-BREAK SPACE; nothing to do with
UTF-8 at all. Don't mention UTF-8. It's *irrelevant* and just shows
that you can't actually bothered to stop and do any critical thinking
about the matter at all.

As I said, the only time that it makes sense to mention UTF-8 in this
context is when talking about *why* the BOM is not needed. And even
then, you could say "because we *aren't* using an encoding where
endianness matters, such as UTF-16", instead of actually mentioning
UTF-8. Try it ☺

> > 
> > > ---
> > > 
> > > Now, this specific patch series address also this extra case:
> > > 
> > > 5. curly commas:
> > > 
> > >         - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> > >         - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> > >         - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> > >         - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
> > > 
> > > IMO, those should be replaced by ASCII commas: ' and ".
> > > 
> > > The rationale is simple: 
> > > 
> > > - most were introduced during the conversion from Docbook,
> > >   markdown and LaTex;
> > > - they don't add any extra value, as using "foo" of “foo” means
> > >   the same thing;
> > > - Sphinx already use "fancy" commas at the output. 
> > > 
> > > I guess I will put this on a separate series, as this is not a bug
> > > fix, but just a cleanup from the conversion work.
> > > 
> > > I'll re-post those cleanups on a separate series, for patch per patch
> > > review.  
> > 
> > Makes sense. 
> > 
> > The left/right quotation marks exists to make human-readable text much
> > easier to read, but the key point here is that they are redundant
> > because the tooling already emits them in the *output* so they don't
> > need to be in the source, yes?
> 
> Yes.
> 
> > As long as the tooling gets it *right* and uses them where it should,
> > that seems sane enough.
> > 
> > However, it *does* break 'grep', because if I cut/paste a snippet from
> > the documentation and try to grep for it, it'll no longer match.
> > 
> > Consistency is good, but perhaps we should actually be consistent the
> > other way round and always use the left/right versions in the source
> > *instead* of relying on the tooling, to make searches work better?
> > You claimed to care about that, right?
> 
> That's indeed a good point. It would be interesting to have more
> opinions with that matter.
> 
> There are a couple of things to consider:
> 
> 1. It is (usually) trivial to discover what document produced a
>    certain page at the documentation.
> 
>    For instance, if you want to know where the text under this
>    file came from, or to grep a text from it:
> 
> 	https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
> 
>    You can click at the "View page source" button at the first line.
>    It will show the .rst file used to produce it:
> 
> 	https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.txt
> 
> 2. If all you want is to search for a text inside the docs,
>    you can click at the "Search docs" box, which is part of the
>    Read the Docs theme.
> 
> 3. Kernel has several extensions for Sphinx, in order to make life 
>    easier for Kernel developers:
> 
> 	Documentation/sphinx/automarkup.py
> 	Documentation/sphinx/cdomain.py
> 	Documentation/sphinx/kernel_abi.py
> 	Documentation/sphinx/kernel_feat.py
> 	Documentation/sphinx/kernel_include.py
> 	Documentation/sphinx/kerneldoc.py
> 	Documentation/sphinx/kernellog.py
> 	Documentation/sphinx/kfigure.py
> 	Documentation/sphinx/load_config.py
> 	Documentation/sphinx/maintainers_include.py
> 	Documentation/sphinx/rstFlatTable.py
> 
> Those (in particular automarkup and kerneldoc) will also dynamically 
> change things during ReST conversion, which may cause grep to not work. 
> 
> 5. some PDF tools like evince will match curly commas if you
>    type an ASCII comma on their search boxes.
> 
> 6. Some developers prefer to only deal with the files inside the
>    Kernel tree. Those are very unlikely to do grep with curly aspas.
> 
> My opinion on that matter is that we should make life easier for
> developers to grep on text files, as the ones using the web interface
> are already served by the search box in html format or by tools like
> evince.
> 
> So, my vote here is to keep aspas as plain ASCII.

OK, but all your reasoning is about the *character* used, not the
encoding. So try to do it without mentioning ASCII, and especially
without mentioning UTF-8.

Your point is that the *character* is the one easily reachable on
standard keyboard layouts, and the one which people are most likely to
enter manually. It has *nothing* to do with charset encodings, so don't
conflate is with talking about charset encodings.

> 
> > 
> > > The remaining cases are future work, outside the scope of this v2:
> > > 
> > > 6. Hyphen/Dashes and ellipsis
> > > 
> > >         - U+2212 ('−'): MINUS SIGN
> > >         - U+00ad ('­'): SOFT HYPHEN
> > >         - U+2010 ('‐'): HYPHEN
> > > 
> > >             Those three are used on places where a normal ASCII hyphen/minus
> > >             should be used instead. There are even a couple of C files which
> > >             use them instead of '-' on comments.
> > > 
> > >             IMO are fixes/cleanups from conversions and bad cut-and-paste.  
> > 
> > That seems to make sense.
> > 
> > >         - U+2013 ('–'): EN DASH
> > >         - U+2014 ('—'): EM DASH
> > >         - U+2026 ('…'): HORIZONTAL ELLIPSIS
> > > 
> > >             Those are auto-replaced by Sphinx from "--", "---" and "...",
> > >             respectively.
> > > 
> > >             I guess those are a matter of personal preference about
> > >             weather using ASCII or UTF-8.
> > > 
> > >             My personal preference (and Ted seems to have a similar
> > >             opinion) is to let Sphinx do the conversion.
> > > 
> > >             For those, I intend to post a separate series, to be
> > >             reviewed patch per patch, as this is really a matter
> > >             of personal taste. Hardly we'll reach a consensus here.
> > >   
> > 
> > Again using the trigraph-like '--' and '...' instead of just using the
> > plain text '—' and '…' breaks searching, because what's in the output
> > doesn't match the input. Again consistency is good, but perhaps we
> > should standardise on just putting these in their plain text form
> > instead of the trigraphs?
> 
> Good point. 
> 
> While I don't have any strong preferences here, there's something that
> annoys me with regards to EM/EN DASH:
> 
> With the monospaced fonts I'm using here - both at my e-mailer and
> on my terminals, both EM and EN DASH are displayed look *exactly*
> the same.

Interesting. They definitely show differently in my terminal, and in
the monospaced font in email.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-05-15 12:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12 12:50 [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols Mauro Carvalho Chehab
2021-05-12 12:50 ` [PATCH v2 40/40] docs: RCU: " Mauro Carvalho Chehab
2021-05-12 14:14 ` [PATCH v2 00/40] " Theodore Ts'o
2021-05-12 15:17   ` Mauro Carvalho Chehab
2021-05-12 17:12     ` David Woodhouse
2021-05-12 17:07 ` David Woodhouse
2021-05-14  8:21   ` Mauro Carvalho Chehab
2021-05-14  9:06     ` David Woodhouse
2021-05-14 11:08       ` Edward Cree
2021-05-14 14:18         ` Mauro Carvalho Chehab
2021-05-15  8:22       ` Mauro Carvalho Chehab
2021-05-15  9:24         ` David Woodhouse
2021-05-15 11:23           ` Mauro Carvalho Chehab
2021-05-15 12:02             ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).