kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Edward Cree <ecree.xilinx@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	David Woodhouse <dwmw2@infradead.org>,
	Linux Doc Mailing List <linux-doc@vger.kernel.org>,
	linux-kernel@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>,
	alsa-devel@alsa-project.org, coresight@lists.linaro.org,
	dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
	intel-wired-lan@lists.osuosl.org, keyrings@vger.kernel.org,
	kvm@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-edac@vger.kernel.org,
	linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net,
	linux-fpga@vger.kernel.org, linux-hwmon@vger.kernel.org,
	linux-iio@vger.kernel.org, linux-input@vger.kernel.org,
	linux-integrity@vger.kernel.org, linux-media@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-pm@vger.kernel.org,
	linux-rdma@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-sgx@vger.kernel.org, linux-usb@vger.kernel.org,
	mjpeg-users@lists.sourceforge.net, netdev@vger.kernel.org,
	rcu@vger.kernel.org, x86@kernel.org
Subject: Re: [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII
Date: Tue, 11 May 2021 11:00:02 +0200	[thread overview]
Message-ID: <20210511110002.2f187f01@coco.lan> (raw)
In-Reply-To: <ed65025c-1087-9672-7451-6d28e7ab8f92@gmail.com>

Em Mon, 10 May 2021 15:33:47 +0100
Edward Cree <ecree.xilinx@gmail.com> escreveu:

> On 10/05/2021 14:59, Matthew Wilcox wrote:
> > Most of these
> > UTF-8 characters come from latex conversions and really aren't
> > necessary (and are being used incorrectly).  
> I fully agree with fixing those.
> The cover-letter, however, gave the impression that that was not the
>  main purpose of this series; just, perhaps, a happy side-effect.

Sorry for the mess. The main reason why I wrote this series is because
there are lots of UTF-8 left-over chars from the ReST conversion.
See:
  - https://lore.kernel.org/linux-doc/20210507100435.3095f924@coco.lan/

A large set of the UTF-8 letf-over chars were due to my conversion work,
so I feel personally responsible to fix those ;-)

Yet, this series has two positive side effects:

 - it helps people needing to touch the documents using non-utf8 locales[1];
 - it makes easier to grep for a text;

[1] There are still some widely used distros nowadays (LTS ones?) that
    don't set UTF-8 as default. Last time I installed a Debian machine
    I had to explicitly set UTF-8 charset after install as the default
    were using ASCII encoding (can't remember if it was Debian 10 or an
    older version).

Unintentionally, I ended by giving emphasis to the non-utf8 instead of
giving emphasis to the conversion left-overs.

FYI, this patch series originated from a discussion at linux-doc,
reporting that Sphinx breaks when LANG is not set to utf-8[2]. That's
why I probably ended giving the wrong emphasis at the cover letter.

[2] See https://lore.kernel.org/linux-doc/20210506103913.GE6564@kitsune.suse.cz/
    for the original report. I strongly suspect that the VM set by Michal 
    to build the docs was using a distro that doesn't set UTF-8 as default.

    PS.: 
      I intend to prepare afterwards a separate fix to avoid Sphinx
      logger to crash during Kernel doc builds when the locale charset
      is not UTF-8, but I'm not too fluent in python. So, I need some
      time to check if are there a way to just avoid python log crashes
      without touching Sphinx code and without needing to trick it to 
      think that the machine's locale is UTF-8.

See: while there was just a single document originally stored at the
Kernel tree as a LaTeX document during the time we did the conversion
(cdrom-standard.tex), there are several other documents stored as 
text that seemed to be generated by some tool like LaTeX, whose the
original version were not preserved. 

Also, there were other documents using different markdown dialects 
that were converted via pandoc (and/or other similar tools). That's 
not to mention the ones that were converted from DocBook. Such
tools tend to use some logic to use "neat" versions of some ASCII
characters, like what this tool does:

	https://daringfireball.net/projects/smartypants/

(Sphinx itself seemed to use this tool on its early versions)

All tool-converted documents can carry UTF-8 on unexpected places. See,
on this series, a large amount of patches deal with U+A0 (NO-BREAK SPACE)
chars. I can't see why someone writing a plain text document (or a ReST
one) would type a NO-BREAK SPACE instead of a normal white space.

The same applies, up to some sort, to curly commas: usually people just 
write ASCII "commas" on their documents, and use some tool like LaTeX
or a text editor like libreoffice in order to convert them into
 “utf-8 curly commas”[3].

[3] Sphinx will do such things at the produced output, doing something 
    similar to what smartypants does, nowadays using this:

	https://docutils.sourceforge.io/docs/user/smartquotes.html

    E. g.:
      - Straight quotes (" and ') turned into "curly" quote characters;
      - dashes (-- and ---) turned into en- and em-dash entities;
      - three consecutive dots (... or . . .) turned into an ellipsis char.

> > You seem quite knowedgeable about the various differences.  Perhaps
> > you'd be willing to write a document for Documentation/doc-guide/
> > that provides guidance for when to use which kinds of horizontal
> > line?
> I have Opinions about the proper usage of punctuation, but I also know  
>  that other people have differing opinions.  For instance, I place
>  spaces around an em dash, which is nonstandard according to most
>  style guides.  Really this is an individual enough thing that I'm not
>  sure we could have a "kernel style guide" that would be more useful
>  than general-purpose guidance like the page you linked.

> Moreover, such a guide could make non-native speakers needlessly self-
>  conscious about their writing and discourage them from contributing
>  documentation at all.

I don't think so. In a matter of fact, as a non-native speaker, I guess
this can actually help people willing to write documents.

>  I'm not advocating here for trying to push
>  kernel developers towards an eats-shoots-and-leaves level of
>  linguistic pedantry; rather, I merely think that existing correct
>  usages should be left intact (and therefore, excising incorrect usage
>  should only be attempted by someone with both the expertise and time
>  to check each case).
> 
> But if you really want such a doc I wouldn't mind contributing to it.

IMO, a document like that can be helpful. I can help reviewing it.

Thanks,
Mauro

  reply	other threads:[~2021-05-11  9:00 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-10 10:26 [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII Mauro Carvalho Chehab
2021-05-10 10:27 ` [PATCH 52/53] docs: virt: kvm: avoid using UTF-8 chars Mauro Carvalho Chehab
2021-05-10 10:52 ` [PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII Thorsten Leemhuis
2021-05-10 11:19   ` Mauro Carvalho Chehab
2021-05-10 12:27     ` Mauro Carvalho Chehab
2021-05-10 10:54 ` David Woodhouse
2021-05-10 11:55   ` Mauro Carvalho Chehab
2021-05-10 13:16     ` Edward Cree
2021-05-10 13:38       ` Mauro Carvalho Chehab
2021-05-10 13:58         ` Edward Cree
2021-05-10 13:59       ` Matthew Wilcox
2021-05-10 14:33         ` Edward Cree
2021-05-11  9:00           ` Mauro Carvalho Chehab [this message]
2021-05-11  9:19             ` David Woodhouse
2021-05-10 13:49     ` David Woodhouse
2021-05-10 19:22       ` Theodore Ts'o
2021-05-11  9:37         ` Mauro Carvalho Chehab
2021-05-11  9:25       ` Mauro Carvalho Chehab
2021-05-10 14:00     ` Ben Boeckel
2021-05-10 21:57 ` Adam Borowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210511110002.2f187f01@coco.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=alsa-devel@alsa-project.org \
    --cc=corbet@lwn.net \
    --cc=coresight@lists.linaro.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=dwmw2@infradead.org \
    --cc=ecree.xilinx@gmail.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=keyrings@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=linux-fpga@vger.kernel.org \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-iio@vger.kernel.org \
    --cc=linux-input@vger.kernel.org \
    --cc=linux-integrity@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-sgx@vger.kernel.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=mjpeg-users@lists.sourceforge.net \
    --cc=netdev@vger.kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).