All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Anton Mitterer <calestyo@scientia.org>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	DASH Mailing List <dash@vger.kernel.org>
Subject: Re: [PATCH 0/8] Add multi-byte support
Date: Sat, 27 Apr 2024 23:31:43 +0200	[thread overview]
Message-ID: <eb927eeff3d1da7406c998d0ce816d3b6a061fc7.camel@scientia.org> (raw)
In-Reply-To: <cover.1714215826.git.herbert@gondor.apana.org.au>

Hey.


On Sat, 2024-04-27 at 19:03 +0800, Herbert Xu wrote:
> This patch series adds multi-byte support to dash.  For now only
> fnmatch is supported as the native pmatch function has not been
> modified to support multi-byte characters.

Nothing against the functionality per se, but I think for all scripts
that assumed dash's (and thus on may systems /bin/sh's) current
behaviour of being C locale only even without explicitly setting
LC_ALL=C, this may have quite some subtle issues.


AFAIU, in the C locale, all bytes is a character, and thus in
particular pattern matching notation is defined for every defined
outcome of command substitution respectively every content of variables
(that is: in every(!) locale every byte other than NUL).


For example:
************
A while ago I've asked on the Austin Group mailing list for a portable
way to get command substitution without stripping of trailing newlines.

Long story short:
The recommended way was to add a sentinel character '.' at the end of
the output within the command substitution and strip that off later
with parameter expansion.
But despite of the very special properties[0] of '.', it's apparently
still required to set LC_ALL=C when stripping the sentinel, because the
pattern matching notation in ${foo%.} is defined only on strings of
characters, not on strings of bytes.

Back then, Harald van Dijk had some ideas how that might be resolved
for good, but IIRC none of the shell implementors seemed to really have
interest.

My goal was to make a portable function like
   command_subst_with_newlines "eval-ed-command-string" "target-variable-name"
which, with the requirement of setting LC_ALL proved more or less
impossible when the function should have no side effects (like keeping
the LC_ALL overridden, over possibly overriding some existing var like
OLD_LC_ALL).


Anyway... I could image, that if dash becomes multi-byte aware, there
might be more or less subtle surprises.


Cheers,
Chris.


[0] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
"The encoded values associated with <period>, <slash>, <newline>, and
<carriage-return> shall be invariant across all locales supported by
the implementation."

  parent reply	other threads:[~2024-04-27 21:56 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-27 11:03 [PATCH 0/8] Add multi-byte support Herbert Xu
2024-04-16 10:03 ` [PATCH 1/8] shell: Call setlocale Herbert Xu
2024-04-16 10:38 ` [PATCH 2/8] shell: Use strcoll instead of strcmp where applicable Herbert Xu
2024-04-16 23:13 ` [PATCH 3/8] expand: Count multi-byte characters for VSLENGTH Herbert Xu
2024-04-18  8:59 ` [PATCH 4/8] expand: Process multi-byte characters in subevalvar Herbert Xu
2024-04-20 13:46 ` [PATCH 5/8] expand: Process multi-byte characters in expmeta Herbert Xu
2024-04-23 11:17 ` [PATCH 6/8] expand: Support multi-byte characters during field splitting Herbert Xu
2024-04-27  8:15 ` [PATCH 7/8] input: Allow MB_LEN_MAX calls to pungetc Herbert Xu
2024-04-27  8:41 ` [PATCH 8/8] parser: Add support for multi-byte characters Herbert Xu
2024-04-27 21:31 ` Christoph Anton Mitterer [this message]
2024-04-28  0:49   ` [PATCH 0/8] Add multi-byte support Herbert Xu
2024-04-28  1:19     ` Christoph Anton Mitterer
2024-04-28  1:35       ` Lawrence Velázquez
2024-04-28  1:50         ` Christoph Anton Mitterer
2024-04-28  2:03       ` Christoph Anton Mitterer
2024-04-28 14:50     ` Harald van Dijk
2024-04-29 13:12       ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eb927eeff3d1da7406c998d0ce816d3b6a061fc7.camel@scientia.org \
    --to=calestyo@scientia.org \
    --cc=dash@vger.kernel.org \
    --cc=herbert@gondor.apana.org.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.