linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: lampahome <pahome.chen@mirlab.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: why do we need utf8 normalization when compare name?
Date: Tue, 3 Mar 2020 12:22:09 -0500	[thread overview]
Message-ID: <20200303172209.GB61444@mit.edu> (raw)
In-Reply-To: <CAB3eZfu1=-FwJTnnH=sfg=J2gkeF0bgMs43V5tSkxdqP+m+R9A@mail.gmail.com>

On Tue, Mar 03, 2020 at 06:13:56PM +0800, lampahome wrote:
> 
> > And yes, once the strings are normalised and encoded as UTF-8 you then
> > do a byte-by-byte comparison (if the comparison is case-insensitive then
> > fs/unicode/... will case-fold the Unicode symbols during normalisation).
> 
> What I'm confused is why encoded as utf-8 after normalize finished?
> From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same
> Unicode string. Then why should we just compare bytes from normalized.

For the same reason why we don't upcase or downcase all of the letters
in a directory with case-folding.  The term for this is
"case-preserving, case-insensitive" matching.  So that means that if
you save a file as "Makefile", ls will return "Makefile", and not
"MAKEFILE" or "makefile".

Of course, if you delete or truncate "makefile", it will affect the
file stored in the directory as "Makefile", and the file system will
not allow a directory with case-folding enabled to contain "makefile"
and "Makefile" at the same time.

Simiarly, with normalization, we preserve the existing utf-8 form
(both the composed and decomposed forms are valid utf-8), but we
compare without taking the composition form into account.

Cheers,

					- Ted

P.S.  Some people may hate this, but if the goal is interoperability
with how Windows and MacOS does things, this is basically what they do
as well.  (Well, mostly; MacOS is a little weird for historical
reasons.)

P.P.S.  And before you comment on it, as one Internationalization
expert once said, I18N *is* complicated.  It truly would be easier to
teach all of the world to speak a single language and use it as the
"Federation Standard" language, ala Star Trek.  For better or for
worse, that's not happening, and so we deal with the world as it is,
not as we would like it to be.  :-)


  reply	other threads:[~2020-03-03 17:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-02  9:00 why do we need utf8 normalization when compare name? lampahome
2020-03-02 10:37 ` Aleksa Sarai
2020-03-02 10:47   ` Aleksa Sarai
2020-03-03  1:48     ` lampahome
     [not found]       ` <20200303070928.aawxoyeq77wnc3ts@yavin>
2020-03-03 10:13         ` lampahome
2020-03-03 17:22           ` Theodore Y. Ts'o [this message]
2020-03-02 12:54 ` Matthew Wilcox
2020-03-02 15:28   ` Al Viro
2020-03-02 17:14     ` Matthew Wilcox
2020-03-02 18:12     ` Theodore Y. Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200303172209.GB61444@mit.edu \
    --to=tytso@mit.edu \
    --cc=cyphar@cyphar.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pahome.chen@mirlab.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).