linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@zeniv.linux.org.uk>
To: Matthew Wilcox <willy@infradead.org>
Cc: lampahome <pahome.chen@mirlab.org>, linux-fsdevel@vger.kernel.org
Subject: Re: why do we need utf8 normalization when compare name?
Date: Mon, 2 Mar 2020 15:28:18 +0000	[thread overview]
Message-ID: <20200302152818.GN23230@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20200302125432.GP29971@bombadil.infradead.org>

On Mon, Mar 02, 2020 at 04:54:32AM -0800, Matthew Wilcox wrote:
> On Mon, Mar 02, 2020 at 05:00:24PM +0800, lampahome wrote:
> > According to case insensitive since kernel 5.2, d_compare will
> > transform string into normalized form and then compare.
> > 
> > But why do we need this normalization function? Could we just compare
> > by utf8 string?
> 
> Have you read https://en.wikipedia.org/wiki/Unicode_equivalence ?
> 
> We need to decide whether a user with a case-insensitive filesystem
> who looks up a file with the name U+00E5 (lower case "a" with ring)
> should find a file which is named U+00C5 (upper case "A" with ring)
> or U+212B (Angstrom sign).
> 
> Then there's the question of whether e-acute is stored as U+00E9
> or U+0065 followed by U+0301, and both of those will need to be found
> by a user search for U+00C9 or a user searching for U+0045 U+0301.
> 
> So yes, normalisation needs to be done.

Why the hell do we need case-insensitive filesystems in the first place?
I have only heard two explanations:
	1) because the layout (including name equivalences) is fixed by
some OS that happens to be authoritative for that filesystem.  In that
case we need to match the rules of that OS, whatever they are.  Unicode
equivalence may be an interesting part of _their_ background reasons
for setting those rules, but the only thing that really matters is what
rules have they set.
	2) early Android used to include a memory card with VFAT on
it; the card is long gone, but crapplications came to rely upon having
that shit.  And rather than giving them a file on the normal filesystem
with VFAT image on it and /dev/loop set up and mounted, somebody wants
to use parts of the normal (ext4) filesystem for it.  However, the
same crapplications have come to rely upon the case-insensitive (sensu
VFAT) behaviour there, so we must duplicate that vomit-inducing pile
of hacks on ext4.  Ideally - with that vomit-induc{ing,ed} pile
reclassified as a generic feature; those look more respectable.

(1) is reasonable enough, but belongs in specific weird filesystems.
(2) is, IMO, a bad joke.

Does anybody know of any other reasons?

  reply	other threads:[~2020-03-02 15:28 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-02  9:00 why do we need utf8 normalization when compare name? lampahome
2020-03-02 10:37 ` Aleksa Sarai
2020-03-02 10:47   ` Aleksa Sarai
2020-03-03  1:48     ` lampahome
     [not found]       ` <20200303070928.aawxoyeq77wnc3ts@yavin>
2020-03-03 10:13         ` lampahome
2020-03-03 17:22           ` Theodore Y. Ts'o
2020-03-02 12:54 ` Matthew Wilcox
2020-03-02 15:28   ` Al Viro [this message]
2020-03-02 17:14     ` Matthew Wilcox
2020-03-02 18:12     ` Theodore Y. Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200302152818.GN23230@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=pahome.chen@mirlab.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).