linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* why do we need utf8 normalization when compare name?
@ 2020-03-02  9:00 lampahome
  2020-03-02 10:37 ` Aleksa Sarai
  2020-03-02 12:54 ` Matthew Wilcox
  0 siblings, 2 replies; 10+ messages in thread
From: lampahome @ 2020-03-02  9:00 UTC (permalink / raw)
  To: linux-fsdevel

According to case insensitive since kernel 5.2, d_compare will
transform string into normalized form and then compare.

But why do we need this normalization function? Could we just compare
by utf8 string?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-02  9:00 why do we need utf8 normalization when compare name? lampahome
@ 2020-03-02 10:37 ` Aleksa Sarai
  2020-03-02 10:47   ` Aleksa Sarai
  2020-03-02 12:54 ` Matthew Wilcox
  1 sibling, 1 reply; 10+ messages in thread
From: Aleksa Sarai @ 2020-03-02 10:37 UTC (permalink / raw)
  To: lampahome; +Cc: linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On 2020-03-02, lampahome <pahome.chen@mirlab.org> wrote:
> According to case insensitive since kernel 5.2, d_compare will
> transform string into normalized form and then compare.
>
> But why do we need this normalization function? Could we just compare
> by utf8 string?

The problem is that there are multiple ways to represent the same glyph
in Unicode -- for instance, you can represent Å (the symbol for
angstrom) as both U+212B and U+0041 U+030A (the latin letter "A"
followed by the ring-above symbol "°"). Different software may choose to
represent the same glyphs in different Unicode forms, hence the need for
normalisation.

[1] is the Wikipedia article that describes this problem and what the
different kinds of Unicode normalisation are.

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-02 10:37 ` Aleksa Sarai
@ 2020-03-02 10:47   ` Aleksa Sarai
  2020-03-03  1:48     ` lampahome
  0 siblings, 1 reply; 10+ messages in thread
From: Aleksa Sarai @ 2020-03-02 10:47 UTC (permalink / raw)
  To: lampahome; +Cc: linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

On 2020-03-02, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2020-03-02, lampahome <pahome.chen@mirlab.org> wrote:
> > According to case insensitive since kernel 5.2, d_compare will
> > transform string into normalized form and then compare.
> >
> > But why do we need this normalization function? Could we just compare
> > by utf8 string?
> 
> The problem is that there are multiple ways to represent the same glyph
> in Unicode -- for instance, you can represent Å (the symbol for
> angstrom) as both U+212B and U+0041 U+030A (the latin letter "A"
> followed by the ring-above symbol "°"). Different software may choose to
> represent the same glyphs in different Unicode forms, hence the need for
> normalisation.

Sorry, a better example would've been "ñ" (U+00F1). You can also
represent it as "n" (U+006E) followed by "◌̃" (U+0303 -- "combining
tilde"). Both forms are defined by Unicode to be canonically equivalent
so it would be incorrect to treat the two Unicode strings differently
(that isn't quite the case for "Å").

> [1] is the Wikipedia article that describes this problem and what the
> different kinds of Unicode normalisation are.
> 
> [1]: https://en.wikipedia.org/wiki/Unicode_equivalence

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-02  9:00 why do we need utf8 normalization when compare name? lampahome
  2020-03-02 10:37 ` Aleksa Sarai
@ 2020-03-02 12:54 ` Matthew Wilcox
  2020-03-02 15:28   ` Al Viro
  1 sibling, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2020-03-02 12:54 UTC (permalink / raw)
  To: lampahome; +Cc: linux-fsdevel

On Mon, Mar 02, 2020 at 05:00:24PM +0800, lampahome wrote:
> According to case insensitive since kernel 5.2, d_compare will
> transform string into normalized form and then compare.
> 
> But why do we need this normalization function? Could we just compare
> by utf8 string?

Have you read https://en.wikipedia.org/wiki/Unicode_equivalence ?

We need to decide whether a user with a case-insensitive filesystem
who looks up a file with the name U+00E5 (lower case "a" with ring)
should find a file which is named U+00C5 (upper case "A" with ring)
or U+212B (Angstrom sign).

Then there's the question of whether e-acute is stored as U+00E9
or U+0065 followed by U+0301, and both of those will need to be found
by a user search for U+00C9 or a user searching for U+0045 U+0301.

So yes, normalisation needs to be done.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-02 12:54 ` Matthew Wilcox
@ 2020-03-02 15:28   ` Al Viro
  2020-03-02 17:14     ` Matthew Wilcox
  2020-03-02 18:12     ` Theodore Y. Ts'o
  0 siblings, 2 replies; 10+ messages in thread
From: Al Viro @ 2020-03-02 15:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lampahome, linux-fsdevel

On Mon, Mar 02, 2020 at 04:54:32AM -0800, Matthew Wilcox wrote:
> On Mon, Mar 02, 2020 at 05:00:24PM +0800, lampahome wrote:
> > According to case insensitive since kernel 5.2, d_compare will
> > transform string into normalized form and then compare.
> > 
> > But why do we need this normalization function? Could we just compare
> > by utf8 string?
> 
> Have you read https://en.wikipedia.org/wiki/Unicode_equivalence ?
> 
> We need to decide whether a user with a case-insensitive filesystem
> who looks up a file with the name U+00E5 (lower case "a" with ring)
> should find a file which is named U+00C5 (upper case "A" with ring)
> or U+212B (Angstrom sign).
> 
> Then there's the question of whether e-acute is stored as U+00E9
> or U+0065 followed by U+0301, and both of those will need to be found
> by a user search for U+00C9 or a user searching for U+0045 U+0301.
> 
> So yes, normalisation needs to be done.

Why the hell do we need case-insensitive filesystems in the first place?
I have only heard two explanations:
	1) because the layout (including name equivalences) is fixed by
some OS that happens to be authoritative for that filesystem.  In that
case we need to match the rules of that OS, whatever they are.  Unicode
equivalence may be an interesting part of _their_ background reasons
for setting those rules, but the only thing that really matters is what
rules have they set.
	2) early Android used to include a memory card with VFAT on
it; the card is long gone, but crapplications came to rely upon having
that shit.  And rather than giving them a file on the normal filesystem
with VFAT image on it and /dev/loop set up and mounted, somebody wants
to use parts of the normal (ext4) filesystem for it.  However, the
same crapplications have come to rely upon the case-insensitive (sensu
VFAT) behaviour there, so we must duplicate that vomit-inducing pile
of hacks on ext4.  Ideally - with that vomit-induc{ing,ed} pile
reclassified as a generic feature; those look more respectable.

(1) is reasonable enough, but belongs in specific weird filesystems.
(2) is, IMO, a bad joke.

Does anybody know of any other reasons?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-02 15:28   ` Al Viro
@ 2020-03-02 17:14     ` Matthew Wilcox
  2020-03-02 18:12     ` Theodore Y. Ts'o
  1 sibling, 0 replies; 10+ messages in thread
From: Matthew Wilcox @ 2020-03-02 17:14 UTC (permalink / raw)
  To: Al Viro; +Cc: lampahome, linux-fsdevel

On Mon, Mar 02, 2020 at 03:28:18PM +0000, Al Viro wrote:
> Why the hell do we need case-insensitive filesystems in the first place?
> I have only heard two explanations:
> 	1) because the layout (including name equivalences) is fixed by
> some OS that happens to be authoritative for that filesystem.  In that
> case we need to match the rules of that OS, whatever they are.  Unicode
> equivalence may be an interesting part of _their_ background reasons
> for setting those rules, but the only thing that really matters is what
> rules have they set.
> 	2) early Android used to include a memory card with VFAT on
> it; the card is long gone, but crapplications came to rely upon having
> that shit.  And rather than giving them a file on the normal filesystem
> with VFAT image on it and /dev/loop set up and mounted, somebody wants
> to use parts of the normal (ext4) filesystem for it.  However, the
> same crapplications have come to rely upon the case-insensitive (sensu
> VFAT) behaviour there, so we must duplicate that vomit-inducing pile
> of hacks on ext4.  Ideally - with that vomit-induc{ing,ed} pile
> reclassified as a generic feature; those look more respectable.
> 
> (1) is reasonable enough, but belongs in specific weird filesystems.
> (2) is, IMO, a bad joke.
> 
> Does anybody know of any other reasons?

I've heard it was primarily developed to help port an ecosystem known
for prioritising shipping-on-time over quality-of-code from Windows to
Linux.  I'm not sure why a variant of #2 wasn't a solution they used.
I'm not a fan of having case-insensitive unicode tables in the kernel.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-02 15:28   ` Al Viro
  2020-03-02 17:14     ` Matthew Wilcox
@ 2020-03-02 18:12     ` Theodore Y. Ts'o
  1 sibling, 0 replies; 10+ messages in thread
From: Theodore Y. Ts'o @ 2020-03-02 18:12 UTC (permalink / raw)
  To: Al Viro; +Cc: Matthew Wilcox, lampahome, linux-fsdevel

On Mon, Mar 02, 2020 at 03:28:18PM +0000, Al Viro wrote:
> Why the hell do we need case-insensitive filesystems in the first place?
> I have only heard two explanations:
> 	1) because the layout (including name equivalences) is fixed by
> some OS that happens to be authoritative for that filesystem.  In that
> case we need to match the rules of that OS, whatever they are.  Unicode
> equivalence may be an interesting part of _their_ background reasons
> for setting those rules, but the only thing that really matters is what
> rules have they set.

It significantly helps porting applications that were originally
written for Windows and/or MacOS.  In particular, the work to add
Unicode comparison support to ext4 was funded to enable the ability to
run Windows gaming applications on Linux for Steam.


> 	2) early Android used to include a memory card with VFAT on
> it; the card is long gone, but crapplications came to rely upon having
> that shit.  And rather than giving them a file on the normal filesystem
> with VFAT image on it and /dev/loop set up and mounted, somebody wants
> to use parts of the normal (ext4) filesystem for it.  However, the
> same crapplications have come to rely upon the case-insensitive (sensu
> VFAT) behaviour there, so we must duplicate that vomit-inducing pile
> of hacks on ext4.  Ideally - with that vomit-induc{ing,ed} pile
> reclassified as a generic feature; those look more respectable.

There are a number of reasons why a loop device is not sufficient;
there is a requirement to have selective sharing of application data
between other applications, which is done using user/group ownership.

For Android, previously, this was done using sdcardfs which was based
off of wrapfs.  Wrapfs was the same base as unionfs, and as you may
recall, it had even more horrendous race issues, and more than once I
was asked to help to debug crashes when you fsstress was run on
different views of sdcardfs.  I've been strongly encouraging the push
to something more sane, but a requirement for this is case-folded
directory.

There is a third reason why case folding is necessary, which is for
file serving applications such as Samba which require case-folding,
and without file system level support, trying to simulate this in
userspace requires searching via readdir to do a case-insensitive
lookup (at least in some case).  So finding a more performant way to
support case-folding is a big help for applications like Samba.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-02 10:47   ` Aleksa Sarai
@ 2020-03-03  1:48     ` lampahome
       [not found]       ` <20200303070928.aawxoyeq77wnc3ts@yavin>
  0 siblings, 1 reply; 10+ messages in thread
From: lampahome @ 2020-03-03  1:48 UTC (permalink / raw)
  To: Aleksa Sarai; +Cc: linux-fsdevel, linux-kernel

> Sorry, a better example would've been "ñ" (U+00F1). You can also
> represent it as "n" (U+006E) followed by "◌̃" (U+0303 -- "combining
> tilde"). Both forms are defined by Unicode to be canonically equivalent
> so it would be incorrect to treat the two Unicode strings differently
> (that isn't quite the case for "Å").

So utf8-normalize will convert  "ñ" (U+00F1) and "n" (U+006E) followed
by "◌̃" to a utf8 code, and both are the same, right?
Then compare it byte by byte.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
       [not found]       ` <20200303070928.aawxoyeq77wnc3ts@yavin>
@ 2020-03-03 10:13         ` lampahome
  2020-03-03 17:22           ` Theodore Y. Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: lampahome @ 2020-03-03 10:13 UTC (permalink / raw)
  To: Aleksa Sarai; +Cc: linux-fsdevel, linux-kernel

> Unicode normalisation will take the strings "ñ" (U+00F1) and "n◌̃"
> (U+006E U+0303) and turn them into the same Unicode string. Note that
> there are four kinds of Unicode normalisation (NFD, NFC, NFKD, NFKC), so
> what precise string you end up with depends on which form you're using.
> Linux uses NFD, I believe.


> And yes, once the strings are normalised and encoded as UTF-8 you then
> do a byte-by-byte comparison (if the comparison is case-insensitive then
> fs/unicode/... will case-fold the Unicode symbols during normalisation).
>

What I'm confused is why encoded as utf-8 after normalize finished?
From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same
Unicode string. Then why should we just compare bytes from normalized.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: why do we need utf8 normalization when compare name?
  2020-03-03 10:13         ` lampahome
@ 2020-03-03 17:22           ` Theodore Y. Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Y. Ts'o @ 2020-03-03 17:22 UTC (permalink / raw)
  To: lampahome; +Cc: Aleksa Sarai, linux-fsdevel, linux-kernel

On Tue, Mar 03, 2020 at 06:13:56PM +0800, lampahome wrote:
> 
> > And yes, once the strings are normalised and encoded as UTF-8 you then
> > do a byte-by-byte comparison (if the comparison is case-insensitive then
> > fs/unicode/... will case-fold the Unicode symbols during normalisation).
> 
> What I'm confused is why encoded as utf-8 after normalize finished?
> From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same
> Unicode string. Then why should we just compare bytes from normalized.

For the same reason why we don't upcase or downcase all of the letters
in a directory with case-folding.  The term for this is
"case-preserving, case-insensitive" matching.  So that means that if
you save a file as "Makefile", ls will return "Makefile", and not
"MAKEFILE" or "makefile".

Of course, if you delete or truncate "makefile", it will affect the
file stored in the directory as "Makefile", and the file system will
not allow a directory with case-folding enabled to contain "makefile"
and "Makefile" at the same time.

Simiarly, with normalization, we preserve the existing utf-8 form
(both the composed and decomposed forms are valid utf-8), but we
compare without taking the composition form into account.

Cheers,

					- Ted

P.S.  Some people may hate this, but if the goal is interoperability
with how Windows and MacOS does things, this is basically what they do
as well.  (Well, mostly; MacOS is a little weird for historical
reasons.)

P.P.S.  And before you comment on it, as one Internationalization
expert once said, I18N *is* complicated.  It truly would be easier to
teach all of the world to speak a single language and use it as the
"Federation Standard" language, ala Star Trek.  For better or for
worse, that's not happening, and so we deal with the world as it is,
not as we would like it to be.  :-)


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-03-03 17:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-02  9:00 why do we need utf8 normalization when compare name? lampahome
2020-03-02 10:37 ` Aleksa Sarai
2020-03-02 10:47   ` Aleksa Sarai
2020-03-03  1:48     ` lampahome
     [not found]       ` <20200303070928.aawxoyeq77wnc3ts@yavin>
2020-03-03 10:13         ` lampahome
2020-03-03 17:22           ` Theodore Y. Ts'o
2020-03-02 12:54 ` Matthew Wilcox
2020-03-02 15:28   ` Al Viro
2020-03-02 17:14     ` Matthew Wilcox
2020-03-02 18:12     ` Theodore Y. Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).