* why do we need utf8 normalization when compare name? @ 2020-03-02 9:00 lampahome 2020-03-02 10:37 ` Aleksa Sarai 2020-03-02 12:54 ` Matthew Wilcox 0 siblings, 2 replies; 10+ messages in thread From: lampahome @ 2020-03-02 9:00 UTC (permalink / raw) To: linux-fsdevel According to case insensitive since kernel 5.2, d_compare will transform string into normalized form and then compare. But why do we need this normalization function? Could we just compare by utf8 string? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-02 9:00 why do we need utf8 normalization when compare name? lampahome @ 2020-03-02 10:37 ` Aleksa Sarai 2020-03-02 10:47 ` Aleksa Sarai 2020-03-02 12:54 ` Matthew Wilcox 1 sibling, 1 reply; 10+ messages in thread From: Aleksa Sarai @ 2020-03-02 10:37 UTC (permalink / raw) To: lampahome; +Cc: linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 930 bytes --] On 2020-03-02, lampahome <pahome.chen@mirlab.org> wrote: > According to case insensitive since kernel 5.2, d_compare will > transform string into normalized form and then compare. > > But why do we need this normalization function? Could we just compare > by utf8 string? The problem is that there are multiple ways to represent the same glyph in Unicode -- for instance, you can represent Å (the symbol for angstrom) as both U+212B and U+0041 U+030A (the latin letter "A" followed by the ring-above symbol "°"). Different software may choose to represent the same glyphs in different Unicode forms, hence the need for normalisation. [1] is the Wikipedia article that describes this problem and what the different kinds of Unicode normalisation are. [1]: https://en.wikipedia.org/wiki/Unicode_equivalence -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-02 10:37 ` Aleksa Sarai @ 2020-03-02 10:47 ` Aleksa Sarai 2020-03-03 1:48 ` lampahome 0 siblings, 1 reply; 10+ messages in thread From: Aleksa Sarai @ 2020-03-02 10:47 UTC (permalink / raw) To: lampahome; +Cc: linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1346 bytes --] On 2020-03-02, Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2020-03-02, lampahome <pahome.chen@mirlab.org> wrote: > > According to case insensitive since kernel 5.2, d_compare will > > transform string into normalized form and then compare. > > > > But why do we need this normalization function? Could we just compare > > by utf8 string? > > The problem is that there are multiple ways to represent the same glyph > in Unicode -- for instance, you can represent Å (the symbol for > angstrom) as both U+212B and U+0041 U+030A (the latin letter "A" > followed by the ring-above symbol "°"). Different software may choose to > represent the same glyphs in different Unicode forms, hence the need for > normalisation. Sorry, a better example would've been "ñ" (U+00F1). You can also represent it as "n" (U+006E) followed by "◌̃" (U+0303 -- "combining tilde"). Both forms are defined by Unicode to be canonically equivalent so it would be incorrect to treat the two Unicode strings differently (that isn't quite the case for "Å"). > [1] is the Wikipedia article that describes this problem and what the > different kinds of Unicode normalisation are. > > [1]: https://en.wikipedia.org/wiki/Unicode_equivalence -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-02 10:47 ` Aleksa Sarai @ 2020-03-03 1:48 ` lampahome [not found] ` <20200303070928.aawxoyeq77wnc3ts@yavin> 0 siblings, 1 reply; 10+ messages in thread From: lampahome @ 2020-03-03 1:48 UTC (permalink / raw) To: Aleksa Sarai; +Cc: linux-fsdevel, linux-kernel > Sorry, a better example would've been "ñ" (U+00F1). You can also > represent it as "n" (U+006E) followed by "◌̃" (U+0303 -- "combining > tilde"). Both forms are defined by Unicode to be canonically equivalent > so it would be incorrect to treat the two Unicode strings differently > (that isn't quite the case for "Å"). So utf8-normalize will convert "ñ" (U+00F1) and "n" (U+006E) followed by "◌̃" to a utf8 code, and both are the same, right? Then compare it byte by byte. ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20200303070928.aawxoyeq77wnc3ts@yavin>]
* Re: why do we need utf8 normalization when compare name? [not found] ` <20200303070928.aawxoyeq77wnc3ts@yavin> @ 2020-03-03 10:13 ` lampahome 2020-03-03 17:22 ` Theodore Y. Ts'o 0 siblings, 1 reply; 10+ messages in thread From: lampahome @ 2020-03-03 10:13 UTC (permalink / raw) To: Aleksa Sarai; +Cc: linux-fsdevel, linux-kernel > Unicode normalisation will take the strings "ñ" (U+00F1) and "n◌̃" > (U+006E U+0303) and turn them into the same Unicode string. Note that > there are four kinds of Unicode normalisation (NFD, NFC, NFKD, NFKC), so > what precise string you end up with depends on which form you're using. > Linux uses NFD, I believe. > And yes, once the strings are normalised and encoded as UTF-8 you then > do a byte-by-byte comparison (if the comparison is case-insensitive then > fs/unicode/... will case-fold the Unicode symbols during normalisation). > What I'm confused is why encoded as utf-8 after normalize finished? From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same Unicode string. Then why should we just compare bytes from normalized. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-03 10:13 ` lampahome @ 2020-03-03 17:22 ` Theodore Y. Ts'o 0 siblings, 0 replies; 10+ messages in thread From: Theodore Y. Ts'o @ 2020-03-03 17:22 UTC (permalink / raw) To: lampahome; +Cc: Aleksa Sarai, linux-fsdevel, linux-kernel On Tue, Mar 03, 2020 at 06:13:56PM +0800, lampahome wrote: > > > And yes, once the strings are normalised and encoded as UTF-8 you then > > do a byte-by-byte comparison (if the comparison is case-insensitive then > > fs/unicode/... will case-fold the Unicode symbols during normalisation). > > What I'm confused is why encoded as utf-8 after normalize finished? > From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same > Unicode string. Then why should we just compare bytes from normalized. For the same reason why we don't upcase or downcase all of the letters in a directory with case-folding. The term for this is "case-preserving, case-insensitive" matching. So that means that if you save a file as "Makefile", ls will return "Makefile", and not "MAKEFILE" or "makefile". Of course, if you delete or truncate "makefile", it will affect the file stored in the directory as "Makefile", and the file system will not allow a directory with case-folding enabled to contain "makefile" and "Makefile" at the same time. Simiarly, with normalization, we preserve the existing utf-8 form (both the composed and decomposed forms are valid utf-8), but we compare without taking the composition form into account. Cheers, - Ted P.S. Some people may hate this, but if the goal is interoperability with how Windows and MacOS does things, this is basically what they do as well. (Well, mostly; MacOS is a little weird for historical reasons.) P.P.S. And before you comment on it, as one Internationalization expert once said, I18N *is* complicated. It truly would be easier to teach all of the world to speak a single language and use it as the "Federation Standard" language, ala Star Trek. For better or for worse, that's not happening, and so we deal with the world as it is, not as we would like it to be. :-) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-02 9:00 why do we need utf8 normalization when compare name? lampahome 2020-03-02 10:37 ` Aleksa Sarai @ 2020-03-02 12:54 ` Matthew Wilcox 2020-03-02 15:28 ` Al Viro 1 sibling, 1 reply; 10+ messages in thread From: Matthew Wilcox @ 2020-03-02 12:54 UTC (permalink / raw) To: lampahome; +Cc: linux-fsdevel On Mon, Mar 02, 2020 at 05:00:24PM +0800, lampahome wrote: > According to case insensitive since kernel 5.2, d_compare will > transform string into normalized form and then compare. > > But why do we need this normalization function? Could we just compare > by utf8 string? Have you read https://en.wikipedia.org/wiki/Unicode_equivalence ? We need to decide whether a user with a case-insensitive filesystem who looks up a file with the name U+00E5 (lower case "a" with ring) should find a file which is named U+00C5 (upper case "A" with ring) or U+212B (Angstrom sign). Then there's the question of whether e-acute is stored as U+00E9 or U+0065 followed by U+0301, and both of those will need to be found by a user search for U+00C9 or a user searching for U+0045 U+0301. So yes, normalisation needs to be done. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-02 12:54 ` Matthew Wilcox @ 2020-03-02 15:28 ` Al Viro 2020-03-02 17:14 ` Matthew Wilcox 2020-03-02 18:12 ` Theodore Y. Ts'o 0 siblings, 2 replies; 10+ messages in thread From: Al Viro @ 2020-03-02 15:28 UTC (permalink / raw) To: Matthew Wilcox; +Cc: lampahome, linux-fsdevel On Mon, Mar 02, 2020 at 04:54:32AM -0800, Matthew Wilcox wrote: > On Mon, Mar 02, 2020 at 05:00:24PM +0800, lampahome wrote: > > According to case insensitive since kernel 5.2, d_compare will > > transform string into normalized form and then compare. > > > > But why do we need this normalization function? Could we just compare > > by utf8 string? > > Have you read https://en.wikipedia.org/wiki/Unicode_equivalence ? > > We need to decide whether a user with a case-insensitive filesystem > who looks up a file with the name U+00E5 (lower case "a" with ring) > should find a file which is named U+00C5 (upper case "A" with ring) > or U+212B (Angstrom sign). > > Then there's the question of whether e-acute is stored as U+00E9 > or U+0065 followed by U+0301, and both of those will need to be found > by a user search for U+00C9 or a user searching for U+0045 U+0301. > > So yes, normalisation needs to be done. Why the hell do we need case-insensitive filesystems in the first place? I have only heard two explanations: 1) because the layout (including name equivalences) is fixed by some OS that happens to be authoritative for that filesystem. In that case we need to match the rules of that OS, whatever they are. Unicode equivalence may be an interesting part of _their_ background reasons for setting those rules, but the only thing that really matters is what rules have they set. 2) early Android used to include a memory card with VFAT on it; the card is long gone, but crapplications came to rely upon having that shit. And rather than giving them a file on the normal filesystem with VFAT image on it and /dev/loop set up and mounted, somebody wants to use parts of the normal (ext4) filesystem for it. However, the same crapplications have come to rely upon the case-insensitive (sensu VFAT) behaviour there, so we must duplicate that vomit-inducing pile of hacks on ext4. Ideally - with that vomit-induc{ing,ed} pile reclassified as a generic feature; those look more respectable. (1) is reasonable enough, but belongs in specific weird filesystems. (2) is, IMO, a bad joke. Does anybody know of any other reasons? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-02 15:28 ` Al Viro @ 2020-03-02 17:14 ` Matthew Wilcox 2020-03-02 18:12 ` Theodore Y. Ts'o 1 sibling, 0 replies; 10+ messages in thread From: Matthew Wilcox @ 2020-03-02 17:14 UTC (permalink / raw) To: Al Viro; +Cc: lampahome, linux-fsdevel On Mon, Mar 02, 2020 at 03:28:18PM +0000, Al Viro wrote: > Why the hell do we need case-insensitive filesystems in the first place? > I have only heard two explanations: > 1) because the layout (including name equivalences) is fixed by > some OS that happens to be authoritative for that filesystem. In that > case we need to match the rules of that OS, whatever they are. Unicode > equivalence may be an interesting part of _their_ background reasons > for setting those rules, but the only thing that really matters is what > rules have they set. > 2) early Android used to include a memory card with VFAT on > it; the card is long gone, but crapplications came to rely upon having > that shit. And rather than giving them a file on the normal filesystem > with VFAT image on it and /dev/loop set up and mounted, somebody wants > to use parts of the normal (ext4) filesystem for it. However, the > same crapplications have come to rely upon the case-insensitive (sensu > VFAT) behaviour there, so we must duplicate that vomit-inducing pile > of hacks on ext4. Ideally - with that vomit-induc{ing,ed} pile > reclassified as a generic feature; those look more respectable. > > (1) is reasonable enough, but belongs in specific weird filesystems. > (2) is, IMO, a bad joke. > > Does anybody know of any other reasons? I've heard it was primarily developed to help port an ecosystem known for prioritising shipping-on-time over quality-of-code from Windows to Linux. I'm not sure why a variant of #2 wasn't a solution they used. I'm not a fan of having case-insensitive unicode tables in the kernel. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: why do we need utf8 normalization when compare name? 2020-03-02 15:28 ` Al Viro 2020-03-02 17:14 ` Matthew Wilcox @ 2020-03-02 18:12 ` Theodore Y. Ts'o 1 sibling, 0 replies; 10+ messages in thread From: Theodore Y. Ts'o @ 2020-03-02 18:12 UTC (permalink / raw) To: Al Viro; +Cc: Matthew Wilcox, lampahome, linux-fsdevel On Mon, Mar 02, 2020 at 03:28:18PM +0000, Al Viro wrote: > Why the hell do we need case-insensitive filesystems in the first place? > I have only heard two explanations: > 1) because the layout (including name equivalences) is fixed by > some OS that happens to be authoritative for that filesystem. In that > case we need to match the rules of that OS, whatever they are. Unicode > equivalence may be an interesting part of _their_ background reasons > for setting those rules, but the only thing that really matters is what > rules have they set. It significantly helps porting applications that were originally written for Windows and/or MacOS. In particular, the work to add Unicode comparison support to ext4 was funded to enable the ability to run Windows gaming applications on Linux for Steam. > 2) early Android used to include a memory card with VFAT on > it; the card is long gone, but crapplications came to rely upon having > that shit. And rather than giving them a file on the normal filesystem > with VFAT image on it and /dev/loop set up and mounted, somebody wants > to use parts of the normal (ext4) filesystem for it. However, the > same crapplications have come to rely upon the case-insensitive (sensu > VFAT) behaviour there, so we must duplicate that vomit-inducing pile > of hacks on ext4. Ideally - with that vomit-induc{ing,ed} pile > reclassified as a generic feature; those look more respectable. There are a number of reasons why a loop device is not sufficient; there is a requirement to have selective sharing of application data between other applications, which is done using user/group ownership. For Android, previously, this was done using sdcardfs which was based off of wrapfs. Wrapfs was the same base as unionfs, and as you may recall, it had even more horrendous race issues, and more than once I was asked to help to debug crashes when you fsstress was run on different views of sdcardfs. I've been strongly encouraging the push to something more sane, but a requirement for this is case-folded directory. There is a third reason why case folding is necessary, which is for file serving applications such as Samba which require case-folding, and without file system level support, trying to simulate this in userspace requires searching via readdir to do a case-insensitive lookup (at least in some case). So finding a more performant way to support case-folding is a big help for applications like Samba. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-03-03 17:22 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-02 9:00 why do we need utf8 normalization when compare name? lampahome 2020-03-02 10:37 ` Aleksa Sarai 2020-03-02 10:47 ` Aleksa Sarai 2020-03-03 1:48 ` lampahome [not found] ` <20200303070928.aawxoyeq77wnc3ts@yavin> 2020-03-03 10:13 ` lampahome 2020-03-03 17:22 ` Theodore Y. Ts'o 2020-03-02 12:54 ` Matthew Wilcox 2020-03-02 15:28 ` Al Viro 2020-03-02 17:14 ` Matthew Wilcox 2020-03-02 18:12 ` Theodore Y. Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).