All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Pali Rohár" <pali.rohar@gmail.com>
To: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: tytso@mit.edu, linux-fsdevel@vger.kernel.org,
	linux-ext4@vger.kernel.org, sfrench@samba.org,
	darrick.wong@oracle.com, samba-technical@lists.samba.org,
	jlayton@kernel.org, bfields@fieldses.org, paulus@samba.org
Subject: Re: [PATCH RFC v5 00/11] Ext4 Encoding and Case-insensitive support
Date: Wed, 6 Feb 2019 09:47:52 +0100	[thread overview]
Message-ID: <20190206084752.nwjkeiixjks34vao@pali> (raw)
In-Reply-To: <8736p2jbov.fsf@collabora.com>

On Tuesday 05 February 2019 14:08:00 Gabriel Krisman Bertazi wrote:
> Pali Rohár <pali.rohar@gmail.com> writes:
> 
> > On Monday 28 January 2019 16:32:12 Gabriel Krisman Bertazi wrote:
> >> The main change presented here is a proposal to migrate the
> >> normalization method from NFKD to NFD.  After our discussions, and
> >> reviewing other operating systems and languages aspects, I am more
> >> convinced that canonical decomposition is more viable solution than
> >> compatibility decomposition, because it doesn't ignore eliminate any
> >> semantic meaning, like the definitive case of superscript numbers.  NFD
> >> is also the documented method used by HFS+ and APFS, so there is
> >> precedent. Notice however, that as far as my research goes, APFS doesn't
> >> completely follows NFD, and in some cases, like <compat> flags, it
> >> actually does NFKD, but not in others (<fraction>), where it applies the
> >> canonical form.  We take a more consistent approach and always do plain NFD.
> >> 
> >> This RFC, therefore, aims to resume/start conversation with some
> >> stalkeholders that may have something to say regarding the normalization
> >> method used.  I added people from SMB, NFS and FS development who
> >> might be interested on this.
> >
> > Hello! I think that choice of NFD normalization is not right decision.
> > Some reasons:
> >
> > 1) NFD is not widely used. Even Apple does not use it (as you wrote
> >    Apple has own normalization form).
> 
> To be exact, Apple claims to use NFD in their specification [1] .

Interesting...

> What I
> observed is that they don't ignore some types of compatibility
> characters correctly as they should. For instance, the ff ligature is
> decomposed into f + f.

I'm sure that Apple does not do NFD, but their own invented normal form.
Some graphemes are decomposed, and some not.

> > 2) All filesystems which I known either do not use any normalization or
> >    use NFC.
> > 3) Lot of existing Linux application generate file names in NFC.
> >
> 
> Most do use NFC.  But this is an internal representation for ext4 and it
> is name preserving.

Ok. I was in impression that it does not preserve original names, just
like implementation in Apple's system, where char* passed to creat()
does not appear in readdir().

> We only use the normalization when comparing if names
> matches and to calculate dcache and dx hashes.  The unicode standard
> recomends the D forms for internal representation.

Ok, this should be less destructive and less visible to userspace.

> > 4) Linux GUI libraries like Qt and Gtk generate strings from key strokes
> >    in NFC. So if user type file name in Qt/Gtk box it would be in NFC.
> >
> > So why to use NFD in ext4 filesystem if Linux userspace ecosystem
> > already uses NFC?
> 
> NFC is costlier to calculate, usually requiring an intermediate NFD
> step.  Whether it is prohibitively expensive to do in the dcache path, I
> don't know, but since it is a critical path, any gain matters.
> 
> > NFD here just makes another layer of problems, unexpected things and
> > make it somehow different.
> 
> Is there any case where
>    NFC(x) == NFC(y) && NFD(x) != NFD(y)   , or
>    NFC(x) != NFC(y) && NFD(x) == NFD(y)

This is good question. And I think we should get definite answer for it
prior inclusion of normalization into kernel.

> I am having a hard time thinking of an example.  This is the main
> (only?) scenario where choosing C or D form for an internal
> representation would affect userspace.

For decision between normal format, probably yes.

> >
> > Why not rather choose NFS? It would be more compatible with Linux GUI
> > applications and also with Microsoft Windows systems, which uses NFC
> > too.
> >
> > Please, really consider to not use NFD. Most Linux applications really
> > do not do any normalization or do NFC. And usage of decomposition form
> > for application which do not implement full Unicode grapheme algorithms
> > just make for them another problems.
> 
> > Yes, there are still lot of legacy application which expect that one
> > code point = one visible symbol (therefore one Unicode grapheme). And
> > because GUI in most cases generates NFC strings, also existing file
> > names are in NFC, these application works in most cases without problem.
> > Force usage of NFD filenames just break them.
> 
> As I said, this shouldn't be a problem because what the application
> creates and retrieves is the exact name that was used before, we'd
> only use this format for internal metadata on the disk (hashes) and for
> in-kernel comparisons.

There is another problem for userspace applications:

Currently ext4 accepts as file name any sequence of bytes which do not
contain nul byte and '/'. So having Latin1 file name is perfectly
correct.

What would happen if userspace application want to create following two
file names? "\xDF" and "\F0"? First one is sharp S second one is eth (in
Latin1). But file names are invalid UTF-8 sequences. Is it disallowed to
create such file names? Or both file names are internally converted to
"U+FFFD" (replacement character) and because NFD(first U+FFFD) ==
NFD(second U+FFFD) only first file would be created?

And what happen in general with invalid UTF-8 sequences? Because there
are many different types of invalid UTF-8 sequences, like non-shortest
sequence for valid code point, valid sequence for invalid code points
(either surrogate pairs code points, or code points above U+10FFFF,
...), incorrect byte which should start new code point, incorrect byte
when decoding of code point started, ...

Different (userspace) application handles these invalid UTF-8 sequences
differently, some of them accept some kind of "incorrectness" (e.g.
non-shortest form of code point representation), some not. Some
applications replace invalid parts of UTF-8 sequence by sequence of
UTF-8 replacement character, some not. Also it can be observed that some
applications use just one replacement characters and some other replace
invalid UTF-8 sequence by more replacement characters.

So trying to "recover" from invalid UTF-8 sequence to valid one is done
in more ways... And usage of any existing way could cause problems...
E.g. not possible to create two files "\xDF\xF0" and "\xF0\xDF"...

> > (PS: I think that only 2 programming languages implements Unicode
> > grapheme algorithms correctly: Elixir and Perl 6; which is not so
> > much)
> 
> [1] https://developer.apple.com/support/apple-file-system/Apple-File-System-Reference.pdf
> 

-- 
Pali Rohár
pali.rohar@gmail.com

  reply	other threads:[~2019-02-06  8:47 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-28 21:32 [PATCH RFC v5 00/11] Ext4 Encoding and Case-insensitive support Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 01/11] unicode: Add unicode character database files Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 02/11] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 03/11] unicode: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 04/11] unicode: reduce the size of utf8data[] Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 05/11] unicode: Implement higher level API for string handling Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 06/11] unicode: Introduce test module for normalized utf8 implementation Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 07/11] MAINTAINERS: Add Unicode subsystem entry Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 08/11] ext4: Include encoding information in the superblock Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 09/11] ext4: Support encoding-aware file name lookups Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 10/11] ext4: Implement EXT4_CASEFOLD_FL flag Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 11/11] docs: ext4.rst: Document encoding and case-insensitive Gabriel Krisman Bertazi
2019-01-29 16:54 ` [PATCH RFC v5 00/11] Ext4 Encoding and Case-insensitive support J. Bruce Fields
2019-02-05 18:10 ` Pali Rohár
2019-02-05 19:08   ` Gabriel Krisman Bertazi
2019-02-06  8:47     ` Pali Rohár [this message]
2019-02-06 16:04       ` Gabriel Krisman Bertazi
2019-02-06 16:43         ` Pali Rohár
2019-02-19 19:04 ` Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190206084752.nwjkeiixjks34vao@pali \
    --to=pali.rohar@gmail.com \
    --cc=bfields@fieldses.org \
    --cc=darrick.wong@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=krisman@collabora.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=samba-technical@lists.samba.org \
    --cc=sfrench@samba.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.