linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	kernel@collabora.com, linux-ext4@vger.kernel.org,
	krisman@collabora.com
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support
Date: Mon, 10 Dec 2018 11:35:17 -0800	[thread overview]
Message-ID: <CAHk-=wiFtZL5rK3T-HQPm0oG4vekDJEKS47P8BbzHSXt_6SHuA@mail.gmail.com> (raw)
In-Reply-To: <20181210000822.GD1840@mit.edu>

On Sun, Dec 9, 2018 at 4:08 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> So things are much better in recent years.  In the past it was kind of
> a disaster, but the world is converging enough that the latest
> versions of Mac OS'x APFS and Windows NTFS behave pretty much the same
> way.  They are both case-insensitive, case-preserving and
> normalization-preserving, normalization-insensitive with respect to
> filenames.

Oh, so APFS at least fixed *that* horrific problem with their
filesystem. Oh how I despised the exposure of NFD (which should at
most be used as an internal representation, not externally visible).
Turning basic letters (coming from Finland, åäö) into character
combinations was an absolute abomination.

> In the bad old-days, MacOS X's HFS+ was not normalization-preserving.

Oh, I'm very aware.

It's not even that it wasn't normalization-preserving, it picked the
*wrong* normalization to use.

> Now, both file systems basically say, "we don't care whether you pass
> in U+212B or U+0041,U+030A; on the screen it looks identical, Å, so we
> will treat it as the same filename; but readdir(2) will return what
> you gave us."

Actually, the "on the screen it will look identical" is a horribly
incorrect thing to do too.

There are lots of things that look identical on the screen without
being at all the same thing. Sometimes it depends on font, sometimes
it's just how it is. A nonbreaking space is *not* the same as a
regular space, even if they may look identical on the screen.

I suspect (and sincerely _hope_) neither filesystem actually does
anything as stuipid as taking "glyph equivalence" into account.

I'm hoping it's just "convert to NFx, then lower-case, then compare
for equality". Where the 'x' doesn't much matter as long as it is
never _exposed_ in any way outside of the comparison (ie NFD is a fine
and probably simpler model for the lower-casing, the HFS+ mistake was
to then expose the corrupted form of the filename).

> It's been a *long* time since Unicode has changed case folding rules
> for pre-existing characters.  The tables have only changed with
> respect to the new character sets have been added.

But new characters _have_ been added, and some of them do have
lower-case form, so the folding tables have changed.

Happily, maybe that is over. As long as the Unicode people continue to
mainly play with their Emoji list, I guess we can consider it done.

> So how about this?  We'll put the unicode handling functions in a new
> directory, fs/unicode, just to make it really clear that this will now
> be changing any of the legacy fs/nls functions which other file
> systems will use.  By putting it in a separate directory, it will be
> easier for other file systems to use it, whether it's for better Samba
> or NFSv4 support.

Ok, that sounds fine.

Some of the unicode translation functions from the NLS code could well
move into that, and NLS itself could be relegated to the sad
historical thing.

And please try to make the *interfaces* sane.

For example, the interface for "let's compare with folded case" should
*not* be about "convert to NFDK and lower case into a temp buffer,
then compare the results".

You can do a lot of "let's handle the simple cases" faster even if the
"oh, I hit a complex character" case might then become one of those
"convert to a temp buffer" cases.

And it shouldn't be about C strings, since we very much have cases
where it's not a C string but a {ptr,len} tuple. Maybe even use the
"struct qstr", which is a not-horrible way to pass those around.

Even if you have a C string, you can always just do

        struct qstr str = QSTR_INIT(name, strlen(name));

and then pass that qstr pointer around.

Finally, don't do the NLS thing with "descriptors". that you register
and look up. The indirection kills you. Particularly the crazy "one
character at a time" model.

Just let people explicitly say "utf8_icasecmp(qstr, qstr)" or
something like that.  With the interface at least allowing for the
common simple cases (ie everything is in the ASCII subset) to be
handled basically as a specialized thing.

                    Linus

  reply	other threads:[~2018-12-10 19:35 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-06 23:08 [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 01/23] nls: Wrap uni2char/char2uni callers Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 02/23] nls: Wrap charset field access Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 03/23] nls: Wrap charset hooks in ops structure Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 04/23] nls: Split default charset from NLS core Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 05/23] nls: Split struct nls_charset from struct nls_table Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 06/23] nls: Add support for multiple versions of an encoding Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 07/23] nls: Implement NLS_STRICT_MODE flag Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 08/23] nls: Let charsets define the behavior of tolower/toupper Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 09/23] nls: Add new interface for string comparisons Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 10/23] nls: Add optional normalization and casefold hooks Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 11/23] nls: ascii: Support validation and normalization operations Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 12/23] nls: utf8: Add unicode character database files Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 13/23] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 14/23] nls: utf8: Move nls-utf8{,-core}.c Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 15/23] nls: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 16/23] nls: utf8n: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 17/23] nls: utf8: Integrate utf8 normalization code with utf8 charset Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 18/23] nls: utf8: Introduce test module for normalized utf8 implementation Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 19/23] ext4: Reserve superblock fields for encoding information Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 20/23] ext4: Include encoding information in the superblock Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 21/23] ext4: Support encoding-aware file name lookups Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 22/23] ext4: Implement EXT4_CASEFOLD_FL flag Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 23/23] docs: ext4.rst: Document encoding and case-insensitive Gabriel Krisman Bertazi
2018-12-07 18:41 ` [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support Randy Dunlap
     [not found] ` <20181208194128.GE20708@thunk.org>
2018-12-08 21:48   ` Linus Torvalds
2018-12-08 21:58     ` Linus Torvalds
2018-12-08 22:59       ` Linus Torvalds
2018-12-09  0:46         ` Andreas Dilger
     [not found]       ` <20181209050326.GA28659@mit.edu>
2018-12-09 17:41         ` Linus Torvalds
2018-12-09 20:10           ` Theodore Y. Ts'o
2018-12-09 20:54             ` Linus Torvalds
2018-12-10  0:08               ` Theodore Y. Ts'o
2018-12-10 19:35                 ` Linus Torvalds [this message]
2018-12-09 20:53           ` Gabriel Krisman Bertazi
2018-12-09 21:05             ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2018-12-06 22:04 Gabriel Krisman Bertazi
2018-12-06 22:50 ` Dave Chinner
2018-12-06 23:09   ` Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wiFtZL5rK3T-HQPm0oG4vekDJEKS47P8BbzHSXt_6SHuA@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).