linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Pali Rohár" <pali.rohar@gmail.com>
To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Namjae Jeon <linkinjeon@gmail.com>,
	Gabriel Krisman Bertazi <krisman@collabora.com>
Subject: vfat: Broken case-insensitive support for UTF-8
Date: Sun, 19 Jan 2020 23:14:55 +0100	[thread overview]
Message-ID: <20200119221455.bac7dc55g56q2l4r@pali> (raw)

[-- Attachment #1: Type: text/plain, Size: 4176 bytes --]

Hello!

I have looked more deeply at vfat kernel code how is UTF-8 encoding
handled and I found out that case-insensitivity is broken, or rather not
implemented at all.

In fat_fill_super() function is already FIXME comment about this problem

	/* FIXME: utf8 is using iocharset for upper/lower conversion */
	if (sbi->options.isvfat) {
		sbi->nls_io = load_nls(sbi->options.iocharset);

Basically vfat always loads NLS table which is used for strnicmp and
tolower functions. When no is specified, then default (iso8859-1) is
used. And this applies also when utf8=1 mount option is specified. Also
note that kernel's utf8 NLS table does not implement toupper/tolower
functions (kernel's NLS API does not support tolower/toupper for
non-fixed-8bit encodings, like UTF-8).

So when UTF-8 on VFS for VFAT is enabled, then for VFS <--> VFAT
conversion are used utf16s_to_utf8s() and utf8s_to_utf16s() functions.
But in fat_name_match(), vfat_hashi() and vfat_cmpi() functions is used
NLS table (default iso8859-1) with nls_strnicmp() and nls_tolower().

Which means that fat_name_match(), vfat_hashi() and vfat_cmpi() are
broken for vfat in UTF-8 mode.

I was thinking how to fix it, and the only possible way is to write a
uni_tolower() function which takes one Unicode code point and returns
lowercase of input's Unicode code point. We cannot do any Unicode
normalization as VFAT specification does not say anything about it and
MS reference fastfat.sys implementation does not do it neither.

So, what would be the best option for implementing that function?

  unicode_t uni_tolower(unicode_t u);

Could a new fs/unicode code help with it? Or it is too tied with NFD
normalization and therefore cannot be easily used or extended?

New exfat code which is under review and hopefully would be merged,
contains own unicode upcase table (as defined by exfat specification) so
as exfat is similar to FAT32, maybe reusing it would be a better option?


========================================================================

Proof that vfat in UTF-8 mode is broken and must be fixed:

$ mount | grep /mnt/fat
/tmp/fat2 on /mnt/fat type vfat
(rw,relatime,uid=1000,gid=1000,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root  80 Jan 19 22:45 ../
$ touch /mnt/fat/č
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root  80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 č*
$ touch /mnt/fat/Č
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root  80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 Č*
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 č*
$ touch /mnt/fat/d
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root  80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 d*
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 Č*
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 č*
$ touch /mnt/fat/D
$ ll /mnt/fat/
total 1
drwxr-xr-x 2 pali pali 512 Jan 19 22:50 ./
drwxrwxrwt 4 root root  80 Jan 19 22:45 ../
-rwxr-xr-x 1 pali pali   0 Jan 19 22:51 d*
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 Č*
-rwxr-xr-x 1 pali pali   0 Jan 19 22:50 č*

As you can see lowercase 'd' and uppercase 'D' are same, but lowercase
'č' and uppercase 'Č' are not same. This is because 'č' is two bytes
0xc4 0x8d sequence and comparing is done by Latin1 table. 0xc4 is in
Latin 'Ä' which is already in uppercase. 0x8d is control char so is not
changed by tolower/toupper function.

Bigger problem can be with U+C9FF code point. In UTF-8 it is encoded as
bytes 0xe3 0xa7 0xbf (in Latin1 㧿). If you convert it by Latin1 upper
case table you get ç¿ (bytes 0xc3 0xa7 0xbf). First two bytes is valid
UTF-8 sequence for character ç = U+00E7.

Therefore U+C9FF and U+00E7 may be treated in some cases as same
character (when comparing just prefixes), difference only in upper case,
which is fully wrong.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

             reply	other threads:[~2020-01-19 22:15 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-19 22:14 Pali Rohár [this message]
2020-01-19 23:08 ` Al Viro
2020-01-19 23:33   ` Pali Rohár
2020-01-20  0:09     ` Al Viro
2020-01-20 11:19       ` Pali Rohár
2020-01-20  4:04 ` OGAWA Hirofumi
2020-01-20  7:30   ` Al Viro
2020-01-20  7:45     ` Al Viro
2020-01-20  8:07       ` oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8) Al Viro
2020-01-20 19:35         ` Al Viro
2020-01-24  4:29           ` Eric Biggers
2020-01-24 17:47             ` Linus Torvalds
2020-01-24 18:03               ` Jaegeuk Kim
2020-01-24 18:45                 ` Eric Biggers
2020-01-20 11:04   ` vfat: Broken case-insensitive support for UTF-8 Pali Rohár
2020-01-20 12:07     ` OGAWA Hirofumi
2020-01-20 21:40       ` Pali Rohár
2020-01-20 22:46         ` Al Viro
2020-01-20 23:57           ` Pali Rohár
2020-01-21  0:07             ` Al Viro
2020-01-21 20:34               ` Pali Rohár
2020-01-21 21:36                 ` Al Viro
2020-01-21 22:14                   ` Al Viro
2020-01-21 22:46                     ` Pali Rohár
2020-01-26 23:08                 ` Pali Rohár
2020-01-21 12:43             ` David Laight
2020-01-22  0:25         ` Gabriel Krisman Bertazi
2020-01-20 15:07     ` David Laight
2020-01-20 15:20       ` Pali Rohár
2020-01-20 15:47         ` David Laight
2020-01-20 16:12           ` Al Viro
2020-01-20 16:51             ` David Laight
2020-01-20 16:27           ` Pali Rohár
2020-01-20 16:43             ` David Laight
2020-01-20 16:56               ` Pali Rohár
2020-01-20 17:37       ` Theodore Y. Ts'o
2020-01-20 17:32   ` Theodore Y. Ts'o
2020-01-20 17:56     ` Pali Rohár
2020-01-21  3:52     ` OGAWA Hirofumi
2020-01-21 11:00       ` Pali Rohár
2020-01-21 12:26         ` OGAWA Hirofumi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200119221455.bac7dc55g56q2l4r@pali \
    --to=pali.rohar@gmail.com \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=krisman@collabora.com \
    --cc=linkinjeon@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --subject='Re: vfat: Broken case-insensitive support for UTF-8' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).