linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Pali Rohár" <pali.rohar@gmail.com>
To: David Laight <David.Laight@ACULAB.COM>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Namjae Jeon <linkinjeon@gmail.com>,
	Gabriel Krisman Bertazi <krisman@collabora.com>
Subject: Re: vfat: Broken case-insensitive support for UTF-8
Date: Mon, 20 Jan 2020 17:56:14 +0100	[thread overview]
Message-ID: <20200120165614.yp3pukpj3ilq6nxp@pali> (raw)
In-Reply-To: <b42888a01c8847e48116873ebbbbb261@AcuMS.aculab.com>

[-- Attachment #1: Type: text/plain, Size: 3566 bytes --]

On Monday 20 January 2020 16:43:21 David Laight wrote:
> From: Pali Rohár
> > Sent: 20 January 2020 16:27
> ...
> > > Unfortunately there is neither a 1:1 mapping of all possible byte sequences
> > > to wchar_t (or unicode code points),
> > 
> > I was talking about valid UTF-8 sequence (invalid, illformed is out of
> > game and for sure would always cause problems).
> 
> Except that they are always likely to happen.

As wrote before, Linux kernel does not allow such sequences. So
userspace get error when is trying to store garbage.

> I've been pissed off by programs crashing because they assume that
> a input string (eg an email) is UTF-8 but happens to contain a single
> 0xa3 byte in the otherwise 7-bit data.
> 
> The standard ought to have defined a translation for such sequences
> and just a 'warning' from the function(s) that unexpected bytes were
> processed.

There is informative part, how to replace invalid part of sequence to
Unicode code point U+FFFD. So if your need to to "process any byte
sequence as UTF-8" there is standardized way to convert it into one
exact sequence of Unicode code points. This is what email programs
should do and non-broken are already doing it.

> > > nor a 1:1 mapping of all possible wchar_t values to UTF-8.
> > 
> > This is not truth. There is exactly only one way how to convert sequence
> > of Unicode code points to UTF-8. UTF is Unicode Transformation Format
> > and has exact definition how is Unicode Transformed.
> 
> But a wchar_t can hold lots of values that aren't Unicode code points.
> Prior to the 2003 changes half of the 2^32 values could be converted.
> Afterwards only a small fraction.

wchar_t in kernel can hold only subset of Unicode code points, up to
the U+FFFF (2^16-1).

Halves of surrogate pairs are not valid Unicode code points but as
stated they are used in MS FAT.

So anything which can be put into kernel's wchar_t is valid for FAT.

> 
> > If you have valid UTF-8 sequence then it describe one exact sequence of
> > Unicode code points. And if you have sequence (ordinals) of Unicode code
> > points there is exactly one and only one its representation in UTF-8.
> > 
> > I would suggest you to read Unicode standard, section 2.5 Encoding Forms.
> 
> That all assumes everyone is playing the correct game

And why should we not play correct game? On input we have UTF and
internally we works with Unicode. Unicode codepoints does not leak from
kernel, so we can play correct game and assume that our code in kernel
is correct (and if not, we can fix it). Plus when communicating with
outside word, just check that input data are valid (which we already do
for UTF-8 user input).

So I do not see any problem there.

> > > Really both need to be defined - even for otherwise 'invalid' sequences.
> > >
> > > Even the 16-bit values above 0xd000 can appear on their own in
> > > windows filesystems (according to wikipedia).
> > 
> > If you are talking about UTF-16 (which is _not_ 16-bit as you wrote),
> > look at my previous email:
> 
> UFT-16 is a sequence of 16-bit values....

No, this is not truth. UTF-16 is sequence either of 16-bit values or of
32-bit values with other restrictions. UTF-16 is variable length enc.

> It can contain 0xd000 to 0xffff (usually in pairs) but they aren't UTF-8 codepoints.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

  reply	other threads:[~2020-01-20 16:56 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-19 22:14 vfat: Broken case-insensitive support for UTF-8 Pali Rohár
2020-01-19 23:08 ` Al Viro
2020-01-19 23:33   ` Pali Rohár
2020-01-20  0:09     ` Al Viro
2020-01-20 11:19       ` Pali Rohár
2020-01-20  4:04 ` OGAWA Hirofumi
2020-01-20  7:30   ` Al Viro
2020-01-20  7:45     ` Al Viro
2020-01-20  8:07       ` oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8) Al Viro
2020-01-20 19:35         ` Al Viro
2020-01-24  4:29           ` Eric Biggers
2020-01-24 17:47             ` Linus Torvalds
2020-01-24 18:03               ` Jaegeuk Kim
2020-01-24 18:45                 ` Eric Biggers
2020-01-20 11:04   ` vfat: Broken case-insensitive support for UTF-8 Pali Rohár
2020-01-20 12:07     ` OGAWA Hirofumi
2020-01-20 21:40       ` Pali Rohár
2020-01-20 22:46         ` Al Viro
2020-01-20 23:57           ` Pali Rohár
2020-01-21  0:07             ` Al Viro
2020-01-21 20:34               ` Pali Rohár
2020-01-21 21:36                 ` Al Viro
2020-01-21 22:14                   ` Al Viro
2020-01-21 22:46                     ` Pali Rohár
2020-01-26 23:08                 ` Pali Rohár
2020-01-21 12:43             ` David Laight
2020-01-22  0:25         ` Gabriel Krisman Bertazi
2020-01-20 15:07     ` David Laight
2020-01-20 15:20       ` Pali Rohár
2020-01-20 15:47         ` David Laight
2020-01-20 16:12           ` Al Viro
2020-01-20 16:51             ` David Laight
2020-01-20 16:27           ` Pali Rohár
2020-01-20 16:43             ` David Laight
2020-01-20 16:56               ` Pali Rohár [this message]
2020-01-20 17:37       ` Theodore Y. Ts'o
2020-01-20 17:32   ` Theodore Y. Ts'o
2020-01-20 17:56     ` Pali Rohár
2020-01-21  3:52     ` OGAWA Hirofumi
2020-01-21 11:00       ` Pali Rohár
2020-01-21 12:26         ` OGAWA Hirofumi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200120165614.yp3pukpj3ilq6nxp@pali \
    --to=pali.rohar@gmail.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=krisman@collabora.com \
    --cc=linkinjeon@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).