archive mirror
 help / color / mirror / Atom feed
From: "Pali Rohár" <>
To: "Theodore Y. Ts'o" <>
Cc: OGAWA Hirofumi <>,,,
	Namjae Jeon <>,
	Gabriel Krisman Bertazi <>
Subject: Re: vfat: Broken case-insensitive support for UTF-8
Date: Mon, 20 Jan 2020 18:56:10 +0100	[thread overview]
Message-ID: <20200120175610.md2nu7f7qe2ekgly@pali> (raw)
In-Reply-To: <>

[-- Attachment #1: Type: text/plain, Size: 4411 bytes --]

On Monday 20 January 2020 12:32:15 Theodore Y. Ts'o wrote:
> On Mon, Jan 20, 2020 at 01:04:42PM +0900, OGAWA Hirofumi wrote:
> > 
> > To be perfect, the table would have to emulate what Windows use. It can
> > be unicode standard, or something other. And other fs can use different
> > what Windows use.
> The big question is *which* version of Windows.  vfat has been in use
> for over two decades, and vfat predates Window starting to use Unicode
> in 2001.  Before that, vfat would have been using whatever code page
> its local Windows installation was set to sue; and I'm not sure if
> there was space in the FAT headers to indicate the codepage in use.

VFAT is extension to FAT which stores file names in UTF-16. In original
FAT without VFAT extension (in all variants, FAT12, FAT16 and FAT32) is
file name stored "according to current 8bit OEM code page". VFAT-aware
FAT implementation would know if particular filename is really VFAT
(UTF-16) or without VFAT (8bit OEM code page). There are flags in FAT
which indicates if entry is VFAT (UTF-16).

And no, there are no bits in FAT header which specify OEM code page.
So if you use "mode con" or "chcp" (or what was those MS-DOS commands
for changing OEM codepage), all non-VFAT filenames would change after
next reading of FAT directory.

But because every OEM code page is full 8bit, you always get valid data.
Just you would see that your file name is different :D

> It would be entertaining for someone with ancient versions of Windows
> 9x to create some floppy images using codepage 437 and 450, and then
> see what a modern Windows system does with those VFAT images --- would

Hehe :-) I did it as part of my investigation, how is stored FAT volume
label and how different tools read it. FAT label is *not* stored as
UTF-16 but only in that OEM code page like old filenames on MS-DOS

And what recent Windows do? They decode such filenames (and therefore
also volume label) via OEM codepage which belongs to current system
Language settings. You cannot change OEM codepage on recent Windows. You
can only change Regional Language (which then change OEM codepage which
belongs to it).

Mapping table between Windows Regional Language and OEM codepage is in
(still unreleased) fatlabel(8) manpage, section DOS CODEPAGES, here:

> it break horibbly when it tries to interpret them as UTF-16?  Or would

As Windows knows that filename is stored as 8bit and not UTF-16, nothing
is broken. Just for characters with upper bit set you probably does not
see filenames as you saw in MS-DOS.

But if you remember which OEM code page you used on MS-DOS, you can
change Windows Language to one which uses your OEM code page and then
you can read that old FAT fs without any broken file names.

> it figure it out?  And if so, how?  Inquiring minds want to know....
> Bonus points if the lack of forwards compatibility causes older
> versions of Windows to Blue Screen.  :-)

I have not got any Blue Screens during reading of these older FAT fs
created and used by MS-DOS.

On Linux it is easier, just specify -o codepage= mount option and
vfat.ko translate it correctly.

>       	     	   	  		   	- Ted
> P.S.  And of course, then there's the question of how does older
> versions of Windows handle versions of Unicode which postdate the
> release date of that particular version of Windows?  After all,

This is not a problem. Windows allows you to store into filename
arbitrary sequence of uint16[] (except disallowed MS-DOS chars like
:?<>...). And when doing read directory operation you need to expect
that it will returns arbitrary sequence of uint16[].

Windows does not care about valid/invalid/assigned/unassigned code
points. It even do not care about halves of surrogate pairs. So it can
store also one half of (unpaired) surrogate pair (one uint16).

> Unicode adds new code points with potential revisions to the case
> folding table every 6-12 months.  (The most recent version of Unicode
> was released in in April 2019 to accomodate the new Japanese kanji
> character "Rei" for the current era name with the elevation of the new
> current reigning emperor of Japan.)

Pali Rohár

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

  reply	other threads:[~2020-01-20 17:56 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-19 22:14 vfat: Broken case-insensitive support for UTF-8 Pali Rohár
2020-01-19 23:08 ` Al Viro
2020-01-19 23:33   ` Pali Rohár
2020-01-20  0:09     ` Al Viro
2020-01-20 11:19       ` Pali Rohár
2020-01-20  4:04 ` OGAWA Hirofumi
2020-01-20  7:30   ` Al Viro
2020-01-20  7:45     ` Al Viro
2020-01-20  8:07       ` oopsably broken case-insensitive support in ext4 and f2fs (Re: vfat: Broken case-insensitive support for UTF-8) Al Viro
2020-01-20 19:35         ` Al Viro
2020-01-24  4:29           ` Eric Biggers
2020-01-24 17:47             ` Linus Torvalds
2020-01-24 18:03               ` Jaegeuk Kim
2020-01-24 18:45                 ` Eric Biggers
2020-01-20 11:04   ` vfat: Broken case-insensitive support for UTF-8 Pali Rohár
2020-01-20 12:07     ` OGAWA Hirofumi
2020-01-20 21:40       ` Pali Rohár
2020-01-20 22:46         ` Al Viro
2020-01-20 23:57           ` Pali Rohár
2020-01-21  0:07             ` Al Viro
2020-01-21 20:34               ` Pali Rohár
2020-01-21 21:36                 ` Al Viro
2020-01-21 22:14                   ` Al Viro
2020-01-21 22:46                     ` Pali Rohár
2020-01-26 23:08                 ` Pali Rohár
2020-01-21 12:43             ` David Laight
2020-01-22  0:25         ` Gabriel Krisman Bertazi
2020-01-20 15:07     ` David Laight
2020-01-20 15:20       ` Pali Rohár
2020-01-20 15:47         ` David Laight
2020-01-20 16:12           ` Al Viro
2020-01-20 16:51             ` David Laight
2020-01-20 16:27           ` Pali Rohár
2020-01-20 16:43             ` David Laight
2020-01-20 16:56               ` Pali Rohár
2020-01-20 17:37       ` Theodore Y. Ts'o
2020-01-20 17:32   ` Theodore Y. Ts'o
2020-01-20 17:56     ` Pali Rohár [this message]
2020-01-21  3:52     ` OGAWA Hirofumi
2020-01-21 11:00       ` Pali Rohár
2020-01-21 12:26         ` OGAWA Hirofumi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200120175610.md2nu7f7qe2ekgly@pali \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).