On Tuesday 21 January 2020 21:34:05 Pali Rohár wrote: > On Tuesday 21 January 2020 00:07:01 Al Viro wrote: > > On Tue, Jan 21, 2020 at 12:57:45AM +0100, Pali Rohár wrote: > > > On Monday 20 January 2020 22:46:25 Al Viro wrote: > > > > On Mon, Jan 20, 2020 at 10:40:46PM +0100, Pali Rohár wrote: > > > > > > > > > Ok, I did some research. It took me it longer as I thought as lot of > > > > > stuff is undocumented and hard to find all relevant information. > > > > > > > > > > So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which > > > > > takes UTF-16 string and returns upper case UTF-16 string. There is no > > > > > mapping table in fastfat.sys driver itself. > > > > > > > > Er... Surely it's OK to just tabulate that function on 65536 values > > > > and see how could that be packed into something more compact? > > > > > > It is OK, but too complicated. That function is in nt kernel. So you > > > need to build a new kernel module and also decide where to put output of > > > that function. It is a long time since I did some nt kernel hacking and > > > nowadays you need to download 10GB+ of Visual Studio code, then addons > > > for building kernel modules, figure out how to write and compile simple > > > kernel module via Visual Studio, write ini install file, try to load it > > > and then you even fail as recent Windows kernels refuse to load kernel > > > modules which are not signed... > > > > Wait a sec... From NT userland, on a mounted VFAT: > > for all s in single-codepoint strings > > open s for append > > if failed > > print s on stderr, along with error value > > write s to the opened file, adding to its tail > > close the file > > the for each equivalence class you'll get a single file, with all > > members of that class written to it. In addition you'll get the > > list of prohibited codepoints. > > > > Why bother with any kind of kernel modules? IDGI... > > This is a great idea to get FAT equivalence classes. Thank you! > > Now I quickly tried it... and it failed. FAT has restriction for number > of files in a directory, so I would have to do it in more clever way, > e.g prepare N directories and then try to create/open file for each > single-point string in every directory until it success or fail in every > one. Now I have done test with more directories and finally it passed. I run it on WinXP with different configurations And results are interesting... First important thing: DOS OEM codepage is implicitly configured by option "Language for non-Unicode programs" found in "Regional and Language Options" at "Advanced" tab (run: intl.cpl). It is *not* affected by "Standards and formats" language and also *not* by "Location" language. Description for "Language for non-Unicode programs" says: "It does not affect Unicode programs" which is clearly non-truth as it affects all Unicode programs which stores data to FAT fs. Second thing: Equivalence classes depends on OEM codepage. And are different. Note that some languages shares one codepage. CP850 (languages: English UK, Afrikaans, ...) has 614 non-trivial (*) equivalence classes, CP852 (Slavic languages) has 619 and CP437 (English USA) has only 586. The biggest equivalence class is for 'U' and has following elements: CP437: 0x0055 0x0075 0x00d9 0x00da 0x00db 0x00f9 0x00fa 0x00fb 0x0168 0x0169 0x016a 0x016b 0x016c 0x016d 0x016e 0x016f 0x0170 0x0171 0x0172 0x0173 0x01af 0x01b0 0x01d3 0x01d4 0x01d5 0x01d6 0x01d7 0x01d8 0x01d9 0x01da 0x01db 0x01dc 0xff35 0xff55 CP852: 0x0055 0x0075 0x00b5 0x00d9 0x00db 0x00f9 0x00fb 0x0168 0x0169 0x016a 0x016b 0x016c 0x016d 0x0172 0x0173 0x01af 0x01b0 0x01d3 0x01d4 0x01d5 0x01d6 0x01d7 0x01d8 0x01d9 0x01da 0x01db 0x01dc 0x03bc 0xff35 0xff55 CP850: 0x0055 0x0075 0x0168 0x0169 0x016a 0x016b 0x016c 0x016d 0x016e 0x016f 0x0170 0x0171 0x0172 0x0173 0x01af 0x01b0 0x01d3 0x01d4 0x01d5 0x01d6 0x01d7 0x01d8 0x01d9 0x01da 0x01db 0x01dc 0xff35 0xff55 Just to note that elements are Unicode code points. It is interesting that for English USA (CP437) are "U" and "Ù" in same equivalence class, but for English UK (CP850) are "U" and "Ù" in different classes. CP850 has "U" in two-member class: 0x00d9 0x00f9 Are there any cultural, regional or linguistic reasons why English USA and English UK languages/regions should treat "Ù" differently? So third thing? How should be handle this complicated situation for our VFAT implementation in Linux kernel when using UTF-8 encoding for userspace? For fixing case-insensitivity for UTF-8 I see there following options: Option 1) Create intersect of equivalence classes from all codepages and use this for Linux VFAT uppercase function. This would ensure that whatever codepage/language windows uses, Linux VFAT does not create inaccessible files for Windows (see PPS). Option 2) As equivalence classes depends on codepage and VFAT already needs to know codepage when mounting/accessing shortnames, we can calculate "common" uppercase table (which would same for all codepages, ideally from option 1)) and then differences from "common" uppercase table to equivalence classes. Kernel already has uppercase tables for NLS codepages and so we can store these "differences" to them. In this case VFAT would know to uppercase function for specified codepage (which is already passed as mount param). Option 3) Ignores this MS shit nonsense (see PPS how it is broken) and define uppercase table from Unicode standard. This would be the most expected behavior for userspace, but incompatible with MS FAT32 implementation. Option 4) Use uppercase table from Unicode standard (as in option 3), but adds also definitions from option 1). This would ensure that all files created by VFAT would be accessible on any Windows systems (see PPS), plus there would be uppercase definitions from Unicode standard (but only those which do not break definitions from 1) with respect to PPS). Option 5) Create API for kernel <---> userspace which would allow userspace to define mapping table (or equivalence classes) and throw away this problem from kernel to userspace. But as we already discussed this is hard, plus without proper configuration from userspace, kernel's VFAT driver could modify FS in way that MS would not be able to use it. Or do you have a better idea how to handle this problem? (*) - with more then one element PS: If somebody is interested I can share my whole results and source code of testing application. PPS: If you create two files "U" and "Ù" on English UK (you can do that as these codepoints are in different equivalence classes) and then connect this FAT32 fs on English USA, you would not be able to access "Ù" file. Windows English USA list both files "U" and "Ù", but whichever you open, Windows get you always content of file "U". "Ù" is therefore inaccessible until you change language to English UK. -- Pali Rohár pali.rohar@gmail.com