All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Unicode/UTF-8 support for XFS
@ 2014-09-11 20:37 Ben Myers
  2014-09-11 20:40 ` [PATCH 1/9] xfs: return the first match during case-insensitive lookup Ben Myers
                   ` (22 more replies)
  0 siblings, 23 replies; 33+ messages in thread
From: Ben Myers @ 2014-09-11 20:37 UTC (permalink / raw)
  To: xfs; +Cc: tinguely, olaf

Hi,

I'm posting this RFC on Olaf's behalf, as he is busy with other projects.

First is a series of kernel patches, then a series of patches for
xfsprogs, and then a test.

Note that I have removed the unicode database files prior to posting due
to their large size.  There are instructions on how to download them in
the relevant commit headers.

Thanks,
	Ben

Here are some notes of introduction from Olaf:

-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS

So we had a customer request proper unicode support...


Design notes.

XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
must meet to be UTF-8. We settled on the following:
  - Valid unicode code points are 0..0x10FFFF, except that
  - The surrogates 0xD800..0xDFFF are not valid code points, and
  - Valid UTF-8 must be a shortest encoding of a valid unicode code point.

In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).

Based on feedback on the earlier patches for unicode/UTF-8 support, we
decided that a filename that does not match the above criteria should be
treated as a binary blob, as opposed to being rejected. To stress: if any
part of the string isn't valid UTF-8, then the entire string is treated
as a binary blob. This matters once normalization is considered.

When comparing unicode strings for equality, normalization comes into play:
we must compare the normalized forms of strings, not just the raw sequences
of bytes. There are a number of defined normalization forms for unicode.
We decided on a variant of NFKD we call NFKDI. NFD was chosed over NFC,
because calculating NFC requires calculating NFD first, followed by an
additional step. NFKD was chosen over NFD because this makes filenames
that ought to be equal compare as equal. My favorite example is the ways
"office" can be spelled, when "fi" or "ffi" ligatures are used. NFKDI adds
one more step of NFKD, in that it eliminates the code points that have the
Default_Ignorable_Code_Point property from the comparison. These code
points are as a rule invisible, but might (or might not) be pulled in when
you copy/paste a string to be used as a filename. An example of these is
U+00AD SOFT HYPHEN, a code point that only shows up if a word is split
across lines.

If a filename is considered to be binary blob, comparison is based on a
simple binary match. Normalization does not apply to any part of a blob.

The code uses ("leverages", in corp-speak) the existing infrastructure for
case-insensitive filenames. Like the CI code, the name used to create a
file is stored on disk, and returned in a lookup. When comparing filenames
the normalized forms of the names being compared are generated on the fly
from the non-normalized forms stored on disk.

If the borgbit (the bit enabling legacy ASCII-based CI) is set in the
superblock, then case folding is added into the mix. This normalization
form we call NFKDICF. It allows for the creation of case-insensitive
filesystems with UTF-8 support.

-----------------------------------------------------------------------------
Implementation notes.

Strings are normalized using a trie that stores the relevant information.
The trie itself is part of the XFS module, and about 250kB in size. The
trie is not checked in: instead we add the source files from the Unicode
Character Database and a program that creates the header containing the
trie.

The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.

The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.

The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.

The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.

The non-XFS-specific supporting code is in separate source files, and be
put in some other location in the Linux kernel source tree, if desired.
These functions have the prefix 'utf8n' if they handle length-limited
strings, and 'utf8' if they handle NUL-terminated strings.
-----------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-18 19:56 Ben Myers
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: " Ben Myers
  0 siblings, 1 reply; 33+ messages in thread
From: Ben Myers @ 2014-09-18 19:56 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

Hi,

I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
is busy with other projects.  This is the second revision of the series.
The first is available here:

http://oss.sgi.com/archives/xfs/2014-09/msg00169.html

In response to the initial feedback, the changes in version 2 include:

* linux-fsdevel in the To: line,
* Updated design notes,
* Separation of the fs-independent trie and support code into utf8norm.ko,
* A mechanism for loading the normalization module only when necessary.

I'll post the whole series for completeness sake.  Many on -fsdevel will
not be interested in the xfs-specific bits, but it may be helpful to
have the full series as an example and for testing purposes.

First there is a set of kernel bits, then some libxfs/xfsprogs stuff,
and finally a test.  (Note: I am not posting the unicode database files
due to their large size.  There are scripts to download them from
unicode.org in the relevant commit headers.)

TODO: Store the unicode version number of the filesystem on disk in the
super block.

Thanks,
Ben

Here are Olaf's design notes:

-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS

So we had a customer request proper unicode support...


* What does "supporting unicode" actually mean?

>From a text processing point of view, what a filesystem does with
filenames is simple: it stores and retrieves them, and compares them
for equality. It may reject certain byte sequences as invalid
filenames (for example, no filename can contain an ASCII NUL).

I've been taking it as a given that when a file is created with a
certain byte sequence as its name, then a subsequent directory listing
will contain that same byte sequence among the names listed.

This leaves comparing names for equality, and in my view this is what
"supporting unicode" revolves about.

The present state of affairs is that different byte sequences are
different filenames. This amounts to tolerating unicode without
actually supporting it.

To support unicode we have to interpret filenames. What happens when
(part of) a filename cannot be interpreted? We can reject the
filename, interpret the parts we can, or punt and accept it as an
uninterpreted blob.

Rejecting ill-formed filenames was my first choice, but I came around
on the issue: there are too many ways in which you can end up with
having to deal with ill-formed filenames that would leave a user with
no recourse but to move whatever they're doing to a different
filesystem. Unpacking a tarball with filenames in a different encoding
is an example.

Partial interpretation of an ill-formed filename just strikes me as
the kind of bad idea that most half-houses are. I admit that I have no
stronger objection to this than the fact that it makes the code even
more complicated and fragile.

Which leaves "blob" as the preferred option by default for coping with
ill-formed filenames.

When comparing well-formed filenames, the question now becomes which
byte sequences are considered to be alternative spellings of the same
filename. This is where normalization forms come into play, and the
unicode standard has quite a bit to say about the subject.

If all you're doing is comparison, then choosing NFD over NFC is easy,
because the former is easier to calculate than the latter.

If you want various spellings of "office" to compare equal, then
picking NFKD over NFD for comparison is also an obvious
choice. (Hand-picking individual compatibility forms is truly a bad
idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
ligature. (Some fool thought it a good idea to add these ligatures to
unicode, all we get to decide is how to cope.)

The most contentious part is (should be) ignoring the codepoints with
the Default_Ignorable_Code_Point property. I've included the list
below. My argument, such as it is, is that these code points either
have no visible rendering, or in cases like the soft hyphen, are only
conditionally visible. The problem with these (as I see it) is that on
seeing a filename that might contain them you cannot tell whether they
are present. So I propose to ignore them for the purpose of comparing
filenames for equality.

Finally, case folding. First of all, it is optional. Then the issue is
that you either go the language-specific route, or simplify the task
by "just" doing a full casefold (C+F, in unicode parlance). Looking
around the net I tend to find that if you're going to do casefolding
at all, then a language-independent full casefold is preferred because
it is the most predictable option. See
http://www.w3.org/TR/charmod-norm/ for an example of that kind of
reasoning.

An additional question is whether case folding should be a fixed
(mkfs-time) property of a filesystem or can be enabled and disabled on
the fly. When mixing these modes, preferring exact matches is easy.
But after case-sensitive creates of files named "README" and "readme",
which of these two files will be found by case-insensitive lookups of
"Readme", and "ReadMe"? Does the answer differ if the order in which
the files were created is reversed? I do not have good answers to
those questions, and absent such answers the behavior of a filesystem
becomes hard to predict. This may not be a bug according to the
design, but it will be experienced as a bug by users. This is why in
these patches case folding is a property set at mkfs time.

All of these choices can be argued with, but I do believe that the
particular combination of choices I made is a defensible one.

The code refers to these normalization forms as nfkdi and nfkdicf.


* XFS-specific design notes.

XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
must meet to be UTF-8. We settled on the following:
 - Valid unicode code points are 0..0x10FFFF, except that
 - The surrogates 0xD800..0xDFFF are not valid code points, and
 - Valid UTF-8 must be a shortest encoding of a valid unicode code point.

In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).

The code uses ("leverages", in corp-speak) the existing XFS
infrastructure for case-insensitive filenames. Like the CI code, the
name used to create a file is stored on disk, and returned in a
lookup. When comparing filenames the normalized forms of the names
being compared are generated on the fly from the non-normalized forms
stored on disk.

If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
in the superblock, then case folding is added into the mix. This is
the nfkdicf normalization form mentioned above. It allows for the
creation of case-insensitive filesystems with UTF-8 support.


* Implementation notes.

Strings are normalized using a trie that stores the relevant
information.  The trie itself is about 250kB in size, and lives in a
separate module. The trie is not checked in: instead we add the source
files from the Unicode Character Database and a program that creates
the header containing the trie.

The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.

The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.

The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.

The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.

The non-XFS-specific supporting code functions have the prefix 'utf8n'
if they handle length-limited strings, and 'utf8' if they handle
NUL-terminated strings.

----
# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0605, 06DD, 070F, 110BD (exceptional Cf characters that should be visible)

00AD          ; Default_Ignorable_Code_Point # Cf       SOFT HYPHEN
034F          ; Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
061C          ; Default_Ignorable_Code_Point # Cf       ARABIC LETTER MARK
115F..1160    ; Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D    ; Default_Ignorable_Code_Point # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E          ; Default_Ignorable_Code_Point # Cf       MONGOLIAN VOWEL SEPARATOR
200B..200F    ; Default_Ignorable_Code_Point # Cf   [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E    ; Default_Ignorable_Code_Point # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
2066..206F    ; Default_Ignorable_Code_Point # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164          ; Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FE00..FE0F    ; Default_Ignorable_Code_Point # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF          ; Default_Ignorable_Code_Point # Cf       ZERO WIDTH NO-BREAK SPACE
FFA0          ; Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3  ; Default_Ignorable_Code_Point # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A  ; Default_Ignorable_Code_Point # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000         ; Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Default_Ignorable_Code_Point # Cf       LANGUAGE TAG
E0002..E001F  ; Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Default_Ignorable_Code_Point # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF  ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF  ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

# Total code points: 4173
----

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2014-09-18 20:33 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-11 20:37 [RFC] Unicode/UTF-8 support for XFS Ben Myers
2014-09-11 20:40 ` [PATCH 1/9] xfs: return the first match during case-insensitive lookup Ben Myers
2014-09-11 20:41 ` [PATCH 2/9] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-11 20:42 ` [PATCH 3/9] xfs: add xfs_nameops.normhash Ben Myers
2014-09-11 20:43 ` [PATCH 4/9] xfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-11 20:46 ` [PATCH 5/9] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-11 20:47 ` [PATCH 6/9] xfs: add unicode character database files Ben Myers
2014-09-11 20:48 ` [PATCH 7/9] xfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-11 20:49 ` [PATCH 8/9] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-11 20:50 ` [PATCH 9/9] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-11 20:51 ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
2014-09-11 20:52 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-11 20:53 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-11 20:55 ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-11 20:56 ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-11 20:57 ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers
2014-09-11 20:59 ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-11 21:00 ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-11 21:01 ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-11 21:02 ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers
2014-09-11 21:03 ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
2014-09-11 21:04 ` [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair Ben Myers
2014-09-11 21:06 ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers
2014-09-12 10:02 ` [RFC] Unicode/UTF-8 support for XFS Dave Chinner
2014-09-12 11:55   ` Olaf Weber
2014-09-12 20:55     ` Christoph Hellwig
2014-09-15  7:16       ` Olaf Weber
2014-09-16 20:54         ` Dave Chinner
2014-09-16 21:02           ` Christoph Hellwig
2014-09-16 21:42             ` Ben Myers
2014-09-12 17:45   ` Josef 'Jeff' Sipek
2014-09-12 20:53   ` Christoph Hellwig
2014-09-18 19:56 [RFC v2] " Ben Myers
2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: " Ben Myers
2014-09-18 20:33   ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.