* [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-18 19:56 Ben Myers
2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers
` (16 more replies)
0 siblings, 17 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 19:56 UTC (permalink / raw)
To: linux-fsdevel; +Cc: tinguely, olaf, xfs
Hi,
I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
is busy with other projects. This is the second revision of the series.
The first is available here:
http://oss.sgi.com/archives/xfs/2014-09/msg00169.html
In response to the initial feedback, the changes in version 2 include:
* linux-fsdevel in the To: line,
* Updated design notes,
* Separation of the fs-independent trie and support code into utf8norm.ko,
* A mechanism for loading the normalization module only when necessary.
I'll post the whole series for completeness sake. Many on -fsdevel will
not be interested in the xfs-specific bits, but it may be helpful to
have the full series as an example and for testing purposes.
First there is a set of kernel bits, then some libxfs/xfsprogs stuff,
and finally a test. (Note: I am not posting the unicode database files
due to their large size. There are scripts to download them from
unicode.org in the relevant commit headers.)
TODO: Store the unicode version number of the filesystem on disk in the
super block.
Thanks,
Ben
Here are Olaf's design notes:
-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS
So we had a customer request proper unicode support...
* What does "supporting unicode" actually mean?
>From a text processing point of view, what a filesystem does with
filenames is simple: it stores and retrieves them, and compares them
for equality. It may reject certain byte sequences as invalid
filenames (for example, no filename can contain an ASCII NUL).
I've been taking it as a given that when a file is created with a
certain byte sequence as its name, then a subsequent directory listing
will contain that same byte sequence among the names listed.
This leaves comparing names for equality, and in my view this is what
"supporting unicode" revolves about.
The present state of affairs is that different byte sequences are
different filenames. This amounts to tolerating unicode without
actually supporting it.
To support unicode we have to interpret filenames. What happens when
(part of) a filename cannot be interpreted? We can reject the
filename, interpret the parts we can, or punt and accept it as an
uninterpreted blob.
Rejecting ill-formed filenames was my first choice, but I came around
on the issue: there are too many ways in which you can end up with
having to deal with ill-formed filenames that would leave a user with
no recourse but to move whatever they're doing to a different
filesystem. Unpacking a tarball with filenames in a different encoding
is an example.
Partial interpretation of an ill-formed filename just strikes me as
the kind of bad idea that most half-houses are. I admit that I have no
stronger objection to this than the fact that it makes the code even
more complicated and fragile.
Which leaves "blob" as the preferred option by default for coping with
ill-formed filenames.
When comparing well-formed filenames, the question now becomes which
byte sequences are considered to be alternative spellings of the same
filename. This is where normalization forms come into play, and the
unicode standard has quite a bit to say about the subject.
If all you're doing is comparison, then choosing NFD over NFC is easy,
because the former is easier to calculate than the latter.
If you want various spellings of "office" to compare equal, then
picking NFKD over NFD for comparison is also an obvious
choice. (Hand-picking individual compatibility forms is truly a bad
idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
ligature. (Some fool thought it a good idea to add these ligatures to
unicode, all we get to decide is how to cope.)
The most contentious part is (should be) ignoring the codepoints with
the Default_Ignorable_Code_Point property. I've included the list
below. My argument, such as it is, is that these code points either
have no visible rendering, or in cases like the soft hyphen, are only
conditionally visible. The problem with these (as I see it) is that on
seeing a filename that might contain them you cannot tell whether they
are present. So I propose to ignore them for the purpose of comparing
filenames for equality.
Finally, case folding. First of all, it is optional. Then the issue is
that you either go the language-specific route, or simplify the task
by "just" doing a full casefold (C+F, in unicode parlance). Looking
around the net I tend to find that if you're going to do casefolding
at all, then a language-independent full casefold is preferred because
it is the most predictable option. See
http://www.w3.org/TR/charmod-norm/ for an example of that kind of
reasoning.
An additional question is whether case folding should be a fixed
(mkfs-time) property of a filesystem or can be enabled and disabled on
the fly. When mixing these modes, preferring exact matches is easy.
But after case-sensitive creates of files named "README" and "readme",
which of these two files will be found by case-insensitive lookups of
"Readme", and "ReadMe"? Does the answer differ if the order in which
the files were created is reversed? I do not have good answers to
those questions, and absent such answers the behavior of a filesystem
becomes hard to predict. This may not be a bug according to the
design, but it will be experienced as a bug by users. This is why in
these patches case folding is a property set at mkfs time.
All of these choices can be argued with, but I do believe that the
particular combination of choices I made is a defensible one.
The code refers to these normalization forms as nfkdi and nfkdicf.
* XFS-specific design notes.
XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
must meet to be UTF-8. We settled on the following:
- Valid unicode code points are 0..0x10FFFF, except that
- The surrogates 0xD800..0xDFFF are not valid code points, and
- Valid UTF-8 must be a shortest encoding of a valid unicode code point.
In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).
The code uses ("leverages", in corp-speak) the existing XFS
infrastructure for case-insensitive filenames. Like the CI code, the
name used to create a file is stored on disk, and returned in a
lookup. When comparing filenames the normalized forms of the names
being compared are generated on the fly from the non-normalized forms
stored on disk.
If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
in the superblock, then case folding is added into the mix. This is
the nfkdicf normalization form mentioned above. It allows for the
creation of case-insensitive filesystems with UTF-8 support.
* Implementation notes.
Strings are normalized using a trie that stores the relevant
information. The trie itself is about 250kB in size, and lives in a
separate module. The trie is not checked in: instead we add the source
files from the Unicode Character Database and a program that creates
the header containing the trie.
The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.
The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.
The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.
The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.
The non-XFS-specific supporting code functions have the prefix 'utf8n'
if they handle length-limited strings, and 'utf8' if they handle
NUL-terminated strings.
----
# Derived Property: Default_Ignorable_Code_Point
# Generated from
# Other_Default_Ignorable_Code_Point
# + Cf (Format characters)
# + Variation_Selector
# - White_Space
# - FFF9..FFFB (Annotation Characters)
# - 0600..0605, 06DD, 070F, 110BD (exceptional Cf characters that should be visible)
00AD ; Default_Ignorable_Code_Point # Cf SOFT HYPHEN
034F ; Default_Ignorable_Code_Point # Mn COMBINING GRAPHEME JOINER
061C ; Default_Ignorable_Code_Point # Cf ARABIC LETTER MARK
115F..1160 ; Default_Ignorable_Code_Point # Lo [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5 ; Default_Ignorable_Code_Point # Mn [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D ; Default_Ignorable_Code_Point # Mn [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E ; Default_Ignorable_Code_Point # Cf MONGOLIAN VOWEL SEPARATOR
200B..200F ; Default_Ignorable_Code_Point # Cf [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E ; Default_Ignorable_Code_Point # Cf [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064 ; Default_Ignorable_Code_Point # Cf [5] WORD JOINER..INVISIBLE PLUS
2065 ; Default_Ignorable_Code_Point # Cn <reserved-2065>
2066..206F ; Default_Ignorable_Code_Point # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164 ; Default_Ignorable_Code_Point # Lo HANGUL FILLER
FE00..FE0F ; Default_Ignorable_Code_Point # Mn [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF ; Default_Ignorable_Code_Point # Cf ZERO WIDTH NO-BREAK SPACE
FFA0 ; Default_Ignorable_Code_Point # Lo HALFWIDTH HANGUL FILLER
FFF0..FFF8 ; Default_Ignorable_Code_Point # Cn [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3 ; Default_Ignorable_Code_Point # Cf [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A ; Default_Ignorable_Code_Point # Cf [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000 ; Default_Ignorable_Code_Point # Cn <reserved-E0000>
E0001 ; Default_Ignorable_Code_Point # Cf LANGUAGE TAG
E0002..E001F ; Default_Ignorable_Code_Point # Cn [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F ; Default_Ignorable_Code_Point # Cf [96] TAG SPACE..CANCEL TAG
E0080..E00FF ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>
# Total code points: 4173
----
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 01/10] xfs: return the first match during case-insensitive lookup. 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 20:08 ` Ben Myers 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (15 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Change the XFS case-insensitive lookup code to return the first match found, even if it is not an exact match. Whether a filesystem uses case-insensitive lookups is determined by a superblock bit set during filesystem creation. This means that normal use cannot create two files that both match the same filename. Signed-off-by: Olaf Weber <olaf@sgi.com> --- fs/xfs/libxfs/xfs_dir2_block.c | 17 +++------ fs/xfs/libxfs/xfs_dir2_leaf.c | 37 ++++---------------- fs/xfs/libxfs/xfs_dir2_node.c | 79 ++++++++++++++++-------------------------- fs/xfs/libxfs/xfs_dir2_sf.c | 8 ++--- 4 files changed, 45 insertions(+), 96 deletions(-) diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c index 9628cec..990bf0c 100644 --- a/fs/xfs/libxfs/xfs_dir2_block.c +++ b/fs/xfs/libxfs/xfs_dir2_block.c @@ -725,28 +725,21 @@ xfs_dir2_block_lookup_int( dep = (xfs_dir2_data_entry_t *) ((char *)hdr + xfs_dir2_dataptr_to_off(args->geo, addr)); /* - * Compare name and if it's an exact match, return the index - * and buffer. If it's the first case-insensitive match, store - * the index and buffer and continue looking for an exact match. + * Compare name and if it's a match, return the + * index and buffer. */ cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; *bpp = bp; *entno = mid; - if (cmp == XFS_CMP_EXACT) - return 0; + return 0; } } while (++mid < be32_to_cpu(btp->count) && be32_to_cpu(blp[mid].hashval) == hash); ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); - /* - * Here, we can only be doing a lookup (not a rename or replace). - * If a case-insensitive match was found earlier, return success. - */ - if (args->cmpresult == XFS_CMP_CASE) - return 0; + ASSERT(args->cmpresult == XFS_CMP_DIFFERENT); /* * No match, release the buffer and return ENOENT. */ diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c index a19174e..3d572ee 100644 --- a/fs/xfs/libxfs/xfs_dir2_leaf.c +++ b/fs/xfs/libxfs/xfs_dir2_leaf.c @@ -1226,7 +1226,6 @@ xfs_dir2_leaf_lookup_int( xfs_mount_t *mp; /* filesystem mount point */ xfs_dir2_db_t newdb; /* new data block number */ xfs_trans_t *tp; /* transaction pointer */ - xfs_dir2_db_t cidb = -1; /* case match data block no. */ enum xfs_dacmp cmp; /* name compare result */ struct xfs_dir2_leaf_entry *ents; struct xfs_dir3_icleaf_hdr leafhdr; @@ -1290,46 +1289,22 @@ xfs_dir2_leaf_lookup_int( be32_to_cpu(lep->address))); /* * Compare name and if it's an exact match, return the index - * and buffer. If it's the first case-insensitive match, store - * the index and buffer and continue looking for an exact match. + * and buffer */ cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; *indexp = index; - /* case exact match: return the current buffer. */ - if (cmp == XFS_CMP_EXACT) { - *dbpp = dbp; - return 0; - } - cidb = curdb; + *dbpp = dbp; + return 0; } } ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); - /* - * Here, we can only be doing a lookup (not a rename or remove). - * If a case-insensitive match was found earlier, re-read the - * appropriate data block if required and return it. - */ - if (args->cmpresult == XFS_CMP_CASE) { - ASSERT(cidb != -1); - if (cidb != curdb) { - xfs_trans_brelse(tp, dbp); - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(args->geo, cidb), - -1, &dbp); - if (error) { - xfs_trans_brelse(tp, lbp); - return error; - } - } - *dbpp = dbp; - return 0; - } + ASSERT(args->cmpresult == XFS_CMP_DIFFERENT); + /* * No match found, return -ENOENT. */ - ASSERT(cidb == -1); if (dbp) xfs_trans_brelse(tp, dbp); xfs_trans_brelse(tp, lbp); diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index 2ae6ac2..1778c40 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -679,6 +679,7 @@ xfs_dir2_leafn_lookup_for_entry( xfs_dir2_data_entry_t *dep; /* data block entry */ xfs_inode_t *dp; /* incore directory inode */ int error; /* error return value */ + int di = -1; /* data entry index */ int index; /* leaf entry index */ xfs_dir2_leaf_t *leaf; /* leaf structure */ xfs_dir2_leaf_entry_t *lep; /* leaf entry */ @@ -709,6 +710,7 @@ xfs_dir2_leafn_lookup_for_entry( if (state->extravalid) { curbp = state->extrablk.bp; curdb = state->extrablk.blkno; + di = state->extrablk.index; } /* * Loop over leaf entries with the right hash value. @@ -734,28 +736,20 @@ xfs_dir2_leafn_lookup_for_entry( */ if (newdb != curdb) { /* - * If we had a block before that we aren't saving - * for a CI name, drop it + * If we had a block, drop it */ - if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT || - curdb != state->extrablk.blkno)) + if (curbp) { xfs_trans_brelse(tp, curbp); + di = -1; + } /* - * If needing the block that is saved with a CI match, - * use it otherwise read in the new data block. + * Read in the new data block. */ - if (args->cmpresult != XFS_CMP_DIFFERENT && - newdb == state->extrablk.blkno) { - ASSERT(state->extravalid); - curbp = state->extrablk.bp; - } else { - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(args->geo, - newdb), + error = xfs_dir3_data_read(tp, dp, + xfs_dir2_db_to_da(args->geo, newdb), -1, &curbp); - if (error) - return error; - } + if (error) + return error; xfs_dir3_data_check(dp, curbp); curdb = newdb; } @@ -766,53 +760,40 @@ xfs_dir2_leafn_lookup_for_entry( xfs_dir2_dataptr_to_off(args->geo, be32_to_cpu(lep->address))); /* - * Compare the entry and if it's an exact match, return - * EEXIST immediately. If it's the first case-insensitive - * match, store the block & inode number and continue looking. + * Compare the entry and if it's a match, return + * EEXIST immediately. */ cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { - /* If there is a CI match block, drop it */ - if (args->cmpresult != XFS_CMP_DIFFERENT && - curdb != state->extrablk.blkno) - xfs_trans_brelse(tp, state->extrablk.bp); + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; args->inumber = be64_to_cpu(dep->inumber); args->filetype = dp->d_ops->data_get_ftype(dep); - *indexp = index; - state->extravalid = 1; - state->extrablk.bp = curbp; - state->extrablk.blkno = curdb; - state->extrablk.index = (int)((char *)dep - - (char *)curbp->b_addr); - state->extrablk.magic = XFS_DIR2_DATA_MAGIC; curbp->b_ops = &xfs_dir3_data_buf_ops; xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF); - if (cmp == XFS_CMP_EXACT) - return -EEXIST; + di = (int)((char *)dep - (char *)curbp->b_addr); + error = -EEXIST; + goto out; + } } + /* Didn't find a match */ + error = -ENOENT; ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT)); +out: if (curbp) { - if (args->cmpresult == XFS_CMP_DIFFERENT) { - /* Giving back last used data block. */ - state->extravalid = 1; - state->extrablk.bp = curbp; - state->extrablk.index = -1; - state->extrablk.blkno = curdb; - state->extrablk.magic = XFS_DIR2_DATA_MAGIC; - curbp->b_ops = &xfs_dir3_data_buf_ops; - xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF); - } else { - /* If the curbp is not the CI match block, drop it */ - if (state->extrablk.bp != curbp) - xfs_trans_brelse(tp, curbp); - } + /* Giving back last used data block. */ + state->extravalid = 1; + state->extrablk.bp = curbp; + state->extrablk.index = di; + state->extrablk.blkno = curdb; + state->extrablk.magic = XFS_DIR2_DATA_MAGIC; + curbp->b_ops = &xfs_dir3_data_buf_ops; + xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF); } else { state->extravalid = 0; } *indexp = index; - return -ENOENT; + return error; } /* diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c index 5079e05..e69fdb7 100644 --- a/fs/xfs/libxfs/xfs_dir2_sf.c +++ b/fs/xfs/libxfs/xfs_dir2_sf.c @@ -757,19 +757,19 @@ xfs_dir2_sf_lookup( for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count; i++, sfep = dp->d_ops->sf_nextentry(sfp, sfep)) { /* - * Compare name and if it's an exact match, return the inode - * number. If it's the first case-insensitive match, store the - * inode number and continue looking for an exact match. + * Compare name and if it's a match, return the inode + * number. */ cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name, sfep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; args->inumber = dp->d_ops->sf_get_ino(sfp, sfep); args->filetype = dp->d_ops->sf_get_ftype(sfep); if (cmp == XFS_CMP_EXACT) return -EEXIST; ci_sfep = sfep; + break; } } ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers 2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers @ 2014-09-18 20:09 ` Ben Myers 2014-09-18 20:09 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers ` (14 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:09 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and normalization, different strings will match on other criteria than case insensitivity. Signed-off-by: Olaf Weber <olaf@sgi.com> --- fs/xfs/libxfs/xfs_da_btree.h | 2 +- fs/xfs/libxfs/xfs_dir2.c | 9 ++++++--- fs/xfs/libxfs/xfs_dir2_node.c | 2 +- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index 6e153e3..9ebcc23 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -52,7 +52,7 @@ struct xfs_da_geometry { enum xfs_dacmp { XFS_CMP_DIFFERENT, /* names are completely different */ XFS_CMP_EXACT, /* names are exactly the same */ - XFS_CMP_CASE /* names are same but differ in case */ + XFS_CMP_MATCH /* names are same but differ in encoding */ }; /* diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 6cef221..32e769b 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -74,7 +74,7 @@ xfs_ascii_ci_compname( continue; if (tolower(args->name[i]) != tolower(name[i])) return XFS_CMP_DIFFERENT; - result = XFS_CMP_CASE; + result = XFS_CMP_MATCH; } return result; @@ -315,8 +315,11 @@ xfs_dir_cilookup_result( { if (args->cmpresult == XFS_CMP_DIFFERENT) return -ENOENT; - if (args->cmpresult != XFS_CMP_CASE || - !(args->op_flags & XFS_DA_OP_CILOOKUP)) + if (args->cmpresult == XFS_CMP_EXACT) + return -EEXIST; + ASSERT(args->cmpresult == XFS_CMP_MATCH); + /* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */ + if (!(args->op_flags & XFS_DA_OP_CILOOKUP)) return -EEXIST; args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL); diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index 1778c40..9d46e8d 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -2023,7 +2023,7 @@ xfs_dir2_node_lookup( error = xfs_da3_node_lookup_int(state, &rval); if (error) rval = error; - else if (rval == -ENOENT && args->cmpresult == XFS_CMP_CASE) { + else if (rval == -ENOENT && args->cmpresult == XFS_CMP_MATCH) { /* If a CI match, dup the actual name and return -EEXIST */ xfs_dir2_data_entry_t *dep; -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 03/13] libxfs: add xfs_nameops.normhash 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers 2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers @ 2014-09-18 20:09 ` Ben Myers 2014-09-18 20:10 ` Ben Myers ` (13 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:09 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args structure as its argument, and calculates a hash value over the name. It may in the process create a normalized form of the name, and assign that to the norm/normlen fields in the xfs_da_args structure. Changes: The pointer in kmem_free() was type converted to suppress compiler warnings. Signed-off-by: Olaf Weber <olaf@sgi.com> --- include/xfs_da_btree.h | 5 ++++- libxfs/xfs_da_btree.c | 9 ++++++++ libxfs/xfs_dir2.c | 56 +++++++++++++++++++++++++++++++++++++++----------- 3 files changed, 57 insertions(+), 13 deletions(-) diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h index 3d9f9dd..06b50bf 100644 --- a/include/xfs_da_btree.h +++ b/include/xfs_da_btree.h @@ -42,7 +42,9 @@ enum xfs_dacmp { */ typedef struct xfs_da_args { const __uint8_t *name; /* string (maybe not NULL terminated) */ - int namelen; /* length of string (maybe no NULL) */ + const __uint8_t *norm; /* normalized name (may be NULL) */ + int namelen; /* length of string (maybe no NULL) */ + int normlen; /* length of normalized name */ __uint8_t filetype; /* filetype of inode for directories */ __uint8_t *value; /* set of bytes (maybe contain NULLs) */ int valuelen; /* length of value */ @@ -131,6 +133,7 @@ typedef struct xfs_da_state { */ struct xfs_nameops { xfs_dahash_t (*hashname)(struct xfs_name *); + int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); }; diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c index b731b54..eb97317 100644 --- a/libxfs/xfs_da_btree.c +++ b/libxfs/xfs_da_btree.c @@ -2000,8 +2000,17 @@ xfs_default_hashname( return xfs_da_hashname(name->name, name->len); } +STATIC int +xfs_da_normhash( + struct xfs_da_args *args) +{ + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + const struct xfs_nameops xfs_default_nameops = { .hashname = xfs_default_hashname, + .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c index 57e98a3..e52d082 100644 --- a/libxfs/xfs_dir2.c +++ b/libxfs/xfs_dir2.c @@ -54,6 +54,21 @@ xfs_ascii_ci_hashname( return hash; } +STATIC int +xfs_ascii_ci_normhash( + struct xfs_da_args *args) +{ + xfs_dahash_t hash; + int i; + + for (i = 0, hash = 0; i < args->namelen; i++) + hash = tolower(args->name[i]) ^ rol32(hash, 7); + + args->hashval = hash; + return 0; +} + + STATIC enum xfs_dacmp xfs_ascii_ci_compname( struct xfs_da_args *args, @@ -80,6 +95,7 @@ xfs_ascii_ci_compname( static struct xfs_nameops xfs_ascii_ci_nameops = { .hashname = xfs_ascii_ci_hashname, + .normhash = xfs_ascii_ci_normhash, .compname = xfs_ascii_ci_compname, }; @@ -211,7 +227,6 @@ xfs_dir_createname( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.inumber = inum; args.dp = dp; args.firstblock = first; @@ -220,19 +235,24 @@ xfs_dir_createname( args.whichfork = XFS_DATA_FORK; args.trans = tp; args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_addname(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_addname(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_addname(&args); else rval = xfs_dir2_node_addname(&args); +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } @@ -289,22 +309,23 @@ xfs_dir_lookup( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.dp = dp; args.whichfork = XFS_DATA_FORK; args.trans = tp; args.op_flags = XFS_DA_OP_OKNOENT; if (ci_name) args.op_flags |= XFS_DA_OP_CILOOKUP; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_lookup(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_lookup(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_lookup(&args); else @@ -318,6 +339,9 @@ xfs_dir_lookup( ci_name->len = args.valuelen; } } +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } @@ -345,7 +369,6 @@ xfs_dir_removename( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.inumber = ino; args.dp = dp; args.firstblock = first; @@ -353,19 +376,24 @@ xfs_dir_removename( args.total = total; args.whichfork = XFS_DATA_FORK; args.trans = tp; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_removename(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_removename(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_removename(&args); else rval = xfs_dir2_node_removename(&args); +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } @@ -395,7 +423,6 @@ xfs_dir_replace( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.inumber = inum; args.dp = dp; args.firstblock = first; @@ -403,19 +430,24 @@ xfs_dir_replace( args.total = total; args.whichfork = XFS_DATA_FORK; args.trans = tp; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_replace(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_replace(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_replace(&args); else rval = xfs_dir2_node_replace(&args); +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 04/10] xfs: change interface of xfs_nameops.normhash 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 20:10 ` Ben Myers 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (15 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:10 UTC (permalink / raw) To: linux-fsdevel; +Cc: xfs, olaf, tinguely From: Olaf Weber <olaf@sgi.com> With the introduction of the xfs_nameops.normhash callout, all uses of the hashname callout now occur in places where an xfs_name structure must be explicitly created just to match the parameter passing convention of this callout. Change the arguments to a const unsigned char * and int instead. Signed-off-by: Olaf Weber <olaf@sgi.com> --- fs/xfs/libxfs/xfs_da_btree.c | 9 +-------- fs/xfs/libxfs/xfs_da_btree.h | 2 +- fs/xfs/libxfs/xfs_dir2.c | 7 ++++--- fs/xfs/libxfs/xfs_dir2_block.c | 2 +- fs/xfs/libxfs/xfs_dir2_data.c | 3 ++- 5 files changed, 9 insertions(+), 14 deletions(-) diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 07a3acf..a0608ca 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -1983,13 +1983,6 @@ xfs_da_compname( XFS_CMP_EXACT : XFS_CMP_DIFFERENT; } -static xfs_dahash_t -xfs_default_hashname( - struct xfs_name *name) -{ - return xfs_da_hashname(name->name, name->len); -} - STATIC int xfs_da_normhash( struct xfs_da_args *args) @@ -1999,7 +1992,7 @@ xfs_da_normhash( } const struct xfs_nameops xfs_default_nameops = { - .hashname = xfs_default_hashname, + .hashname = xfs_da_hashname, .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index 6cdafee..4d6b36f 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -151,7 +151,7 @@ typedef struct xfs_da_state { * Name ops for directory and/or attr name operations */ struct xfs_nameops { - xfs_dahash_t (*hashname)(struct xfs_name *); + xfs_dahash_t (*hashname)(const unsigned char *, int); int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 55733a6..84e5ca9 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -45,13 +45,14 @@ struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR }; */ STATIC xfs_dahash_t xfs_ascii_ci_hashname( - struct xfs_name *name) + const unsigned char *name, + int len) { xfs_dahash_t hash; int i; - for (i = 0, hash = 0; i < name->len; i++) - hash = tolower(name->name[i]) ^ rol32(hash, 7); + for (i = 0, hash = 0; i < len; i++) + hash = tolower(name[i]) ^ rol32(hash, 7); return hash; } diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c index 990bf0c..f93c141 100644 --- a/fs/xfs/libxfs/xfs_dir2_block.c +++ b/fs/xfs/libxfs/xfs_dir2_block.c @@ -1231,7 +1231,7 @@ xfs_dir2_sf_to_block( name.name = sfep->name; name.len = sfep->namelen; blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops-> - hashname(&name)); + hashname(sfep->name, sfep->namelen)); blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr( (char *)dep - (char *)hdr)); offset = (int)((char *)(tagp + 1) - (char *)hdr); diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c index fdd803f..28c35cf 100644 --- a/fs/xfs/libxfs/xfs_dir2_data.c +++ b/fs/xfs/libxfs/xfs_dir2_data.c @@ -179,7 +179,8 @@ __xfs_dir3_data_check( ((char *)dep - (char *)hdr)); name.name = dep->name; name.len = dep->namelen; - hash = mp->m_dirnameops->hashname(&name); + hash = mp->m_dirnameops->hashname(dep->name, + dep->namelen); for (i = 0; i < be32_to_cpu(btp->count); i++) { if (be32_to_cpu(lep[i].address) == addr && be32_to_cpu(lep[i].hashval) == hash) -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 04/10] xfs: change interface of xfs_nameops.normhash @ 2014-09-18 20:10 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:10 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> With the introduction of the xfs_nameops.normhash callout, all uses of the hashname callout now occur in places where an xfs_name structure must be explicitly created just to match the parameter passing convention of this callout. Change the arguments to a const unsigned char * and int instead. Signed-off-by: Olaf Weber <olaf@sgi.com> --- fs/xfs/libxfs/xfs_da_btree.c | 9 +-------- fs/xfs/libxfs/xfs_da_btree.h | 2 +- fs/xfs/libxfs/xfs_dir2.c | 7 ++++--- fs/xfs/libxfs/xfs_dir2_block.c | 2 +- fs/xfs/libxfs/xfs_dir2_data.c | 3 ++- 5 files changed, 9 insertions(+), 14 deletions(-) diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 07a3acf..a0608ca 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -1983,13 +1983,6 @@ xfs_da_compname( XFS_CMP_EXACT : XFS_CMP_DIFFERENT; } -static xfs_dahash_t -xfs_default_hashname( - struct xfs_name *name) -{ - return xfs_da_hashname(name->name, name->len); -} - STATIC int xfs_da_normhash( struct xfs_da_args *args) @@ -1999,7 +1992,7 @@ xfs_da_normhash( } const struct xfs_nameops xfs_default_nameops = { - .hashname = xfs_default_hashname, + .hashname = xfs_da_hashname, .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index 6cdafee..4d6b36f 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -151,7 +151,7 @@ typedef struct xfs_da_state { * Name ops for directory and/or attr name operations */ struct xfs_nameops { - xfs_dahash_t (*hashname)(struct xfs_name *); + xfs_dahash_t (*hashname)(const unsigned char *, int); int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 55733a6..84e5ca9 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -45,13 +45,14 @@ struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR }; */ STATIC xfs_dahash_t xfs_ascii_ci_hashname( - struct xfs_name *name) + const unsigned char *name, + int len) { xfs_dahash_t hash; int i; - for (i = 0, hash = 0; i < name->len; i++) - hash = tolower(name->name[i]) ^ rol32(hash, 7); + for (i = 0, hash = 0; i < len; i++) + hash = tolower(name[i]) ^ rol32(hash, 7); return hash; } diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c index 990bf0c..f93c141 100644 --- a/fs/xfs/libxfs/xfs_dir2_block.c +++ b/fs/xfs/libxfs/xfs_dir2_block.c @@ -1231,7 +1231,7 @@ xfs_dir2_sf_to_block( name.name = sfep->name; name.len = sfep->namelen; blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops-> - hashname(&name)); + hashname(sfep->name, sfep->namelen)); blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr( (char *)dep - (char *)hdr)); offset = (int)((char *)(tagp + 1) - (char *)hdr); diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c index fdd803f..28c35cf 100644 --- a/fs/xfs/libxfs/xfs_dir2_data.c +++ b/fs/xfs/libxfs/xfs_dir2_data.c @@ -179,7 +179,8 @@ __xfs_dir3_data_check( ((char *)dep - (char *)hdr)); name.name = dep->name; name.len = dep->namelen; - hash = mp->m_dirnameops->hashname(&name); + hash = mp->m_dirnameops->hashname(dep->name, + dep->namelen); for (i = 0; i < be32_to_cpu(btp->count); i++) { if (be32_to_cpu(lep[i].address) == addr && be32_to_cpu(lep[i].hashval) == hash) -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support. 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (3 preceding siblings ...) 2014-09-18 20:10 ` Ben Myers @ 2014-09-18 20:11 ` Ben Myers 2014-09-18 20:13 ` [PATCH 03/10] xfs: add xfs_nameops.normhash Ben Myers ` (11 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:11 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be installed. Add xfs_sb_version_hasci(), which tests both the borgbit and the utf8bit, and returns true if at least one of them is set. Replace calls to xfs_sb_version_hasasciici() as needed. Signed-off-by: Olaf Weber <olaf@sgi.com> --- fs/xfs/libxfs/xfs_sb.h | 24 +++++++++++++++++++++++- fs/xfs/xfs_fs.h | 1 + fs/xfs/xfs_fsops.c | 4 +++- fs/xfs/xfs_iops.c | 4 ++-- 4 files changed, 29 insertions(+), 4 deletions(-) diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h index 2e73970..525eacb 100644 --- a/fs/xfs/libxfs/xfs_sb.h +++ b/fs/xfs/libxfs/xfs_sb.h @@ -70,6 +70,7 @@ struct xfs_trans; #define XFS_SB_VERSION2_RESERVED4BIT 0x00000004 #define XFS_SB_VERSION2_ATTR2BIT 0x00000008 /* Inline attr rework */ #define XFS_SB_VERSION2_PARENTBIT 0x00000010 /* parent pointers */ +#define XFS_SB_VERSION2_UTF8BIT 0x00000020 /* utf8 names */ #define XFS_SB_VERSION2_PROJID32BIT 0x00000080 /* 32 bit project id */ #define XFS_SB_VERSION2_CRCBIT 0x00000100 /* metadata CRCs */ #define XFS_SB_VERSION2_FTYPE 0x00000200 /* inode type in dir */ @@ -77,6 +78,7 @@ struct xfs_trans; #define XFS_SB_VERSION2_OKBITS \ (XFS_SB_VERSION2_LAZYSBCOUNTBIT | \ XFS_SB_VERSION2_ATTR2BIT | \ + XFS_SB_VERSION2_UTF8BIT | \ XFS_SB_VERSION2_PROJID32BIT | \ XFS_SB_VERSION2_FTYPE) @@ -509,8 +511,10 @@ xfs_sb_has_ro_compat_feature( } #define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */ +#define XFS_SB_FEAT_INCOMPAT_UTF8 (1 << 1) /* utf-8 name support */ #define XFS_SB_FEAT_INCOMPAT_ALL \ - (XFS_SB_FEAT_INCOMPAT_FTYPE) + (XFS_SB_FEAT_INCOMPAT_FTYPE | \ + XFS_SB_FEAT_INCOMPAT_UTF8) #define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL static inline bool @@ -558,6 +562,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp) (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT); } +static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp) +{ + return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 && + xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) || + (xfs_sb_version_hasmorebits(sbp) && + (sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT)); +} + +/* + * Special case: there are a number of places where we need to test + * both the borgbit and the utf8bit, and take the same action if + * either of those is set. + */ +static inline int xfs_sb_version_hasci(xfs_sb_t *sbp) +{ + return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp); +} + /* * end of superblock version macros */ diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h index 18dc721..e845d75 100644 --- a/fs/xfs/xfs_fs.h +++ b/fs/xfs/xfs_fs.h @@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks { #define XFS_FSOP_GEOM_FLAGS_V5SB 0x8000 /* version 5 superblock */ #define XFS_FSOP_GEOM_FLAGS_FTYPE 0x10000 /* inode directory types */ #define XFS_FSOP_GEOM_FLAGS_FINOBT 0x20000 /* free inode btree */ +#define XFS_FSOP_GEOM_FLAGS_UTF8 0x40000 /* utf8 filenames */ /* * Minimum and maximum sizes need for growth checks. diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index f91de1e..1a83eef 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -103,7 +103,9 @@ xfs_fs_geometry( (xfs_sb_version_hasftype(&mp->m_sb) ? XFS_FSOP_GEOM_FLAGS_FTYPE : 0) | (xfs_sb_version_hasfinobt(&mp->m_sb) ? - XFS_FSOP_GEOM_FLAGS_FINOBT : 0); + XFS_FSOP_GEOM_FLAGS_FINOBT : 0) | + (xfs_sb_version_hasutf8(&mp->m_sb) ? + XFS_FSOP_GEOM_FLAGS_UTF8 : 0); geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ? mp->m_sb.sb_logsectsize : BBSIZE; geo->rtsectsize = mp->m_sb.sb_blocksize; diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 7212949..cea3d64 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -335,9 +335,9 @@ xfs_vn_unlink( /* * With unlink, the VFS makes the dentry "negative": no inode, * but still hashed. This is incompatible with case-insensitive - * mode, so invalidate (unhash) the dentry in CI-mode. + * or utf8 mode, so invalidate (unhash) the dentry in CI-mode. */ - if (xfs_sb_version_hasasciici(&XFS_M(dir->i_sb)->m_sb)) + if (xfs_sb_version_hasci(&XFS_M(dir->i_sb)->m_sb)) d_invalidate(dentry); return 0; } -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 03/10] xfs: add xfs_nameops.normhash 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (4 preceding siblings ...) 2014-09-18 20:11 ` [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers @ 2014-09-18 20:13 ` Ben Myers 2014-09-18 20:14 ` Ben Myers ` (10 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:13 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args structure as its argument, and calculates a hash value over the name. It may in the process create a normalized form of the name, and assign that to the norm/normlen fields in the xfs_da_args structure. Signed-off-by: Olaf Weber <olaf@sgi.com> --- fs/xfs/libxfs/xfs_da_btree.c | 9 +++++++++ fs/xfs/libxfs/xfs_da_btree.h | 3 +++ fs/xfs/libxfs/xfs_dir2.c | 42 +++++++++++++++++++++++++++++++++++++----- 3 files changed, 49 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 2c42ae2..07a3acf 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -1990,8 +1990,17 @@ xfs_default_hashname( return xfs_da_hashname(name->name, name->len); } +STATIC int +xfs_da_normhash( + struct xfs_da_args *args) +{ + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + const struct xfs_nameops xfs_default_nameops = { .hashname = xfs_default_hashname, + .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index 9ebcc23..6cdafee 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -61,7 +61,9 @@ enum xfs_dacmp { typedef struct xfs_da_args { struct xfs_da_geometry *geo; /* da block geometry */ const __uint8_t *name; /* string (maybe not NULL terminated) */ + const __uint8_t *norm; /* normalized name (may be NULL) */ int namelen; /* length of string (maybe no NULL) */ + int normlen; /* length of normalized name */ __uint8_t filetype; /* filetype of inode for directories */ __uint8_t *value; /* set of bytes (maybe contain NULLs) */ int valuelen; /* length of value */ @@ -150,6 +152,7 @@ typedef struct xfs_da_state { */ struct xfs_nameops { xfs_dahash_t (*hashname)(struct xfs_name *); + int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); }; diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 32e769b..55733a6 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -56,6 +56,21 @@ xfs_ascii_ci_hashname( return hash; } +STATIC int +xfs_ascii_ci_normhash( + struct xfs_da_args *args) +{ + xfs_dahash_t hash; + int i; + + for (i = 0, hash = 0; i < args->namelen; i++) + hash = tolower(args->name[i]) ^ rol32(hash, 7); + + args->hashval = hash; + return 0; +} + + STATIC enum xfs_dacmp xfs_ascii_ci_compname( struct xfs_da_args *args, @@ -82,6 +97,7 @@ xfs_ascii_ci_compname( static struct xfs_nameops xfs_ascii_ci_nameops = { .hashname = xfs_ascii_ci_hashname, + .normhash = xfs_ascii_ci_normhash, .compname = xfs_ascii_ci_compname, }; @@ -267,7 +283,6 @@ xfs_dir_createname( args->name = name->name; args->namelen = name->len; args->filetype = name->type; - args->hashval = dp->i_mount->m_dirnameops->hashname(name); args->inumber = inum; args->dp = dp; args->firstblock = first; @@ -276,6 +291,8 @@ xfs_dir_createname( args->whichfork = XFS_DATA_FORK; args->trans = tp; args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + if ((rval = dp->i_mount->m_dirnameops->normhash(args))) + goto out_free; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_addname(args); @@ -299,6 +316,8 @@ xfs_dir_createname( rval = xfs_dir2_node_addname(args); out_free: + if (args->norm) + kmem_free(args->norm); kmem_free(args); return rval; } @@ -365,13 +384,14 @@ xfs_dir_lookup( args->name = name->name; args->namelen = name->len; args->filetype = name->type; - args->hashval = dp->i_mount->m_dirnameops->hashname(name); args->dp = dp; args->whichfork = XFS_DATA_FORK; args->trans = tp; args->op_flags = XFS_DA_OP_OKNOENT; if (ci_name) args->op_flags |= XFS_DA_OP_CILOOKUP; + if ((rval = dp->i_mount->m_dirnameops->normhash(args))) + goto out_free; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_lookup(args); @@ -405,6 +425,9 @@ out_check_rval: } } out_free: + if (args->norm) + kmem_free(args->norm); + kmem_free(args); return rval; } @@ -437,7 +460,6 @@ xfs_dir_removename( args->name = name->name; args->namelen = name->len; args->filetype = name->type; - args->hashval = dp->i_mount->m_dirnameops->hashname(name); args->inumber = ino; args->dp = dp; args->firstblock = first; @@ -445,6 +467,8 @@ xfs_dir_removename( args->total = total; args->whichfork = XFS_DATA_FORK; args->trans = tp; + if ((rval = dp->i_mount->m_dirnameops->normhash(args))) + goto out_free; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_removename(args); @@ -467,6 +491,8 @@ xfs_dir_removename( else rval = xfs_dir2_node_removename(args); out_free: + if (args->norm) + kmem_free(args->norm); kmem_free(args); return rval; } @@ -502,7 +528,6 @@ xfs_dir_replace( args->name = name->name; args->namelen = name->len; args->filetype = name->type; - args->hashval = dp->i_mount->m_dirnameops->hashname(name); args->inumber = inum; args->dp = dp; args->firstblock = first; @@ -510,6 +535,8 @@ xfs_dir_replace( args->total = total; args->whichfork = XFS_DATA_FORK; args->trans = tp; + if ((rval = dp->i_mount->m_dirnameops->normhash(args))) + goto out_free; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_replace(args); @@ -532,6 +559,8 @@ xfs_dir_replace( else rval = xfs_dir2_node_replace(args); out_free: + if (args->norm) + kmem_free(args->norm); kmem_free(args); return rval; } @@ -564,12 +593,13 @@ xfs_dir_canenter( args->name = name->name; args->namelen = name->len; args->filetype = name->type; - args->hashval = dp->i_mount->m_dirnameops->hashname(name); args->dp = dp; args->whichfork = XFS_DATA_FORK; args->trans = tp; args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + if ((rval = dp->i_mount->m_dirnameops->normhash(args))) + goto out_free; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_addname(args); @@ -592,6 +622,8 @@ xfs_dir_canenter( else rval = xfs_dir2_node_addname(args); out_free: + if (args->norm) + kmem_free(args->norm); kmem_free(args); return rval; } -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 06/10] xfs: add unicode character database files 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 20:14 ` Ben Myers 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (15 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:14 UTC (permalink / raw) To: linux-fsdevel; +Cc: xfs, olaf, tinguely From: Olaf Weber <olaf@sgi.com> Add files from the Unicode Character Database, version 7.0.0, to the source. A helper program that generates a trie used for normalization from these files is part of a separate commit. Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: Removed large unicode files prior to posting. Get them as below. -bpm] [v3: Moved files to ucd8norm directory. -bpm] cd fs/xfs/utf8norm/ucd wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt for e in *.txt do base=`basename $e .txt` mv $e $base-7.0.0.txt done --- fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 fs/xfs/utf8norm/ucd/README diff --git a/fs/xfs/utf8norm/ucd/README b/fs/xfs/utf8norm/ucd/README new file mode 100644 index 0000000..d713e66 --- /dev/null +++ b/fs/xfs/utf8norm/ucd/README @@ -0,0 +1,33 @@ +The files in this directory are part of the Unicode Character Database +for version 7.0.0 of the Unicode standard. + +The full set of files can be found here: + + http://www.unicode.org/Public/7.0.0/ucd/ + +The latest released version of the UCD can be found here: + + http://www.unicode.org/Public/UCD/latest/ + +The files in this directory are identical, except that they have been +renamed with a suffix indicating the unicode version. + +Individual source links: + + http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt + http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt + http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt + http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt + http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt + http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt + http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt + +md5sums + + 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt + 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt + 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt + c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt + 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt + 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt + c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 06/10] xfs: add unicode character database files @ 2014-09-18 20:14 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:14 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Add files from the Unicode Character Database, version 7.0.0, to the source. A helper program that generates a trie used for normalization from these files is part of a separate commit. Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: Removed large unicode files prior to posting. Get them as below. -bpm] [v3: Moved files to ucd8norm directory. -bpm] cd fs/xfs/utf8norm/ucd wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt for e in *.txt do base=`basename $e .txt` mv $e $base-7.0.0.txt done --- fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 fs/xfs/utf8norm/ucd/README diff --git a/fs/xfs/utf8norm/ucd/README b/fs/xfs/utf8norm/ucd/README new file mode 100644 index 0000000..d713e66 --- /dev/null +++ b/fs/xfs/utf8norm/ucd/README @@ -0,0 +1,33 @@ +The files in this directory are part of the Unicode Character Database +for version 7.0.0 of the Unicode standard. + +The full set of files can be found here: + + http://www.unicode.org/Public/7.0.0/ucd/ + +The latest released version of the UCD can be found here: + + http://www.unicode.org/Public/UCD/latest/ + +The files in this directory are identical, except that they have been +renamed with a suffix indicating the unicode version. + +Individual source links: + + http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt + http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt + http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt + http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt + http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt + http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt + http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt + +md5sums + + 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt + 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt + 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt + c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt + 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt + 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt + c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [PATCH 06/10] xfs: add unicode character database files 2014-09-18 20:14 ` Ben Myers @ 2014-09-22 20:54 ` Dave Chinner -1 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-22 20:54 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 03:14:40PM -0500, Ben Myers wrote: > From: Olaf Weber <olaf@sgi.com> > > Add files from the Unicode Character Database, version 7.0.0, to the source. > A helper program that generates a trie used for normalization from these > files is part of a separate commit. > > Signed-off-by: Olaf Weber <olaf@sgi.com> > --- > [v2: Removed large unicode files prior to posting. Get them as below. -bpm] > [v3: Moved files to ucd8norm directory. -bpm] > > cd fs/xfs/utf8norm/ucd > wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt > wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt > wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt > wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt > wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt > wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt > wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt > for e in *.txt > do > base=`basename $e .txt` > mv $e $base-7.0.0.txt > done > --- > fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++ This probably needs to live somewhere under lib/. There's nothing XFS specific in it and the translations should be the same for anything that wants to parse unicode. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 06/10] xfs: add unicode character database files @ 2014-09-22 20:54 ` Dave Chinner 0 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-22 20:54 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 03:14:40PM -0500, Ben Myers wrote: > From: Olaf Weber <olaf@sgi.com> > > Add files from the Unicode Character Database, version 7.0.0, to the source. > A helper program that generates a trie used for normalization from these > files is part of a separate commit. > > Signed-off-by: Olaf Weber <olaf@sgi.com> > --- > [v2: Removed large unicode files prior to posting. Get them as below. -bpm] > [v3: Moved files to ucd8norm directory. -bpm] > > cd fs/xfs/utf8norm/ucd > wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt > wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt > wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt > wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt > wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt > wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt > wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt > for e in *.txt > do > base=`basename $e .txt` > mv $e $base-7.0.0.txt > done > --- > fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++ This probably needs to live somewhere under lib/. There's nothing XFS specific in it and the translations should be the same for anything that wants to parse unicode. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 06/10] xfs: add unicode character database files 2014-09-22 20:54 ` Dave Chinner (?) @ 2014-09-26 17:09 ` Christoph Hellwig -1 siblings, 0 replies; 84+ messages in thread From: Christoph Hellwig @ 2014-09-26 17:09 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, Ben Myers, tinguely, olaf, xfs On Tue, Sep 23, 2014 at 06:54:38AM +1000, Dave Chinner wrote: > This probably needs to live somewhere under lib/. There's nothing > XFS specific in it and the translations should be the same for > anything that wants to parse unicode. Agreed. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8. 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (6 preceding siblings ...) 2014-09-18 20:14 ` Ben Myers @ 2014-09-18 20:15 ` Ben Myers 2014-09-22 20:57 ` Dave Chinner 2014-09-18 20:16 ` Ben Myers ` (8 subsequent siblings) 16 siblings, 1 reply; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:15 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> mkutf8data.c is the source for a program that generates utf8data.h, which contains the trie that utf8norm.c uses. The trie is generated from the Unicode 7.0.0 data files. The format of the utf8data[] table is described in utf8norm.c. Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. nfkdi: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. nfkdicf: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: the trie is now separated into utf8norm.ko; utf8version is now a function and exported; introduced CONFIG_XFS_UTF8. -bpm] --- fs/xfs/Kconfig | 8 + fs/xfs/Makefile | 2 +- fs/xfs/utf8norm/Makefile | 37 + fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/utf8norm/utf8norm.c | 649 +++++++++ fs/xfs/utf8norm/utf8norm.h | 116 ++ 6 files changed, 4050 insertions(+), 1 deletion(-) create mode 100644 fs/xfs/utf8norm/Makefile create mode 100644 fs/xfs/utf8norm/mkutf8data.c create mode 100644 fs/xfs/utf8norm/utf8norm.c create mode 100644 fs/xfs/utf8norm/utf8norm.h diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 5d47b4d..a847857 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -95,3 +95,11 @@ config XFS_DEBUG not useful unless you are debugging a particular problem. Say N unless you are an XFS developer, or you play one on TV. + +config XFS_UTF8 + bool "XFS UTF-8 support" + depends on XFS_FS + help + Say Y here to enable utf8 normalization support in XFS. You + will be able to mount and use filesystems created with the + utf8 mkfs.xfs option. diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index d617999..6d000d3 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -21,7 +21,7 @@ ccflags-y += -I$(src)/libxfs ccflags-$(CONFIG_XFS_DEBUG) += -g -obj-$(CONFIG_XFS_FS) += xfs.o +obj-$(CONFIG_XFS_FS) += xfs.o utf8norm/ # this one should be compiled first, as the tracing macros can easily blow up xfs-y += xfs_trace.o diff --git a/fs/xfs/utf8norm/Makefile b/fs/xfs/utf8norm/Makefile new file mode 100644 index 0000000..f83f9b9 --- /dev/null +++ b/fs/xfs/utf8norm/Makefile @@ -0,0 +1,37 @@ +# +# Copyright (c) 2014 SGI. +# All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# + +ifeq ($(CONFIG_XFS_UTF8),y) +obj-m += utf8norm.o +endif + +hostprogs-y := mkutf8data +$(obj)/utf8norm.o: $(obj)/utf8data.h +$(obj)/utf8data.h: $(src)/ucd/*.txt +$(obj)/utf8data.h: $(obj)/mkutf8data FORCE + $(call if_changed,mkutf8data) +quiet_cmd_mkutf8data = MKUTF8DATA $@ + cmd_mkutf8data = $(obj)/mkutf8data \ + -a $(src)/ucd/DerivedAge-7.0.0.txt \ + -c $(src)/ucd/DerivedCombiningClass-7.0.0.txt \ + -p $(src)/ucd/DerivedCoreProperties-7.0.0.txt \ + -d $(src)/ucd/UnicodeData-7.0.0.txt \ + -f $(src)/ucd/CaseFolding-7.0.0.txt \ + -n $(src)/ucd/NormalizationCorrections-7.0.0.txt \ + -t $(src)/ucd/NormalizationTest-7.0.0.txt \ + -o $@ diff --git a/fs/xfs/utf8norm/mkutf8data.c b/fs/xfs/utf8norm/mkutf8data.c new file mode 100644 index 0000000..1d6ec02 --- /dev/null +++ b/fs/xfs/utf8norm/mkutf8data.c @@ -0,0 +1,3239 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/* Generator for a compact trie for unicode normalization */ + +#include <sys/types.h> +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <assert.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> + +/* Default names of the in- and output files. */ + +#define AGE_NAME "DerivedAge.txt" +#define CCC_NAME "DerivedCombiningClass.txt" +#define PROP_NAME "DerivedCoreProperties.txt" +#define DATA_NAME "UnicodeData.txt" +#define FOLD_NAME "CaseFolding.txt" +#define NORM_NAME "NormalizationCorrections.txt" +#define TEST_NAME "NormalizationTest.txt" +#define UTF8_NAME "utf8data.h" + +const char *age_name = AGE_NAME; +const char *ccc_name = CCC_NAME; +const char *prop_name = PROP_NAME; +const char *data_name = DATA_NAME; +const char *fold_name = FOLD_NAME; +const char *norm_name = NORM_NAME; +const char *test_name = TEST_NAME; +const char *utf8_name = UTF8_NAME; + +int verbose = 0; + +/* An arbitrary line size limit on input lines. */ + +#define LINESIZE 1024 +char line[LINESIZE]; +char buf0[LINESIZE]; +char buf1[LINESIZE]; +char buf2[LINESIZE]; +char buf3[LINESIZE]; + +const char *argv0; + +/* ------------------------------------------------------------------ */ + +/* + * Unicode version numbers consist of three parts: major, minor, and a + * revision. These numbers are packed into an unsigned int to obtain + * a single version number. + * + * To save space in the generated trie, the unicode version is not + * stored directly, instead we calculate a generation number from the + * unicode versions seen in the DerivedAge file, and use that as an + * index into a table of unicode versions. + */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_MAJ_MAX ((unsigned short)-1) +#define UNICODE_MIN_MAX ((unsigned char)-1) +#define UNICODE_REV_MAX ((unsigned char)-1) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +unsigned int *ages; +int ages_count; + +unsigned int unicode_maxage; + +static int +age_valid(unsigned int major, unsigned int minor, unsigned int revision) +{ + if (major > UNICODE_MAJ_MAX) + return 0; + if (minor > UNICODE_MIN_MAX) + return 0; + if (revision > UNICODE_REV_MAX) + return 0; + return 1; +} + +/* ------------------------------------------------------------------ */ + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype, unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + */ +typedef unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MAXGEN (255) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +struct tree; +static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, const char *); + +unsigned char *utf8data; +size_t utf8data_size; + +utf8trie_t *nfkdi; +utf8trie_t *nfkdicf; + +/* ------------------------------------------------------------------ */ + +/* + * UTF8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7f: 0 0x7f + * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf + * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf + * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf + * + * Even within those ranges not all values are allowed: the surrogates + * 0xd800 - 0xdfff should never be seen. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +#define UTF8_2_BITS 0xC0 +#define UTF8_3_BITS 0xE0 +#define UTF8_4_BITS 0xF0 +#define UTF8_N_BITS 0x80 +#define UTF8_2_MASK 0xE0 +#define UTF8_3_MASK 0xF0 +#define UTF8_4_MASK 0xF8 +#define UTF8_N_MASK 0xC0 +#define UTF8_V_MASK 0x3F +#define UTF8_V_SHIFT 6 + +static int +utf8key(unsigned int key, char keyval[]) +{ + int keylen; + + if (key < 0x80) { + keyval[0] = key; + keylen = 1; + } else if (key < 0x800) { + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_2_BITS; + keylen = 2; + } else if (key < 0x10000) { + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_3_BITS; + keylen = 3; + } else if (key < 0x110000) { + keyval[3] = key & UTF8_V_MASK; + keyval[3] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_4_BITS; + keylen = 4; + } else { + printf("%#x: illegal key\n", key); + keylen = 0; + } + return keylen; +} + +static unsigned int +utf8code(const char *str) +{ + const unsigned char *s = (const unsigned char*)str; + unsigned int unichar = 0; + + if (*s < 0x80) { + unichar = *s; + } else if (*s < UTF8_3_BITS) { + unichar = *s++ & 0x1F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else if (*s < UTF8_4_BITS) { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } + return unichar; +} + +static int +utf32valid(unsigned int unichar) +{ + return unichar < 0x110000; +} + +#define NODE 1 +#define LEAF 0 + +struct tree { + void *root; + int childnode; + const char *type; + unsigned int maxage; + struct tree *next; + int (*leaf_equal)(void *, void *); + void (*leaf_print)(void *, int); + int (*leaf_mark)(void *); + int (*leaf_size)(void *); + int *(*leaf_index)(struct tree *, void *); + unsigned char *(*leaf_emit)(void *, unsigned char *); + int leafindex[0x110000]; + int index; +}; + +struct node { + int index; + int offset; + int mark; + int size; + struct node *parent; + void *left; + void *right; + unsigned char bitnum; + unsigned char nextbyte; + unsigned char leftnode; + unsigned char rightnode; + unsigned int keybits; + unsigned int keymask; +}; + +/* + * Example lookup function for a tree. + */ +static void * +lookup(struct tree *tree, const char *key) +{ + struct node *node; + void *leaf = NULL; + + node = tree->root; + while (!leaf && node) { + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) { + /* Right leg */ + if (node->rightnode == NODE) { + node = node->right; + } else if (node->rightnode == LEAF) { + leaf = node->right; + } else { + node = NULL; + } + } else { + /* Left leg */ + if (node->leftnode == NODE) { + node = node->left; + } else if (node->leftnode == LEAF) { + leaf = node->left; + } else { + node = NULL; + } + } + } + + return leaf; +} + +/* + * A simple non-recursive tree walker: keep track of visits to the + * left and right branches in the leftmask and rightmask. + */ +static void +tree_walk(struct tree *tree) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int indent = 1; + int nodes, singletons, leaves; + + nodes = singletons = leaves = 0; + + printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_print(tree->root, indent); + leaves = 1; + } else { + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + printf("%*snode @ %p bitnum %d nextbyte %d" + " left %p right %p mask %x bits %x\n", + indent, "", node, + node->bitnum, node->nextbyte, + node->left, node->right, + node->keymask, node->keybits); + nodes += 1; + if (!(node->left && node->right)) + singletons += 1; + + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + tree->leaf_print(node->left, + indent+1); + leaves += 1; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + tree->leaf_print(node->right, + indent+1); + leaves += 1; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } + } + printf("nodes %d leaves %d singletons %d\n", + nodes, leaves, singletons); +} + +/* + * Allocate an initialize a new internal node. + */ +static struct node * +alloc_node(struct node *parent) +{ + struct node *node; + int bitnum; + + node = malloc(sizeof(*node)); + node->left = node->right = NULL; + node->parent = parent; + node->leftnode = NODE; + node->rightnode = NODE; + node->keybits = 0; + node->keymask = 0; + node->mark = 0; + node->index = 0; + node->offset = -1; + node->size = 4; + + if (node->parent) { + bitnum = parent->bitnum; + if ((bitnum & 7) == 0) { + node->bitnum = bitnum + 7 + 8; + node->nextbyte = 1; + } else { + node->bitnum = bitnum - 1; + node->nextbyte = 0; + } + } else { + node->bitnum = 7; + node->nextbyte = 0; + } + + return node; +} + +/* + * Insert a new leaf into the tree, and collapse any subtrees that are + * fully populated and end in identical leaves. A nextbyte tagged + * internal node will not be removed to preserve the tree's integrity. + * Note that due to the structure of utf8, no nextbyte tagged node + * will be a candidate for removal. + */ +static int +insert(struct tree *tree, char *key, int keylen, void *leaf) +{ + struct node *node; + struct node *parent; + void **cursor; + int keybits; + + assert(keylen >= 1 && keylen <= 4); + + node = NULL; + cursor = &tree->root; + keybits = 8 * keylen; + + /* Insert, creating path along the way. */ + while (keybits) { + if (!*cursor) + *cursor = alloc_node(node); + node = *cursor; + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) + cursor = &node->right; + else + cursor = &node->left; + keybits--; + } + *cursor = leaf; + + /* Merge subtrees if possible. */ + while (node) { + if (*key & (1 << (node->bitnum & 7))) + node->rightnode = LEAF; + else + node->leftnode = LEAF; + if (node->nextbyte) + break; + if (node->leftnode == NODE || node->rightnode == NODE) + break; + assert(node->left); + assert(node->right); + /* Compare */ + if (! tree->leaf_equal(node->left, node->right)) + break; + /* Keep left, drop right leaf. */ + leaf = node->left; + /* Check in parent */ + parent = node->parent; + if (!parent) { + /* root of tree! */ + tree->root = leaf; + tree->childnode = LEAF; + } else if (parent->left == node) { + parent->left = leaf; + parent->leftnode = LEAF; + if (parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + } + } else if (parent->right == node) { + parent->right = leaf; + parent->rightnode = LEAF; + if (parent->left) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + parent->keybits |= (1 << node->bitnum); + } + } else { + /* internal tree error */ + assert(0); + } + free(node); + node = parent; + } + + /* Propagate keymasks up along singleton chains. */ + while (node) { + parent = node->parent; + if (!parent) + break; + /* Nix the mask for parents with two children. */ + if (node->keymask == 0) { + parent->keymask = 0; + parent->keybits = 0; + } else if (parent->left && parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + assert((parent->keymask & node->keymask) == 0); + parent->keymask |= node->keymask; + parent->keymask |= (1 << parent->bitnum); + parent->keybits |= node->keybits; + if (parent->right) + parent->keybits |= (1 << parent->bitnum); + } + node = parent; + } + + return 0; +} + +/* + * Prune internal nodes. + * + * Fully populated subtrees that end at the same leaf have already + * been collapsed. There are still internal nodes that have for both + * their left and right branches a sequence of singletons that make + * identical choices and end in identical leaves. The keymask and + * keybits collected in the nodes describe the choices made in these + * singleton chains. When they are identical for the left and right + * branch of a node, and the two leaves comare identical, the node in + * question can be removed. + * + * Note that nodes with the nextbyte tag set will not be removed by + * this to ensure tree integrity. Note as well that the structure of + * utf8 ensures that these nodes would not have been candidates for + * removal in any case. + */ +static void +prune(struct tree *tree) +{ + struct node *node; + struct node *left; + struct node *right; + struct node *parent; + void *leftleaf; + void *rightleaf; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + + if (verbose > 0) + printf("Pruning %s_%x\n", tree->type, tree->maxage); + + count = 0; + if (tree->childnode == LEAF) + return; + if (!tree->root) + return; + + leftmask = rightmask = 0; + node = tree->root; + while (node) { + if (node->nextbyte) + goto advance; + if (node->leftnode == LEAF) + goto advance; + if (node->rightnode == LEAF) + goto advance; + if (!node->left) + goto advance; + if (!node->right) + goto advance; + left = node->left; + right = node->right; + if (left->keymask == 0) + goto advance; + if (right->keymask == 0) + goto advance; + if (left->keymask != right->keymask) + goto advance; + if (left->keybits != right->keybits) + goto advance; + leftleaf = NULL; + while (!leftleaf) { + assert(left->left || left->right); + if (left->leftnode == LEAF) + leftleaf = left->left; + else if (left->rightnode == LEAF) + leftleaf = left->right; + else if (left->left) + left = left->left; + else if (left->right) + left = left->right; + else + assert(0); + } + rightleaf = NULL; + while (!rightleaf) { + assert(right->left || right->right); + if (right->leftnode == LEAF) + rightleaf = right->left; + else if (right->rightnode == LEAF) + rightleaf = right->right; + else if (right->left) + right = right->left; + else if (right->right) + right = right->right; + else + assert(0); + } + if (! tree->leaf_equal(leftleaf, rightleaf)) + goto advance; + /* + * This node has identical singleton-only subtrees. + * Remove it. + */ + parent = node->parent; + left = node->left; + right = node->right; + if (parent->left == node) + parent->left = left; + else if (parent->right == node) + parent->right = left; + else + assert(0); + left->parent = parent; + left->keymask |= (1 << node->bitnum); + node->left = NULL; + while (node) { + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + if (node->leftnode == NODE && node->left) { + left = node->left; + free(node); + count++; + node = left; + } else if (node->rightnode == NODE && node->right) { + right = node->right; + free(node); + count++; + node = right; + } else { + node = NULL; + } + } + /* Propagate keymasks up along singleton chains. */ + node = parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + for (;;) { + if (node->left && node->right) + break; + if (node->left) { + left = node->left; + node->keymask |= left->keymask; + node->keybits |= left->keybits; + } + if (node->right) { + right = node->right; + node->keymask |= right->keymask; + node->keybits |= right->keybits; + } + node->keymask |= (1 << node->bitnum); + node = node->parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + } + advance: + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0 && + node->leftnode == NODE && + node->left) { + leftmask |= bitmask; + node = node->left; + } else if ((rightmask & bitmask) == 0 && + node->rightnode == NODE && + node->right) { + rightmask |= bitmask; + node = node->right; + } else { + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + } + if (verbose > 0) + printf("Pruned %d nodes\n", count); +} + +/* + * Mark the nodes in the tree that lead to leaves that must be + * emitted. + */ +static void +mark_nodes(struct tree *tree) +{ + struct node *node; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int marked; + + marked = 0; + if (verbose > 0) + printf("Marking %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + + /* second pass: left siblings and singletons */ + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + if (!node->mark && node->parent->mark) { + marked++; + node->mark = 1; + } + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + if (!node->mark && node->parent->mark && + !node->parent->left) { + marked++; + node->mark = 1; + } + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } +done: + if (verbose > 0) + printf("Marked %d nodes\n", marked); +} + +/* + * Compute the index of each node and leaf, which is the offset in the + * emitted trie. These value must be pre-computed because relative + * offsets between nodes are used to navigate the tree. + */ +static int +index_nodes(struct tree *tree, int index) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + int indent; + + /* Align to a cache line (or half a cache line?). */ + while (index % 64) + index++; + tree->index = index; + indent = 1; + count = 0; + + if (verbose > 0) + printf("Indexing %s_%x: %d", tree->type, tree->maxage, index); + if (tree->childnode == LEAF) { + index += tree->leaf_size(tree->root); + goto done; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + count++; + if (node->index != index) + node->index = index; + index += node->size; +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + *tree->leaf_index(tree, node->left) = + index; + index += tree->leaf_size(node->left); + count++; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + *tree->leaf_index(tree, node->right) = index; + index += tree->leaf_size(node->right); + count++; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + /* Round up to a multiple of 16 */ + while (index % 16) + index++; + if (verbose > 0) + printf("Final index %d\n", index); + return index; +} + +/* + * Compute the size of nodes and leaves. We start by assuming that + * each node needs to store a three-byte offset. The indexes of the + * nodes are calculated based on that, and then this function is + * called to see if the sizes of some nodes can be reduced. This is + * repeated until no more changes are seen. + */ +static int +size_nodes(struct tree *tree) +{ + struct tree *next; + struct node *node; + struct node *right; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + unsigned int pathbits; + unsigned int pathmask; + int changed; + int offset; + int size; + int indent; + + indent = 1; + changed = 0; + size = 0; + + if (verbose > 0) + printf("Sizing %s_%x", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + pathbits = 0; + pathmask = 0; + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + offset = 0; + if (!node->left || !node->right) { + size = 1; + } else { + if (node->rightnode == NODE) { + right = node->right; + next = tree->next; + while (!right->mark) { + assert(next); + n = next->root; + while (n->bitnum != node->bitnum) { + if (pathbits & (1<<n->bitnum)) + n = n->right; + else + n = n->left; + } + n = n->right; + assert(right->bitnum == n->bitnum); + right = n; + next = next->next; + } + offset = right->index - node->index; + } else { + offset = *tree->leaf_index(tree, node->right); + offset -= node->index; + } + assert(offset >= 0); + assert(offset <= 0xffffff); + if (offset <= 0xff) { + size = 2; + } else if (offset <= 0xffff) { + size = 3; + } else { /* offset <= 0xffffff */ + size = 4; + } + } + if (node->size != size || node->offset != offset) { + node->size = size; + node->offset = offset; + changed++; + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + pathmask |= bitmask; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + pathbits |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + pathmask &= ~bitmask; + pathbits &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + if (verbose > 0) + printf("Found %d changes\n", changed); + return changed; +} + +/* + * Emit a trie for the given tree into the data array. + */ +static void +emit(struct tree *tree, unsigned char *data) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int offlen; + int offset; + int index; + int indent; + unsigned char byte; + + index = tree->index; + data += index; + indent = 1; + if (verbose > 0) + printf("Emitting %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_emit(tree->root, data); + return; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + assert(node->offset != -1); + assert(node->index == index); + + byte = 0; + if (node->nextbyte) + byte |= NEXTBYTE; + byte |= (node->bitnum & BITNUM); + if (node->left && node->right) { + if (node->leftnode == NODE) + byte |= LEFTNODE; + if (node->rightnode == NODE) + byte |= RIGHTNODE; + if (node->offset <= 0xff) + offlen = 1; + else if (node->offset <= 0xffff) + offlen = 2; + else + offlen = 3; + offset = node->offset; + byte |= offlen << OFFLEN_SHIFT; + *data++ = byte; + index++; + while (offlen--) { + *data++ = offset & 0xff; + index++; + offset >>= 8; + } + } else if (node->left) { + if (node->leftnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else if (node->right) { + byte |= RIGHTNODE; + if (node->rightnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else { + assert(0); + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + data = tree->leaf_emit(node->left, + data); + index += tree->leaf_size(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + data = tree->leaf_emit(node->right, + data); + index += tree->leaf_size(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +} + +/* ------------------------------------------------------------------ */ + +/* + * Unicode data. + * + * We need to keep track of the Canonical Combining Class, the Age, + * and decompositions for a code point. + * + * For the Age, we store the index into the ages table. Effectively + * this is a generation number that the table maps to a unicode + * version. + * + * The correction field is used to indicate that this entry is in the + * corrections array, which contains decompositions that were + * corrected in later revisions. The value of the correction field is + * the Unicode version in which the mapping was corrected. + */ +struct unicode_data { + unsigned int code; + int ccc; + int gen; + int correction; + unsigned int *utf32nfkdi; + unsigned int *utf32nfkdicf; + char *utf8nfkdi; + char *utf8nfkdicf; +}; + +struct unicode_data unicode_data[0x110000]; +struct unicode_data *corrections; +int corrections_count; + +struct tree *nfkdi_tree; +struct tree *nfkdicf_tree; + +struct tree *trees; +int trees_count; + +/* + * Check the corrections array to see if this entry was corrected at + * some point. + */ +static struct unicode_data * +corrections_lookup(struct unicode_data *u) +{ + int i; + + for (i = 0; i != corrections_count; i++) + if (u->code == corrections[i].code) + return &corrections[i]; + return u; +} + +static int +nfkdi_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static int +nfkdicf_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdicf && right->utf8nfkdicf && + strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0) + return 1; + if (left->utf8nfkdicf && right->utf8nfkdicf) + return 0; + if (left->utf8nfkdicf || right->utf8nfkdicf) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static void +nfkdi_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static void +nfkdicf_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdicf) + printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf); + else if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static int +nfkdi_mark(void *l) +{ + return 1; +} + +static int +nfkdicf_mark(void *l) +{ + struct unicode_data *leaf = l; + + if (leaf->utf8nfkdicf) + return 1; + return 0; +} + +static int +correction_mark(void *l) +{ + struct unicode_data *leaf = l; + + return leaf->correction; +} + +static int +nfkdi_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int +nfkdicf_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfkdicf) + size += strlen(leaf->utf8nfkdicf) + 1; + else if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int * +nfkdi_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static int * +nfkdicf_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static unsigned char * +nfkdi_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static unsigned char * +nfkdicf_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdicf) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdicf; + while ((*data++ = *s++) != 0) + ; + } else if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static void +utf8_create(struct unicode_data *data) +{ + char utf[18*4+1]; + char *u; + unsigned int *um; + int i; + + u = utf; + um = data->utf32nfkdi; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + data->utf8nfkdi = strdup((char*)utf); + } + u = utf; + um = data->utf32nfkdicf; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf)) + data->utf8nfkdicf = strdup((char*)utf); + } +} + +static void +utf8_init(void) +{ + unsigned int unichar; + int i; + + for (unichar = 0; unichar != 0x110000; unichar++) + utf8_create(&unicode_data[unichar]); + + for (i = 0; i != corrections_count; i++) + utf8_create(&corrections[i]); +} + +static void +trees_init(void) +{ + struct unicode_data *data; + unsigned int maxage; + unsigned int nextage; + int count; + int i; + int j; + + /* Count the number of different ages. */ + count = 0; + nextage = (unsigned int)-1; + do { + maxage = nextage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + count++; + } while (nextage); + + /* Two trees per age: nfkdi and nfkdicf */ + trees_count = count * 2; + trees = calloc(trees_count, sizeof(struct tree)); + + /* Assign ages to the trees. */ + count = trees_count; + nextage = (unsigned int)-1; + do { + maxage = nextage; + trees[--count].maxage = maxage; + trees[--count].maxage = maxage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + } while (nextage); + + /* The ages assigned above are off by one. */ + for (i = 0; i != trees_count; i++) { + j = 0; + while (ages[j] < trees[i].maxage) + j++; + trees[i].maxage = ages[j-1]; + } + + /* Set up the forwarding between trees. */ + trees[trees_count-2].next = &trees[trees_count-1]; + trees[trees_count-1].leaf_mark = nfkdi_mark; + trees[trees_count-2].leaf_mark = nfkdicf_mark; + for (i = 0; i != trees_count-2; i += 2) { + trees[i].next = &trees[trees_count-2]; + trees[i].leaf_mark = correction_mark; + trees[i+1].next = &trees[trees_count-1]; + trees[i+1].leaf_mark = correction_mark; + } + + /* Assign the callouts. */ + for (i = 0; i != trees_count; i += 2) { + trees[i].type = "nfkdicf"; + trees[i].leaf_equal = nfkdicf_equal; + trees[i].leaf_print = nfkdicf_print; + trees[i].leaf_size = nfkdicf_size; + trees[i].leaf_index = nfkdicf_index; + trees[i].leaf_emit = nfkdicf_emit; + + trees[i+1].type = "nfkdi"; + trees[i+1].leaf_equal = nfkdi_equal; + trees[i+1].leaf_print = nfkdi_print; + trees[i+1].leaf_size = nfkdi_size; + trees[i+1].leaf_index = nfkdi_index; + trees[i+1].leaf_emit = nfkdi_emit; + } + + /* Finish init. */ + for (i = 0; i != trees_count; i++) + trees[i].childnode = NODE; +} + +static void +trees_populate(void) +{ + struct unicode_data *data; + unsigned int unichar; + char keyval[4]; + int keylen; + int i; + + for (i = 0; i != trees_count; i++) { + if (verbose > 0) { + printf("Populating %s_%x\n", + trees[i].type, trees[i].maxage); + } + for (unichar = 0; unichar != 0x110000; unichar++) { + if (unicode_data[unichar].gen < 0) + continue; + keylen = utf8key(unichar, keyval); + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= trees[i].maxage) + data = &unicode_data[unichar]; + insert(&trees[i], keyval, keylen, data); + } + } +} + +static void +trees_reduce(void) +{ + int i; + int size; + int changed; + + for (i = 0; i != trees_count; i++) + prune(&trees[i]); + for (i = 0; i != trees_count; i++) + mark_nodes(&trees[i]); + do { + size = 0; + for (i = 0; i != trees_count; i++) + size = index_nodes(&trees[i], size); + changed = 0; + for (i = 0; i != trees_count; i++) + changed += size_nodes(&trees[i]); + } while (changed); + + utf8data = calloc(size, 1); + utf8data_size = size; + for (i = 0; i != trees_count; i++) + emit(&trees[i], utf8data); + + if (verbose > 0) { + for (i = 0; i != trees_count; i++) { + printf("%s_%x idx %d\n", + trees[i].type, trees[i].maxage, trees[i].index); + } + } + + nfkdi = utf8data + trees[trees_count-1].index; + nfkdicf = utf8data + trees[trees_count-2].index; + + nfkdi_tree = &trees[trees_count-1]; + nfkdicf_tree = &trees[trees_count-2]; +} + +static void +verify(struct tree *tree) +{ + struct unicode_data *data; + utf8leaf_t *leaf; + unsigned int unichar; + char key[4]; + int report; + int nocf; + + if (verbose > 0) + printf("Verifying %s_%x\n", tree->type, tree->maxage); + nocf = strcmp(tree->type, "nfkdicf"); + + for (unichar = 0; unichar != 0x110000; unichar++) { + report = 0; + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= tree->maxage) + data = &unicode_data[unichar]; + utf8key(unichar, key); + leaf = utf8lookup(tree, key); + if (!leaf) { + if (data->gen != -1) + report++; + if (unichar < 0xd800 || unichar > 0xdfff) + report++; + } else { + if (unichar >= 0xd800 && unichar <= 0xdfff) + report++; + if (data->gen == -1) + report++; + if (data->gen != LEAF_GEN(leaf)) + report++; + if (LEAF_CCC(leaf) == DECOMPOSE) { + if (nocf) { + if (!data->utf8nfkdi) { + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } else { + if (!data->utf8nfkdicf && + !data->utf8nfkdi) { + report++; + } else if (data->utf8nfkdicf) { + if (strcmp(data->utf8nfkdicf, + LEAF_STR(leaf))) + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } + } else if (data->ccc != LEAF_CCC(leaf)) { + report++; + } + } + if (report) { + printf("%X code %X gen %d ccc %d" + " nfdki -> \"%s\"", + unichar, data->code, data->gen, + data->ccc, + data->utf8nfkdi); + if (leaf) { + printf(" age %d ccc %d" + " nfdki -> \"%s\"\n", + LEAF_GEN(leaf), + LEAF_CCC(leaf), + LEAF_CCC(leaf) == DECOMPOSE ? + LEAF_STR(leaf) : ""); + } + printf("\n"); + } + } +} + +static void +trees_verify(void) +{ + int i; + + for (i = 0; i != trees_count; i++) + verify(&trees[i]); +} + +/* ------------------------------------------------------------------ */ + +static void +help(void) +{ + printf("Usage: %s [options]\n", argv0); + printf("\n"); + printf("This program creates an a data trie used for parsing and\n"); + printf("normalization of UTF-8 strings. The trie is derived from\n"); + printf("a set of input files from the Unicode character database\n"); + printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n"); + printf("\n"); + printf("The generated tree supports two normalization forms:\n"); + printf("\n"); + printf("\tnfkdi:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\n"); + printf("\tnfkdicf:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\t- Apply a full casefold (C + F).\n"); + printf("\n"); + printf("These forms were chosen as being most useful when dealing\n"); + printf("with file names: NFKD catches most cases where characters\n"); + printf("should be considered equivalent. The ignorables are mostly\n"); + printf("invisible, making names hard to type.\n"); + printf("\n"); + printf("The options to specify the files to be used are listed\n"); + printf("below with their default values, which are the names used\n"); + printf("by version 7.0.0 of the Unicode Character Database.\n"); + printf("\n"); + printf("The input files:\n"); + printf("\t-a %s\n", AGE_NAME); + printf("\t-c %s\n", CCC_NAME); + printf("\t-p %s\n", PROP_NAME); + printf("\t-d %s\n", DATA_NAME); + printf("\t-f %s\n", FOLD_NAME); + printf("\t-n %s\n", NORM_NAME); + printf("\n"); + printf("Additionally, the generated tables are tested using:\n"); + printf("\t-t %s\n", TEST_NAME); + printf("\n"); + printf("Finally, the output file:\n"); + printf("\t-o %s\n", UTF8_NAME); + printf("\n"); +} + +static void +usage(void) +{ + help(); + exit(1); +} + +static void +open_fail(const char *name, int error) +{ + printf("Error %d opening %s: %s\n", error, name, strerror(error)); + exit(1); +} + +static void +file_fail(const char *filename) +{ + printf("Error parsing %s\n", filename); + exit(1); +} + +static void +line_fail(const char *filename, const char *line) +{ + printf("Error parsing %s:%s\n", filename, line); + exit(1); +} + +/* ------------------------------------------------------------------ */ + +static void +print_utf32(unsigned int *utf32str) +{ + int i; + + for (i = 0; utf32str[i]; i++) + printf(" %X", utf32str[i]); +} + +static void +print_utf32nfkdi(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdi); + printf("\n"); +} + +static void +print_utf32nfkdicf(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdicf); + printf("\n"); +} + +/* ------------------------------------------------------------------ */ + +static void +age_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + int gen; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", age_name); + + file = fopen(age_name, "r"); + if (!file) + open_fail(age_name, errno); + count = 0; + + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d_%d\n", + major, minor, revision); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d\n", major, minor); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + } + + /* We must have found something above. */ + if (verbose > 1) + printf("%d age entries\n", ages_count); + if (ages_count == 0 || ages_count > MAXGEN) + file_fail(age_name); + + /* There is a 0 entry. */ + ages_count++; + ages = calloc(ages_count + 1, sizeof(*ages)); + /* And a guard entry. */ + ages[ages_count] = (unsigned int)-1; + + rewind(file); + count = 0; + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages[++gen] = + UNICODE_AGE(major, minor, revision); + if (verbose > 1) + printf(" Age V%d_%d_%d = gen %d\n", + major, minor, revision, gen); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages[++gen] = UNICODE_AGE(major, minor, 0); + if (verbose > 1) + printf(" Age V%d_%d = %d\n", + major, minor, gen); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X..%X ; %d.%d #", + &first, &last, &major, &minor); + if (ret == 4) { + for (unichar = first; unichar <= last; unichar++) + unicode_data[unichar].gen = gen; + count += 1 + last - first; + if (verbose > 1) + printf(" %X..%X gen %d\n", first, last, gen); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor); + if (ret == 3) { + unicode_data[unichar].gen = gen; + count++; + if (verbose > 1) + printf(" %X gen %d\n", unichar, gen); + if (!utf32valid(unichar)) + line_fail(age_name, line); + continue; + } + } + unicode_maxage = ages[gen]; + fclose(file); + + /* Nix surrogate block */ + if (verbose > 1) + printf(" Removing surrogate block D800..DFFF\n"); + for (unichar = 0xd800; unichar <= 0xdfff; unichar++) + unicode_data[unichar].gen = -1; + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(age_name); +} + +static void +ccc_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int value; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", ccc_name); + + file = fopen(ccc_name, "r"); + if (!file) + open_fail(ccc_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value); + if (ret == 3) { + for (unichar = first; unichar <= last; unichar++) { + unicode_data[unichar].ccc = value; + count++; + } + if (verbose > 1) + printf(" %X..%X ccc %d\n", first, last, value); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(ccc_name, line); + continue; + } + ret = sscanf(line, "%X ; %d #", &unichar, &value); + if (ret == 2) { + unicode_data[unichar].ccc = value; + count++; + if (verbose > 1) + printf(" %X ccc %d\n", unichar, value); + if (!utf32valid(unichar)) + line_fail(ccc_name, line); + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(ccc_name); +} + +static void +nfkdi_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + unsigned int *um; + int count; + int i; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", data_name); + file = fopen(data_name, "r"); + if (!file) + open_fail(data_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];", + &unichar, buf0); + if (ret != 2) + continue; + if (!utf32valid(unichar)) + line_fail(data_name, line); + + s = buf0; + /* skip over <tag> */ + if (*s == '<') + while (*s++ != ' ') + ; + /* decode the decomposition into UTF-32 */ + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(data_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(data_name); +} + +static void +nfkdicf_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char status; + char *s; + unsigned int *um; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", fold_name); + file = fopen(fold_name, "r"); + if (!file) + open_fail(fold_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0); + if (ret != 3) + continue; + if (!utf32valid(unichar)) + line_fail(fold_name, line); + /* Use the C+F casefold. */ + if (status != 'C' && status != 'F') + continue; + s = buf0; + if (*s == '<') + while (*s++ != ' ') + ; + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(fold_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(fold_name); +} + +static void +ignore_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int first; + unsigned int last; + unsigned int *um; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", prop_name); + file = fopen(prop_name, "r"); + if (!file) + open_fail(prop_name, errno); + assert(file); + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0); + if (ret == 3) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(first) || !utf32valid(last)) + line_fail(prop_name, line); + for (unichar = first; unichar <= last; unichar++) { + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + count++; + } + if (verbose > 1) + printf(" %X..%X Default_Ignorable_Code_Point\n", + first, last); + continue; + } + ret = sscanf(line, "%X ; %s # ", &unichar, buf0); + if (ret == 2) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(unichar)) + line_fail(prop_name, line); + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + if (verbose > 1) + printf(" %X Default_Ignorable_Code_Point\n", + unichar); + count++; + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(prop_name); +} + +static void +corrections_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + unsigned int age; + unsigned int *um; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", norm_name); + file = fopen(norm_name, "r"); + if (!file) + open_fail(norm_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + count++; + } + corrections = calloc(count, sizeof(struct unicode_data)); + corrections_count = count; + rewind(file); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + corrections[count] = unicode_data[unichar]; + assert(corrections[count].code == unichar); + age = UNICODE_AGE(major, minor, revision); + corrections[count].correction = age; + + i = 0; + s = buf0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(norm_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + corrections[count].utf32nfkdi = um; + + if (verbose > 1) + printf(" %X -> %s -> %s V%d_%d_%d\n", + unichar, buf0, buf1, major, minor, revision); + count++; + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(norm_name); +} + +/* ------------------------------------------------------------------ */ + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;; + * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount + * LVPart = LBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = <LPart, VPart> + * } else { + * TPart = TBase + TIndex + * d = <LPart, TPart, VPart> + * } + * + */ + +static void +hangul_decompose(void) +{ + unsigned int sb = 0xAC00; + unsigned int lb = 0x1100; + unsigned int vb = 0x1161; + unsigned int tb = 0x11a7; + /* unsigned int lc = 19; */ + unsigned int vc = 21; + unsigned int tc = 28; + unsigned int nc = (vc * tc); + /* unsigned int sc = (lc * nc); */ + unsigned int unichar; + unsigned int mapping[4]; + unsigned int *um; + int count; + int i; + + if (verbose > 0) + printf("Decomposing hangul\n"); + /* Hangul */ + count = 0; + for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) { + unsigned int si = unichar - sb; + unsigned int li = si / nc; + unsigned int vi = (si % nc) / tc; + unsigned int ti = si % tc; + + i = 0; + mapping[i++] = lb + li; + mapping[i++] = vb + vi; + if (ti) + mapping[i++] = tb + ti; + mapping[i++] = 0; + + assert(!unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + assert(!unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + + count++; + } + if (verbose > 0) + printf("Created %d entries\n", count); +} + +static void +nfkdi_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdi\n"); + + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdi) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdi; + while (*um) { + dc = unicode_data[*um].utf32nfkdi; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + } + /* Add this decomposition to nfkdicf if there is no entry. */ + if (!unicode_data[unichar].utf32nfkdicf) { + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +static void +nfkdicf_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdicf\n"); + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdicf) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdicf; + while (*um) { + dc = unicode_data[*um].utf32nfkdicf; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +/* ------------------------------------------------------------------ */ + +int utf8agemax(struct tree *, const char *); +int utf8nagemax(struct tree *, const char *, size_t); +int utf8agemin(struct tree *, const char *); +int utf8nagemin(struct tree *, const char *, size_t); +ssize_t utf8len(struct tree *, const char *); +ssize_t utf8nlen(struct tree *, const char *, size_t); +struct utf8cursor; +int utf8cursor(struct utf8cursor *, struct tree *, const char *); +int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); +int utf8byte(struct utf8cursor *); + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(struct tree *tree, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + tree->index; + int offlen; + int offset; + int mask; + int node; + + if (!tree) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to trie_nlookup(). + */ +static utf8leaf_t * +utf8lookup(struct tree *tree, const char *s) +{ + return utf8nlookup(tree, s, (size_t)-1); +} + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = tree->maxage; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age = tree->maxage; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + struct tree *tree; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; + unsigned int unichar; +}; + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : string. + * len : length of s. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s, + size_t len) +{ + if (!tree) + return -1; + if (!s) + return -1; + u8c->tree = tree; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->unichar = 0; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : NUL-terminated string. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s) +{ + return utf8ncursor(u8c, tree, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->tree, u8c->s); + else + leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->tree, u8c->s); + ccc = LEAF_CCC(leaf); + } + u8c->unichar = utf8code(u8c->s); + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + assert(u8c->ccc == STOPPER); + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +/* ------------------------------------------------------------------ */ + +static int +normalize_line(struct tree *tree) +{ + char *s; + char *t; + int c; + struct utf8cursor u8c; + + /* First test: null-terminated string. */ + s = buf2; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + /* Second test: length-limited string. */ + s = buf2; + /* Replace NUL with a value that will cause an error if seen. */ + s[strlen(s) + 1] = -1; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + return 0; +} + +static void +normalization_test(void) +{ + FILE *file; + unsigned int unichar; + struct unicode_data *data; + char *s; + char *t; + int ret; + int ignorables; + int tests = 0; + int failures = 0; + + if (verbose > 0) + printf("Parsing %s\n", test_name); + /* Step one, read data from file. */ + file = fopen(test_name, "r"); + if (!file) + open_fail(test_name, errno); + + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];", + buf0, buf1); + if (ret != 2 || *line == '#') + continue; + s = buf0; + t = buf2; + while (*s) { + unichar = strtoul(s, &s, 16); + t += utf8key(unichar, t); + } + *t = '\0'; + + ignorables = 0; + s = buf1; + t = buf3; + while (*s) { + unichar = strtoul(s, &s, 16); + data = &unicode_data[unichar]; + if (data->utf8nfkdi && !*data->utf8nfkdi) + ignorables = 1; + else + t += utf8key(unichar, t); + } + *t = '\0'; + + tests++; + if (normalize_line(nfkdi_tree) < 0) { + printf("\nline %s -> %s", buf0, buf1); + if (ignorables) + printf(" (ignorables removed)"); + printf(" failure\n"); + failures++; + } + } + fclose(file); + if (verbose > 0) + printf("Ran %d tests with %d failures\n", tests, failures); + if (failures) + file_fail(test_name); +} + +/* ------------------------------------------------------------------ */ + +static void +write_file(void) +{ + FILE *file; + int i; + int j; + int t; + int gen; + + if (verbose > 0) + printf("Writing %s\n", utf8_name); + file = fopen(utf8_name, "w"); + if (!file) + open_fail(utf8_name, errno); + + fprintf(file, "/* This file is generated code, do not edit. */\n"); + fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n"); + fprintf(file, "#error Only xfs_utf8.c may include this file.\n"); + fprintf(file, "#endif\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8vers = %#x;\n", + unicode_maxage); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8agetab[] = {\n"); + for (i = 0; i != ages_count; i++) + fprintf(file, "\t%#x%s\n", ages[i], + ages[i] == unicode_maxage ? "" : ","); + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n"); + t = 0; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n"); + t = 1; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned char utf8data[%zd] = {\n", + utf8data_size); + t = 0; + for (i = 0; i != utf8data_size; i += 16) { + if (i == trees[t].index) { + fprintf(file, "\t/* %s_%x */\n", + trees[t].type, trees[t].maxage); + if (t < trees_count-1) + t++; + } + fprintf(file, "\t"); + for (j = i; j != i + 16; j++) + fprintf(file, "0x%.2x%s", utf8data[j], + (j < utf8data_size -1 ? "," : "")); + fprintf(file, "\n"); + } + fprintf(file, "};\n"); + fclose(file); +} + +/* ------------------------------------------------------------------ */ + +int +main(int argc, char *argv[]) +{ + unsigned int unichar; + int opt; + + argv0 = argv[0]; + + while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) { + switch (opt) { + case 'a': + age_name = optarg; + break; + case 'c': + ccc_name = optarg; + break; + case 'd': + data_name = optarg; + break; + case 'f': + fold_name = optarg; + break; + case 'n': + norm_name = optarg; + break; + case 'o': + utf8_name = optarg; + break; + case 'p': + prop_name = optarg; + break; + case 't': + test_name = optarg; + break; + case 'v': + verbose++; + break; + case 'h': + help(); + exit(0); + default: + usage(); + } + } + + if (verbose > 1) + help(); + for (unichar = 0; unichar != 0x110000; unichar++) + unicode_data[unichar].code = unichar; + age_init(); + ccc_init(); + nfkdi_init(); + nfkdicf_init(); + ignore_init(); + corrections_init(); + hangul_decompose(); + nfkdi_decompose(); + nfkdicf_decompose(); + utf8_init(); + trees_init(); + trees_populate(); + trees_reduce(); + trees_verify(); + /* Prevent "unused function" warning. */ + (void)lookup(nfkdi_tree, " "); + if (verbose > 2) + tree_walk(nfkdi_tree); + if (verbose > 2) + tree_walk(nfkdicf_tree); + normalization_test(); + write_file(); + + return 0; +} diff --git a/fs/xfs/utf8norm/utf8norm.c b/fs/xfs/utf8norm/utf8norm.c new file mode 100644 index 0000000..995c4df --- /dev/null +++ b/fs/xfs/utf8norm/utf8norm.c @@ -0,0 +1,649 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "utf8norm.h" + +struct utf8data { + unsigned int maxage; + unsigned int offset; +}; + +#define __INCLUDED_FROM_UTF8NORM_C__ +#include "utf8data.h" +#undef __INCLUDED_FROM_UTF8NORM_C__ + +const unsigned int utf8version(void) +{ + return utf8vers; +} +EXPORT_SYMBOL(utf8version); + +/* + * UTF-8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7F: 0 - 0x7F + * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF + * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF + * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF + * + * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef const unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype: unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + * + * The trie is constructed in such a way that leaves exist for all + * UTF-8 sequences that match the criteria from the "UTF-8 valid + * ranges" comment above, and only for those sequences. Therefore a + * lookup in the trie can be used to validate the UTF-8 input. + */ +typedef const unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(utf8data_t data, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + data->offset; + int offlen; + int offset; + int mask; + int node; + + if (!data) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to utf8nlookup(). + */ +static utf8leaf_t * +utf8lookup(utf8data_t data, const char *s) +{ + return utf8nlookup(data, s, (size_t)-1); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemax); + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age; + int leaf_age; + + if (!data) + return -1; + age = data->maxage; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemin); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemax); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age; + + if (!data) + return -1; + age = data->maxage; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemin); + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8len); + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8nlen); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : string. + * len : length of s. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s, + size_t len) +{ + if (!data) + return -1; + if (!s) + return -1; + u8c->data = data; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} +EXPORT_SYMBOL(utf8ncursor); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : NUL-terminated string. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s) +{ + return utf8ncursor(u8c, data, s, (unsigned int)-1); +} +EXPORT_SYMBOL(utf8cursor); + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->data, u8c->s); + else + leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->data, u8c->s); + ccc = LEAF_CCC(leaf); + } + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} +EXPORT_SYMBOL(utf8byte); + +const struct utf8data * +utf8nfkdi(unsigned int maxage) +{ + int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1; + + while (maxage < utf8nfkdidata[i].maxage) + i--; + if (maxage > utf8nfkdidata[i].maxage) + return NULL; + return &utf8nfkdidata[i]; +} +EXPORT_SYMBOL(utf8nfkdi); + +const struct utf8data * +utf8nfkdicf(unsigned int maxage) +{ + int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1; + + while (maxage < utf8nfkdicfdata[i].maxage) + i--; + if (maxage > utf8nfkdicfdata[i].maxage) + return NULL; + return &utf8nfkdicfdata[i]; +} +EXPORT_SYMBOL(utf8nfkdicf); + +MODULE_AUTHOR("SGI"); +MODULE_DESCRIPTION("utf8 normalization"); +MODULE_LICENSE("GPL"); diff --git a/fs/xfs/utf8norm/utf8norm.h b/fs/xfs/utf8norm/utf8norm.h new file mode 100644 index 0000000..44a9e53 --- /dev/null +++ b/fs/xfs/utf8norm/utf8norm.h @@ -0,0 +1,116 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef UTF8NORM_H +#define UTF8NORM_H + +#include <linux/types.h> +#include <linux/export.h> +#include <linux/string.h> +#include <linux/module.h> + +/* An opaque type used to determine the normalization in use. */ +typedef const struct utf8data *utf8data_t; + +/* Encoding a unicode version number as a single unsigned int. */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +/* Highest unicode version supported by the data tables. */ +extern const unsigned int utf8version(void); + +/* + * Look for the correct utf8data_t for a unicode version. + * Returns NULL if the version requested is too new. + * + * Two normalization forms are supported: nfkdi and nfkdicf. + * + * nfkdi: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * + * nfkdicf: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * - Apply a full casefold (C + F). + */ +extern utf8data_t utf8nfkdi(unsigned int); +extern utf8data_t utf8nfkdicf(unsigned int); + +/* + * Determine the maximum age of any unicode character in the string. + * Returns 0 if only unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemax(utf8data_t, const char *); +extern int utf8nagemax(utf8data_t, const char *, size_t); + +/* + * Determine the minimum age of any unicode character in the string. + * Returns 0 if any unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemin(utf8data_t, const char *); +extern int utf8nagemin(utf8data_t, const char *, size_t); + +/* + * Determine the length of the normalized from of the string, + * excluding any terminating NULL byte. + * Returns 0 if only ignorable code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern ssize_t utf8len(utf8data_t, const char *); +extern ssize_t utf8nlen(utf8data_t, const char *, size_t); + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + utf8data_t data; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; +}; + +/* + * Initialize a utf8cursor to normalize a string. + * Returns 0 on success. + * Returns -1 on failure. + */ +extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *); +extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t); + +/* + * Get the next byte in the normalization. + * Returns a value > 0 && < 256 on success. + * Returns 0 when the end of the normalization is reached. + * Returns -1 if the string being normalized is not valid UTF-8. + */ +extern int utf8byte(struct utf8cursor *); + +#endif /* UTF8NORM_H */ -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8. 2014-09-18 20:15 ` [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8 Ben Myers @ 2014-09-22 20:57 ` Dave Chinner 0 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-22 20:57 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 03:15:19PM -0500, Ben Myers wrote: > From: Olaf Weber <olaf@sgi.com> > > mkutf8data.c is the source for a program that generates utf8data.h, which > contains the trie that utf8norm.c uses. The trie is generated from the > Unicode 7.0.0 data files. The format of the utf8data[] table is described > in utf8norm.c. > > Supporting functions for UTF-8 normalization are in utf8norm.c with the > header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. > > nfkdi: > - Apply unicode normalization form NFKD. > - Remove any Default_Ignorable_Code_Point. > > nfkdicf: > - Apply unicode normalization form NFKD. > - Remove any Default_Ignorable_Code_Point. > - Apply a full casefold (C + F). > > For the purposes of the code, a string is valid UTF-8 if: > > - The values encoded are 0x1..0x10FFFF. > - The surrogate codepoints 0xD800..0xDFFFF are not encoded. > - The shortest possible encoding is used for all values. > > The supporting functions work on null-terminated strings (utf8 prefix) and > on length-limited strings (utf8n prefix). > > Signed-off-by: Olaf Weber <olaf@sgi.com> > > --- > [v2: the trie is now separated into utf8norm.ko; > utf8version is now a function and exported; > introduced CONFIG_XFS_UTF8. -bpm] > --- > fs/xfs/Kconfig | 8 + > fs/xfs/Makefile | 2 +- > fs/xfs/utf8norm/Makefile | 37 + > fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/utf8norm/utf8norm.c | 649 +++++++++ > fs/xfs/utf8norm/utf8norm.h | 116 ++ Again, nothing XFS specific here. It's being built as a separate module and the only thing that XFS uses are exported functions, so it really should be generic library code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8. @ 2014-09-22 20:57 ` Dave Chinner 0 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-22 20:57 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 03:15:19PM -0500, Ben Myers wrote: > From: Olaf Weber <olaf@sgi.com> > > mkutf8data.c is the source for a program that generates utf8data.h, which > contains the trie that utf8norm.c uses. The trie is generated from the > Unicode 7.0.0 data files. The format of the utf8data[] table is described > in utf8norm.c. > > Supporting functions for UTF-8 normalization are in utf8norm.c with the > header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. > > nfkdi: > - Apply unicode normalization form NFKD. > - Remove any Default_Ignorable_Code_Point. > > nfkdicf: > - Apply unicode normalization form NFKD. > - Remove any Default_Ignorable_Code_Point. > - Apply a full casefold (C + F). > > For the purposes of the code, a string is valid UTF-8 if: > > - The values encoded are 0x1..0x10FFFF. > - The surrogate codepoints 0xD800..0xDFFFF are not encoded. > - The shortest possible encoding is used for all values. > > The supporting functions work on null-terminated strings (utf8 prefix) and > on length-limited strings (utf8n prefix). > > Signed-off-by: Olaf Weber <olaf@sgi.com> > > --- > [v2: the trie is now separated into utf8norm.ko; > utf8version is now a function and exported; > introduced CONFIG_XFS_UTF8. -bpm] > --- > fs/xfs/Kconfig | 8 + > fs/xfs/Makefile | 2 +- > fs/xfs/utf8norm/Makefile | 37 + > fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/utf8norm/utf8norm.c | 649 +++++++++ > fs/xfs/utf8norm/utf8norm.h | 116 ++ Again, nothing XFS specific here. It's being built as a separate module and the only thing that XFS uses are exported functions, so it really should be generic library code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8. 2014-09-22 20:57 ` Dave Chinner (?) @ 2014-09-23 18:57 ` Ben Myers 2014-09-26 17:10 ` Christoph Hellwig -1 siblings, 1 reply; 84+ messages in thread From: Ben Myers @ 2014-09-23 18:57 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, tinguely, olaf, xfs On Tue, Sep 23, 2014 at 06:57:14AM +1000, Dave Chinner wrote: > On Thu, Sep 18, 2014 at 03:15:19PM -0500, Ben Myers wrote: > > From: Olaf Weber <olaf@sgi.com> > > > > mkutf8data.c is the source for a program that generates utf8data.h, which > > contains the trie that utf8norm.c uses. The trie is generated from the > > Unicode 7.0.0 data files. The format of the utf8data[] table is described > > in utf8norm.c. > > > > Supporting functions for UTF-8 normalization are in utf8norm.c with the > > header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. > > > > nfkdi: > > - Apply unicode normalization form NFKD. > > - Remove any Default_Ignorable_Code_Point. > > > > nfkdicf: > > - Apply unicode normalization form NFKD. > > - Remove any Default_Ignorable_Code_Point. > > - Apply a full casefold (C + F). > > > > For the purposes of the code, a string is valid UTF-8 if: > > > > - The values encoded are 0x1..0x10FFFF. > > - The surrogate codepoints 0xD800..0xDFFFF are not encoded. > > - The shortest possible encoding is used for all values. > > > > The supporting functions work on null-terminated strings (utf8 prefix) and > > on length-limited strings (utf8n prefix). > > > > Signed-off-by: Olaf Weber <olaf@sgi.com> > > > > --- > > [v2: the trie is now separated into utf8norm.ko; > > utf8version is now a function and exported; > > introduced CONFIG_XFS_UTF8. -bpm] > > --- > > fs/xfs/Kconfig | 8 + > > fs/xfs/Makefile | 2 +- > > fs/xfs/utf8norm/Makefile | 37 + > > fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++ > > fs/xfs/utf8norm/utf8norm.c | 649 +++++++++ > > fs/xfs/utf8norm/utf8norm.h | 116 ++ > > Again, nothing XFS specific here. It's being built as a separate > module and the only thing that XFS uses are exported functions, so > it really should be generic library code.... I'll get this moved to lib/ as you suggested elsewhere in the thread. Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8. 2014-09-23 18:57 ` Ben Myers @ 2014-09-26 17:10 ` Christoph Hellwig 0 siblings, 0 replies; 84+ messages in thread From: Christoph Hellwig @ 2014-09-26 17:10 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Tue, Sep 23, 2014 at 01:57:21PM -0500, Ben Myers wrote: > I'll get this moved to lib/ as you suggested elsewhere in the > thread. Given that this is a host side tool it should be under scripts/ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold. 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 20:16 ` Ben Myers 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (15 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:16 UTC (permalink / raw) To: linux-fsdevel; +Cc: xfs, olaf, tinguely From: Olaf Weber <olaf@sgi.com> The xfs_utf8_nameops use the nfkdi normalization when comparing filenames, and are installed if the utf8bit is set in the super block. The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing filenames, and are installed if both the utf8bit and the borgbit are set in the superblock. Normalized filenames are not stored on disk. Normalization will fail if a filename is not valid UTF-8, in which case the filename is treated as an opaque blob. Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: updated to use utf8norm.ko module; compiled conditionally on CONFIG_XFS_UTF8=y; utf8version is now a function; move xfs_utf8.[ch] into libxfs. --bpm] --- fs/xfs/Makefile | 2 + fs/xfs/libxfs/xfs_dir2.c | 24 ++++- fs/xfs/libxfs/xfs_utf8.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_utf8.h | 25 +++++ fs/xfs/xfs_iops.c | 2 +- 5 files changed, 290 insertions(+), 5 deletions(-) create mode 100644 fs/xfs/libxfs/xfs_utf8.c create mode 100644 fs/xfs/libxfs/xfs_utf8.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 6d000d3..5a4dfa0 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -114,6 +114,8 @@ xfs-$(CONFIG_XFS_QUOTA) += xfs_dquot.o \ xfs_qm.o \ xfs_quotaops.o +xfs-$(CONFIG_XFS_UTF8) += libxfs/xfs_utf8.o + # xfs_rtbitmap is shared with libxfs xfs-$(CONFIG_XFS_RT) += xfs_rtalloc.o diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 84e5ca9..e28736b 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -35,6 +35,7 @@ #include "xfs_error.h" #include "xfs_trace.h" #include "xfs_dinode.h" +#include "xfs_utf8.h" struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR }; @@ -156,10 +157,25 @@ xfs_da_mount( (uint)sizeof(xfs_da_node_entry_t); dageo->magicpct = (dageo->blksize * 37) / 100; - if (xfs_sb_version_hasasciici(&mp->m_sb)) - mp->m_dirnameops = &xfs_ascii_ci_nameops; - else - mp->m_dirnameops = &xfs_default_nameops; + if (xfs_sb_version_hasutf8(&mp->m_sb)) { +#ifdef CONFIG_XFS_UTF8 + if (xfs_sb_version_hasasciici(&mp->m_sb)) + mp->m_dirnameops = &xfs_utf8_ci_nameops; + else + mp->m_dirnameops = &xfs_utf8_nameops; +#else + xfs_warn(mp, + "Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem"); + kmem_free(mp->m_dir_geo); + kmem_free(mp->m_attr_geo); + return -ENOSYS; +#endif + } else { + if (xfs_sb_version_hasasciici(&mp->m_sb)) + mp->m_dirnameops = &xfs_ascii_ci_nameops; + else + mp->m_dirnameops = &xfs_default_nameops; + } return 0; } diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c new file mode 100644 index 0000000..1e64c44 --- /dev/null +++ b/fs/xfs/libxfs/xfs_utf8.c @@ -0,0 +1,242 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_types.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_inum.h" +#include "xfs_trans.h" +#include "xfs_trans_resv.h" +#include "xfs_sb.h" +#include "xfs_ag.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_dir2.h" +#include "xfs_mount.h" +#include "xfs_da_btree.h" +#include "xfs_format.h" +#include "xfs_bmap_btree.h" +#include "xfs_alloc_btree.h" +#include "xfs_dinode.h" +#include "xfs_inode.h" +#include "xfs_inode_item.h" +#include "xfs_bmap.h" +#include "xfs_error.h" +#include "xfs_trace.h" +#include "xfs_utf8.h" +#include <utf8norm/utf8norm.h> + +/* + * xfs nameops using nfkdi + */ + +static xfs_dahash_t +xfs_utf8_hashname( + const unsigned char *name, + int len) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + xfs_dahash_t hash; + int val; + + nfkdi = utf8nfkdi(utf8version()); + hash = 0; + if (utf8ncursor(&u8c, nfkdi, name, len) < 0) + goto blob; + while ((val = utf8byte(&u8c)) > 0) + hash = val ^ rol32(hash, 7); + /* In case of error treat the name as a binary blob. */ + if (val == 0) + return hash; +blob: + return xfs_da_hashname(name, len); +} + +static int +xfs_utf8_normhash( + struct xfs_da_args *args) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + unsigned char *norm; + ssize_t normlen; + int c; + + nfkdi = utf8nfkdi(utf8version()); + /* Failure to normalize is treated as a blob. */ + if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0) + goto blob; + if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0) + goto blob; + if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL))) + return -ENOMEM; + args->norm = norm; + args->normlen = normlen; + while ((c = utf8byte(&u8c)) > 0) + *norm++ = c; + if (c == 0) { + *norm = '\0'; + args->hashval = xfs_da_hashname(args->norm, args->normlen); + return 0; + } + kmem_free(args->norm); +blob: + args->norm = NULL; + args->normlen = -1; + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + +static enum xfs_dacmp +xfs_utf8_compname( + struct xfs_da_args *args, + const unsigned char *name, + int len) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + const unsigned char *norm; + int c; + + ASSERT(args->norm || args->normlen == -1); + + /* Check for an exact match first. */ + if (args->namelen == len && memcmp(args->name, name, len) == 0) + return XFS_CMP_EXACT; + /* xfs_utf8_normhash() set args->normlen to -1 for a blob */ + if (args->normlen < 0) + return XFS_CMP_DIFFERENT; + nfkdi = utf8nfkdi(utf8version()); + if (utf8ncursor(&u8c, nfkdi, name, len) < 0) + return XFS_CMP_DIFFERENT; + norm = args->norm; + while ((c = utf8byte(&u8c)) > 0) + if (c != *norm++) + return XFS_CMP_DIFFERENT; + if (c < 0 || *norm != '\0') + return XFS_CMP_DIFFERENT; + return XFS_CMP_MATCH; +} + +struct xfs_nameops xfs_utf8_nameops = { + .hashname = xfs_utf8_hashname, + .normhash = xfs_utf8_normhash, + .compname = xfs_utf8_compname, +}; + +/* + * xfs nameops using nfkdicf + */ + +static xfs_dahash_t +xfs_utf8_ci_hashname( + const unsigned char *name, + int len) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + xfs_dahash_t hash; + int val; + + nfkdicf = utf8nfkdicf(utf8version()); + hash = 0; + if (utf8ncursor(&u8c, nfkdicf, name, len) < 0) + goto blob; + while ((val = utf8byte(&u8c)) > 0) + hash = val ^ rol32(hash, 7); + /* In case of error treat the name as a binary blob. */ + if (val == 0) + return hash; +blob: + return xfs_da_hashname(name, len); +} + +static int +xfs_utf8_ci_normhash( + struct xfs_da_args *args) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + unsigned char *norm; + ssize_t normlen; + int c; + + nfkdicf = utf8nfkdicf(utf8version()); + /* Failure to normalize is treated as a blob. */ + if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0) + goto blob; + if (utf8ncursor(&u8c, nfkdicf, args->name, args->namelen) < 0) + goto blob; + if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL))) + return -ENOMEM; + args->norm = norm; + args->normlen = normlen; + while ((c = utf8byte(&u8c)) > 0) + *norm++ = c; + if (c == 0) { + *norm = '\0'; + args->hashval = xfs_da_hashname(args->norm, args->normlen); + return 0; + } + kmem_free(args->norm); +blob: + args->norm = NULL; + args->normlen = -1; + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + +static enum xfs_dacmp +xfs_utf8_ci_compname( + struct xfs_da_args *args, + const unsigned char *name, + int len) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + const unsigned char *norm; + int c; + + ASSERT(args->norm || args->normlen == -1); + + /* Check for an exact match first. */ + if (args->namelen == len && memcmp(args->name, name, len) == 0) + return XFS_CMP_EXACT; + /* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */ + if (args->normlen < 0) + return XFS_CMP_DIFFERENT; + nfkdicf = utf8nfkdicf(utf8version()); + if (utf8ncursor(&u8c, nfkdicf, name, len) < 0) + return XFS_CMP_DIFFERENT; + norm = args->norm; + while ((c = utf8byte(&u8c)) > 0) + if (c != *norm++) + return XFS_CMP_DIFFERENT; + if (c < 0 || *norm != '\0') + return XFS_CMP_DIFFERENT; + return XFS_CMP_MATCH; +} + +struct xfs_nameops xfs_utf8_ci_nameops = { + .hashname = xfs_utf8_ci_hashname, + .normhash = xfs_utf8_ci_normhash, + .compname = xfs_utf8_ci_compname, +}; diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h new file mode 100644 index 0000000..97b6a91 --- /dev/null +++ b/fs/xfs/libxfs/xfs_utf8.h @@ -0,0 +1,25 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef XFS_UTF8_H +#define XFS_UTF8_H + +extern struct xfs_nameops xfs_utf8_nameops; +extern struct xfs_nameops xfs_utf8_ci_nameops; + +#endif /* XFS_UTF8_H */ diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index cea3d64..fbfb1bb 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1257,7 +1257,7 @@ xfs_setup_inode( break; case S_IFDIR: lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class); - if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb)) + if (xfs_sb_version_hasci(&XFS_M(inode->i_sb)->m_sb)) inode->i_op = &xfs_dir_ci_inode_operations; else inode->i_op = &xfs_dir_inode_operations; -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold. @ 2014-09-18 20:16 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:16 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> The xfs_utf8_nameops use the nfkdi normalization when comparing filenames, and are installed if the utf8bit is set in the super block. The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing filenames, and are installed if both the utf8bit and the borgbit are set in the superblock. Normalized filenames are not stored on disk. Normalization will fail if a filename is not valid UTF-8, in which case the filename is treated as an opaque blob. Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: updated to use utf8norm.ko module; compiled conditionally on CONFIG_XFS_UTF8=y; utf8version is now a function; move xfs_utf8.[ch] into libxfs. --bpm] --- fs/xfs/Makefile | 2 + fs/xfs/libxfs/xfs_dir2.c | 24 ++++- fs/xfs/libxfs/xfs_utf8.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_utf8.h | 25 +++++ fs/xfs/xfs_iops.c | 2 +- 5 files changed, 290 insertions(+), 5 deletions(-) create mode 100644 fs/xfs/libxfs/xfs_utf8.c create mode 100644 fs/xfs/libxfs/xfs_utf8.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 6d000d3..5a4dfa0 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -114,6 +114,8 @@ xfs-$(CONFIG_XFS_QUOTA) += xfs_dquot.o \ xfs_qm.o \ xfs_quotaops.o +xfs-$(CONFIG_XFS_UTF8) += libxfs/xfs_utf8.o + # xfs_rtbitmap is shared with libxfs xfs-$(CONFIG_XFS_RT) += xfs_rtalloc.o diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 84e5ca9..e28736b 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -35,6 +35,7 @@ #include "xfs_error.h" #include "xfs_trace.h" #include "xfs_dinode.h" +#include "xfs_utf8.h" struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR }; @@ -156,10 +157,25 @@ xfs_da_mount( (uint)sizeof(xfs_da_node_entry_t); dageo->magicpct = (dageo->blksize * 37) / 100; - if (xfs_sb_version_hasasciici(&mp->m_sb)) - mp->m_dirnameops = &xfs_ascii_ci_nameops; - else - mp->m_dirnameops = &xfs_default_nameops; + if (xfs_sb_version_hasutf8(&mp->m_sb)) { +#ifdef CONFIG_XFS_UTF8 + if (xfs_sb_version_hasasciici(&mp->m_sb)) + mp->m_dirnameops = &xfs_utf8_ci_nameops; + else + mp->m_dirnameops = &xfs_utf8_nameops; +#else + xfs_warn(mp, + "Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem"); + kmem_free(mp->m_dir_geo); + kmem_free(mp->m_attr_geo); + return -ENOSYS; +#endif + } else { + if (xfs_sb_version_hasasciici(&mp->m_sb)) + mp->m_dirnameops = &xfs_ascii_ci_nameops; + else + mp->m_dirnameops = &xfs_default_nameops; + } return 0; } diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c new file mode 100644 index 0000000..1e64c44 --- /dev/null +++ b/fs/xfs/libxfs/xfs_utf8.c @@ -0,0 +1,242 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_types.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_inum.h" +#include "xfs_trans.h" +#include "xfs_trans_resv.h" +#include "xfs_sb.h" +#include "xfs_ag.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_dir2.h" +#include "xfs_mount.h" +#include "xfs_da_btree.h" +#include "xfs_format.h" +#include "xfs_bmap_btree.h" +#include "xfs_alloc_btree.h" +#include "xfs_dinode.h" +#include "xfs_inode.h" +#include "xfs_inode_item.h" +#include "xfs_bmap.h" +#include "xfs_error.h" +#include "xfs_trace.h" +#include "xfs_utf8.h" +#include <utf8norm/utf8norm.h> + +/* + * xfs nameops using nfkdi + */ + +static xfs_dahash_t +xfs_utf8_hashname( + const unsigned char *name, + int len) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + xfs_dahash_t hash; + int val; + + nfkdi = utf8nfkdi(utf8version()); + hash = 0; + if (utf8ncursor(&u8c, nfkdi, name, len) < 0) + goto blob; + while ((val = utf8byte(&u8c)) > 0) + hash = val ^ rol32(hash, 7); + /* In case of error treat the name as a binary blob. */ + if (val == 0) + return hash; +blob: + return xfs_da_hashname(name, len); +} + +static int +xfs_utf8_normhash( + struct xfs_da_args *args) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + unsigned char *norm; + ssize_t normlen; + int c; + + nfkdi = utf8nfkdi(utf8version()); + /* Failure to normalize is treated as a blob. */ + if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0) + goto blob; + if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0) + goto blob; + if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL))) + return -ENOMEM; + args->norm = norm; + args->normlen = normlen; + while ((c = utf8byte(&u8c)) > 0) + *norm++ = c; + if (c == 0) { + *norm = '\0'; + args->hashval = xfs_da_hashname(args->norm, args->normlen); + return 0; + } + kmem_free(args->norm); +blob: + args->norm = NULL; + args->normlen = -1; + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + +static enum xfs_dacmp +xfs_utf8_compname( + struct xfs_da_args *args, + const unsigned char *name, + int len) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + const unsigned char *norm; + int c; + + ASSERT(args->norm || args->normlen == -1); + + /* Check for an exact match first. */ + if (args->namelen == len && memcmp(args->name, name, len) == 0) + return XFS_CMP_EXACT; + /* xfs_utf8_normhash() set args->normlen to -1 for a blob */ + if (args->normlen < 0) + return XFS_CMP_DIFFERENT; + nfkdi = utf8nfkdi(utf8version()); + if (utf8ncursor(&u8c, nfkdi, name, len) < 0) + return XFS_CMP_DIFFERENT; + norm = args->norm; + while ((c = utf8byte(&u8c)) > 0) + if (c != *norm++) + return XFS_CMP_DIFFERENT; + if (c < 0 || *norm != '\0') + return XFS_CMP_DIFFERENT; + return XFS_CMP_MATCH; +} + +struct xfs_nameops xfs_utf8_nameops = { + .hashname = xfs_utf8_hashname, + .normhash = xfs_utf8_normhash, + .compname = xfs_utf8_compname, +}; + +/* + * xfs nameops using nfkdicf + */ + +static xfs_dahash_t +xfs_utf8_ci_hashname( + const unsigned char *name, + int len) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + xfs_dahash_t hash; + int val; + + nfkdicf = utf8nfkdicf(utf8version()); + hash = 0; + if (utf8ncursor(&u8c, nfkdicf, name, len) < 0) + goto blob; + while ((val = utf8byte(&u8c)) > 0) + hash = val ^ rol32(hash, 7); + /* In case of error treat the name as a binary blob. */ + if (val == 0) + return hash; +blob: + return xfs_da_hashname(name, len); +} + +static int +xfs_utf8_ci_normhash( + struct xfs_da_args *args) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + unsigned char *norm; + ssize_t normlen; + int c; + + nfkdicf = utf8nfkdicf(utf8version()); + /* Failure to normalize is treated as a blob. */ + if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0) + goto blob; + if (utf8ncursor(&u8c, nfkdicf, args->name, args->namelen) < 0) + goto blob; + if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL))) + return -ENOMEM; + args->norm = norm; + args->normlen = normlen; + while ((c = utf8byte(&u8c)) > 0) + *norm++ = c; + if (c == 0) { + *norm = '\0'; + args->hashval = xfs_da_hashname(args->norm, args->normlen); + return 0; + } + kmem_free(args->norm); +blob: + args->norm = NULL; + args->normlen = -1; + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + +static enum xfs_dacmp +xfs_utf8_ci_compname( + struct xfs_da_args *args, + const unsigned char *name, + int len) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + const unsigned char *norm; + int c; + + ASSERT(args->norm || args->normlen == -1); + + /* Check for an exact match first. */ + if (args->namelen == len && memcmp(args->name, name, len) == 0) + return XFS_CMP_EXACT; + /* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */ + if (args->normlen < 0) + return XFS_CMP_DIFFERENT; + nfkdicf = utf8nfkdicf(utf8version()); + if (utf8ncursor(&u8c, nfkdicf, name, len) < 0) + return XFS_CMP_DIFFERENT; + norm = args->norm; + while ((c = utf8byte(&u8c)) > 0) + if (c != *norm++) + return XFS_CMP_DIFFERENT; + if (c < 0 || *norm != '\0') + return XFS_CMP_DIFFERENT; + return XFS_CMP_MATCH; +} + +struct xfs_nameops xfs_utf8_ci_nameops = { + .hashname = xfs_utf8_ci_hashname, + .normhash = xfs_utf8_ci_normhash, + .compname = xfs_utf8_ci_compname, +}; diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h new file mode 100644 index 0000000..97b6a91 --- /dev/null +++ b/fs/xfs/libxfs/xfs_utf8.h @@ -0,0 +1,25 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef XFS_UTF8_H +#define XFS_UTF8_H + +extern struct xfs_nameops xfs_utf8_nameops; +extern struct xfs_nameops xfs_utf8_ci_nameops; + +#endif /* XFS_UTF8_H */ diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index cea3d64..fbfb1bb 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1257,7 +1257,7 @@ xfs_setup_inode( break; case S_IFDIR: lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class); - if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb)) + if (xfs_sb_version_hasci(&XFS_M(inode->i_sb)->m_sb)) inode->i_op = &xfs_dir_ci_inode_operations; else inode->i_op = &xfs_dir_inode_operations; -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (8 preceding siblings ...) 2014-09-18 20:16 ` Ben Myers @ 2014-09-18 20:17 ` Ben Myers 2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers ` (6 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:17 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Apply the same rules for UTF-8 normalization to the names of user-defined extended attributes. System attributes are excluded because they are not user-visible in the first place, and the kernel is expected to know what it is doing when naming them. Signed-off-by: Olaf Weber <olaf@sgi.com> --- fs/xfs/libxfs/xfs_attr.c | 56 ++++++++++++++++++++++++++++++++++++------- fs/xfs/libxfs/xfs_attr_leaf.c | 11 +++++++-- fs/xfs/libxfs/xfs_utf8.c | 7 ++++++ fs/xfs/xfs_attr_list.c | 11 ++++++++- 4 files changed, 74 insertions(+), 11 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 353fb42..68e7ce3 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -83,12 +83,14 @@ xfs_attr_args_init( const unsigned char *name, int flags) { + struct xfs_mount *mp = dp->i_mount; + int error; if (!name) return -EINVAL; memset(args, 0, sizeof(*args)); - args->geo = dp->i_mount->m_attr_geo; + args->geo = mp->m_attr_geo; args->whichfork = XFS_ATTR_FORK; args->dp = dp; args->flags = flags; @@ -97,7 +99,11 @@ xfs_attr_args_init( if (args->namelen >= MAXNAMELEN) return -EFAULT; /* match IRIX behaviour */ - args->hashval = xfs_da_hashname(args->name, args->namelen); + if (!xfs_sb_version_hasutf8(&mp->m_sb)) + args->hashval = xfs_da_hashname(args->name, args->namelen); + else if ((error = mp->m_dirnameops->normhash(args)) != 0) + return error; + return 0; } @@ -154,6 +160,9 @@ xfs_attr_get( error = xfs_attr_node_get(&args); xfs_iunlock(ip, lock_mode); + if (args.norm) + kmem_free(args.norm); + *valuelenp = args.valuelen; return error == -EEXIST ? 0 : error; } @@ -216,8 +225,11 @@ xfs_attr_set( return -EIO; error = xfs_attr_args_init(&args, dp, name, flags); - if (error) + if (error) { + if (args.norm) + kmem_free(args.norm); return error; + } args.value = value; args.valuelen = valuelen; @@ -227,8 +239,11 @@ xfs_attr_set( args.total = xfs_attr_calc_size(&args, &local); error = xfs_qm_dqattach(dp, 0); - if (error) + if (error) { + if (args.norm) + kmem_free(args.norm); return error; + } /* * If the inode doesn't have an attribute fork, add one. @@ -239,8 +254,11 @@ xfs_attr_set( XFS_ATTR_SF_ENTSIZE_BYNAME(args.namelen, valuelen); error = xfs_bmap_add_attrfork(dp, sf_size, rsvd); - if (error) + if (error) { + if (args.norm) + kmem_free(args.norm); return error; + } } /* @@ -270,6 +288,8 @@ xfs_attr_set( error = xfs_trans_reserve(args.trans, &tres, args.total, 0); if (error) { xfs_trans_cancel(args.trans, 0); + if (args.norm) + kmem_free(args.norm); return error; } xfs_ilock(dp, XFS_ILOCK_EXCL); @@ -280,6 +300,8 @@ xfs_attr_set( if (error) { xfs_iunlock(dp, XFS_ILOCK_EXCL); xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES); + if (args.norm) + kmem_free(args.norm); return error; } @@ -327,6 +349,8 @@ xfs_attr_set( XFS_TRANS_RELEASE_LOG_RES); xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free(args.norm); return error ? error : err2; } @@ -388,7 +412,8 @@ xfs_attr_set( xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE); error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES); xfs_iunlock(dp, XFS_ILOCK_EXCL); - + if (args.norm) + kmem_free(args.norm); return error; out: @@ -397,6 +422,8 @@ out: XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT); } xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free(args.norm); return error; } @@ -425,8 +452,11 @@ xfs_attr_remove( return -ENOATTR; error = xfs_attr_args_init(&args, dp, name, flags); - if (error) + if (error) { + if (args.norm) + kmem_free(args.norm); return error; + } args.firstblock = &firstblock; args.flist = &flist; @@ -439,8 +469,11 @@ xfs_attr_remove( args.op_flags = XFS_DA_OP_OKNOENT; error = xfs_qm_dqattach(dp, 0); - if (error) + if (error) { + if (args.norm) + kmem_free(args.norm); return error; + } /* * Start our first transaction of the day. @@ -466,6 +499,8 @@ xfs_attr_remove( XFS_ATTRRM_SPACE_RES(mp), 0); if (error) { xfs_trans_cancel(args.trans, 0); + if (args.norm) + kmem_free(args.norm); return error; } @@ -506,6 +541,8 @@ xfs_attr_remove( xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE); error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES); xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free(args.norm); return error; @@ -515,6 +552,9 @@ out: XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT); } xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free(args.norm); + return error; } diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index b1f73db..c991a88 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -661,6 +661,7 @@ int xfs_attr_shortform_to_leaf(xfs_da_args_t *args) { xfs_inode_t *dp; + struct xfs_mount *mp; xfs_attr_shortform_t *sf; xfs_attr_sf_entry_t *sfe; xfs_da_args_t nargs; @@ -673,6 +674,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args) trace_xfs_attr_sf_to_leaf(args); dp = args->dp; + mp = dp->i_mount; ifp = dp->i_afp; sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data; size = be16_to_cpu(sf->hdr.totsize); @@ -726,13 +728,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args) nargs.namelen = sfe->namelen; nargs.value = &sfe->nameval[nargs.namelen]; nargs.valuelen = sfe->valuelen; - nargs.hashval = xfs_da_hashname(sfe->nameval, - sfe->namelen); nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags); + if (!xfs_sb_version_hasutf8(&mp->m_sb)) + nargs.hashval = xfs_da_hashname(sfe->nameval, + sfe->namelen); + else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0) + goto out; error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */ ASSERT(error == -ENOATTR); error = xfs_attr3_leaf_add(bp, &nargs); ASSERT(error != -ENOSPC); + if (nargs.norm) + kmem_free(nargs.norm); if (error) goto out; sfe = XFS_ATTR_SF_NEXTENTRY(sfe); diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c index 1e64c44..75f2b3a 100644 --- a/fs/xfs/libxfs/xfs_utf8.c +++ b/fs/xfs/libxfs/xfs_utf8.c @@ -38,6 +38,7 @@ #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_bmap.h" +#include "xfs_attr.h" #include "xfs_error.h" #include "xfs_trace.h" #include "xfs_utf8.h" @@ -80,6 +81,9 @@ xfs_utf8_normhash( ssize_t normlen; int c; + /* Don't normalize system attribute names. */ + if (args->flags & (ATTR_ROOT|ATTR_SECURE)) + goto blob; nfkdi = utf8nfkdi(utf8version()); /* Failure to normalize is treated as a blob. */ if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0) @@ -179,6 +183,9 @@ xfs_utf8_ci_normhash( ssize_t normlen; int c; + /* Don't normalize system attribute names. */ + if (args->flags & (ATTR_ROOT|ATTR_SECURE)) + goto blob; nfkdicf = utf8nfkdicf(utf8version()); /* Failure to normalize is treated as a blob. */ if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0) diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c index 62db83a..4075d54 100644 --- a/fs/xfs/xfs_attr_list.c +++ b/fs/xfs/xfs_attr_list.c @@ -76,12 +76,14 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context) xfs_attr_shortform_t *sf; xfs_attr_sf_entry_t *sfe; xfs_inode_t *dp; + struct xfs_mount *mp; int sbsize, nsbuf, count, i; int error; ASSERT(context != NULL); dp = context->dp; ASSERT(dp != NULL); + mp = dp->i_mount; ASSERT(dp->i_afp != NULL); sf = (xfs_attr_shortform_t *)dp->i_afp->if_u1.if_data; ASSERT(sf != NULL); @@ -154,7 +156,14 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context) } sbp->entno = i; - sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen); + /* ATTR_ROOT and ATTR_SECURE are never normalized. */ + if (!xfs_sb_version_hasutf8(&mp->m_sb) || + (sfe->flags & (ATTR_ROOT|ATTR_SECURE))) { + sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen); + } else { + sbp->hash = mp->m_dirnameops->hashname(sfe->nameval, + sfe->namelen); + } sbp->name = sfe->nameval; sbp->namelen = sfe->namelen; /* These are bytes, and both on-disk, don't endian-flip */ -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 10/10] xfs: implement demand load of utf8norm.ko 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (9 preceding siblings ...) 2014-09-18 20:17 ` [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers @ 2014-09-18 20:18 ` Ben Myers 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (5 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:18 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Ben Myers <bpm@sgi.com> The utf8 normalization module is large and there is no need to have it loaded unless an xfs filesystem with utf8 enabled has been mounted. This loads utf8norm.ko at mount time for filesystems that need it. This is optional on CONFIG_XFS_UTF8_DEMAND_LOAD. Signed-off-by: Ben Myers <bpm@sgi.com> --- fs/xfs/Kconfig | 10 +++++ fs/xfs/libxfs/xfs_dir2.c | 9 ++++ fs/xfs/libxfs/xfs_utf8.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_utf8.h | 5 +++ fs/xfs/xfs_super.c | 6 +++ 5 files changed, 134 insertions(+) diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index a847857..69efd85c 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -103,3 +103,13 @@ config XFS_UTF8 Say Y here to enable utf8 normalization support in XFS. You will be able to mount and use filesystems created with the utf8 mkfs.xfs option. + +config XFS_UTF8_DEMAND_LOAD + bool "XFS loads UTF-8 normalization module on demand" + depends on XFS_FS + depends on XFS_UTF8 + help + Say Y here to enable on demand loading of the utf8 + normalization module. This enables the large nomalization + module to remain unloaded until a filesystem with utf8 support + is mounted. diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index e28736b..436738d 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -35,7 +35,9 @@ #include "xfs_error.h" #include "xfs_trace.h" #include "xfs_dinode.h" +#ifdef CONFIG_XFS_UTF8 #include "xfs_utf8.h" +#endif struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR }; @@ -159,6 +161,13 @@ xfs_da_mount( if (xfs_sb_version_hasutf8(&mp->m_sb)) { #ifdef CONFIG_XFS_UTF8 +#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD + if (xfs_init_utf8_module(mp)) { + kmem_free(mp->m_dir_geo); + kmem_free(mp->m_attr_geo); + return -ENOSYS; + } +#endif /* CONFIG_XFS_UTF8_DEMAND_LOAD */ if (xfs_sb_version_hasasciici(&mp->m_sb)) mp->m_dirnameops = &xfs_utf8_ci_nameops; else diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c index 75f2b3a..71978c8 100644 --- a/fs/xfs/libxfs/xfs_utf8.c +++ b/fs/xfs/libxfs/xfs_utf8.c @@ -44,6 +44,110 @@ #include "xfs_utf8.h" #include <utf8norm/utf8norm.h> +#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD +#include <linux/kmod.h> + +static DEFINE_SPINLOCK(utf8norm_lock); +static int utf8norm_initialized; + +static const unsigned int (*utf8version_func)(void); +static utf8data_t (*utf8nfkdi_func)(unsigned int); +static utf8data_t (*utf8nfkdicf_func)(unsigned int); +static ssize_t (*utf8nlen_func)(utf8data_t, const char *, size_t); +static int (*utf8ncursor_func)(struct utf8cursor *, utf8data_t, + const char *, size_t); +static int (*utf8byte_func)(struct utf8cursor *); + +static void +xfs_put_utf8_module_locked(void) +{ + if (utf8version_func) + symbol_put(utf8version); + + if (utf8nfkdi_func) + symbol_put(utf8nfkdi); + + if (utf8nfkdicf_func) + symbol_put(utf8nfkdicf); + + if (utf8nlen_func) + symbol_put(utf8nlen); + + if (utf8ncursor_func) + symbol_put(utf8ncursor); + + if (utf8byte_func) + symbol_put(utf8byte); +} + +void +xfs_put_utf8_module(void) +{ + spin_lock(&utf8norm_lock); + if (!utf8norm_initialized) { + spin_unlock(&utf8norm_lock); + return; + } + xfs_put_utf8_module_locked(); + spin_unlock(&utf8norm_lock); +} + +int +xfs_init_utf8_module(struct xfs_mount *mp) +{ + request_module("utf8norm"); + + spin_lock(&utf8norm_lock); + if (utf8norm_initialized) { + spin_unlock(&utf8norm_lock); + return 0; + } + + utf8version_func = symbol_get(utf8version); + if (!utf8version_func) + goto error; + + utf8nfkdi_func = symbol_get(utf8nfkdi); + if (!utf8nfkdi_func) + goto error; + + utf8nfkdicf_func = symbol_get(utf8nfkdicf); + if (!utf8nfkdicf_func) + goto error; + + utf8nlen_func = symbol_get(utf8nlen); + if (!utf8nlen_func) + goto error; + + utf8ncursor_func = symbol_get(utf8ncursor); + if (!utf8ncursor_func) + goto error; + + utf8byte_func = symbol_get(utf8byte); + if (!utf8byte_func) + goto error; + + utf8norm_initialized = 1; + spin_unlock(&utf8norm_lock); + return 0; +error: + xfs_put_utf8_module_locked(); + spin_unlock(&utf8norm_lock); + xfs_warn(mp, + "Failed to load utf8norm.ko which is required to " + "mount a filesystem with utf8 support."); + return -ENOSYS; +} + +#define utf8version (*utf8version_func) +#define utf8nfkdi (*utf8nfkdi_func) +#define utf8nfkdicf (*utf8nfkdicf_func) +#define utf8nlen (*utf8nlen_func) +#define utf8ncursor (*utf8ncursor_func) +#define utf8byte (*utf8byte_func) + +#endif /* CONFIG_XFS_UTF8_DEMAND_LOAD */ + /* * xfs nameops using nfkdi */ diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h index 97b6a91..9d1125a 100644 --- a/fs/xfs/libxfs/xfs_utf8.h +++ b/fs/xfs/libxfs/xfs_utf8.h @@ -22,4 +22,9 @@ extern struct xfs_nameops xfs_utf8_nameops; extern struct xfs_nameops xfs_utf8_ci_nameops; +#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD +extern int xfs_init_utf8_module(struct xfs_mount *); +extern void xfs_put_utf8_module(void); +#endif + #endif /* XFS_UTF8_H */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index b194652..050a949 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -47,6 +47,9 @@ #include "xfs_dinode.h" #include "xfs_filestream.h" #include "xfs_quota.h" +#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD +#include "xfs_utf8.h" +#endif #include <linux/namei.h> #include <linux/init.h> @@ -1809,6 +1812,9 @@ exit_xfs_fs(void) xfs_mru_cache_uninit(); xfs_destroy_workqueues(); xfs_destroy_zones(); +#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD + xfs_put_utf8_module(); +#endif } module_init(init_xfs_fs); -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (10 preceding siblings ...) 2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers @ 2014-09-18 20:31 ` Ben Myers 2014-09-18 20:33 ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers ` (14 more replies) 2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (4 subsequent siblings) 16 siblings, 15 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:31 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs Hi, Here is the xfsprogs portion of the Unicode/UTF-8 support. A number of the patches in libxfs correspond with kernel patches previously posted, and then there are patches to add support to mkfs, xfs_info, xfs_repair, and a test. (Note that the Unicode character database files have also been removed here due to their size.) Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 01/13] libxfs: return the first match during case-insensitive lookup 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 20:33 ` Ben Myers 2014-09-18 20:33 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (13 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:33 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Change the XFS case-insensitive lookup code to return the first match found, even if it is not an exact match. Whether a filesystem uses case-insensitive lookups is determined by a superblock bit set during filesystem creation. This means that normal use cannot create two files that both match the same filename. Signed-off-by: Olaf Weber <olaf@sgi.com> --- libxfs/xfs_dir2_block.c | 17 ++++------- libxfs/xfs_dir2_leaf.c | 38 ++++------------------- libxfs/xfs_dir2_node.c | 80 ++++++++++++++++++------------------------------- libxfs/xfs_dir2_sf.c | 8 ++--- 4 files changed, 44 insertions(+), 99 deletions(-) diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c index cede01f..2880431 100644 --- a/libxfs/xfs_dir2_block.c +++ b/libxfs/xfs_dir2_block.c @@ -705,28 +705,21 @@ xfs_dir2_block_lookup_int( dep = (xfs_dir2_data_entry_t *) ((char *)hdr + xfs_dir2_dataptr_to_off(mp, addr)); /* - * Compare name and if it's an exact match, return the index - * and buffer. If it's the first case-insensitive match, store - * the index and buffer and continue looking for an exact match. + * Compare name and if it's a match, return the + * index and buffer. */ cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; *bpp = bp; *entno = mid; - if (cmp == XFS_CMP_EXACT) - return 0; + return 0; } } while (++mid < be32_to_cpu(btp->count) && be32_to_cpu(blp[mid].hashval) == hash); ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); - /* - * Here, we can only be doing a lookup (not a rename or replace). - * If a case-insensitive match was found earlier, return success. - */ - if (args->cmpresult == XFS_CMP_CASE) - return 0; + ASSERT(args->cmpresult == XFS_CMP_DIFFERENT); /* * No match, release the buffer and return ENOENT. */ diff --git a/libxfs/xfs_dir2_leaf.c b/libxfs/xfs_dir2_leaf.c index 8e0cbc9..b1901d3 100644 --- a/libxfs/xfs_dir2_leaf.c +++ b/libxfs/xfs_dir2_leaf.c @@ -1246,7 +1246,6 @@ xfs_dir2_leaf_lookup_int( xfs_mount_t *mp; /* filesystem mount point */ xfs_dir2_db_t newdb; /* new data block number */ xfs_trans_t *tp; /* transaction pointer */ - xfs_dir2_db_t cidb = -1; /* case match data block no. */ enum xfs_dacmp cmp; /* name compare result */ struct xfs_dir2_leaf_entry *ents; struct xfs_dir3_icleaf_hdr leafhdr; @@ -1307,47 +1306,22 @@ xfs_dir2_leaf_lookup_int( dep = (xfs_dir2_data_entry_t *)((char *)dbp->b_addr + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); /* - * Compare name and if it's an exact match, return the index - * and buffer. If it's the first case-insensitive match, store - * the index and buffer and continue looking for an exact match. + * Compare name and if it's a match, return the index + * and buffer. */ cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; *indexp = index; - /* case exact match: return the current buffer. */ - if (cmp == XFS_CMP_EXACT) { - *dbpp = dbp; - return 0; - } - cidb = curdb; + *dbpp = dbp; + return 0; } } ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); - /* - * Here, we can only be doing a lookup (not a rename or remove). - * If a case-insensitive match was found earlier, re-read the - * appropriate data block if required and return it. - */ - if (args->cmpresult == XFS_CMP_CASE) { - ASSERT(cidb != -1); - if (cidb != curdb) { - xfs_trans_brelse(tp, dbp); - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(mp, cidb), - -1, &dbp); - if (error) { - xfs_trans_brelse(tp, lbp); - return error; - } - } - *dbpp = dbp; - return 0; - } + ASSERT(args->cmpresult == XFS_CMP_DIFFERENT); /* * No match found, return ENOENT. */ - ASSERT(cidb == -1); if (dbp) xfs_trans_brelse(tp, dbp); xfs_trans_brelse(tp, lbp); diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c index 3737e4e..fb27506 100644 --- a/libxfs/xfs_dir2_node.c +++ b/libxfs/xfs_dir2_node.c @@ -702,6 +702,7 @@ xfs_dir2_leafn_lookup_for_entry( xfs_dir2_db_t curdb = -1; /* current data block number */ xfs_dir2_data_entry_t *dep; /* data block entry */ xfs_inode_t *dp; /* incore directory inode */ + int di = -1; /* data entry index */ int error; /* error return value */ int index; /* leaf entry index */ xfs_dir2_leaf_t *leaf; /* leaf structure */ @@ -733,6 +734,7 @@ xfs_dir2_leafn_lookup_for_entry( if (state->extravalid) { curbp = state->extrablk.bp; curdb = state->extrablk.blkno; + di = state->extrablk.index; } /* * Loop over leaf entries with the right hash value. @@ -757,27 +759,20 @@ xfs_dir2_leafn_lookup_for_entry( */ if (newdb != curdb) { /* - * If we had a block before that we aren't saving - * for a CI name, drop it + * If we had a block, drop it */ - if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT || - curdb != state->extrablk.blkno)) + if (curbp) { xfs_trans_brelse(tp, curbp); + di = -1; + } /* - * If needing the block that is saved with a CI match, - * use it otherwise read in the new data block. + * Read in the new data block. */ - if (args->cmpresult != XFS_CMP_DIFFERENT && - newdb == state->extrablk.blkno) { - ASSERT(state->extravalid); - curbp = state->extrablk.bp; - } else { - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(mp, newdb), - -1, &curbp); - if (error) - return error; - } + error = xfs_dir3_data_read(tp, dp, + xfs_dir2_db_to_da(mp, newdb), + -1, &curbp); + if (error) + return error; xfs_dir3_data_check(dp, curbp); curdb = newdb; } @@ -787,53 +782,36 @@ xfs_dir2_leafn_lookup_for_entry( dep = (xfs_dir2_data_entry_t *)((char *)curbp->b_addr + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address))); /* - * Compare the entry and if it's an exact match, return - * EEXIST immediately. If it's the first case-insensitive - * match, store the block & inode number and continue looking. + * Compare the entry and if it's a match, return + * EEXIST immediately. */ cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { - /* If there is a CI match block, drop it */ - if (args->cmpresult != XFS_CMP_DIFFERENT && - curdb != state->extrablk.blkno) - xfs_trans_brelse(tp, state->extrablk.bp); + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; args->inumber = be64_to_cpu(dep->inumber); args->filetype = xfs_dir3_dirent_get_ftype(mp, dep); - *indexp = index; - state->extravalid = 1; - state->extrablk.bp = curbp; - state->extrablk.blkno = curdb; - state->extrablk.index = (int)((char *)dep - - (char *)curbp->b_addr); - state->extrablk.magic = XFS_DIR2_DATA_MAGIC; - curbp->b_ops = &xfs_dir3_data_buf_ops; - xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF); - if (cmp == XFS_CMP_EXACT) - return XFS_ERROR(EEXIST); + error = EEXIST; + goto out; } } + /* Didn't find a match */ + error = ENOENT; ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT)); +out: if (curbp) { - if (args->cmpresult == XFS_CMP_DIFFERENT) { - /* Giving back last used data block. */ - state->extravalid = 1; - state->extrablk.bp = curbp; - state->extrablk.index = -1; - state->extrablk.blkno = curdb; - state->extrablk.magic = XFS_DIR2_DATA_MAGIC; - curbp->b_ops = &xfs_dir3_data_buf_ops; - xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF); - } else { - /* If the curbp is not the CI match block, drop it */ - if (state->extrablk.bp != curbp) - xfs_trans_brelse(tp, curbp); - } + /* Giving back last used data block. */ + state->extravalid = 1; + state->extrablk.bp = curbp; + state->extrablk.index = di; + state->extrablk.blkno = curdb; + state->extrablk.magic = XFS_DIR2_DATA_MAGIC; + curbp->b_ops = &xfs_dir3_data_buf_ops; + xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF); } else { state->extravalid = 0; } *indexp = index; - return XFS_ERROR(ENOENT); + return XFS_ERROR(error); } /* diff --git a/libxfs/xfs_dir2_sf.c b/libxfs/xfs_dir2_sf.c index 7580333..7b01d43 100644 --- a/libxfs/xfs_dir2_sf.c +++ b/libxfs/xfs_dir2_sf.c @@ -833,13 +833,12 @@ xfs_dir2_sf_lookup( for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count; i++, sfep = xfs_dir3_sf_nextentry(dp->i_mount, sfp, sfep)) { /* - * Compare name and if it's an exact match, return the inode - * number. If it's the first case-insensitive match, store the - * inode number and continue looking for an exact match. + * Compare name and if it's a match, return the inode + * number. */ cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name, sfep->namelen); - if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) { + if (cmp != XFS_CMP_DIFFERENT) { args->cmpresult = cmp; args->inumber = xfs_dir3_sfe_get_ino(dp->i_mount, sfp, sfep); @@ -848,6 +847,7 @@ xfs_dir2_sf_lookup( if (cmp == XFS_CMP_EXACT) return XFS_ERROR(EEXIST); ci_sfep = sfep; + break; } } ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers 2014-09-18 20:33 ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers @ 2014-09-18 20:33 ` Ben Myers 2014-09-18 20:34 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers ` (12 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:33 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and normalization, different strings will match on other criteria than case insensitivity. Signed-off-by: Olaf Weber <olaf@sgi.com> --- include/xfs_da_btree.h | 2 +- libxfs/xfs_dir2.c | 9 ++++++--- libxfs/xfs_dir2_node.c | 2 +- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h index e492dca..3d9f9dd 100644 --- a/include/xfs_da_btree.h +++ b/include/xfs_da_btree.h @@ -34,7 +34,7 @@ struct zone; enum xfs_dacmp { XFS_CMP_DIFFERENT, /* names are completely different */ XFS_CMP_EXACT, /* names are exactly the same */ - XFS_CMP_CASE /* names are same but differ in case */ + XFS_CMP_MATCH /* names are same but differ in encoding */ }; /* diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c index 4c8c836..57e98a3 100644 --- a/libxfs/xfs_dir2.c +++ b/libxfs/xfs_dir2.c @@ -72,7 +72,7 @@ xfs_ascii_ci_compname( continue; if (tolower(args->name[i]) != tolower(name[i])) return XFS_CMP_DIFFERENT; - result = XFS_CMP_CASE; + result = XFS_CMP_MATCH; } return result; @@ -248,8 +248,11 @@ xfs_dir_cilookup_result( { if (args->cmpresult == XFS_CMP_DIFFERENT) return ENOENT; - if (args->cmpresult != XFS_CMP_CASE || - !(args->op_flags & XFS_DA_OP_CILOOKUP)) + if (args->cmpresult == XFS_CMP_EXACT) + return EEXIST; + ASSERT(args->cmpresult == XFS_CMP_MATCH); + /* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */ + if (!(args->op_flags & XFS_DA_OP_CILOOKUP)) return EEXIST; args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL); diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c index fb27506..550ca99 100644 --- a/libxfs/xfs_dir2_node.c +++ b/libxfs/xfs_dir2_node.c @@ -2034,7 +2034,7 @@ xfs_dir2_node_lookup( error = xfs_da3_node_lookup_int(state, &rval); if (error) rval = error; - else if (rval == ENOENT && args->cmpresult == XFS_CMP_CASE) { + else if (rval == ENOENT && args->cmpresult == XFS_CMP_MATCH) { /* If a CI match, dup the actual name and return EEXIST */ xfs_dir2_data_entry_t *dep; -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 03/13] libxfs: add xfs_nameops.normhash 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers 2014-09-18 20:33 ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers 2014-09-18 20:33 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers @ 2014-09-18 20:34 ` Ben Myers 2014-09-18 20:35 ` Ben Myers ` (11 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:34 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args structure as its argument, and calculates a hash value over the name. It may in the process create a normalized form of the name, and assign that to the norm/normlen fields in the xfs_da_args structure. Changes: The pointer in kmem_free() was type converted to suppress compiler warnings. Signed-off-by: Olaf Weber <olaf@sgi.com> --- include/xfs_da_btree.h | 5 ++++- libxfs/xfs_da_btree.c | 9 ++++++++ libxfs/xfs_dir2.c | 56 +++++++++++++++++++++++++++++++++++++++----------- 3 files changed, 57 insertions(+), 13 deletions(-) diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h index 3d9f9dd..06b50bf 100644 --- a/include/xfs_da_btree.h +++ b/include/xfs_da_btree.h @@ -42,7 +42,9 @@ enum xfs_dacmp { */ typedef struct xfs_da_args { const __uint8_t *name; /* string (maybe not NULL terminated) */ - int namelen; /* length of string (maybe no NULL) */ + const __uint8_t *norm; /* normalized name (may be NULL) */ + int namelen; /* length of string (maybe no NULL) */ + int normlen; /* length of normalized name */ __uint8_t filetype; /* filetype of inode for directories */ __uint8_t *value; /* set of bytes (maybe contain NULLs) */ int valuelen; /* length of value */ @@ -131,6 +133,7 @@ typedef struct xfs_da_state { */ struct xfs_nameops { xfs_dahash_t (*hashname)(struct xfs_name *); + int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); }; diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c index b731b54..eb97317 100644 --- a/libxfs/xfs_da_btree.c +++ b/libxfs/xfs_da_btree.c @@ -2000,8 +2000,17 @@ xfs_default_hashname( return xfs_da_hashname(name->name, name->len); } +STATIC int +xfs_da_normhash( + struct xfs_da_args *args) +{ + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + const struct xfs_nameops xfs_default_nameops = { .hashname = xfs_default_hashname, + .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c index 57e98a3..e52d082 100644 --- a/libxfs/xfs_dir2.c +++ b/libxfs/xfs_dir2.c @@ -54,6 +54,21 @@ xfs_ascii_ci_hashname( return hash; } +STATIC int +xfs_ascii_ci_normhash( + struct xfs_da_args *args) +{ + xfs_dahash_t hash; + int i; + + for (i = 0, hash = 0; i < args->namelen; i++) + hash = tolower(args->name[i]) ^ rol32(hash, 7); + + args->hashval = hash; + return 0; +} + + STATIC enum xfs_dacmp xfs_ascii_ci_compname( struct xfs_da_args *args, @@ -80,6 +95,7 @@ xfs_ascii_ci_compname( static struct xfs_nameops xfs_ascii_ci_nameops = { .hashname = xfs_ascii_ci_hashname, + .normhash = xfs_ascii_ci_normhash, .compname = xfs_ascii_ci_compname, }; @@ -211,7 +227,6 @@ xfs_dir_createname( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.inumber = inum; args.dp = dp; args.firstblock = first; @@ -220,19 +235,24 @@ xfs_dir_createname( args.whichfork = XFS_DATA_FORK; args.trans = tp; args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_addname(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_addname(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_addname(&args); else rval = xfs_dir2_node_addname(&args); +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } @@ -289,22 +309,23 @@ xfs_dir_lookup( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.dp = dp; args.whichfork = XFS_DATA_FORK; args.trans = tp; args.op_flags = XFS_DA_OP_OKNOENT; if (ci_name) args.op_flags |= XFS_DA_OP_CILOOKUP; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_lookup(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_lookup(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_lookup(&args); else @@ -318,6 +339,9 @@ xfs_dir_lookup( ci_name->len = args.valuelen; } } +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } @@ -345,7 +369,6 @@ xfs_dir_removename( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.inumber = ino; args.dp = dp; args.firstblock = first; @@ -353,19 +376,24 @@ xfs_dir_removename( args.total = total; args.whichfork = XFS_DATA_FORK; args.trans = tp; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_removename(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_removename(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_removename(&args); else rval = xfs_dir2_node_removename(&args); +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } @@ -395,7 +423,6 @@ xfs_dir_replace( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); args.inumber = inum; args.dp = dp; args.firstblock = first; @@ -403,19 +430,24 @@ xfs_dir_replace( args.total = total; args.whichfork = XFS_DATA_FORK; args.trans = tp; + if ((rval = dp->i_mount->m_dirnameops->normhash(&args))) + return rval; if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) rval = xfs_dir2_sf_replace(&args); else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_block_replace(&args); else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; + goto out_free; else if (v) rval = xfs_dir2_leaf_replace(&args); else rval = xfs_dir2_node_replace(&args); +out_free: + if (args.norm) + kmem_free((void *)args.norm); return rval; } -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 20:35 ` Ben Myers 2014-09-18 20:33 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (13 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:35 UTC (permalink / raw) To: linux-fsdevel; +Cc: xfs, olaf, tinguely From: Olaf Weber <olaf@sgi.com> With the introduction of the xfs_nameops.normhash callout, all uses of the hashname callout now occur in places where an xfs_name structure must be explicitly created just to match the parameter passing convention of this callout. Change the arguments to a const unsigned char * and int instead. Signed-off-by: Olaf Weber <olaf@sgi.com> --- db/check.c | 6 ++---- include/xfs_da_btree.h | 2 +- libxfs/xfs_da_btree.c | 9 +-------- libxfs/xfs_dir2.c | 10 ++++++---- libxfs/xfs_dir2_block.c | 5 +---- libxfs/xfs_dir2_data.c | 6 ++---- repair/phase6.c | 2 +- 7 files changed, 14 insertions(+), 26 deletions(-) diff --git a/db/check.c b/db/check.c index 4fd9fd0..49359d7 100644 --- a/db/check.c +++ b/db/check.c @@ -2212,7 +2212,6 @@ process_data_dir_v2( int stale = 0; int tag_err; __be16 *tagp; - struct xfs_name xname; data = iocur_top->data; block = iocur_top->data; @@ -2323,9 +2322,8 @@ process_data_dir_v2( tag_err += be16_to_cpu(*tagp) != (char *)dep - (char *)data; addr = xfs_dir2_db_off_to_dataptr(mp, db, (char *)dep - (char *)data); - xname.name = dep->name; - xname.len = dep->namelen; - dir_hash_add(mp->m_dirnameops->hashname(&xname), addr); + dir_hash_add(mp->m_dirnameops->hashname(dep->name, + dep->namelen), addr); ptr += xfs_dir3_data_entsize(mp, dep->namelen); count++; lastfree = 0; diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h index 06b50bf..9674bed 100644 --- a/include/xfs_da_btree.h +++ b/include/xfs_da_btree.h @@ -132,7 +132,7 @@ typedef struct xfs_da_state { * Name ops for directory and/or attr name operations */ struct xfs_nameops { - xfs_dahash_t (*hashname)(struct xfs_name *); + xfs_dahash_t (*hashname)(const unsigned char *, int); int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c index eb97317..7be5eaf 100644 --- a/libxfs/xfs_da_btree.c +++ b/libxfs/xfs_da_btree.c @@ -1993,13 +1993,6 @@ xfs_da_compname( XFS_CMP_EXACT : XFS_CMP_DIFFERENT; } -static xfs_dahash_t -xfs_default_hashname( - struct xfs_name *name) -{ - return xfs_da_hashname(name->name, name->len); -} - STATIC int xfs_da_normhash( struct xfs_da_args *args) @@ -2009,7 +2002,7 @@ xfs_da_normhash( } const struct xfs_nameops xfs_default_nameops = { - .hashname = xfs_default_hashname, + .hashname = xfs_da_hashname, .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c index e52d082..1893931 100644 --- a/libxfs/xfs_dir2.c +++ b/libxfs/xfs_dir2.c @@ -43,13 +43,14 @@ const unsigned char xfs_mode_to_ftype[S_IFMT >> S_SHIFT] = { */ STATIC xfs_dahash_t xfs_ascii_ci_hashname( - struct xfs_name *name) + const unsigned char *name, + int len) { xfs_dahash_t hash; int i; - for (i = 0, hash = 0; i < name->len; i++) - hash = tolower(name->name[i]) ^ rol32(hash, 7); + for (i = 0, hash = 0; i < len; i++) + hash = tolower(name[i]) ^ rol32(hash, 7); return hash; } @@ -475,7 +476,8 @@ xfs_dir_canenter( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); + args.hashval = dp->i_mount->m_dirnameops->hashname(name->name, + name->len); args.dp = dp; args.whichfork = XFS_DATA_FORK; args.trans = tp; diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c index 2880431..1a8b5f5 100644 --- a/libxfs/xfs_dir2_block.c +++ b/libxfs/xfs_dir2_block.c @@ -1047,7 +1047,6 @@ xfs_dir2_sf_to_block( xfs_dir2_sf_hdr_t *sfp; /* shortform header */ __be16 *tagp; /* end of data entry */ xfs_trans_t *tp; /* transaction pointer */ - struct xfs_name name; struct xfs_ifork *ifp; trace_xfs_dir2_sf_to_block(args); @@ -1205,10 +1204,8 @@ xfs_dir2_sf_to_block( tagp = xfs_dir3_data_entry_tag_p(mp, dep); *tagp = cpu_to_be16((char *)dep - (char *)hdr); xfs_dir2_data_log_entry(tp, bp, dep); - name.name = sfep->name; - name.len = sfep->namelen; blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops-> - hashname(&name)); + hashname(sfep->name, sfep->namelen)); blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)hdr)); offset = (int)((char *)(tagp + 1) - (char *)hdr); diff --git a/libxfs/xfs_dir2_data.c b/libxfs/xfs_dir2_data.c index dc9df4d..9b3f750 100644 --- a/libxfs/xfs_dir2_data.c +++ b/libxfs/xfs_dir2_data.c @@ -46,7 +46,6 @@ __xfs_dir3_data_check( xfs_mount_t *mp; /* filesystem mount point */ char *p; /* current data position */ int stale; /* count of stale leaves */ - struct xfs_name name; mp = bp->b_target->bt_mount; hdr = bp->b_addr; @@ -142,9 +141,8 @@ __xfs_dir3_data_check( addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, (xfs_dir2_data_aoff_t) ((char *)dep - (char *)hdr)); - name.name = dep->name; - name.len = dep->namelen; - hash = mp->m_dirnameops->hashname(&name); + hash = mp->m_dirnameops-> + hashname(dep->name, dep->namelen); for (i = 0; i < be32_to_cpu(btp->count); i++) { if (be32_to_cpu(lep[i].address) == addr && be32_to_cpu(lep[i].hashval) == hash) diff --git a/repair/phase6.c b/repair/phase6.c index f13069f..f374fd0 100644 --- a/repair/phase6.c +++ b/repair/phase6.c @@ -195,7 +195,7 @@ dir_hash_add( dup = 0; if (!junk) { - hash = mp->m_dirnameops->hashname(&xname); + hash = mp->m_dirnameops->hashname(name, namelen); byhash = DIR_HASH_FUNC(hashtab, hash); /* -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash @ 2014-09-18 20:35 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:35 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> With the introduction of the xfs_nameops.normhash callout, all uses of the hashname callout now occur in places where an xfs_name structure must be explicitly created just to match the parameter passing convention of this callout. Change the arguments to a const unsigned char * and int instead. Signed-off-by: Olaf Weber <olaf@sgi.com> --- db/check.c | 6 ++---- include/xfs_da_btree.h | 2 +- libxfs/xfs_da_btree.c | 9 +-------- libxfs/xfs_dir2.c | 10 ++++++---- libxfs/xfs_dir2_block.c | 5 +---- libxfs/xfs_dir2_data.c | 6 ++---- repair/phase6.c | 2 +- 7 files changed, 14 insertions(+), 26 deletions(-) diff --git a/db/check.c b/db/check.c index 4fd9fd0..49359d7 100644 --- a/db/check.c +++ b/db/check.c @@ -2212,7 +2212,6 @@ process_data_dir_v2( int stale = 0; int tag_err; __be16 *tagp; - struct xfs_name xname; data = iocur_top->data; block = iocur_top->data; @@ -2323,9 +2322,8 @@ process_data_dir_v2( tag_err += be16_to_cpu(*tagp) != (char *)dep - (char *)data; addr = xfs_dir2_db_off_to_dataptr(mp, db, (char *)dep - (char *)data); - xname.name = dep->name; - xname.len = dep->namelen; - dir_hash_add(mp->m_dirnameops->hashname(&xname), addr); + dir_hash_add(mp->m_dirnameops->hashname(dep->name, + dep->namelen), addr); ptr += xfs_dir3_data_entsize(mp, dep->namelen); count++; lastfree = 0; diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h index 06b50bf..9674bed 100644 --- a/include/xfs_da_btree.h +++ b/include/xfs_da_btree.h @@ -132,7 +132,7 @@ typedef struct xfs_da_state { * Name ops for directory and/or attr name operations */ struct xfs_nameops { - xfs_dahash_t (*hashname)(struct xfs_name *); + xfs_dahash_t (*hashname)(const unsigned char *, int); int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c index eb97317..7be5eaf 100644 --- a/libxfs/xfs_da_btree.c +++ b/libxfs/xfs_da_btree.c @@ -1993,13 +1993,6 @@ xfs_da_compname( XFS_CMP_EXACT : XFS_CMP_DIFFERENT; } -static xfs_dahash_t -xfs_default_hashname( - struct xfs_name *name) -{ - return xfs_da_hashname(name->name, name->len); -} - STATIC int xfs_da_normhash( struct xfs_da_args *args) @@ -2009,7 +2002,7 @@ xfs_da_normhash( } const struct xfs_nameops xfs_default_nameops = { - .hashname = xfs_default_hashname, + .hashname = xfs_da_hashname, .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c index e52d082..1893931 100644 --- a/libxfs/xfs_dir2.c +++ b/libxfs/xfs_dir2.c @@ -43,13 +43,14 @@ const unsigned char xfs_mode_to_ftype[S_IFMT >> S_SHIFT] = { */ STATIC xfs_dahash_t xfs_ascii_ci_hashname( - struct xfs_name *name) + const unsigned char *name, + int len) { xfs_dahash_t hash; int i; - for (i = 0, hash = 0; i < name->len; i++) - hash = tolower(name->name[i]) ^ rol32(hash, 7); + for (i = 0, hash = 0; i < len; i++) + hash = tolower(name[i]) ^ rol32(hash, 7); return hash; } @@ -475,7 +476,8 @@ xfs_dir_canenter( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); + args.hashval = dp->i_mount->m_dirnameops->hashname(name->name, + name->len); args.dp = dp; args.whichfork = XFS_DATA_FORK; args.trans = tp; diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c index 2880431..1a8b5f5 100644 --- a/libxfs/xfs_dir2_block.c +++ b/libxfs/xfs_dir2_block.c @@ -1047,7 +1047,6 @@ xfs_dir2_sf_to_block( xfs_dir2_sf_hdr_t *sfp; /* shortform header */ __be16 *tagp; /* end of data entry */ xfs_trans_t *tp; /* transaction pointer */ - struct xfs_name name; struct xfs_ifork *ifp; trace_xfs_dir2_sf_to_block(args); @@ -1205,10 +1204,8 @@ xfs_dir2_sf_to_block( tagp = xfs_dir3_data_entry_tag_p(mp, dep); *tagp = cpu_to_be16((char *)dep - (char *)hdr); xfs_dir2_data_log_entry(tp, bp, dep); - name.name = sfep->name; - name.len = sfep->namelen; blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops-> - hashname(&name)); + hashname(sfep->name, sfep->namelen)); blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)hdr)); offset = (int)((char *)(tagp + 1) - (char *)hdr); diff --git a/libxfs/xfs_dir2_data.c b/libxfs/xfs_dir2_data.c index dc9df4d..9b3f750 100644 --- a/libxfs/xfs_dir2_data.c +++ b/libxfs/xfs_dir2_data.c @@ -46,7 +46,6 @@ __xfs_dir3_data_check( xfs_mount_t *mp; /* filesystem mount point */ char *p; /* current data position */ int stale; /* count of stale leaves */ - struct xfs_name name; mp = bp->b_target->bt_mount; hdr = bp->b_addr; @@ -142,9 +141,8 @@ __xfs_dir3_data_check( addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, (xfs_dir2_data_aoff_t) ((char *)dep - (char *)hdr)); - name.name = dep->name; - name.len = dep->namelen; - hash = mp->m_dirnameops->hashname(&name); + hash = mp->m_dirnameops-> + hashname(dep->name, dep->namelen); for (i = 0; i < be32_to_cpu(btp->count); i++) { if (be32_to_cpu(lep[i].address) == addr && be32_to_cpu(lep[i].hashval) == hash) diff --git a/repair/phase6.c b/repair/phase6.c index f13069f..f374fd0 100644 --- a/repair/phase6.c +++ b/repair/phase6.c @@ -195,7 +195,7 @@ dir_hash_add( dup = 0; if (!junk) { - hash = mp->m_dirnameops->hashname(&xname); + hash = mp->m_dirnameops->hashname(name, namelen); byhash = DIR_HASH_FUNC(hashtab, hash); /* -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support. 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (3 preceding siblings ...) 2014-09-18 20:35 ` Ben Myers @ 2014-09-18 20:36 ` Ben Myers 2014-09-18 20:37 ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers ` (9 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:36 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be installed. Add xfs_sb_version_hasci(), which tests both the borgbit and the utf8bit, and returns true if at least one of them is set. Replace calls to xfs_sb_version_hasasciici() as needed. Signed-off-by: Olaf Weber <olaf@sgi.com> --- include/xfs_fs.h | 2 +- include/xfs_sb.h | 25 ++++++++++++++++++++++++- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/include/xfs_fs.h b/include/xfs_fs.h index 59c40fc..1be539d 100644 --- a/include/xfs_fs.h +++ b/include/xfs_fs.h @@ -239,7 +239,7 @@ typedef struct xfs_fsop_resblks { #define XFS_FSOP_GEOM_FLAGS_V5SB 0x8000 /* version 5 superblock */ #define XFS_FSOP_GEOM_FLAGS_FTYPE 0x10000 /* inode directory types */ #define XFS_FSOP_GEOM_FLAGS_FINOBT 0x20000 /* free inode btree */ - +#define XFS_FSOP_GEOM_FLAGS_UTF8 0x40000 /* utf8 filenames */ /* * Minimum and maximum sizes need for growth checks. diff --git a/include/xfs_sb.h b/include/xfs_sb.h index 950d1ea..5ac7f06 100644 --- a/include/xfs_sb.h +++ b/include/xfs_sb.h @@ -82,6 +82,8 @@ struct xfs_trans; #define XFS_SB_VERSION2_RESERVED4BIT 0x00000004 #define XFS_SB_VERSION2_ATTR2BIT 0x00000008 /* Inline attr rework */ #define XFS_SB_VERSION2_PARENTBIT 0x00000010 /* parent pointers */ +#define XFS_SB_VERSION2_PARENTBIT 0x00000010 /* parent pointers */ +#define XFS_SB_VERSION2_UTF8BIT 0x00000020 /* utf8 names */ #define XFS_SB_VERSION2_PROJID32BIT 0x00000080 /* 32 bit project id */ #define XFS_SB_VERSION2_CRCBIT 0x00000100 /* metadata CRCs */ #define XFS_SB_VERSION2_FTYPE 0x00000200 /* inode type in dir */ @@ -89,6 +91,7 @@ struct xfs_trans; #define XFS_SB_VERSION2_OKREALFBITS \ (XFS_SB_VERSION2_LAZYSBCOUNTBIT | \ XFS_SB_VERSION2_ATTR2BIT | \ + XFS_SB_VERSION2_UTF8BIT | \ XFS_SB_VERSION2_PROJID32BIT | \ XFS_SB_VERSION2_FTYPE) #define XFS_SB_VERSION2_OKSASHFBITS \ @@ -600,8 +603,10 @@ xfs_sb_has_ro_compat_feature( } #define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */ +#define XFS_SB_FEAT_INCOMPAT_UTF8 (1 << 1) /* utf-8 name support */ #define XFS_SB_FEAT_INCOMPAT_ALL \ - (XFS_SB_FEAT_INCOMPAT_FTYPE) + (XFS_SB_FEAT_INCOMPAT_FTYPE | \ + XFS_SB_FEAT_INCOMPAT_UTF8) #define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL static inline bool @@ -649,6 +654,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp) (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT); } +static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp) +{ + return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 && + xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) || + (xfs_sb_version_hasmorebits(sbp) && + (sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT)); +} + +/* + * Special case: there are a number of places where we need to test + * both the borgbit and the utf8bit, and take the same action if + * either of those is set. + */ +static inline int xfs_sb_version_hasci(xfs_sb_t *sbp) +{ + return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp); +} + /* * end of superblock version macros */ -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 06/13] xfsprogs: add unicode character database files 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (4 preceding siblings ...) 2014-09-18 20:36 ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers @ 2014-09-18 20:37 ` Ben Myers 2014-09-18 20:38 ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers ` (8 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:37 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Add files from the Unicode Character Database, version 7.0.0, to the source. A helper program that generates a trie used for normalization from these files is part of a separate commit. Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: removed large unicode files. download them as below. -bpm] cd support/ucd-7.0.0 wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt --- support/ucd-7.0.0/README | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 support/ucd-7.0.0/README diff --git a/support/ucd-7.0.0/README b/support/ucd-7.0.0/README new file mode 100644 index 0000000..d713e66 --- /dev/null +++ b/support/ucd-7.0.0/README @@ -0,0 +1,33 @@ +The files in this directory are part of the Unicode Character Database +for version 7.0.0 of the Unicode standard. + +The full set of files can be found here: + + http://www.unicode.org/Public/7.0.0/ucd/ + +The latest released version of the UCD can be found here: + + http://www.unicode.org/Public/UCD/latest/ + +The files in this directory are identical, except that they have been +renamed with a suffix indicating the unicode version. + +Individual source links: + + http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt + http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt + http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt + http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt + http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt + http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt + http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt + +md5sums + + 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt + 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt + 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt + c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt + 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt + 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt + c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8. 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (5 preceding siblings ...) 2014-09-18 20:37 ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers @ 2014-09-18 20:38 ` Ben Myers 2014-09-18 20:38 ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers ` (7 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:38 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> mkutf8data.c is the source for a program that generates utf8data.h, which contains the trie that utf8norm.c uses. The trie is generated from the Unicode 7.0.0 data files. The format of the utf8data[] table is described in utf8norm.c. Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. nfkdi: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. nfkdicf: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). Signed-off-by: Olaf Weber <olaf@sgi.com> --- include/utf8norm.h | 111 ++ libxfs/utf8norm.c | 628 ++++++++++ support/mkutf8data.c | 3232 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 3971 insertions(+) create mode 100644 include/utf8norm.h create mode 100644 libxfs/utf8norm.c create mode 100644 support/mkutf8data.c diff --git a/include/utf8norm.h b/include/utf8norm.h new file mode 100644 index 0000000..6aa3391 --- /dev/null +++ b/include/utf8norm.h @@ -0,0 +1,111 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef UTF8NORM_H +#define UTF8NORM_H + +/* An opaque type used to determine the normalization in use. */ +typedef const struct utf8data *utf8data_t; + +/* Encoding a unicode version number as a single unsigned int. */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +/* Highest unicode version supported by the data tables. */ +extern const unsigned int utf8version; + +/* + * Look for the correct utf8data_t for a unicode version. + * Returns NULL if the version requested is too new. + * + * Two normalization forms are supported: nfkdi and nfkdicf. + * + * nfkdi: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * + * nfkdicf: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * - Apply a full casefold (C + F). + */ +extern utf8data_t utf8nfkdi(unsigned int); +extern utf8data_t utf8nfkdicf(unsigned int); + +/* + * Determine the maximum age of any unicode character in the string. + * Returns 0 if only unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemax(utf8data_t, const char *); +extern int utf8nagemax(utf8data_t, const char *, size_t); + +/* + * Determine the minimum age of any unicode character in the string. + * Returns 0 if any unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemin(utf8data_t, const char *); +extern int utf8nagemin(utf8data_t, const char *, size_t); + +/* + * Determine the length of the normalized from of the string, + * excluding any terminating NULL byte. + * Returns 0 if only ignorable code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern ssize_t utf8len(utf8data_t, const char *); +extern ssize_t utf8nlen(utf8data_t, const char *, size_t); + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + utf8data_t data; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; +}; + +/* + * Initialize a utf8cursor to normalize a string. + * Returns 0 on success. + * Returns -1 on failure. + */ +extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *); +extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t); + +/* + * Get the next byte in the normalization. + * Returns a value > 0 && < 256 on success. + * Returns 0 when the end of the normalization is reached. + * Returns -1 if the string being normalized is not valid UTF-8. + */ +extern int utf8byte(struct utf8cursor *); + +#endif /* UTF8NORM_H */ diff --git a/libxfs/utf8norm.c b/libxfs/utf8norm.c new file mode 100644 index 0000000..6232d1a --- /dev/null +++ b/libxfs/utf8norm.c @@ -0,0 +1,628 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xfs.h" +#include "xfs_types.h" +#include <utf8norm.h> + +struct utf8data { + unsigned int maxage; + unsigned int offset; +}; + +#define __INCLUDED_FROM_UTF8NORM_C__ +#include <utf8data.h> +#undef __INCLUDED_FROM_UTF8NORM_C__ + +/* + * UTF-8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7F: 0 - 0x7F + * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF + * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF + * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF + * + * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef const unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype: unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + * + * The trie is constructed in such a way that leaves exist for all + * UTF-8 sequences that match the criteria from the "UTF-8 valid + * ranges" comment above, and only for those sequences. Therefore a + * lookup in the trie can be used to validate the UTF-8 input. + */ +typedef const unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(utf8data_t data, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + data->offset; + int offlen; + int offset; + int mask; + int node; + + if (!data) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to utf8nlookup(). + */ +static utf8leaf_t * +utf8lookup(utf8data_t data, const char *s) +{ + return utf8nlookup(data, s, (size_t)-1); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age = data->maxage; + int leaf_age; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age = data->maxage; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : string. + * len : length of s. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s, + size_t len) +{ + if (!data) + return -1; + if (!s) + return -1; + u8c->data = data; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : NUL-terminated string. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s) +{ + return utf8ncursor(u8c, data, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->data, u8c->s); + else + leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->data, u8c->s); + ccc = LEAF_CCC(leaf); + } + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +const struct utf8data * +utf8nfkdi(unsigned int maxage) +{ + int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1; + + while (maxage < utf8nfkdidata[i].maxage) + i--; + if (maxage > utf8nfkdidata[i].maxage) + return NULL; + return &utf8nfkdidata[i]; +} + +const struct utf8data * +utf8nfkdicf(unsigned int maxage) +{ + int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1; + + while (maxage < utf8nfkdicfdata[i].maxage) + i--; + if (maxage > utf8nfkdicfdata[i].maxage) + return NULL; + return &utf8nfkdicfdata[i]; +} diff --git a/support/mkutf8data.c b/support/mkutf8data.c new file mode 100644 index 0000000..e5c3507 --- /dev/null +++ b/support/mkutf8data.c @@ -0,0 +1,3232 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/* Generator for a compact trie for unicode normalization */ + +#include <sys/types.h> +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <assert.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> + +/* Default names of the in- and output files. */ + +#define AGE_NAME "DerivedAge.txt" +#define CCC_NAME "DerivedCombiningClass.txt" +#define PROP_NAME "DerivedCoreProperties.txt" +#define DATA_NAME "UnicodeData.txt" +#define FOLD_NAME "CaseFolding.txt" +#define NORM_NAME "NormalizationCorrections.txt" +#define TEST_NAME "NormalizationTest.txt" +#define UTF8_NAME "utf8data.h" + +const char *age_name = AGE_NAME; +const char *ccc_name = CCC_NAME; +const char *prop_name = PROP_NAME; +const char *data_name = DATA_NAME; +const char *fold_name = FOLD_NAME; +const char *norm_name = NORM_NAME; +const char *test_name = TEST_NAME; +const char *utf8_name = UTF8_NAME; + +int verbose = 0; + +/* An arbitrary line size limit on input lines. */ + +#define LINESIZE 1024 +char line[LINESIZE]; +char buf0[LINESIZE]; +char buf1[LINESIZE]; +char buf2[LINESIZE]; +char buf3[LINESIZE]; + +const char *argv0; + +/* ------------------------------------------------------------------ */ + +/* + * Unicode version numbers consist of three parts: major, minor, and a + * revision. These numbers are packed into an unsigned int to obtain + * a single version number. + * + * To save space in the generated trie, the unicode version is not + * stored directly, instead we calculate a generation number from the + * unicode versions seen in the DerivedAge file, and use that as an + * index into a table of unicode versions. + */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_MAJ_MAX ((unsigned short)-1) +#define UNICODE_MIN_MAX ((unsigned char)-1) +#define UNICODE_REV_MAX ((unsigned char)-1) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +unsigned int *ages; +int ages_count; + +unsigned int unicode_maxage; + +static int +age_valid(unsigned int major, unsigned int minor, unsigned int revision) +{ + if (major > UNICODE_MAJ_MAX) + return 0; + if (minor > UNICODE_MIN_MAX) + return 0; + if (revision > UNICODE_REV_MAX) + return 0; + return 1; +} + +/* ------------------------------------------------------------------ */ + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype, unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + */ +typedef unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MAXGEN (255) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +struct tree; +static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, const char *); + +unsigned char *utf8data; +size_t utf8data_size; + +utf8trie_t *nfkdi; +utf8trie_t *nfkdicf; + +/* ------------------------------------------------------------------ */ + +/* + * UTF8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7f: 0 0x7f + * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf + * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf + * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf + * + * Even within those ranges not all values are allowed: the surrogates + * 0xd800 - 0xdfff should never be seen. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +#define UTF8_2_BITS 0xC0 +#define UTF8_3_BITS 0xE0 +#define UTF8_4_BITS 0xF0 +#define UTF8_N_BITS 0x80 +#define UTF8_2_MASK 0xE0 +#define UTF8_3_MASK 0xF0 +#define UTF8_4_MASK 0xF8 +#define UTF8_N_MASK 0xC0 +#define UTF8_V_MASK 0x3F +#define UTF8_V_SHIFT 6 + +static int +utf8key(unsigned int key, char keyval[]) +{ + int keylen; + + if (key < 0x80) { + keyval[0] = key; + keylen = 1; + } else if (key < 0x800) { + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_2_BITS; + keylen = 2; + } else if (key < 0x10000) { + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_3_BITS; + keylen = 3; + } else if (key < 0x110000) { + keyval[3] = key & UTF8_V_MASK; + keyval[3] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_4_BITS; + keylen = 4; + } else { + printf("%#x: illegal key\n", key); + keylen = 0; + } + return keylen; +} + +static unsigned int +utf8code(const char *str) +{ + const unsigned char *s = (const unsigned char*)str; + unsigned int unichar = 0; + + if (*s < 0x80) { + unichar = *s; + } else if (*s < UTF8_3_BITS) { + unichar = *s++ & 0x1F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else if (*s < UTF8_4_BITS) { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } + return unichar; +} + +static int +utf32valid(unsigned int unichar) +{ + return unichar < 0x110000; +} + +#define NODE 1 +#define LEAF 0 + +struct tree { + void *root; + int childnode; + const char *type; + unsigned int maxage; + struct tree *next; + int (*leaf_equal)(void *, void *); + void (*leaf_print)(void *, int); + int (*leaf_mark)(void *); + int (*leaf_size)(void *); + int *(*leaf_index)(struct tree *, void *); + unsigned char *(*leaf_emit)(void *, unsigned char *); + int leafindex[0x110000]; + int index; +}; + +struct node { + int index; + int offset; + int mark; + int size; + struct node *parent; + void *left; + void *right; + unsigned char bitnum; + unsigned char nextbyte; + unsigned char leftnode; + unsigned char rightnode; + unsigned int keybits; + unsigned int keymask; +}; + +/* + * Example lookup function for a tree. + */ +static void * +lookup(struct tree *tree, const char *key) +{ + struct node *node; + void *leaf = NULL; + + node = tree->root; + while (!leaf && node) { + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) { + /* Right leg */ + if (node->rightnode == NODE) { + node = node->right; + } else if (node->rightnode == LEAF) { + leaf = node->right; + } else { + node = NULL; + } + } else { + /* Left leg */ + if (node->leftnode == NODE) { + node = node->left; + } else if (node->leftnode == LEAF) { + leaf = node->left; + } else { + node = NULL; + } + } + } + + return leaf; +} + +/* + * A simple non-recursive tree walker: keep track of visits to the + * left and right branches in the leftmask and rightmask. + */ +static void +tree_walk(struct tree *tree) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int indent = 1; + int nodes, singletons, leaves; + + nodes = singletons = leaves = 0; + + printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_print(tree->root, indent); + leaves = 1; + } else { + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + printf("%*snode @ %p bitnum %d nextbyte %d" + " left %p right %p mask %x bits %x\n", + indent, "", node, + node->bitnum, node->nextbyte, + node->left, node->right, + node->keymask, node->keybits); + nodes += 1; + if (!(node->left && node->right)) + singletons += 1; + + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + tree->leaf_print(node->left, + indent+1); + leaves += 1; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + tree->leaf_print(node->right, + indent+1); + leaves += 1; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } + } + printf("nodes %d leaves %d singletons %d\n", + nodes, leaves, singletons); +} + +/* + * Allocate an initialize a new internal node. + */ +static struct node * +alloc_node(struct node *parent) +{ + struct node *node; + int bitnum; + + node = malloc(sizeof(*node)); + node->left = node->right = NULL; + node->parent = parent; + node->leftnode = NODE; + node->rightnode = NODE; + node->keybits = 0; + node->keymask = 0; + node->mark = 0; + node->index = 0; + node->offset = -1; + node->size = 4; + + if (node->parent) { + bitnum = parent->bitnum; + if ((bitnum & 7) == 0) { + node->bitnum = bitnum + 7 + 8; + node->nextbyte = 1; + } else { + node->bitnum = bitnum - 1; + node->nextbyte = 0; + } + } else { + node->bitnum = 7; + node->nextbyte = 0; + } + + return node; +} + +/* + * Insert a new leaf into the tree, and collapse any subtrees that are + * fully populated and end in identical leaves. A nextbyte tagged + * internal node will not be removed to preserve the tree's integrity. + * Note that due to the structure of utf8, no nextbyte tagged node + * will be a candidate for removal. + */ +static int +insert(struct tree *tree, char *key, int keylen, void *leaf) +{ + struct node *node; + struct node *parent; + void **cursor; + int keybits; + + assert(keylen >= 1 && keylen <= 4); + + node = NULL; + cursor = &tree->root; + keybits = 8 * keylen; + + /* Insert, creating path along the way. */ + while (keybits) { + if (!*cursor) + *cursor = alloc_node(node); + node = *cursor; + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) + cursor = &node->right; + else + cursor = &node->left; + keybits--; + } + *cursor = leaf; + + /* Merge subtrees if possible. */ + while (node) { + if (*key & (1 << (node->bitnum & 7))) + node->rightnode = LEAF; + else + node->leftnode = LEAF; + if (node->nextbyte) + break; + if (node->leftnode == NODE || node->rightnode == NODE) + break; + assert(node->left); + assert(node->right); + /* Compare */ + if (! tree->leaf_equal(node->left, node->right)) + break; + /* Keep left, drop right leaf. */ + leaf = node->left; + /* Check in parent */ + parent = node->parent; + if (!parent) { + /* root of tree! */ + tree->root = leaf; + tree->childnode = LEAF; + } else if (parent->left == node) { + parent->left = leaf; + parent->leftnode = LEAF; + if (parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + } + } else if (parent->right == node) { + parent->right = leaf; + parent->rightnode = LEAF; + if (parent->left) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + parent->keybits |= (1 << node->bitnum); + } + } else { + /* internal tree error */ + assert(0); + } + free(node); + node = parent; + } + + /* Propagate keymasks up along singleton chains. */ + while (node) { + parent = node->parent; + if (!parent) + break; + /* Nix the mask for parents with two children. */ + if (node->keymask == 0) { + parent->keymask = 0; + parent->keybits = 0; + } else if (parent->left && parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + assert((parent->keymask & node->keymask) == 0); + parent->keymask |= node->keymask; + parent->keymask |= (1 << parent->bitnum); + parent->keybits |= node->keybits; + if (parent->right) + parent->keybits |= (1 << parent->bitnum); + } + node = parent; + } + + return 0; +} + +/* + * Prune internal nodes. + * + * Fully populated subtrees that end at the same leaf have already + * been collapsed. There are still internal nodes that have for both + * their left and right branches a sequence of singletons that make + * identical choices and end in identical leaves. The keymask and + * keybits collected in the nodes describe the choices made in these + * singleton chains. When they are identical for the left and right + * branch of a node, and the two leaves comare identical, the node in + * question can be removed. + * + * Note that nodes with the nextbyte tag set will not be removed by + * this to ensure tree integrity. Note as well that the structure of + * utf8 ensures that these nodes would not have been candidates for + * removal in any case. + */ +static void +prune(struct tree *tree) +{ + struct node *node; + struct node *left; + struct node *right; + struct node *parent; + void *leftleaf; + void *rightleaf; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + + if (verbose > 0) + printf("Pruning %s_%x\n", tree->type, tree->maxage); + + count = 0; + if (tree->childnode == LEAF) + return; + if (!tree->root) + return; + + leftmask = rightmask = 0; + node = tree->root; + while (node) { + if (node->nextbyte) + goto advance; + if (node->leftnode == LEAF) + goto advance; + if (node->rightnode == LEAF) + goto advance; + if (!node->left) + goto advance; + if (!node->right) + goto advance; + left = node->left; + right = node->right; + if (left->keymask == 0) + goto advance; + if (right->keymask == 0) + goto advance; + if (left->keymask != right->keymask) + goto advance; + if (left->keybits != right->keybits) + goto advance; + leftleaf = NULL; + while (!leftleaf) { + assert(left->left || left->right); + if (left->leftnode == LEAF) + leftleaf = left->left; + else if (left->rightnode == LEAF) + leftleaf = left->right; + else if (left->left) + left = left->left; + else if (left->right) + left = left->right; + else + assert(0); + } + rightleaf = NULL; + while (!rightleaf) { + assert(right->left || right->right); + if (right->leftnode == LEAF) + rightleaf = right->left; + else if (right->rightnode == LEAF) + rightleaf = right->right; + else if (right->left) + right = right->left; + else if (right->right) + right = right->right; + else + assert(0); + } + if (! tree->leaf_equal(leftleaf, rightleaf)) + goto advance; + /* + * This node has identical singleton-only subtrees. + * Remove it. + */ + parent = node->parent; + left = node->left; + right = node->right; + if (parent->left == node) + parent->left = left; + else if (parent->right == node) + parent->right = left; + else + assert(0); + left->parent = parent; + left->keymask |= (1 << node->bitnum); + node->left = NULL; + while (node) { + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + if (node->leftnode == NODE && node->left) { + left = node->left; + free(node); + count++; + node = left; + } else if (node->rightnode == NODE && node->right) { + right = node->right; + free(node); + count++; + node = right; + } else { + node = NULL; + } + } + /* Propagate keymasks up along singleton chains. */ + node = parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + for (;;) { + if (node->left && node->right) + break; + if (node->left) { + left = node->left; + node->keymask |= left->keymask; + node->keybits |= left->keybits; + } + if (node->right) { + right = node->right; + node->keymask |= right->keymask; + node->keybits |= right->keybits; + } + node->keymask |= (1 << node->bitnum); + node = node->parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + } + advance: + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0 && + node->leftnode == NODE && + node->left) { + leftmask |= bitmask; + node = node->left; + } else if ((rightmask & bitmask) == 0 && + node->rightnode == NODE && + node->right) { + rightmask |= bitmask; + node = node->right; + } else { + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + } + if (verbose > 0) + printf("Pruned %d nodes\n", count); +} + +/* + * Mark the nodes in the tree that lead to leaves that must be + * emitted. + */ +static void +mark_nodes(struct tree *tree) +{ + struct node *node; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int marked; + + marked = 0; + if (verbose > 0) + printf("Marking %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + + /* second pass: left siblings and singletons */ + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + if (!node->mark && node->parent->mark) { + marked++; + node->mark = 1; + } + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + if (!node->mark && node->parent->mark && + !node->parent->left) { + marked++; + node->mark = 1; + } + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } +done: + if (verbose > 0) + printf("Marked %d nodes\n", marked); +} + +/* + * Compute the index of each node and leaf, which is the offset in the + * emitted trie. These value must be pre-computed because relative + * offsets between nodes are used to navigate the tree. + */ +static int +index_nodes(struct tree *tree, int index) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + int indent; + + /* Align to a cache line (or half a cache line?). */ + while (index % 64) + index++; + tree->index = index; + indent = 1; + count = 0; + + if (verbose > 0) + printf("Indexing %s_%x: %d", tree->type, tree->maxage, index); + if (tree->childnode == LEAF) { + index += tree->leaf_size(tree->root); + goto done; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + count++; + if (node->index != index) + node->index = index; + index += node->size; +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + *tree->leaf_index(tree, node->left) = + index; + index += tree->leaf_size(node->left); + count++; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + *tree->leaf_index(tree, node->right) = index; + index += tree->leaf_size(node->right); + count++; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + /* Round up to a multiple of 16 */ + while (index % 16) + index++; + if (verbose > 0) + printf("Final index %d\n", index); + return index; +} + +/* + * Compute the size of nodes and leaves. We start by assuming that + * each node needs to store a three-byte offset. The indexes of the + * nodes are calculated based on that, and then this function is + * called to see if the sizes of some nodes can be reduced. This is + * repeated until no more changes are seen. + */ +static int +size_nodes(struct tree *tree) +{ + struct tree *next; + struct node *node; + struct node *right; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + unsigned int pathbits; + unsigned int pathmask; + int changed; + int offset; + int size; + int indent; + + indent = 1; + changed = 0; + size = 0; + + if (verbose > 0) + printf("Sizing %s_%x", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + pathbits = 0; + pathmask = 0; + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + offset = 0; + if (!node->left || !node->right) { + size = 1; + } else { + if (node->rightnode == NODE) { + right = node->right; + next = tree->next; + while (!right->mark) { + assert(next); + n = next->root; + while (n->bitnum != node->bitnum) { + if (pathbits & (1<<n->bitnum)) + n = n->right; + else + n = n->left; + } + n = n->right; + assert(right->bitnum == n->bitnum); + right = n; + next = next->next; + } + offset = right->index - node->index; + } else { + offset = *tree->leaf_index(tree, node->right); + offset -= node->index; + } + assert(offset >= 0); + assert(offset <= 0xffffff); + if (offset <= 0xff) { + size = 2; + } else if (offset <= 0xffff) { + size = 3; + } else { /* offset <= 0xffffff */ + size = 4; + } + } + if (node->size != size || node->offset != offset) { + node->size = size; + node->offset = offset; + changed++; + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + pathmask |= bitmask; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + pathbits |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + pathmask &= ~bitmask; + pathbits &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + if (verbose > 0) + printf("Found %d changes\n", changed); + return changed; +} + +/* + * Emit a trie for the given tree into the data array. + */ +static void +emit(struct tree *tree, unsigned char *data) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int offlen; + int offset; + int index; + int indent; + unsigned char byte; + + index = tree->index; + data += index; + indent = 1; + if (verbose > 0) + printf("Emitting %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_emit(tree->root, data); + return; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + assert(node->offset != -1); + assert(node->index == index); + + byte = 0; + if (node->nextbyte) + byte |= NEXTBYTE; + byte |= (node->bitnum & BITNUM); + if (node->left && node->right) { + if (node->leftnode == NODE) + byte |= LEFTNODE; + if (node->rightnode == NODE) + byte |= RIGHTNODE; + if (node->offset <= 0xff) + offlen = 1; + else if (node->offset <= 0xffff) + offlen = 2; + else + offlen = 3; + offset = node->offset; + byte |= offlen << OFFLEN_SHIFT; + *data++ = byte; + index++; + while (offlen--) { + *data++ = offset & 0xff; + index++; + offset >>= 8; + } + } else if (node->left) { + if (node->leftnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else if (node->right) { + byte |= RIGHTNODE; + if (node->rightnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else { + assert(0); + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + data = tree->leaf_emit(node->left, + data); + index += tree->leaf_size(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + data = tree->leaf_emit(node->right, + data); + index += tree->leaf_size(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +} + +/* ------------------------------------------------------------------ */ + +/* + * Unicode data. + * + * We need to keep track of the Canonical Combining Class, the Age, + * and decompositions for a code point. + * + * For the Age, we store the index into the ages table. Effectively + * this is a generation number that the table maps to a unicode + * version. + * + * The correction field is used to indicate that this entry is in the + * corrections array, which contains decompositions that were + * corrected in later revisions. The value of the correction field is + * the Unicode version in which the mapping was corrected. + */ +struct unicode_data { + unsigned int code; + int ccc; + int gen; + int correction; + unsigned int *utf32nfkdi; + unsigned int *utf32nfkdicf; + char *utf8nfkdi; + char *utf8nfkdicf; +}; + +struct unicode_data unicode_data[0x110000]; +struct unicode_data *corrections; +int corrections_count; + +struct tree *nfkdi_tree; +struct tree *nfkdicf_tree; + +struct tree *trees; +int trees_count; + +/* + * Check the corrections array to see if this entry was corrected at + * some point. + */ +static struct unicode_data * +corrections_lookup(struct unicode_data *u) +{ + int i; + + for (i = 0; i != corrections_count; i++) + if (u->code == corrections[i].code) + return &corrections[i]; + return u; +} + +static int +nfkdi_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static int +nfkdicf_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdicf && right->utf8nfkdicf && + strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0) + return 1; + if (left->utf8nfkdicf && right->utf8nfkdicf) + return 0; + if (left->utf8nfkdicf || right->utf8nfkdicf) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static void +nfkdi_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static void +nfkdicf_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdicf) + printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf); + else if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static int +nfkdi_mark(void *l) +{ + return 1; +} + +static int +nfkdicf_mark(void *l) +{ + struct unicode_data *leaf = l; + if (leaf->utf8nfkdicf) + return 1; + return 0; +} + +static int +correction_mark(void *l) +{ + struct unicode_data *leaf = l; + return leaf->correction; +} + +static int +nfkdi_size(void *l) +{ + struct unicode_data *leaf = l; + int size = 2; + if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int +nfkdicf_size(void *l) +{ + struct unicode_data *leaf = l; + int size = 2; + if (leaf->utf8nfkdicf) + size += strlen(leaf->utf8nfkdicf) + 1; + else if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int * +nfkdi_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + return &tree->leafindex[leaf->code]; +} + +static int * +nfkdicf_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + return &tree->leafindex[leaf->code]; +} + +static unsigned char * +nfkdi_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static unsigned char * +nfkdicf_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdicf) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdicf; + while ((*data++ = *s++) != 0) + ; + } else if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static void +utf8_create(struct unicode_data *data) +{ + char utf[18*4+1]; + char *u; + unsigned int *um; + int i; + + u = utf; + um = data->utf32nfkdi; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + data->utf8nfkdi = strdup((char*)utf); + } + u = utf; + um = data->utf32nfkdicf; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf)) + data->utf8nfkdicf = strdup((char*)utf); + } +} + +static void +utf8_init(void) +{ + unsigned int unichar; + int i; + + for (unichar = 0; unichar != 0x110000; unichar++) + utf8_create(&unicode_data[unichar]); + + for (i = 0; i != corrections_count; i++) + utf8_create(&corrections[i]); +} + +static void +trees_init(void) +{ + struct unicode_data *data; + unsigned int maxage; + unsigned int nextage; + int count; + int i; + int j; + + /* Count the number of different ages. */ + count = 0; + nextage = (unsigned int)-1; + do { + maxage = nextage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + count++; + } while (nextage); + + /* Two trees per age: nfkdi and nfkdicf */ + trees_count = count * 2; + trees = calloc(trees_count, sizeof(struct tree)); + + /* Assign ages to the trees. */ + count = trees_count; + nextage = (unsigned int)-1; + do { + maxage = nextage; + trees[--count].maxage = maxage; + trees[--count].maxage = maxage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + } while (nextage); + + /* The ages assigned above are off by one. */ + for (i = 0; i != trees_count; i++) { + j = 0; + while (ages[j] < trees[i].maxage) + j++; + trees[i].maxage = ages[j-1]; + } + + /* Set up the forwarding between trees. */ + trees[trees_count-2].next = &trees[trees_count-1]; + trees[trees_count-1].leaf_mark = nfkdi_mark; + trees[trees_count-2].leaf_mark = nfkdicf_mark; + for (i = 0; i != trees_count-2; i += 2) { + trees[i].next = &trees[trees_count-2]; + trees[i].leaf_mark = correction_mark; + trees[i+1].next = &trees[trees_count-1]; + trees[i+1].leaf_mark = correction_mark; + } + + /* Assign the callouts. */ + for (i = 0; i != trees_count; i += 2) { + trees[i].type = "nfkdicf"; + trees[i].leaf_equal = nfkdicf_equal; + trees[i].leaf_print = nfkdicf_print; + trees[i].leaf_size = nfkdicf_size; + trees[i].leaf_index = nfkdicf_index; + trees[i].leaf_emit = nfkdicf_emit; + + trees[i+1].type = "nfkdi"; + trees[i+1].leaf_equal = nfkdi_equal; + trees[i+1].leaf_print = nfkdi_print; + trees[i+1].leaf_size = nfkdi_size; + trees[i+1].leaf_index = nfkdi_index; + trees[i+1].leaf_emit = nfkdi_emit; + } + + /* Finish init. */ + for (i = 0; i != trees_count; i++) + trees[i].childnode = NODE; +} + +static void +trees_populate(void) +{ + struct unicode_data *data; + unsigned int unichar; + char keyval[4]; + int keylen; + int i; + + for (i = 0; i != trees_count; i++) { + if (verbose > 0) { + printf("Populating %s_%x\n", + trees[i].type, trees[i].maxage); + } + for (unichar = 0; unichar != 0x110000; unichar++) { + if (unicode_data[unichar].gen < 0) + continue; + keylen = utf8key(unichar, keyval); + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= trees[i].maxage) + data = &unicode_data[unichar]; + insert(&trees[i], keyval, keylen, data); + } + } +} + +static void +trees_reduce(void) +{ + int i; + int size; + int changed; + + for (i = 0; i != trees_count; i++) + prune(&trees[i]); + for (i = 0; i != trees_count; i++) + mark_nodes(&trees[i]); + do { + size = 0; + for (i = 0; i != trees_count; i++) + size = index_nodes(&trees[i], size); + changed = 0; + for (i = 0; i != trees_count; i++) + changed += size_nodes(&trees[i]); + } while (changed); + + utf8data = calloc(size, 1); + utf8data_size = size; + for (i = 0; i != trees_count; i++) + emit(&trees[i], utf8data); + + if (verbose > 0) { + for (i = 0; i != trees_count; i++) { + printf("%s_%x idx %d\n", + trees[i].type, trees[i].maxage, trees[i].index); + } + } + + nfkdi = utf8data + trees[trees_count-1].index; + nfkdicf = utf8data + trees[trees_count-2].index; + + nfkdi_tree = &trees[trees_count-1]; + nfkdicf_tree = &trees[trees_count-2]; +} + +static void +verify(struct tree *tree) +{ + struct unicode_data *data; + utf8leaf_t *leaf; + unsigned int unichar; + char key[4]; + int report; + int nocf; + + if (verbose > 0) + printf("Verifying %s_%x\n", tree->type, tree->maxage); + nocf = strcmp(tree->type, "nfkdicf"); + + for (unichar = 0; unichar != 0x110000; unichar++) { + report = 0; + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= tree->maxage) + data = &unicode_data[unichar]; + utf8key(unichar, key); + leaf = utf8lookup(tree, key); + if (!leaf) { + if (data->gen != -1) + report++; + if (unichar < 0xd800 || unichar > 0xdfff) + report++; + } else { + if (unichar >= 0xd800 && unichar <= 0xdfff) + report++; + if (data->gen == -1) + report++; + if (data->gen != LEAF_GEN(leaf)) + report++; + if (LEAF_CCC(leaf) == DECOMPOSE) { + if (nocf) { + if (!data->utf8nfkdi) { + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } else { + if (!data->utf8nfkdicf && + !data->utf8nfkdi) { + report++; + } else if (data->utf8nfkdicf) { + if (strcmp(data->utf8nfkdicf, + LEAF_STR(leaf))) + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } + } else if (data->ccc != LEAF_CCC(leaf)) { + report++; + } + } + if (report) { + printf("%X code %X gen %d ccc %d" + " nfdki -> \"%s\"", + unichar, data->code, data->gen, + data->ccc, + data->utf8nfkdi); + if (leaf) { + printf(" age %d ccc %d" + " nfdki -> \"%s\"\n", + LEAF_GEN(leaf), + LEAF_CCC(leaf), + LEAF_CCC(leaf) == DECOMPOSE ? + LEAF_STR(leaf) : ""); + } + printf("\n"); + } + } +} + +static void +trees_verify(void) +{ + int i; + + for (i = 0; i != trees_count; i++) + verify(&trees[i]); +} + +/* ------------------------------------------------------------------ */ + +static void +help(void) +{ + printf("Usage: %s [options]\n", argv0); + printf("\n"); + printf("This program creates an a data trie used for parsing and\n"); + printf("normalization of UTF-8 strings. The trie is derived from\n"); + printf("a set of input files from the Unicode character database\n"); + printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n"); + printf("\n"); + printf("The generated tree supports two normalization forms:\n"); + printf("\n"); + printf("\tnfkdi:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\n"); + printf("\tnfkdicf:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\t- Apply a full casefold (C + F).\n"); + printf("\n"); + printf("These forms were chosen as being most useful when dealing\n"); + printf("with file names: NFKD catches most cases where characters\n"); + printf("should be considered equivalent. The ignorables are mostly\n"); + printf("invisible, making names hard to type.\n"); + printf("\n"); + printf("The options to specify the files to be used are listed\n"); + printf("below with their default values, which are the names used\n"); + printf("by version 7.0.0 of the Unicode Character Database.\n"); + printf("\n"); + printf("The input files:\n"); + printf("\t-a %s\n", AGE_NAME); + printf("\t-c %s\n", CCC_NAME); + printf("\t-p %s\n", PROP_NAME); + printf("\t-d %s\n", DATA_NAME); + printf("\t-f %s\n", FOLD_NAME); + printf("\t-n %s\n", NORM_NAME); + printf("\n"); + printf("Additionally, the generated tables are tested using:\n"); + printf("\t-t %s\n", TEST_NAME); + printf("\n"); + printf("Finally, the output file:\n"); + printf("\t-o %s\n", UTF8_NAME); + printf("\n"); +} + +static void +usage(void) +{ + help(); + exit(1); +} + +static void +open_fail(const char *name, int error) +{ + printf("Error %d opening %s: %s\n", error, name, strerror(error)); + exit(1); +} + +static void +file_fail(const char *filename) +{ + printf("Error parsing %s\n", filename); + exit(1); +} + +static void +line_fail(const char *filename, const char *line) +{ + printf("Error parsing %s:%s\n", filename, line); + exit(1); +} + +/* ------------------------------------------------------------------ */ + +static void +print_utf32(unsigned int *utf32str) +{ + int i; + for (i = 0; utf32str[i]; i++) + printf(" %X", utf32str[i]); +} + +static void +print_utf32nfkdi(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdi); + printf("\n"); +} + +static void +print_utf32nfkdicf(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdicf); + printf("\n"); +} + +/* ------------------------------------------------------------------ */ + +static void +age_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + int gen; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", age_name); + + file = fopen(age_name, "r"); + if (!file) + open_fail(age_name, errno); + count = 0; + + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d_%d\n", + major, minor, revision); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d\n", major, minor); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + } + + /* We must have found something above. */ + if (verbose > 1) + printf("%d age entries\n", ages_count); + if (ages_count == 0 || ages_count > MAXGEN) + file_fail(age_name); + + /* There is a 0 entry. */ + ages_count++; + ages = calloc(ages_count + 1, sizeof(*ages)); + /* And a guard entry. */ + ages[ages_count] = (unsigned int)-1; + + rewind(file); + count = 0; + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages[++gen] = + UNICODE_AGE(major, minor, revision); + if (verbose > 1) + printf(" Age V%d_%d_%d = gen %d\n", + major, minor, revision, gen); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages[++gen] = UNICODE_AGE(major, minor, 0); + if (verbose > 1) + printf(" Age V%d_%d = %d\n", + major, minor, gen); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X..%X ; %d.%d #", + &first, &last, &major, &minor); + if (ret == 4) { + for (unichar = first; unichar <= last; unichar++) + unicode_data[unichar].gen = gen; + count += 1 + last - first; + if (verbose > 1) + printf(" %X..%X gen %d\n", first, last, gen); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor); + if (ret == 3) { + unicode_data[unichar].gen = gen; + count++; + if (verbose > 1) + printf(" %X gen %d\n", unichar, gen); + if (!utf32valid(unichar)) + line_fail(age_name, line); + continue; + } + } + unicode_maxage = ages[gen]; + fclose(file); + + /* Nix surrogate block */ + if (verbose > 1) + printf(" Removing surrogate block D800..DFFF\n"); + for (unichar = 0xd800; unichar <= 0xdfff; unichar++) + unicode_data[unichar].gen = -1; + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(age_name); +} + +static void +ccc_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int value; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", ccc_name); + + file = fopen(ccc_name, "r"); + if (!file) + open_fail(ccc_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value); + if (ret == 3) { + for (unichar = first; unichar <= last; unichar++) { + unicode_data[unichar].ccc = value; + count++; + } + if (verbose > 1) + printf(" %X..%X ccc %d\n", first, last, value); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(ccc_name, line); + continue; + } + ret = sscanf(line, "%X ; %d #", &unichar, &value); + if (ret == 2) { + unicode_data[unichar].ccc = value; + count++; + if (verbose > 1) + printf(" %X ccc %d\n", unichar, value); + if (!utf32valid(unichar)) + line_fail(ccc_name, line); + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(ccc_name); +} + +static void +nfkdi_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + unsigned int *um; + int count; + int i; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", data_name); + file = fopen(data_name, "r"); + if (!file) + open_fail(data_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];", + &unichar, buf0); + if (ret != 2) + continue; + if (!utf32valid(unichar)) + line_fail(data_name, line); + + s = buf0; + /* skip over <tag> */ + if (*s == '<') + while (*s++ != ' ') + ; + /* decode the decomposition into UTF-32 */ + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(data_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(data_name); +} + +static void +nfkdicf_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char status; + char *s; + unsigned int *um; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", fold_name); + file = fopen(fold_name, "r"); + if (!file) + open_fail(fold_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0); + if (ret != 3) + continue; + if (!utf32valid(unichar)) + line_fail(fold_name, line); + /* Use the C+F casefold. */ + if (status != 'C' && status != 'F') + continue; + s = buf0; + if (*s == '<') + while (*s++ != ' ') + ; + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(fold_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(fold_name); +} + +static void +ignore_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int first; + unsigned int last; + unsigned int *um; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", prop_name); + file = fopen(prop_name, "r"); + if (!file) + open_fail(prop_name, errno); + assert(file); + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0); + if (ret == 3) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(first) || !utf32valid(last)) + line_fail(prop_name, line); + for (unichar = first; unichar <= last; unichar++) { + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + count++; + } + if (verbose > 1) + printf(" %X..%X Default_Ignorable_Code_Point\n", + first, last); + continue; + } + ret = sscanf(line, "%X ; %s # ", &unichar, buf0); + if (ret == 2) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(unichar)) + line_fail(prop_name, line); + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + if (verbose > 1) + printf(" %X Default_Ignorable_Code_Point\n", + unichar); + count++; + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(prop_name); +} + +static void +corrections_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + unsigned int age; + unsigned int *um; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", norm_name); + file = fopen(norm_name, "r"); + if (!file) + open_fail(norm_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + count++; + } + corrections = calloc(count, sizeof(struct unicode_data)); + corrections_count = count; + rewind(file); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + corrections[count] = unicode_data[unichar]; + assert(corrections[count].code == unichar); + age = UNICODE_AGE(major, minor, revision); + corrections[count].correction = age; + + i = 0; + s = buf0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(norm_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + corrections[count].utf32nfkdi = um; + + if (verbose > 1) + printf(" %X -> %s -> %s V%d_%d_%d\n", + unichar, buf0, buf1, major, minor, revision); + count++; + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(norm_name); +} + +/* ------------------------------------------------------------------ */ + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;; + * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount + * LVPart = LBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = <LPart, VPart> + * } else { + * TPart = TBase + TIndex + * d = <LPart, TPart, VPart> + * } + * + */ + +static void +hangul_decompose(void) +{ + unsigned int sb = 0xAC00; + unsigned int lb = 0x1100; + unsigned int vb = 0x1161; + unsigned int tb = 0x11a7; + /* unsigned int lc = 19; */ + unsigned int vc = 21; + unsigned int tc = 28; + unsigned int nc = (vc * tc); + /* unsigned int sc = (lc * nc); */ + unsigned int unichar; + unsigned int mapping[4]; + unsigned int *um; + int count; + int i; + + if (verbose > 0) + printf("Decomposing hangul\n"); + /* Hangul */ + count = 0; + for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) { + unsigned int si = unichar - sb; + unsigned int li = si / nc; + unsigned int vi = (si % nc) / tc; + unsigned int ti = si % tc; + + i = 0; + mapping[i++] = lb + li; + mapping[i++] = vb + vi; + if (ti) + mapping[i++] = tb + ti; + mapping[i++] = 0; + + assert(!unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + assert(!unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + + count++; + } + if (verbose > 0) + printf("Created %d entries\n", count); +} + +static void +nfkdi_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdi\n"); + + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdi) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdi; + while (*um) { + dc = unicode_data[*um].utf32nfkdi; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + } + /* Add this decomposition to nfkdicf if there is no entry. */ + if (!unicode_data[unichar].utf32nfkdicf) { + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +static void +nfkdicf_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdicf\n"); + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdicf) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdicf; + while (*um) { + dc = unicode_data[*um].utf32nfkdicf; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +/* ------------------------------------------------------------------ */ + +int utf8agemax(struct tree *, const char *); +int utf8nagemax(struct tree *, const char *, size_t); +int utf8agemin(struct tree *, const char *); +int utf8nagemin(struct tree *, const char *, size_t); +ssize_t utf8len(struct tree *, const char *); +ssize_t utf8nlen(struct tree *, const char *, size_t); +struct utf8cursor; +int utf8cursor(struct utf8cursor *, struct tree *, const char *); +int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); +int utf8byte(struct utf8cursor *); + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(struct tree *tree, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + tree->index; + int offlen; + int offset; + int mask; + int node; + + if (!tree) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to trie_nlookup(). + */ +static utf8leaf_t * +utf8lookup(struct tree *tree, const char *s) +{ + return utf8nlookup(tree, s, (size_t)-1); +} + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = tree->maxage; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age = tree->maxage; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + struct tree *tree; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; + unsigned int unichar; +}; + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : string. + * len : length of s. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s, + size_t len) +{ + if (!tree) + return -1; + if (!s) + return -1; + u8c->tree = tree; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->unichar = 0; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : NUL-terminated string. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s) +{ + return utf8ncursor(u8c, tree, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->tree, u8c->s); + else + leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->tree, u8c->s); + ccc = LEAF_CCC(leaf); + } + u8c->unichar = utf8code(u8c->s); + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + assert(u8c->ccc == STOPPER); + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +/* ------------------------------------------------------------------ */ + +static int +normalize_line(struct tree *tree) +{ + char *s; + char *t; + int c; + struct utf8cursor u8c; + + /* First test: null-terminated string. */ + s = buf2; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + /* Second test: length-limited string. */ + s = buf2; + /* Replace NUL with a value that will cause an error if seen. */ + s[strlen(s) + 1] = -1; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + return 0; +} + +static void +normalization_test(void) +{ + FILE *file; + unsigned int unichar; + struct unicode_data *data; + char *s; + char *t; + int ret; + int ignorables; + int tests = 0; + int failures = 0; + + if (verbose > 0) + printf("Parsing %s\n", test_name); + /* Step one, read data from file. */ + file = fopen(test_name, "r"); + if (!file) + open_fail(test_name, errno); + + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];", + buf0, buf1); + if (ret != 2 || *line == '#') + continue; + s = buf0; + t = buf2; + while (*s) { + unichar = strtoul(s, &s, 16); + t += utf8key(unichar, t); + } + *t = '\0'; + + ignorables = 0; + s = buf1; + t = buf3; + while (*s) { + unichar = strtoul(s, &s, 16); + data = &unicode_data[unichar]; + if (data->utf8nfkdi && !*data->utf8nfkdi) + ignorables = 1; + else + t += utf8key(unichar, t); + } + *t = '\0'; + + tests++; + if (normalize_line(nfkdi_tree) < 0) { + printf("\nline %s -> %s", buf0, buf1); + if (ignorables) + printf(" (ignorables removed)"); + printf(" failure\n"); + failures++; + } + } + fclose(file); + if (verbose > 0) + printf("Ran %d tests with %d failures\n", tests, failures); + if (failures) + file_fail(test_name); +} + +/* ------------------------------------------------------------------ */ + +static void +write_file(void) +{ + FILE *file; + int i; + int j; + int t; + int gen; + + if (verbose > 0) + printf("Writing %s\n", utf8_name); + file = fopen(utf8_name, "w"); + if (!file) + open_fail(utf8_name, errno); + + fprintf(file, "/* This file is generated code, do not edit. */\n"); + fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n"); + fprintf(file, "#error Only xfs_utf8.c may include this file.\n"); + fprintf(file, "#endif\n"); + fprintf(file, "\n"); + fprintf(file, "const unsigned int utf8version = %#x;\n", + unicode_maxage); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8agetab[] = {\n"); + for (i = 0; i != ages_count; i++) + fprintf(file, "\t%#x%s\n", ages[i], + ages[i] == unicode_maxage ? "" : ","); + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n"); + t = 0; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n"); + t = 1; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned char utf8data[%zd] = {\n", + utf8data_size); + t = 0; + for (i = 0; i != utf8data_size; i += 16) { + if (i == trees[t].index) { + fprintf(file, "\t/* %s_%x */\n", + trees[t].type, trees[t].maxage); + if (t < trees_count-1) + t++; + } + fprintf(file, "\t"); + for (j = i; j != i + 16; j++) + fprintf(file, "0x%.2x%s", utf8data[j], + (j < utf8data_size -1 ? "," : "")); + fprintf(file, "\n"); + } + fprintf(file, "};\n"); + fclose(file); +} + +/* ------------------------------------------------------------------ */ + +int +main(int argc, char *argv[]) +{ + unsigned int unichar; + int opt; + + argv0 = argv[0]; + + while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) { + switch (opt) { + case 'a': + age_name = optarg; + break; + case 'c': + ccc_name = optarg; + break; + case 'd': + data_name = optarg; + break; + case 'f': + fold_name = optarg; + break; + case 'n': + norm_name = optarg; + break; + case 'o': + utf8_name = optarg; + break; + case 'p': + prop_name = optarg; + break; + case 't': + test_name = optarg; + break; + case 'v': + verbose++; + break; + case 'h': + help(); + exit(0); + default: + usage(); + } + } + + if (verbose > 1) + help(); + for (unichar = 0; unichar != 0x110000; unichar++) + unicode_data[unichar].code = unichar; + age_init(); + ccc_init(); + nfkdi_init(); + nfkdicf_init(); + ignore_init(); + corrections_init(); + hangul_decompose(); + nfkdi_decompose(); + nfkdicf_decompose(); + utf8_init(); + trees_init(); + trees_populate(); + trees_reduce(); + trees_verify(); + /* Prevent "unused function" warning. */ + (void)lookup(nfkdi_tree, " "); + if (verbose > 2) + tree_walk(nfkdi_tree); + if (verbose > 2) + tree_walk(nfkdicf_tree); + normalization_test(); + write_file(); + + return 0; +} -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold. 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (6 preceding siblings ...) 2014-09-18 20:38 ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers @ 2014-09-18 20:38 ` Ben Myers 2014-09-18 20:39 ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers ` (6 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:38 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> The xfs_utf8_nameops use the nfkdi normalization when comparing filenames, and are installed if the utf8bit is set in the super block. The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing filenames, and are installed if both the utf8bit and the borgbit are set in the superblock. Normalized filenames are not stored on disk. Normalization will fail if a filename is not valid UTF-8, in which case the filename is treated as an opaque blob. Changes: Type conversion to "(const char *)" added to utf8ncursor() and utf8nlen() calls. Signed-off-by: Olaf Weber <olaf@sgi.com> --- Makefile | 2 +- include/libxfs.h | 1 + include/xfs_utf8.h | 25 ++++++ libxfs/Makefile | 4 +- libxfs/xfs_dir2.c | 15 +++- libxfs/xfs_utf8.c | 238 +++++++++++++++++++++++++++++++++++++++++++++++++++++ support/Makefile | 24 ++++++ 7 files changed, 303 insertions(+), 6 deletions(-) create mode 100644 include/xfs_utf8.h create mode 100644 libxfs/xfs_utf8.c create mode 100644 support/Makefile diff --git a/Makefile b/Makefile index f56aebd..c442da6 100644 --- a/Makefile +++ b/Makefile @@ -40,7 +40,7 @@ LDIRDIRT = $(SRCDIR) LDIRT += $(SRCTAR) endif -LIB_SUBDIRS = libxfs libxlog libxcmd libhandle libdisk +LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \ mdrestore repair rtcp m4 man doc po debian diff --git a/include/libxfs.h b/include/libxfs.h index 45a924f..99cb3d9 100644 --- a/include/libxfs.h +++ b/include/libxfs.h @@ -59,6 +59,7 @@ #include <xfs/xfs_btree_trace.h> #include <xfs/xfs_bmap.h> #include <xfs/xfs_trace.h> +#include <xfs_utf8.h> #ifndef ARRAY_SIZE diff --git a/include/xfs_utf8.h b/include/xfs_utf8.h new file mode 100644 index 0000000..97b6a91 --- /dev/null +++ b/include/xfs_utf8.h @@ -0,0 +1,25 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef XFS_UTF8_H +#define XFS_UTF8_H + +extern struct xfs_nameops xfs_utf8_nameops; +extern struct xfs_nameops xfs_utf8_ci_nameops; + +#endif /* XFS_UTF8_H */ diff --git a/libxfs/Makefile b/libxfs/Makefile index ae15a5d..d836027 100644 --- a/libxfs/Makefile +++ b/libxfs/Makefile @@ -14,6 +14,7 @@ HFILES = xfs.h init.h xfs_dir2_priv.h crc32defs.h crc32table.h CFILES = cache.c \ crc32.c \ init.c kmem.c logitem.c radix-tree.c rdwr.c trans.c util.c \ + utf8norm.c \ xfs_alloc.c \ xfs_alloc_btree.c \ xfs_attr.c \ @@ -38,7 +39,8 @@ CFILES = cache.c \ xfs_rtbitmap.c \ xfs_sb.c \ xfs_symlink_remote.c \ - xfs_trans_resv.c + xfs_trans_resv.c \ + xfs_utf8.c CFILES += $(PKG_PLATFORM).c PCFILES = darwin.c freebsd.c irix.c linux.c diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c index 1893931..6872844 100644 --- a/libxfs/xfs_dir2.c +++ b/libxfs/xfs_dir2.c @@ -123,10 +123,17 @@ xfs_dir_mount( (uint)sizeof(xfs_da_node_entry_t); mp->m_dir_magicpct = (mp->m_dirblksize * 37) / 100; - if (xfs_sb_version_hasasciici(&mp->m_sb)) - mp->m_dirnameops = &xfs_ascii_ci_nameops; - else - mp->m_dirnameops = &xfs_default_nameops; + if (xfs_sb_version_hasutf8(&mp->m_sb)) { + if (xfs_sb_version_hasasciici(&mp->m_sb)) + mp->m_dirnameops = &xfs_utf8_ci_nameops; + else + mp->m_dirnameops = &xfs_utf8_nameops; + } else { + if (xfs_sb_version_hasasciici(&mp->m_sb)) + mp->m_dirnameops = &xfs_ascii_ci_nameops; + else + mp->m_dirnameops = &xfs_default_nameops; + } } /* diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c new file mode 100644 index 0000000..f5cc231 --- /dev/null +++ b/libxfs/xfs_utf8.c @@ -0,0 +1,238 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_types.h" +#include "xfs_bit.h" +#include "xfs_inum.h" +#include "xfs_sb.h" +#include "xfs_ag.h" +#include "xfs_dir2.h" +#include "xfs_da_btree.h" +#include "xfs_bmap_btree.h" +#include "xfs_alloc_btree.h" +#include "xfs_dinode.h" +#include "xfs_inode_fork.h" +#include "xfs_bmap.h" +#include "xfs_dir2.h" +#include "xfs_trace.h" +#include "xfs_utf8.h" +#include "utf8norm.h" + +/* + * xfs nameops using nfkdi + */ + +static xfs_dahash_t +xfs_utf8_hashname( + const unsigned char *name, + int len) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + xfs_dahash_t hash; + int val; + + nfkdi = utf8nfkdi(utf8version); + hash = 0; + if (utf8ncursor(&u8c, nfkdi, (const char *)name, len) < 0) + goto blob; + while ((val = utf8byte(&u8c)) > 0) + hash = val ^ rol32(hash, 7); + /* In case of error treat the name as a binary blob. */ + if (val == 0) + return hash; +blob: + return xfs_da_hashname(name, len); +} + +static int +xfs_utf8_normhash( + struct xfs_da_args *args) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + unsigned char *norm; + ssize_t normlen; + int c; + + nfkdi = utf8nfkdi(utf8version); + /* Failure to normalize is treated as a blob. */ + if ((normlen = utf8nlen(nfkdi, (const char *)args->name, + args->namelen)) < 0) + goto blob; + if (utf8ncursor(&u8c, nfkdi, (const char *)args->name, + args->namelen) < 0) + goto blob; + if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL))) + return ENOMEM; + args->norm = norm; + args->normlen = normlen; + while ((c = utf8byte(&u8c)) > 0) + *norm++ = c; + if (c == 0) { + *norm = '\0'; + args->hashval = xfs_da_hashname(args->norm, args->normlen); + return 0; + } + kmem_free((void *)args->norm); +blob: + args->norm = NULL; + args->normlen = -1; + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + +static enum xfs_dacmp +xfs_utf8_compname( + struct xfs_da_args *args, + const unsigned char *name, + int len) +{ + utf8data_t nfkdi; + struct utf8cursor u8c; + const char *norm; + int c; + + ASSERT(args->norm || args->normlen == -1); + + /* Check for an exact match first. */ + if (args->namelen == len && memcmp(args->name, name, len) == 0) + return XFS_CMP_EXACT; + /* xfs_utf8_normhash() set args->normlen to -1 for a blob */ + if (args->normlen < 0) + return XFS_CMP_DIFFERENT; + nfkdi = utf8nfkdi(utf8version); + if (utf8ncursor(&u8c, nfkdi, (const char *)name, len) < 0) + return XFS_CMP_DIFFERENT; + norm = (const char *)args->norm; + while ((c = utf8byte(&u8c)) > 0) + if (c != *norm++) + return XFS_CMP_DIFFERENT; + if (c < 0 || *norm != '\0') + return XFS_CMP_DIFFERENT; + return XFS_CMP_MATCH; +} + +struct xfs_nameops xfs_utf8_nameops = { + .hashname = xfs_utf8_hashname, + .normhash = xfs_utf8_normhash, + .compname = xfs_utf8_compname, +}; + +/* + * xfs nameops using nfkdicf + */ + +static xfs_dahash_t +xfs_utf8_ci_hashname( + const unsigned char *name, + int len) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + xfs_dahash_t hash; + int val; + + nfkdicf = utf8nfkdicf(utf8version); + hash = 0; + if (utf8ncursor(&u8c, nfkdicf, (const char *)name, len) < 0) + goto blob; + while ((val = utf8byte(&u8c)) > 0) + hash = val ^ rol32(hash, 7); + /* In case of error treat the name as a binary blob. */ + if (val == 0) + return hash; +blob: + return xfs_da_hashname(name, len); +} + +static int +xfs_utf8_ci_normhash( + struct xfs_da_args *args) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + unsigned char *norm; + ssize_t normlen; + int c; + + nfkdicf = utf8nfkdicf(utf8version); + /* Failure to normalize is treated as a blob. */ + if ((normlen = utf8nlen(nfkdicf, (const char *)args->name, + args->namelen)) < 0) + goto blob; + if (utf8ncursor(&u8c, nfkdicf, (const char *)args->name, + args->namelen) < 0) + goto blob; + if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL))) + return ENOMEM; + args->norm = norm; + args->normlen = normlen; + while ((c = utf8byte(&u8c)) > 0) + *norm++ = c; + if (c == 0) { + *norm = '\0'; + args->hashval = xfs_da_hashname(args->norm, args->normlen); + return 0; + } + kmem_free((void *)args->norm); +blob: + args->norm = NULL; + args->normlen = -1; + args->hashval = xfs_da_hashname(args->name, args->namelen); + return 0; +} + +static enum xfs_dacmp +xfs_utf8_ci_compname( + struct xfs_da_args *args, + const unsigned char *name, + int len) +{ + utf8data_t nfkdicf; + struct utf8cursor u8c; + const unsigned char *norm; + int c; + + ASSERT(args->norm || args->normlen == -1); + + /* Check for an exact match first. */ + if (args->namelen == len && memcmp(args->name, name, len) == 0) + return XFS_CMP_EXACT; + /* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */ + if (args->normlen < 0) + return XFS_CMP_DIFFERENT; + nfkdicf = utf8nfkdicf(utf8version); + if (utf8ncursor(&u8c, nfkdicf, (const char *)name, len) < 0) + return XFS_CMP_DIFFERENT; + norm = args->norm; + while ((c = utf8byte(&u8c)) > 0) + if (c != *norm++) + return XFS_CMP_DIFFERENT; + if (c < 0 || *norm != '\0') + return XFS_CMP_DIFFERENT; + return XFS_CMP_MATCH; +} + +struct xfs_nameops xfs_utf8_ci_nameops = { + .hashname = xfs_utf8_ci_hashname, + .normhash = xfs_utf8_ci_normhash, + .compname = xfs_utf8_ci_compname, +}; diff --git a/support/Makefile b/support/Makefile new file mode 100644 index 0000000..cade5fe --- /dev/null +++ b/support/Makefile @@ -0,0 +1,24 @@ +# +# Copyright (c) 2014 SGI. All Rights Reserved. +# + +TOPDIR = .. +include $(TOPDIR)/include/builddefs + +default = ../include/utf8data.h + +../include/utf8data.h: mkutf8data.c + cc -o mkutf8data mkutf8data.c + cd ucd-7.0.0 ; ../mkutf8data + mv ucd-7.0.0/utf8data.h ../include + +default clean: + rm -f mkutf8data ../include/utf8data.h + +default install: + +default install-dev: + +default install-qa: + +-include .ltdep -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (7 preceding siblings ...) 2014-09-18 20:38 ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers @ 2014-09-18 20:39 ` Ben Myers 2014-09-18 20:40 ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers ` (5 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:39 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Apply the same rules for UTF-8 normalization to the names of user-defined extended attributes. System attributes are excluded because they are not user-visible in the first place, and the kernel is expected to know what it is doing when naming them. Signed-off-by: Olaf Weber <olaf@sgi.com> --- libxfs/xfs_attr.c | 49 +++++++++++++++++++++++++++++++++++++++++-------- libxfs/xfs_attr_leaf.c | 11 +++++++++-- libxfs/xfs_utf8.c | 7 +++++++ 3 files changed, 57 insertions(+), 10 deletions(-) diff --git a/libxfs/xfs_attr.c b/libxfs/xfs_attr.c index 17519d3..c30703b 100644 --- a/libxfs/xfs_attr.c +++ b/libxfs/xfs_attr.c @@ -88,8 +88,9 @@ xfs_attr_get_int( int *valuelenp, int flags) { - xfs_da_args_t args; - int error; + xfs_da_args_t args; + struct xfs_mount *mp = ip->i_mount; + int error; if (!xfs_inode_hasattr(ip)) return ENOATTR; @@ -103,9 +104,12 @@ xfs_attr_get_int( args.value = value; args.valuelen = *valuelenp; args.flags = flags; - args.hashval = xfs_da_hashname(args.name, args.namelen); args.dp = ip; args.whichfork = XFS_ATTR_FORK; + if (! xfs_sb_version_hasutf8(&mp->m_sb)) + args.hashval = xfs_da_hashname(args.name, args.namelen); + else if ((error = mp->m_dirnameops->normhash(&args)) != 0) + return error; /* * Decide on what work routines to call based on the inode size. @@ -118,6 +122,9 @@ xfs_attr_get_int( error = xfs_attr_node_get(&args); } + if (args.norm) + kmem_free((void *)args.norm); + /* * Return the number of bytes in the value to the caller. */ @@ -239,12 +246,15 @@ xfs_attr_set_int( args.value = value; args.valuelen = valuelen; args.flags = flags; - args.hashval = xfs_da_hashname(args.name, args.namelen); args.dp = dp; args.firstblock = &firstblock; args.flist = &flist; args.whichfork = XFS_ATTR_FORK; args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + if (! xfs_sb_version_hasutf8(&mp->m_sb)) + args.hashval = xfs_da_hashname(args.name, args.namelen); + else if ((error = mp->m_dirnameops->normhash(&args)) != 0) + return error; /* Size is now blocks for attribute data */ args.total = xfs_attr_calc_size(dp, name->len, valuelen, &local); @@ -276,6 +286,8 @@ xfs_attr_set_int( error = xfs_trans_reserve(args.trans, &tres, args.total, 0); if (error) { xfs_trans_cancel(args.trans, 0); + if (args.norm) + kmem_free((void *)args.norm); return(error); } xfs_ilock(dp, XFS_ILOCK_EXCL); @@ -286,6 +298,8 @@ xfs_attr_set_int( if (error) { xfs_iunlock(dp, XFS_ILOCK_EXCL); xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES); + if (args.norm) + kmem_free((void *)args.norm); return (error); } @@ -333,7 +347,8 @@ xfs_attr_set_int( err2 = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES); xfs_iunlock(dp, XFS_ILOCK_EXCL); - + if (args.norm) + kmem_free((void *)args.norm); return(error == 0 ? err2 : error); } @@ -398,6 +413,8 @@ xfs_attr_set_int( xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE); error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES); xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free((void *)args.norm); return(error); @@ -406,6 +423,9 @@ out: xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT); xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free((void *)args.norm); + return(error); } @@ -452,12 +472,15 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags) args.name = name->name; args.namelen = name->len; args.flags = flags; - args.hashval = xfs_da_hashname(args.name, args.namelen); args.dp = dp; args.firstblock = &firstblock; args.flist = &flist; args.total = 0; args.whichfork = XFS_ATTR_FORK; + if (! xfs_sb_version_hasutf8(&mp->m_sb)) + args.hashval = xfs_da_hashname(args.name, args.namelen); + else if ((error = mp->m_dirnameops->normhash(&args)) != 0) + return error; /* * we have no control over the attribute names that userspace passes us @@ -470,8 +493,11 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags) * Attach the dquots to the inode. */ error = xfs_qm_dqattach(dp, 0); - if (error) - return error; + if (error) { + if (args.norm) + kmem_free((void *)args.norm); + return error; + } /* * Start our first transaction of the day. @@ -497,6 +523,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags) XFS_ATTRRM_SPACE_RES(mp), 0); if (error) { xfs_trans_cancel(args.trans, 0); + if (args.norm) + kmem_free((void *)args.norm); return(error); } @@ -546,6 +574,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags) xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE); error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES); xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free((void *)args.norm); return(error); @@ -554,6 +584,9 @@ out: xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT); xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (args.norm) + kmem_free((void *)args.norm); + return(error); } diff --git a/libxfs/xfs_attr_leaf.c b/libxfs/xfs_attr_leaf.c index f7f02ae..052a6a1 100644 --- a/libxfs/xfs_attr_leaf.c +++ b/libxfs/xfs_attr_leaf.c @@ -634,6 +634,7 @@ int xfs_attr_shortform_to_leaf(xfs_da_args_t *args) { xfs_inode_t *dp; + struct xfs_mount *mp; xfs_attr_shortform_t *sf; xfs_attr_sf_entry_t *sfe; xfs_da_args_t nargs; @@ -646,6 +647,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args) trace_xfs_attr_sf_to_leaf(args); dp = args->dp; + mp = dp->i_mount; ifp = dp->i_afp; sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data; size = be16_to_cpu(sf->hdr.totsize); @@ -698,13 +700,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args) nargs.namelen = sfe->namelen; nargs.value = &sfe->nameval[nargs.namelen]; nargs.valuelen = sfe->valuelen; - nargs.hashval = xfs_da_hashname(sfe->nameval, - sfe->namelen); nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags); + if (! xfs_sb_version_hasutf8(&mp->m_sb)) + nargs.hashval = xfs_da_hashname(sfe->nameval, + sfe->namelen); + else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0) + goto out; error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */ ASSERT(error == ENOATTR); error = xfs_attr3_leaf_add(bp, &nargs); ASSERT(error != ENOSPC); + if (nargs.norm) + kmem_free((void *)nargs.norm); if (error) goto out; sfe = XFS_ATTR_SF_NEXTENTRY(sfe); diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c index f5cc231..5c69591 100644 --- a/libxfs/xfs_utf8.c +++ b/libxfs/xfs_utf8.c @@ -31,6 +31,7 @@ #include "xfs_inode_fork.h" #include "xfs_bmap.h" #include "xfs_dir2.h" +#include "xfs_attr_leaf.h" #include "xfs_trace.h" #include "xfs_utf8.h" #include "utf8norm.h" @@ -72,6 +73,9 @@ xfs_utf8_normhash( ssize_t normlen; int c; + /* Don't normalize system attribute names. */ + if (args->flags & (ATTR_ROOT|ATTR_SECURE)) + goto blob; nfkdi = utf8nfkdi(utf8version); /* Failure to normalize is treated as a blob. */ if ((normlen = utf8nlen(nfkdi, (const char *)args->name, @@ -173,6 +177,9 @@ xfs_utf8_ci_normhash( ssize_t normlen; int c; + /* Don't normalize system attribute names. */ + if (args->flags & (ATTR_ROOT|ATTR_SECURE)) + goto blob; nfkdicf = utf8nfkdicf(utf8version); /* Failure to normalize is treated as a blob. */ if ((normlen = utf8nlen(nfkdicf, (const char *)args->name, -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 10/13] xfsprogs: add utf8 support to growfs 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (8 preceding siblings ...) 2014-09-18 20:39 ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers @ 2014-09-18 20:40 ` Ben Myers 2014-09-18 20:41 ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers ` (4 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:40 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Mark Tinguely <tinguely@sgi.com> Add reporting of the utf-8 mkfs options to xfs_growfs and xfs_info. Signed-off-by: Mark Tinguely <tinguely@sgi.com> --- growfs/xfs_growfs.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/growfs/xfs_growfs.c b/growfs/xfs_growfs.c index 8e611b6..6c41803 100644 --- a/growfs/xfs_growfs.c +++ b/growfs/xfs_growfs.c @@ -57,7 +57,8 @@ report_info( int crcs_enabled, int cimode, int ftype_enabled, - int finobt_enabled) + int finobt_enabled, + int utf8) { printf(_( "meta-data=%-22s isize=%-6u agcount=%u, agsize=%u blks\n" @@ -65,7 +66,7 @@ report_info( " =%-22s crc=%-8u finobt=%u\n" "data =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n" " =%-22s sunit=%-6u swidth=%u blks\n" - "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n" + "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d utf8=%d\n" "log =%-22s bsize=%-6u blocks=%u, version=%u\n" " =%-22s sectsz=%-5u sunit=%u blks, lazy-count=%u\n" "realtime =%-22s extsz=%-6u blocks=%llu, rtextents=%llu\n"), @@ -76,7 +77,7 @@ report_info( "", geo.blocksize, (unsigned long long)geo.datablocks, geo.imaxpct, "", geo.sunit, geo.swidth, - dirversion, geo.dirblocksize, cimode, ftype_enabled, + dirversion, geo.dirblocksize, cimode, ftype_enabled, utf8, isint ? _("internal") : logname ? logname : _("external"), geo.blocksize, geo.logblocks, logversion, "", geo.logsectsize, geo.logsunit / geo.blocksize, lazycount, @@ -114,6 +115,7 @@ main(int argc, char **argv) long long rsize; /* new rt size in fs blocks */ int ci; /* ASCII case-insensitive fs */ int lazycount; /* lazy superblock counters */ + int utf8; /* Unicode chars supported */ int xflag; /* -x flag */ char *fname; /* mount point name */ char *datadev; /* data device name */ @@ -247,11 +249,12 @@ main(int argc, char **argv) crcs_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_V5SB ? 1 : 0; ftype_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FTYPE ? 1 : 0; finobt_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FINOBT ? 1 : 0; + utf8 = geo.flags & XFS_FSOP_GEOM_FLAGS_UTF8 ? 1 : 0; if (nflag) { report_info(geo, datadev, isint, logdev, rtdev, lazycount, dirversion, logversion, attrversion, projid32bit, crcs_enabled, ci, - ftype_enabled, finobt_enabled); + ftype_enabled, finobt_enabled, utf8); exit(0); } @@ -289,7 +292,7 @@ main(int argc, char **argv) report_info(geo, datadev, isint, logdev, rtdev, lazycount, dirversion, logversion, attrversion, projid32bit, crcs_enabled, ci, ftype_enabled, - finobt_enabled); + finobt_enabled, utf8); ddsize = xi.dsize; dlsize = ( xi.logBBsize? xi.logBBsize : -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (9 preceding siblings ...) 2014-09-18 20:40 ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers @ 2014-09-18 20:41 ` Ben Myers 2014-09-18 20:42 ` Ben Myers ` (3 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:41 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Mark Tinguely <tinguely@sgi.com> Set the utf-8 feature bit. Signed-off-by: Mark Tinguely <tinguely@sgi.com> --- man/man8/mkfs.xfs.8 | 9 ++++++++- mkfs/xfs_mkfs.c | 27 ++++++++++++++++++++++----- mkfs/xfs_mkfs.h | 3 ++- 3 files changed, 32 insertions(+), 7 deletions(-) diff --git a/man/man8/mkfs.xfs.8 b/man/man8/mkfs.xfs.8 index ad9ff3d..aa43cf5 100644 --- a/man/man8/mkfs.xfs.8 +++ b/man/man8/mkfs.xfs.8 @@ -558,7 +558,7 @@ any power of 2 size from the filesystem block size up to 65536. .IP The .B version=ci -option enables ASCII only case-insensitive filename lookup and version +option enables ASCII or UTF-8 case-insensitive filename lookup and version 2 directories. Filenames are case-preserving, that is, the names are stored in directories using the case they were created with. .IP @@ -582,6 +582,13 @@ When CRCs are enabled via the ftype functionality is always enabled. This feature can not be turned off for such filesystem configurations. .IP +.TP +.BI utf8[= value ] +This is used to enable the UTF-8 character set support. The +.I value +is either 0 or 1, with 1 signifying that UTF-8 character support is to be +enabled. If the value is omitted, 1 is assumed. +.IP .RE .TP .BI \-p " protofile" diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c index c85258a..1829e51 100644 --- a/mkfs/xfs_mkfs.c +++ b/mkfs/xfs_mkfs.c @@ -149,6 +149,8 @@ char *nopts[] = { "version", #define N_FTYPE 3 "ftype", +#define N_UTF8 4 + "utf8", NULL, }; @@ -958,6 +960,7 @@ main( int nsflag; int nvflag; int nci; + int utf8; int Nflag; int discard = 1; char *p; @@ -1004,6 +1007,7 @@ main( logagno = logblocks = rtblocks = rtextblocks = 0; Nflag = nlflag = nsflag = nvflag = nci = 0; nftype = dirftype = 0; /* inode type information in the dir */ + utf8 = 0; /* utf-8 support */ dirblocklog = dirblocksize = 0; dirversion = XFS_DFL_DIR_VERSION; qflag = 0; @@ -1565,7 +1569,8 @@ _("cannot specify both crc and ftype\n")); if (nvflag) respec('n', nopts, N_VERSION); if (!strcasecmp(value, "ci")) { - nci = 1; /* ASCII CI mode */ + /* ASCII or UTF-8 CI mode */ + nci = 1; } else { dirversion = atoi(value); if (dirversion != 2) @@ -1587,6 +1592,14 @@ _("cannot specify both crc and ftype\n")); } nftype = 1; break; + case N_UTF8: + if (!value || *value == '\0') + value = "1"; + c = atoi(value); + if (c < 0 || c > 1) + illegal(value, "n utf8"); + utf8 = c; + break; default: unknown('n', value); } @@ -2460,7 +2473,8 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"), */ sbp->sb_features2 = XFS_SB_VERSION2_MKFS(crcs_enabled, lazy_sb_counters, attrversion == 2, !projid16bit, 0, - (!crcs_enabled && dirftype)); + (!crcs_enabled && dirftype), + (!crcs_enabled && utf8)); sbp->sb_versionnum = XFS_SB_VERSION_MKFS(crcs_enabled, iaflag, dsunit != 0, logversion == 2, attrversion == 1, @@ -2534,6 +2548,9 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"), if (crcs_enabled) { sbp->sb_features_incompat = XFS_SB_FEAT_INCOMPAT_FTYPE; dirftype = 1; + /* turn on the utf-8 support */ + if (utf8) + sbp->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_UTF8; } if (!qflag || Nflag) { @@ -2543,7 +2560,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"), " =%-22s crc=%-8u finobt=%u\n" "data =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n" " =%-22s sunit=%-6u swidth=%u blks\n" - "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n" + "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d utf8=%d\n" "log =%-22s bsize=%-6d blocks=%lld, version=%d\n" " =%-22s sectsz=%-5u sunit=%d blks, lazy-count=%d\n" "realtime =%-22s extsz=%-6d blocks=%lld, rtextents=%lld\n"), @@ -2552,7 +2569,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"), "", crcs_enabled, finobt, "", blocksize, (long long)dblocks, imaxpct, "", dsunit, dswidth, - dirversion, dirblocksize, nci, dirftype, + dirversion, dirblocksize, nci, dirftype, utf8, logfile, 1 << blocklog, (long long)logblocks, logversion, "", lsectorsize, lsunit, lazy_sb_counters, rtfile, rtextblocks << blocklog, @@ -3171,7 +3188,7 @@ usage( void ) sunit=value|su=num,sectlog=n|sectsize=num,\n\ lazy-count=0|1]\n\ /* label */ [-L label (maximum 12 characters)]\n\ -/* naming */ [-n log=n|size=num,version=2|ci,ftype=0|1]\n\ +/* naming */ [-n log=n|size=num,version=2|ci,ftype=0|1,utf8=0|1]\n\ /* no-op info only */ [-N]\n\ /* prototype file */ [-p fname]\n\ /* quiet */ [-q]\n\ diff --git a/mkfs/xfs_mkfs.h b/mkfs/xfs_mkfs.h index 9df5f37..f40b284 100644 --- a/mkfs/xfs_mkfs.h +++ b/mkfs/xfs_mkfs.h @@ -37,13 +37,14 @@ 0 ) : XFS_SB_VERSION_1 ) #define XFS_SB_VERSION2_MKFS(crc, lazycount, attr2, projid32bit, parent, \ - ftype) (\ + ftype, utf8) (\ ((lazycount) ? XFS_SB_VERSION2_LAZYSBCOUNTBIT : 0) | \ ((attr2) ? XFS_SB_VERSION2_ATTR2BIT : 0) | \ ((projid32bit) ? XFS_SB_VERSION2_PROJID32BIT : 0) | \ ((parent) ? XFS_SB_VERSION2_PARENTBIT : 0) | \ ((crc) ? XFS_SB_VERSION2_CRCBIT : 0) | \ ((ftype) ? XFS_SB_VERSION2_FTYPE : 0) | \ + ((utf8) ? XFS_SB_VERSION2_UTF8BIT : 0) | \ 0 ) #define XFS_DFL_BLOCKSIZE_LOG 12 /* 4096 byte blocks */ -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 20:42 ` Ben Myers 2014-09-18 20:33 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (13 subsequent siblings) 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:42 UTC (permalink / raw) To: linux-fsdevel; +Cc: xfs, olaf, tinguely From: Mark Tinguely <tinguely@sgi.com> Fix the duplicate filename detection to use the utf-8 normalization routines. Signed-off-by: Mark Tinguely <tinguely@sgi.com> --- repair/phase6.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/repair/phase6.c b/repair/phase6.c index f374fd0..eb3ea35 100644 --- a/repair/phase6.c +++ b/repair/phase6.c @@ -176,13 +176,15 @@ dir_hash_add( unsigned char *name, __uint8_t ftype) { - xfs_dahash_t hash = 0; int byaddr; int byhash = 0; dir_hash_ent_t *p; int dup; short junk; struct xfs_name xname; + xfs_da_args_t args; + + memset(&args, 0, sizeof(xfs_da_args_t)); ASSERT(!hashtab->names_duped); @@ -195,19 +197,30 @@ dir_hash_add( dup = 0; if (!junk) { - hash = mp->m_dirnameops->hashname(name, namelen); - byhash = DIR_HASH_FUNC(hashtab, hash); + int error; + + args.name = name; + args.namelen = namelen; + args.inumber = inum; + args.whichfork = XFS_DATA_FORK; + + error = mp->m_dirnameops->normhash(&args); + if (error) + do_error(_("normalize has failed %d)\n"), error); + + byhash = DIR_HASH_FUNC(hashtab, args.hashval); /* * search hash bucket for existing name. */ for (p = hashtab->byhash[byhash]; p; p = p->nextbyhash) { - if (p->hashval == hash && p->name.len == namelen) { - if (memcmp(p->name.name, name, namelen) == 0) { - dup = 1; - junk = 1; - break; - } + if (p->hashval == args.hashval && + mp->m_dirnameops->compname(&args, p->name.name, + p->name.len) != + XFS_CMP_DIFFERENT) { + dup = 1; + junk = 1; + break; } } } @@ -226,7 +239,7 @@ dir_hash_add( hashtab->last = p; if (!(p->junkit = junk)) { - p->hashval = hash; + p->hashval = args.hashval; p->nextbyhash = hashtab->byhash[byhash]; hashtab->byhash[byhash] = p; } @@ -235,6 +248,8 @@ dir_hash_add( p->seen = 0; p->name = xname; + if (args.norm) + kmem_free((void *) args.norm); return !dup; } -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair @ 2014-09-18 20:42 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:42 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Mark Tinguely <tinguely@sgi.com> Fix the duplicate filename detection to use the utf-8 normalization routines. Signed-off-by: Mark Tinguely <tinguely@sgi.com> --- repair/phase6.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/repair/phase6.c b/repair/phase6.c index f374fd0..eb3ea35 100644 --- a/repair/phase6.c +++ b/repair/phase6.c @@ -176,13 +176,15 @@ dir_hash_add( unsigned char *name, __uint8_t ftype) { - xfs_dahash_t hash = 0; int byaddr; int byhash = 0; dir_hash_ent_t *p; int dup; short junk; struct xfs_name xname; + xfs_da_args_t args; + + memset(&args, 0, sizeof(xfs_da_args_t)); ASSERT(!hashtab->names_duped); @@ -195,19 +197,30 @@ dir_hash_add( dup = 0; if (!junk) { - hash = mp->m_dirnameops->hashname(name, namelen); - byhash = DIR_HASH_FUNC(hashtab, hash); + int error; + + args.name = name; + args.namelen = namelen; + args.inumber = inum; + args.whichfork = XFS_DATA_FORK; + + error = mp->m_dirnameops->normhash(&args); + if (error) + do_error(_("normalize has failed %d)\n"), error); + + byhash = DIR_HASH_FUNC(hashtab, args.hashval); /* * search hash bucket for existing name. */ for (p = hashtab->byhash[byhash]; p; p = p->nextbyhash) { - if (p->hashval == hash && p->name.len == namelen) { - if (memcmp(p->name.name, name, namelen) == 0) { - dup = 1; - junk = 1; - break; - } + if (p->hashval == args.hashval && + mp->m_dirnameops->compname(&args, p->name.name, + p->name.len) != + XFS_CMP_DIFFERENT) { + dup = 1; + junk = 1; + break; } } } @@ -226,7 +239,7 @@ dir_hash_add( hashtab->last = p; if (!(p->junkit = junk)) { - p->hashval = hash; + p->hashval = args.hashval; p->nextbyhash = hashtab->byhash[byhash]; hashtab->byhash[byhash] = p; } @@ -235,6 +248,8 @@ dir_hash_add( p->seen = 0; p->name = xname; + if (args.norm) + kmem_free((void *) args.norm); return !dup; } -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (11 preceding siblings ...) 2014-09-18 20:42 ` Ben Myers @ 2014-09-18 20:43 ` Ben Myers 2014-09-19 16:06 ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers 2014-09-19 16:07 ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 20:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Ben Myers <bpm@sgi.com> Here's a preliminary test for utf8 support in xfs. It is based on code that also does some testing in the trie generator. Here too we are using the NormalizationTest.txt file from the unicode distribution. We check that the normalization in libxfs is working and then run checks on a filesystem. Note that there are some 'blacklisted' unichars which normalize to reserved characters. FIXME: For convenience of build this patch is against xfsprogs access to libxfs. Handling of ignorables and case fold is also not implemented here. --- Makefile | 2 +- chkutf8data/Makefile | 21 +++ chkutf8data/chkutf8data.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 452 insertions(+), 1 deletion(-) create mode 100644 chkutf8data/Makefile create mode 100644 chkutf8data/chkutf8data.c diff --git a/Makefile b/Makefile index c442da6..d4c0a23 100644 --- a/Makefile +++ b/Makefile @@ -42,7 +42,7 @@ endif LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \ - mdrestore repair rtcp m4 man doc po debian + mdrestore repair rtcp m4 man doc po debian chkutf8data SUBDIRS = include $(LIB_SUBDIRS) $(TOOL_SUBDIRS) diff --git a/chkutf8data/Makefile b/chkutf8data/Makefile new file mode 100644 index 0000000..6ce5706 --- /dev/null +++ b/chkutf8data/Makefile @@ -0,0 +1,21 @@ +# +# Copyright (c) 2014 SGI. All Rights Reserved. +# + +TOPDIR = .. +include $(TOPDIR)/include/builddefs + +LTCOMMAND = chkutf8data +CFILES = chkutf8data.c + +LLDLIBS = $(LIBXFS) +LTDEPENDENCIES = $(LIBXFS) +LLDFLAGS = -static + +default: depend $(LTCOMMAND) + +include $(BUILDRULES) + +install: default + +-include .ltdep diff --git a/chkutf8data/chkutf8data.c b/chkutf8data/chkutf8data.c new file mode 100644 index 0000000..487cf1e --- /dev/null +++ b/chkutf8data/chkutf8data.c @@ -0,0 +1,430 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ +#include <sys/types.h> +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <assert.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> +#include <fcntl.h> +#include "utf8norm.h" + +#define FOLD_NAME "CaseFolding.txt" +#define TEST_NAME "NormalizationTest.txt" + +const char *fold_name = FOLD_NAME; +const char *test_name = TEST_NAME; + +/* An arbitrary line size limit on input lines. */ + +#define LINESIZE 1024 +char line[LINESIZE]; +char buf0[LINESIZE]; +char buf1[LINESIZE]; +char buf2[LINESIZE]; +char buf3[LINESIZE]; +char buf4[LINESIZE]; +char buf5[LINESIZE]; + +const char *mtpt; +int verbose = 0; + +/* ------------------------------------------------------------------ */ + +static void +help(void) +{ + printf("The input files:\n"); + printf("\t-f %s\n", FOLD_NAME); + printf("\t-t %s\n", TEST_NAME); + printf("\n\n"); + printf("\t-m mtpt\n"); + printf("\t-v (verbose)\n"); + printf("\t-h (help)\n"); + printf("\n"); +} + +static void +usage(void) +{ + help(); + exit(1); +} + +static void +open_fail(const char *name, int error) +{ + printf("Error %d opening %s: %s\n", error, name, strerror(error)); + exit(1); +} + +static void +file_fail(const char *filename) +{ + printf("Error parsing %s\n", filename); + exit(1); +} + +/* ------------------------------------------------------------------ */ + +/* + * UTF8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7f: 0 0x7f + * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf + * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf + * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf + * + * Even within those ranges not all values are allowed: the surrogates + * 0xd800 - 0xdfff should never be seen. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +#define UTF8_2_BITS 0xC0 +#define UTF8_3_BITS 0xE0 +#define UTF8_4_BITS 0xF0 +#define UTF8_N_BITS 0x80 +#define UTF8_2_MASK 0xE0 +#define UTF8_3_MASK 0xF0 +#define UTF8_4_MASK 0xF8 +#define UTF8_N_MASK 0xC0 +#define UTF8_V_MASK 0x3F +#define UTF8_V_SHIFT 6 + +static int +utf8key(unsigned int key, char keyval[]) +{ + int keylen; + + if (key < 0x80) { + keyval[0] = key; + keylen = 1; + } else if (key < 0x800) { + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_2_BITS; + keylen = 2; + } else if (key < 0x10000) { + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_3_BITS; + keylen = 3; + } else if (key < 0x110000) { + keyval[3] = key & UTF8_V_MASK; + keyval[3] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_4_BITS; + keylen = 4; + } else { + printf("%#x: illegal key\n", key); + keylen = 0; + } + return keylen; +} + +static int +normalize_line(utf8data_t tree, char *s, char *t) +{ + struct utf8cursor u8c; + + if (utf8cursor(&u8c, tree, s)) { + printf("%s return utf8cursor failed\n", __func__); + return -1; + } + + while ((*t = utf8byte(&u8c)) > 0) + t++; + + if (*t < 0) { + printf("%s return error %d\r", __func__, *t); + return -1; + } + if (*t != 0) { + printf("%s return t not 0\n", __func__); + return -1; + } + + return 0; +} + +static void +test_key(char *source, + char *NFC, + char *NFD, + char *NFKC, + char *NFKD) +{ + int fd; + int error; + + if (verbose) + printf("Testing %s -> %s\n", source, NFKD); + + error = chdir(mtpt); /* XXX hardcoded mount point */ + if (error) { + perror(mtpt); + exit(-1); + } + + /* the initial create should succeed */ + if (verbose) + printf("Initial create %s... ", source); + fd = open(source, O_CREAT|O_EXCL, 0); + if (fd < 0) { + printf("Failed to create %s XXX\n", source); + perror(source); + close(fd); + exit(-1); + } + close(fd); + if (verbose) + printf("Success\n"); + + /* a second create should fail */ + if (verbose) + printf("Second create %s (should return EEXIST)... ", NFKD); + fd = open(NFKD, O_CREAT|O_EXCL, 0); + if (fd >= 1) { + printf("Test Failed. Was able to create %s XXX\n", NFKD); + perror(NFKD); + close(fd); + exit(-1); + } + close(fd); + if (verbose) + printf("EEXIST\n"); + + error = unlink(NFKD); + if (error) { + printf("Unlink failed\n"); + perror(NFKD); + exit(-1); + } +} + +int +blacklisted(unsigned int unichar) +{ + /* these unichars normalize to characters we don't allow */ + unsigned int list[] = { 0x2024 /* . */, + 0x2025 /* .. */, + 0x2100 /* a/c */, + 0x2101 /* a/s */, + 0x2105 /* c/o */, + 0x2106 /* c/u */, + 0xFE30 /* .. */, + 0xFE52 /* . */, + 0xFF0E /* . */, + 0xFF0F /* / */}; + int i; + + for (i=0; i < (sizeof(list) / sizeof(unichar)); i++) { + if (list[i] == unichar) + return 1; + } + return 0; +} + +static void +normalization_test(void) +{ + FILE *file; + unsigned int unichar; + char *s; + char *t; + int ret; + int tests = 0; + int failures = 0; + char source[LINESIZE]; + char NFKD[LINESIZE]; + int skip; + utf8data_t nfkdi = utf8nfkdi(utf8version); + + printf("Parsing %s\n", test_name); + /* Step one, read data from file. */ + file = fopen(test_name, "r"); + if (!file) + open_fail(test_name, errno); + + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];", + source, NFKD); + //NFC, NFD, NFKC, NFKD); + if (ret != 2 || *line == '#') + continue; + + s = source; + t = buf2; + skip = 0; + while (*s) { + unichar = strtoul(s, &s, 16); + if (blacklisted(unichar)) + skip++; + t += utf8key(unichar, t); + } + *t = '\0'; + + if (skip) + continue; + + s = NFKD; + t = buf3; + while (*s) { + unichar = strtoul(s, &s, 16); + t += utf8key(unichar, t); + } + *t = '\0'; + + /* normalize source */ + if (normalize_line(nfkdi, buf2, buf4) < 0) { + printf("normalize_line for unichar %s Failed\n", buf0); + exit(1); + } + if (verbose) + printf("(%s) %s normalized to %s... ", + source, buf2, buf4); + + /* does it match NFKD? */ + tests++; + if (memcmp(buf4, buf3, strlen(buf3))) { + if (verbose) + printf("Fail!\n"); + failures++; + } else { + if (verbose) + printf("Correct!\n"); + } + + /* normalize NFKD */ + if (normalize_line(nfkdi, buf3, buf5) < 0) { + printf("normalize_line for unichar %s Failed\n", + buf3); + exit(1); + } + if (verbose) + printf("(%s) %s normalized to %s... ", + NFKD, buf3, buf5); + + /* does it normalize to itself? */ + tests++; + if (memcmp(buf5, buf3, strlen(buf3))) { + if (verbose) + printf("Fail!\n"); + failures++; + } else { + if (verbose) + printf("Correct!\n"); + } + + /* XXX ignorables need to be taken into account? */ + test_key(buf2, NULL, NULL, NULL, buf3); + } + fclose(file); + printf("Ran %d tests with %d failures\n", tests, failures); + if (failures) + file_fail(test_name); +} + +int +main(int argc, char *argv[]) +{ + int opt; + + while ((opt = getopt(argc, argv, "f:t:m:vh")) != -1) { + switch (opt) { + case 'f': + fold_name = optarg; + break; + case 't': + test_name = optarg; + break; + case 'm': + mtpt = optarg; + break; + case 'v': + verbose++; + break; + case 'h': + help(); + exit(0); + default: + usage(); + } + } + + if (!test_name || !mtpt) { + usage(); + exit(-1); + } + + normalization_test(); + + return 0; +} -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 07a/13] xfsprogs: add trie generator for UTF-8. 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (12 preceding siblings ...) 2014-09-18 20:43 ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers @ 2014-09-19 16:06 ` Ben Myers 2014-09-23 18:34 ` Roger Willcocks 2014-09-19 16:07 ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers 14 siblings, 1 reply; 84+ messages in thread From: Ben Myers @ 2014-09-19 16:06 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> mkutf8data.c is the source for a program that generates utf8data.h, which contains the trie that utf8norm.c uses. The trie is generated from the Unicode 7.0.0 data files. The format of the utf8data[] table is described in utf8norm.c, which is added in the next patch. Signed-off-by: Olaf Weber <olaf@sgi.com> --- Makefile | 2 +- support/Makefile | 24 + support/mkutf8data.c | 3232 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 3257 insertions(+), 1 deletion(-) create mode 100644 support/Makefile create mode 100644 support/mkutf8data.c diff --git a/Makefile b/Makefile index f56aebd..c442da6 100644 --- a/Makefile +++ b/Makefile @@ -40,7 +40,7 @@ LDIRDIRT = $(SRCDIR) LDIRT += $(SRCTAR) endif -LIB_SUBDIRS = libxfs libxlog libxcmd libhandle libdisk +LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \ mdrestore repair rtcp m4 man doc po debian diff --git a/support/Makefile b/support/Makefile new file mode 100644 index 0000000..cade5fe --- /dev/null +++ b/support/Makefile @@ -0,0 +1,24 @@ +# +# Copyright (c) 2014 SGI. All Rights Reserved. +# + +TOPDIR = .. +include $(TOPDIR)/include/builddefs + +default = ../include/utf8data.h + +../include/utf8data.h: mkutf8data.c + cc -o mkutf8data mkutf8data.c + cd ucd-7.0.0 ; ../mkutf8data + mv ucd-7.0.0/utf8data.h ../include + +default clean: + rm -f mkutf8data ../include/utf8data.h + +default install: + +default install-dev: + +default install-qa: + +-include .ltdep diff --git a/support/mkutf8data.c b/support/mkutf8data.c new file mode 100644 index 0000000..e5c3507 --- /dev/null +++ b/support/mkutf8data.c @@ -0,0 +1,3232 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/* Generator for a compact trie for unicode normalization */ + +#include <sys/types.h> +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <assert.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> + +/* Default names of the in- and output files. */ + +#define AGE_NAME "DerivedAge.txt" +#define CCC_NAME "DerivedCombiningClass.txt" +#define PROP_NAME "DerivedCoreProperties.txt" +#define DATA_NAME "UnicodeData.txt" +#define FOLD_NAME "CaseFolding.txt" +#define NORM_NAME "NormalizationCorrections.txt" +#define TEST_NAME "NormalizationTest.txt" +#define UTF8_NAME "utf8data.h" + +const char *age_name = AGE_NAME; +const char *ccc_name = CCC_NAME; +const char *prop_name = PROP_NAME; +const char *data_name = DATA_NAME; +const char *fold_name = FOLD_NAME; +const char *norm_name = NORM_NAME; +const char *test_name = TEST_NAME; +const char *utf8_name = UTF8_NAME; + +int verbose = 0; + +/* An arbitrary line size limit on input lines. */ + +#define LINESIZE 1024 +char line[LINESIZE]; +char buf0[LINESIZE]; +char buf1[LINESIZE]; +char buf2[LINESIZE]; +char buf3[LINESIZE]; + +const char *argv0; + +/* ------------------------------------------------------------------ */ + +/* + * Unicode version numbers consist of three parts: major, minor, and a + * revision. These numbers are packed into an unsigned int to obtain + * a single version number. + * + * To save space in the generated trie, the unicode version is not + * stored directly, instead we calculate a generation number from the + * unicode versions seen in the DerivedAge file, and use that as an + * index into a table of unicode versions. + */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_MAJ_MAX ((unsigned short)-1) +#define UNICODE_MIN_MAX ((unsigned char)-1) +#define UNICODE_REV_MAX ((unsigned char)-1) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +unsigned int *ages; +int ages_count; + +unsigned int unicode_maxage; + +static int +age_valid(unsigned int major, unsigned int minor, unsigned int revision) +{ + if (major > UNICODE_MAJ_MAX) + return 0; + if (minor > UNICODE_MIN_MAX) + return 0; + if (revision > UNICODE_REV_MAX) + return 0; + return 1; +} + +/* ------------------------------------------------------------------ */ + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype, unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + */ +typedef unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MAXGEN (255) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +struct tree; +static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, const char *); + +unsigned char *utf8data; +size_t utf8data_size; + +utf8trie_t *nfkdi; +utf8trie_t *nfkdicf; + +/* ------------------------------------------------------------------ */ + +/* + * UTF8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7f: 0 0x7f + * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf + * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf + * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf + * + * Even within those ranges not all values are allowed: the surrogates + * 0xd800 - 0xdfff should never be seen. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +#define UTF8_2_BITS 0xC0 +#define UTF8_3_BITS 0xE0 +#define UTF8_4_BITS 0xF0 +#define UTF8_N_BITS 0x80 +#define UTF8_2_MASK 0xE0 +#define UTF8_3_MASK 0xF0 +#define UTF8_4_MASK 0xF8 +#define UTF8_N_MASK 0xC0 +#define UTF8_V_MASK 0x3F +#define UTF8_V_SHIFT 6 + +static int +utf8key(unsigned int key, char keyval[]) +{ + int keylen; + + if (key < 0x80) { + keyval[0] = key; + keylen = 1; + } else if (key < 0x800) { + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_2_BITS; + keylen = 2; + } else if (key < 0x10000) { + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_3_BITS; + keylen = 3; + } else if (key < 0x110000) { + keyval[3] = key & UTF8_V_MASK; + keyval[3] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_4_BITS; + keylen = 4; + } else { + printf("%#x: illegal key\n", key); + keylen = 0; + } + return keylen; +} + +static unsigned int +utf8code(const char *str) +{ + const unsigned char *s = (const unsigned char*)str; + unsigned int unichar = 0; + + if (*s < 0x80) { + unichar = *s; + } else if (*s < UTF8_3_BITS) { + unichar = *s++ & 0x1F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else if (*s < UTF8_4_BITS) { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } + return unichar; +} + +static int +utf32valid(unsigned int unichar) +{ + return unichar < 0x110000; +} + +#define NODE 1 +#define LEAF 0 + +struct tree { + void *root; + int childnode; + const char *type; + unsigned int maxage; + struct tree *next; + int (*leaf_equal)(void *, void *); + void (*leaf_print)(void *, int); + int (*leaf_mark)(void *); + int (*leaf_size)(void *); + int *(*leaf_index)(struct tree *, void *); + unsigned char *(*leaf_emit)(void *, unsigned char *); + int leafindex[0x110000]; + int index; +}; + +struct node { + int index; + int offset; + int mark; + int size; + struct node *parent; + void *left; + void *right; + unsigned char bitnum; + unsigned char nextbyte; + unsigned char leftnode; + unsigned char rightnode; + unsigned int keybits; + unsigned int keymask; +}; + +/* + * Example lookup function for a tree. + */ +static void * +lookup(struct tree *tree, const char *key) +{ + struct node *node; + void *leaf = NULL; + + node = tree->root; + while (!leaf && node) { + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) { + /* Right leg */ + if (node->rightnode == NODE) { + node = node->right; + } else if (node->rightnode == LEAF) { + leaf = node->right; + } else { + node = NULL; + } + } else { + /* Left leg */ + if (node->leftnode == NODE) { + node = node->left; + } else if (node->leftnode == LEAF) { + leaf = node->left; + } else { + node = NULL; + } + } + } + + return leaf; +} + +/* + * A simple non-recursive tree walker: keep track of visits to the + * left and right branches in the leftmask and rightmask. + */ +static void +tree_walk(struct tree *tree) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int indent = 1; + int nodes, singletons, leaves; + + nodes = singletons = leaves = 0; + + printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_print(tree->root, indent); + leaves = 1; + } else { + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + printf("%*snode @ %p bitnum %d nextbyte %d" + " left %p right %p mask %x bits %x\n", + indent, "", node, + node->bitnum, node->nextbyte, + node->left, node->right, + node->keymask, node->keybits); + nodes += 1; + if (!(node->left && node->right)) + singletons += 1; + + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + tree->leaf_print(node->left, + indent+1); + leaves += 1; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + tree->leaf_print(node->right, + indent+1); + leaves += 1; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } + } + printf("nodes %d leaves %d singletons %d\n", + nodes, leaves, singletons); +} + +/* + * Allocate an initialize a new internal node. + */ +static struct node * +alloc_node(struct node *parent) +{ + struct node *node; + int bitnum; + + node = malloc(sizeof(*node)); + node->left = node->right = NULL; + node->parent = parent; + node->leftnode = NODE; + node->rightnode = NODE; + node->keybits = 0; + node->keymask = 0; + node->mark = 0; + node->index = 0; + node->offset = -1; + node->size = 4; + + if (node->parent) { + bitnum = parent->bitnum; + if ((bitnum & 7) == 0) { + node->bitnum = bitnum + 7 + 8; + node->nextbyte = 1; + } else { + node->bitnum = bitnum - 1; + node->nextbyte = 0; + } + } else { + node->bitnum = 7; + node->nextbyte = 0; + } + + return node; +} + +/* + * Insert a new leaf into the tree, and collapse any subtrees that are + * fully populated and end in identical leaves. A nextbyte tagged + * internal node will not be removed to preserve the tree's integrity. + * Note that due to the structure of utf8, no nextbyte tagged node + * will be a candidate for removal. + */ +static int +insert(struct tree *tree, char *key, int keylen, void *leaf) +{ + struct node *node; + struct node *parent; + void **cursor; + int keybits; + + assert(keylen >= 1 && keylen <= 4); + + node = NULL; + cursor = &tree->root; + keybits = 8 * keylen; + + /* Insert, creating path along the way. */ + while (keybits) { + if (!*cursor) + *cursor = alloc_node(node); + node = *cursor; + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) + cursor = &node->right; + else + cursor = &node->left; + keybits--; + } + *cursor = leaf; + + /* Merge subtrees if possible. */ + while (node) { + if (*key & (1 << (node->bitnum & 7))) + node->rightnode = LEAF; + else + node->leftnode = LEAF; + if (node->nextbyte) + break; + if (node->leftnode == NODE || node->rightnode == NODE) + break; + assert(node->left); + assert(node->right); + /* Compare */ + if (! tree->leaf_equal(node->left, node->right)) + break; + /* Keep left, drop right leaf. */ + leaf = node->left; + /* Check in parent */ + parent = node->parent; + if (!parent) { + /* root of tree! */ + tree->root = leaf; + tree->childnode = LEAF; + } else if (parent->left == node) { + parent->left = leaf; + parent->leftnode = LEAF; + if (parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + } + } else if (parent->right == node) { + parent->right = leaf; + parent->rightnode = LEAF; + if (parent->left) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + parent->keybits |= (1 << node->bitnum); + } + } else { + /* internal tree error */ + assert(0); + } + free(node); + node = parent; + } + + /* Propagate keymasks up along singleton chains. */ + while (node) { + parent = node->parent; + if (!parent) + break; + /* Nix the mask for parents with two children. */ + if (node->keymask == 0) { + parent->keymask = 0; + parent->keybits = 0; + } else if (parent->left && parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + assert((parent->keymask & node->keymask) == 0); + parent->keymask |= node->keymask; + parent->keymask |= (1 << parent->bitnum); + parent->keybits |= node->keybits; + if (parent->right) + parent->keybits |= (1 << parent->bitnum); + } + node = parent; + } + + return 0; +} + +/* + * Prune internal nodes. + * + * Fully populated subtrees that end at the same leaf have already + * been collapsed. There are still internal nodes that have for both + * their left and right branches a sequence of singletons that make + * identical choices and end in identical leaves. The keymask and + * keybits collected in the nodes describe the choices made in these + * singleton chains. When they are identical for the left and right + * branch of a node, and the two leaves comare identical, the node in + * question can be removed. + * + * Note that nodes with the nextbyte tag set will not be removed by + * this to ensure tree integrity. Note as well that the structure of + * utf8 ensures that these nodes would not have been candidates for + * removal in any case. + */ +static void +prune(struct tree *tree) +{ + struct node *node; + struct node *left; + struct node *right; + struct node *parent; + void *leftleaf; + void *rightleaf; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + + if (verbose > 0) + printf("Pruning %s_%x\n", tree->type, tree->maxage); + + count = 0; + if (tree->childnode == LEAF) + return; + if (!tree->root) + return; + + leftmask = rightmask = 0; + node = tree->root; + while (node) { + if (node->nextbyte) + goto advance; + if (node->leftnode == LEAF) + goto advance; + if (node->rightnode == LEAF) + goto advance; + if (!node->left) + goto advance; + if (!node->right) + goto advance; + left = node->left; + right = node->right; + if (left->keymask == 0) + goto advance; + if (right->keymask == 0) + goto advance; + if (left->keymask != right->keymask) + goto advance; + if (left->keybits != right->keybits) + goto advance; + leftleaf = NULL; + while (!leftleaf) { + assert(left->left || left->right); + if (left->leftnode == LEAF) + leftleaf = left->left; + else if (left->rightnode == LEAF) + leftleaf = left->right; + else if (left->left) + left = left->left; + else if (left->right) + left = left->right; + else + assert(0); + } + rightleaf = NULL; + while (!rightleaf) { + assert(right->left || right->right); + if (right->leftnode == LEAF) + rightleaf = right->left; + else if (right->rightnode == LEAF) + rightleaf = right->right; + else if (right->left) + right = right->left; + else if (right->right) + right = right->right; + else + assert(0); + } + if (! tree->leaf_equal(leftleaf, rightleaf)) + goto advance; + /* + * This node has identical singleton-only subtrees. + * Remove it. + */ + parent = node->parent; + left = node->left; + right = node->right; + if (parent->left == node) + parent->left = left; + else if (parent->right == node) + parent->right = left; + else + assert(0); + left->parent = parent; + left->keymask |= (1 << node->bitnum); + node->left = NULL; + while (node) { + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + if (node->leftnode == NODE && node->left) { + left = node->left; + free(node); + count++; + node = left; + } else if (node->rightnode == NODE && node->right) { + right = node->right; + free(node); + count++; + node = right; + } else { + node = NULL; + } + } + /* Propagate keymasks up along singleton chains. */ + node = parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + for (;;) { + if (node->left && node->right) + break; + if (node->left) { + left = node->left; + node->keymask |= left->keymask; + node->keybits |= left->keybits; + } + if (node->right) { + right = node->right; + node->keymask |= right->keymask; + node->keybits |= right->keybits; + } + node->keymask |= (1 << node->bitnum); + node = node->parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + } + advance: + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0 && + node->leftnode == NODE && + node->left) { + leftmask |= bitmask; + node = node->left; + } else if ((rightmask & bitmask) == 0 && + node->rightnode == NODE && + node->right) { + rightmask |= bitmask; + node = node->right; + } else { + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + } + if (verbose > 0) + printf("Pruned %d nodes\n", count); +} + +/* + * Mark the nodes in the tree that lead to leaves that must be + * emitted. + */ +static void +mark_nodes(struct tree *tree) +{ + struct node *node; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int marked; + + marked = 0; + if (verbose > 0) + printf("Marking %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + + /* second pass: left siblings and singletons */ + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + if (!node->mark && node->parent->mark) { + marked++; + node->mark = 1; + } + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + if (!node->mark && node->parent->mark && + !node->parent->left) { + marked++; + node->mark = 1; + } + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } +done: + if (verbose > 0) + printf("Marked %d nodes\n", marked); +} + +/* + * Compute the index of each node and leaf, which is the offset in the + * emitted trie. These value must be pre-computed because relative + * offsets between nodes are used to navigate the tree. + */ +static int +index_nodes(struct tree *tree, int index) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + int indent; + + /* Align to a cache line (or half a cache line?). */ + while (index % 64) + index++; + tree->index = index; + indent = 1; + count = 0; + + if (verbose > 0) + printf("Indexing %s_%x: %d", tree->type, tree->maxage, index); + if (tree->childnode == LEAF) { + index += tree->leaf_size(tree->root); + goto done; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + count++; + if (node->index != index) + node->index = index; + index += node->size; +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + *tree->leaf_index(tree, node->left) = + index; + index += tree->leaf_size(node->left); + count++; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + *tree->leaf_index(tree, node->right) = index; + index += tree->leaf_size(node->right); + count++; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + /* Round up to a multiple of 16 */ + while (index % 16) + index++; + if (verbose > 0) + printf("Final index %d\n", index); + return index; +} + +/* + * Compute the size of nodes and leaves. We start by assuming that + * each node needs to store a three-byte offset. The indexes of the + * nodes are calculated based on that, and then this function is + * called to see if the sizes of some nodes can be reduced. This is + * repeated until no more changes are seen. + */ +static int +size_nodes(struct tree *tree) +{ + struct tree *next; + struct node *node; + struct node *right; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + unsigned int pathbits; + unsigned int pathmask; + int changed; + int offset; + int size; + int indent; + + indent = 1; + changed = 0; + size = 0; + + if (verbose > 0) + printf("Sizing %s_%x", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + pathbits = 0; + pathmask = 0; + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + offset = 0; + if (!node->left || !node->right) { + size = 1; + } else { + if (node->rightnode == NODE) { + right = node->right; + next = tree->next; + while (!right->mark) { + assert(next); + n = next->root; + while (n->bitnum != node->bitnum) { + if (pathbits & (1<<n->bitnum)) + n = n->right; + else + n = n->left; + } + n = n->right; + assert(right->bitnum == n->bitnum); + right = n; + next = next->next; + } + offset = right->index - node->index; + } else { + offset = *tree->leaf_index(tree, node->right); + offset -= node->index; + } + assert(offset >= 0); + assert(offset <= 0xffffff); + if (offset <= 0xff) { + size = 2; + } else if (offset <= 0xffff) { + size = 3; + } else { /* offset <= 0xffffff */ + size = 4; + } + } + if (node->size != size || node->offset != offset) { + node->size = size; + node->offset = offset; + changed++; + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + pathmask |= bitmask; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + pathbits |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + pathmask &= ~bitmask; + pathbits &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + if (verbose > 0) + printf("Found %d changes\n", changed); + return changed; +} + +/* + * Emit a trie for the given tree into the data array. + */ +static void +emit(struct tree *tree, unsigned char *data) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int offlen; + int offset; + int index; + int indent; + unsigned char byte; + + index = tree->index; + data += index; + indent = 1; + if (verbose > 0) + printf("Emitting %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_emit(tree->root, data); + return; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + assert(node->offset != -1); + assert(node->index == index); + + byte = 0; + if (node->nextbyte) + byte |= NEXTBYTE; + byte |= (node->bitnum & BITNUM); + if (node->left && node->right) { + if (node->leftnode == NODE) + byte |= LEFTNODE; + if (node->rightnode == NODE) + byte |= RIGHTNODE; + if (node->offset <= 0xff) + offlen = 1; + else if (node->offset <= 0xffff) + offlen = 2; + else + offlen = 3; + offset = node->offset; + byte |= offlen << OFFLEN_SHIFT; + *data++ = byte; + index++; + while (offlen--) { + *data++ = offset & 0xff; + index++; + offset >>= 8; + } + } else if (node->left) { + if (node->leftnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else if (node->right) { + byte |= RIGHTNODE; + if (node->rightnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else { + assert(0); + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + data = tree->leaf_emit(node->left, + data); + index += tree->leaf_size(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + data = tree->leaf_emit(node->right, + data); + index += tree->leaf_size(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +} + +/* ------------------------------------------------------------------ */ + +/* + * Unicode data. + * + * We need to keep track of the Canonical Combining Class, the Age, + * and decompositions for a code point. + * + * For the Age, we store the index into the ages table. Effectively + * this is a generation number that the table maps to a unicode + * version. + * + * The correction field is used to indicate that this entry is in the + * corrections array, which contains decompositions that were + * corrected in later revisions. The value of the correction field is + * the Unicode version in which the mapping was corrected. + */ +struct unicode_data { + unsigned int code; + int ccc; + int gen; + int correction; + unsigned int *utf32nfkdi; + unsigned int *utf32nfkdicf; + char *utf8nfkdi; + char *utf8nfkdicf; +}; + +struct unicode_data unicode_data[0x110000]; +struct unicode_data *corrections; +int corrections_count; + +struct tree *nfkdi_tree; +struct tree *nfkdicf_tree; + +struct tree *trees; +int trees_count; + +/* + * Check the corrections array to see if this entry was corrected at + * some point. + */ +static struct unicode_data * +corrections_lookup(struct unicode_data *u) +{ + int i; + + for (i = 0; i != corrections_count; i++) + if (u->code == corrections[i].code) + return &corrections[i]; + return u; +} + +static int +nfkdi_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static int +nfkdicf_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdicf && right->utf8nfkdicf && + strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0) + return 1; + if (left->utf8nfkdicf && right->utf8nfkdicf) + return 0; + if (left->utf8nfkdicf || right->utf8nfkdicf) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static void +nfkdi_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static void +nfkdicf_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdicf) + printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf); + else if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static int +nfkdi_mark(void *l) +{ + return 1; +} + +static int +nfkdicf_mark(void *l) +{ + struct unicode_data *leaf = l; + if (leaf->utf8nfkdicf) + return 1; + return 0; +} + +static int +correction_mark(void *l) +{ + struct unicode_data *leaf = l; + return leaf->correction; +} + +static int +nfkdi_size(void *l) +{ + struct unicode_data *leaf = l; + int size = 2; + if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int +nfkdicf_size(void *l) +{ + struct unicode_data *leaf = l; + int size = 2; + if (leaf->utf8nfkdicf) + size += strlen(leaf->utf8nfkdicf) + 1; + else if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int * +nfkdi_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + return &tree->leafindex[leaf->code]; +} + +static int * +nfkdicf_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + return &tree->leafindex[leaf->code]; +} + +static unsigned char * +nfkdi_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static unsigned char * +nfkdicf_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdicf) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdicf; + while ((*data++ = *s++) != 0) + ; + } else if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static void +utf8_create(struct unicode_data *data) +{ + char utf[18*4+1]; + char *u; + unsigned int *um; + int i; + + u = utf; + um = data->utf32nfkdi; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + data->utf8nfkdi = strdup((char*)utf); + } + u = utf; + um = data->utf32nfkdicf; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf)) + data->utf8nfkdicf = strdup((char*)utf); + } +} + +static void +utf8_init(void) +{ + unsigned int unichar; + int i; + + for (unichar = 0; unichar != 0x110000; unichar++) + utf8_create(&unicode_data[unichar]); + + for (i = 0; i != corrections_count; i++) + utf8_create(&corrections[i]); +} + +static void +trees_init(void) +{ + struct unicode_data *data; + unsigned int maxage; + unsigned int nextage; + int count; + int i; + int j; + + /* Count the number of different ages. */ + count = 0; + nextage = (unsigned int)-1; + do { + maxage = nextage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + count++; + } while (nextage); + + /* Two trees per age: nfkdi and nfkdicf */ + trees_count = count * 2; + trees = calloc(trees_count, sizeof(struct tree)); + + /* Assign ages to the trees. */ + count = trees_count; + nextage = (unsigned int)-1; + do { + maxage = nextage; + trees[--count].maxage = maxage; + trees[--count].maxage = maxage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + } while (nextage); + + /* The ages assigned above are off by one. */ + for (i = 0; i != trees_count; i++) { + j = 0; + while (ages[j] < trees[i].maxage) + j++; + trees[i].maxage = ages[j-1]; + } + + /* Set up the forwarding between trees. */ + trees[trees_count-2].next = &trees[trees_count-1]; + trees[trees_count-1].leaf_mark = nfkdi_mark; + trees[trees_count-2].leaf_mark = nfkdicf_mark; + for (i = 0; i != trees_count-2; i += 2) { + trees[i].next = &trees[trees_count-2]; + trees[i].leaf_mark = correction_mark; + trees[i+1].next = &trees[trees_count-1]; + trees[i+1].leaf_mark = correction_mark; + } + + /* Assign the callouts. */ + for (i = 0; i != trees_count; i += 2) { + trees[i].type = "nfkdicf"; + trees[i].leaf_equal = nfkdicf_equal; + trees[i].leaf_print = nfkdicf_print; + trees[i].leaf_size = nfkdicf_size; + trees[i].leaf_index = nfkdicf_index; + trees[i].leaf_emit = nfkdicf_emit; + + trees[i+1].type = "nfkdi"; + trees[i+1].leaf_equal = nfkdi_equal; + trees[i+1].leaf_print = nfkdi_print; + trees[i+1].leaf_size = nfkdi_size; + trees[i+1].leaf_index = nfkdi_index; + trees[i+1].leaf_emit = nfkdi_emit; + } + + /* Finish init. */ + for (i = 0; i != trees_count; i++) + trees[i].childnode = NODE; +} + +static void +trees_populate(void) +{ + struct unicode_data *data; + unsigned int unichar; + char keyval[4]; + int keylen; + int i; + + for (i = 0; i != trees_count; i++) { + if (verbose > 0) { + printf("Populating %s_%x\n", + trees[i].type, trees[i].maxage); + } + for (unichar = 0; unichar != 0x110000; unichar++) { + if (unicode_data[unichar].gen < 0) + continue; + keylen = utf8key(unichar, keyval); + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= trees[i].maxage) + data = &unicode_data[unichar]; + insert(&trees[i], keyval, keylen, data); + } + } +} + +static void +trees_reduce(void) +{ + int i; + int size; + int changed; + + for (i = 0; i != trees_count; i++) + prune(&trees[i]); + for (i = 0; i != trees_count; i++) + mark_nodes(&trees[i]); + do { + size = 0; + for (i = 0; i != trees_count; i++) + size = index_nodes(&trees[i], size); + changed = 0; + for (i = 0; i != trees_count; i++) + changed += size_nodes(&trees[i]); + } while (changed); + + utf8data = calloc(size, 1); + utf8data_size = size; + for (i = 0; i != trees_count; i++) + emit(&trees[i], utf8data); + + if (verbose > 0) { + for (i = 0; i != trees_count; i++) { + printf("%s_%x idx %d\n", + trees[i].type, trees[i].maxage, trees[i].index); + } + } + + nfkdi = utf8data + trees[trees_count-1].index; + nfkdicf = utf8data + trees[trees_count-2].index; + + nfkdi_tree = &trees[trees_count-1]; + nfkdicf_tree = &trees[trees_count-2]; +} + +static void +verify(struct tree *tree) +{ + struct unicode_data *data; + utf8leaf_t *leaf; + unsigned int unichar; + char key[4]; + int report; + int nocf; + + if (verbose > 0) + printf("Verifying %s_%x\n", tree->type, tree->maxage); + nocf = strcmp(tree->type, "nfkdicf"); + + for (unichar = 0; unichar != 0x110000; unichar++) { + report = 0; + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= tree->maxage) + data = &unicode_data[unichar]; + utf8key(unichar, key); + leaf = utf8lookup(tree, key); + if (!leaf) { + if (data->gen != -1) + report++; + if (unichar < 0xd800 || unichar > 0xdfff) + report++; + } else { + if (unichar >= 0xd800 && unichar <= 0xdfff) + report++; + if (data->gen == -1) + report++; + if (data->gen != LEAF_GEN(leaf)) + report++; + if (LEAF_CCC(leaf) == DECOMPOSE) { + if (nocf) { + if (!data->utf8nfkdi) { + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } else { + if (!data->utf8nfkdicf && + !data->utf8nfkdi) { + report++; + } else if (data->utf8nfkdicf) { + if (strcmp(data->utf8nfkdicf, + LEAF_STR(leaf))) + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } + } else if (data->ccc != LEAF_CCC(leaf)) { + report++; + } + } + if (report) { + printf("%X code %X gen %d ccc %d" + " nfdki -> \"%s\"", + unichar, data->code, data->gen, + data->ccc, + data->utf8nfkdi); + if (leaf) { + printf(" age %d ccc %d" + " nfdki -> \"%s\"\n", + LEAF_GEN(leaf), + LEAF_CCC(leaf), + LEAF_CCC(leaf) == DECOMPOSE ? + LEAF_STR(leaf) : ""); + } + printf("\n"); + } + } +} + +static void +trees_verify(void) +{ + int i; + + for (i = 0; i != trees_count; i++) + verify(&trees[i]); +} + +/* ------------------------------------------------------------------ */ + +static void +help(void) +{ + printf("Usage: %s [options]\n", argv0); + printf("\n"); + printf("This program creates an a data trie used for parsing and\n"); + printf("normalization of UTF-8 strings. The trie is derived from\n"); + printf("a set of input files from the Unicode character database\n"); + printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n"); + printf("\n"); + printf("The generated tree supports two normalization forms:\n"); + printf("\n"); + printf("\tnfkdi:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\n"); + printf("\tnfkdicf:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\t- Apply a full casefold (C + F).\n"); + printf("\n"); + printf("These forms were chosen as being most useful when dealing\n"); + printf("with file names: NFKD catches most cases where characters\n"); + printf("should be considered equivalent. The ignorables are mostly\n"); + printf("invisible, making names hard to type.\n"); + printf("\n"); + printf("The options to specify the files to be used are listed\n"); + printf("below with their default values, which are the names used\n"); + printf("by version 7.0.0 of the Unicode Character Database.\n"); + printf("\n"); + printf("The input files:\n"); + printf("\t-a %s\n", AGE_NAME); + printf("\t-c %s\n", CCC_NAME); + printf("\t-p %s\n", PROP_NAME); + printf("\t-d %s\n", DATA_NAME); + printf("\t-f %s\n", FOLD_NAME); + printf("\t-n %s\n", NORM_NAME); + printf("\n"); + printf("Additionally, the generated tables are tested using:\n"); + printf("\t-t %s\n", TEST_NAME); + printf("\n"); + printf("Finally, the output file:\n"); + printf("\t-o %s\n", UTF8_NAME); + printf("\n"); +} + +static void +usage(void) +{ + help(); + exit(1); +} + +static void +open_fail(const char *name, int error) +{ + printf("Error %d opening %s: %s\n", error, name, strerror(error)); + exit(1); +} + +static void +file_fail(const char *filename) +{ + printf("Error parsing %s\n", filename); + exit(1); +} + +static void +line_fail(const char *filename, const char *line) +{ + printf("Error parsing %s:%s\n", filename, line); + exit(1); +} + +/* ------------------------------------------------------------------ */ + +static void +print_utf32(unsigned int *utf32str) +{ + int i; + for (i = 0; utf32str[i]; i++) + printf(" %X", utf32str[i]); +} + +static void +print_utf32nfkdi(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdi); + printf("\n"); +} + +static void +print_utf32nfkdicf(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdicf); + printf("\n"); +} + +/* ------------------------------------------------------------------ */ + +static void +age_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + int gen; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", age_name); + + file = fopen(age_name, "r"); + if (!file) + open_fail(age_name, errno); + count = 0; + + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d_%d\n", + major, minor, revision); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d\n", major, minor); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + } + + /* We must have found something above. */ + if (verbose > 1) + printf("%d age entries\n", ages_count); + if (ages_count == 0 || ages_count > MAXGEN) + file_fail(age_name); + + /* There is a 0 entry. */ + ages_count++; + ages = calloc(ages_count + 1, sizeof(*ages)); + /* And a guard entry. */ + ages[ages_count] = (unsigned int)-1; + + rewind(file); + count = 0; + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages[++gen] = + UNICODE_AGE(major, minor, revision); + if (verbose > 1) + printf(" Age V%d_%d_%d = gen %d\n", + major, minor, revision, gen); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages[++gen] = UNICODE_AGE(major, minor, 0); + if (verbose > 1) + printf(" Age V%d_%d = %d\n", + major, minor, gen); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X..%X ; %d.%d #", + &first, &last, &major, &minor); + if (ret == 4) { + for (unichar = first; unichar <= last; unichar++) + unicode_data[unichar].gen = gen; + count += 1 + last - first; + if (verbose > 1) + printf(" %X..%X gen %d\n", first, last, gen); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor); + if (ret == 3) { + unicode_data[unichar].gen = gen; + count++; + if (verbose > 1) + printf(" %X gen %d\n", unichar, gen); + if (!utf32valid(unichar)) + line_fail(age_name, line); + continue; + } + } + unicode_maxage = ages[gen]; + fclose(file); + + /* Nix surrogate block */ + if (verbose > 1) + printf(" Removing surrogate block D800..DFFF\n"); + for (unichar = 0xd800; unichar <= 0xdfff; unichar++) + unicode_data[unichar].gen = -1; + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(age_name); +} + +static void +ccc_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int value; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", ccc_name); + + file = fopen(ccc_name, "r"); + if (!file) + open_fail(ccc_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value); + if (ret == 3) { + for (unichar = first; unichar <= last; unichar++) { + unicode_data[unichar].ccc = value; + count++; + } + if (verbose > 1) + printf(" %X..%X ccc %d\n", first, last, value); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(ccc_name, line); + continue; + } + ret = sscanf(line, "%X ; %d #", &unichar, &value); + if (ret == 2) { + unicode_data[unichar].ccc = value; + count++; + if (verbose > 1) + printf(" %X ccc %d\n", unichar, value); + if (!utf32valid(unichar)) + line_fail(ccc_name, line); + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(ccc_name); +} + +static void +nfkdi_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + unsigned int *um; + int count; + int i; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", data_name); + file = fopen(data_name, "r"); + if (!file) + open_fail(data_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];", + &unichar, buf0); + if (ret != 2) + continue; + if (!utf32valid(unichar)) + line_fail(data_name, line); + + s = buf0; + /* skip over <tag> */ + if (*s == '<') + while (*s++ != ' ') + ; + /* decode the decomposition into UTF-32 */ + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(data_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(data_name); +} + +static void +nfkdicf_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char status; + char *s; + unsigned int *um; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", fold_name); + file = fopen(fold_name, "r"); + if (!file) + open_fail(fold_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0); + if (ret != 3) + continue; + if (!utf32valid(unichar)) + line_fail(fold_name, line); + /* Use the C+F casefold. */ + if (status != 'C' && status != 'F') + continue; + s = buf0; + if (*s == '<') + while (*s++ != ' ') + ; + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(fold_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(fold_name); +} + +static void +ignore_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int first; + unsigned int last; + unsigned int *um; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", prop_name); + file = fopen(prop_name, "r"); + if (!file) + open_fail(prop_name, errno); + assert(file); + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0); + if (ret == 3) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(first) || !utf32valid(last)) + line_fail(prop_name, line); + for (unichar = first; unichar <= last; unichar++) { + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + count++; + } + if (verbose > 1) + printf(" %X..%X Default_Ignorable_Code_Point\n", + first, last); + continue; + } + ret = sscanf(line, "%X ; %s # ", &unichar, buf0); + if (ret == 2) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(unichar)) + line_fail(prop_name, line); + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + if (verbose > 1) + printf(" %X Default_Ignorable_Code_Point\n", + unichar); + count++; + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(prop_name); +} + +static void +corrections_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + unsigned int age; + unsigned int *um; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", norm_name); + file = fopen(norm_name, "r"); + if (!file) + open_fail(norm_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + count++; + } + corrections = calloc(count, sizeof(struct unicode_data)); + corrections_count = count; + rewind(file); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + corrections[count] = unicode_data[unichar]; + assert(corrections[count].code == unichar); + age = UNICODE_AGE(major, minor, revision); + corrections[count].correction = age; + + i = 0; + s = buf0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(norm_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + corrections[count].utf32nfkdi = um; + + if (verbose > 1) + printf(" %X -> %s -> %s V%d_%d_%d\n", + unichar, buf0, buf1, major, minor, revision); + count++; + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(norm_name); +} + +/* ------------------------------------------------------------------ */ + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;; + * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount + * LVPart = LBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = <LPart, VPart> + * } else { + * TPart = TBase + TIndex + * d = <LPart, TPart, VPart> + * } + * + */ + +static void +hangul_decompose(void) +{ + unsigned int sb = 0xAC00; + unsigned int lb = 0x1100; + unsigned int vb = 0x1161; + unsigned int tb = 0x11a7; + /* unsigned int lc = 19; */ + unsigned int vc = 21; + unsigned int tc = 28; + unsigned int nc = (vc * tc); + /* unsigned int sc = (lc * nc); */ + unsigned int unichar; + unsigned int mapping[4]; + unsigned int *um; + int count; + int i; + + if (verbose > 0) + printf("Decomposing hangul\n"); + /* Hangul */ + count = 0; + for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) { + unsigned int si = unichar - sb; + unsigned int li = si / nc; + unsigned int vi = (si % nc) / tc; + unsigned int ti = si % tc; + + i = 0; + mapping[i++] = lb + li; + mapping[i++] = vb + vi; + if (ti) + mapping[i++] = tb + ti; + mapping[i++] = 0; + + assert(!unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + assert(!unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + + count++; + } + if (verbose > 0) + printf("Created %d entries\n", count); +} + +static void +nfkdi_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdi\n"); + + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdi) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdi; + while (*um) { + dc = unicode_data[*um].utf32nfkdi; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + } + /* Add this decomposition to nfkdicf if there is no entry. */ + if (!unicode_data[unichar].utf32nfkdicf) { + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +static void +nfkdicf_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdicf\n"); + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdicf) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdicf; + while (*um) { + dc = unicode_data[*um].utf32nfkdicf; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +/* ------------------------------------------------------------------ */ + +int utf8agemax(struct tree *, const char *); +int utf8nagemax(struct tree *, const char *, size_t); +int utf8agemin(struct tree *, const char *); +int utf8nagemin(struct tree *, const char *, size_t); +ssize_t utf8len(struct tree *, const char *); +ssize_t utf8nlen(struct tree *, const char *, size_t); +struct utf8cursor; +int utf8cursor(struct utf8cursor *, struct tree *, const char *); +int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); +int utf8byte(struct utf8cursor *); + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(struct tree *tree, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + tree->index; + int offlen; + int offset; + int mask; + int node; + + if (!tree) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to trie_nlookup(). + */ +static utf8leaf_t * +utf8lookup(struct tree *tree, const char *s) +{ + return utf8nlookup(tree, s, (size_t)-1); +} + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = tree->maxage; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age = tree->maxage; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + struct tree *tree; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; + unsigned int unichar; +}; + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : string. + * len : length of s. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s, + size_t len) +{ + if (!tree) + return -1; + if (!s) + return -1; + u8c->tree = tree; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->unichar = 0; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : NUL-terminated string. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s) +{ + return utf8ncursor(u8c, tree, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->tree, u8c->s); + else + leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->tree, u8c->s); + ccc = LEAF_CCC(leaf); + } + u8c->unichar = utf8code(u8c->s); + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + assert(u8c->ccc == STOPPER); + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +/* ------------------------------------------------------------------ */ + +static int +normalize_line(struct tree *tree) +{ + char *s; + char *t; + int c; + struct utf8cursor u8c; + + /* First test: null-terminated string. */ + s = buf2; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + /* Second test: length-limited string. */ + s = buf2; + /* Replace NUL with a value that will cause an error if seen. */ + s[strlen(s) + 1] = -1; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + return 0; +} + +static void +normalization_test(void) +{ + FILE *file; + unsigned int unichar; + struct unicode_data *data; + char *s; + char *t; + int ret; + int ignorables; + int tests = 0; + int failures = 0; + + if (verbose > 0) + printf("Parsing %s\n", test_name); + /* Step one, read data from file. */ + file = fopen(test_name, "r"); + if (!file) + open_fail(test_name, errno); + + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];", + buf0, buf1); + if (ret != 2 || *line == '#') + continue; + s = buf0; + t = buf2; + while (*s) { + unichar = strtoul(s, &s, 16); + t += utf8key(unichar, t); + } + *t = '\0'; + + ignorables = 0; + s = buf1; + t = buf3; + while (*s) { + unichar = strtoul(s, &s, 16); + data = &unicode_data[unichar]; + if (data->utf8nfkdi && !*data->utf8nfkdi) + ignorables = 1; + else + t += utf8key(unichar, t); + } + *t = '\0'; + + tests++; + if (normalize_line(nfkdi_tree) < 0) { + printf("\nline %s -> %s", buf0, buf1); + if (ignorables) + printf(" (ignorables removed)"); + printf(" failure\n"); + failures++; + } + } + fclose(file); + if (verbose > 0) + printf("Ran %d tests with %d failures\n", tests, failures); + if (failures) + file_fail(test_name); +} + +/* ------------------------------------------------------------------ */ + +static void +write_file(void) +{ + FILE *file; + int i; + int j; + int t; + int gen; + + if (verbose > 0) + printf("Writing %s\n", utf8_name); + file = fopen(utf8_name, "w"); + if (!file) + open_fail(utf8_name, errno); + + fprintf(file, "/* This file is generated code, do not edit. */\n"); + fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n"); + fprintf(file, "#error Only xfs_utf8.c may include this file.\n"); + fprintf(file, "#endif\n"); + fprintf(file, "\n"); + fprintf(file, "const unsigned int utf8version = %#x;\n", + unicode_maxage); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8agetab[] = {\n"); + for (i = 0; i != ages_count; i++) + fprintf(file, "\t%#x%s\n", ages[i], + ages[i] == unicode_maxage ? "" : ","); + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n"); + t = 0; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n"); + t = 1; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned char utf8data[%zd] = {\n", + utf8data_size); + t = 0; + for (i = 0; i != utf8data_size; i += 16) { + if (i == trees[t].index) { + fprintf(file, "\t/* %s_%x */\n", + trees[t].type, trees[t].maxage); + if (t < trees_count-1) + t++; + } + fprintf(file, "\t"); + for (j = i; j != i + 16; j++) + fprintf(file, "0x%.2x%s", utf8data[j], + (j < utf8data_size -1 ? "," : "")); + fprintf(file, "\n"); + } + fprintf(file, "};\n"); + fclose(file); +} + +/* ------------------------------------------------------------------ */ + +int +main(int argc, char *argv[]) +{ + unsigned int unichar; + int opt; + + argv0 = argv[0]; + + while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) { + switch (opt) { + case 'a': + age_name = optarg; + break; + case 'c': + ccc_name = optarg; + break; + case 'd': + data_name = optarg; + break; + case 'f': + fold_name = optarg; + break; + case 'n': + norm_name = optarg; + break; + case 'o': + utf8_name = optarg; + break; + case 'p': + prop_name = optarg; + break; + case 't': + test_name = optarg; + break; + case 'v': + verbose++; + break; + case 'h': + help(); + exit(0); + default: + usage(); + } + } + + if (verbose > 1) + help(); + for (unichar = 0; unichar != 0x110000; unichar++) + unicode_data[unichar].code = unichar; + age_init(); + ccc_init(); + nfkdi_init(); + nfkdicf_init(); + ignore_init(); + corrections_init(); + hangul_decompose(); + nfkdi_decompose(); + nfkdicf_decompose(); + utf8_init(); + trees_init(); + trees_populate(); + trees_reduce(); + trees_verify(); + /* Prevent "unused function" warning. */ + (void)lookup(nfkdi_tree, " "); + if (verbose > 2) + tree_walk(nfkdi_tree); + if (verbose > 2) + tree_walk(nfkdicf_tree); + normalization_test(); + write_file(); + + return 0; +} -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [PATCH 07a/13] xfsprogs: add trie generator for UTF-8. 2014-09-19 16:06 ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers @ 2014-09-23 18:34 ` Roger Willcocks 2014-09-24 23:11 ` Ben Myers 0 siblings, 1 reply; 84+ messages in thread From: Roger Willcocks @ 2014-09-23 18:34 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Fri, 2014-09-19 at 11:06 -0500, Ben Myers wrote: > +#define AGE_NAME "DerivedAge.txt" > +#define CCC_NAME "DerivedCombiningClass.txt" > +#define PROP_NAME "DerivedCoreProperties.txt" > +#define DATA_NAME "UnicodeData.txt" > +#define FOLD_NAME "CaseFolding.txt" > +#define NORM_NAME "NormalizationCorrections.txt" > +#define TEST_NAME "NormalizationTest.txt" Is there a reason why you're using multiple text-based data files (and hand-parsing them) when there's an xml formatted flat file available ? http://www.unicode.org/Public/UCD/latest/ucdxml/ And a 2nd question - why does the trie need to encode "the the unicode version in which the codepoint was assigned an interpretation" ? -- Roger Willcocks <roger@filmlight.ltd.uk> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH 07a/13] xfsprogs: add trie generator for UTF-8. 2014-09-23 18:34 ` Roger Willcocks @ 2014-09-24 23:11 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-24 23:11 UTC (permalink / raw) To: Roger Willcocks; +Cc: linux-fsdevel, tinguely, olaf, xfs Hi Roger, On Tue, Sep 23, 2014 at 07:34:19PM +0100, Roger Willcocks wrote: > On Fri, 2014-09-19 at 11:06 -0500, Ben Myers wrote: > > +#define AGE_NAME "DerivedAge.txt" > > +#define CCC_NAME "DerivedCombiningClass.txt" > > +#define PROP_NAME "DerivedCoreProperties.txt" > > +#define DATA_NAME "UnicodeData.txt" > > +#define FOLD_NAME "CaseFolding.txt" > > +#define NORM_NAME "NormalizationCorrections.txt" > > +#define TEST_NAME "NormalizationTest.txt" > > Is there a reason why you're using multiple text-based data files (and > hand-parsing them) when there's an xml formatted flat file available ? > > http://www.unicode.org/Public/UCD/latest/ucdxml/ The UCD files being parsed are the authoritative source. Check out ucdxml.readme.txt. > And a 2nd question - why does the trie need to encode "the the unicode > version in which the codepoint was assigned an interpretation" ? You need to know whether a given code point is assigned in the version of Unicode you're normalizing for. Unicode 8 is supposed to release June/July 2015 (see http://www.unicode.org/versions/), but filesystems you created this year will still need the version 7 normalization. There is still some plumbing to do to pass the version along with the string for normalization. I think you bring up a good point, but we'll need to support multiple versions in the long run. Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 07b/13] libxfs: add supporting code for UTF-8. 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers ` (13 preceding siblings ...) 2014-09-19 16:06 ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers @ 2014-09-19 16:07 ` Ben Myers 14 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-19 16:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. nfkdi: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. nfkdicf: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). Signed-off-by: Olaf Weber <olaf@sgi.com> --- include/utf8norm.h | 111 ++++++++++ libxfs/Makefile | 1 + libxfs/utf8norm.c | 628 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 740 insertions(+) create mode 100644 include/utf8norm.h create mode 100644 libxfs/utf8norm.c diff --git a/include/utf8norm.h b/include/utf8norm.h new file mode 100644 index 0000000..6aa3391 --- /dev/null +++ b/include/utf8norm.h @@ -0,0 +1,111 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef UTF8NORM_H +#define UTF8NORM_H + +/* An opaque type used to determine the normalization in use. */ +typedef const struct utf8data *utf8data_t; + +/* Encoding a unicode version number as a single unsigned int. */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +/* Highest unicode version supported by the data tables. */ +extern const unsigned int utf8version; + +/* + * Look for the correct utf8data_t for a unicode version. + * Returns NULL if the version requested is too new. + * + * Two normalization forms are supported: nfkdi and nfkdicf. + * + * nfkdi: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * + * nfkdicf: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * - Apply a full casefold (C + F). + */ +extern utf8data_t utf8nfkdi(unsigned int); +extern utf8data_t utf8nfkdicf(unsigned int); + +/* + * Determine the maximum age of any unicode character in the string. + * Returns 0 if only unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemax(utf8data_t, const char *); +extern int utf8nagemax(utf8data_t, const char *, size_t); + +/* + * Determine the minimum age of any unicode character in the string. + * Returns 0 if any unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemin(utf8data_t, const char *); +extern int utf8nagemin(utf8data_t, const char *, size_t); + +/* + * Determine the length of the normalized from of the string, + * excluding any terminating NULL byte. + * Returns 0 if only ignorable code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern ssize_t utf8len(utf8data_t, const char *); +extern ssize_t utf8nlen(utf8data_t, const char *, size_t); + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + utf8data_t data; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; +}; + +/* + * Initialize a utf8cursor to normalize a string. + * Returns 0 on success. + * Returns -1 on failure. + */ +extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *); +extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t); + +/* + * Get the next byte in the normalization. + * Returns a value > 0 && < 256 on success. + * Returns 0 when the end of the normalization is reached. + * Returns -1 if the string being normalized is not valid UTF-8. + */ +extern int utf8byte(struct utf8cursor *); + +#endif /* UTF8NORM_H */ diff --git a/libxfs/Makefile b/libxfs/Makefile index ae15a5d..a1e85ef 100644 --- a/libxfs/Makefile +++ b/libxfs/Makefile @@ -14,6 +14,7 @@ HFILES = xfs.h init.h xfs_dir2_priv.h crc32defs.h crc32table.h CFILES = cache.c \ crc32.c \ init.c kmem.c logitem.c radix-tree.c rdwr.c trans.c util.c \ + utf8norm.c \ xfs_alloc.c \ xfs_alloc_btree.c \ xfs_attr.c \ diff --git a/libxfs/utf8norm.c b/libxfs/utf8norm.c new file mode 100644 index 0000000..6232d1a --- /dev/null +++ b/libxfs/utf8norm.c @@ -0,0 +1,628 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xfs.h" +#include "xfs_types.h" +#include <utf8norm.h> + +struct utf8data { + unsigned int maxage; + unsigned int offset; +}; + +#define __INCLUDED_FROM_UTF8NORM_C__ +#include <utf8data.h> +#undef __INCLUDED_FROM_UTF8NORM_C__ + +/* + * UTF-8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7F: 0 - 0x7F + * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF + * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF + * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF + * + * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef const unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype: unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + * + * The trie is constructed in such a way that leaves exist for all + * UTF-8 sequences that match the criteria from the "UTF-8 valid + * ranges" comment above, and only for those sequences. Therefore a + * lookup in the trie can be used to validate the UTF-8 input. + */ +typedef const unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(utf8data_t data, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + data->offset; + int offlen; + int offset; + int mask; + int node; + + if (!data) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to utf8nlookup(). + */ +static utf8leaf_t * +utf8lookup(utf8data_t data, const char *s) +{ + return utf8nlookup(data, s, (size_t)-1); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age = data->maxage; + int leaf_age; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age = data->maxage; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : string. + * len : length of s. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s, + size_t len) +{ + if (!data) + return -1; + if (!s) + return -1; + u8c->data = data; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : NUL-terminated string. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s) +{ + return utf8ncursor(u8c, data, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->data, u8c->s); + else + leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->data, u8c->s); + ccc = LEAF_CCC(leaf); + } + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +const struct utf8data * +utf8nfkdi(unsigned int maxage) +{ + int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1; + + while (maxage < utf8nfkdidata[i].maxage) + i--; + if (maxage > utf8nfkdidata[i].maxage) + return NULL; + return &utf8nfkdidata[i]; +} + +const struct utf8data * +utf8nfkdicf(unsigned int maxage) +{ + int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1; + + while (maxage < utf8nfkdicfdata[i].maxage) + i--; + if (maxage > utf8nfkdicfdata[i].maxage) + return NULL; + return &utf8nfkdicfdata[i]; +} -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (11 preceding siblings ...) 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 21:10 ` Ben Myers 2014-09-18 21:24 ` Zach Brown 2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers ` (3 subsequent siblings) 16 siblings, 1 reply; 84+ messages in thread From: Ben Myers @ 2014-09-18 21:10 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs Hi, On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he > is busy with other projects. Patch 7 of each series is the trie generator and utf8 normalization module. Either it's just a little slow or it's over the message size limit on vger at a little over 100k. In case it's the latter I've placed them here: ftp://oss.sgi.com/projects/xfs/tmp/18-Sep-14/0007-xfs-add-trie-generator-and-supporting-code-for-UTF-8.patch and ftp://oss.sgi.com/projects/xfs/tmp/18-Sep-14/0007-libxfs-add-trie-generator-and-supporting-code-for-UT.patch I'll give it a while and then see about splitting up the patch. Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-18 21:24 ` Zach Brown 0 siblings, 0 replies; 84+ messages in thread From: Zach Brown @ 2014-09-18 21:24 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, xfs, olaf, tinguely On Thu, Sep 18, 2014 at 04:10:10PM -0500, Ben Myers wrote: > Hi, > > On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he > > is busy with other projects. > > Patch 7 of each series is the trie generator and utf8 normalization > module. Either it's just a little slow or it's over the message size > limit on vger at a little over 100k. Probably, yeah: http://vger.kernel.org/majordomo-info.html " * Message size exceeding 100 000 characters causes blocking. " - z ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-18 21:24 ` Zach Brown 0 siblings, 0 replies; 84+ messages in thread From: Zach Brown @ 2014-09-18 21:24 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 04:10:10PM -0500, Ben Myers wrote: > Hi, > > On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he > > is busy with other projects. > > Patch 7 of each series is the trie generator and utf8 normalization > module. Either it's just a little slow or it's over the message size > limit on vger at a little over 100k. Probably, yeah: http://vger.kernel.org/majordomo-info.html " * Message size exceeding 100 000 characters causes blocking. " - z _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-18 21:24 ` Zach Brown (?) @ 2014-09-18 22:23 ` Ben Myers -1 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-18 22:23 UTC (permalink / raw) To: Zach Brown; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 02:24:17PM -0700, Zach Brown wrote: > On Thu, Sep 18, 2014 at 04:10:10PM -0500, Ben Myers wrote: > > On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > > > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he > > > is busy with other projects. > > > > Patch 7 of each series is the trie generator and utf8 normalization > > module. Either it's just a little slow or it's over the message size > > limit on vger at a little over 100k. > > Probably, yeah: > > http://vger.kernel.org/majordomo-info.html > > " > * Message size exceeding 100 000 characters causes blocking. > " D'oh! I'll split it up. Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 07a/10] xfs: add trie generator for UTF-8. 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (12 preceding siblings ...) 2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-19 16:03 ` Ben Myers 2014-09-19 16:04 ` [PATCH 07b/10] xfs: add supporting code " Ben Myers ` (2 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-19 16:03 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> mkutf8data.c is the source for a program that generates utf8data.h, which contains the trie that utf8norm.c uses. The trie is generated from the Unicode 7.0.0 data files. The format of the utf8data[] table is described in utf8norm.c, which is added in the next patch. Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: the trie is now separated into utf8norm.ko; utf8version is now a function and exported; introduced CONFIG_XFS_UTF8; removed supporting code due to vger size constraint. --bpm] --- fs/xfs/Kconfig | 8 + fs/xfs/Makefile | 2 +- fs/xfs/utf8norm/Makefile | 33 + fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 3281 insertions(+), 1 deletion(-) create mode 100644 fs/xfs/utf8norm/Makefile create mode 100644 fs/xfs/utf8norm/mkutf8data.c diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 5d47b4d..a847857 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -95,3 +95,11 @@ config XFS_DEBUG not useful unless you are debugging a particular problem. Say N unless you are an XFS developer, or you play one on TV. + +config XFS_UTF8 + bool "XFS UTF-8 support" + depends on XFS_FS + help + Say Y here to enable utf8 normalization support in XFS. You + will be able to mount and use filesystems created with the + utf8 mkfs.xfs option. diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index d617999..6d000d3 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -21,7 +21,7 @@ ccflags-y += -I$(src)/libxfs ccflags-$(CONFIG_XFS_DEBUG) += -g -obj-$(CONFIG_XFS_FS) += xfs.o +obj-$(CONFIG_XFS_FS) += xfs.o utf8norm/ # this one should be compiled first, as the tracing macros can easily blow up xfs-y += xfs_trace.o diff --git a/fs/xfs/utf8norm/Makefile b/fs/xfs/utf8norm/Makefile new file mode 100644 index 0000000..9b2efa9 --- /dev/null +++ b/fs/xfs/utf8norm/Makefile @@ -0,0 +1,33 @@ +# +# Copyright (c) 2014 SGI. +# All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# + +hostprogs-y := mkutf8data +$(obj)/utf8norm.o: $(obj)/utf8data.h +$(obj)/utf8data.h: $(src)/ucd/*.txt +$(obj)/utf8data.h: $(obj)/mkutf8data FORCE + $(call if_changed,mkutf8data) +quiet_cmd_mkutf8data = MKUTF8DATA $@ + cmd_mkutf8data = $(obj)/mkutf8data \ + -a $(src)/ucd/DerivedAge-7.0.0.txt \ + -c $(src)/ucd/DerivedCombiningClass-7.0.0.txt \ + -p $(src)/ucd/DerivedCoreProperties-7.0.0.txt \ + -d $(src)/ucd/UnicodeData-7.0.0.txt \ + -f $(src)/ucd/CaseFolding-7.0.0.txt \ + -n $(src)/ucd/NormalizationCorrections-7.0.0.txt \ + -t $(src)/ucd/NormalizationTest-7.0.0.txt \ + -o $@ diff --git a/fs/xfs/utf8norm/mkutf8data.c b/fs/xfs/utf8norm/mkutf8data.c new file mode 100644 index 0000000..1d6ec02 --- /dev/null +++ b/fs/xfs/utf8norm/mkutf8data.c @@ -0,0 +1,3239 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/* Generator for a compact trie for unicode normalization */ + +#include <sys/types.h> +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <assert.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> + +/* Default names of the in- and output files. */ + +#define AGE_NAME "DerivedAge.txt" +#define CCC_NAME "DerivedCombiningClass.txt" +#define PROP_NAME "DerivedCoreProperties.txt" +#define DATA_NAME "UnicodeData.txt" +#define FOLD_NAME "CaseFolding.txt" +#define NORM_NAME "NormalizationCorrections.txt" +#define TEST_NAME "NormalizationTest.txt" +#define UTF8_NAME "utf8data.h" + +const char *age_name = AGE_NAME; +const char *ccc_name = CCC_NAME; +const char *prop_name = PROP_NAME; +const char *data_name = DATA_NAME; +const char *fold_name = FOLD_NAME; +const char *norm_name = NORM_NAME; +const char *test_name = TEST_NAME; +const char *utf8_name = UTF8_NAME; + +int verbose = 0; + +/* An arbitrary line size limit on input lines. */ + +#define LINESIZE 1024 +char line[LINESIZE]; +char buf0[LINESIZE]; +char buf1[LINESIZE]; +char buf2[LINESIZE]; +char buf3[LINESIZE]; + +const char *argv0; + +/* ------------------------------------------------------------------ */ + +/* + * Unicode version numbers consist of three parts: major, minor, and a + * revision. These numbers are packed into an unsigned int to obtain + * a single version number. + * + * To save space in the generated trie, the unicode version is not + * stored directly, instead we calculate a generation number from the + * unicode versions seen in the DerivedAge file, and use that as an + * index into a table of unicode versions. + */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_MAJ_MAX ((unsigned short)-1) +#define UNICODE_MIN_MAX ((unsigned char)-1) +#define UNICODE_REV_MAX ((unsigned char)-1) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +unsigned int *ages; +int ages_count; + +unsigned int unicode_maxage; + +static int +age_valid(unsigned int major, unsigned int minor, unsigned int revision) +{ + if (major > UNICODE_MAJ_MAX) + return 0; + if (minor > UNICODE_MIN_MAX) + return 0; + if (revision > UNICODE_REV_MAX) + return 0; + return 1; +} + +/* ------------------------------------------------------------------ */ + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype, unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + */ +typedef unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MAXGEN (255) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +struct tree; +static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, const char *); + +unsigned char *utf8data; +size_t utf8data_size; + +utf8trie_t *nfkdi; +utf8trie_t *nfkdicf; + +/* ------------------------------------------------------------------ */ + +/* + * UTF8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7f: 0 0x7f + * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf + * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf + * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf + * + * Even within those ranges not all values are allowed: the surrogates + * 0xd800 - 0xdfff should never be seen. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +#define UTF8_2_BITS 0xC0 +#define UTF8_3_BITS 0xE0 +#define UTF8_4_BITS 0xF0 +#define UTF8_N_BITS 0x80 +#define UTF8_2_MASK 0xE0 +#define UTF8_3_MASK 0xF0 +#define UTF8_4_MASK 0xF8 +#define UTF8_N_MASK 0xC0 +#define UTF8_V_MASK 0x3F +#define UTF8_V_SHIFT 6 + +static int +utf8key(unsigned int key, char keyval[]) +{ + int keylen; + + if (key < 0x80) { + keyval[0] = key; + keylen = 1; + } else if (key < 0x800) { + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_2_BITS; + keylen = 2; + } else if (key < 0x10000) { + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_3_BITS; + keylen = 3; + } else if (key < 0x110000) { + keyval[3] = key & UTF8_V_MASK; + keyval[3] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[2] = key & UTF8_V_MASK; + keyval[2] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[1] = key & UTF8_V_MASK; + keyval[1] |= UTF8_N_BITS; + key >>= UTF8_V_SHIFT; + keyval[0] = key; + keyval[0] |= UTF8_4_BITS; + keylen = 4; + } else { + printf("%#x: illegal key\n", key); + keylen = 0; + } + return keylen; +} + +static unsigned int +utf8code(const char *str) +{ + const unsigned char *s = (const unsigned char*)str; + unsigned int unichar = 0; + + if (*s < 0x80) { + unichar = *s; + } else if (*s < UTF8_3_BITS) { + unichar = *s++ & 0x1F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else if (*s < UTF8_4_BITS) { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } + return unichar; +} + +static int +utf32valid(unsigned int unichar) +{ + return unichar < 0x110000; +} + +#define NODE 1 +#define LEAF 0 + +struct tree { + void *root; + int childnode; + const char *type; + unsigned int maxage; + struct tree *next; + int (*leaf_equal)(void *, void *); + void (*leaf_print)(void *, int); + int (*leaf_mark)(void *); + int (*leaf_size)(void *); + int *(*leaf_index)(struct tree *, void *); + unsigned char *(*leaf_emit)(void *, unsigned char *); + int leafindex[0x110000]; + int index; +}; + +struct node { + int index; + int offset; + int mark; + int size; + struct node *parent; + void *left; + void *right; + unsigned char bitnum; + unsigned char nextbyte; + unsigned char leftnode; + unsigned char rightnode; + unsigned int keybits; + unsigned int keymask; +}; + +/* + * Example lookup function for a tree. + */ +static void * +lookup(struct tree *tree, const char *key) +{ + struct node *node; + void *leaf = NULL; + + node = tree->root; + while (!leaf && node) { + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) { + /* Right leg */ + if (node->rightnode == NODE) { + node = node->right; + } else if (node->rightnode == LEAF) { + leaf = node->right; + } else { + node = NULL; + } + } else { + /* Left leg */ + if (node->leftnode == NODE) { + node = node->left; + } else if (node->leftnode == LEAF) { + leaf = node->left; + } else { + node = NULL; + } + } + } + + return leaf; +} + +/* + * A simple non-recursive tree walker: keep track of visits to the + * left and right branches in the leftmask and rightmask. + */ +static void +tree_walk(struct tree *tree) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int indent = 1; + int nodes, singletons, leaves; + + nodes = singletons = leaves = 0; + + printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_print(tree->root, indent); + leaves = 1; + } else { + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + printf("%*snode @ %p bitnum %d nextbyte %d" + " left %p right %p mask %x bits %x\n", + indent, "", node, + node->bitnum, node->nextbyte, + node->left, node->right, + node->keymask, node->keybits); + nodes += 1; + if (!(node->left && node->right)) + singletons += 1; + + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + tree->leaf_print(node->left, + indent+1); + leaves += 1; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + tree->leaf_print(node->right, + indent+1); + leaves += 1; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } + } + printf("nodes %d leaves %d singletons %d\n", + nodes, leaves, singletons); +} + +/* + * Allocate an initialize a new internal node. + */ +static struct node * +alloc_node(struct node *parent) +{ + struct node *node; + int bitnum; + + node = malloc(sizeof(*node)); + node->left = node->right = NULL; + node->parent = parent; + node->leftnode = NODE; + node->rightnode = NODE; + node->keybits = 0; + node->keymask = 0; + node->mark = 0; + node->index = 0; + node->offset = -1; + node->size = 4; + + if (node->parent) { + bitnum = parent->bitnum; + if ((bitnum & 7) == 0) { + node->bitnum = bitnum + 7 + 8; + node->nextbyte = 1; + } else { + node->bitnum = bitnum - 1; + node->nextbyte = 0; + } + } else { + node->bitnum = 7; + node->nextbyte = 0; + } + + return node; +} + +/* + * Insert a new leaf into the tree, and collapse any subtrees that are + * fully populated and end in identical leaves. A nextbyte tagged + * internal node will not be removed to preserve the tree's integrity. + * Note that due to the structure of utf8, no nextbyte tagged node + * will be a candidate for removal. + */ +static int +insert(struct tree *tree, char *key, int keylen, void *leaf) +{ + struct node *node; + struct node *parent; + void **cursor; + int keybits; + + assert(keylen >= 1 && keylen <= 4); + + node = NULL; + cursor = &tree->root; + keybits = 8 * keylen; + + /* Insert, creating path along the way. */ + while (keybits) { + if (!*cursor) + *cursor = alloc_node(node); + node = *cursor; + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) + cursor = &node->right; + else + cursor = &node->left; + keybits--; + } + *cursor = leaf; + + /* Merge subtrees if possible. */ + while (node) { + if (*key & (1 << (node->bitnum & 7))) + node->rightnode = LEAF; + else + node->leftnode = LEAF; + if (node->nextbyte) + break; + if (node->leftnode == NODE || node->rightnode == NODE) + break; + assert(node->left); + assert(node->right); + /* Compare */ + if (! tree->leaf_equal(node->left, node->right)) + break; + /* Keep left, drop right leaf. */ + leaf = node->left; + /* Check in parent */ + parent = node->parent; + if (!parent) { + /* root of tree! */ + tree->root = leaf; + tree->childnode = LEAF; + } else if (parent->left == node) { + parent->left = leaf; + parent->leftnode = LEAF; + if (parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + } + } else if (parent->right == node) { + parent->right = leaf; + parent->rightnode = LEAF; + if (parent->left) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + parent->keybits |= (1 << node->bitnum); + } + } else { + /* internal tree error */ + assert(0); + } + free(node); + node = parent; + } + + /* Propagate keymasks up along singleton chains. */ + while (node) { + parent = node->parent; + if (!parent) + break; + /* Nix the mask for parents with two children. */ + if (node->keymask == 0) { + parent->keymask = 0; + parent->keybits = 0; + } else if (parent->left && parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + assert((parent->keymask & node->keymask) == 0); + parent->keymask |= node->keymask; + parent->keymask |= (1 << parent->bitnum); + parent->keybits |= node->keybits; + if (parent->right) + parent->keybits |= (1 << parent->bitnum); + } + node = parent; + } + + return 0; +} + +/* + * Prune internal nodes. + * + * Fully populated subtrees that end at the same leaf have already + * been collapsed. There are still internal nodes that have for both + * their left and right branches a sequence of singletons that make + * identical choices and end in identical leaves. The keymask and + * keybits collected in the nodes describe the choices made in these + * singleton chains. When they are identical for the left and right + * branch of a node, and the two leaves comare identical, the node in + * question can be removed. + * + * Note that nodes with the nextbyte tag set will not be removed by + * this to ensure tree integrity. Note as well that the structure of + * utf8 ensures that these nodes would not have been candidates for + * removal in any case. + */ +static void +prune(struct tree *tree) +{ + struct node *node; + struct node *left; + struct node *right; + struct node *parent; + void *leftleaf; + void *rightleaf; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + + if (verbose > 0) + printf("Pruning %s_%x\n", tree->type, tree->maxage); + + count = 0; + if (tree->childnode == LEAF) + return; + if (!tree->root) + return; + + leftmask = rightmask = 0; + node = tree->root; + while (node) { + if (node->nextbyte) + goto advance; + if (node->leftnode == LEAF) + goto advance; + if (node->rightnode == LEAF) + goto advance; + if (!node->left) + goto advance; + if (!node->right) + goto advance; + left = node->left; + right = node->right; + if (left->keymask == 0) + goto advance; + if (right->keymask == 0) + goto advance; + if (left->keymask != right->keymask) + goto advance; + if (left->keybits != right->keybits) + goto advance; + leftleaf = NULL; + while (!leftleaf) { + assert(left->left || left->right); + if (left->leftnode == LEAF) + leftleaf = left->left; + else if (left->rightnode == LEAF) + leftleaf = left->right; + else if (left->left) + left = left->left; + else if (left->right) + left = left->right; + else + assert(0); + } + rightleaf = NULL; + while (!rightleaf) { + assert(right->left || right->right); + if (right->leftnode == LEAF) + rightleaf = right->left; + else if (right->rightnode == LEAF) + rightleaf = right->right; + else if (right->left) + right = right->left; + else if (right->right) + right = right->right; + else + assert(0); + } + if (! tree->leaf_equal(leftleaf, rightleaf)) + goto advance; + /* + * This node has identical singleton-only subtrees. + * Remove it. + */ + parent = node->parent; + left = node->left; + right = node->right; + if (parent->left == node) + parent->left = left; + else if (parent->right == node) + parent->right = left; + else + assert(0); + left->parent = parent; + left->keymask |= (1 << node->bitnum); + node->left = NULL; + while (node) { + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + if (node->leftnode == NODE && node->left) { + left = node->left; + free(node); + count++; + node = left; + } else if (node->rightnode == NODE && node->right) { + right = node->right; + free(node); + count++; + node = right; + } else { + node = NULL; + } + } + /* Propagate keymasks up along singleton chains. */ + node = parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + for (;;) { + if (node->left && node->right) + break; + if (node->left) { + left = node->left; + node->keymask |= left->keymask; + node->keybits |= left->keybits; + } + if (node->right) { + right = node->right; + node->keymask |= right->keymask; + node->keybits |= right->keybits; + } + node->keymask |= (1 << node->bitnum); + node = node->parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + } + advance: + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0 && + node->leftnode == NODE && + node->left) { + leftmask |= bitmask; + node = node->left; + } else if ((rightmask & bitmask) == 0 && + node->rightnode == NODE && + node->right) { + rightmask |= bitmask; + node = node->right; + } else { + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + } + if (verbose > 0) + printf("Pruned %d nodes\n", count); +} + +/* + * Mark the nodes in the tree that lead to leaves that must be + * emitted. + */ +static void +mark_nodes(struct tree *tree) +{ + struct node *node; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int marked; + + marked = 0; + if (verbose > 0) + printf("Marking %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + + /* second pass: left siblings and singletons */ + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + if (!node->mark && node->parent->mark) { + marked++; + node->mark = 1; + } + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + if (!node->mark && node->parent->mark && + !node->parent->left) { + marked++; + node->mark = 1; + } + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } +done: + if (verbose > 0) + printf("Marked %d nodes\n", marked); +} + +/* + * Compute the index of each node and leaf, which is the offset in the + * emitted trie. These value must be pre-computed because relative + * offsets between nodes are used to navigate the tree. + */ +static int +index_nodes(struct tree *tree, int index) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + int indent; + + /* Align to a cache line (or half a cache line?). */ + while (index % 64) + index++; + tree->index = index; + indent = 1; + count = 0; + + if (verbose > 0) + printf("Indexing %s_%x: %d", tree->type, tree->maxage, index); + if (tree->childnode == LEAF) { + index += tree->leaf_size(tree->root); + goto done; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + count++; + if (node->index != index) + node->index = index; + index += node->size; +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + *tree->leaf_index(tree, node->left) = + index; + index += tree->leaf_size(node->left); + count++; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + *tree->leaf_index(tree, node->right) = index; + index += tree->leaf_size(node->right); + count++; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + /* Round up to a multiple of 16 */ + while (index % 16) + index++; + if (verbose > 0) + printf("Final index %d\n", index); + return index; +} + +/* + * Compute the size of nodes and leaves. We start by assuming that + * each node needs to store a three-byte offset. The indexes of the + * nodes are calculated based on that, and then this function is + * called to see if the sizes of some nodes can be reduced. This is + * repeated until no more changes are seen. + */ +static int +size_nodes(struct tree *tree) +{ + struct tree *next; + struct node *node; + struct node *right; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + unsigned int pathbits; + unsigned int pathmask; + int changed; + int offset; + int size; + int indent; + + indent = 1; + changed = 0; + size = 0; + + if (verbose > 0) + printf("Sizing %s_%x", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + pathbits = 0; + pathmask = 0; + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + offset = 0; + if (!node->left || !node->right) { + size = 1; + } else { + if (node->rightnode == NODE) { + right = node->right; + next = tree->next; + while (!right->mark) { + assert(next); + n = next->root; + while (n->bitnum != node->bitnum) { + if (pathbits & (1<<n->bitnum)) + n = n->right; + else + n = n->left; + } + n = n->right; + assert(right->bitnum == n->bitnum); + right = n; + next = next->next; + } + offset = right->index - node->index; + } else { + offset = *tree->leaf_index(tree, node->right); + offset -= node->index; + } + assert(offset >= 0); + assert(offset <= 0xffffff); + if (offset <= 0xff) { + size = 2; + } else if (offset <= 0xffff) { + size = 3; + } else { /* offset <= 0xffffff */ + size = 4; + } + } + if (node->size != size || node->offset != offset) { + node->size = size; + node->offset = offset; + changed++; + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + pathmask |= bitmask; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + pathbits |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + pathmask &= ~bitmask; + pathbits &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + if (verbose > 0) + printf("Found %d changes\n", changed); + return changed; +} + +/* + * Emit a trie for the given tree into the data array. + */ +static void +emit(struct tree *tree, unsigned char *data) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int offlen; + int offset; + int index; + int indent; + unsigned char byte; + + index = tree->index; + data += index; + indent = 1; + if (verbose > 0) + printf("Emitting %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_emit(tree->root, data); + return; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + assert(node->offset != -1); + assert(node->index == index); + + byte = 0; + if (node->nextbyte) + byte |= NEXTBYTE; + byte |= (node->bitnum & BITNUM); + if (node->left && node->right) { + if (node->leftnode == NODE) + byte |= LEFTNODE; + if (node->rightnode == NODE) + byte |= RIGHTNODE; + if (node->offset <= 0xff) + offlen = 1; + else if (node->offset <= 0xffff) + offlen = 2; + else + offlen = 3; + offset = node->offset; + byte |= offlen << OFFLEN_SHIFT; + *data++ = byte; + index++; + while (offlen--) { + *data++ = offset & 0xff; + index++; + offset >>= 8; + } + } else if (node->left) { + if (node->leftnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else if (node->right) { + byte |= RIGHTNODE; + if (node->rightnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else { + assert(0); + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + data = tree->leaf_emit(node->left, + data); + index += tree->leaf_size(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + data = tree->leaf_emit(node->right, + data); + index += tree->leaf_size(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +} + +/* ------------------------------------------------------------------ */ + +/* + * Unicode data. + * + * We need to keep track of the Canonical Combining Class, the Age, + * and decompositions for a code point. + * + * For the Age, we store the index into the ages table. Effectively + * this is a generation number that the table maps to a unicode + * version. + * + * The correction field is used to indicate that this entry is in the + * corrections array, which contains decompositions that were + * corrected in later revisions. The value of the correction field is + * the Unicode version in which the mapping was corrected. + */ +struct unicode_data { + unsigned int code; + int ccc; + int gen; + int correction; + unsigned int *utf32nfkdi; + unsigned int *utf32nfkdicf; + char *utf8nfkdi; + char *utf8nfkdicf; +}; + +struct unicode_data unicode_data[0x110000]; +struct unicode_data *corrections; +int corrections_count; + +struct tree *nfkdi_tree; +struct tree *nfkdicf_tree; + +struct tree *trees; +int trees_count; + +/* + * Check the corrections array to see if this entry was corrected at + * some point. + */ +static struct unicode_data * +corrections_lookup(struct unicode_data *u) +{ + int i; + + for (i = 0; i != corrections_count; i++) + if (u->code == corrections[i].code) + return &corrections[i]; + return u; +} + +static int +nfkdi_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static int +nfkdicf_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdicf && right->utf8nfkdicf && + strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0) + return 1; + if (left->utf8nfkdicf && right->utf8nfkdicf) + return 0; + if (left->utf8nfkdicf || right->utf8nfkdicf) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static void +nfkdi_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static void +nfkdicf_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdicf) + printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf); + else if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static int +nfkdi_mark(void *l) +{ + return 1; +} + +static int +nfkdicf_mark(void *l) +{ + struct unicode_data *leaf = l; + + if (leaf->utf8nfkdicf) + return 1; + return 0; +} + +static int +correction_mark(void *l) +{ + struct unicode_data *leaf = l; + + return leaf->correction; +} + +static int +nfkdi_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int +nfkdicf_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfkdicf) + size += strlen(leaf->utf8nfkdicf) + 1; + else if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int * +nfkdi_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static int * +nfkdicf_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static unsigned char * +nfkdi_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static unsigned char * +nfkdicf_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdicf) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdicf; + while ((*data++ = *s++) != 0) + ; + } else if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static void +utf8_create(struct unicode_data *data) +{ + char utf[18*4+1]; + char *u; + unsigned int *um; + int i; + + u = utf; + um = data->utf32nfkdi; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + data->utf8nfkdi = strdup((char*)utf); + } + u = utf; + um = data->utf32nfkdicf; + if (um) { + for (i = 0; um[i]; i++) + u += utf8key(um[i], u); + *u = '\0'; + if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf)) + data->utf8nfkdicf = strdup((char*)utf); + } +} + +static void +utf8_init(void) +{ + unsigned int unichar; + int i; + + for (unichar = 0; unichar != 0x110000; unichar++) + utf8_create(&unicode_data[unichar]); + + for (i = 0; i != corrections_count; i++) + utf8_create(&corrections[i]); +} + +static void +trees_init(void) +{ + struct unicode_data *data; + unsigned int maxage; + unsigned int nextage; + int count; + int i; + int j; + + /* Count the number of different ages. */ + count = 0; + nextage = (unsigned int)-1; + do { + maxage = nextage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + count++; + } while (nextage); + + /* Two trees per age: nfkdi and nfkdicf */ + trees_count = count * 2; + trees = calloc(trees_count, sizeof(struct tree)); + + /* Assign ages to the trees. */ + count = trees_count; + nextage = (unsigned int)-1; + do { + maxage = nextage; + trees[--count].maxage = maxage; + trees[--count].maxage = maxage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + } while (nextage); + + /* The ages assigned above are off by one. */ + for (i = 0; i != trees_count; i++) { + j = 0; + while (ages[j] < trees[i].maxage) + j++; + trees[i].maxage = ages[j-1]; + } + + /* Set up the forwarding between trees. */ + trees[trees_count-2].next = &trees[trees_count-1]; + trees[trees_count-1].leaf_mark = nfkdi_mark; + trees[trees_count-2].leaf_mark = nfkdicf_mark; + for (i = 0; i != trees_count-2; i += 2) { + trees[i].next = &trees[trees_count-2]; + trees[i].leaf_mark = correction_mark; + trees[i+1].next = &trees[trees_count-1]; + trees[i+1].leaf_mark = correction_mark; + } + + /* Assign the callouts. */ + for (i = 0; i != trees_count; i += 2) { + trees[i].type = "nfkdicf"; + trees[i].leaf_equal = nfkdicf_equal; + trees[i].leaf_print = nfkdicf_print; + trees[i].leaf_size = nfkdicf_size; + trees[i].leaf_index = nfkdicf_index; + trees[i].leaf_emit = nfkdicf_emit; + + trees[i+1].type = "nfkdi"; + trees[i+1].leaf_equal = nfkdi_equal; + trees[i+1].leaf_print = nfkdi_print; + trees[i+1].leaf_size = nfkdi_size; + trees[i+1].leaf_index = nfkdi_index; + trees[i+1].leaf_emit = nfkdi_emit; + } + + /* Finish init. */ + for (i = 0; i != trees_count; i++) + trees[i].childnode = NODE; +} + +static void +trees_populate(void) +{ + struct unicode_data *data; + unsigned int unichar; + char keyval[4]; + int keylen; + int i; + + for (i = 0; i != trees_count; i++) { + if (verbose > 0) { + printf("Populating %s_%x\n", + trees[i].type, trees[i].maxage); + } + for (unichar = 0; unichar != 0x110000; unichar++) { + if (unicode_data[unichar].gen < 0) + continue; + keylen = utf8key(unichar, keyval); + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= trees[i].maxage) + data = &unicode_data[unichar]; + insert(&trees[i], keyval, keylen, data); + } + } +} + +static void +trees_reduce(void) +{ + int i; + int size; + int changed; + + for (i = 0; i != trees_count; i++) + prune(&trees[i]); + for (i = 0; i != trees_count; i++) + mark_nodes(&trees[i]); + do { + size = 0; + for (i = 0; i != trees_count; i++) + size = index_nodes(&trees[i], size); + changed = 0; + for (i = 0; i != trees_count; i++) + changed += size_nodes(&trees[i]); + } while (changed); + + utf8data = calloc(size, 1); + utf8data_size = size; + for (i = 0; i != trees_count; i++) + emit(&trees[i], utf8data); + + if (verbose > 0) { + for (i = 0; i != trees_count; i++) { + printf("%s_%x idx %d\n", + trees[i].type, trees[i].maxage, trees[i].index); + } + } + + nfkdi = utf8data + trees[trees_count-1].index; + nfkdicf = utf8data + trees[trees_count-2].index; + + nfkdi_tree = &trees[trees_count-1]; + nfkdicf_tree = &trees[trees_count-2]; +} + +static void +verify(struct tree *tree) +{ + struct unicode_data *data; + utf8leaf_t *leaf; + unsigned int unichar; + char key[4]; + int report; + int nocf; + + if (verbose > 0) + printf("Verifying %s_%x\n", tree->type, tree->maxage); + nocf = strcmp(tree->type, "nfkdicf"); + + for (unichar = 0; unichar != 0x110000; unichar++) { + report = 0; + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= tree->maxage) + data = &unicode_data[unichar]; + utf8key(unichar, key); + leaf = utf8lookup(tree, key); + if (!leaf) { + if (data->gen != -1) + report++; + if (unichar < 0xd800 || unichar > 0xdfff) + report++; + } else { + if (unichar >= 0xd800 && unichar <= 0xdfff) + report++; + if (data->gen == -1) + report++; + if (data->gen != LEAF_GEN(leaf)) + report++; + if (LEAF_CCC(leaf) == DECOMPOSE) { + if (nocf) { + if (!data->utf8nfkdi) { + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } else { + if (!data->utf8nfkdicf && + !data->utf8nfkdi) { + report++; + } else if (data->utf8nfkdicf) { + if (strcmp(data->utf8nfkdicf, + LEAF_STR(leaf))) + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } + } else if (data->ccc != LEAF_CCC(leaf)) { + report++; + } + } + if (report) { + printf("%X code %X gen %d ccc %d" + " nfdki -> \"%s\"", + unichar, data->code, data->gen, + data->ccc, + data->utf8nfkdi); + if (leaf) { + printf(" age %d ccc %d" + " nfdki -> \"%s\"\n", + LEAF_GEN(leaf), + LEAF_CCC(leaf), + LEAF_CCC(leaf) == DECOMPOSE ? + LEAF_STR(leaf) : ""); + } + printf("\n"); + } + } +} + +static void +trees_verify(void) +{ + int i; + + for (i = 0; i != trees_count; i++) + verify(&trees[i]); +} + +/* ------------------------------------------------------------------ */ + +static void +help(void) +{ + printf("Usage: %s [options]\n", argv0); + printf("\n"); + printf("This program creates an a data trie used for parsing and\n"); + printf("normalization of UTF-8 strings. The trie is derived from\n"); + printf("a set of input files from the Unicode character database\n"); + printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n"); + printf("\n"); + printf("The generated tree supports two normalization forms:\n"); + printf("\n"); + printf("\tnfkdi:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\n"); + printf("\tnfkdicf:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\t- Apply a full casefold (C + F).\n"); + printf("\n"); + printf("These forms were chosen as being most useful when dealing\n"); + printf("with file names: NFKD catches most cases where characters\n"); + printf("should be considered equivalent. The ignorables are mostly\n"); + printf("invisible, making names hard to type.\n"); + printf("\n"); + printf("The options to specify the files to be used are listed\n"); + printf("below with their default values, which are the names used\n"); + printf("by version 7.0.0 of the Unicode Character Database.\n"); + printf("\n"); + printf("The input files:\n"); + printf("\t-a %s\n", AGE_NAME); + printf("\t-c %s\n", CCC_NAME); + printf("\t-p %s\n", PROP_NAME); + printf("\t-d %s\n", DATA_NAME); + printf("\t-f %s\n", FOLD_NAME); + printf("\t-n %s\n", NORM_NAME); + printf("\n"); + printf("Additionally, the generated tables are tested using:\n"); + printf("\t-t %s\n", TEST_NAME); + printf("\n"); + printf("Finally, the output file:\n"); + printf("\t-o %s\n", UTF8_NAME); + printf("\n"); +} + +static void +usage(void) +{ + help(); + exit(1); +} + +static void +open_fail(const char *name, int error) +{ + printf("Error %d opening %s: %s\n", error, name, strerror(error)); + exit(1); +} + +static void +file_fail(const char *filename) +{ + printf("Error parsing %s\n", filename); + exit(1); +} + +static void +line_fail(const char *filename, const char *line) +{ + printf("Error parsing %s:%s\n", filename, line); + exit(1); +} + +/* ------------------------------------------------------------------ */ + +static void +print_utf32(unsigned int *utf32str) +{ + int i; + + for (i = 0; utf32str[i]; i++) + printf(" %X", utf32str[i]); +} + +static void +print_utf32nfkdi(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdi); + printf("\n"); +} + +static void +print_utf32nfkdicf(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdicf); + printf("\n"); +} + +/* ------------------------------------------------------------------ */ + +static void +age_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + int gen; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", age_name); + + file = fopen(age_name, "r"); + if (!file) + open_fail(age_name, errno); + count = 0; + + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d_%d\n", + major, minor, revision); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d\n", major, minor); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + } + + /* We must have found something above. */ + if (verbose > 1) + printf("%d age entries\n", ages_count); + if (ages_count == 0 || ages_count > MAXGEN) + file_fail(age_name); + + /* There is a 0 entry. */ + ages_count++; + ages = calloc(ages_count + 1, sizeof(*ages)); + /* And a guard entry. */ + ages[ages_count] = (unsigned int)-1; + + rewind(file); + count = 0; + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages[++gen] = + UNICODE_AGE(major, minor, revision); + if (verbose > 1) + printf(" Age V%d_%d_%d = gen %d\n", + major, minor, revision, gen); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages[++gen] = UNICODE_AGE(major, minor, 0); + if (verbose > 1) + printf(" Age V%d_%d = %d\n", + major, minor, gen); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X..%X ; %d.%d #", + &first, &last, &major, &minor); + if (ret == 4) { + for (unichar = first; unichar <= last; unichar++) + unicode_data[unichar].gen = gen; + count += 1 + last - first; + if (verbose > 1) + printf(" %X..%X gen %d\n", first, last, gen); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor); + if (ret == 3) { + unicode_data[unichar].gen = gen; + count++; + if (verbose > 1) + printf(" %X gen %d\n", unichar, gen); + if (!utf32valid(unichar)) + line_fail(age_name, line); + continue; + } + } + unicode_maxage = ages[gen]; + fclose(file); + + /* Nix surrogate block */ + if (verbose > 1) + printf(" Removing surrogate block D800..DFFF\n"); + for (unichar = 0xd800; unichar <= 0xdfff; unichar++) + unicode_data[unichar].gen = -1; + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(age_name); +} + +static void +ccc_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int value; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", ccc_name); + + file = fopen(ccc_name, "r"); + if (!file) + open_fail(ccc_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value); + if (ret == 3) { + for (unichar = first; unichar <= last; unichar++) { + unicode_data[unichar].ccc = value; + count++; + } + if (verbose > 1) + printf(" %X..%X ccc %d\n", first, last, value); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(ccc_name, line); + continue; + } + ret = sscanf(line, "%X ; %d #", &unichar, &value); + if (ret == 2) { + unicode_data[unichar].ccc = value; + count++; + if (verbose > 1) + printf(" %X ccc %d\n", unichar, value); + if (!utf32valid(unichar)) + line_fail(ccc_name, line); + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(ccc_name); +} + +static void +nfkdi_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + unsigned int *um; + int count; + int i; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", data_name); + file = fopen(data_name, "r"); + if (!file) + open_fail(data_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];", + &unichar, buf0); + if (ret != 2) + continue; + if (!utf32valid(unichar)) + line_fail(data_name, line); + + s = buf0; + /* skip over <tag> */ + if (*s == '<') + while (*s++ != ' ') + ; + /* decode the decomposition into UTF-32 */ + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(data_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(data_name); +} + +static void +nfkdicf_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char status; + char *s; + unsigned int *um; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", fold_name); + file = fopen(fold_name, "r"); + if (!file) + open_fail(fold_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0); + if (ret != 3) + continue; + if (!utf32valid(unichar)) + line_fail(fold_name, line); + /* Use the C+F casefold. */ + if (status != 'C' && status != 'F') + continue; + s = buf0; + if (*s == '<') + while (*s++ != ' ') + ; + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(fold_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(fold_name); +} + +static void +ignore_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int first; + unsigned int last; + unsigned int *um; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", prop_name); + file = fopen(prop_name, "r"); + if (!file) + open_fail(prop_name, errno); + assert(file); + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0); + if (ret == 3) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(first) || !utf32valid(last)) + line_fail(prop_name, line); + for (unichar = first; unichar <= last; unichar++) { + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + count++; + } + if (verbose > 1) + printf(" %X..%X Default_Ignorable_Code_Point\n", + first, last); + continue; + } + ret = sscanf(line, "%X ; %s # ", &unichar, buf0); + if (ret == 2) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(unichar)) + line_fail(prop_name, line); + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + if (verbose > 1) + printf(" %X Default_Ignorable_Code_Point\n", + unichar); + count++; + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(prop_name); +} + +static void +corrections_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + unsigned int age; + unsigned int *um; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", norm_name); + file = fopen(norm_name, "r"); + if (!file) + open_fail(norm_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + count++; + } + corrections = calloc(count, sizeof(struct unicode_data)); + corrections_count = count; + rewind(file); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + corrections[count] = unicode_data[unichar]; + assert(corrections[count].code == unichar); + age = UNICODE_AGE(major, minor, revision); + corrections[count].correction = age; + + i = 0; + s = buf0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(norm_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + corrections[count].utf32nfkdi = um; + + if (verbose > 1) + printf(" %X -> %s -> %s V%d_%d_%d\n", + unichar, buf0, buf1, major, minor, revision); + count++; + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(norm_name); +} + +/* ------------------------------------------------------------------ */ + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;; + * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount + * LVPart = LBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = <LPart, VPart> + * } else { + * TPart = TBase + TIndex + * d = <LPart, TPart, VPart> + * } + * + */ + +static void +hangul_decompose(void) +{ + unsigned int sb = 0xAC00; + unsigned int lb = 0x1100; + unsigned int vb = 0x1161; + unsigned int tb = 0x11a7; + /* unsigned int lc = 19; */ + unsigned int vc = 21; + unsigned int tc = 28; + unsigned int nc = (vc * tc); + /* unsigned int sc = (lc * nc); */ + unsigned int unichar; + unsigned int mapping[4]; + unsigned int *um; + int count; + int i; + + if (verbose > 0) + printf("Decomposing hangul\n"); + /* Hangul */ + count = 0; + for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) { + unsigned int si = unichar - sb; + unsigned int li = si / nc; + unsigned int vi = (si % nc) / tc; + unsigned int ti = si % tc; + + i = 0; + mapping[i++] = lb + li; + mapping[i++] = vb + vi; + if (ti) + mapping[i++] = tb + ti; + mapping[i++] = 0; + + assert(!unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + assert(!unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + + count++; + } + if (verbose > 0) + printf("Created %d entries\n", count); +} + +static void +nfkdi_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdi\n"); + + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdi) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdi; + while (*um) { + dc = unicode_data[*um].utf32nfkdi; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + } + /* Add this decomposition to nfkdicf if there is no entry. */ + if (!unicode_data[unichar].utf32nfkdicf) { + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +static void +nfkdicf_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdicf\n"); + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdicf) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdicf; + while (*um) { + dc = unicode_data[*um].utf32nfkdicf; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +/* ------------------------------------------------------------------ */ + +int utf8agemax(struct tree *, const char *); +int utf8nagemax(struct tree *, const char *, size_t); +int utf8agemin(struct tree *, const char *); +int utf8nagemin(struct tree *, const char *, size_t); +ssize_t utf8len(struct tree *, const char *); +ssize_t utf8nlen(struct tree *, const char *, size_t); +struct utf8cursor; +int utf8cursor(struct utf8cursor *, struct tree *, const char *); +int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); +int utf8byte(struct utf8cursor *); + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(struct tree *tree, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + tree->index; + int offlen; + int offset; + int mask; + int node; + + if (!tree) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to trie_nlookup(). + */ +static utf8leaf_t * +utf8lookup(struct tree *tree, const char *s) +{ + return utf8nlookup(tree, s, (size_t)-1); +} + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = tree->maxage; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age = tree->maxage; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + struct tree *tree; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; + unsigned int unichar; +}; + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : string. + * len : length of s. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s, + size_t len) +{ + if (!tree) + return -1; + if (!s) + return -1; + u8c->tree = tree; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->unichar = 0; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : NUL-terminated string. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + struct tree *tree, + const char *s) +{ + return utf8ncursor(u8c, tree, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->tree, u8c->s); + else + leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->tree, u8c->s); + ccc = LEAF_CCC(leaf); + } + u8c->unichar = utf8code(u8c->s); + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + assert(u8c->ccc == STOPPER); + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +/* ------------------------------------------------------------------ */ + +static int +normalize_line(struct tree *tree) +{ + char *s; + char *t; + int c; + struct utf8cursor u8c; + + /* First test: null-terminated string. */ + s = buf2; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + /* Second test: length-limited string. */ + s = buf2; + /* Replace NUL with a value that will cause an error if seen. */ + s[strlen(s) + 1] = -1; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + return 0; +} + +static void +normalization_test(void) +{ + FILE *file; + unsigned int unichar; + struct unicode_data *data; + char *s; + char *t; + int ret; + int ignorables; + int tests = 0; + int failures = 0; + + if (verbose > 0) + printf("Parsing %s\n", test_name); + /* Step one, read data from file. */ + file = fopen(test_name, "r"); + if (!file) + open_fail(test_name, errno); + + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];", + buf0, buf1); + if (ret != 2 || *line == '#') + continue; + s = buf0; + t = buf2; + while (*s) { + unichar = strtoul(s, &s, 16); + t += utf8key(unichar, t); + } + *t = '\0'; + + ignorables = 0; + s = buf1; + t = buf3; + while (*s) { + unichar = strtoul(s, &s, 16); + data = &unicode_data[unichar]; + if (data->utf8nfkdi && !*data->utf8nfkdi) + ignorables = 1; + else + t += utf8key(unichar, t); + } + *t = '\0'; + + tests++; + if (normalize_line(nfkdi_tree) < 0) { + printf("\nline %s -> %s", buf0, buf1); + if (ignorables) + printf(" (ignorables removed)"); + printf(" failure\n"); + failures++; + } + } + fclose(file); + if (verbose > 0) + printf("Ran %d tests with %d failures\n", tests, failures); + if (failures) + file_fail(test_name); +} + +/* ------------------------------------------------------------------ */ + +static void +write_file(void) +{ + FILE *file; + int i; + int j; + int t; + int gen; + + if (verbose > 0) + printf("Writing %s\n", utf8_name); + file = fopen(utf8_name, "w"); + if (!file) + open_fail(utf8_name, errno); + + fprintf(file, "/* This file is generated code, do not edit. */\n"); + fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n"); + fprintf(file, "#error Only xfs_utf8.c may include this file.\n"); + fprintf(file, "#endif\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8vers = %#x;\n", + unicode_maxage); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8agetab[] = {\n"); + for (i = 0; i != ages_count; i++) + fprintf(file, "\t%#x%s\n", ages[i], + ages[i] == unicode_maxage ? "" : ","); + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n"); + t = 0; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n"); + t = 1; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned char utf8data[%zd] = {\n", + utf8data_size); + t = 0; + for (i = 0; i != utf8data_size; i += 16) { + if (i == trees[t].index) { + fprintf(file, "\t/* %s_%x */\n", + trees[t].type, trees[t].maxage); + if (t < trees_count-1) + t++; + } + fprintf(file, "\t"); + for (j = i; j != i + 16; j++) + fprintf(file, "0x%.2x%s", utf8data[j], + (j < utf8data_size -1 ? "," : "")); + fprintf(file, "\n"); + } + fprintf(file, "};\n"); + fclose(file); +} + +/* ------------------------------------------------------------------ */ + +int +main(int argc, char *argv[]) +{ + unsigned int unichar; + int opt; + + argv0 = argv[0]; + + while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) { + switch (opt) { + case 'a': + age_name = optarg; + break; + case 'c': + ccc_name = optarg; + break; + case 'd': + data_name = optarg; + break; + case 'f': + fold_name = optarg; + break; + case 'n': + norm_name = optarg; + break; + case 'o': + utf8_name = optarg; + break; + case 'p': + prop_name = optarg; + break; + case 't': + test_name = optarg; + break; + case 'v': + verbose++; + break; + case 'h': + help(); + exit(0); + default: + usage(); + } + } + + if (verbose > 1) + help(); + for (unichar = 0; unichar != 0x110000; unichar++) + unicode_data[unichar].code = unichar; + age_init(); + ccc_init(); + nfkdi_init(); + nfkdicf_init(); + ignore_init(); + corrections_init(); + hangul_decompose(); + nfkdi_decompose(); + nfkdicf_decompose(); + utf8_init(); + trees_init(); + trees_populate(); + trees_reduce(); + trees_verify(); + /* Prevent "unused function" warning. */ + (void)lookup(nfkdi_tree, " "); + if (verbose > 2) + tree_walk(nfkdi_tree); + if (verbose > 2) + tree_walk(nfkdicf_tree); + normalization_test(); + write_file(); + + return 0; +} -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH 07b/10] xfs: add supporting code for UTF-8. 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers ` (13 preceding siblings ...) 2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers @ 2014-09-19 16:04 ` Ben Myers 2014-09-22 14:55 ` Andi Kleen 2014-09-22 22:26 ` Dave Chinner 16 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-19 16:04 UTC (permalink / raw) To: linux-fsdevel; +Cc: tinguely, olaf, xfs From: Olaf Weber <olaf@sgi.com> Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. nfkdi: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. nfkdicf: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). Signed-off-by: Olaf Weber <olaf@sgi.com> --- [v2: the trie is now separated into utf8norm.ko; utf8version is now a function and exported; introduced CONFIG_XFS_UTF8; removed trie generator due to vger size constraint. --bpm] --- fs/xfs/utf8norm/Makefile | 4 + fs/xfs/utf8norm/utf8norm.c | 649 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/utf8norm/utf8norm.h | 116 ++++++++ 3 files changed, 769 insertions(+) create mode 100644 fs/xfs/utf8norm/utf8norm.c create mode 100644 fs/xfs/utf8norm/utf8norm.h diff --git a/fs/xfs/utf8norm/Makefile b/fs/xfs/utf8norm/Makefile index 9b2efa9..f83f9b9 100644 --- a/fs/xfs/utf8norm/Makefile +++ b/fs/xfs/utf8norm/Makefile @@ -16,6 +16,10 @@ # Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # +ifeq ($(CONFIG_XFS_UTF8),y) +obj-m += utf8norm.o +endif + hostprogs-y := mkutf8data $(obj)/utf8norm.o: $(obj)/utf8data.h $(obj)/utf8data.h: $(src)/ucd/*.txt diff --git a/fs/xfs/utf8norm/utf8norm.c b/fs/xfs/utf8norm/utf8norm.c new file mode 100644 index 0000000..995c4df --- /dev/null +++ b/fs/xfs/utf8norm/utf8norm.c @@ -0,0 +1,649 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "utf8norm.h" + +struct utf8data { + unsigned int maxage; + unsigned int offset; +}; + +#define __INCLUDED_FROM_UTF8NORM_C__ +#include "utf8data.h" +#undef __INCLUDED_FROM_UTF8NORM_C__ + +const unsigned int utf8version(void) +{ + return utf8vers; +} +EXPORT_SYMBOL(utf8version); + +/* + * UTF-8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7F: 0 - 0x7F + * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF + * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF + * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF + * + * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int +utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef const unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype: unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + * + * The trie is constructed in such a way that leaves exist for all + * UTF-8 sequences that match the criteria from the "UTF-8 valid + * ranges" comment above, and only for those sequences. Therefore a + * lookup in the trie can be used to validate the UTF-8 input. + */ +typedef const unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t * +utf8nlookup(utf8data_t data, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + data->offset; + int offlen; + int offset; + int mask; + int node; + + if (!data) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to utf8nlookup(). + */ +static utf8leaf_t * +utf8lookup(utf8data_t data, const char *s) +{ + return utf8nlookup(data, s, (size_t)-1); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int +utf8agemax(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemax); + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int +utf8agemin(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + int age; + int leaf_age; + + if (!data) + return -1; + age = data->maxage; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemin); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemax(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemax); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int +utf8nagemin(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age; + + if (!data) + return -1; + age = data->maxage; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemin); + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t +utf8len(utf8data_t data, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (*s) { + if (!(leaf = utf8lookup(data, s))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8len); + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t +utf8nlen(utf8data_t data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(data, s, len))) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8nlen); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : string. + * len : length of s. + * + * Returns -1 on error, 0 on success. + */ +int +utf8ncursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s, + size_t len) +{ + if (!data) + return -1; + if (!s) + return -1; + u8c->data = data; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} +EXPORT_SYMBOL(utf8ncursor); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : utf8data_t to use for normalization. + * s : NUL-terminated string. + * + * Returns -1 on error, 0 on success. + */ +int +utf8cursor( + struct utf8cursor *u8c, + utf8data_t data, + const char *s) +{ + return utf8ncursor(u8c, data, s, (unsigned int)-1); +} +EXPORT_SYMBOL(utf8cursor); + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int +utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->data, u8c->s); + else + leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->data, u8c->s); + ccc = LEAF_CCC(leaf); + } + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} +EXPORT_SYMBOL(utf8byte); + +const struct utf8data * +utf8nfkdi(unsigned int maxage) +{ + int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1; + + while (maxage < utf8nfkdidata[i].maxage) + i--; + if (maxage > utf8nfkdidata[i].maxage) + return NULL; + return &utf8nfkdidata[i]; +} +EXPORT_SYMBOL(utf8nfkdi); + +const struct utf8data * +utf8nfkdicf(unsigned int maxage) +{ + int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1; + + while (maxage < utf8nfkdicfdata[i].maxage) + i--; + if (maxage > utf8nfkdicfdata[i].maxage) + return NULL; + return &utf8nfkdicfdata[i]; +} +EXPORT_SYMBOL(utf8nfkdicf); + +MODULE_AUTHOR("SGI"); +MODULE_DESCRIPTION("utf8 normalization"); +MODULE_LICENSE("GPL"); diff --git a/fs/xfs/utf8norm/utf8norm.h b/fs/xfs/utf8norm/utf8norm.h new file mode 100644 index 0000000..44a9e53 --- /dev/null +++ b/fs/xfs/utf8norm/utf8norm.h @@ -0,0 +1,116 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef UTF8NORM_H +#define UTF8NORM_H + +#include <linux/types.h> +#include <linux/export.h> +#include <linux/string.h> +#include <linux/module.h> + +/* An opaque type used to determine the normalization in use. */ +typedef const struct utf8data *utf8data_t; + +/* Encoding a unicode version number as a single unsigned int. */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +/* Highest unicode version supported by the data tables. */ +extern const unsigned int utf8version(void); + +/* + * Look for the correct utf8data_t for a unicode version. + * Returns NULL if the version requested is too new. + * + * Two normalization forms are supported: nfkdi and nfkdicf. + * + * nfkdi: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * + * nfkdicf: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * - Apply a full casefold (C + F). + */ +extern utf8data_t utf8nfkdi(unsigned int); +extern utf8data_t utf8nfkdicf(unsigned int); + +/* + * Determine the maximum age of any unicode character in the string. + * Returns 0 if only unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemax(utf8data_t, const char *); +extern int utf8nagemax(utf8data_t, const char *, size_t); + +/* + * Determine the minimum age of any unicode character in the string. + * Returns 0 if any unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemin(utf8data_t, const char *); +extern int utf8nagemin(utf8data_t, const char *, size_t); + +/* + * Determine the length of the normalized from of the string, + * excluding any terminating NULL byte. + * Returns 0 if only ignorable code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern ssize_t utf8len(utf8data_t, const char *); +extern ssize_t utf8nlen(utf8data_t, const char *, size_t); + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + utf8data_t data; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; +}; + +/* + * Initialize a utf8cursor to normalize a string. + * Returns 0 on success. + * Returns -1 on failure. + */ +extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *); +extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t); + +/* + * Get the next byte in the normalization. + * Returns a value > 0 && < 256 on success. + * Returns 0 when the end of the normalization is reached. + * Returns -1 if the string being normalized is not valid UTF-8. + */ +extern int utf8byte(struct utf8cursor *); + +#endif /* UTF8NORM_H */ -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-22 14:55 ` Andi Kleen 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (15 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Andi Kleen @ 2014-09-22 14:55 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs Ben Myers <bpm@sgi.com> writes: > > Strings are normalized using a trie that stores the relevant > information. The trie itself is about 250kB in size, and lives in a > separate module. So 250kB bloat -- and what does this fix exactly? Someone putting random ligatures into their file names and expecting the file to be the same as before. Can't they just not do that? -Andi ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-22 14:55 ` Andi Kleen 0 siblings, 0 replies; 84+ messages in thread From: Andi Kleen @ 2014-09-22 14:55 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs Ben Myers <bpm@sgi.com> writes: > > Strings are normalized using a trie that stores the relevant > information. The trie itself is about 250kB in size, and lives in a > separate module. So 250kB bloat -- and what does this fix exactly? Someone putting random ligatures into their file names and expecting the file to be the same as before. Can't they just not do that? -Andi _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-22 14:55 ` Andi Kleen (?) @ 2014-09-22 18:41 ` Ben Myers 2014-09-22 19:29 ` Andi Kleen -1 siblings, 1 reply; 84+ messages in thread From: Ben Myers @ 2014-09-22 18:41 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-fsdevel, tinguely, olaf, xfs Hey Andi, On Mon, Sep 22, 2014 at 07:55:59AM -0700, Andi Kleen wrote: > Ben Myers <bpm@sgi.com> writes: > > > > Strings are normalized using a trie that stores the relevant > > information. The trie itself is about 250kB in size, and lives in a > > separate module. > > So 250kB bloat -- and what does this fix exactly? We're trying to address the size issue by only loading the module when it's needed, but yeah it's big. Open to suggestions on how best to deal with that. I understand the sticker shock. > Someone putting random ligatures into their file names and expecting > the file to be the same as before. Can't they just not do that? The ligature example that Olaf gave might seem kind of trivial, but for other characters and languages could it be more significant? As far as telling the customer "don't do that", my guess is that they would just go elsewhere. There are several other options for filesystems that support unicode. Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-22 18:41 ` Ben Myers @ 2014-09-22 19:29 ` Andi Kleen 0 siblings, 0 replies; 84+ messages in thread From: Andi Kleen @ 2014-09-22 19:29 UTC (permalink / raw) To: Ben Myers; +Cc: Andi Kleen, linux-fsdevel, tinguely, olaf, xfs > > So 250kB bloat -- and what does this fix exactly? > > We're trying to address the size issue by only loading the module when I'm not sure this is really addressing it. > it's needed, but yeah it's big. Open to suggestions on how best to deal > with that. I understand the sticker shock. I don't even understand why you need the whole table. You want to not compare some special symbols, and a few other symbols are equivalent to others. But most symbols are only identical to themselves. Couldn't you have a much smaller table that only expresses the exceptions? > As far as telling the customer "don't do that", my guess is that they > would just go elsewhere. There are several other options for > filesystems that support unicode. They could put some code into their user app that generates an unique representation. -Andi ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-22 19:29 ` Andi Kleen 0 siblings, 0 replies; 84+ messages in thread From: Andi Kleen @ 2014-09-22 19:29 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, Andi Kleen, tinguely, olaf, xfs > > So 250kB bloat -- and what does this fix exactly? > > We're trying to address the size issue by only loading the module when I'm not sure this is really addressing it. > it's needed, but yeah it's big. Open to suggestions on how best to deal > with that. I understand the sticker shock. I don't even understand why you need the whole table. You want to not compare some special symbols, and a few other symbols are equivalent to others. But most symbols are only identical to themselves. Couldn't you have a much smaller table that only expresses the exceptions? > As far as telling the customer "don't do that", my guess is that they > would just go elsewhere. There are several other options for > filesystems that support unicode. They could put some code into their user app that generates an unique representation. -Andi _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-22 19:29 ` Andi Kleen (?) @ 2014-09-23 16:13 ` Olaf Weber 2014-09-23 20:15 ` Andi Kleen -1 siblings, 1 reply; 84+ messages in thread From: Olaf Weber @ 2014-09-23 16:13 UTC (permalink / raw) To: Andi Kleen, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs On 22-09-14 21:29, Andi Kleen wrote: >>> So 250kB bloat -- and what does this fix exactly? >> >> We're trying to address the size issue by only loading the module when > > I'm not sure this is really addressing it. You only pay the space cost if you use it, similar to the nls tables. >> it's needed, but yeah it's big. Open to suggestions on how best to deal >> with that. I understand the sticker shock. > > I don't even understand why you need the whole table. > > You want to not compare some special symbols, and a few other symbols > are equivalent to others. But most symbols are only identical to themselves. > > Couldn't you have a much smaller table that only expresses > the exceptions? The trie tells you whether a given sequence of bytes is a UTF-8 encoded unicode codepoint, and if so, it gives the unicode version in which the codepoint was assigned an interpretation (if any), the canonical combining class (required for normalization), and the decomposition and case fold (if any). A big part of the table does decompositions for Korean: eliminating the Hangul decompositions removes 156320 bytes, leaving 89936 bytes. Hangul decomposition uses two or three unicode code points and a terminating NUL byte in a UTF-8 string. The code points each require a three-byte UTF-8 sequence, so the total is 7 bytes per 2-part decomposition, and 10 bytes per 3-part decomposition. With that in mind, the 156320 additional bytes spent on Hangul are accounted for as follows: 22344 bytes : 11172 leaves * 2 byte leaf size 2793 bytes : 399 2-part decompositions at 7 bytes each 107730 bytes : 10773 3-part decompositions at 10 bytes each This adds up to 132867 bytes of data, with the remainder, 23453 bytes, spent on additional internal trie nodes. >> As far as telling the customer "don't do that", my guess is that they >> would just go elsewhere. There are several other options for >> filesystems that support unicode. > > They could put some code into their user app that generates > an unique representation. This assumes a single app, and that they control the source of that app. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-23 16:13 ` Olaf Weber @ 2014-09-23 20:15 ` Andi Kleen 2014-09-23 20:45 ` Ben Myers 2014-09-24 11:07 ` Olaf Weber 0 siblings, 2 replies; 84+ messages in thread From: Andi Kleen @ 2014-09-23 20:15 UTC (permalink / raw) To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, Andi Kleen, tinguely, xfs > You only pay the space cost if you use it, similar to the nls tables. The way Linux module loading works these things are loaded by default, not when someone needs it. So no, you (or rather every unfortunate Linux XFS user) would pay it always, unless you black list the module or rebuild the kernel. > A big part of the table does decompositions for Korean: eliminating > the Hangul decompositions removes 156320 bytes, leaving 89936 bytes. Are there regular ranges or other redundancies in the Korean encoding that could be used to compress paths? Doing some basic research other people already answered this: Please use the ICU or google tables referenced below. Apparently smaller is possible too, but 40-50k seems more reasonable. I'm just gonna make the claim that whatever performance you get from a larger table is dwarfed by the cache miss overhead. ---- http://macchiato.com/unicode/normalization_footprint.htm http://www.macchiato.com/unicode/nfc-faq NFC normalization requires large tables, right? Like many other cases, there is a tradeoff between size and performance. You can use very small tables, at some cost in performance. (Even there, the actual performance cost depends on how often normalization needs to be invoked, as discussed above.) To see an analysis of the situation, see Normalization Footprint. It is a bit out of date, but gives a sense of the magnitude. For comparison, ICU's optimized tables for NFC take 44 kB (UTF-16) and Google's optimized tables for NFC take 46 kB (UTF-8). -Andi _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-23 20:15 ` Andi Kleen @ 2014-09-23 20:45 ` Ben Myers 2014-09-24 11:07 ` Olaf Weber 1 sibling, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-23 20:45 UTC (permalink / raw) To: Andi Kleen; +Cc: Olaf Weber, linux-fsdevel, tinguely, xfs Hi Andi, On Tue, Sep 23, 2014 at 10:15:40PM +0200, Andi Kleen wrote: > > You only pay the space cost if you use it, similar to the nls tables. > > The way Linux module loading works these things are loaded by default, > not when someone needs it. > > So no, you (or rather every unfortunate Linux XFS user) would pay it always, > unless you black list the module or rebuild the kernel. The suggestion to use symbol_get was made in response to the initial post of the series. In my testing the normalization module did not load until I attempted to mount a filesystem which required it. Seems to work ok. See patch 10, "xfs: implement demand load of utf8norm.ko". Thanks, Ben ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-23 20:45 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-23 20:45 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-fsdevel, tinguely, Olaf Weber, xfs Hi Andi, On Tue, Sep 23, 2014 at 10:15:40PM +0200, Andi Kleen wrote: > > You only pay the space cost if you use it, similar to the nls tables. > > The way Linux module loading works these things are loaded by default, > not when someone needs it. > > So no, you (or rather every unfortunate Linux XFS user) would pay it always, > unless you black list the module or rebuild the kernel. The suggestion to use symbol_get was made in response to the initial post of the series. In my testing the normalization module did not load until I attempted to mount a filesystem which required it. Seems to work ok. See patch 10, "xfs: implement demand load of utf8norm.ko". Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-23 20:15 ` Andi Kleen 2014-09-23 20:45 ` Ben Myers @ 2014-09-24 11:07 ` Olaf Weber 2014-09-26 14:06 ` Olaf Weber 1 sibling, 1 reply; 84+ messages in thread From: Olaf Weber @ 2014-09-24 11:07 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs On 23-09-14 22:15, Andi Kleen wrote: >> A big part of the table does decompositions for Korean: eliminating >> the Hangul decompositions removes 156320 bytes, leaving 89936 bytes. > > Are there regular ranges or other redundancies in the Korean encoding > that could be used to compress paths? Yes, though at the expense of more complicated code and interfaces. in particular, lookups that want a normalized string would need to provide a 10-byte buffer to store it in. > Doing some basic research other people already answered this: > > Please use the ICU or google tables referenced below. Apparently > smaller is possible too, but 40-50k seems more reasonable. Riffing off the http://macchiato.com/unicode/normalization_footprint.htm link you provided, looking at the NFKD case. For Unicode 3.0.0 that link gives 3483 NFKD normalizations (exlcuding Hangul), and gives 26,918 bytes as the size of a simple lookup table (key/offset pairs, with the offset pointing into a string table). In Unicode 7.0.0 I count 5721 NFKD normalizations (again excluding Hangul). As NUL-terminated UTF-8 strings these take 23390 bytes. Using a key-offset table I need 3 bytes for the key (code points are 21 bits) and 2 bytes for an offset. Total is 5721 * (3 + 2) + 23390 = 51995 bytes. Stealing 4 bits from the key field and 1 from the offset to store the size of the normalized string I can remove the NUL bytes from the string table and reduce total size to 46274 bytes. The trie implementation used here would use 66283 bytes to store the same information, but it also provides unicode version and canonical combining class for all codepoints. There are 10268 leaves in this case, with the size of each leaf being 1 byte for version, one for ccc, plus the size of the decomposition, if any. So a quick estimate on the space used just for the NFKD data is some 45747 bytes. I'm pretty certain that a trie that only stores the NFKD would be smaller than 45747 bytes as it would need fewer internal nodes, but didn't do the experiment. But as you can see, for just the NFKD part and excluding Hangul, the size of the trie is within the ballpark of the numbers you gave. Case folding adds a partial trie that forwards to the "main" trie for parts that are identical. This adds 2672 extra leaves and 20171 extra bytes. The data for the normalization corrections adds another 3328 bytes. With a bit of rounding, total size comes to 89840 bytes. > I'm just gonna make the claim that whatever performance you > get from a larger table is dwarfed by the cache miss overhead. That's possible, though it seems plausible you'd only suffer from this doing actual Hangul decomposition: all the data related to this (trie nodes, trie leaves, and strings) sits in one contiguous block in memory. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-24 11:07 ` Olaf Weber @ 2014-09-26 14:06 ` Olaf Weber 0 siblings, 0 replies; 84+ messages in thread From: Olaf Weber @ 2014-09-26 14:06 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs On 24-09-14 13:07, Olaf Weber wrote: > On 23-09-14 22:15, Andi Kleen wrote: > >>> A big part of the table does decompositions for Korean: eliminating >>> the Hangul decompositions removes 156320 bytes, leaving 89936 bytes. >> >> Are there regular ranges or other redundancies in the Korean encoding >> that could be used to compress paths? > > Yes, though at the expense of more complicated code and interfaces. in > particular, lookups that want a normalized string would need to provide a > 10-byte buffer to store it in. I spent some time working on this, and the effect on the lookup code isn't as bad as I'd thought. The updated code should be posted early next week. With this change, the table size for the full trie becomes 89952 bytes. Of this, 66400 bytes are spent on the NFKD + Ignorables, an additional 20992 bytes on NFDK + Ignorables + Case Fold. The remainder, 2560 bytes, are additional info for older unicode versions. Note that the NFDK + Ignorables + Case Fold trie forwards to the NFKD + Ignorables where they overlap. A stand-alone version would be 71750 bytes. As noted before these tables also contain the Canonical Combining Class and unicode version information for the code points. The latter allows for supporting multiple unicode versions using a single combined table. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-22 14:55 ` Andi Kleen (?) (?) @ 2014-09-23 13:01 ` Olaf Weber 2014-09-23 20:02 ` Andi Kleen -1 siblings, 1 reply; 84+ messages in thread From: Olaf Weber @ 2014-09-23 13:01 UTC (permalink / raw) To: Andi Kleen, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs On 22-09-14 16:55, Andi Kleen wrote: > Ben Myers <bpm@sgi.com> writes: >> >> Strings are normalized using a trie that stores the relevant >> information. The trie itself is about 250kB in size, and lives in a >> separate module. > > So 250kB bloat -- and what does this fix exactly? > > Someone putting random ligatures into their file names and expecting > the file to be the same as before. Can't they just not do that? I like the 'office' example because it is applicable to English and easy to explain. Once you move away from English examples are much easier to come by. Take a Dutch name like 'Renée Soutendijk'. These two forms both spell Renée in UTF-8: 0x52 0x65 0x6E 0xC3 0xA9 0x65 0x52 0x65 0x6E 0x65 0xCC 0x81 0x65 The difference is LATIN SMALL LETTER E WITH ACUTE (U+00E9) LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301) and corresponds to the difference between NFC and NFD. These two forms both spell Soutendijk in UTF-8: 0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0x69 0x6A 0x6B 0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0xC4 0xB3 0x6B The difference is LATIN SMALL LETTER I (U+0069) LATIN SMALL LETTER J (U+006A) LATIN SMALL LIGATURE IJ (U+0133) and the former is the compatibility decomposition of the latter, the 'K' in NFKC/NFKD. Do accented letters count as random ligatures that people should just not use? The bulk of the table deals with Korean. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-23 13:01 ` Olaf Weber @ 2014-09-23 20:02 ` Andi Kleen 0 siblings, 0 replies; 84+ messages in thread From: Andi Kleen @ 2014-09-23 20:02 UTC (permalink / raw) To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, Andi Kleen, tinguely, xfs > The bulk of the table deals with Korean. Is a table of Korean exceptions 250k too? If not please find some other way to compress this. I'm sure there is redundancy somewhere. -Andi _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers @ 2014-09-22 22:26 ` Dave Chinner 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers ` (15 subsequent siblings) 16 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-22 22:26 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > Hi, > > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he > is busy with other projects. This is the second revision of the series. > The first is available here: > > http://oss.sgi.com/archives/xfs/2014-09/msg00169.html > > In response to the initial feedback, the changes in version 2 include: > > * linux-fsdevel in the To: line, > * Updated design notes, > * Separation of the fs-independent trie and support code into utf8norm.ko, > * A mechanism for loading the normalization module only when necessary. > > I'll post the whole series for completeness sake. Many on -fsdevel will > not be interested in the xfs-specific bits, but it may be helpful to > have the full series as an example and for testing purposes. > > First there is a set of kernel bits, then some libxfs/xfsprogs stuff, > and finally a test. (Note: I am not posting the unicode database files > due to their large size. There are scripts to download them from > unicode.org in the relevant commit headers.) > > TODO: Store the unicode version number of the filesystem on disk in the > super block. So, if the filesystem has to store the specific unicode version it was created with so that we know what version to put in trie lookups, again I'll ask: why are we loading the trie as a generic kernel module and not as metadata in the filesystem that is demand paged and cached? i.e. put the entire trie on disk, look up the specific conversion required for the name being compared, and then cache that conversion in memory. This makes repeated lookups much faster because the trie only contains conversions that are in use, the memory footprint is way lower and the conversions are guaranteed to be consistent for the life of the filesystem.... > Here are Olaf's design notes: > > ----------------------------------------------------------------------------- > Unicode/UTF-8 support for XFS > > So we had a customer request proper unicode support... > > > * What does "supporting unicode" actually mean? > > From a text processing point of view, what a filesystem does with > filenames is simple: it stores and retrieves them, and compares them > for equality. It may reject certain byte sequences as invalid > filenames (for example, no filename can contain an ASCII NUL). > > I've been taking it as a given that when a file is created with a > certain byte sequence as its name, then a subsequent directory listing > will contain that same byte sequence among the names listed. > > This leaves comparing names for equality, and in my view this is what > "supporting unicode" revolves about. > > The present state of affairs is that different byte sequences are > different filenames. This amounts to tolerating unicode without > actually supporting it. That's somewhat circular - using your own definition of "supported" to argue that your own definition is the right one.... > To support unicode we have to interpret filenames. What happens when > (part of) a filename cannot be interpreted? We can reject the > filename, interpret the parts we can, or punt and accept it as an > uninterpreted blob. > > Rejecting ill-formed filenames was my first choice, but I came around > on the issue: there are too many ways in which you can end up with > having to deal with ill-formed filenames that would leave a user with > no recourse but to move whatever they're doing to a different > filesystem. Unpacking a tarball with filenames in a different encoding > is an example. You still haven't addressed this: | So we accept invalid unicode in filenames, but only after failing to | parse them? Isn't this a potential vector for exploiting weaknesses | in application filename handling? i.e. unprivileged user writes | specially crafted invalid unicode filename to disk, setuid program | tries to parse it, invalid sequence triggers a buffer overflow bug | in setuid parser? apart from handwaving that userspace has to be able to handle invalid utf-8 already. Why should we let filesystems say "we fully understand and support utf8" and then allow them to accept and propagate invalid utf8 sequences and leave everyone else to have to clean up the mess? > Partial interpretation of an ill-formed filename just strikes me as > the kind of bad idea that most half-houses are. I admit that I have no > stronger objection to this than the fact that it makes the code even > more complicated and fragile. > > Which leaves "blob" as the preferred option by default for coping with > ill-formed filenames. And so can't be case-folded, leading to inconsistent behaviour of case-insensitive filename comparisons. I don't blindly subscribe to the robustness principle of "be liberal with what you accept". Being liberal means accepting malformed junk and then trying to make good. It's a fool's game - we've learnt time and time again that if we don't fully validate string inputs that we have to interpret then someone will find an exploit that utilises malformed strings. I don't think we should expose core kernel code to such structural weaknesses... > When comparing well-formed filenames, the question now becomes which > byte sequences are considered to be alternative spellings of the same > filename. This is where normalization forms come into play, and the > unicode standard has quite a bit to say about the subject. > > If all you're doing is comparison, then choosing NFD over NFC is easy, > because the former is easier to calculate than the latter. > > If you want various spellings of "office" to compare equal, then > picking NFKD over NFD for comparison is also an obvious > choice. (Hand-picking individual compatibility forms is truly a bad > idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and > "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi > ligature. (Some fool thought it a good idea to add these ligatures to > unicode, all we get to decide is how to cope.) Yet normalised strings are only stable and hence comparable if there are no unassigned code points in them. What happens when userspace is not using the same version of unicode as the filesystem and is using newer code points in it's strings? Normalisation fails, right? And as an extension of using normalisation for case-folded comparisons, how do we make case folding work with blobs that can't be normalised? It seems to me that this just leads to the nasty situation where some filenames are case sensitive and some aren't based on what the filesystem thinks is valid utf-8. The worst part is that userspace has no idea that the filesystem is making such distinctions and so behaviour is not at all predictable or expected. This is another point in favour of rejecting invalid utf-8 strings and for keeping the translation tables stable within the filesystem... > The most contentious part is (should be) ignoring the codepoints with > the Default_Ignorable_Code_Point property. I've included the list > below. My argument, such as it is, is that these code points either > have no visible rendering, or in cases like the soft hyphen, are only > conditionally visible. The problem with these (as I see it) is that on > seeing a filename that might contain them you cannot tell whether they > are present. So I propose to ignore them for the purpose of comparing > filenames for equality. Which introduces a non-standard "visibility criterial" for determining what should be or shouldn't be part of the normalised string for comparison. I don't see any real justification for stepping outside the standard unicode normalisation here - just because the user cannot see a character in a specific context does not mean that it is not significant to the application that created it. > Finally, case folding. First of all, it is optional. Then the issue is > that you either go the language-specific route, or simplify the task > by "just" doing a full casefold (C+F, in unicode parlance). Looking > around the net I tend to find that if you're going to do casefolding > at all, then a language-independent full casefold is preferred because > it is the most predictable option. See > http://www.w3.org/TR/charmod-norm/ for an example of that kind of > reasoning. Which says in section 2.4: "Some languages need case-folding to be tailored to meet specific linguistic needs". That implies that the case folding needs to be language aware and hence needs to be tied into the NLS subsystem for handling specific quirks like Turkic. I also note that it says in several places that C+F can result in a folded string of a different length. What happens when that folded string is longer than 255 bytes and hence longer than NAME_MAX? That's a bit of a nasty landmine for pathname string handling functions - developers are going to assume that pathname components are not longer than NAME_MAX, and if we are passing normalised strings around that is not a valid assumption.... > * XFS-specific design notes. ... > If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set > in the superblock, then case folding is added into the mix. This is > the nfkdicf normalization form mentioned above. It allows for the > creation of case-insensitive filesystems with UTF-8 support. Please don't overload existing superblock feature bits with multiple meanings. ASCII-CI is a stand-alone feature and is not in any way compatible with Unicode: Unicode-CI is a superset of Unicode support. So it really needs two new feature bits for Unicode and Unicode-CI, not just one for unicode. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-22 22:26 ` Dave Chinner 0 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-22 22:26 UTC (permalink / raw) To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > Hi, > > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he > is busy with other projects. This is the second revision of the series. > The first is available here: > > http://oss.sgi.com/archives/xfs/2014-09/msg00169.html > > In response to the initial feedback, the changes in version 2 include: > > * linux-fsdevel in the To: line, > * Updated design notes, > * Separation of the fs-independent trie and support code into utf8norm.ko, > * A mechanism for loading the normalization module only when necessary. > > I'll post the whole series for completeness sake. Many on -fsdevel will > not be interested in the xfs-specific bits, but it may be helpful to > have the full series as an example and for testing purposes. > > First there is a set of kernel bits, then some libxfs/xfsprogs stuff, > and finally a test. (Note: I am not posting the unicode database files > due to their large size. There are scripts to download them from > unicode.org in the relevant commit headers.) > > TODO: Store the unicode version number of the filesystem on disk in the > super block. So, if the filesystem has to store the specific unicode version it was created with so that we know what version to put in trie lookups, again I'll ask: why are we loading the trie as a generic kernel module and not as metadata in the filesystem that is demand paged and cached? i.e. put the entire trie on disk, look up the specific conversion required for the name being compared, and then cache that conversion in memory. This makes repeated lookups much faster because the trie only contains conversions that are in use, the memory footprint is way lower and the conversions are guaranteed to be consistent for the life of the filesystem.... > Here are Olaf's design notes: > > ----------------------------------------------------------------------------- > Unicode/UTF-8 support for XFS > > So we had a customer request proper unicode support... > > > * What does "supporting unicode" actually mean? > > From a text processing point of view, what a filesystem does with > filenames is simple: it stores and retrieves them, and compares them > for equality. It may reject certain byte sequences as invalid > filenames (for example, no filename can contain an ASCII NUL). > > I've been taking it as a given that when a file is created with a > certain byte sequence as its name, then a subsequent directory listing > will contain that same byte sequence among the names listed. > > This leaves comparing names for equality, and in my view this is what > "supporting unicode" revolves about. > > The present state of affairs is that different byte sequences are > different filenames. This amounts to tolerating unicode without > actually supporting it. That's somewhat circular - using your own definition of "supported" to argue that your own definition is the right one.... > To support unicode we have to interpret filenames. What happens when > (part of) a filename cannot be interpreted? We can reject the > filename, interpret the parts we can, or punt and accept it as an > uninterpreted blob. > > Rejecting ill-formed filenames was my first choice, but I came around > on the issue: there are too many ways in which you can end up with > having to deal with ill-formed filenames that would leave a user with > no recourse but to move whatever they're doing to a different > filesystem. Unpacking a tarball with filenames in a different encoding > is an example. You still haven't addressed this: | So we accept invalid unicode in filenames, but only after failing to | parse them? Isn't this a potential vector for exploiting weaknesses | in application filename handling? i.e. unprivileged user writes | specially crafted invalid unicode filename to disk, setuid program | tries to parse it, invalid sequence triggers a buffer overflow bug | in setuid parser? apart from handwaving that userspace has to be able to handle invalid utf-8 already. Why should we let filesystems say "we fully understand and support utf8" and then allow them to accept and propagate invalid utf8 sequences and leave everyone else to have to clean up the mess? > Partial interpretation of an ill-formed filename just strikes me as > the kind of bad idea that most half-houses are. I admit that I have no > stronger objection to this than the fact that it makes the code even > more complicated and fragile. > > Which leaves "blob" as the preferred option by default for coping with > ill-formed filenames. And so can't be case-folded, leading to inconsistent behaviour of case-insensitive filename comparisons. I don't blindly subscribe to the robustness principle of "be liberal with what you accept". Being liberal means accepting malformed junk and then trying to make good. It's a fool's game - we've learnt time and time again that if we don't fully validate string inputs that we have to interpret then someone will find an exploit that utilises malformed strings. I don't think we should expose core kernel code to such structural weaknesses... > When comparing well-formed filenames, the question now becomes which > byte sequences are considered to be alternative spellings of the same > filename. This is where normalization forms come into play, and the > unicode standard has quite a bit to say about the subject. > > If all you're doing is comparison, then choosing NFD over NFC is easy, > because the former is easier to calculate than the latter. > > If you want various spellings of "office" to compare equal, then > picking NFKD over NFD for comparison is also an obvious > choice. (Hand-picking individual compatibility forms is truly a bad > idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and > "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi > ligature. (Some fool thought it a good idea to add these ligatures to > unicode, all we get to decide is how to cope.) Yet normalised strings are only stable and hence comparable if there are no unassigned code points in them. What happens when userspace is not using the same version of unicode as the filesystem and is using newer code points in it's strings? Normalisation fails, right? And as an extension of using normalisation for case-folded comparisons, how do we make case folding work with blobs that can't be normalised? It seems to me that this just leads to the nasty situation where some filenames are case sensitive and some aren't based on what the filesystem thinks is valid utf-8. The worst part is that userspace has no idea that the filesystem is making such distinctions and so behaviour is not at all predictable or expected. This is another point in favour of rejecting invalid utf-8 strings and for keeping the translation tables stable within the filesystem... > The most contentious part is (should be) ignoring the codepoints with > the Default_Ignorable_Code_Point property. I've included the list > below. My argument, such as it is, is that these code points either > have no visible rendering, or in cases like the soft hyphen, are only > conditionally visible. The problem with these (as I see it) is that on > seeing a filename that might contain them you cannot tell whether they > are present. So I propose to ignore them for the purpose of comparing > filenames for equality. Which introduces a non-standard "visibility criterial" for determining what should be or shouldn't be part of the normalised string for comparison. I don't see any real justification for stepping outside the standard unicode normalisation here - just because the user cannot see a character in a specific context does not mean that it is not significant to the application that created it. > Finally, case folding. First of all, it is optional. Then the issue is > that you either go the language-specific route, or simplify the task > by "just" doing a full casefold (C+F, in unicode parlance). Looking > around the net I tend to find that if you're going to do casefolding > at all, then a language-independent full casefold is preferred because > it is the most predictable option. See > http://www.w3.org/TR/charmod-norm/ for an example of that kind of > reasoning. Which says in section 2.4: "Some languages need case-folding to be tailored to meet specific linguistic needs". That implies that the case folding needs to be language aware and hence needs to be tied into the NLS subsystem for handling specific quirks like Turkic. I also note that it says in several places that C+F can result in a folded string of a different length. What happens when that folded string is longer than 255 bytes and hence longer than NAME_MAX? That's a bit of a nasty landmine for pathname string handling functions - developers are going to assume that pathname components are not longer than NAME_MAX, and if we are passing normalised strings around that is not a valid assumption.... > * XFS-specific design notes. ... > If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set > in the superblock, then case folding is added into the mix. This is > the nfkdicf normalization form mentioned above. It allows for the > creation of case-insensitive filesystems with UTF-8 support. Please don't overload existing superblock feature bits with multiple meanings. ASCII-CI is a stand-alone feature and is not in any way compatible with Unicode: Unicode-CI is a superset of Unicode support. So it really needs two new feature bits for Unicode and Unicode-CI, not just one for unicode. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-22 22:26 ` Dave Chinner @ 2014-09-24 13:21 ` Olaf Weber -1 siblings, 0 replies; 84+ messages in thread From: Olaf Weber @ 2014-09-24 13:21 UTC (permalink / raw) To: Dave Chinner, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs On 23-09-14 00:26, Dave Chinner wrote: > On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: [...] >> TODO: Store the unicode version number of the filesystem on disk in the >> super block. > > So, if the filesystem has to store the specific unicode version it > was created with so that we know what version to put in trie > lookups, again I'll ask: why are we loading the trie as a generic > kernel module and not as metadata in the filesystem that is demand > paged and cached? This way the trie can be shared, and the code using it is not entangled with the XFS code. > i.e. put the entire trie on disk, look up the specific conversion > required for the name being compared, and then cache that conversion > in memory. This makes repeated lookups much faster because the trie > only contains conversions that are in use, the memory footprint is > way lower and the conversions are guaranteed to be consistent for > the life of the filesystem.... Above you mention demand paging parts of the trie, but here you seem to suggest creating an in-core conversion table on the fly from data read from disk. The former seems a lot easier to do than the latter. >> Here are Olaf's design notes: >> >> ----------------------------------------------------------------------------- >> Unicode/UTF-8 support for XFS >> >> So we had a customer request proper unicode support... >> >> >> * What does "supporting unicode" actually mean? >> >> From a text processing point of view, what a filesystem does with >> filenames is simple: it stores and retrieves them, and compares them >> for equality. It may reject certain byte sequences as invalid >> filenames (for example, no filename can contain an ASCII NUL). >> >> I've been taking it as a given that when a file is created with a >> certain byte sequence as its name, then a subsequent directory listing >> will contain that same byte sequence among the names listed. >> >> This leaves comparing names for equality, and in my view this is what >> "supporting unicode" revolves about. >> >> The present state of affairs is that different byte sequences are >> different filenames. This amounts to tolerating unicode without >> actually supporting it. > > That's somewhat circular - using your own definition of "supported" > to argue that your own definition is the right one.... > >> To support unicode we have to interpret filenames. What happens when >> (part of) a filename cannot be interpreted? We can reject the >> filename, interpret the parts we can, or punt and accept it as an >> uninterpreted blob. >> >> Rejecting ill-formed filenames was my first choice, but I came around >> on the issue: there are too many ways in which you can end up with >> having to deal with ill-formed filenames that would leave a user with >> no recourse but to move whatever they're doing to a different >> filesystem. Unpacking a tarball with filenames in a different encoding >> is an example. > > You still haven't addressed this: > > | So we accept invalid unicode in filenames, but only after failing to > | parse them? Isn't this a potential vector for exploiting weaknesses > | in application filename handling? i.e. unprivileged user writes > | specially crafted invalid unicode filename to disk, setuid program > | tries to parse it, invalid sequence triggers a buffer overflow bug > | in setuid parser? > > apart from handwaving that userspace has to be able to handle > invalid utf-8 already. Why should we let filesystems say "we fully > understand and support utf8" and then allow them to accept and > propagate invalid utf8 sequences and leave everyone else to have to > clean up the mess? Because the alternative amounts in my opinion to a demand that every bit of userspace that may be involved in generating filenames generate only clean UTF-8. I do not believe that this is a realistic demand at this point in time. >> Partial interpretation of an ill-formed filename just strikes me as >> the kind of bad idea that most half-houses are. I admit that I have no >> stronger objection to this than the fact that it makes the code even >> more complicated and fragile. >> >> Which leaves "blob" as the preferred option by default for coping with >> ill-formed filenames. > > And so can't be case-folded, leading to inconsistent behaviour of > case-insensitive filename comparisons. > > I don't blindly subscribe to the robustness principle of "be liberal > with what you accept". Being liberal means accepting malformed junk > and then trying to make good. It's a fool's game - we've learnt time > and time again that if we don't fully validate string inputs that we > have to interpret then someone will find an exploit that utilises > malformed strings. I don't think we should expose core kernel code > to such structural weaknesses... This is why I prefer not to interpret strings that are not UTF-8. I just don't think we can afford to outright reject them. >> When comparing well-formed filenames, the question now becomes which >> byte sequences are considered to be alternative spellings of the same >> filename. This is where normalization forms come into play, and the >> unicode standard has quite a bit to say about the subject. >> >> If all you're doing is comparison, then choosing NFD over NFC is easy, >> because the former is easier to calculate than the latter. >> >> If you want various spellings of "office" to compare equal, then >> picking NFKD over NFD for comparison is also an obvious >> choice. (Hand-picking individual compatibility forms is truly a bad >> idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and >> "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi >> ligature. (Some fool thought it a good idea to add these ligatures to >> unicode, all we get to decide is how to cope.) > > Yet normalised strings are only stable and hence comparable > if there are no unassigned code points in them. What happens when > userspace is not using the same version of unicode as the > filesystem and is using newer code points in it's strings? > Normalisation fails, right? For the newer code points, yes. This is not treated as a failure to normalize the string as a whole, as there are clear guidelines in unicode on how unassigned code points interact with normalization: they have canonical combining class 0 and no decomposition. > And as an extension of using normalisation for case-folded > comparisons, how do we make case folding work with blobs that can't > be normalised? It seems to me that this just leads to the nasty > situation where some filenames are case sensitive and some aren't > based on what the filesystem thinks is valid utf-8. The worst part > is that userspace has no idea that the filesystem is making such > distinctions and so behaviour is not at all predictable or expected. Making case-folding work on a blob that cannot be normalized is (in my opinion) akin to doing an ASCII-based casefold on a Shift-JIS string: the result is neither pretty nor useful. > This is another point in favour of rejecting invalid utf-8 strings > and for keeping the translation tables stable within the > filesystem... Bear in mind that this means not just rejecting invalid UTF-8 strings, but also rejecting valid UTF-8 strings that encode unassigned code points. This should be easy to implement if it is decided that we want to do this. >> The most contentious part is (should be) ignoring the codepoints with >> the Default_Ignorable_Code_Point property. I've included the list >> below. My argument, such as it is, is that these code points either >> have no visible rendering, or in cases like the soft hyphen, are only >> conditionally visible. The problem with these (as I see it) is that on >> seeing a filename that might contain them you cannot tell whether they >> are present. So I propose to ignore them for the purpose of comparing >> filenames for equality. > > Which introduces a non-standard "visibility criterial" for > determining what should be or shouldn't be part of the normalised > string for comparison. I don't see any real justification for > stepping outside the standard unicode normalisation here - just > because the user cannot see a character in a specific context does > not mean that it is not significant to the application that created > it. I agree these characters may be significant to the application. I'm just not convinced that they should be significant in a file name. >> Finally, case folding. First of all, it is optional. Then the issue is >> that you either go the language-specific route, or simplify the task >> by "just" doing a full casefold (C+F, in unicode parlance). Looking >> around the net I tend to find that if you're going to do casefolding >> at all, then a language-independent full casefold is preferred because >> it is the most predictable option. See >> http://www.w3.org/TR/charmod-norm/ for an example of that kind of >> reasoning. > > Which says in section 2.4: "Some languages need case-folding to be > tailored to meet specific linguistic needs". That implies that the > case folding needs to be language aware and hence needs to be tied > into the NLS subsystem for handling specific quirks like Turkic. It also recommends just doing a full case fold for cases where you are ignorant of the language actually in use. In section 3.1 they say: "However, language-sensitive case-sensitive matching in document formats and protocols is NOT RECOMMENDED because language information can be hard to obtain, verify, or manage and the resulting operations can produce results that frustrate users." This doesn't exactly address the case of filesystems, but as far as I know there is no defined interface that allows kernel code to query the locale settings that currently apply to a userspace process. > I also note that it says in several places that C+F can result in a > folded string of a different length. What happens when that folded > string is longer than 255 bytes and hence longer than NAME_MAX? > That's a bit of a nasty landmine for pathname string handling > functions - developers are going to assume that pathname components > are not longer than NAME_MAX, and if we are passing normalised > strings around that is not a valid assumption.... This is not just true for case folding: normalization may also change string length, and NFD or NFKD will typically increase the length. That is among the reasons why normalized and case folded strings are not stored on disk, and are not passed up to other parts of the kernel. The code posted will generate a normalized version of the user-provided string used to look up data as a way to cache that normalization and to reduce stack pressure a bit, but this string is ephemeral and discarded once lookup is complete. >> * XFS-specific design notes. > ... >> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set >> in the superblock, then case folding is added into the mix. This is >> the nfkdicf normalization form mentioned above. It allows for the >> creation of case-insensitive filesystems with UTF-8 support. > > Please don't overload existing superblock feature bits with multiple > meanings. ASCII-CI is a stand-alone feature and is not in any way > compatible with Unicode: Unicode-CI is a superset of Unicode > support. So it really needs two new feature bits for Unicode and > Unicode-CI, not just one for unicode. It seemed an obvious extension of the meaning of that bit. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-24 13:21 ` Olaf Weber 0 siblings, 0 replies; 84+ messages in thread From: Olaf Weber @ 2014-09-24 13:21 UTC (permalink / raw) To: Dave Chinner, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs On 23-09-14 00:26, Dave Chinner wrote: > On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: [...] >> TODO: Store the unicode version number of the filesystem on disk in the >> super block. > > So, if the filesystem has to store the specific unicode version it > was created with so that we know what version to put in trie > lookups, again I'll ask: why are we loading the trie as a generic > kernel module and not as metadata in the filesystem that is demand > paged and cached? This way the trie can be shared, and the code using it is not entangled with the XFS code. > i.e. put the entire trie on disk, look up the specific conversion > required for the name being compared, and then cache that conversion > in memory. This makes repeated lookups much faster because the trie > only contains conversions that are in use, the memory footprint is > way lower and the conversions are guaranteed to be consistent for > the life of the filesystem.... Above you mention demand paging parts of the trie, but here you seem to suggest creating an in-core conversion table on the fly from data read from disk. The former seems a lot easier to do than the latter. >> Here are Olaf's design notes: >> >> ----------------------------------------------------------------------------- >> Unicode/UTF-8 support for XFS >> >> So we had a customer request proper unicode support... >> >> >> * What does "supporting unicode" actually mean? >> >> From a text processing point of view, what a filesystem does with >> filenames is simple: it stores and retrieves them, and compares them >> for equality. It may reject certain byte sequences as invalid >> filenames (for example, no filename can contain an ASCII NUL). >> >> I've been taking it as a given that when a file is created with a >> certain byte sequence as its name, then a subsequent directory listing >> will contain that same byte sequence among the names listed. >> >> This leaves comparing names for equality, and in my view this is what >> "supporting unicode" revolves about. >> >> The present state of affairs is that different byte sequences are >> different filenames. This amounts to tolerating unicode without >> actually supporting it. > > That's somewhat circular - using your own definition of "supported" > to argue that your own definition is the right one.... > >> To support unicode we have to interpret filenames. What happens when >> (part of) a filename cannot be interpreted? We can reject the >> filename, interpret the parts we can, or punt and accept it as an >> uninterpreted blob. >> >> Rejecting ill-formed filenames was my first choice, but I came around >> on the issue: there are too many ways in which you can end up with >> having to deal with ill-formed filenames that would leave a user with >> no recourse but to move whatever they're doing to a different >> filesystem. Unpacking a tarball with filenames in a different encoding >> is an example. > > You still haven't addressed this: > > | So we accept invalid unicode in filenames, but only after failing to > | parse them? Isn't this a potential vector for exploiting weaknesses > | in application filename handling? i.e. unprivileged user writes > | specially crafted invalid unicode filename to disk, setuid program > | tries to parse it, invalid sequence triggers a buffer overflow bug > | in setuid parser? > > apart from handwaving that userspace has to be able to handle > invalid utf-8 already. Why should we let filesystems say "we fully > understand and support utf8" and then allow them to accept and > propagate invalid utf8 sequences and leave everyone else to have to > clean up the mess? Because the alternative amounts in my opinion to a demand that every bit of userspace that may be involved in generating filenames generate only clean UTF-8. I do not believe that this is a realistic demand at this point in time. >> Partial interpretation of an ill-formed filename just strikes me as >> the kind of bad idea that most half-houses are. I admit that I have no >> stronger objection to this than the fact that it makes the code even >> more complicated and fragile. >> >> Which leaves "blob" as the preferred option by default for coping with >> ill-formed filenames. > > And so can't be case-folded, leading to inconsistent behaviour of > case-insensitive filename comparisons. > > I don't blindly subscribe to the robustness principle of "be liberal > with what you accept". Being liberal means accepting malformed junk > and then trying to make good. It's a fool's game - we've learnt time > and time again that if we don't fully validate string inputs that we > have to interpret then someone will find an exploit that utilises > malformed strings. I don't think we should expose core kernel code > to such structural weaknesses... This is why I prefer not to interpret strings that are not UTF-8. I just don't think we can afford to outright reject them. >> When comparing well-formed filenames, the question now becomes which >> byte sequences are considered to be alternative spellings of the same >> filename. This is where normalization forms come into play, and the >> unicode standard has quite a bit to say about the subject. >> >> If all you're doing is comparison, then choosing NFD over NFC is easy, >> because the former is easier to calculate than the latter. >> >> If you want various spellings of "office" to compare equal, then >> picking NFKD over NFD for comparison is also an obvious >> choice. (Hand-picking individual compatibility forms is truly a bad >> idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and >> "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi >> ligature. (Some fool thought it a good idea to add these ligatures to >> unicode, all we get to decide is how to cope.) > > Yet normalised strings are only stable and hence comparable > if there are no unassigned code points in them. What happens when > userspace is not using the same version of unicode as the > filesystem and is using newer code points in it's strings? > Normalisation fails, right? For the newer code points, yes. This is not treated as a failure to normalize the string as a whole, as there are clear guidelines in unicode on how unassigned code points interact with normalization: they have canonical combining class 0 and no decomposition. > And as an extension of using normalisation for case-folded > comparisons, how do we make case folding work with blobs that can't > be normalised? It seems to me that this just leads to the nasty > situation where some filenames are case sensitive and some aren't > based on what the filesystem thinks is valid utf-8. The worst part > is that userspace has no idea that the filesystem is making such > distinctions and so behaviour is not at all predictable or expected. Making case-folding work on a blob that cannot be normalized is (in my opinion) akin to doing an ASCII-based casefold on a Shift-JIS string: the result is neither pretty nor useful. > This is another point in favour of rejecting invalid utf-8 strings > and for keeping the translation tables stable within the > filesystem... Bear in mind that this means not just rejecting invalid UTF-8 strings, but also rejecting valid UTF-8 strings that encode unassigned code points. This should be easy to implement if it is decided that we want to do this. >> The most contentious part is (should be) ignoring the codepoints with >> the Default_Ignorable_Code_Point property. I've included the list >> below. My argument, such as it is, is that these code points either >> have no visible rendering, or in cases like the soft hyphen, are only >> conditionally visible. The problem with these (as I see it) is that on >> seeing a filename that might contain them you cannot tell whether they >> are present. So I propose to ignore them for the purpose of comparing >> filenames for equality. > > Which introduces a non-standard "visibility criterial" for > determining what should be or shouldn't be part of the normalised > string for comparison. I don't see any real justification for > stepping outside the standard unicode normalisation here - just > because the user cannot see a character in a specific context does > not mean that it is not significant to the application that created > it. I agree these characters may be significant to the application. I'm just not convinced that they should be significant in a file name. >> Finally, case folding. First of all, it is optional. Then the issue is >> that you either go the language-specific route, or simplify the task >> by "just" doing a full casefold (C+F, in unicode parlance). Looking >> around the net I tend to find that if you're going to do casefolding >> at all, then a language-independent full casefold is preferred because >> it is the most predictable option. See >> http://www.w3.org/TR/charmod-norm/ for an example of that kind of >> reasoning. > > Which says in section 2.4: "Some languages need case-folding to be > tailored to meet specific linguistic needs". That implies that the > case folding needs to be language aware and hence needs to be tied > into the NLS subsystem for handling specific quirks like Turkic. It also recommends just doing a full case fold for cases where you are ignorant of the language actually in use. In section 3.1 they say: "However, language-sensitive case-sensitive matching in document formats and protocols is NOT RECOMMENDED because language information can be hard to obtain, verify, or manage and the resulting operations can produce results that frustrate users." This doesn't exactly address the case of filesystems, but as far as I know there is no defined interface that allows kernel code to query the locale settings that currently apply to a userspace process. > I also note that it says in several places that C+F can result in a > folded string of a different length. What happens when that folded > string is longer than 255 bytes and hence longer than NAME_MAX? > That's a bit of a nasty landmine for pathname string handling > functions - developers are going to assume that pathname components > are not longer than NAME_MAX, and if we are passing normalised > strings around that is not a valid assumption.... This is not just true for case folding: normalization may also change string length, and NFD or NFKD will typically increase the length. That is among the reasons why normalized and case folded strings are not stored on disk, and are not passed up to other parts of the kernel. The code posted will generate a normalized version of the user-provided string used to look up data as a way to cache that normalization and to reduce stack pressure a bit, but this string is ephemeral and discarded once lookup is complete. >> * XFS-specific design notes. > ... >> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set >> in the superblock, then case folding is added into the mix. This is >> the nfkdicf normalization form mentioned above. It allows for the >> creation of case-insensitive filesystems with UTF-8 support. > > Please don't overload existing superblock feature bits with multiple > meanings. ASCII-CI is a stand-alone feature and is not in any way > compatible with Unicode: Unicode-CI is a superset of Unicode > support. So it really needs two new feature bits for Unicode and > Unicode-CI, not just one for unicode. It seemed an obvious extension of the meaning of that bit. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-24 13:21 ` Olaf Weber @ 2014-09-24 23:10 ` Dave Chinner -1 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-24 23:10 UTC (permalink / raw) To: Olaf Weber; +Cc: Ben Myers, linux-fsdevel, tinguely, xfs On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote: > On 23-09-14 00:26, Dave Chinner wrote: > >On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > > [...] > > >>TODO: Store the unicode version number of the filesystem on disk in the > >>super block. > > > >So, if the filesystem has to store the specific unicode version it > >was created with so that we know what version to put in trie > >lookups, again I'll ask: why are we loading the trie as a generic > >kernel module and not as metadata in the filesystem that is demand > >paged and cached? > > This way the trie can be shared, and the code using it is not > entangled with the XFS code. The trie parsing code can still be common - just the location and contents of the data is determined by the end-user. > > >i.e. put the entire trie on disk, look up the specific conversion > >required for the name being compared, and then cache that conversion > >in memory. This makes repeated lookups much faster because the trie > >only contains conversions that are in use, the memory footprint is > >way lower and the conversions are guaranteed to be consistent for > >the life of the filesystem.... > > Above you mention demand paging parts of the trie, but here you seem > to suggest creating an in-core conversion table on the fly from data > read from disk. The former seems a lot easier to do than the latter. Right - it's a question of what needs optimising. If people are only concerned about memory footprint, then demand paging solves that problem. If people are concerned about performance and memory footprint, then demand paging plus a lookaside cache will address both of those aspects. We can't do demand paging if the trie data is built into the kernel. We can still do a lookaside cache to avoid performane issues with repeated trie lookups... [...] > >>To support unicode we have to interpret filenames. What happens when > >>(part of) a filename cannot be interpreted? We can reject the > >>filename, interpret the parts we can, or punt and accept it as an > >>uninterpreted blob. > >> > >>Rejecting ill-formed filenames was my first choice, but I came around > >>on the issue: there are too many ways in which you can end up with > >>having to deal with ill-formed filenames that would leave a user with > >>no recourse but to move whatever they're doing to a different > >>filesystem. Unpacking a tarball with filenames in a different encoding > >>is an example. > > > >You still haven't addressed this: > > > >| So we accept invalid unicode in filenames, but only after failing to > >| parse them? Isn't this a potential vector for exploiting weaknesses > >| in application filename handling? i.e. unprivileged user writes > >| specially crafted invalid unicode filename to disk, setuid program > >| tries to parse it, invalid sequence triggers a buffer overflow bug > >| in setuid parser? > > > >apart from handwaving that userspace has to be able to handle > >invalid utf-8 already. Why should we let filesystems say "we fully > >understand and support utf8" and then allow them to accept and > >propagate invalid utf8 sequences and leave everyone else to have to > >clean up the mess? > > Because the alternative amounts in my opinion to a demand that every > bit of userspace that may be involved in generating filenames > generate only clean UTF-8. I do not believe that this is a realistic > demand at this point in time. It's a chicken and egg situation. I'd much prefer we enforce clean utf8 from the start, because if we don't we'll never be able to do that. And other filesystems (e.g. ZFS) allow you to do reject anything that is not clean utf8.... [...] > >>When comparing well-formed filenames, the question now becomes which > >>byte sequences are considered to be alternative spellings of the same > >>filename. This is where normalization forms come into play, and the > >>unicode standard has quite a bit to say about the subject. > >> > >>If all you're doing is comparison, then choosing NFD over NFC is easy, > >>because the former is easier to calculate than the latter. > >> > >>If you want various spellings of "office" to compare equal, then > >>picking NFKD over NFD for comparison is also an obvious > >>choice. (Hand-picking individual compatibility forms is truly a bad > >>idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and > >>"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi > >>ligature. (Some fool thought it a good idea to add these ligatures to > >>unicode, all we get to decide is how to cope.) > > > >Yet normalised strings are only stable and hence comparable > >if there are no unassigned code points in them. What happens when > >userspace is not using the same version of unicode as the > >filesystem and is using newer code points in it's strings? > >Normalisation fails, right? > > For the newer code points, yes. This is not treated as a failure to > normalize the string as a whole, as there are clear guidelines in > unicode on how unassigned code points interact with normalization: > they have canonical combining class 0 and no decomposition. And so effectively are not stable. Which is something we absolutely have to avoid for information stored on disk. i.e. you're using the normalised form to build the hash values in the lookup index in the directory structure, and so having unstable normalisation forms is just wrong. Hence we'd need to reject anything with unassigned code points.... > >And as an extension of using normalisation for case-folded > >comparisons, how do we make case folding work with blobs that can't > >be normalised? It seems to me that this just leads to the nasty > >situation where some filenames are case sensitive and some aren't > >based on what the filesystem thinks is valid utf-8. The worst part > >is that userspace has no idea that the filesystem is making such > >distinctions and so behaviour is not at all predictable or expected. > > Making case-folding work on a blob that cannot be normalized is (in > my opinion) akin to doing an ASCII-based casefold on a Shift-JIS > string: the result is neither pretty nor useful. Yes, that's exactly my point. > >This is another point in favour of rejecting invalid utf-8 strings > >and for keeping the translation tables stable within the > >filesystem... > > Bear in mind that this means not just rejecting invalid UTF-8 > strings, but also rejecting valid UTF-8 strings that encode > unassigned code points. And that's precisely what I'm suggesting: If we can't normalise the filename to a stable form then it cannot be used for hashing or case folding. That means it needs to be rejected, not treated as an opaque blob. The moment we start parsing filenames they are no longer opaque blobs and so all existing "filename are opaque blobs" handling rules go out the window. They are now either valid so we can use them, or they are invalid and need to be rejected to avoid unpredictable and/or undesirable behaviour. > This should be easy to implement if it is decided that we want to do this. > > >>The most contentious part is (should be) ignoring the codepoints with > >>the Default_Ignorable_Code_Point property. I've included the list > >>below. My argument, such as it is, is that these code points either > >>have no visible rendering, or in cases like the soft hyphen, are only > >>conditionally visible. The problem with these (as I see it) is that on > >>seeing a filename that might contain them you cannot tell whether they > >>are present. So I propose to ignore them for the purpose of comparing > >>filenames for equality. > > > >Which introduces a non-standard "visibility criterial" for > >determining what should be or shouldn't be part of the normalised > >string for comparison. I don't see any real justification for > >stepping outside the standard unicode normalisation here - just > >because the user cannot see a character in a specific context does > >not mean that it is not significant to the application that created > >it. > > I agree these characters may be significant to the application. I'm > just not convinced that they should be significant in a file name. They are significant to the case folding result, right? And therefore would be significant in a filename... > > >>Finally, case folding. First of all, it is optional. Then the issue is > >>that you either go the language-specific route, or simplify the task > >>by "just" doing a full casefold (C+F, in unicode parlance). Looking > >>around the net I tend to find that if you're going to do casefolding > >>at all, then a language-independent full casefold is preferred because > >>it is the most predictable option. See > >>http://www.w3.org/TR/charmod-norm/ for an example of that kind of > >>reasoning. > > > >Which says in section 2.4: "Some languages need case-folding to be > >tailored to meet specific linguistic needs". That implies that the > >case folding needs to be language aware and hence needs to be tied > >into the NLS subsystem for handling specific quirks like Turkic. > > It also recommends just doing a full case fold for cases where you > are ignorant of the language actually in use. In section 3.1 they > say: "However, language-sensitive case-sensitive matching in > document formats and protocols is NOT RECOMMENDED because language > information can be hard to obtain, verify, or manage and the > resulting operations can produce results that frustrate users." This > doesn't exactly address the case of filesystems, but as far as I > know there is no defined interface that allows kernel code to query > the locale settings that currently apply to a userspace process. Hence my comments about NLS integration. The NLS subsystem already has utf8 support with language dependent case folding tables. All the current filesystems that deal with unicode (including case folding) use the NLS subsystem for conversions. Hmmm - looking at all the NLS code that does different utf format conversions first: what happens if an application is using UTF16 or UTF32 for it's filename encoding rather than utf8? > >>* XFS-specific design notes. > >... > >>If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set > >>in the superblock, then case folding is added into the mix. This is > >>the nfkdicf normalization form mentioned above. It allows for the > >>creation of case-insensitive filesystems with UTF-8 support. > > > >Please don't overload existing superblock feature bits with multiple > >meanings. ASCII-CI is a stand-alone feature and is not in any way > >compatible with Unicode: Unicode-CI is a superset of Unicode > >support. So it really needs two new feature bits for Unicode and > >Unicode-CI, not just one for unicode. > > It seemed an obvious extension of the meaning of that bit. Feature bits refer to a specific on disk format feature. If that bit is set, then that feature is present. In this case, it means the filesystem is using ascii-ci. If that bit is passed out to userspace via the geometry ioctl, then *existing applications* expect it to mean ascii-ci behaviour from the filesystem. If an existing utility reads the flag field from disk (e.g. repair, metadump, db, etc) they all expect it to mean ascii-ci, and will do stuff based on that specific meaning. We cannot redefine the meaning of a feature bit after the fact - we have lots of feature bits so there's no need to overload an existing one for this. Hmmm - another interesting question just popped into my head about metadump: file name obfuscation. What does unicode and utf8 mean for the hash collision calculation algorithm? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-24 23:10 ` Dave Chinner 0 siblings, 0 replies; 84+ messages in thread From: Dave Chinner @ 2014-09-24 23:10 UTC (permalink / raw) To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote: > On 23-09-14 00:26, Dave Chinner wrote: > >On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > > [...] > > >>TODO: Store the unicode version number of the filesystem on disk in the > >>super block. > > > >So, if the filesystem has to store the specific unicode version it > >was created with so that we know what version to put in trie > >lookups, again I'll ask: why are we loading the trie as a generic > >kernel module and not as metadata in the filesystem that is demand > >paged and cached? > > This way the trie can be shared, and the code using it is not > entangled with the XFS code. The trie parsing code can still be common - just the location and contents of the data is determined by the end-user. > > >i.e. put the entire trie on disk, look up the specific conversion > >required for the name being compared, and then cache that conversion > >in memory. This makes repeated lookups much faster because the trie > >only contains conversions that are in use, the memory footprint is > >way lower and the conversions are guaranteed to be consistent for > >the life of the filesystem.... > > Above you mention demand paging parts of the trie, but here you seem > to suggest creating an in-core conversion table on the fly from data > read from disk. The former seems a lot easier to do than the latter. Right - it's a question of what needs optimising. If people are only concerned about memory footprint, then demand paging solves that problem. If people are concerned about performance and memory footprint, then demand paging plus a lookaside cache will address both of those aspects. We can't do demand paging if the trie data is built into the kernel. We can still do a lookaside cache to avoid performane issues with repeated trie lookups... [...] > >>To support unicode we have to interpret filenames. What happens when > >>(part of) a filename cannot be interpreted? We can reject the > >>filename, interpret the parts we can, or punt and accept it as an > >>uninterpreted blob. > >> > >>Rejecting ill-formed filenames was my first choice, but I came around > >>on the issue: there are too many ways in which you can end up with > >>having to deal with ill-formed filenames that would leave a user with > >>no recourse but to move whatever they're doing to a different > >>filesystem. Unpacking a tarball with filenames in a different encoding > >>is an example. > > > >You still haven't addressed this: > > > >| So we accept invalid unicode in filenames, but only after failing to > >| parse them? Isn't this a potential vector for exploiting weaknesses > >| in application filename handling? i.e. unprivileged user writes > >| specially crafted invalid unicode filename to disk, setuid program > >| tries to parse it, invalid sequence triggers a buffer overflow bug > >| in setuid parser? > > > >apart from handwaving that userspace has to be able to handle > >invalid utf-8 already. Why should we let filesystems say "we fully > >understand and support utf8" and then allow them to accept and > >propagate invalid utf8 sequences and leave everyone else to have to > >clean up the mess? > > Because the alternative amounts in my opinion to a demand that every > bit of userspace that may be involved in generating filenames > generate only clean UTF-8. I do not believe that this is a realistic > demand at this point in time. It's a chicken and egg situation. I'd much prefer we enforce clean utf8 from the start, because if we don't we'll never be able to do that. And other filesystems (e.g. ZFS) allow you to do reject anything that is not clean utf8.... [...] > >>When comparing well-formed filenames, the question now becomes which > >>byte sequences are considered to be alternative spellings of the same > >>filename. This is where normalization forms come into play, and the > >>unicode standard has quite a bit to say about the subject. > >> > >>If all you're doing is comparison, then choosing NFD over NFC is easy, > >>because the former is easier to calculate than the latter. > >> > >>If you want various spellings of "office" to compare equal, then > >>picking NFKD over NFD for comparison is also an obvious > >>choice. (Hand-picking individual compatibility forms is truly a bad > >>idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and > >>"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi > >>ligature. (Some fool thought it a good idea to add these ligatures to > >>unicode, all we get to decide is how to cope.) > > > >Yet normalised strings are only stable and hence comparable > >if there are no unassigned code points in them. What happens when > >userspace is not using the same version of unicode as the > >filesystem and is using newer code points in it's strings? > >Normalisation fails, right? > > For the newer code points, yes. This is not treated as a failure to > normalize the string as a whole, as there are clear guidelines in > unicode on how unassigned code points interact with normalization: > they have canonical combining class 0 and no decomposition. And so effectively are not stable. Which is something we absolutely have to avoid for information stored on disk. i.e. you're using the normalised form to build the hash values in the lookup index in the directory structure, and so having unstable normalisation forms is just wrong. Hence we'd need to reject anything with unassigned code points.... > >And as an extension of using normalisation for case-folded > >comparisons, how do we make case folding work with blobs that can't > >be normalised? It seems to me that this just leads to the nasty > >situation where some filenames are case sensitive and some aren't > >based on what the filesystem thinks is valid utf-8. The worst part > >is that userspace has no idea that the filesystem is making such > >distinctions and so behaviour is not at all predictable or expected. > > Making case-folding work on a blob that cannot be normalized is (in > my opinion) akin to doing an ASCII-based casefold on a Shift-JIS > string: the result is neither pretty nor useful. Yes, that's exactly my point. > >This is another point in favour of rejecting invalid utf-8 strings > >and for keeping the translation tables stable within the > >filesystem... > > Bear in mind that this means not just rejecting invalid UTF-8 > strings, but also rejecting valid UTF-8 strings that encode > unassigned code points. And that's precisely what I'm suggesting: If we can't normalise the filename to a stable form then it cannot be used for hashing or case folding. That means it needs to be rejected, not treated as an opaque blob. The moment we start parsing filenames they are no longer opaque blobs and so all existing "filename are opaque blobs" handling rules go out the window. They are now either valid so we can use them, or they are invalid and need to be rejected to avoid unpredictable and/or undesirable behaviour. > This should be easy to implement if it is decided that we want to do this. > > >>The most contentious part is (should be) ignoring the codepoints with > >>the Default_Ignorable_Code_Point property. I've included the list > >>below. My argument, such as it is, is that these code points either > >>have no visible rendering, or in cases like the soft hyphen, are only > >>conditionally visible. The problem with these (as I see it) is that on > >>seeing a filename that might contain them you cannot tell whether they > >>are present. So I propose to ignore them for the purpose of comparing > >>filenames for equality. > > > >Which introduces a non-standard "visibility criterial" for > >determining what should be or shouldn't be part of the normalised > >string for comparison. I don't see any real justification for > >stepping outside the standard unicode normalisation here - just > >because the user cannot see a character in a specific context does > >not mean that it is not significant to the application that created > >it. > > I agree these characters may be significant to the application. I'm > just not convinced that they should be significant in a file name. They are significant to the case folding result, right? And therefore would be significant in a filename... > > >>Finally, case folding. First of all, it is optional. Then the issue is > >>that you either go the language-specific route, or simplify the task > >>by "just" doing a full casefold (C+F, in unicode parlance). Looking > >>around the net I tend to find that if you're going to do casefolding > >>at all, then a language-independent full casefold is preferred because > >>it is the most predictable option. See > >>http://www.w3.org/TR/charmod-norm/ for an example of that kind of > >>reasoning. > > > >Which says in section 2.4: "Some languages need case-folding to be > >tailored to meet specific linguistic needs". That implies that the > >case folding needs to be language aware and hence needs to be tied > >into the NLS subsystem for handling specific quirks like Turkic. > > It also recommends just doing a full case fold for cases where you > are ignorant of the language actually in use. In section 3.1 they > say: "However, language-sensitive case-sensitive matching in > document formats and protocols is NOT RECOMMENDED because language > information can be hard to obtain, verify, or manage and the > resulting operations can produce results that frustrate users." This > doesn't exactly address the case of filesystems, but as far as I > know there is no defined interface that allows kernel code to query > the locale settings that currently apply to a userspace process. Hence my comments about NLS integration. The NLS subsystem already has utf8 support with language dependent case folding tables. All the current filesystems that deal with unicode (including case folding) use the NLS subsystem for conversions. Hmmm - looking at all the NLS code that does different utf format conversions first: what happens if an application is using UTF16 or UTF32 for it's filename encoding rather than utf8? > >>* XFS-specific design notes. > >... > >>If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set > >>in the superblock, then case folding is added into the mix. This is > >>the nfkdicf normalization form mentioned above. It allows for the > >>creation of case-insensitive filesystems with UTF-8 support. > > > >Please don't overload existing superblock feature bits with multiple > >meanings. ASCII-CI is a stand-alone feature and is not in any way > >compatible with Unicode: Unicode-CI is a superset of Unicode > >support. So it really needs two new feature bits for Unicode and > >Unicode-CI, not just one for unicode. > > It seemed an obvious extension of the meaning of that bit. Feature bits refer to a specific on disk format feature. If that bit is set, then that feature is present. In this case, it means the filesystem is using ascii-ci. If that bit is passed out to userspace via the geometry ioctl, then *existing applications* expect it to mean ascii-ci behaviour from the filesystem. If an existing utility reads the flag field from disk (e.g. repair, metadump, db, etc) they all expect it to mean ascii-ci, and will do stuff based on that specific meaning. We cannot redefine the meaning of a feature bit after the fact - we have lots of feature bits so there's no need to overload an existing one for this. Hmmm - another interesting question just popped into my head about metadump: file name obfuscation. What does unicode and utf8 mean for the hash collision calculation algorithm? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* RE: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-24 23:10 ` Dave Chinner (?) @ 2014-09-25 13:33 ` Zuckerman, Boris -1 siblings, 0 replies; 84+ messages in thread From: Zuckerman, Boris @ 2014-09-25 13:33 UTC (permalink / raw) To: Dave Chinner, Olaf Weber; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs > -----Original Message----- > From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel- > owner@vger.kernel.org] On Behalf Of Dave Chinner > Sent: Wednesday, September 24, 2014 7:10 PM > To: Olaf Weber > Cc: Ben Myers; linux-fsdevel@vger.kernel.org; tinguely@sgi.com; xfs@oss.sgi.com > Subject: Re: [RFC v2] Unicode/UTF-8 support for XFS > > On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote: > > On 23-09-14 00:26, Dave Chinner wrote: > > >On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: > > > > [...] > > > > >>TODO: Store the unicode version number of the filesystem on disk in > > >>the super block. > > > > > >So, if the filesystem has to store the specific unicode version it > > >was created with so that we know what version to put in trie lookups, > > >again I'll ask: why are we loading the trie as a generic kernel > > >module and not as metadata in the filesystem that is demand paged and > > >cached? > > > > This way the trie can be shared, and the code using it is not > > entangled with the XFS code. > > The trie parsing code can still be common - just the location and contents of the data is > determined by the end-user. > Both these approaches can co-exists (as I recall was done in Windows). A system may have "a default trie" and a caller (name space) can provide its own... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-24 23:10 ` Dave Chinner (?) (?) @ 2014-09-26 14:50 ` Olaf Weber 2014-09-26 16:56 ` Christoph Hellwig -1 siblings, 1 reply; 84+ messages in thread From: Olaf Weber @ 2014-09-26 14:50 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs On 25-09-14 01:10, Dave Chinner wrote: > On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote: >> On 23-09-14 00:26, Dave Chinner wrote: >>> On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote: >> >> [...] >> >>>> TODO: Store the unicode version number of the filesystem on disk in the >>>> super block. >>> >>> So, if the filesystem has to store the specific unicode version it >>> was created with so that we know what version to put in trie >>> lookups, again I'll ask: why are we loading the trie as a generic >>> kernel module and not as metadata in the filesystem that is demand >>> paged and cached? >> >> This way the trie can be shared, and the code using it is not >> entangled with the XFS code. > > The trie parsing code can still be common - just the location and > contents of the data is determined by the end-user. I'm not sure how common the parsing code can be if needs to be capable of retrieving data from a filesystem. Note given your and Andi Kleen's feedback on the trie size I've switched to doing algorithmic decomposition for Hangul. This reduces the size of the trie to 89952 bytes. In addition, if you store the trie in the filesystem, then the only part that needs storing is the version for that particular filesystem, e.g no compatibility info for different unicode versions would be required. This would reduce the trie size to about 50kB for case-sensitive filesystems, and about 55kB on case-folding filesystems. [...] >>> [...] Why should we let filesystems say "we fully >>> understand and support utf8" and then allow them to accept and >>> propagate invalid utf8 sequences and leave everyone else to have to >>> clean up the mess? >> >> Because the alternative amounts in my opinion to a demand that every >> bit of userspace that may be involved in generating filenames >> generate only clean UTF-8. I do not believe that this is a realistic >> demand at this point in time. > > It's a chicken and egg situation. I'd much prefer we enforce clean > utf8 from the start, because if we don't we'll never be able to do > that. And other filesystems (e.g. ZFS) allow you to do reject > anything that is not clean utf8.... As I understand it, this is optional in ZFS. I wonder what people's experiences are with this. [...] >>> Yet normalised strings are only stable and hence comparable >>> if there are no unassigned code points in them. What happens when >>> userspace is not using the same version of unicode as the >>> filesystem and is using newer code points in it's strings? >>> Normalisation fails, right? >> >> For the newer code points, yes. This is not treated as a failure to >> normalize the string as a whole, as there are clear guidelines in >> unicode on how unassigned code points interact with normalization: >> they have canonical combining class 0 and no decomposition. > > And so effectively are not stable. Which is something we absolutely > have to avoid for information stored on disk. i.e. you're using the > normalised form to build the hash values in the lookup index in the > directory structure, and so having unstable normalisation forms is > just wrong. Hence we'd need to reject anything with unassigned code > points.... On a particular filesystem, the calculated normalization would be stable. >>> And as an extension of using normalisation for case-folded >>> comparisons, how do we make case folding work with blobs that can't >>> be normalised? It seems to me that this just leads to the nasty >>> situation where some filenames are case sensitive and some aren't >>> based on what the filesystem thinks is valid utf-8. The worst part >>> is that userspace has no idea that the filesystem is making such >>> distinctions and so behaviour is not at all predictable or expected. >> >> Making case-folding work on a blob that cannot be normalized is (in >> my opinion) akin to doing an ASCII-based casefold on a Shift-JIS >> string: the result is neither pretty nor useful. > > Yes, that's exactly my point. But apparently we draw different conclusions from it. >>> This is another point in favour of rejecting invalid utf-8 strings >>> and for keeping the translation tables stable within the >>> filesystem... >> >> Bear in mind that this means not just rejecting invalid UTF-8 >> strings, but also rejecting valid UTF-8 strings that encode >> unassigned code points. > > And that's precisely what I'm suggesting: If we can't normalise the > filename to a stable form then it cannot be used for hashing or case > folding. That means it needs to be rejected, not treated as an > opaque blob. > > The moment we start parsing filenames they are no longer opaque > blobs and so all existing "filename are opaque blobs" handling rules > go out the window. They are now either valid so we can use them, or > they are invalid and need to be rejected to avoid unpredictable > and/or undesirable behaviour. At this point I'd really like other people to weigh in on this and get a sense of how sentiment is spread on the question. - Forbid non-UTF-8 filenames - Allow non-UTF-8 filenames - Make it a mount option - Make it a mkfs option [...] >>>> The most contentious part is (should be) ignoring the codepoints with >>>> the Default_Ignorable_Code_Point property. I've included the list >>>> below. My argument, such as it is, is that these code points either >>>> have no visible rendering, or in cases like the soft hyphen, are only >>>> conditionally visible. The problem with these (as I see it) is that on >>>> seeing a filename that might contain them you cannot tell whether they >>>> are present. So I propose to ignore them for the purpose of comparing >>>> filenames for equality. >>> >>> Which introduces a non-standard "visibility criterial" for >>> determining what should be or shouldn't be part of the normalised >>> string for comparison. I don't see any real justification for >>> stepping outside the standard unicode normalisation here - just >>> because the user cannot see a character in a specific context does >>> not mean that it is not significant to the application that created >>> it. >> >> I agree these characters may be significant to the application. I'm >> just not convinced that they should be significant in a file name. > > They are significant to the case folding result, right? And > therefore would be significant in a filename... Case Folding doesn't affect the ignorables, so in that sense at least they're not significant to the case folding result, even if you do not ignore them. [...] > Hence my comments about NLS integration. The NLS subsystem already > has utf8 support with language dependent case folding tables. All the > current filesystems that deal with unicode (including case folding) > use the NLS subsystem for conversions. Looking at the NLS subsystem I see support for translating a number of different encodings ("code pages") to unicode and back. There is support for uppercase/lowercase translation for a number of those encodings. Which is not the same as language dependent case folding. As for a unicode case fold, I see no support at all. In nls_utf8.c the uppercase/lowercase mappings are set to the identity maps. I see no support for unicode normalization forms either. > Hmmm - looking at all the NLS code that does different utf format > conversions first: what happens if an application is using UTF16 or > UTF32 for it's filename encoding rather than utf8? Since UTF-16 and UTF-32 strings contain embedded 0 bytes, those encodings cannot be used to pass a filename across the kernel/userspace interface. >>>> * XFS-specific design notes. >>> ... >>>> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set >>>> in the superblock, then case folding is added into the mix. This is >>>> the nfkdicf normalization form mentioned above. It allows for the >>>> creation of case-insensitive filesystems with UTF-8 support. >>> >>> Please don't overload existing superblock feature bits with multiple >>> meanings. ASCII-CI is a stand-alone feature and is not in any way >>> compatible with Unicode: Unicode-CI is a superset of Unicode >>> support. So it really needs two new feature bits for Unicode and >>> Unicode-CI, not just one for unicode. >> >> It seemed an obvious extension of the meaning of that bit. > > Feature bits refer to a specific on disk format feature. If that bit > is set, then that feature is present. In this case, it means the > filesystem is using ascii-ci. If that bit is passed out to > userspace via the geometry ioctl, then *existing applications* > expect it to mean ascii-ci behaviour from the filesystem. If an > existing utility reads the flag field from disk (e.g. repair, > metadump, db, etc) they all expect it to mean ascii-ci, and will do > stuff based on that specific meaning. We cannot redefine the meaning > of a feature bit after the fact - we have lots of feature bits so > there's no need to overload an existing one for this. Good point. > Hmmm - another interesting question just popped into my head about > metadump: file name obfuscation. What does unicode and utf8 mean > for the hash collision calculation algorithm? Good question. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 14:50 ` Olaf Weber @ 2014-09-26 16:56 ` Christoph Hellwig 0 siblings, 0 replies; 84+ messages in thread From: Christoph Hellwig @ 2014-09-26 16:56 UTC (permalink / raw) To: Olaf Weber; +Cc: Dave Chinner, Ben Myers, linux-fsdevel, tinguely, xfs On Fri, Sep 26, 2014 at 04:50:39PM +0200, Olaf Weber wrote: > I'm not sure how common the parsing code can be if needs to be capable of > retrieving data from a filesystem. > > Note given your and Andi Kleen's feedback on the trie size I've switched to > doing algorithmic decomposition for Hangul. This reduces the size of the > trie to 89952 bytes. > > In addition, if you store the trie in the filesystem, then the only part > that needs storing is the version for that particular filesystem, e.g no > compatibility info for different unicode versions would be required. This > would reduce the trie size to about 50kB for case-sensitive filesystems, and > about 55kB on case-folding filesystems. Honestly I wouldn't worry about demand loading it too much. This is a fairly special case code for NAS servers, and should not affect normal uses now that we use symbol_get. Let's get back to the fundamentals. > >It's a chicken and egg situation. I'd much prefer we enforce clean > >utf8 from the start, because if we don't we'll never be able to do > >that. And other filesystems (e.g. ZFS) allow you to do reject > >anything that is not clean utf8.... > > As I understand it, this is optional in ZFS. I wonder what people's > experiences are with this. It is as optional as your utf8 support for XFS is. But they do enforce valid utf8 if they use utf8 normalization for file name comparisms, be that case sensitive or insensitive. Take a look at the zfs(8) man page. > - Forbid non-UTF-8 filenames > - Allow non-UTF-8 filenames > - Make it a mount option > - Make it a mkfs option My take on this is: - I think we'll have to prevent non-utf8 file names for any cases where we use utf8 normalization. If you do not use utf8 normalization it's plain old Unix everything is allowed. - I think utf8 normalization vs not should be mkfs option, to make sure everyone including kernel and repair knows what sort of filesystem deal with. - case insensitive matching for utf8 normalized filesystems should be a runtime decision. mount time for now, but Samba people would be extremly happy to allow per-operation or per-process CI matching. But that is another totally different discusion I'd like to keep separate, I just want to make sure the disk format allows for it for now. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-26 16:56 ` Christoph Hellwig 0 siblings, 0 replies; 84+ messages in thread From: Christoph Hellwig @ 2014-09-26 16:56 UTC (permalink / raw) To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs On Fri, Sep 26, 2014 at 04:50:39PM +0200, Olaf Weber wrote: > I'm not sure how common the parsing code can be if needs to be capable of > retrieving data from a filesystem. > > Note given your and Andi Kleen's feedback on the trie size I've switched to > doing algorithmic decomposition for Hangul. This reduces the size of the > trie to 89952 bytes. > > In addition, if you store the trie in the filesystem, then the only part > that needs storing is the version for that particular filesystem, e.g no > compatibility info for different unicode versions would be required. This > would reduce the trie size to about 50kB for case-sensitive filesystems, and > about 55kB on case-folding filesystems. Honestly I wouldn't worry about demand loading it too much. This is a fairly special case code for NAS servers, and should not affect normal uses now that we use symbol_get. Let's get back to the fundamentals. > >It's a chicken and egg situation. I'd much prefer we enforce clean > >utf8 from the start, because if we don't we'll never be able to do > >that. And other filesystems (e.g. ZFS) allow you to do reject > >anything that is not clean utf8.... > > As I understand it, this is optional in ZFS. I wonder what people's > experiences are with this. It is as optional as your utf8 support for XFS is. But they do enforce valid utf8 if they use utf8 normalization for file name comparisms, be that case sensitive or insensitive. Take a look at the zfs(8) man page. > - Forbid non-UTF-8 filenames > - Allow non-UTF-8 filenames > - Make it a mount option > - Make it a mkfs option My take on this is: - I think we'll have to prevent non-utf8 file names for any cases where we use utf8 normalization. If you do not use utf8 normalization it's plain old Unix everything is allowed. - I think utf8 normalization vs not should be mkfs option, to make sure everyone including kernel and repair knows what sort of filesystem deal with. - case insensitive matching for utf8 normalized filesystems should be a runtime decision. mount time for now, but Samba people would be extremly happy to allow per-operation or per-process CI matching. But that is another totally different discusion I'd like to keep separate, I just want to make sure the disk format allows for it for now. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 16:56 ` Christoph Hellwig (?) @ 2014-09-26 17:04 ` Jeremy Allison 2014-09-26 17:06 ` Christoph Hellwig 2014-09-26 19:37 ` Olaf Weber -1 siblings, 2 replies; 84+ messages in thread From: Jeremy Allison @ 2014-09-26 17:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: tinguely, xfs, Ben Myers, Olaf Weber, linux-fsdevel On Fri, Sep 26, 2014 at 09:56:05AM -0700, Christoph Hellwig wrote: > > My take on this is: > > - I think we'll have to prevent non-utf8 file names for any cases where > we use utf8 normalization. If you do not use utf8 normalization > it's plain old Unix everything is allowed. > > - I think utf8 normalization vs not should be mkfs option, to make sure > everyone including kernel and repair knows what sort of filesystem > deal with. > > - case insensitive matching for utf8 normalized filesystems should be > a runtime decision. mount time for now, but Samba people would be > extremly happy to allow per-operation or per-process CI matching. > But that is another totally different discusion I'd like to keep > separate, I just want to make sure the disk format allows for it for > now. Actually, I'm so eager for case-insensitive matching I'd take "at format time", as with ZFS :-) :-). Having CI matching can speed up Samba operations by a factor of 10 on large directories (warning, number made up, depending on the number of entries per dir :-). Jeremy. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 17:04 ` Jeremy Allison @ 2014-09-26 17:06 ` Christoph Hellwig 2014-09-26 17:13 ` Jeremy Allison 2014-09-26 19:37 ` Olaf Weber 1 sibling, 1 reply; 84+ messages in thread From: Christoph Hellwig @ 2014-09-26 17:06 UTC (permalink / raw) To: Jeremy Allison; +Cc: tinguely, xfs, Ben Myers, Olaf Weber, linux-fsdevel On Fri, Sep 26, 2014 at 10:04:07AM -0700, Jeremy Allison wrote: > Actually, I'm so eager for case-insensitive matching I'd > take "at format time", as with ZFS :-) :-). You already get this with XFS as long as you limit yourself to 7-bit ASCII :) And utf-8 with Olaf's patches as-is. Maybe time to give them some testing? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 17:06 ` Christoph Hellwig @ 2014-09-26 17:13 ` Jeremy Allison 0 siblings, 0 replies; 84+ messages in thread From: Jeremy Allison @ 2014-09-26 17:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Jeremy Allison, Olaf Weber, Dave Chinner, Ben Myers, linux-fsdevel, tinguely, xfs On Fri, Sep 26, 2014 at 10:06:04AM -0700, Christoph Hellwig wrote: > On Fri, Sep 26, 2014 at 10:04:07AM -0700, Jeremy Allison wrote: > > Actually, I'm so eager for case-insensitive matching I'd > > take "at format time", as with ZFS :-) :-). > > You already get this with XFS as long as you limit yourself to > 7-bit ASCII :) Thankyou for playing, here's a $10 gift token... No, that won't do I'm afraid :-). > And utf-8 with Olaf's patches as-is. > > Maybe time to give them some testing? I might do that, once I've finished rebuilding my home server (remember kids, RAID5 is *NOT* a backup :-) :-). ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-26 17:13 ` Jeremy Allison 0 siblings, 0 replies; 84+ messages in thread From: Jeremy Allison @ 2014-09-26 17:13 UTC (permalink / raw) To: Christoph Hellwig Cc: tinguely, xfs, Ben Myers, Olaf Weber, linux-fsdevel, Jeremy Allison On Fri, Sep 26, 2014 at 10:06:04AM -0700, Christoph Hellwig wrote: > On Fri, Sep 26, 2014 at 10:04:07AM -0700, Jeremy Allison wrote: > > Actually, I'm so eager for case-insensitive matching I'd > > take "at format time", as with ZFS :-) :-). > > You already get this with XFS as long as you limit yourself to > 7-bit ASCII :) Thankyou for playing, here's a $10 gift token... No, that won't do I'm afraid :-). > And utf-8 with Olaf's patches as-is. > > Maybe time to give them some testing? I might do that, once I've finished rebuilding my home server (remember kids, RAID5 is *NOT* a backup :-) :-). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 17:04 ` Jeremy Allison 2014-09-26 17:06 ` Christoph Hellwig @ 2014-09-26 19:37 ` Olaf Weber 2014-09-26 19:46 ` Jeremy Allison 2014-09-29 11:06 ` Christoph Hellwig 1 sibling, 2 replies; 84+ messages in thread From: Olaf Weber @ 2014-09-26 19:37 UTC (permalink / raw) To: Jeremy Allison, Christoph Hellwig; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs On 26-09-14 19:04, Jeremy Allison wrote: > On Fri, Sep 26, 2014 at 09:56:05AM -0700, Christoph Hellwig wrote: >> >> My take on this is: >> >> - I think we'll have to prevent non-utf8 file names for any cases where >> we use utf8 normalization. If you do not use utf8 normalization >> it's plain old Unix everything is allowed. >> >> - I think utf8 normalization vs not should be mkfs option, to make sure >> everyone including kernel and repair knows what sort of filesystem >> deal with. >> >> - case insensitive matching for utf8 normalized filesystems should be >> a runtime decision. mount time for now, but Samba people would be >> extremly happy to allow per-operation or per-process CI matching. >> But that is another totally different discusion I'd like to keep >> separate, I just want to make sure the disk format allows for it for >> now. > > Actually, I'm so eager for case-insensitive matching I'd > take "at format time", as with ZFS :-) :-). My argument against "mount time case-insensitivity" and for "mkfs time case-insensitivity" is related to switching from the case-sensitive domain to the case-insensitive one. For case-sensitive, from "README" to "readme" there are 64 different possible filenames. Let's say you create 63 out of these 64. Now remount the filesystem case-insensitive, and try to open by the 64th version of "readme". It is not an exact match for any of the 63 candidate files, and a case-insensitive match to all 63 candidate files. Which of these 63 files should be opened, and why that one in particular? > Having CI matching can speed up Samba operations by a > factor of 10 on large directories (warning, number made > up, depending on the number of entries per dir :-). I really want that to be true, but the proof of the pudding... Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 19:37 ` Olaf Weber @ 2014-09-26 19:46 ` Jeremy Allison 2014-09-26 20:03 ` Olaf Weber 2014-09-29 11:06 ` Christoph Hellwig 1 sibling, 1 reply; 84+ messages in thread From: Jeremy Allison @ 2014-09-26 19:46 UTC (permalink / raw) To: Olaf Weber Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel, Jeremy Allison On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote: > > My argument against "mount time case-insensitivity" and for "mkfs > time case-insensitivity" is related to switching from the > case-sensitive domain to the case-insensitive one. > > For case-sensitive, from "README" to "readme" there are 64 different > possible filenames. Let's say you create 63 out of these 64. Now > remount the filesystem case-insensitive, and try to open by the 64th > version of "readme". It is not an exact match for any of the 63 > candidate files, and a case-insensitive match to all 63 candidate > files. Which of these 63 files should be opened, and why that one in > particular? I'm ok with "mkfs time case-insensitivity" - really ! Most of my OEMs would set that and claim victory (few of them care much about NFS semantics :-). > >Having CI matching can speed up Samba operations by a > >factor of 10 on large directories (warning, number made > >up, depending on the number of entries per dir :-). > > I really want that to be true, but the proof of the pudding... No it really *is* true. The reason I can't give exact numbers is it depends on the number of entries. Remember, for every cache *miss*, we have to scan the entire directory. So a user asks for README, and we attempt that and it fails. So now we have to enumerate the entire directory to see if READMe (or any other case varient) exists. Now do that in a directory with 10, 100, 1000, .... 10000000 existing files (don't laugh, I've seen an application for Music files that did *exactly* that). On a case insensitive filesystem you just request README and you're done. Certain vendors who shall remain nameless :-) created test cases of just this example to show how much storage on Linux sucks. Not a happy camper about that - and telling them to use ZFS on FreeBSD or Solaris just doesn't feel right :-). Jeremy. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 19:46 ` Jeremy Allison @ 2014-09-26 20:03 ` Olaf Weber 2014-09-29 20:16 ` J. Bruce Fields 0 siblings, 1 reply; 84+ messages in thread From: Olaf Weber @ 2014-09-26 20:03 UTC (permalink / raw) To: Jeremy Allison; +Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel On 26-09-14 21:46, Jeremy Allison wrote: > On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote: >> >> My argument against "mount time case-insensitivity" and for "mkfs >> time case-insensitivity" is related to switching from the >> case-sensitive domain to the case-insensitive one. >> >> For case-sensitive, from "README" to "readme" there are 64 different >> possible filenames. Let's say you create 63 out of these 64. Now >> remount the filesystem case-insensitive, and try to open by the 64th >> version of "readme". It is not an exact match for any of the 63 >> candidate files, and a case-insensitive match to all 63 candidate >> files. Which of these 63 files should be opened, and why that one in >> particular? > > I'm ok with "mkfs time case-insensitivity" - really ! > Most of my OEMs would set that and claim victory (few > of them care much about NFS semantics :-). I'd say you can have CIFS-style case-insensitive semantics or NFS-style case-sensitive semantics, but not both. And in particular, that a customer should not actually want to have both. >>> Having CI matching can speed up Samba operations by a >>> factor of 10 on large directories (warning, number made >>> up, depending on the number of entries per dir :-). >> >> I really want that to be true, but the proof of the pudding... > > No it really *is* true. The reason I can't give > exact numbers is it depends on the number of entries. > > Remember, for every cache *miss*, we have to scan > the entire directory. > > So a user asks for README, and we attempt that > and it fails. So now we have to enumerate the > entire directory to see if READMe (or any other > case varient) exists. > > Now do that in a directory with 10, 100, 1000, > .... 10000000 existing files (don't laugh, I've > seen an application for Music files that did > *exactly* that). On a case insensitive filesystem > you just request README and you're done. > > Certain vendors who shall remain nameless :-) > created test cases of just this example to > show how much storage on Linux sucks. Not > a happy camper about that - and telling them > to use ZFS on FreeBSD or Solaris just doesn't > feel right :-). Here's the thing to bear in mind: what I did is a straightforward extension of the existing XFS ASCII-based case-insensitive code. If that gets you the desired performance improvement, then my code should extend that to more general usage. If it doesn't, then there are places in XFS that I haven't touched that need modification to have these cases work well. Olaf -- Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 20:03 ` Olaf Weber @ 2014-09-29 20:16 ` J. Bruce Fields 0 siblings, 0 replies; 84+ messages in thread From: J. Bruce Fields @ 2014-09-29 20:16 UTC (permalink / raw) To: Olaf Weber Cc: Jeremy Allison, Christoph Hellwig, Dave Chinner, Ben Myers, linux-fsdevel, tinguely, xfs On Fri, Sep 26, 2014 at 10:03:50PM +0200, Olaf Weber wrote: > On 26-09-14 21:46, Jeremy Allison wrote: > >On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote: > >> > >>My argument against "mount time case-insensitivity" and for "mkfs > >>time case-insensitivity" is related to switching from the > >>case-sensitive domain to the case-insensitive one. > >> > >>For case-sensitive, from "README" to "readme" there are 64 different > >>possible filenames. Let's say you create 63 out of these 64. Now > >>remount the filesystem case-insensitive, and try to open by the 64th > >>version of "readme". It is not an exact match for any of the 63 > >>candidate files, and a case-insensitive match to all 63 candidate > >>files. Which of these 63 files should be opened, and why that one in > >>particular? > > > >I'm ok with "mkfs time case-insensitivity" - really ! > >Most of my OEMs would set that and claim victory (few > >of them care much about NFS semantics :-). > > I'd say you can have CIFS-style case-insensitive semantics or > NFS-style case-sensitive semantics, but not both. Note the NFSv4 specs do claim to allow case insensitivity. No idea how well clients deal with it. I think rfc3530bis has the most up to date language on NFSv4 internationalization issues: http://tools.ietf.org/html/draft-ietf-nfsv4-rfc3530bis-33 (One nit in the current knfsd: the server doesn't correctly report the case_insensitive attribute. If it had some flag it could check in the filesystem's superblock then it could do that right instead of just assuming 0 as it currently does (see FATTR4_WORD0_CASE_INSENSITIVE in fs/nfsd/nfs4xdr.c:nfsd4_encode_fattr).) --b. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-29 20:16 ` J. Bruce Fields 0 siblings, 0 replies; 84+ messages in thread From: J. Bruce Fields @ 2014-09-29 20:16 UTC (permalink / raw) To: Olaf Weber Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel, Jeremy Allison On Fri, Sep 26, 2014 at 10:03:50PM +0200, Olaf Weber wrote: > On 26-09-14 21:46, Jeremy Allison wrote: > >On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote: > >> > >>My argument against "mount time case-insensitivity" and for "mkfs > >>time case-insensitivity" is related to switching from the > >>case-sensitive domain to the case-insensitive one. > >> > >>For case-sensitive, from "README" to "readme" there are 64 different > >>possible filenames. Let's say you create 63 out of these 64. Now > >>remount the filesystem case-insensitive, and try to open by the 64th > >>version of "readme". It is not an exact match for any of the 63 > >>candidate files, and a case-insensitive match to all 63 candidate > >>files. Which of these 63 files should be opened, and why that one in > >>particular? > > > >I'm ok with "mkfs time case-insensitivity" - really ! > >Most of my OEMs would set that and claim victory (few > >of them care much about NFS semantics :-). > > I'd say you can have CIFS-style case-insensitive semantics or > NFS-style case-sensitive semantics, but not both. Note the NFSv4 specs do claim to allow case insensitivity. No idea how well clients deal with it. I think rfc3530bis has the most up to date language on NFSv4 internationalization issues: http://tools.ietf.org/html/draft-ietf-nfsv4-rfc3530bis-33 (One nit in the current knfsd: the server doesn't correctly report the case_insensitive attribute. If it had some flag it could check in the filesystem's superblock then it could do that right instead of just assuming 0 as it currently does (see FATTR4_WORD0_CASE_INSENSITIVE in fs/nfsd/nfs4xdr.c:nfsd4_encode_fattr).) --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 19:37 ` Olaf Weber @ 2014-09-29 11:06 ` Christoph Hellwig 2014-09-29 11:06 ` Christoph Hellwig 1 sibling, 0 replies; 84+ messages in thread From: Christoph Hellwig @ 2014-09-29 11:06 UTC (permalink / raw) To: Olaf Weber Cc: Jeremy Allison, Christoph Hellwig, Dave Chinner, Ben Myers, linux-fsdevel, tinguely, xfs On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote: > My argument against "mount time case-insensitivity" and for "mkfs time > case-insensitivity" is related to switching from the case-sensitive domain > to the case-insensitive one. > > For case-sensitive, from "README" to "readme" there are 64 different > possible filenames. Let's say you create 63 out of these 64. Now remount > the filesystem case-insensitive, and try to open by the 64th version of > "readme". It is not an exact match for any of the 63 candidate files, and a > case-insensitive match to all 63 candidate files. Which of these 63 files > should be opened, and why that one in particular? Well, the point is not that we use the CI-capable hash all the time. I fully expect the current XFS behavior to remain the default for normal systems forever. I just want to make sure that the CI implementation you chose can also allow mixed lookups if we desire. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS @ 2014-09-29 11:06 ` Christoph Hellwig 0 siblings, 0 replies; 84+ messages in thread From: Christoph Hellwig @ 2014-09-29 11:06 UTC (permalink / raw) To: Olaf Weber Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel, Jeremy Allison On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote: > My argument against "mount time case-insensitivity" and for "mkfs time > case-insensitivity" is related to switching from the case-sensitive domain > to the case-insensitive one. > > For case-sensitive, from "README" to "readme" there are 64 different > possible filenames. Let's say you create 63 out of these 64. Now remount > the filesystem case-insensitive, and try to open by the 64th version of > "readme". It is not an exact match for any of the 63 candidate files, and a > case-insensitive match to all 63 candidate files. Which of these 63 files > should be opened, and why that one in particular? Well, the point is not that we use the CI-capable hash all the time. I fully expect the current XFS behavior to remain the default for normal systems forever. I just want to make sure that the CI implementation you chose can also allow mixed lookups if we desire. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [RFC v2] Unicode/UTF-8 support for XFS 2014-09-26 16:56 ` Christoph Hellwig (?) (?) @ 2014-09-26 17:30 ` Ben Myers -1 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-26 17:30 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, tinguely, Olaf Weber, xfs Hey Christoph, On Fri, Sep 26, 2014 at 09:56:05AM -0700, Christoph Hellwig wrote: > On Fri, Sep 26, 2014 at 04:50:39PM +0200, Olaf Weber wrote: > > >It's a chicken and egg situation. I'd much prefer we enforce clean > > >utf8 from the start, because if we don't we'll never be able to do > > >that. And other filesystems (e.g. ZFS) allow you to do reject > > >anything that is not clean utf8.... > > > > As I understand it, this is optional in ZFS. I wonder what people's > > experiences are with this. > > It is as optional as your utf8 support for XFS is. But they do > enforce valid utf8 if they use utf8 normalization for file name > comparisms, be that case sensitive or insensitive. Take a look at the > zfs(8) man page. The way I'm reading that man page, it seems like with ZFS you have one option to choose whether to use normalization: 'nomalization = "none|FormD|FormKCf"' And a separate option to choose whether to accept non-utf8 filenames: 'utf8only = "on|off". The default setting appears to be that ZFS does allow non-utf8 filenames. Whereas with Olaf's series you have one option that turns normalization on or off, and he is not giving you a choice of whether non-utf8 filenames will be accepted (they will be accepted). So IIUC there is a distinction: The utf8 support for ZFS is "more optional" than the utf8 support for XFS. Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* [RFC] Unicode/UTF-8 support for XFS @ 2014-09-11 20:37 Ben Myers 2014-09-11 20:55 ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers 0 siblings, 1 reply; 84+ messages in thread From: Ben Myers @ 2014-09-11 20:37 UTC (permalink / raw) To: xfs; +Cc: tinguely, olaf Hi, I'm posting this RFC on Olaf's behalf, as he is busy with other projects. First is a series of kernel patches, then a series of patches for xfsprogs, and then a test. Note that I have removed the unicode database files prior to posting due to their large size. There are instructions on how to download them in the relevant commit headers. Thanks, Ben Here are some notes of introduction from Olaf: ----------------------------------------------------------------------------- Unicode/UTF-8 support for XFS So we had a customer request proper unicode support... Design notes. XFS uses byte strings for filenames, so UTF-8 is the expected format for unicode filenames. This does raise the question what criteria a byte string must meet to be UTF-8. We settled on the following: - Valid unicode code points are 0..0x10FFFF, except that - The surrogates 0xD800..0xDFFF are not valid code points, and - Valid UTF-8 must be a shortest encoding of a valid unicode code point. In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and is itself not part of the string). Moreover strings may be length-limited in addition to being NUL-terminated (there is no such thing as an embedded NUL in a length-limited string). Based on feedback on the earlier patches for unicode/UTF-8 support, we decided that a filename that does not match the above criteria should be treated as a binary blob, as opposed to being rejected. To stress: if any part of the string isn't valid UTF-8, then the entire string is treated as a binary blob. This matters once normalization is considered. When comparing unicode strings for equality, normalization comes into play: we must compare the normalized forms of strings, not just the raw sequences of bytes. There are a number of defined normalization forms for unicode. We decided on a variant of NFKD we call NFKDI. NFD was chosed over NFC, because calculating NFC requires calculating NFD first, followed by an additional step. NFKD was chosen over NFD because this makes filenames that ought to be equal compare as equal. My favorite example is the ways "office" can be spelled, when "fi" or "ffi" ligatures are used. NFKDI adds one more step of NFKD, in that it eliminates the code points that have the Default_Ignorable_Code_Point property from the comparison. These code points are as a rule invisible, but might (or might not) be pulled in when you copy/paste a string to be used as a filename. An example of these is U+00AD SOFT HYPHEN, a code point that only shows up if a word is split across lines. If a filename is considered to be binary blob, comparison is based on a simple binary match. Normalization does not apply to any part of a blob. The code uses ("leverages", in corp-speak) the existing infrastructure for case-insensitive filenames. Like the CI code, the name used to create a file is stored on disk, and returned in a lookup. When comparing filenames the normalized forms of the names being compared are generated on the fly from the non-normalized forms stored on disk. If the borgbit (the bit enabling legacy ASCII-based CI) is set in the superblock, then case folding is added into the mix. This normalization form we call NFKDICF. It allows for the creation of case-insensitive filesystems with UTF-8 support. ----------------------------------------------------------------------------- Implementation notes. Strings are normalized using a trie that stores the relevant information. The trie itself is part of the XFS module, and about 250kB in size. The trie is not checked in: instead we add the source files from the Unicode Character Database and a program that creates the header containing the trie. The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8 sequence leads to a leaf. No invalid sequence does. This means that trie lookups can be used to validate UTF-8 sequences, which why there is no specialized code for the same purpose. The trie contains information for the version of unicode in which each code point was defined. This matters because non-normalized strings are stored on disk, and newer versions of unicode may introduce new normalized forms. Ideally, the version of unicode used by the filesystem is stored in the filesystem. The trie also accounts for corrections made in the past to normalizations. This has little value today, because any newly created filesystem would be using unicode version 7.0.0. It is included in order to show, not tell, that such corrections can be handled if they are added in future revisions. The algorithm used to calculate the sequences of bytes for the normalized form of a UTF-8 string is tricky. The core is found in utf8byte(), with an explanation in the preceeding comment. The non-XFS-specific supporting code is in separate source files, and be put in some other location in the Linux kernel source tree, if desired. These functions have the prefix 'utf8n' if they handle length-limited strings, and 'utf8' if they handle NUL-terminated strings. ----------------------------------------------------------------------------- _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash 2014-09-11 20:37 [RFC] " Ben Myers @ 2014-09-11 20:55 ` Ben Myers 0 siblings, 0 replies; 84+ messages in thread From: Ben Myers @ 2014-09-11 20:55 UTC (permalink / raw) To: xfs; +Cc: tinguely, olaf From: Olaf Weber <olaf@sgi.com> With the introduction of the xfs_nameops.normhash callout, all uses of the hashname callout now occur in places where an xfs_name structure must be explicitly created just to match the parameter passing convention of this callout. Change the arguments to a const unsigned char * and int instead. Signed-off-by: Olaf Weber <olaf@sgi.com> --- db/check.c | 6 ++---- include/xfs_da_btree.h | 2 +- libxfs/xfs_da_btree.c | 9 +-------- libxfs/xfs_dir2.c | 10 ++++++---- libxfs/xfs_dir2_block.c | 5 +---- libxfs/xfs_dir2_data.c | 6 ++---- repair/phase6.c | 2 +- 7 files changed, 14 insertions(+), 26 deletions(-) diff --git a/db/check.c b/db/check.c index 4fd9fd0..49359d7 100644 --- a/db/check.c +++ b/db/check.c @@ -2212,7 +2212,6 @@ process_data_dir_v2( int stale = 0; int tag_err; __be16 *tagp; - struct xfs_name xname; data = iocur_top->data; block = iocur_top->data; @@ -2323,9 +2322,8 @@ process_data_dir_v2( tag_err += be16_to_cpu(*tagp) != (char *)dep - (char *)data; addr = xfs_dir2_db_off_to_dataptr(mp, db, (char *)dep - (char *)data); - xname.name = dep->name; - xname.len = dep->namelen; - dir_hash_add(mp->m_dirnameops->hashname(&xname), addr); + dir_hash_add(mp->m_dirnameops->hashname(dep->name, + dep->namelen), addr); ptr += xfs_dir3_data_entsize(mp, dep->namelen); count++; lastfree = 0; diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h index 06b50bf..9674bed 100644 --- a/include/xfs_da_btree.h +++ b/include/xfs_da_btree.h @@ -132,7 +132,7 @@ typedef struct xfs_da_state { * Name ops for directory and/or attr name operations */ struct xfs_nameops { - xfs_dahash_t (*hashname)(struct xfs_name *); + xfs_dahash_t (*hashname)(const unsigned char *, int); int (*normhash)(struct xfs_da_args *); enum xfs_dacmp (*compname)(struct xfs_da_args *, const unsigned char *, int); diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c index eb97317..7be5eaf 100644 --- a/libxfs/xfs_da_btree.c +++ b/libxfs/xfs_da_btree.c @@ -1993,13 +1993,6 @@ xfs_da_compname( XFS_CMP_EXACT : XFS_CMP_DIFFERENT; } -static xfs_dahash_t -xfs_default_hashname( - struct xfs_name *name) -{ - return xfs_da_hashname(name->name, name->len); -} - STATIC int xfs_da_normhash( struct xfs_da_args *args) @@ -2009,7 +2002,7 @@ xfs_da_normhash( } const struct xfs_nameops xfs_default_nameops = { - .hashname = xfs_default_hashname, + .hashname = xfs_da_hashname, .normhash = xfs_da_normhash, .compname = xfs_da_compname }; diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c index e52d082..1893931 100644 --- a/libxfs/xfs_dir2.c +++ b/libxfs/xfs_dir2.c @@ -43,13 +43,14 @@ const unsigned char xfs_mode_to_ftype[S_IFMT >> S_SHIFT] = { */ STATIC xfs_dahash_t xfs_ascii_ci_hashname( - struct xfs_name *name) + const unsigned char *name, + int len) { xfs_dahash_t hash; int i; - for (i = 0, hash = 0; i < name->len; i++) - hash = tolower(name->name[i]) ^ rol32(hash, 7); + for (i = 0, hash = 0; i < len; i++) + hash = tolower(name[i]) ^ rol32(hash, 7); return hash; } @@ -475,7 +476,8 @@ xfs_dir_canenter( args.name = name->name; args.namelen = name->len; args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); + args.hashval = dp->i_mount->m_dirnameops->hashname(name->name, + name->len); args.dp = dp; args.whichfork = XFS_DATA_FORK; args.trans = tp; diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c index 2880431..1a8b5f5 100644 --- a/libxfs/xfs_dir2_block.c +++ b/libxfs/xfs_dir2_block.c @@ -1047,7 +1047,6 @@ xfs_dir2_sf_to_block( xfs_dir2_sf_hdr_t *sfp; /* shortform header */ __be16 *tagp; /* end of data entry */ xfs_trans_t *tp; /* transaction pointer */ - struct xfs_name name; struct xfs_ifork *ifp; trace_xfs_dir2_sf_to_block(args); @@ -1205,10 +1204,8 @@ xfs_dir2_sf_to_block( tagp = xfs_dir3_data_entry_tag_p(mp, dep); *tagp = cpu_to_be16((char *)dep - (char *)hdr); xfs_dir2_data_log_entry(tp, bp, dep); - name.name = sfep->name; - name.len = sfep->namelen; blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops-> - hashname(&name)); + hashname(sfep->name, sfep->namelen)); blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp, (char *)dep - (char *)hdr)); offset = (int)((char *)(tagp + 1) - (char *)hdr); diff --git a/libxfs/xfs_dir2_data.c b/libxfs/xfs_dir2_data.c index dc9df4d..9b3f750 100644 --- a/libxfs/xfs_dir2_data.c +++ b/libxfs/xfs_dir2_data.c @@ -46,7 +46,6 @@ __xfs_dir3_data_check( xfs_mount_t *mp; /* filesystem mount point */ char *p; /* current data position */ int stale; /* count of stale leaves */ - struct xfs_name name; mp = bp->b_target->bt_mount; hdr = bp->b_addr; @@ -142,9 +141,8 @@ __xfs_dir3_data_check( addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk, (xfs_dir2_data_aoff_t) ((char *)dep - (char *)hdr)); - name.name = dep->name; - name.len = dep->namelen; - hash = mp->m_dirnameops->hashname(&name); + hash = mp->m_dirnameops-> + hashname(dep->name, dep->namelen); for (i = 0; i < be32_to_cpu(btp->count); i++) { if (be32_to_cpu(lep[i].address) == addr && be32_to_cpu(lep[i].hashval) == hash) diff --git a/repair/phase6.c b/repair/phase6.c index f13069f..f374fd0 100644 --- a/repair/phase6.c +++ b/repair/phase6.c @@ -195,7 +195,7 @@ dir_hash_add( dup = 0; if (!junk) { - hash = mp->m_dirnameops->hashname(&xname); + hash = mp->m_dirnameops->hashname(name, namelen); byhash = DIR_HASH_FUNC(hashtab, hash); /* -- 1.7.12.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 84+ messages in thread
end of thread, other threads:[~2014-09-29 20:17 UTC | newest] Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers 2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers 2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers 2014-09-18 20:09 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers 2014-09-18 20:10 ` [PATCH 04/10] xfs: change interface of xfs_nameops.normhash Ben Myers 2014-09-18 20:10 ` Ben Myers 2014-09-18 20:11 ` [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers 2014-09-18 20:13 ` [PATCH 03/10] xfs: add xfs_nameops.normhash Ben Myers 2014-09-18 20:14 ` [PATCH 06/10] xfs: add unicode character database files Ben Myers 2014-09-18 20:14 ` Ben Myers 2014-09-22 20:54 ` Dave Chinner 2014-09-22 20:54 ` Dave Chinner 2014-09-26 17:09 ` Christoph Hellwig 2014-09-18 20:15 ` [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8 Ben Myers 2014-09-22 20:57 ` Dave Chinner 2014-09-22 20:57 ` Dave Chinner 2014-09-23 18:57 ` Ben Myers 2014-09-26 17:10 ` Christoph Hellwig 2014-09-18 20:16 ` [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers 2014-09-18 20:16 ` Ben Myers 2014-09-18 20:17 ` [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers 2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers 2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers 2014-09-18 20:33 ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers 2014-09-18 20:33 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers 2014-09-18 20:34 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers 2014-09-18 20:35 ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers 2014-09-18 20:35 ` Ben Myers 2014-09-18 20:36 ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers 2014-09-18 20:37 ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers 2014-09-18 20:38 ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers 2014-09-18 20:38 ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers 2014-09-18 20:39 ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers 2014-09-18 20:40 ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers 2014-09-18 20:41 ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers 2014-09-18 20:42 ` [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair Ben Myers 2014-09-18 20:42 ` Ben Myers 2014-09-18 20:43 ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers 2014-09-19 16:06 ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers 2014-09-23 18:34 ` Roger Willcocks 2014-09-24 23:11 ` Ben Myers 2014-09-19 16:07 ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers 2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers 2014-09-18 21:24 ` Zach Brown 2014-09-18 21:24 ` Zach Brown 2014-09-18 22:23 ` Ben Myers 2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers 2014-09-19 16:04 ` [PATCH 07b/10] xfs: add supporting code " Ben Myers 2014-09-22 14:55 ` [RFC v2] Unicode/UTF-8 support for XFS Andi Kleen 2014-09-22 14:55 ` Andi Kleen 2014-09-22 18:41 ` Ben Myers 2014-09-22 19:29 ` Andi Kleen 2014-09-22 19:29 ` Andi Kleen 2014-09-23 16:13 ` Olaf Weber 2014-09-23 20:15 ` Andi Kleen 2014-09-23 20:45 ` Ben Myers 2014-09-23 20:45 ` Ben Myers 2014-09-24 11:07 ` Olaf Weber 2014-09-26 14:06 ` Olaf Weber 2014-09-23 13:01 ` Olaf Weber 2014-09-23 20:02 ` Andi Kleen 2014-09-22 22:26 ` Dave Chinner 2014-09-22 22:26 ` Dave Chinner 2014-09-24 13:21 ` Olaf Weber 2014-09-24 13:21 ` Olaf Weber 2014-09-24 23:10 ` Dave Chinner 2014-09-24 23:10 ` Dave Chinner 2014-09-25 13:33 ` Zuckerman, Boris 2014-09-26 14:50 ` Olaf Weber 2014-09-26 16:56 ` Christoph Hellwig 2014-09-26 16:56 ` Christoph Hellwig 2014-09-26 17:04 ` Jeremy Allison 2014-09-26 17:06 ` Christoph Hellwig 2014-09-26 17:13 ` Jeremy Allison 2014-09-26 17:13 ` Jeremy Allison 2014-09-26 19:37 ` Olaf Weber 2014-09-26 19:46 ` Jeremy Allison 2014-09-26 20:03 ` Olaf Weber 2014-09-29 20:16 ` J. Bruce Fields 2014-09-29 20:16 ` J. Bruce Fields 2014-09-29 11:06 ` Christoph Hellwig 2014-09-29 11:06 ` Christoph Hellwig 2014-09-26 17:30 ` Ben Myers -- strict thread matches above, loose matches on Subject: below -- 2014-09-11 20:37 [RFC] " Ben Myers 2014-09-11 20:55 ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.