All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-18 19:56 Ben Myers
  2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers
                   ` (16 more replies)
  0 siblings, 17 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 19:56 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

Hi,

I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
is busy with other projects.  This is the second revision of the series.
The first is available here:

http://oss.sgi.com/archives/xfs/2014-09/msg00169.html

In response to the initial feedback, the changes in version 2 include:

* linux-fsdevel in the To: line,
* Updated design notes,
* Separation of the fs-independent trie and support code into utf8norm.ko,
* A mechanism for loading the normalization module only when necessary.

I'll post the whole series for completeness sake.  Many on -fsdevel will
not be interested in the xfs-specific bits, but it may be helpful to
have the full series as an example and for testing purposes.

First there is a set of kernel bits, then some libxfs/xfsprogs stuff,
and finally a test.  (Note: I am not posting the unicode database files
due to their large size.  There are scripts to download them from
unicode.org in the relevant commit headers.)

TODO: Store the unicode version number of the filesystem on disk in the
super block.

Thanks,
Ben

Here are Olaf's design notes:

-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS

So we had a customer request proper unicode support...


* What does "supporting unicode" actually mean?

>From a text processing point of view, what a filesystem does with
filenames is simple: it stores and retrieves them, and compares them
for equality. It may reject certain byte sequences as invalid
filenames (for example, no filename can contain an ASCII NUL).

I've been taking it as a given that when a file is created with a
certain byte sequence as its name, then a subsequent directory listing
will contain that same byte sequence among the names listed.

This leaves comparing names for equality, and in my view this is what
"supporting unicode" revolves about.

The present state of affairs is that different byte sequences are
different filenames. This amounts to tolerating unicode without
actually supporting it.

To support unicode we have to interpret filenames. What happens when
(part of) a filename cannot be interpreted? We can reject the
filename, interpret the parts we can, or punt and accept it as an
uninterpreted blob.

Rejecting ill-formed filenames was my first choice, but I came around
on the issue: there are too many ways in which you can end up with
having to deal with ill-formed filenames that would leave a user with
no recourse but to move whatever they're doing to a different
filesystem. Unpacking a tarball with filenames in a different encoding
is an example.

Partial interpretation of an ill-formed filename just strikes me as
the kind of bad idea that most half-houses are. I admit that I have no
stronger objection to this than the fact that it makes the code even
more complicated and fragile.

Which leaves "blob" as the preferred option by default for coping with
ill-formed filenames.

When comparing well-formed filenames, the question now becomes which
byte sequences are considered to be alternative spellings of the same
filename. This is where normalization forms come into play, and the
unicode standard has quite a bit to say about the subject.

If all you're doing is comparison, then choosing NFD over NFC is easy,
because the former is easier to calculate than the latter.

If you want various spellings of "office" to compare equal, then
picking NFKD over NFD for comparison is also an obvious
choice. (Hand-picking individual compatibility forms is truly a bad
idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
ligature. (Some fool thought it a good idea to add these ligatures to
unicode, all we get to decide is how to cope.)

The most contentious part is (should be) ignoring the codepoints with
the Default_Ignorable_Code_Point property. I've included the list
below. My argument, such as it is, is that these code points either
have no visible rendering, or in cases like the soft hyphen, are only
conditionally visible. The problem with these (as I see it) is that on
seeing a filename that might contain them you cannot tell whether they
are present. So I propose to ignore them for the purpose of comparing
filenames for equality.

Finally, case folding. First of all, it is optional. Then the issue is
that you either go the language-specific route, or simplify the task
by "just" doing a full casefold (C+F, in unicode parlance). Looking
around the net I tend to find that if you're going to do casefolding
at all, then a language-independent full casefold is preferred because
it is the most predictable option. See
http://www.w3.org/TR/charmod-norm/ for an example of that kind of
reasoning.

An additional question is whether case folding should be a fixed
(mkfs-time) property of a filesystem or can be enabled and disabled on
the fly. When mixing these modes, preferring exact matches is easy.
But after case-sensitive creates of files named "README" and "readme",
which of these two files will be found by case-insensitive lookups of
"Readme", and "ReadMe"? Does the answer differ if the order in which
the files were created is reversed? I do not have good answers to
those questions, and absent such answers the behavior of a filesystem
becomes hard to predict. This may not be a bug according to the
design, but it will be experienced as a bug by users. This is why in
these patches case folding is a property set at mkfs time.

All of these choices can be argued with, but I do believe that the
particular combination of choices I made is a defensible one.

The code refers to these normalization forms as nfkdi and nfkdicf.


* XFS-specific design notes.

XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
must meet to be UTF-8. We settled on the following:
 - Valid unicode code points are 0..0x10FFFF, except that
 - The surrogates 0xD800..0xDFFF are not valid code points, and
 - Valid UTF-8 must be a shortest encoding of a valid unicode code point.

In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).

The code uses ("leverages", in corp-speak) the existing XFS
infrastructure for case-insensitive filenames. Like the CI code, the
name used to create a file is stored on disk, and returned in a
lookup. When comparing filenames the normalized forms of the names
being compared are generated on the fly from the non-normalized forms
stored on disk.

If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
in the superblock, then case folding is added into the mix. This is
the nfkdicf normalization form mentioned above. It allows for the
creation of case-insensitive filesystems with UTF-8 support.


* Implementation notes.

Strings are normalized using a trie that stores the relevant
information.  The trie itself is about 250kB in size, and lives in a
separate module. The trie is not checked in: instead we add the source
files from the Unicode Character Database and a program that creates
the header containing the trie.

The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.

The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.

The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.

The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.

The non-XFS-specific supporting code functions have the prefix 'utf8n'
if they handle length-limited strings, and 'utf8' if they handle
NUL-terminated strings.

----
# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0605, 06DD, 070F, 110BD (exceptional Cf characters that should be visible)

00AD          ; Default_Ignorable_Code_Point # Cf       SOFT HYPHEN
034F          ; Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
061C          ; Default_Ignorable_Code_Point # Cf       ARABIC LETTER MARK
115F..1160    ; Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D    ; Default_Ignorable_Code_Point # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E          ; Default_Ignorable_Code_Point # Cf       MONGOLIAN VOWEL SEPARATOR
200B..200F    ; Default_Ignorable_Code_Point # Cf   [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E    ; Default_Ignorable_Code_Point # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
2066..206F    ; Default_Ignorable_Code_Point # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164          ; Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FE00..FE0F    ; Default_Ignorable_Code_Point # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF          ; Default_Ignorable_Code_Point # Cf       ZERO WIDTH NO-BREAK SPACE
FFA0          ; Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3  ; Default_Ignorable_Code_Point # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A  ; Default_Ignorable_Code_Point # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000         ; Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Default_Ignorable_Code_Point # Cf       LANGUAGE TAG
E0002..E001F  ; Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Default_Ignorable_Code_Point # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF  ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF  ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

# Total code points: 4173
----

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 01/10] xfs: return the first match during case-insensitive lookup.
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 20:08 ` Ben Myers
  2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Change the XFS case-insensitive lookup code to return the first match
found, even if it is not an exact match. Whether a filesystem uses
case-insensitive lookups is determined by a superblock bit set during
filesystem creation.  This means that normal use cannot create two files
that both match the same filename.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_dir2_block.c | 17 +++------
 fs/xfs/libxfs/xfs_dir2_leaf.c  | 37 ++++----------------
 fs/xfs/libxfs/xfs_dir2_node.c  | 79 ++++++++++++++++--------------------------
 fs/xfs/libxfs/xfs_dir2_sf.c    |  8 ++---
 4 files changed, 45 insertions(+), 96 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 9628cec..990bf0c 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -725,28 +725,21 @@ xfs_dir2_block_lookup_int(
 		dep = (xfs_dir2_data_entry_t *)
 			((char *)hdr + xfs_dir2_dataptr_to_off(args->geo, addr));
 		/*
-		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * Compare name and if it's a match, return the
+		 * index and buffer.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*bpp = bp;
 			*entno = mid;
-			if (cmp == XFS_CMP_EXACT)
-				return 0;
+			return 0;
 		}
 	} while (++mid < be32_to_cpu(btp->count) &&
 			be32_to_cpu(blp[mid].hashval) == hash);
 
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or replace).
-	 * If a case-insensitive match was found earlier, return success.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE)
-		return 0;
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
 	/*
 	 * No match, release the buffer and return ENOENT.
 	 */
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index a19174e..3d572ee 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -1226,7 +1226,6 @@ xfs_dir2_leaf_lookup_int(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	xfs_dir2_db_t		newdb;		/* new data block number */
 	xfs_trans_t		*tp;		/* transaction pointer */
-	xfs_dir2_db_t		cidb = -1;	/* case match data block no. */
 	enum xfs_dacmp		cmp;		/* name compare result */
 	struct xfs_dir2_leaf_entry *ents;
 	struct xfs_dir3_icleaf_hdr leafhdr;
@@ -1290,46 +1289,22 @@ xfs_dir2_leaf_lookup_int(
 						be32_to_cpu(lep->address)));
 		/*
 		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * and buffer
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*indexp = index;
-			/* case exact match: return the current buffer. */
-			if (cmp == XFS_CMP_EXACT) {
-				*dbpp = dbp;
-				return 0;
-			}
-			cidb = curdb;
+			*dbpp = dbp;
+			return 0;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or remove).
-	 * If a case-insensitive match was found earlier, re-read the
-	 * appropriate data block if required and return it.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE) {
-		ASSERT(cidb != -1);
-		if (cidb != curdb) {
-			xfs_trans_brelse(tp, dbp);
-			error = xfs_dir3_data_read(tp, dp,
-					   xfs_dir2_db_to_da(args->geo, cidb),
-					   -1, &dbp);
-			if (error) {
-				xfs_trans_brelse(tp, lbp);
-				return error;
-			}
-		}
-		*dbpp = dbp;
-		return 0;
-	}
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
+
 	/*
 	 * No match found, return -ENOENT.
 	 */
-	ASSERT(cidb == -1);
 	if (dbp)
 		xfs_trans_brelse(tp, dbp);
 	xfs_trans_brelse(tp, lbp);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 2ae6ac2..1778c40 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -679,6 +679,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	xfs_dir2_data_entry_t	*dep;		/* data block entry */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	int			error;		/* error return value */
+	int			di = -1;	/* data entry index */
 	int			index;		/* leaf entry index */
 	xfs_dir2_leaf_t		*leaf;		/* leaf structure */
 	xfs_dir2_leaf_entry_t	*lep;		/* leaf entry */
@@ -709,6 +710,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	if (state->extravalid) {
 		curbp = state->extrablk.bp;
 		curdb = state->extrablk.blkno;
+		di = state->extrablk.index;
 	}
 	/*
 	 * Loop over leaf entries with the right hash value.
@@ -734,28 +736,20 @@ xfs_dir2_leafn_lookup_for_entry(
 		 */
 		if (newdb != curdb) {
 			/*
-			 * If we had a block before that we aren't saving
-			 * for a CI name, drop it
+			 * If we had a block, drop it
 			 */
-			if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT ||
-						curdb != state->extrablk.blkno))
+			if (curbp) {
 				xfs_trans_brelse(tp, curbp);
+				di = -1;
+			}
 			/*
-			 * If needing the block that is saved with a CI match,
-			 * use it otherwise read in the new data block.
+			 * Read in the new data block.
 			 */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-					newdb == state->extrablk.blkno) {
-				ASSERT(state->extravalid);
-				curbp = state->extrablk.bp;
-			} else {
-				error = xfs_dir3_data_read(tp, dp,
-						xfs_dir2_db_to_da(args->geo,
-								  newdb),
+			error = xfs_dir3_data_read(tp, dp,
+					xfs_dir2_db_to_da(args->geo, newdb),
 						-1, &curbp);
-				if (error)
-					return error;
-			}
+			if (error)
+				return error;
 			xfs_dir3_data_check(dp, curbp);
 			curdb = newdb;
 		}
@@ -766,53 +760,40 @@ xfs_dir2_leafn_lookup_for_entry(
 			xfs_dir2_dataptr_to_off(args->geo,
 						be32_to_cpu(lep->address)));
 		/*
-		 * Compare the entry and if it's an exact match, return
-		 * EEXIST immediately. If it's the first case-insensitive
-		 * match, store the block & inode number and continue looking.
+		 * Compare the entry and if it's a match, return
+		 * EEXIST immediately.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
-			/* If there is a CI match block, drop it */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-						curdb != state->extrablk.blkno)
-				xfs_trans_brelse(tp, state->extrablk.bp);
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = be64_to_cpu(dep->inumber);
 			args->filetype = dp->d_ops->data_get_ftype(dep);
-			*indexp = index;
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.blkno = curdb;
-			state->extrablk.index = (int)((char *)dep -
-							(char *)curbp->b_addr);
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
 			curbp->b_ops = &xfs_dir3_data_buf_ops;
 			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-			if (cmp == XFS_CMP_EXACT)
-				return -EEXIST;
+			di = (int)((char *)dep - (char *)curbp->b_addr);
+			error = -EEXIST;
+			goto out;
+
 		}
 	}
+	/* Didn't find a match */
+	error = -ENOENT;
 	ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT));
+out:
 	if (curbp) {
-		if (args->cmpresult == XFS_CMP_DIFFERENT) {
-			/* Giving back last used data block. */
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.index = -1;
-			state->extrablk.blkno = curdb;
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
-			curbp->b_ops = &xfs_dir3_data_buf_ops;
-			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-		} else {
-			/* If the curbp is not the CI match block, drop it */
-			if (state->extrablk.bp != curbp)
-				xfs_trans_brelse(tp, curbp);
-		}
+		/* Giving back last used data block. */
+		state->extravalid = 1;
+		state->extrablk.bp = curbp;
+		state->extrablk.index = di;
+		state->extrablk.blkno = curdb;
+		state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
+		curbp->b_ops = &xfs_dir3_data_buf_ops;
+		xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
 	} else {
 		state->extravalid = 0;
 	}
 	*indexp = index;
-	return -ENOENT;
+	return error;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c
index 5079e05..e69fdb7 100644
--- a/fs/xfs/libxfs/xfs_dir2_sf.c
+++ b/fs/xfs/libxfs/xfs_dir2_sf.c
@@ -757,19 +757,19 @@ xfs_dir2_sf_lookup(
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
 	     i++, sfep = dp->d_ops->sf_nextentry(sfp, sfep)) {
 		/*
-		 * Compare name and if it's an exact match, return the inode
-		 * number. If it's the first case-insensitive match, store the
-		 * inode number and continue looking for an exact match.
+		 * Compare name and if it's a match, return the inode
+		 * number.
 		 */
 		cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name,
 								sfep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = dp->d_ops->sf_get_ino(sfp, sfep);
 			args->filetype = dp->d_ops->sf_get_ftype(sfep);
 			if (cmp == XFS_CMP_EXACT)
 				return -EEXIST;
 			ci_sfep = sfep;
+			break;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
  2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers
@ 2014-09-18 20:09 ` Ben Myers
  2014-09-18 20:09 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:09 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and
normalization, different strings will match on other criteria than
case insensitivity.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_da_btree.h  | 2 +-
 fs/xfs/libxfs/xfs_dir2.c      | 9 ++++++---
 fs/xfs/libxfs/xfs_dir2_node.c | 2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6e153e3..9ebcc23 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -52,7 +52,7 @@ struct xfs_da_geometry {
 enum xfs_dacmp {
 	XFS_CMP_DIFFERENT,	/* names are completely different */
 	XFS_CMP_EXACT,		/* names are exactly the same */
-	XFS_CMP_CASE		/* names are same but differ in case */
+	XFS_CMP_MATCH		/* names are same but differ in encoding */
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 6cef221..32e769b 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -74,7 +74,7 @@ xfs_ascii_ci_compname(
 			continue;
 		if (tolower(args->name[i]) != tolower(name[i]))
 			return XFS_CMP_DIFFERENT;
-		result = XFS_CMP_CASE;
+		result = XFS_CMP_MATCH;
 	}
 
 	return result;
@@ -315,8 +315,11 @@ xfs_dir_cilookup_result(
 {
 	if (args->cmpresult == XFS_CMP_DIFFERENT)
 		return -ENOENT;
-	if (args->cmpresult != XFS_CMP_CASE ||
-					!(args->op_flags & XFS_DA_OP_CILOOKUP))
+	if (args->cmpresult == XFS_CMP_EXACT)
+		return -EEXIST;
+	ASSERT(args->cmpresult == XFS_CMP_MATCH);
+	/* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */
+	if (!(args->op_flags & XFS_DA_OP_CILOOKUP))
 		return -EEXIST;
 
 	args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 1778c40..9d46e8d 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -2023,7 +2023,7 @@ xfs_dir2_node_lookup(
 	error = xfs_da3_node_lookup_int(state, &rval);
 	if (error)
 		rval = error;
-	else if (rval == -ENOENT && args->cmpresult == XFS_CMP_CASE) {
+	else if (rval == -ENOENT && args->cmpresult == XFS_CMP_MATCH) {
 		/* If a CI match, dup the actual name and return -EEXIST */
 		xfs_dir2_data_entry_t	*dep;
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 03/13] libxfs: add xfs_nameops.normhash
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
  2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers
  2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
@ 2014-09-18 20:09 ` Ben Myers
  2014-09-18 20:10   ` Ben Myers
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:09 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args
structure as its argument, and calculates a hash value over the name. It may
in the process create a normalized form of the name, and assign that to the
norm/normlen fields in the xfs_da_args structure.

Changes:
 The pointer in kmem_free() was type converted to suppress compiler
 warnings.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/xfs_da_btree.h |  5 ++++-
 libxfs/xfs_da_btree.c  |  9 ++++++++
 libxfs/xfs_dir2.c      | 56 +++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index 3d9f9dd..06b50bf 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -42,7 +42,9 @@ enum xfs_dacmp {
  */
 typedef struct xfs_da_args {
 	const __uint8_t	*name;		/* string (maybe not NULL terminated) */
-	int		namelen;	/* length of string (maybe no NULL) */
+	const __uint8_t	*norm;		/* normalized name (may be NULL) */
+ 	int		namelen;	/* length of string (maybe no NULL) */
+	int		normlen;	/* length of normalized name */
 	__uint8_t	filetype;	/* filetype of inode for directories */
 	__uint8_t	*value;		/* set of bytes (maybe contain NULLs) */
 	int		valuelen;	/* length of value */
@@ -131,6 +133,7 @@ typedef struct xfs_da_state {
  */
 struct xfs_nameops {
 	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
 };
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index b731b54..eb97317 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -2000,8 +2000,17 @@ xfs_default_hashname(
 	return xfs_da_hashname(name->name, name->len);
 }
 
+STATIC int
+xfs_da_normhash(
+	struct xfs_da_args *args)
+{
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
 const struct xfs_nameops xfs_default_nameops = {
 	.hashname	= xfs_default_hashname,
+	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
 
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 57e98a3..e52d082 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -54,6 +54,21 @@ xfs_ascii_ci_hashname(
 	return hash;
 }
 
+STATIC int
+xfs_ascii_ci_normhash(
+	struct xfs_da_args *args)
+{
+	xfs_dahash_t	hash;
+	int		i;
+
+	for (i = 0, hash = 0; i < args->namelen; i++)
+		hash = tolower(args->name[i]) ^ rol32(hash, 7);
+
+	args->hashval = hash;
+	return 0;
+}
+
+
 STATIC enum xfs_dacmp
 xfs_ascii_ci_compname(
 	struct xfs_da_args *args,
@@ -80,6 +95,7 @@ xfs_ascii_ci_compname(
 
 static struct xfs_nameops xfs_ascii_ci_nameops = {
 	.hashname	= xfs_ascii_ci_hashname,
+	.normhash	= xfs_ascii_ci_normhash,
 	.compname	= xfs_ascii_ci_compname,
 };
 
@@ -211,7 +227,6 @@ xfs_dir_createname(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = inum;
 	args.dp = dp;
 	args.firstblock = first;
@@ -220,19 +235,24 @@ xfs_dir_createname(
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
 	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_addname(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_addname(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_addname(&args);
 	else
 		rval = xfs_dir2_node_addname(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -289,22 +309,23 @@ xfs_dir_lookup(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.dp = dp;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
 	args.op_flags = XFS_DA_OP_OKNOENT;
 	if (ci_name)
 		args.op_flags |= XFS_DA_OP_CILOOKUP;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_lookup(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_lookup(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_lookup(&args);
 	else
@@ -318,6 +339,9 @@ xfs_dir_lookup(
 			ci_name->len = args.valuelen;
 		}
 	}
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -345,7 +369,6 @@ xfs_dir_removename(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = ino;
 	args.dp = dp;
 	args.firstblock = first;
@@ -353,19 +376,24 @@ xfs_dir_removename(
 	args.total = total;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_removename(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_removename(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_removename(&args);
 	else
 		rval = xfs_dir2_node_removename(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -395,7 +423,6 @@ xfs_dir_replace(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = inum;
 	args.dp = dp;
 	args.firstblock = first;
@@ -403,19 +430,24 @@ xfs_dir_replace(
 	args.total = total;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_replace(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_replace(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_replace(&args);
 	else
 		rval = xfs_dir2_node_replace(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 04/10] xfs: change interface of xfs_nameops.normhash
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 20:10   ` Ben Myers
  2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:10 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: xfs, olaf, tinguely

From: Olaf Weber <olaf@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_da_btree.c   | 9 +--------
 fs/xfs/libxfs/xfs_da_btree.h   | 2 +-
 fs/xfs/libxfs/xfs_dir2.c       | 7 ++++---
 fs/xfs/libxfs/xfs_dir2_block.c | 2 +-
 fs/xfs/libxfs/xfs_dir2_data.c  | 3 ++-
 5 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 07a3acf..a0608ca 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1983,13 +1983,6 @@ xfs_da_compname(
 					XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
 }
 
-static xfs_dahash_t
-xfs_default_hashname(
-	struct xfs_name	*name)
-{
-	return xfs_da_hashname(name->name, name->len);
-}
-
 STATIC int
 xfs_da_normhash(
 	struct xfs_da_args *args)
@@ -1999,7 +1992,7 @@ xfs_da_normhash(
 }
 
 const struct xfs_nameops xfs_default_nameops = {
-	.hashname	= xfs_default_hashname,
+	.hashname	= xfs_da_hashname,
 	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6cdafee..4d6b36f 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -151,7 +151,7 @@ typedef struct xfs_da_state {
  * Name ops for directory and/or attr name operations
  */
 struct xfs_nameops {
-	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	xfs_dahash_t	(*hashname)(const unsigned char *, int);
 	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 55733a6..84e5ca9 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -45,13 +45,14 @@ struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
  */
 STATIC xfs_dahash_t
 xfs_ascii_ci_hashname(
-	struct xfs_name	*name)
+	const unsigned char *name,
+	int len)
 {
 	xfs_dahash_t	hash;
 	int		i;
 
-	for (i = 0, hash = 0; i < name->len; i++)
-		hash = tolower(name->name[i]) ^ rol32(hash, 7);
+	for (i = 0, hash = 0; i < len; i++)
+		hash = tolower(name[i]) ^ rol32(hash, 7);
 
 	return hash;
 }
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 990bf0c..f93c141 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -1231,7 +1231,7 @@ xfs_dir2_sf_to_block(
 		name.name = sfep->name;
 		name.len = sfep->namelen;
 		blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
-							hashname(&name));
+					hashname(sfep->name, sfep->namelen));
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(
 						 (char *)dep - (char *)hdr));
 		offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index fdd803f..28c35cf 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -179,7 +179,8 @@ __xfs_dir3_data_check(
 						((char *)dep - (char *)hdr));
 			name.name = dep->name;
 			name.len = dep->namelen;
-			hash = mp->m_dirnameops->hashname(&name);
+			hash = mp->m_dirnameops->hashname(dep->name,
+					dep->namelen);
 			for (i = 0; i < be32_to_cpu(btp->count); i++) {
 				if (be32_to_cpu(lep[i].address) == addr &&
 				    be32_to_cpu(lep[i].hashval) == hash)
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 04/10] xfs: change interface of xfs_nameops.normhash
@ 2014-09-18 20:10   ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:10 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_da_btree.c   | 9 +--------
 fs/xfs/libxfs/xfs_da_btree.h   | 2 +-
 fs/xfs/libxfs/xfs_dir2.c       | 7 ++++---
 fs/xfs/libxfs/xfs_dir2_block.c | 2 +-
 fs/xfs/libxfs/xfs_dir2_data.c  | 3 ++-
 5 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 07a3acf..a0608ca 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1983,13 +1983,6 @@ xfs_da_compname(
 					XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
 }
 
-static xfs_dahash_t
-xfs_default_hashname(
-	struct xfs_name	*name)
-{
-	return xfs_da_hashname(name->name, name->len);
-}
-
 STATIC int
 xfs_da_normhash(
 	struct xfs_da_args *args)
@@ -1999,7 +1992,7 @@ xfs_da_normhash(
 }
 
 const struct xfs_nameops xfs_default_nameops = {
-	.hashname	= xfs_default_hashname,
+	.hashname	= xfs_da_hashname,
 	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6cdafee..4d6b36f 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -151,7 +151,7 @@ typedef struct xfs_da_state {
  * Name ops for directory and/or attr name operations
  */
 struct xfs_nameops {
-	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	xfs_dahash_t	(*hashname)(const unsigned char *, int);
 	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 55733a6..84e5ca9 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -45,13 +45,14 @@ struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
  */
 STATIC xfs_dahash_t
 xfs_ascii_ci_hashname(
-	struct xfs_name	*name)
+	const unsigned char *name,
+	int len)
 {
 	xfs_dahash_t	hash;
 	int		i;
 
-	for (i = 0, hash = 0; i < name->len; i++)
-		hash = tolower(name->name[i]) ^ rol32(hash, 7);
+	for (i = 0, hash = 0; i < len; i++)
+		hash = tolower(name[i]) ^ rol32(hash, 7);
 
 	return hash;
 }
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 990bf0c..f93c141 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -1231,7 +1231,7 @@ xfs_dir2_sf_to_block(
 		name.name = sfep->name;
 		name.len = sfep->namelen;
 		blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
-							hashname(&name));
+					hashname(sfep->name, sfep->namelen));
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(
 						 (char *)dep - (char *)hdr));
 		offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index fdd803f..28c35cf 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -179,7 +179,8 @@ __xfs_dir3_data_check(
 						((char *)dep - (char *)hdr));
 			name.name = dep->name;
 			name.len = dep->namelen;
-			hash = mp->m_dirnameops->hashname(&name);
+			hash = mp->m_dirnameops->hashname(dep->name,
+					dep->namelen);
 			for (i = 0; i < be32_to_cpu(btp->count); i++) {
 				if (be32_to_cpu(lep[i].address) == addr &&
 				    be32_to_cpu(lep[i].hashval) == hash)
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support.
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (3 preceding siblings ...)
  2014-09-18 20:10   ` Ben Myers
@ 2014-09-18 20:11 ` Ben Myers
  2014-09-18 20:13 ` [PATCH 03/10] xfs: add xfs_nameops.normhash Ben Myers
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:11 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
the utf8bit, and returns true if at least one of them is set. Replace
calls to xfs_sb_version_hasasciici() as needed.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_sb.h | 24 +++++++++++++++++++++++-
 fs/xfs/xfs_fs.h        |  1 +
 fs/xfs/xfs_fsops.c     |  4 +++-
 fs/xfs/xfs_iops.c      |  4 ++--
 4 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 2e73970..525eacb 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -70,6 +70,7 @@ struct xfs_trans;
 #define XFS_SB_VERSION2_RESERVED4BIT	0x00000004
 #define XFS_SB_VERSION2_ATTR2BIT	0x00000008	/* Inline attr rework */
 #define XFS_SB_VERSION2_PARENTBIT	0x00000010	/* parent pointers */
+#define XFS_SB_VERSION2_UTF8BIT		0x00000020      /* utf8 names */
 #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */
 #define XFS_SB_VERSION2_CRCBIT		0x00000100	/* metadata CRCs */
 #define XFS_SB_VERSION2_FTYPE		0x00000200	/* inode type in dir */
@@ -77,6 +78,7 @@ struct xfs_trans;
 #define	XFS_SB_VERSION2_OKBITS		\
 	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
 	 XFS_SB_VERSION2_ATTR2BIT	| \
+	 XFS_SB_VERSION2_UTF8BIT	| \
 	 XFS_SB_VERSION2_PROJID32BIT	| \
 	 XFS_SB_VERSION2_FTYPE)
 
@@ -509,8 +511,10 @@ xfs_sb_has_ro_compat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_UTF8	(1 << 1)	/* utf-8 name support */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
-		(XFS_SB_FEAT_INCOMPAT_FTYPE)
+		(XFS_SB_FEAT_INCOMPAT_FTYPE | \
+		 XFS_SB_FEAT_INCOMPAT_UTF8)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
@@ -558,6 +562,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
 }
 
+static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) ||
+		(xfs_sb_version_hasmorebits(sbp) &&
+		(sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));
+}
+
+/*
+ * Special case: there are a number of places where we need to test
+ * both the borgbit and the utf8bit, and take the same action if
+ * either of those is set.
+ */
+static inline int xfs_sb_version_hasci(xfs_sb_t *sbp)
+{
+	return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp);
+}
+
 /*
  * end of superblock version macros
  */
diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index 18dc721..e845d75 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
+#define XFS_FSOP_GEOM_FLAGS_UTF8	0x40000	/* utf8 filenames */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index f91de1e..1a83eef 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -103,7 +103,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hasftype(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_FTYPE : 0) |
 			(xfs_sb_version_hasfinobt(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_FINOBT : 0);
+				XFS_FSOP_GEOM_FLAGS_FINOBT : 0) |
+			(xfs_sb_version_hasutf8(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_UTF8 : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 7212949..cea3d64 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -335,9 +335,9 @@ xfs_vn_unlink(
 	/*
 	 * With unlink, the VFS makes the dentry "negative": no inode,
 	 * but still hashed. This is incompatible with case-insensitive
-	 * mode, so invalidate (unhash) the dentry in CI-mode.
+	 * or utf8 mode, so invalidate (unhash) the dentry in CI-mode.
 	 */
-	if (xfs_sb_version_hasasciici(&XFS_M(dir->i_sb)->m_sb))
+	if (xfs_sb_version_hasci(&XFS_M(dir->i_sb)->m_sb))
 		d_invalidate(dentry);
 	return 0;
 }
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 03/10] xfs: add xfs_nameops.normhash
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (4 preceding siblings ...)
  2014-09-18 20:11 ` [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
@ 2014-09-18 20:13 ` Ben Myers
  2014-09-18 20:14   ` Ben Myers
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args
structure as its argument, and calculates a hash value over the name. It may
in the process create a normalized form of the name, and assign that to the
norm/normlen fields in the xfs_da_args structure.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_da_btree.c |  9 +++++++++
 fs/xfs/libxfs/xfs_da_btree.h |  3 +++
 fs/xfs/libxfs/xfs_dir2.c     | 42 +++++++++++++++++++++++++++++++++++++-----
 3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 2c42ae2..07a3acf 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1990,8 +1990,17 @@ xfs_default_hashname(
 	return xfs_da_hashname(name->name, name->len);
 }
 
+STATIC int
+xfs_da_normhash(
+	struct xfs_da_args *args)
+{
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
 const struct xfs_nameops xfs_default_nameops = {
 	.hashname	= xfs_default_hashname,
+	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
 
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 9ebcc23..6cdafee 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -61,7 +61,9 @@ enum xfs_dacmp {
 typedef struct xfs_da_args {
 	struct xfs_da_geometry *geo;	/* da block geometry */
 	const __uint8_t	*name;		/* string (maybe not NULL terminated) */
+	const __uint8_t	*norm;		/* normalized name (may be NULL) */
 	int		namelen;	/* length of string (maybe no NULL) */
+	int		normlen;	/* length of normalized name */
 	__uint8_t	filetype;	/* filetype of inode for directories */
 	__uint8_t	*value;		/* set of bytes (maybe contain NULLs) */
 	int		valuelen;	/* length of value */
@@ -150,6 +152,7 @@ typedef struct xfs_da_state {
  */
 struct xfs_nameops {
 	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
 };
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 32e769b..55733a6 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -56,6 +56,21 @@ xfs_ascii_ci_hashname(
 	return hash;
 }
 
+STATIC int
+xfs_ascii_ci_normhash(
+	struct xfs_da_args *args)
+{
+	xfs_dahash_t	hash;
+	int		i;
+
+	for (i = 0, hash = 0; i < args->namelen; i++)
+		hash = tolower(args->name[i]) ^ rol32(hash, 7);
+
+	args->hashval = hash;
+	return 0;
+}
+
+
 STATIC enum xfs_dacmp
 xfs_ascii_ci_compname(
 	struct xfs_da_args *args,
@@ -82,6 +97,7 @@ xfs_ascii_ci_compname(
 
 static struct xfs_nameops xfs_ascii_ci_nameops = {
 	.hashname	= xfs_ascii_ci_hashname,
+	.normhash	= xfs_ascii_ci_normhash,
 	.compname	= xfs_ascii_ci_compname,
 };
 
@@ -267,7 +283,6 @@ xfs_dir_createname(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->inumber = inum;
 	args->dp = dp;
 	args->firstblock = first;
@@ -276,6 +291,8 @@ xfs_dir_createname(
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_addname(args);
@@ -299,6 +316,8 @@ xfs_dir_createname(
 		rval = xfs_dir2_node_addname(args);
 
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
@@ -365,13 +384,14 @@ xfs_dir_lookup(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->dp = dp;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_OKNOENT;
 	if (ci_name)
 		args->op_flags |= XFS_DA_OP_CILOOKUP;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_lookup(args);
@@ -405,6 +425,9 @@ out_check_rval:
 		}
 	}
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
+
 	kmem_free(args);
 	return rval;
 }
@@ -437,7 +460,6 @@ xfs_dir_removename(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->inumber = ino;
 	args->dp = dp;
 	args->firstblock = first;
@@ -445,6 +467,8 @@ xfs_dir_removename(
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_removename(args);
@@ -467,6 +491,8 @@ xfs_dir_removename(
 	else
 		rval = xfs_dir2_node_removename(args);
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
@@ -502,7 +528,6 @@ xfs_dir_replace(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->inumber = inum;
 	args->dp = dp;
 	args->firstblock = first;
@@ -510,6 +535,8 @@ xfs_dir_replace(
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_replace(args);
@@ -532,6 +559,8 @@ xfs_dir_replace(
 	else
 		rval = xfs_dir2_node_replace(args);
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
@@ -564,12 +593,13 @@ xfs_dir_canenter(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->dp = dp;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME |
 							XFS_DA_OP_OKNOENT;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_addname(args);
@@ -592,6 +622,8 @@ xfs_dir_canenter(
 	else
 		rval = xfs_dir2_node_addname(args);
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 06/10] xfs: add unicode character database files
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 20:14   ` Ben Myers
  2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: xfs, olaf, tinguely

From: Olaf Weber <olaf@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
[v2: Removed large unicode files prior to posting.  Get them as below. -bpm]
[v3: Moved files to ucd8norm directory. -bpm]

cd fs/xfs/utf8norm/ucd
wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
for e in *.txt
do
	base=`basename $e .txt`
	mv $e $base-7.0.0.txt
done
---
 fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)
 create mode 100644 fs/xfs/utf8norm/ucd/README

diff --git a/fs/xfs/utf8norm/ucd/README b/fs/xfs/utf8norm/ucd/README
new file mode 100644
index 0000000..d713e66
--- /dev/null
+++ b/fs/xfs/utf8norm/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+  http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+  http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+  http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
+  07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
+  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
+  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
+  522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
+  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
+  c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 06/10] xfs: add unicode character database files
@ 2014-09-18 20:14   ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
[v2: Removed large unicode files prior to posting.  Get them as below. -bpm]
[v3: Moved files to ucd8norm directory. -bpm]

cd fs/xfs/utf8norm/ucd
wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
for e in *.txt
do
	base=`basename $e .txt`
	mv $e $base-7.0.0.txt
done
---
 fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)
 create mode 100644 fs/xfs/utf8norm/ucd/README

diff --git a/fs/xfs/utf8norm/ucd/README b/fs/xfs/utf8norm/ucd/README
new file mode 100644
index 0000000..d713e66
--- /dev/null
+++ b/fs/xfs/utf8norm/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+  http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+  http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+  http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
+  07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
+  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
+  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
+  522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
+  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
+  c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8.
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (6 preceding siblings ...)
  2014-09-18 20:14   ` Ben Myers
@ 2014-09-18 20:15 ` Ben Myers
  2014-09-22 20:57     ` Dave Chinner
  2014-09-18 20:16   ` Ben Myers
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c.

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

  nfkdi:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.

  nfkdicf:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: the trie is now separated into utf8norm.ko;
     utf8version is now a function and exported;
     introduced CONFIG_XFS_UTF8. -bpm]
---
 fs/xfs/Kconfig               |    8 +
 fs/xfs/Makefile              |    2 +-
 fs/xfs/utf8norm/Makefile     |   37 +
 fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/utf8norm/utf8norm.c   |  649 +++++++++
 fs/xfs/utf8norm/utf8norm.h   |  116 ++
 6 files changed, 4050 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/utf8norm/Makefile
 create mode 100644 fs/xfs/utf8norm/mkutf8data.c
 create mode 100644 fs/xfs/utf8norm/utf8norm.c
 create mode 100644 fs/xfs/utf8norm/utf8norm.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 5d47b4d..a847857 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -95,3 +95,11 @@ config XFS_DEBUG
 	  not useful unless you are debugging a particular problem.
 
 	  Say N unless you are an XFS developer, or you play one on TV.
+
+config XFS_UTF8
+	bool "XFS UTF-8 support"
+	depends on XFS_FS
+	help
+	  Say Y here to enable utf8 normalization support in XFS.  You
+	  will be able to mount and use filesystems created with the
+	  utf8 mkfs.xfs option.
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d617999..6d000d3 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -21,7 +21,7 @@ ccflags-y += -I$(src)/libxfs
 
 ccflags-$(CONFIG_XFS_DEBUG) += -g
 
-obj-$(CONFIG_XFS_FS)		+= xfs.o
+obj-$(CONFIG_XFS_FS)		+= xfs.o utf8norm/
 
 # this one should be compiled first, as the tracing macros can easily blow up
 xfs-y				+= xfs_trace.o
diff --git a/fs/xfs/utf8norm/Makefile b/fs/xfs/utf8norm/Makefile
new file mode 100644
index 0000000..f83f9b9
--- /dev/null
+++ b/fs/xfs/utf8norm/Makefile
@@ -0,0 +1,37 @@
+#
+# Copyright (c) 2014 SGI.
+# All rights reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+
+ifeq ($(CONFIG_XFS_UTF8),y)
+obj-m		 		+= utf8norm.o
+endif
+
+hostprogs-y			:= mkutf8data
+$(obj)/utf8norm.o: $(obj)/utf8data.h
+$(obj)/utf8data.h: $(src)/ucd/*.txt
+$(obj)/utf8data.h: $(obj)/mkutf8data FORCE
+	$(call if_changed,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+      cmd_mkutf8data = $(obj)/mkutf8data \
+		-a $(src)/ucd/DerivedAge-7.0.0.txt \
+		-c $(src)/ucd/DerivedCombiningClass-7.0.0.txt \
+		-p $(src)/ucd/DerivedCoreProperties-7.0.0.txt \
+		-d $(src)/ucd/UnicodeData-7.0.0.txt \
+		-f $(src)/ucd/CaseFolding-7.0.0.txt \
+		-n $(src)/ucd/NormalizationCorrections-7.0.0.txt \
+		-t $(src)/ucd/NormalizationTest-7.0.0.txt \
+		-o $@
diff --git a/fs/xfs/utf8norm/mkutf8data.c b/fs/xfs/utf8norm/mkutf8data.c
new file mode 100644
index 0000000..1d6ec02
--- /dev/null
+++ b/fs/xfs/utf8norm/mkutf8data.c
@@ -0,0 +1,3239 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME	"DerivedAge.txt"
+#define CCC_NAME	"DerivedCombiningClass.txt"
+#define PROP_NAME	"DerivedCoreProperties.txt"
+#define DATA_NAME	"UnicodeData.txt"
+#define FOLD_NAME	"CaseFolding.txt"
+#define NORM_NAME	"NormalizationCorrections.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+#define UTF8_NAME	"utf8data.h"
+
+const char	*age_name  = AGE_NAME;
+const char	*ccc_name  = CCC_NAME;
+const char	*prop_name = PROP_NAME;
+const char	*data_name = DATA_NAME;
+const char	*fold_name = FOLD_NAME;
+const char	*norm_name = NORM_NAME;
+const char	*test_name = TEST_NAME;
+const char	*utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision.  These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_MAJ_MAX			((unsigned short)-1)
+#define UNICODE_MIN_MAX			((unsigned char)-1)
+#define UNICODE_REV_MAX			((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+	if (major > UNICODE_MAJ_MAX)
+		return 0;
+	if (minor > UNICODE_MIN_MAX)
+		return 0;
+	if (revision > UNICODE_REV_MAX)
+		return 0;
+	return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MAXGEN		(255)
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+	const unsigned char *s = (const unsigned char*)str;
+	unsigned int unichar = 0;
+
+	if (*s < 0x80) {
+		unichar = *s;
+	} else if (*s < UTF8_3_BITS) {
+		unichar = *s++ & 0x1F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else if (*s < UTF8_4_BITS) {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	}
+	return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+	return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+	void *root;
+	int childnode;
+	const char *type;
+	unsigned int maxage;
+	struct tree *next;
+	int (*leaf_equal)(void *, void *);
+	void (*leaf_print)(void *, int);
+	int (*leaf_mark)(void *);
+	int (*leaf_size)(void *);
+	int *(*leaf_index)(struct tree *, void *);
+	unsigned char *(*leaf_emit)(void *, unsigned char *);
+	int leafindex[0x110000];
+	int index;
+};
+
+struct node {
+	int index;
+	int offset;
+	int mark;
+	int size;
+	struct node *parent;
+	void *left;
+	void *right;
+	unsigned char bitnum;
+	unsigned char nextbyte;
+	unsigned char leftnode;
+	unsigned char rightnode;
+	unsigned int keybits;
+	unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+	struct node *node;
+	void *leaf = NULL;
+
+	node = tree->root;
+	while (!leaf && node) {
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7))) {
+			/* Right leg */
+			if (node->rightnode == NODE) {
+				node = node->right;
+			} else if (node->rightnode == LEAF) {
+				leaf = node->right;
+			} else {
+				node = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (node->leftnode == NODE) {
+				node = node->left;
+			} else if (node->leftnode == LEAF) {
+				leaf = node->left;
+			} else {
+				node = NULL;
+			}
+		}
+	}
+
+	return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int indent = 1;
+	int nodes, singletons, leaves;
+
+	nodes = singletons = leaves = 0;
+
+	printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_print(tree->root, indent);
+		leaves = 1;
+	} else {
+		assert(tree->childnode == NODE);
+		node = tree->root;
+		leftmask = rightmask = 0;
+		while (node) {
+			printf("%*snode @ %p bitnum %d nextbyte %d"
+			       " left %p right %p mask %x bits %x\n",
+				indent, "", node,
+				node->bitnum, node->nextbyte,
+				node->left, node->right,
+				node->keymask, node->keybits);
+			nodes += 1;
+			if (!(node->left && node->right))
+				singletons += 1;
+
+			while (node) {
+				bitmask = 1 << node->bitnum;
+				if ((leftmask & bitmask) == 0) {
+					leftmask |= bitmask;
+					if (node->leftnode == LEAF) {
+						assert(node->left);
+						tree->leaf_print(node->left,
+								 indent+1);
+						leaves += 1;
+					} else if (node->left) {
+						assert(node->leftnode == NODE);
+						indent += 1;
+						node = node->left;
+						break;
+					}
+				}
+				if ((rightmask & bitmask) == 0) {
+					rightmask |= bitmask;
+					if (node->rightnode == LEAF) {
+						assert(node->right);
+						tree->leaf_print(node->right,
+								 indent+1);
+						leaves += 1;
+					} else if (node->right) {
+						assert(node->rightnode==NODE);
+						indent += 1;
+						node = node->right;
+						break;
+					}
+				}
+				leftmask &= ~bitmask;
+				rightmask &= ~bitmask;
+				node = node->parent;
+				indent -= 1;
+			}
+		}
+	}
+	printf("nodes %d leaves %d singletons %d\n",
+	       nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+	struct node *node;
+	int bitnum;
+
+	node = malloc(sizeof(*node));
+	node->left = node->right = NULL;
+	node->parent = parent;
+	node->leftnode = NODE;
+	node->rightnode = NODE;
+	node->keybits = 0;
+	node->keymask = 0;
+	node->mark = 0;
+	node->index = 0;
+	node->offset = -1;
+	node->size = 4;
+
+	if (node->parent) {
+		bitnum = parent->bitnum;
+		if ((bitnum & 7) == 0) {
+			node->bitnum = bitnum + 7 + 8;
+			node->nextbyte = 1;
+		} else {
+			node->bitnum = bitnum - 1;
+			node->nextbyte = 0;
+		}
+	} else {
+		node->bitnum = 7;
+		node->nextbyte = 0;
+	}
+
+	return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+	struct node *node;
+	struct node *parent;
+	void **cursor;
+	int keybits;
+
+	assert(keylen >= 1 && keylen <= 4);
+
+	node = NULL;
+	cursor = &tree->root;
+	keybits = 8 * keylen;
+
+	/* Insert, creating path along the way. */
+	while (keybits) {
+		if (!*cursor)
+			*cursor = alloc_node(node);
+		node = *cursor;
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7)))
+			cursor = &node->right;
+		else
+			cursor = &node->left;
+		keybits--;
+	}
+	*cursor = leaf;
+
+	/* Merge subtrees if possible. */
+	while (node) {
+		if (*key & (1 << (node->bitnum & 7)))
+			node->rightnode = LEAF;
+		else
+			node->leftnode = LEAF;
+		if (node->nextbyte)
+			break;
+		if (node->leftnode == NODE || node->rightnode == NODE)
+			break;
+		assert(node->left);
+		assert(node->right);
+		/* Compare */
+		if (! tree->leaf_equal(node->left, node->right))
+			break;
+		/* Keep left, drop right leaf. */
+		leaf = node->left;
+		/* Check in parent */
+		parent = node->parent;
+		if (!parent) {
+			/* root of tree! */
+			tree->root = leaf;
+			tree->childnode = LEAF;
+		} else if (parent->left == node) {
+			parent->left = leaf;
+			parent->leftnode = LEAF;
+			if (parent->right) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+			}
+		} else if (parent->right == node) {
+			parent->right = leaf;
+			parent->rightnode = LEAF;
+			if (parent->left) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+				parent->keybits |= (1 << node->bitnum);
+			}
+		} else {
+			/* internal tree error */
+			assert(0);
+		}
+		free(node);
+		node = parent;
+	}
+
+	/* Propagate keymasks up along singleton chains. */
+	while (node) {
+		parent = node->parent;
+		if (!parent)
+			break;
+		/* Nix the mask for parents with two children. */
+		if (node->keymask == 0) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else if (parent->left && parent->right) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else {
+			assert((parent->keymask & node->keymask) == 0);
+			parent->keymask |= node->keymask;
+			parent->keymask |= (1 << parent->bitnum);
+			parent->keybits |= node->keybits;
+			if (parent->right)
+				parent->keybits |= (1 << parent->bitnum);
+		}
+		node = parent;
+	}
+
+	return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed.  There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves.  The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains.  When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity.  Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+	struct node *node;
+	struct node *left;
+	struct node *right;
+	struct node *parent;
+	void *leftleaf;
+	void *rightleaf;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+
+	if (verbose > 0)
+		printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+	count = 0;
+	if (tree->childnode == LEAF)
+		return;
+	if (!tree->root)
+		return;
+
+	leftmask = rightmask = 0;
+	node = tree->root;
+	while (node) {
+		if (node->nextbyte)
+			goto advance;
+		if (node->leftnode == LEAF)
+			goto advance;
+		if (node->rightnode == LEAF)
+			goto advance;
+		if (!node->left)
+			goto advance;
+		if (!node->right)
+			goto advance;
+		left = node->left;
+		right = node->right;
+		if (left->keymask == 0)
+			goto advance;
+		if (right->keymask == 0)
+			goto advance;
+		if (left->keymask != right->keymask)
+			goto advance;
+		if (left->keybits != right->keybits)
+			goto advance;
+		leftleaf = NULL;
+		while (!leftleaf) {
+			assert(left->left || left->right);
+			if (left->leftnode == LEAF)
+				leftleaf = left->left;
+			else if (left->rightnode == LEAF)
+				leftleaf = left->right;
+			else if (left->left)
+				left = left->left;
+			else if (left->right)
+				left = left->right;
+			else
+				assert(0);
+		}
+		rightleaf = NULL;
+		while (!rightleaf) {
+			assert(right->left || right->right);
+			if (right->leftnode == LEAF)
+				rightleaf = right->left;
+			else if (right->rightnode == LEAF)
+				rightleaf = right->right;
+			else if (right->left)
+				right = right->left;
+			else if (right->right)
+				right = right->right;
+			else
+				assert(0);
+		}
+		if (! tree->leaf_equal(leftleaf, rightleaf))
+			goto advance;
+		/*
+		 * This node has identical singleton-only subtrees.
+		 * Remove it.
+		 */
+		parent = node->parent;
+		left = node->left;
+		right = node->right;
+		if (parent->left == node)
+			parent->left = left;
+		else if (parent->right == node)
+			parent->right = left;
+		else
+			assert(0);
+		left->parent = parent;
+		left->keymask |= (1 << node->bitnum);
+		node->left = NULL;
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			if (node->leftnode == NODE && node->left) {
+				left = node->left;
+				free(node);
+				count++;
+				node = left;
+			} else if (node->rightnode == NODE && node->right) {
+				right = node->right;
+				free(node);
+				count++;
+				node = right;
+			} else {
+				node = NULL;
+			}
+		}
+		/* Propagate keymasks up along singleton chains. */
+		node = parent;
+		/* Force re-check */
+		bitmask = 1 << node->bitnum;
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		for (;;) {
+			if (node->left && node->right)
+				break;
+			if (node->left) {
+				left = node->left;
+				node->keymask |= left->keymask;
+				node->keybits |= left->keybits;
+			}
+			if (node->right) {
+				right = node->right;
+				node->keymask |= right->keymask;
+				node->keybits |= right->keybits;
+			}
+			node->keymask |= (1 << node->bitnum);
+			node = node->parent;
+			/* Force re-check */
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+		}
+	advance:
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0 &&
+		    node->leftnode == NODE &&
+		    node->left) {
+			leftmask |= bitmask;
+			node = node->left;
+		} else if ((rightmask & bitmask) == 0 &&
+			   node->rightnode == NODE &&
+			   node->right) {
+			rightmask |= bitmask;
+			node = node->right;
+		} else {
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+		}
+	}
+	if (verbose > 0)
+		printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+	struct node *node;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int marked;
+
+	marked = 0;
+	if (verbose > 0)
+		printf("Marking %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+
+	/* second pass: left siblings and singletons */
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				if (!node->mark && node->parent->mark) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				if (!node->mark && node->parent->mark &&
+				    !node->parent->left) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+done:
+	if (verbose > 0)
+		printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie.  These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+	int indent;
+
+	/* Align to a cache line (or half a cache line?). */
+	while (index % 64)
+		index++;
+	tree->index = index;
+	indent = 1;
+	count = 0;
+
+	if (verbose > 0)
+		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+	if (tree->childnode == LEAF) {
+		index += tree->leaf_size(tree->root);
+		goto done;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		count++;
+		if (node->index != index)
+			node->index = index;
+		index += node->size;
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					*tree->leaf_index(tree, node->left) =
+									index;
+					index += tree->leaf_size(node->left);
+					count++;
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					*tree->leaf_index(tree, node->right) = index;
+					index += tree->leaf_size(node->right);
+					count++;
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	/* Round up to a multiple of 16 */
+	while (index % 16)
+		index++;
+	if (verbose > 0)
+		printf("Final index %d\n", index);
+	return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced.  This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+	struct tree *next;
+	struct node *node;
+	struct node *right;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	unsigned int pathbits;
+	unsigned int pathmask;
+	int changed;
+	int offset;
+	int size;
+	int indent;
+
+	indent = 1;
+	changed = 0;
+	size = 0;
+
+	if (verbose > 0)
+		printf("Sizing %s_%x", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	pathbits = 0;
+	pathmask = 0;
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		offset = 0;
+		if (!node->left || !node->right) {
+			size = 1;
+		} else {
+			if (node->rightnode == NODE) {
+				right = node->right;
+				next = tree->next;
+				while (!right->mark) {
+					assert(next);
+					n = next->root;
+					while (n->bitnum != node->bitnum) {
+						if (pathbits & (1<<n->bitnum))
+							n = n->right;
+						else
+							n = n->left;
+					}
+					n = n->right;
+					assert(right->bitnum == n->bitnum);
+					right = n;
+					next = next->next;
+				}
+				offset = right->index - node->index;
+			} else {
+				offset = *tree->leaf_index(tree, node->right);
+				offset -= node->index;
+			}
+			assert(offset >= 0);
+			assert(offset <= 0xffffff);
+			if (offset <= 0xff) {
+				size = 2;
+			} else if (offset <= 0xffff) {
+				size = 3;
+			} else { /* offset <= 0xffffff */
+				size = 4;
+			}
+		}
+		if (node->size != size || node->offset != offset) {
+			node->size = size;
+			node->offset = offset;
+			changed++;
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			pathmask |= bitmask;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				pathbits |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			pathmask &= ~bitmask;
+			pathbits &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	if (verbose > 0)
+		printf("Found %d changes\n", changed);
+	return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int offlen;
+	int offset;
+	int index;
+	int indent;
+	unsigned char byte;
+
+	index = tree->index;
+	data += index;
+	indent = 1;
+	if (verbose > 0)
+		printf("Emitting %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_emit(tree->root, data);
+		return;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		assert(node->offset != -1);
+		assert(node->index == index);
+
+		byte = 0;
+		if (node->nextbyte)
+			byte |= NEXTBYTE;
+		byte |= (node->bitnum & BITNUM);
+		if (node->left && node->right) {
+			if (node->leftnode == NODE)
+				byte |= LEFTNODE;
+			if (node->rightnode == NODE)
+				byte |= RIGHTNODE;
+			if (node->offset <= 0xff)
+				offlen = 1;
+			else if (node->offset <= 0xffff)
+				offlen = 2;
+			else
+				offlen = 3;
+			offset = node->offset;
+			byte |= offlen << OFFLEN_SHIFT;
+			*data++ = byte;
+			index++;
+			while (offlen--) {
+				*data++ = offset & 0xff;
+				index++;
+				offset >>= 8;
+			}
+		} else if (node->left) {
+			if (node->leftnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else if (node->right) {
+			byte |= RIGHTNODE;
+			if (node->rightnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else {
+			assert(0);
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					data = tree->leaf_emit(node->left,
+							       data);
+					index += tree->leaf_size(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					data = tree->leaf_emit(node->right,
+							       data);
+					index += tree->leaf_size(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table.  Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions.  The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+	unsigned int code;
+	int ccc;
+	int gen;
+	int correction;
+	unsigned int *utf32nfkdi;
+	unsigned int *utf32nfkdicf;
+	char *utf8nfkdi;
+	char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int    corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int          trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+	int i;
+
+	for (i = 0; i != corrections_count; i++)
+		if (u->code == corrections[i].code)
+			return &corrections[i];
+	return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdicf && right->utf8nfkdicf &&
+	    strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+		return 1;
+	if (left->utf8nfkdicf && right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdicf || right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdicf)
+		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+	return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	if (leaf->utf8nfkdicf)
+		return 1;
+	return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdicf)
+		size += strlen(leaf->utf8nfkdicf) + 1;
+	else if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdicf) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdicf;
+		while ((*data++ = *s++) != 0)
+			;
+	} else if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+	char utf[18*4+1];
+	char *u;
+	unsigned int *um;
+	int i;
+
+	u = utf;
+	um = data->utf32nfkdi;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		data->utf8nfkdi = strdup((char*)utf);
+	}
+	u = utf;
+	um = data->utf32nfkdicf;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+			data->utf8nfkdicf = strdup((char*)utf);
+	}
+}
+
+static void
+utf8_init(void)
+{
+	unsigned int unichar;
+	int i;
+
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		utf8_create(&unicode_data[unichar]);
+
+	for (i = 0; i != corrections_count; i++)
+		utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+	struct unicode_data *data;
+	unsigned int maxage;
+	unsigned int nextage;
+	int count;
+	int i;
+	int j;
+
+	/* Count the number of different ages. */
+	count = 0;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+		count++;
+	} while (nextage);
+
+	/* Two trees per age: nfkdi and nfkdicf */
+	trees_count = count * 2;
+	trees = calloc(trees_count, sizeof(struct tree));
+
+	/* Assign ages to the trees. */
+	count = trees_count;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		trees[--count].maxage = maxage;
+		trees[--count].maxage = maxage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+	} while (nextage);
+
+	/* The ages assigned above are off by one. */
+	for (i = 0; i != trees_count; i++) {
+		j = 0;
+		while (ages[j] < trees[i].maxage)
+			j++;
+		trees[i].maxage = ages[j-1];
+	}
+
+	/* Set up the forwarding between trees. */
+	trees[trees_count-2].next = &trees[trees_count-1];
+	trees[trees_count-1].leaf_mark = nfkdi_mark;
+	trees[trees_count-2].leaf_mark = nfkdicf_mark;
+	for (i = 0; i != trees_count-2; i += 2) {
+		trees[i].next = &trees[trees_count-2];
+		trees[i].leaf_mark = correction_mark;
+		trees[i+1].next = &trees[trees_count-1];
+		trees[i+1].leaf_mark = correction_mark;
+	}
+
+	/* Assign the callouts. */
+	for (i = 0; i != trees_count; i += 2) {
+		trees[i].type = "nfkdicf";
+		trees[i].leaf_equal = nfkdicf_equal;
+		trees[i].leaf_print = nfkdicf_print;
+		trees[i].leaf_size = nfkdicf_size;
+		trees[i].leaf_index = nfkdicf_index;
+		trees[i].leaf_emit = nfkdicf_emit;
+
+		trees[i+1].type = "nfkdi";
+		trees[i+1].leaf_equal = nfkdi_equal;
+		trees[i+1].leaf_print = nfkdi_print;
+		trees[i+1].leaf_size = nfkdi_size;
+		trees[i+1].leaf_index = nfkdi_index;
+		trees[i+1].leaf_emit = nfkdi_emit;
+	}
+
+	/* Finish init. */
+	for (i = 0; i != trees_count; i++)
+		trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+	struct unicode_data *data;
+	unsigned int unichar;
+	char keyval[4];
+	int keylen;
+	int i;
+
+	for (i = 0; i != trees_count; i++) {
+		if (verbose > 0) {
+			printf("Populating %s_%x\n",
+				trees[i].type, trees[i].maxage);
+		}
+		for (unichar = 0; unichar != 0x110000; unichar++) {
+			if (unicode_data[unichar].gen < 0)
+				continue;
+			keylen = utf8key(unichar, keyval);
+			data = corrections_lookup(&unicode_data[unichar]);
+			if (data->correction <= trees[i].maxage)
+				data = &unicode_data[unichar];
+			insert(&trees[i], keyval, keylen, data);
+		}
+	}
+}
+
+static void
+trees_reduce(void)
+{
+	int i;
+	int size;
+	int changed;
+
+	for (i = 0; i != trees_count; i++)
+		prune(&trees[i]);
+	for (i = 0; i != trees_count; i++)
+		mark_nodes(&trees[i]);
+	do {
+		size = 0;
+		for (i = 0; i != trees_count; i++)
+			size = index_nodes(&trees[i], size);
+		changed = 0;
+		for (i = 0; i != trees_count; i++)
+			changed += size_nodes(&trees[i]);
+	} while (changed);
+
+	utf8data = calloc(size, 1);
+	utf8data_size = size;
+	for (i = 0; i != trees_count; i++)
+		emit(&trees[i], utf8data);
+
+	if (verbose > 0) {
+		for (i = 0; i != trees_count; i++) {
+			printf("%s_%x idx %d\n",
+				trees[i].type, trees[i].maxage, trees[i].index);
+		}
+	}
+
+	nfkdi = utf8data + trees[trees_count-1].index;
+	nfkdicf = utf8data + trees[trees_count-2].index;
+
+	nfkdi_tree = &trees[trees_count-1];
+	nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+	struct unicode_data *data;
+	utf8leaf_t	*leaf;
+	unsigned int	unichar;
+	char		key[4];
+	int		report;
+	int		nocf;
+
+	if (verbose > 0)
+		printf("Verifying %s_%x\n", tree->type, tree->maxage);
+	nocf = strcmp(tree->type, "nfkdicf");
+
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		report = 0;
+		data = corrections_lookup(&unicode_data[unichar]);
+		if (data->correction <= tree->maxage)
+			data = &unicode_data[unichar];
+		utf8key(unichar, key);
+		leaf = utf8lookup(tree, key);
+		if (!leaf) {
+			if (data->gen != -1)
+				report++;
+			if (unichar < 0xd800 || unichar > 0xdfff)
+				report++;
+		} else {
+			if (unichar >= 0xd800 && unichar <= 0xdfff)
+				report++;
+			if (data->gen == -1)
+				report++;
+			if (data->gen != LEAF_GEN(leaf))
+				report++;
+			if (LEAF_CCC(leaf) == DECOMPOSE) {
+				if (nocf) {
+					if (!data->utf8nfkdi) {
+						report++;
+					} else if (strcmp(data->utf8nfkdi,
+							LEAF_STR(leaf))) {
+						report++;
+					}
+				} else {
+					if (!data->utf8nfkdicf &&
+					    !data->utf8nfkdi) {
+						report++;
+					} else if (data->utf8nfkdicf) {
+						if (strcmp(data->utf8nfkdicf,
+							   LEAF_STR(leaf)))
+							report++;
+					} else if (strcmp(data->utf8nfkdi,
+							  LEAF_STR(leaf))) {
+						report++;
+					}
+				}
+			} else if (data->ccc != LEAF_CCC(leaf)) {
+				report++;
+			}
+		}
+		if (report) {
+			printf("%X code %X gen %d ccc %d"
+				" nfdki -> \"%s\"",
+				unichar, data->code, data->gen,
+				data->ccc,
+				data->utf8nfkdi);
+			if (leaf) {
+				printf(" age %d ccc %d"
+					" nfdki -> \"%s\"\n",
+					LEAF_GEN(leaf),
+					LEAF_CCC(leaf),
+					LEAF_CCC(leaf) == DECOMPOSE ?
+						LEAF_STR(leaf) : "");
+			}
+			printf("\n");
+		}
+	}
+}
+
+static void
+trees_verify(void)
+{
+	int i;
+
+	for (i = 0; i != trees_count; i++)
+		verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("Usage: %s [options]\n", argv0);
+	printf("\n");
+	printf("This program creates an a data trie used for parsing and\n");
+	printf("normalization of UTF-8 strings. The trie is derived from\n");
+	printf("a set of input files from the Unicode character database\n");
+	printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+	printf("\n");
+	printf("The generated tree supports two normalization forms:\n");
+	printf("\n");
+	printf("\tnfkdi:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\n");
+	printf("\tnfkdicf:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\t- Apply a full casefold (C + F).\n");
+	printf("\n");
+	printf("These forms were chosen as being most useful when dealing\n");
+	printf("with file names: NFKD catches most cases where characters\n");
+	printf("should be considered equivalent. The ignorables are mostly\n");
+	printf("invisible, making names hard to type.\n");
+	printf("\n");
+	printf("The options to specify the files to be used are listed\n");
+	printf("below with their default values, which are the names used\n");
+	printf("by version 7.0.0 of the Unicode Character Database.\n");
+	printf("\n");
+	printf("The input files:\n");
+	printf("\t-a %s\n", AGE_NAME);
+	printf("\t-c %s\n", CCC_NAME);
+	printf("\t-p %s\n", PROP_NAME);
+	printf("\t-d %s\n", DATA_NAME);
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-n %s\n", NORM_NAME);
+	printf("\n");
+	printf("Additionally, the generated tables are tested using:\n");
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n");
+	printf("Finally, the output file:\n");
+	printf("\t-o %s\n", UTF8_NAME);
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+	printf("Error parsing %s:%s\n", filename, line);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+	int	i;
+
+	for (i = 0; utf32str[i]; i++)
+		printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdi);
+	printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdicf);
+	printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	int gen;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", age_name);
+
+	file = fopen(age_name, "r");
+	if (!file)
+		open_fail(age_name, errno);
+	count = 0;
+
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d\n",
+					major, minor, revision);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d\n", major, minor);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+
+	/* We must have found something above. */
+	if (verbose > 1)
+		printf("%d age entries\n", ages_count);
+	if (ages_count == 0 || ages_count > MAXGEN)
+		file_fail(age_name);
+
+	/* There is a 0 entry. */
+	ages_count++;
+	ages = calloc(ages_count + 1, sizeof(*ages));
+	/* And a guard entry. */
+	ages[ages_count] = (unsigned int)-1;
+
+	rewind(file);
+	count = 0;
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages[++gen] =
+				UNICODE_AGE(major, minor, revision);
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d = gen %d\n",
+					major, minor, revision, gen);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages[++gen] = UNICODE_AGE(major, minor, 0);
+			if (verbose > 1)
+				printf(" Age V%d_%d = %d\n",
+					major, minor, gen);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X..%X ; %d.%d #",
+			     &first, &last, &major, &minor);
+		if (ret == 4) {
+			for (unichar = first; unichar <= last; unichar++)
+				unicode_data[unichar].gen = gen;
+			count += 1 + last - first;
+			if (verbose > 1)
+				printf("  %X..%X gen %d\n", first, last, gen);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+		if (ret == 3) {
+			unicode_data[unichar].gen = gen;
+			count++;
+			if (verbose > 1)
+				printf("  %X gen %d\n", unichar, gen);
+			if (!utf32valid(unichar))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+	unicode_maxage = ages[gen];
+	fclose(file);
+
+	/* Nix surrogate block */
+	if (verbose > 1)
+		printf(" Removing surrogate block D800..DFFF\n");
+	for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+		unicode_data[unichar].gen = -1;
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int value;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", ccc_name);
+
+	file = fopen(ccc_name, "r");
+	if (!file)
+		open_fail(ccc_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+		if (ret == 3) {
+			for (unichar = first; unichar <= last; unichar++) {
+				unicode_data[unichar].ccc = value;
+                                count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X ccc %d\n", first, last, value);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(ccc_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d #", &unichar, &value);
+		if (ret == 2) {
+			unicode_data[unichar].ccc = value;
+                        count++;
+			if (verbose > 1)
+				printf(" %X ccc %d\n", unichar, value);
+			if (!utf32valid(unichar))
+				line_fail(ccc_name, line);
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	unsigned int *um;
+	int count;
+	int i;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", data_name);
+	file = fopen(data_name, "r");
+	if (!file)
+		open_fail(data_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     &unichar, buf0);
+		if (ret != 2)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(data_name, line);
+
+		s = buf0;
+		/* skip over <tag> */
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		/* decode the decomposition into UTF-32 */
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(data_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char status;
+	char *s;
+	unsigned int *um;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", fold_name);
+	file = fopen(fold_name, "r");
+	if (!file)
+		open_fail(fold_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+		if (ret != 3)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(fold_name, line);
+		/* Use the C+F casefold. */
+		if (status != 'C' && status != 'F')
+			continue;
+		s = buf0;
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(fold_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int first;
+	unsigned int last;
+	unsigned int *um;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", prop_name);
+	file = fopen(prop_name, "r");
+	if (!file)
+		open_fail(prop_name, errno);
+	assert(file);
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+		if (ret == 3) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(prop_name, line);
+			for (unichar = first; unichar <= last; unichar++) {
+				free(unicode_data[unichar].utf32nfkdi);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdi = um;
+				free(unicode_data[unichar].utf32nfkdicf);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdicf = um;
+				count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X Default_Ignorable_Code_Point\n",
+					first, last);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+		if (ret == 2) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(unichar))
+				line_fail(prop_name, line);
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdi = um;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdicf = um;
+			if (verbose > 1)
+				printf(" %X Default_Ignorable_Code_Point\n",
+					unichar);
+			count++;
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	unsigned int age;
+	unsigned int *um;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", norm_name);
+	file = fopen(norm_name, "r");
+	if (!file)
+		open_fail(norm_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		count++;
+	}
+	corrections = calloc(count, sizeof(struct unicode_data));
+	corrections_count = count;
+	rewind(file);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		corrections[count] = unicode_data[unichar];
+		assert(corrections[count].code == unichar);
+		age = UNICODE_AGE(major, minor, revision);
+		corrections[count].correction = age;
+
+		i = 0;
+		s = buf0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(norm_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		corrections[count].utf32nfkdi = um;
+
+		if (verbose > 1)
+			printf(" %X -> %s -> %s V%d_%d_%d\n",
+				unichar, buf0, buf1, major, minor, revision);
+		count++;
+	}
+	fclose(file);
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount
+ *   LVPart = LBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+	unsigned int sb = 0xAC00;
+	unsigned int lb = 0x1100;
+	unsigned int vb = 0x1161;
+	unsigned int tb = 0x11a7;
+	/* unsigned int lc = 19; */
+	unsigned int vc = 21;
+	unsigned int tc = 28;
+	unsigned int nc = (vc * tc);
+	/* unsigned int sc = (lc * nc); */
+	unsigned int unichar;
+	unsigned int mapping[4];
+	unsigned int *um;
+        int count;
+	int i;
+
+	if (verbose > 0)
+		printf("Decomposing hangul\n");
+	/* Hangul */
+	count = 0;
+	for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+		unsigned int si = unichar - sb;
+		unsigned int li = si / nc;
+		unsigned int vi = (si % nc) / tc;
+		unsigned int ti = si % tc;
+
+		i = 0;
+		mapping[i++] = lb + li;
+		mapping[i++] = vb + vi;
+		if (ti)
+			mapping[i++] = tb + ti;
+		mapping[i++] = 0;
+
+		assert(!unicode_data[unichar].utf32nfkdi);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		assert(!unicode_data[unichar].utf32nfkdicf);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+
+		count++;
+	}
+	if (verbose > 0)
+		printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdi\n");
+
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdi)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdi;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdi;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdi = um;
+		}
+		/* Add this decomposition to nfkdicf if there is no entry. */
+		if (!unicode_data[unichar].utf32nfkdicf) {
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdicf\n");
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdicf)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdicf;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdicf;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + tree->index;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!tree)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+	return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = tree->maxage;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = tree->maxage;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	struct tree	*tree;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+	unsigned int	unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : string.
+ *   len    : length of s.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s,
+	size_t		len)
+{
+	if (!tree)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->tree = tree;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	u8c->unichar = 0;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : NUL-terminated string.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s)
+{
+	return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1  -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->tree, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->tree, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+		u8c->unichar = utf8code(u8c->s);
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			assert(u8c->ccc == STOPPER);
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+	char *s;
+	char *t;
+	int c;
+	struct utf8cursor u8c;
+
+	/* First test: null-terminated string. */
+	s = buf2;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	/* Second test: length-limited string. */
+	s = buf2;
+	/* Replace NUL with a value that will cause an error if seen. */
+	s[strlen(s) + 1] = -1;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	struct unicode_data *data;
+	char *s;
+	char *t;
+	int ret;
+	int ignorables;
+	int tests = 0;
+	int failures = 0;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     buf0, buf1);
+		if (ret != 2 || *line == '#')
+			continue;
+		s = buf0;
+		t = buf2;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		ignorables = 0;
+		s = buf1;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			data = &unicode_data[unichar];
+			if (data->utf8nfkdi && !*data->utf8nfkdi)
+				ignorables = 1;
+			else
+				t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		tests++;
+		if (normalize_line(nfkdi_tree) < 0) {
+			printf("\nline %s -> %s", buf0, buf1);
+			if (ignorables)
+				printf(" (ignorables removed)");
+			printf(" failure\n");
+			failures++;
+		}
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+	FILE *file;
+	int i;
+	int j;
+	int t;
+	int gen;
+
+	if (verbose > 0)
+		printf("Writing %s\n", utf8_name);
+	file = fopen(utf8_name, "w");
+	if (!file)
+		open_fail(utf8_name, errno);
+
+	fprintf(file, "/* This file is generated code, do not edit. */\n");
+	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+	fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+	fprintf(file, "#endif\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+		unicode_maxage);
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+	for (i = 0; i != ages_count; i++)
+		fprintf(file, "\t%#x%s\n", ages[i],
+			ages[i] == unicode_maxage ? "" : ",");
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+	t = 0;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+	t = 1;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+		utf8data_size);
+	t = 0;
+	for (i = 0; i != utf8data_size; i += 16) {
+		if (i == trees[t].index) {
+			fprintf(file, "\t/* %s_%x */\n",
+				trees[t].type, trees[t].maxage);
+			if (t < trees_count-1)
+				t++;
+		}
+		fprintf(file, "\t");
+		for (j = i; j != i + 16; j++)
+			fprintf(file, "0x%.2x%s", utf8data[j],
+				(j < utf8data_size -1 ? "," : ""));
+		fprintf(file, "\n");
+	}
+	fprintf(file, "};\n");
+	fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+	unsigned int unichar;
+	int opt;
+
+	argv0 = argv[0];
+
+	while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+		switch (opt) {
+		case 'a':
+			age_name = optarg;
+			break;
+		case 'c':
+			ccc_name = optarg;
+			break;
+		case 'd':
+			data_name = optarg;
+			break;
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 'n':
+			norm_name = optarg;
+			break;
+		case 'o':
+			utf8_name = optarg;
+			break;
+		case 'p':
+			prop_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	if (verbose > 1)
+		help();
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		unicode_data[unichar].code = unichar;
+	age_init();
+	ccc_init();
+	nfkdi_init();
+	nfkdicf_init();
+	ignore_init();
+	corrections_init();
+	hangul_decompose();
+	nfkdi_decompose();
+	nfkdicf_decompose();
+	utf8_init();
+	trees_init();
+	trees_populate();
+	trees_reduce();
+	trees_verify();
+	/* Prevent "unused function" warning. */
+	(void)lookup(nfkdi_tree, " ");
+	if (verbose > 2)
+		tree_walk(nfkdi_tree);
+	if (verbose > 2)
+		tree_walk(nfkdicf_tree);
+	normalization_test();
+	write_file();
+
+	return 0;
+}
diff --git a/fs/xfs/utf8norm/utf8norm.c b/fs/xfs/utf8norm/utf8norm.c
new file mode 100644
index 0000000..995c4df
--- /dev/null
+++ b/fs/xfs/utf8norm/utf8norm.c
@@ -0,0 +1,649 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "utf8norm.h"
+
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+const unsigned int utf8version(void)
+{
+	return utf8vers;
+}
+EXPORT_SYMBOL(utf8version);
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7F: 0                   - 0x7F
+ *       0x80 -    0x7FF: 0xC2 0x80           - 0xDF 0xBF
+ *      0x800 -   0xFFFF: 0xE0 0xA0 0x80      - 0xEF 0xBF 0xBF
+ *    0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences.  Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + data->offset;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!data)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+	return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : string.
+ *   len    : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s,
+	size_t		len)
+{
+	if (!data)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->data = data;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s)
+{
+	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1   -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->data, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->data, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+	while (maxage < utf8nfkdidata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdidata[i].maxage)
+		return NULL;
+	return &utf8nfkdidata[i];
+}
+EXPORT_SYMBOL(utf8nfkdi);
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+	while (maxage < utf8nfkdicfdata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdicfdata[i].maxage)
+		return NULL;
+	return &utf8nfkdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfkdicf);
+
+MODULE_AUTHOR("SGI");
+MODULE_DESCRIPTION("utf8 normalization");
+MODULE_LICENSE("GPL");
diff --git a/fs/xfs/utf8norm/utf8norm.h b/fs/xfs/utf8norm/utf8norm.h
new file mode 100644
index 0000000..44a9e53
--- /dev/null
+++ b/fs/xfs/utf8norm/utf8norm.h
@@ -0,0 +1,116 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/module.h>
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern const unsigned int utf8version(void);
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *  - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	utf8data_t	data;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold.
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 20:16   ` Ben Myers
  2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: xfs, olaf, tinguely

From: Olaf Weber <olaf@sgi.com>

The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
and are installed if the utf8bit is set in the super block.

The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
filenames, and are installed if both the utf8bit and the borgbit are set
in the superblock.

Normalized filenames are not stored on disk. Normalization will fail if a
filename is not valid UTF-8, in which case the filename is treated as an
opaque blob.

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: updated to use utf8norm.ko module;
     compiled conditionally on CONFIG_XFS_UTF8=y;
     utf8version is now a function;
     move xfs_utf8.[ch] into libxfs. --bpm]
---
 fs/xfs/Makefile          |   2 +
 fs/xfs/libxfs/xfs_dir2.c |  24 ++++-
 fs/xfs/libxfs/xfs_utf8.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_utf8.h |  25 +++++
 fs/xfs/xfs_iops.c        |   2 +-
 5 files changed, 290 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_utf8.c
 create mode 100644 fs/xfs/libxfs/xfs_utf8.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 6d000d3..5a4dfa0 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -114,6 +114,8 @@ xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
 				   xfs_qm.o \
 				   xfs_quotaops.o
 
+xfs-$(CONFIG_XFS_UTF8)		+= libxfs/xfs_utf8.o
+
 # xfs_rtbitmap is shared with libxfs
 xfs-$(CONFIG_XFS_RT)		+= xfs_rtalloc.o
 
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 84e5ca9..e28736b 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -35,6 +35,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_dinode.h"
+#include "xfs_utf8.h"
 
 struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
 
@@ -156,10 +157,25 @@ xfs_da_mount(
 				(uint)sizeof(xfs_da_node_entry_t);
 	dageo->magicpct = (dageo->blksize * 37) / 100;
 
-	if (xfs_sb_version_hasasciici(&mp->m_sb))
-		mp->m_dirnameops = &xfs_ascii_ci_nameops;
-	else
-		mp->m_dirnameops = &xfs_default_nameops;
+	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
+#ifdef CONFIG_XFS_UTF8
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_utf8_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_utf8_nameops;
+#else
+		xfs_warn(mp,
+		"Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
+		kmem_free(mp->m_dir_geo);
+		kmem_free(mp->m_attr_geo);
+		return -ENOSYS;
+#endif	
+	} else {
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_ascii_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_default_nameops;
+	}
 
 	return 0;
 }
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
new file mode 100644
index 0000000..1e64c44
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -0,0 +1,242 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_inum.h"
+#include "xfs_trans.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_mount.h"
+#include "xfs_da_btree.h"
+#include "xfs_format.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_dinode.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_utf8.h"
+#include <utf8norm/utf8norm.h>
+
+/*
+ * xfs nameops using nfkdi
+ */
+
+static xfs_dahash_t
+xfs_utf8_hashname(
+	const unsigned char *name,
+	int len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdi = utf8nfkdi(utf8version());
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+
+	nfkdi = utf8nfkdi(utf8version());
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdi = utf8nfkdi(utf8version());
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_nameops = {
+	.hashname = xfs_utf8_hashname,
+	.normhash = xfs_utf8_normhash,
+	.compname = xfs_utf8_compname,
+};
+
+/*
+ * xfs nameops using nfkdicf
+ */
+
+static xfs_dahash_t
+xfs_utf8_ci_hashname(
+	const unsigned char *name,
+	int len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdicf = utf8nfkdicf(utf8version());
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_ci_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+
+	nfkdicf = utf8nfkdicf(utf8version());
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdicf, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_ci_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdicf = utf8nfkdicf(utf8version());
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_ci_nameops = {
+	.hashname = xfs_utf8_ci_hashname,
+	.normhash = xfs_utf8_ci_normhash,
+	.compname = xfs_utf8_ci_compname,
+};
diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
new file mode 100644
index 0000000..97b6a91
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_utf8.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef XFS_UTF8_H
+#define XFS_UTF8_H
+
+extern struct xfs_nameops xfs_utf8_nameops;
+extern struct xfs_nameops xfs_utf8_ci_nameops;
+
+#endif /* XFS_UTF8_H */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index cea3d64..fbfb1bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1257,7 +1257,7 @@ xfs_setup_inode(
 		break;
 	case S_IFDIR:
 		lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class);
-		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
+		if (xfs_sb_version_hasci(&XFS_M(inode->i_sb)->m_sb))
 			inode->i_op = &xfs_dir_ci_inode_operations;
 		else
 			inode->i_op = &xfs_dir_inode_operations;
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold.
@ 2014-09-18 20:16   ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
and are installed if the utf8bit is set in the super block.

The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
filenames, and are installed if both the utf8bit and the borgbit are set
in the superblock.

Normalized filenames are not stored on disk. Normalization will fail if a
filename is not valid UTF-8, in which case the filename is treated as an
opaque blob.

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: updated to use utf8norm.ko module;
     compiled conditionally on CONFIG_XFS_UTF8=y;
     utf8version is now a function;
     move xfs_utf8.[ch] into libxfs. --bpm]
---
 fs/xfs/Makefile          |   2 +
 fs/xfs/libxfs/xfs_dir2.c |  24 ++++-
 fs/xfs/libxfs/xfs_utf8.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_utf8.h |  25 +++++
 fs/xfs/xfs_iops.c        |   2 +-
 5 files changed, 290 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_utf8.c
 create mode 100644 fs/xfs/libxfs/xfs_utf8.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 6d000d3..5a4dfa0 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -114,6 +114,8 @@ xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
 				   xfs_qm.o \
 				   xfs_quotaops.o
 
+xfs-$(CONFIG_XFS_UTF8)		+= libxfs/xfs_utf8.o
+
 # xfs_rtbitmap is shared with libxfs
 xfs-$(CONFIG_XFS_RT)		+= xfs_rtalloc.o
 
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 84e5ca9..e28736b 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -35,6 +35,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_dinode.h"
+#include "xfs_utf8.h"
 
 struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
 
@@ -156,10 +157,25 @@ xfs_da_mount(
 				(uint)sizeof(xfs_da_node_entry_t);
 	dageo->magicpct = (dageo->blksize * 37) / 100;
 
-	if (xfs_sb_version_hasasciici(&mp->m_sb))
-		mp->m_dirnameops = &xfs_ascii_ci_nameops;
-	else
-		mp->m_dirnameops = &xfs_default_nameops;
+	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
+#ifdef CONFIG_XFS_UTF8
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_utf8_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_utf8_nameops;
+#else
+		xfs_warn(mp,
+		"Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
+		kmem_free(mp->m_dir_geo);
+		kmem_free(mp->m_attr_geo);
+		return -ENOSYS;
+#endif	
+	} else {
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_ascii_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_default_nameops;
+	}
 
 	return 0;
 }
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
new file mode 100644
index 0000000..1e64c44
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -0,0 +1,242 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_inum.h"
+#include "xfs_trans.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_mount.h"
+#include "xfs_da_btree.h"
+#include "xfs_format.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_dinode.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_utf8.h"
+#include <utf8norm/utf8norm.h>
+
+/*
+ * xfs nameops using nfkdi
+ */
+
+static xfs_dahash_t
+xfs_utf8_hashname(
+	const unsigned char *name,
+	int len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdi = utf8nfkdi(utf8version());
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+
+	nfkdi = utf8nfkdi(utf8version());
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdi = utf8nfkdi(utf8version());
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_nameops = {
+	.hashname = xfs_utf8_hashname,
+	.normhash = xfs_utf8_normhash,
+	.compname = xfs_utf8_compname,
+};
+
+/*
+ * xfs nameops using nfkdicf
+ */
+
+static xfs_dahash_t
+xfs_utf8_ci_hashname(
+	const unsigned char *name,
+	int len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdicf = utf8nfkdicf(utf8version());
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_ci_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+
+	nfkdicf = utf8nfkdicf(utf8version());
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdicf, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_ci_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdicf = utf8nfkdicf(utf8version());
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_ci_nameops = {
+	.hashname = xfs_utf8_ci_hashname,
+	.normhash = xfs_utf8_ci_normhash,
+	.compname = xfs_utf8_ci_compname,
+};
diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
new file mode 100644
index 0000000..97b6a91
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_utf8.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef XFS_UTF8_H
+#define XFS_UTF8_H
+
+extern struct xfs_nameops xfs_utf8_nameops;
+extern struct xfs_nameops xfs_utf8_ci_nameops;
+
+#endif /* XFS_UTF8_H */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index cea3d64..fbfb1bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1257,7 +1257,7 @@ xfs_setup_inode(
 		break;
 	case S_IFDIR:
 		lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class);
-		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
+		if (xfs_sb_version_hasci(&XFS_M(inode->i_sb)->m_sb))
 			inode->i_op = &xfs_dir_ci_inode_operations;
 		else
 			inode->i_op = &xfs_dir_inode_operations;
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (8 preceding siblings ...)
  2014-09-18 20:16   ` Ben Myers
@ 2014-09-18 20:17 ` Ben Myers
  2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Apply the same rules for UTF-8 normalization to the names of user-defined
extended attributes. System attributes are excluded because they are not
user-visible in the first place, and the kernel is expected to know what
it is doing when naming them.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_attr.c      | 56 ++++++++++++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_attr_leaf.c | 11 +++++++--
 fs/xfs/libxfs/xfs_utf8.c      |  7 ++++++
 fs/xfs/xfs_attr_list.c        | 11 ++++++++-
 4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 353fb42..68e7ce3 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -83,12 +83,14 @@ xfs_attr_args_init(
 	const unsigned char	*name,
 	int			flags)
 {
+	struct xfs_mount	*mp = dp->i_mount;
+	int			error;
 
 	if (!name)
 		return -EINVAL;
 
 	memset(args, 0, sizeof(*args));
-	args->geo = dp->i_mount->m_attr_geo;
+	args->geo = mp->m_attr_geo;
 	args->whichfork = XFS_ATTR_FORK;
 	args->dp = dp;
 	args->flags = flags;
@@ -97,7 +99,11 @@ xfs_attr_args_init(
 	if (args->namelen >= MAXNAMELEN)
 		return -EFAULT;		/* match IRIX behaviour */
 
-	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	if (!xfs_sb_version_hasutf8(&mp->m_sb))
+		args->hashval = xfs_da_hashname(args->name, args->namelen);
+	else if ((error = mp->m_dirnameops->normhash(args)) != 0)
+		return error;
+
 	return 0;
 }
 
@@ -154,6 +160,9 @@ xfs_attr_get(
 		error = xfs_attr_node_get(&args);
 	xfs_iunlock(ip, lock_mode);
 
+	if (args.norm)
+		kmem_free(args.norm);
+
 	*valuelenp = args.valuelen;
 	return error == -EEXIST ? 0 : error;
 }
@@ -216,8 +225,11 @@ xfs_attr_set(
 		return -EIO;
 
 	error = xfs_attr_args_init(&args, dp, name, flags);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	args.value = value;
 	args.valuelen = valuelen;
@@ -227,8 +239,11 @@ xfs_attr_set(
 	args.total = xfs_attr_calc_size(&args, &local);
 
 	error = xfs_qm_dqattach(dp, 0);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	/*
 	 * If the inode doesn't have an attribute fork, add one.
@@ -239,8 +254,11 @@ xfs_attr_set(
 			XFS_ATTR_SF_ENTSIZE_BYNAME(args.namelen, valuelen);
 
 		error = xfs_bmap_add_attrfork(dp, sf_size, rsvd);
-		if (error)
+		if (error) {
+			if (args.norm)
+				kmem_free(args.norm);
 			return error;
+		}
 	}
 
 	/*
@@ -270,6 +288,8 @@ xfs_attr_set(
 	error = xfs_trans_reserve(args.trans, &tres, args.total, 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
 	}
 	xfs_ilock(dp, XFS_ILOCK_EXCL);
@@ -280,6 +300,8 @@ xfs_attr_set(
 	if (error) {
 		xfs_iunlock(dp, XFS_ILOCK_EXCL);
 		xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES);
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
 	}
 
@@ -327,6 +349,8 @@ xfs_attr_set(
 						 XFS_TRANS_RELEASE_LOG_RES);
 			xfs_iunlock(dp, XFS_ILOCK_EXCL);
 
+			if (args.norm)
+				kmem_free(args.norm);
 			return error ? error : err2;
 		}
 
@@ -388,7 +412,8 @@ xfs_attr_set(
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
-
+	if (args.norm)
+		kmem_free(args.norm);
 	return error;
 
 out:
@@ -397,6 +422,8 @@ out:
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	}
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free(args.norm);
 	return error;
 }
 
@@ -425,8 +452,11 @@ xfs_attr_remove(
 		return -ENOATTR;
 
 	error = xfs_attr_args_init(&args, dp, name, flags);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	args.firstblock = &firstblock;
 	args.flist = &flist;
@@ -439,8 +469,11 @@ xfs_attr_remove(
 	args.op_flags = XFS_DA_OP_OKNOENT;
 
 	error = xfs_qm_dqattach(dp, 0);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	/*
 	 * Start our first transaction of the day.
@@ -466,6 +499,8 @@ xfs_attr_remove(
 				  XFS_ATTRRM_SPACE_RES(mp), 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
 	}
 
@@ -506,6 +541,8 @@ xfs_attr_remove(
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free(args.norm);
 
 	return error;
 
@@ -515,6 +552,9 @@ out:
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	}
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free(args.norm);
+
 	return error;
 }
 
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b1f73db..c991a88 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -661,6 +661,7 @@ int
 xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 {
 	xfs_inode_t *dp;
+	struct xfs_mount *mp;
 	xfs_attr_shortform_t *sf;
 	xfs_attr_sf_entry_t *sfe;
 	xfs_da_args_t nargs;
@@ -673,6 +674,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	trace_xfs_attr_sf_to_leaf(args);
 
 	dp = args->dp;
+	mp = dp->i_mount;
 	ifp = dp->i_afp;
 	sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
 	size = be16_to_cpu(sf->hdr.totsize);
@@ -726,13 +728,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 		nargs.namelen = sfe->namelen;
 		nargs.value = &sfe->nameval[nargs.namelen];
 		nargs.valuelen = sfe->valuelen;
-		nargs.hashval = xfs_da_hashname(sfe->nameval,
-						sfe->namelen);
 		nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags);
+		if (!xfs_sb_version_hasutf8(&mp->m_sb))
+			nargs.hashval = xfs_da_hashname(sfe->nameval,
+							sfe->namelen);
+		else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0)
+			goto out;
 		error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */
 		ASSERT(error == -ENOATTR);
 		error = xfs_attr3_leaf_add(bp, &nargs);
 		ASSERT(error != -ENOSPC);
+		if (nargs.norm)
+			kmem_free(nargs.norm);
 		if (error)
 			goto out;
 		sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
index 1e64c44..75f2b3a 100644
--- a/fs/xfs/libxfs/xfs_utf8.c
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -38,6 +38,7 @@
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
 #include "xfs_bmap.h"
+#include "xfs_attr.h"
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_utf8.h"
@@ -80,6 +81,9 @@ xfs_utf8_normhash(
 	ssize_t		normlen;
 	int		c;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdi = utf8nfkdi(utf8version());
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
@@ -179,6 +183,9 @@ xfs_utf8_ci_normhash(
 	ssize_t		normlen;
 	int		c;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdicf = utf8nfkdicf(utf8version());
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index 62db83a..4075d54 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -76,12 +76,14 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context)
 	xfs_attr_shortform_t *sf;
 	xfs_attr_sf_entry_t *sfe;
 	xfs_inode_t *dp;
+	struct xfs_mount *mp;
 	int sbsize, nsbuf, count, i;
 	int error;
 
 	ASSERT(context != NULL);
 	dp = context->dp;
 	ASSERT(dp != NULL);
+	mp = dp->i_mount;
 	ASSERT(dp->i_afp != NULL);
 	sf = (xfs_attr_shortform_t *)dp->i_afp->if_u1.if_data;
 	ASSERT(sf != NULL);
@@ -154,7 +156,14 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context)
 		}
 
 		sbp->entno = i;
-		sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen);
+		/* ATTR_ROOT and ATTR_SECURE are never normalized. */
+		if (!xfs_sb_version_hasutf8(&mp->m_sb) ||
+		    (sfe->flags & (ATTR_ROOT|ATTR_SECURE))) {
+			sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen);
+		} else {
+			sbp->hash = mp->m_dirnameops->hashname(sfe->nameval,
+							       sfe->namelen);
+		}
 		sbp->name = sfe->nameval;
 		sbp->namelen = sfe->namelen;
 		/* These are bytes, and both on-disk, don't endian-flip */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 10/10] xfs: implement demand load of utf8norm.ko
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (9 preceding siblings ...)
  2014-09-18 20:17 ` [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
@ 2014-09-18 20:18 ` Ben Myers
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Ben Myers <bpm@sgi.com>

The utf8 normalization module is large and there is no need to have it
loaded unless an xfs filesystem with utf8 enabled has been mounted.
This loads utf8norm.ko at mount time for filesystems that need
it.  This is optional on CONFIG_XFS_UTF8_DEMAND_LOAD.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 fs/xfs/Kconfig           |  10 +++++
 fs/xfs/libxfs/xfs_dir2.c |   9 ++++
 fs/xfs/libxfs/xfs_utf8.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_utf8.h |   5 +++
 fs/xfs/xfs_super.c       |   6 +++
 5 files changed, 134 insertions(+)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index a847857..69efd85c 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -103,3 +103,13 @@ config XFS_UTF8
 	  Say Y here to enable utf8 normalization support in XFS.  You
 	  will be able to mount and use filesystems created with the
 	  utf8 mkfs.xfs option.
+
+config XFS_UTF8_DEMAND_LOAD
+	bool "XFS loads UTF-8 normalization module on demand"
+	depends on XFS_FS
+	depends on XFS_UTF8
+	help
+	  Say Y here to enable on demand loading of the utf8
+	  normalization module.  This enables the large nomalization
+	  module to remain unloaded until a filesystem with utf8 support
+	  is mounted.
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index e28736b..436738d 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -35,7 +35,9 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_dinode.h"
+#ifdef CONFIG_XFS_UTF8
 #include "xfs_utf8.h"
+#endif
 
 struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
 
@@ -159,6 +161,13 @@ xfs_da_mount(
 
 	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
 #ifdef CONFIG_XFS_UTF8
+#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD
+		if (xfs_init_utf8_module(mp)) {
+			kmem_free(mp->m_dir_geo);
+			kmem_free(mp->m_attr_geo);
+			return -ENOSYS;
+		}
+#endif /* CONFIG_XFS_UTF8_DEMAND_LOAD */
 		if (xfs_sb_version_hasasciici(&mp->m_sb))
 			mp->m_dirnameops = &xfs_utf8_ci_nameops;
 		else
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
index 75f2b3a..71978c8 100644
--- a/fs/xfs/libxfs/xfs_utf8.c
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -44,6 +44,110 @@
 #include "xfs_utf8.h"
 #include <utf8norm/utf8norm.h>
 
+#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD
+#include <linux/kmod.h>
+
+static DEFINE_SPINLOCK(utf8norm_lock);
+static int utf8norm_initialized;
+
+static const unsigned int (*utf8version_func)(void);
+static utf8data_t (*utf8nfkdi_func)(unsigned int);
+static utf8data_t (*utf8nfkdicf_func)(unsigned int);
+static ssize_t (*utf8nlen_func)(utf8data_t, const char *, size_t);
+static int (*utf8ncursor_func)(struct utf8cursor *, utf8data_t,
+		const char *, size_t);
+static int (*utf8byte_func)(struct utf8cursor *);
+
+static void
+xfs_put_utf8_module_locked(void)
+{
+	if (utf8version_func)
+		symbol_put(utf8version);
+
+	if (utf8nfkdi_func)
+		symbol_put(utf8nfkdi);
+
+	if (utf8nfkdicf_func)
+		symbol_put(utf8nfkdicf);
+
+	if (utf8nlen_func)
+		symbol_put(utf8nlen);
+
+	if (utf8ncursor_func)
+		symbol_put(utf8ncursor);
+
+	if (utf8byte_func)
+		symbol_put(utf8byte);
+}
+
+void
+xfs_put_utf8_module(void)
+{
+	spin_lock(&utf8norm_lock);
+	if (!utf8norm_initialized) {
+		spin_unlock(&utf8norm_lock);
+		return;
+	}
+	xfs_put_utf8_module_locked();
+	spin_unlock(&utf8norm_lock);
+}
+
+int
+xfs_init_utf8_module(struct xfs_mount	*mp)
+{
+	request_module("utf8norm");
+
+	spin_lock(&utf8norm_lock);
+	if (utf8norm_initialized) {
+		spin_unlock(&utf8norm_lock);
+		return 0;
+	}
+
+	utf8version_func = symbol_get(utf8version);
+	if (!utf8version_func)
+		goto error;
+
+	utf8nfkdi_func = symbol_get(utf8nfkdi);
+	if (!utf8nfkdi_func)
+		goto error;
+
+	utf8nfkdicf_func = symbol_get(utf8nfkdicf);
+	if (!utf8nfkdicf_func)
+		goto error;
+
+	utf8nlen_func = symbol_get(utf8nlen);
+	if (!utf8nlen_func) 
+		goto error;
+
+	utf8ncursor_func = symbol_get(utf8ncursor);
+	if (!utf8ncursor_func)
+		goto error;
+
+	utf8byte_func = symbol_get(utf8byte);
+	if (!utf8byte_func)
+		goto error;
+
+	utf8norm_initialized = 1;	
+	spin_unlock(&utf8norm_lock);
+	return 0;
+error:
+	xfs_put_utf8_module_locked();
+	spin_unlock(&utf8norm_lock);
+	xfs_warn(mp,
+		"Failed to load utf8norm.ko which is required to "
+		"mount a filesystem with utf8 support.");
+	return -ENOSYS;
+}
+
+#define utf8version (*utf8version_func)
+#define utf8nfkdi (*utf8nfkdi_func)
+#define utf8nfkdicf (*utf8nfkdicf_func)
+#define utf8nlen (*utf8nlen_func)
+#define utf8ncursor (*utf8ncursor_func)
+#define utf8byte (*utf8byte_func)
+
+#endif /* CONFIG_XFS_UTF8_DEMAND_LOAD */
+
 /*
  * xfs nameops using nfkdi
  */
diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
index 97b6a91..9d1125a 100644
--- a/fs/xfs/libxfs/xfs_utf8.h
+++ b/fs/xfs/libxfs/xfs_utf8.h
@@ -22,4 +22,9 @@
 extern struct xfs_nameops xfs_utf8_nameops;
 extern struct xfs_nameops xfs_utf8_ci_nameops;
 
+#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD
+extern int xfs_init_utf8_module(struct xfs_mount *);
+extern void xfs_put_utf8_module(void);
+#endif
+
 #endif /* XFS_UTF8_H */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b194652..050a949 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -47,6 +47,9 @@
 #include "xfs_dinode.h"
 #include "xfs_filestream.h"
 #include "xfs_quota.h"
+#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD
+#include "xfs_utf8.h"
+#endif
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1809,6 +1812,9 @@ exit_xfs_fs(void)
 	xfs_mru_cache_uninit();
 	xfs_destroy_workqueues();
 	xfs_destroy_zones();
+#ifdef CONFIG_XFS_UTF8_DEMAND_LOAD
+	xfs_put_utf8_module();
+#endif
 }
 
 module_init(init_xfs_fs);
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (10 preceding siblings ...)
  2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers
@ 2014-09-18 20:31 ` Ben Myers
  2014-09-18 20:33   ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
                     ` (14 more replies)
  2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (4 subsequent siblings)
  16 siblings, 15 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:31 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

Hi,

Here is the xfsprogs portion of the Unicode/UTF-8 support.  A number of
the patches in libxfs correspond with kernel patches previously posted,
and then there are patches to add support to mkfs, xfs_info, xfs_repair,
and a test.

(Note that the Unicode character database files have also been removed
here due to their size.)

Thanks,
Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 01/13] libxfs: return the first match during case-insensitive lookup
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 20:33   ` Ben Myers
  2014-09-18 20:33   ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Change the XFS case-insensitive lookup code to return the first match found,
even if it is not an exact match. Whether a filesystem uses case-insensitive
lookups is determined by a superblock bit set during filesystem creation.
This means that normal use cannot create two files that both match the same
filename.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 libxfs/xfs_dir2_block.c | 17 ++++-------
 libxfs/xfs_dir2_leaf.c  | 38 ++++-------------------
 libxfs/xfs_dir2_node.c  | 80 ++++++++++++++++++-------------------------------
 libxfs/xfs_dir2_sf.c    |  8 ++---
 4 files changed, 44 insertions(+), 99 deletions(-)

diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c
index cede01f..2880431 100644
--- a/libxfs/xfs_dir2_block.c
+++ b/libxfs/xfs_dir2_block.c
@@ -705,28 +705,21 @@ xfs_dir2_block_lookup_int(
 		dep = (xfs_dir2_data_entry_t *)
 			((char *)hdr + xfs_dir2_dataptr_to_off(mp, addr));
 		/*
-		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * Compare name and if it's a match, return the
+		 * index and buffer.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*bpp = bp;
 			*entno = mid;
-			if (cmp == XFS_CMP_EXACT)
-				return 0;
+			return 0;
 		}
 	} while (++mid < be32_to_cpu(btp->count) &&
 			be32_to_cpu(blp[mid].hashval) == hash);
 
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or replace).
-	 * If a case-insensitive match was found earlier, return success.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE)
-		return 0;
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
 	/*
 	 * No match, release the buffer and return ENOENT.
 	 */
diff --git a/libxfs/xfs_dir2_leaf.c b/libxfs/xfs_dir2_leaf.c
index 8e0cbc9..b1901d3 100644
--- a/libxfs/xfs_dir2_leaf.c
+++ b/libxfs/xfs_dir2_leaf.c
@@ -1246,7 +1246,6 @@ xfs_dir2_leaf_lookup_int(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	xfs_dir2_db_t		newdb;		/* new data block number */
 	xfs_trans_t		*tp;		/* transaction pointer */
-	xfs_dir2_db_t		cidb = -1;	/* case match data block no. */
 	enum xfs_dacmp		cmp;		/* name compare result */
 	struct xfs_dir2_leaf_entry *ents;
 	struct xfs_dir3_icleaf_hdr leafhdr;
@@ -1307,47 +1306,22 @@ xfs_dir2_leaf_lookup_int(
 		dep = (xfs_dir2_data_entry_t *)((char *)dbp->b_addr +
 			xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
 		/*
-		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * Compare name and if it's a match, return the index
+		 * and buffer.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*indexp = index;
-			/* case exact match: return the current buffer. */
-			if (cmp == XFS_CMP_EXACT) {
-				*dbpp = dbp;
-				return 0;
-			}
-			cidb = curdb;
+			*dbpp = dbp;
+			return 0;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or remove).
-	 * If a case-insensitive match was found earlier, re-read the
-	 * appropriate data block if required and return it.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE) {
-		ASSERT(cidb != -1);
-		if (cidb != curdb) {
-			xfs_trans_brelse(tp, dbp);
-			error = xfs_dir3_data_read(tp, dp,
-						   xfs_dir2_db_to_da(mp, cidb),
-						   -1, &dbp);
-			if (error) {
-				xfs_trans_brelse(tp, lbp);
-				return error;
-			}
-		}
-		*dbpp = dbp;
-		return 0;
-	}
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
 	/*
 	 * No match found, return ENOENT.
 	 */
-	ASSERT(cidb == -1);
 	if (dbp)
 		xfs_trans_brelse(tp, dbp);
 	xfs_trans_brelse(tp, lbp);
diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c
index 3737e4e..fb27506 100644
--- a/libxfs/xfs_dir2_node.c
+++ b/libxfs/xfs_dir2_node.c
@@ -702,6 +702,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	xfs_dir2_db_t		curdb = -1;	/* current data block number */
 	xfs_dir2_data_entry_t	*dep;		/* data block entry */
 	xfs_inode_t		*dp;		/* incore directory inode */
+	int			di = -1;	/* data entry index */
 	int			error;		/* error return value */
 	int			index;		/* leaf entry index */
 	xfs_dir2_leaf_t		*leaf;		/* leaf structure */
@@ -733,6 +734,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	if (state->extravalid) {
 		curbp = state->extrablk.bp;
 		curdb = state->extrablk.blkno;
+		di = state->extrablk.index;
 	}
 	/*
 	 * Loop over leaf entries with the right hash value.
@@ -757,27 +759,20 @@ xfs_dir2_leafn_lookup_for_entry(
 		 */
 		if (newdb != curdb) {
 			/*
-			 * If we had a block before that we aren't saving
-			 * for a CI name, drop it
+			 * If we had a block, drop it
 			 */
-			if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT ||
-						curdb != state->extrablk.blkno))
+			if (curbp) {
 				xfs_trans_brelse(tp, curbp);
+				di = -1;
+			}
 			/*
-			 * If needing the block that is saved with a CI match,
-			 * use it otherwise read in the new data block.
+			 * Read in the new data block.
 			 */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-					newdb == state->extrablk.blkno) {
-				ASSERT(state->extravalid);
-				curbp = state->extrablk.bp;
-			} else {
-				error = xfs_dir3_data_read(tp, dp,
-						xfs_dir2_db_to_da(mp, newdb),
-						-1, &curbp);
-				if (error)
-					return error;
-			}
+			error = xfs_dir3_data_read(tp, dp,
+					xfs_dir2_db_to_da(mp, newdb),
+					-1, &curbp);
+			if (error)
+				return error;
 			xfs_dir3_data_check(dp, curbp);
 			curdb = newdb;
 		}
@@ -787,53 +782,36 @@ xfs_dir2_leafn_lookup_for_entry(
 		dep = (xfs_dir2_data_entry_t *)((char *)curbp->b_addr +
 			xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
 		/*
-		 * Compare the entry and if it's an exact match, return
-		 * EEXIST immediately. If it's the first case-insensitive
-		 * match, store the block & inode number and continue looking.
+		 * Compare the entry and if it's a match, return
+		 * EEXIST immediately.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
-			/* If there is a CI match block, drop it */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-						curdb != state->extrablk.blkno)
-				xfs_trans_brelse(tp, state->extrablk.bp);
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = be64_to_cpu(dep->inumber);
 			args->filetype = xfs_dir3_dirent_get_ftype(mp, dep);
-			*indexp = index;
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.blkno = curdb;
-			state->extrablk.index = (int)((char *)dep -
-							(char *)curbp->b_addr);
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
-			curbp->b_ops = &xfs_dir3_data_buf_ops;
-			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-			if (cmp == XFS_CMP_EXACT)
-				return XFS_ERROR(EEXIST);
+			error = EEXIST;
+			goto out;
 		}
 	}
+	/* Didn't find a match */
+	error = ENOENT;
 	ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT));
+out:
 	if (curbp) {
-		if (args->cmpresult == XFS_CMP_DIFFERENT) {
-			/* Giving back last used data block. */
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.index = -1;
-			state->extrablk.blkno = curdb;
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
-			curbp->b_ops = &xfs_dir3_data_buf_ops;
-			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-		} else {
-			/* If the curbp is not the CI match block, drop it */
-			if (state->extrablk.bp != curbp)
-				xfs_trans_brelse(tp, curbp);
-		}
+		/* Giving back last used data block. */
+		state->extravalid = 1;
+		state->extrablk.bp = curbp;
+		state->extrablk.index = di;
+		state->extrablk.blkno = curdb;
+		state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
+		curbp->b_ops = &xfs_dir3_data_buf_ops;
+		xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
 	} else {
 		state->extravalid = 0;
 	}
 	*indexp = index;
-	return XFS_ERROR(ENOENT);
+	return XFS_ERROR(error);
 }
 
 /*
diff --git a/libxfs/xfs_dir2_sf.c b/libxfs/xfs_dir2_sf.c
index 7580333..7b01d43 100644
--- a/libxfs/xfs_dir2_sf.c
+++ b/libxfs/xfs_dir2_sf.c
@@ -833,13 +833,12 @@ xfs_dir2_sf_lookup(
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
 	     i++, sfep = xfs_dir3_sf_nextentry(dp->i_mount, sfp, sfep)) {
 		/*
-		 * Compare name and if it's an exact match, return the inode
-		 * number. If it's the first case-insensitive match, store the
-		 * inode number and continue looking for an exact match.
+		 * Compare name and if it's a match, return the inode
+		 * number.
 		 */
 		cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name,
 								sfep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = xfs_dir3_sfe_get_ino(dp->i_mount,
 							     sfp, sfep);
@@ -848,6 +847,7 @@ xfs_dir2_sf_lookup(
 			if (cmp == XFS_CMP_EXACT)
 				return XFS_ERROR(EEXIST);
 			ci_sfep = sfep;
+			break;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
  2014-09-18 20:33   ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
@ 2014-09-18 20:33   ` Ben Myers
  2014-09-18 20:34   ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and
normalization, different strings will match on other criteria than
case insensitivity.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/xfs_da_btree.h | 2 +-
 libxfs/xfs_dir2.c      | 9 ++++++---
 libxfs/xfs_dir2_node.c | 2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index e492dca..3d9f9dd 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -34,7 +34,7 @@ struct zone;
 enum xfs_dacmp {
 	XFS_CMP_DIFFERENT,	/* names are completely different */
 	XFS_CMP_EXACT,		/* names are exactly the same */
-	XFS_CMP_CASE		/* names are same but differ in case */
+	XFS_CMP_MATCH		/* names are same but differ in encoding */
 };
 
 /*
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 4c8c836..57e98a3 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -72,7 +72,7 @@ xfs_ascii_ci_compname(
 			continue;
 		if (tolower(args->name[i]) != tolower(name[i]))
 			return XFS_CMP_DIFFERENT;
-		result = XFS_CMP_CASE;
+		result = XFS_CMP_MATCH;
 	}
 
 	return result;
@@ -248,8 +248,11 @@ xfs_dir_cilookup_result(
 {
 	if (args->cmpresult == XFS_CMP_DIFFERENT)
 		return ENOENT;
-	if (args->cmpresult != XFS_CMP_CASE ||
-					!(args->op_flags & XFS_DA_OP_CILOOKUP))
+	if (args->cmpresult == XFS_CMP_EXACT)
+		return EEXIST;
+	ASSERT(args->cmpresult == XFS_CMP_MATCH);
+	/* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */
+	if (!(args->op_flags & XFS_DA_OP_CILOOKUP))
 		return EEXIST;
 
 	args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL);
diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c
index fb27506..550ca99 100644
--- a/libxfs/xfs_dir2_node.c
+++ b/libxfs/xfs_dir2_node.c
@@ -2034,7 +2034,7 @@ xfs_dir2_node_lookup(
 	error = xfs_da3_node_lookup_int(state, &rval);
 	if (error)
 		rval = error;
-	else if (rval == ENOENT && args->cmpresult == XFS_CMP_CASE) {
+	else if (rval == ENOENT && args->cmpresult == XFS_CMP_MATCH) {
 		/* If a CI match, dup the actual name and return EEXIST */
 		xfs_dir2_data_entry_t	*dep;
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 03/13] libxfs: add xfs_nameops.normhash
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
  2014-09-18 20:33   ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
  2014-09-18 20:33   ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
@ 2014-09-18 20:34   ` Ben Myers
  2014-09-18 20:35     ` Ben Myers
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:34 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args
structure as its argument, and calculates a hash value over the name. It may
in the process create a normalized form of the name, and assign that to the
norm/normlen fields in the xfs_da_args structure.

Changes:
 The pointer in kmem_free() was type converted to suppress compiler
 warnings.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/xfs_da_btree.h |  5 ++++-
 libxfs/xfs_da_btree.c  |  9 ++++++++
 libxfs/xfs_dir2.c      | 56 +++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index 3d9f9dd..06b50bf 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -42,7 +42,9 @@ enum xfs_dacmp {
  */
 typedef struct xfs_da_args {
 	const __uint8_t	*name;		/* string (maybe not NULL terminated) */
-	int		namelen;	/* length of string (maybe no NULL) */
+	const __uint8_t	*norm;		/* normalized name (may be NULL) */
+ 	int		namelen;	/* length of string (maybe no NULL) */
+	int		normlen;	/* length of normalized name */
 	__uint8_t	filetype;	/* filetype of inode for directories */
 	__uint8_t	*value;		/* set of bytes (maybe contain NULLs) */
 	int		valuelen;	/* length of value */
@@ -131,6 +133,7 @@ typedef struct xfs_da_state {
  */
 struct xfs_nameops {
 	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
 };
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index b731b54..eb97317 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -2000,8 +2000,17 @@ xfs_default_hashname(
 	return xfs_da_hashname(name->name, name->len);
 }
 
+STATIC int
+xfs_da_normhash(
+	struct xfs_da_args *args)
+{
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
 const struct xfs_nameops xfs_default_nameops = {
 	.hashname	= xfs_default_hashname,
+	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
 
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 57e98a3..e52d082 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -54,6 +54,21 @@ xfs_ascii_ci_hashname(
 	return hash;
 }
 
+STATIC int
+xfs_ascii_ci_normhash(
+	struct xfs_da_args *args)
+{
+	xfs_dahash_t	hash;
+	int		i;
+
+	for (i = 0, hash = 0; i < args->namelen; i++)
+		hash = tolower(args->name[i]) ^ rol32(hash, 7);
+
+	args->hashval = hash;
+	return 0;
+}
+
+
 STATIC enum xfs_dacmp
 xfs_ascii_ci_compname(
 	struct xfs_da_args *args,
@@ -80,6 +95,7 @@ xfs_ascii_ci_compname(
 
 static struct xfs_nameops xfs_ascii_ci_nameops = {
 	.hashname	= xfs_ascii_ci_hashname,
+	.normhash	= xfs_ascii_ci_normhash,
 	.compname	= xfs_ascii_ci_compname,
 };
 
@@ -211,7 +227,6 @@ xfs_dir_createname(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = inum;
 	args.dp = dp;
 	args.firstblock = first;
@@ -220,19 +235,24 @@ xfs_dir_createname(
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
 	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_addname(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_addname(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_addname(&args);
 	else
 		rval = xfs_dir2_node_addname(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -289,22 +309,23 @@ xfs_dir_lookup(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.dp = dp;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
 	args.op_flags = XFS_DA_OP_OKNOENT;
 	if (ci_name)
 		args.op_flags |= XFS_DA_OP_CILOOKUP;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_lookup(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_lookup(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_lookup(&args);
 	else
@@ -318,6 +339,9 @@ xfs_dir_lookup(
 			ci_name->len = args.valuelen;
 		}
 	}
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -345,7 +369,6 @@ xfs_dir_removename(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = ino;
 	args.dp = dp;
 	args.firstblock = first;
@@ -353,19 +376,24 @@ xfs_dir_removename(
 	args.total = total;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_removename(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_removename(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_removename(&args);
 	else
 		rval = xfs_dir2_node_removename(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -395,7 +423,6 @@ xfs_dir_replace(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = inum;
 	args.dp = dp;
 	args.firstblock = first;
@@ -403,19 +430,24 @@ xfs_dir_replace(
 	args.total = total;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_replace(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_replace(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_replace(&args);
 	else
 		rval = xfs_dir2_node_replace(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 20:35     ` Ben Myers
  2014-09-18 20:33   ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                       ` (13 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: xfs, olaf, tinguely

From: Olaf Weber <olaf@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 db/check.c              |  6 ++----
 include/xfs_da_btree.h  |  2 +-
 libxfs/xfs_da_btree.c   |  9 +--------
 libxfs/xfs_dir2.c       | 10 ++++++----
 libxfs/xfs_dir2_block.c |  5 +----
 libxfs/xfs_dir2_data.c  |  6 ++----
 repair/phase6.c         |  2 +-
 7 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/db/check.c b/db/check.c
index 4fd9fd0..49359d7 100644
--- a/db/check.c
+++ b/db/check.c
@@ -2212,7 +2212,6 @@ process_data_dir_v2(
 	int			stale = 0;
 	int			tag_err;
 	__be16			*tagp;
-	struct xfs_name		xname;
 
 	data = iocur_top->data;
 	block = iocur_top->data;
@@ -2323,9 +2322,8 @@ process_data_dir_v2(
 		tag_err += be16_to_cpu(*tagp) != (char *)dep - (char *)data;
 		addr = xfs_dir2_db_off_to_dataptr(mp, db,
 			(char *)dep - (char *)data);
-		xname.name = dep->name;
-		xname.len = dep->namelen;
-		dir_hash_add(mp->m_dirnameops->hashname(&xname), addr);
+		dir_hash_add(mp->m_dirnameops->hashname(dep->name,
+							dep->namelen), addr);
 		ptr += xfs_dir3_data_entsize(mp, dep->namelen);
 		count++;
 		lastfree = 0;
diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index 06b50bf..9674bed 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -132,7 +132,7 @@ typedef struct xfs_da_state {
  * Name ops for directory and/or attr name operations
  */
 struct xfs_nameops {
-	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	xfs_dahash_t	(*hashname)(const unsigned char *, int);
 	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index eb97317..7be5eaf 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -1993,13 +1993,6 @@ xfs_da_compname(
 					XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
 }
 
-static xfs_dahash_t
-xfs_default_hashname(
-	struct xfs_name	*name)
-{
-	return xfs_da_hashname(name->name, name->len);
-}
-
 STATIC int
 xfs_da_normhash(
 	struct xfs_da_args *args)
@@ -2009,7 +2002,7 @@ xfs_da_normhash(
 }
 
 const struct xfs_nameops xfs_default_nameops = {
-	.hashname	= xfs_default_hashname,
+	.hashname	= xfs_da_hashname,
 	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index e52d082..1893931 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -43,13 +43,14 @@ const unsigned char xfs_mode_to_ftype[S_IFMT >> S_SHIFT] = {
  */
 STATIC xfs_dahash_t
 xfs_ascii_ci_hashname(
-	struct xfs_name	*name)
+	const unsigned char *name,
+	int len)
 {
 	xfs_dahash_t	hash;
 	int		i;
 
-	for (i = 0, hash = 0; i < name->len; i++)
-		hash = tolower(name->name[i]) ^ rol32(hash, 7);
+	for (i = 0, hash = 0; i < len; i++)
+		hash = tolower(name[i]) ^ rol32(hash, 7);
 
 	return hash;
 }
@@ -475,7 +476,8 @@ xfs_dir_canenter(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args.hashval = dp->i_mount->m_dirnameops->hashname(name->name,
+							   name->len);
 	args.dp = dp;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c
index 2880431..1a8b5f5 100644
--- a/libxfs/xfs_dir2_block.c
+++ b/libxfs/xfs_dir2_block.c
@@ -1047,7 +1047,6 @@ xfs_dir2_sf_to_block(
 	xfs_dir2_sf_hdr_t	*sfp;		/* shortform header  */
 	__be16			*tagp;		/* end of data entry */
 	xfs_trans_t		*tp;		/* transaction pointer */
-	struct xfs_name		name;
 	struct xfs_ifork	*ifp;
 
 	trace_xfs_dir2_sf_to_block(args);
@@ -1205,10 +1204,8 @@ xfs_dir2_sf_to_block(
 		tagp = xfs_dir3_data_entry_tag_p(mp, dep);
 		*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 		xfs_dir2_data_log_entry(tp, bp, dep);
-		name.name = sfep->name;
-		name.len = sfep->namelen;
 		blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
-							hashname(&name));
+					hashname(sfep->name, sfep->namelen));
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
 						 (char *)dep - (char *)hdr));
 		offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/libxfs/xfs_dir2_data.c b/libxfs/xfs_dir2_data.c
index dc9df4d..9b3f750 100644
--- a/libxfs/xfs_dir2_data.c
+++ b/libxfs/xfs_dir2_data.c
@@ -46,7 +46,6 @@ __xfs_dir3_data_check(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	char			*p;		/* current data position */
 	int			stale;		/* count of stale leaves */
-	struct xfs_name		name;
 
 	mp = bp->b_target->bt_mount;
 	hdr = bp->b_addr;
@@ -142,9 +141,8 @@ __xfs_dir3_data_check(
 			addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
 				(xfs_dir2_data_aoff_t)
 				((char *)dep - (char *)hdr));
-			name.name = dep->name;
-			name.len = dep->namelen;
-			hash = mp->m_dirnameops->hashname(&name);
+			hash = mp->m_dirnameops->
+					hashname(dep->name, dep->namelen);
 			for (i = 0; i < be32_to_cpu(btp->count); i++) {
 				if (be32_to_cpu(lep[i].address) == addr &&
 				    be32_to_cpu(lep[i].hashval) == hash)
diff --git a/repair/phase6.c b/repair/phase6.c
index f13069f..f374fd0 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -195,7 +195,7 @@ dir_hash_add(
 	dup = 0;
 
 	if (!junk) {
-		hash = mp->m_dirnameops->hashname(&xname);
+		hash = mp->m_dirnameops->hashname(name, namelen);
 		byhash = DIR_HASH_FUNC(hashtab, hash);
 
 		/*
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash
@ 2014-09-18 20:35     ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 db/check.c              |  6 ++----
 include/xfs_da_btree.h  |  2 +-
 libxfs/xfs_da_btree.c   |  9 +--------
 libxfs/xfs_dir2.c       | 10 ++++++----
 libxfs/xfs_dir2_block.c |  5 +----
 libxfs/xfs_dir2_data.c  |  6 ++----
 repair/phase6.c         |  2 +-
 7 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/db/check.c b/db/check.c
index 4fd9fd0..49359d7 100644
--- a/db/check.c
+++ b/db/check.c
@@ -2212,7 +2212,6 @@ process_data_dir_v2(
 	int			stale = 0;
 	int			tag_err;
 	__be16			*tagp;
-	struct xfs_name		xname;
 
 	data = iocur_top->data;
 	block = iocur_top->data;
@@ -2323,9 +2322,8 @@ process_data_dir_v2(
 		tag_err += be16_to_cpu(*tagp) != (char *)dep - (char *)data;
 		addr = xfs_dir2_db_off_to_dataptr(mp, db,
 			(char *)dep - (char *)data);
-		xname.name = dep->name;
-		xname.len = dep->namelen;
-		dir_hash_add(mp->m_dirnameops->hashname(&xname), addr);
+		dir_hash_add(mp->m_dirnameops->hashname(dep->name,
+							dep->namelen), addr);
 		ptr += xfs_dir3_data_entsize(mp, dep->namelen);
 		count++;
 		lastfree = 0;
diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index 06b50bf..9674bed 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -132,7 +132,7 @@ typedef struct xfs_da_state {
  * Name ops for directory and/or attr name operations
  */
 struct xfs_nameops {
-	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	xfs_dahash_t	(*hashname)(const unsigned char *, int);
 	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index eb97317..7be5eaf 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -1993,13 +1993,6 @@ xfs_da_compname(
 					XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
 }
 
-static xfs_dahash_t
-xfs_default_hashname(
-	struct xfs_name	*name)
-{
-	return xfs_da_hashname(name->name, name->len);
-}
-
 STATIC int
 xfs_da_normhash(
 	struct xfs_da_args *args)
@@ -2009,7 +2002,7 @@ xfs_da_normhash(
 }
 
 const struct xfs_nameops xfs_default_nameops = {
-	.hashname	= xfs_default_hashname,
+	.hashname	= xfs_da_hashname,
 	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index e52d082..1893931 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -43,13 +43,14 @@ const unsigned char xfs_mode_to_ftype[S_IFMT >> S_SHIFT] = {
  */
 STATIC xfs_dahash_t
 xfs_ascii_ci_hashname(
-	struct xfs_name	*name)
+	const unsigned char *name,
+	int len)
 {
 	xfs_dahash_t	hash;
 	int		i;
 
-	for (i = 0, hash = 0; i < name->len; i++)
-		hash = tolower(name->name[i]) ^ rol32(hash, 7);
+	for (i = 0, hash = 0; i < len; i++)
+		hash = tolower(name[i]) ^ rol32(hash, 7);
 
 	return hash;
 }
@@ -475,7 +476,8 @@ xfs_dir_canenter(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args.hashval = dp->i_mount->m_dirnameops->hashname(name->name,
+							   name->len);
 	args.dp = dp;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c
index 2880431..1a8b5f5 100644
--- a/libxfs/xfs_dir2_block.c
+++ b/libxfs/xfs_dir2_block.c
@@ -1047,7 +1047,6 @@ xfs_dir2_sf_to_block(
 	xfs_dir2_sf_hdr_t	*sfp;		/* shortform header  */
 	__be16			*tagp;		/* end of data entry */
 	xfs_trans_t		*tp;		/* transaction pointer */
-	struct xfs_name		name;
 	struct xfs_ifork	*ifp;
 
 	trace_xfs_dir2_sf_to_block(args);
@@ -1205,10 +1204,8 @@ xfs_dir2_sf_to_block(
 		tagp = xfs_dir3_data_entry_tag_p(mp, dep);
 		*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 		xfs_dir2_data_log_entry(tp, bp, dep);
-		name.name = sfep->name;
-		name.len = sfep->namelen;
 		blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
-							hashname(&name));
+					hashname(sfep->name, sfep->namelen));
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
 						 (char *)dep - (char *)hdr));
 		offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/libxfs/xfs_dir2_data.c b/libxfs/xfs_dir2_data.c
index dc9df4d..9b3f750 100644
--- a/libxfs/xfs_dir2_data.c
+++ b/libxfs/xfs_dir2_data.c
@@ -46,7 +46,6 @@ __xfs_dir3_data_check(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	char			*p;		/* current data position */
 	int			stale;		/* count of stale leaves */
-	struct xfs_name		name;
 
 	mp = bp->b_target->bt_mount;
 	hdr = bp->b_addr;
@@ -142,9 +141,8 @@ __xfs_dir3_data_check(
 			addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
 				(xfs_dir2_data_aoff_t)
 				((char *)dep - (char *)hdr));
-			name.name = dep->name;
-			name.len = dep->namelen;
-			hash = mp->m_dirnameops->hashname(&name);
+			hash = mp->m_dirnameops->
+					hashname(dep->name, dep->namelen);
 			for (i = 0; i < be32_to_cpu(btp->count); i++) {
 				if (be32_to_cpu(lep[i].address) == addr &&
 				    be32_to_cpu(lep[i].hashval) == hash)
diff --git a/repair/phase6.c b/repair/phase6.c
index f13069f..f374fd0 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -195,7 +195,7 @@ dir_hash_add(
 	dup = 0;
 
 	if (!junk) {
-		hash = mp->m_dirnameops->hashname(&xname);
+		hash = mp->m_dirnameops->hashname(name, namelen);
 		byhash = DIR_HASH_FUNC(hashtab, hash);
 
 		/*
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support.
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (3 preceding siblings ...)
  2014-09-18 20:35     ` Ben Myers
@ 2014-09-18 20:36   ` Ben Myers
  2014-09-18 20:37   ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:36 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
the utf8bit, and returns true if at least one of them is set. Replace
calls to xfs_sb_version_hasasciici() as needed.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/xfs_fs.h |  2 +-
 include/xfs_sb.h | 25 ++++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/xfs_fs.h b/include/xfs_fs.h
index 59c40fc..1be539d 100644
--- a/include/xfs_fs.h
+++ b/include/xfs_fs.h
@@ -239,7 +239,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
-
+#define XFS_FSOP_GEOM_FLAGS_UTF8	0x40000	/* utf8 filenames */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/include/xfs_sb.h b/include/xfs_sb.h
index 950d1ea..5ac7f06 100644
--- a/include/xfs_sb.h
+++ b/include/xfs_sb.h
@@ -82,6 +82,8 @@ struct xfs_trans;
 #define XFS_SB_VERSION2_RESERVED4BIT	0x00000004
 #define XFS_SB_VERSION2_ATTR2BIT	0x00000008	/* Inline attr rework */
 #define XFS_SB_VERSION2_PARENTBIT	0x00000010	/* parent pointers */
+#define XFS_SB_VERSION2_PARENTBIT	0x00000010	/* parent pointers */
+#define XFS_SB_VERSION2_UTF8BIT		0x00000020	/* utf8 names */
 #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */
 #define XFS_SB_VERSION2_CRCBIT		0x00000100	/* metadata CRCs */
 #define XFS_SB_VERSION2_FTYPE		0x00000200	/* inode type in dir */
@@ -89,6 +91,7 @@ struct xfs_trans;
 #define	XFS_SB_VERSION2_OKREALFBITS	\
 	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
 	 XFS_SB_VERSION2_ATTR2BIT	| \
+	 XFS_SB_VERSION2_UTF8BIT	| \
 	 XFS_SB_VERSION2_PROJID32BIT	| \
 	 XFS_SB_VERSION2_FTYPE)
 #define	XFS_SB_VERSION2_OKSASHFBITS	\
@@ -600,8 +603,10 @@ xfs_sb_has_ro_compat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_UTF8	(1 << 1)	/* utf-8 name support */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
-		(XFS_SB_FEAT_INCOMPAT_FTYPE)
+		(XFS_SB_FEAT_INCOMPAT_FTYPE | \
+		 XFS_SB_FEAT_INCOMPAT_UTF8)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
@@ -649,6 +654,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
 }
 
+static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) ||
+	       (xfs_sb_version_hasmorebits(sbp) &&
+		 (sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));
+}
+
+/*
+ * Special case: there are a number of places where we need to test
+ * both the borgbit and the utf8bit, and take the same action if
+ * either of those is set.
+ */
+static inline int xfs_sb_version_hasci(xfs_sb_t *sbp)
+{
+	return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp);
+}
+
 /*
  * end of superblock version macros
  */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 06/13] xfsprogs: add unicode character database files
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (4 preceding siblings ...)
  2014-09-18 20:36   ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
@ 2014-09-18 20:37   ` Ben Myers
  2014-09-18 20:38   ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:37 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: removed large unicode files.  download them as below.  -bpm]

cd support/ucd-7.0.0
wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
---
 support/ucd-7.0.0/README | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)
 create mode 100644 support/ucd-7.0.0/README

diff --git a/support/ucd-7.0.0/README b/support/ucd-7.0.0/README
new file mode 100644
index 0000000..d713e66
--- /dev/null
+++ b/support/ucd-7.0.0/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+  http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+  http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+  http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
+  07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
+  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
+  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
+  522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
+  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
+  c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8.
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (5 preceding siblings ...)
  2014-09-18 20:37   ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers
@ 2014-09-18 20:38   ` Ben Myers
  2014-09-18 20:38   ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:38 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c.

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

  nfkdi:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.

  nfkdicf:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/utf8norm.h   |  111 ++
 libxfs/utf8norm.c    |  628 ++++++++++
 support/mkutf8data.c | 3232 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 3971 insertions(+)
 create mode 100644 include/utf8norm.h
 create mode 100644 libxfs/utf8norm.c
 create mode 100644 support/mkutf8data.c

diff --git a/include/utf8norm.h b/include/utf8norm.h
new file mode 100644
index 0000000..6aa3391
--- /dev/null
+++ b/include/utf8norm.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern const unsigned int utf8version;
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *  - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	utf8data_t	data;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
diff --git a/libxfs/utf8norm.c b/libxfs/utf8norm.c
new file mode 100644
index 0000000..6232d1a
--- /dev/null
+++ b/libxfs/utf8norm.c
@@ -0,0 +1,628 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_types.h"
+#include <utf8norm.h>
+
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include <utf8data.h>
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7F: 0                   - 0x7F
+ *       0x80 -    0x7FF: 0xC2 0x80           - 0xDF 0xBF
+ *      0x800 -   0xFFFF: 0xE0 0xA0 0x80      - 0xEF 0xBF 0xBF
+ *    0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences.  Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + data->offset;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!data)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+	return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = data->maxage;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = data->maxage;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : string.
+ *   len    : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s,
+	size_t		len)
+{
+	if (!data)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->data = data;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s)
+{
+	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1   -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->data, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->data, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+	while (maxage < utf8nfkdidata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdidata[i].maxage)
+		return NULL;
+	return &utf8nfkdidata[i];
+}
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+	while (maxage < utf8nfkdicfdata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdicfdata[i].maxage)
+		return NULL;
+	return &utf8nfkdicfdata[i];
+}
diff --git a/support/mkutf8data.c b/support/mkutf8data.c
new file mode 100644
index 0000000..e5c3507
--- /dev/null
+++ b/support/mkutf8data.c
@@ -0,0 +1,3232 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME	"DerivedAge.txt"
+#define CCC_NAME	"DerivedCombiningClass.txt"
+#define PROP_NAME	"DerivedCoreProperties.txt"
+#define DATA_NAME	"UnicodeData.txt"
+#define FOLD_NAME	"CaseFolding.txt"
+#define NORM_NAME	"NormalizationCorrections.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+#define UTF8_NAME	"utf8data.h"
+
+const char	*age_name  = AGE_NAME;
+const char	*ccc_name  = CCC_NAME;
+const char	*prop_name = PROP_NAME;
+const char	*data_name = DATA_NAME;
+const char	*fold_name = FOLD_NAME;
+const char	*norm_name = NORM_NAME;
+const char	*test_name = TEST_NAME;
+const char	*utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision.  These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_MAJ_MAX			((unsigned short)-1)
+#define UNICODE_MIN_MAX			((unsigned char)-1)
+#define UNICODE_REV_MAX			((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+	if (major > UNICODE_MAJ_MAX)
+		return 0;
+	if (minor > UNICODE_MIN_MAX)
+		return 0;
+	if (revision > UNICODE_REV_MAX)
+		return 0;
+	return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MAXGEN		(255)
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+	const unsigned char *s = (const unsigned char*)str;
+	unsigned int unichar = 0;
+
+	if (*s < 0x80) {
+		unichar = *s;
+	} else if (*s < UTF8_3_BITS) {
+		unichar = *s++ & 0x1F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else if (*s < UTF8_4_BITS) {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	}
+	return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+	return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+	void *root;
+	int childnode;
+	const char *type;
+	unsigned int maxage;
+	struct tree *next;
+	int (*leaf_equal)(void *, void *);
+	void (*leaf_print)(void *, int);
+	int (*leaf_mark)(void *);
+	int (*leaf_size)(void *);
+	int *(*leaf_index)(struct tree *, void *);
+	unsigned char *(*leaf_emit)(void *, unsigned char *);
+	int leafindex[0x110000];
+	int index;
+};
+
+struct node {
+	int index;
+	int offset;
+	int mark;
+	int size;
+	struct node *parent;
+	void *left;
+	void *right;
+	unsigned char bitnum;
+	unsigned char nextbyte;
+	unsigned char leftnode;
+	unsigned char rightnode;
+	unsigned int keybits;
+	unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+	struct node *node;
+	void *leaf = NULL;
+
+	node = tree->root;
+	while (!leaf && node) {
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7))) {
+			/* Right leg */
+			if (node->rightnode == NODE) {
+				node = node->right;
+			} else if (node->rightnode == LEAF) {
+				leaf = node->right;
+			} else {
+				node = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (node->leftnode == NODE) {
+				node = node->left;
+			} else if (node->leftnode == LEAF) {
+				leaf = node->left;
+			} else {
+				node = NULL;
+			}
+		}
+	}
+
+	return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int indent = 1;
+	int nodes, singletons, leaves;
+
+	nodes = singletons = leaves = 0;
+
+	printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_print(tree->root, indent);
+		leaves = 1;
+	} else {
+		assert(tree->childnode == NODE);
+		node = tree->root;
+		leftmask = rightmask = 0;
+		while (node) {
+			printf("%*snode @ %p bitnum %d nextbyte %d"
+			       " left %p right %p mask %x bits %x\n",
+				indent, "", node,
+				node->bitnum, node->nextbyte,
+				node->left, node->right,
+				node->keymask, node->keybits);
+			nodes += 1;
+			if (!(node->left && node->right))
+				singletons += 1;
+
+			while (node) {
+				bitmask = 1 << node->bitnum;
+				if ((leftmask & bitmask) == 0) {
+					leftmask |= bitmask;
+					if (node->leftnode == LEAF) {
+						assert(node->left);
+						tree->leaf_print(node->left,
+								 indent+1);
+						leaves += 1;
+					} else if (node->left) {
+						assert(node->leftnode == NODE);
+						indent += 1;
+						node = node->left;
+						break;
+					}
+				}
+				if ((rightmask & bitmask) == 0) {
+					rightmask |= bitmask;
+					if (node->rightnode == LEAF) {
+						assert(node->right);
+						tree->leaf_print(node->right,
+								 indent+1);
+						leaves += 1;
+					} else if (node->right) {
+						assert(node->rightnode==NODE);
+						indent += 1;
+						node = node->right;
+						break;
+					}
+				}
+				leftmask &= ~bitmask;
+				rightmask &= ~bitmask;
+				node = node->parent;
+				indent -= 1;
+			}
+		}
+	}
+	printf("nodes %d leaves %d singletons %d\n",
+	       nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+	struct node *node;
+	int bitnum;
+
+	node = malloc(sizeof(*node));
+	node->left = node->right = NULL;
+	node->parent = parent;
+	node->leftnode = NODE;
+	node->rightnode = NODE;
+	node->keybits = 0;
+	node->keymask = 0;
+	node->mark = 0;
+	node->index = 0;
+	node->offset = -1;
+	node->size = 4;
+
+	if (node->parent) {
+		bitnum = parent->bitnum;
+		if ((bitnum & 7) == 0) {
+			node->bitnum = bitnum + 7 + 8;
+			node->nextbyte = 1;
+		} else {
+			node->bitnum = bitnum - 1;
+			node->nextbyte = 0;
+		}
+	} else {
+		node->bitnum = 7;
+		node->nextbyte = 0;
+	}
+
+        return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+	struct node *node;
+	struct node *parent;
+	void **cursor;
+	int keybits;
+
+	assert(keylen >= 1 && keylen <= 4);
+
+	node = NULL;
+	cursor = &tree->root;
+	keybits = 8 * keylen;
+
+	/* Insert, creating path along the way. */
+	while (keybits) {
+		if (!*cursor)
+			*cursor = alloc_node(node);
+		node = *cursor;
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7)))
+			cursor = &node->right;
+		else
+			cursor = &node->left;
+		keybits--;
+	}
+	*cursor = leaf;
+
+	/* Merge subtrees if possible. */
+	while (node) {
+		if (*key & (1 << (node->bitnum & 7)))
+			node->rightnode = LEAF;
+		else
+			node->leftnode = LEAF;
+		if (node->nextbyte)
+			break;
+		if (node->leftnode == NODE || node->rightnode == NODE)
+			break;
+		assert(node->left);
+		assert(node->right);
+		/* Compare */
+		if (! tree->leaf_equal(node->left, node->right))
+			break;
+		/* Keep left, drop right leaf. */
+		leaf = node->left;
+		/* Check in parent */
+		parent = node->parent;
+		if (!parent) {
+			/* root of tree! */
+			tree->root = leaf;
+			tree->childnode = LEAF;
+		} else if (parent->left == node) {
+			parent->left = leaf;
+			parent->leftnode = LEAF;
+			if (parent->right) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+			}
+		} else if (parent->right == node) {
+			parent->right = leaf;
+			parent->rightnode = LEAF;
+			if (parent->left) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+				parent->keybits |= (1 << node->bitnum);
+			}
+		} else {
+			/* internal tree error */
+			assert(0);
+		}
+		free(node);
+		node = parent;
+	}
+
+	/* Propagate keymasks up along singleton chains. */
+	while (node) {
+		parent = node->parent;
+		if (!parent)
+			break;
+		/* Nix the mask for parents with two children. */
+		if (node->keymask == 0) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else if (parent->left && parent->right) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else {
+			assert((parent->keymask & node->keymask) == 0);
+			parent->keymask |= node->keymask;
+			parent->keymask |= (1 << parent->bitnum);
+			parent->keybits |= node->keybits;
+			if (parent->right)
+				parent->keybits |= (1 << parent->bitnum);
+		}
+		node = parent;
+	}
+
+	return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed.  There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves.  The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains.  When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity.  Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+	struct node *node;
+	struct node *left;
+	struct node *right;
+	struct node *parent;
+	void *leftleaf;
+	void *rightleaf;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+
+	if (verbose > 0)
+		printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+	count = 0;
+	if (tree->childnode == LEAF)
+		return;
+	if (!tree->root)
+		return;
+
+	leftmask = rightmask = 0;
+	node = tree->root;
+	while (node) {
+		if (node->nextbyte)
+			goto advance;
+		if (node->leftnode == LEAF)
+			goto advance;
+		if (node->rightnode == LEAF)
+			goto advance;
+		if (!node->left)
+			goto advance;
+		if (!node->right)
+			goto advance;
+		left = node->left;
+		right = node->right;
+		if (left->keymask == 0)
+			goto advance;
+		if (right->keymask == 0)
+			goto advance;
+		if (left->keymask != right->keymask)
+			goto advance;
+		if (left->keybits != right->keybits)
+			goto advance;
+		leftleaf = NULL;
+		while (!leftleaf) {
+			assert(left->left || left->right);
+			if (left->leftnode == LEAF)
+				leftleaf = left->left;
+			else if (left->rightnode == LEAF)
+				leftleaf = left->right;
+			else if (left->left)
+				left = left->left;
+			else if (left->right)
+				left = left->right;
+			else
+				assert(0);
+		}
+		rightleaf = NULL;
+		while (!rightleaf) {
+			assert(right->left || right->right);
+			if (right->leftnode == LEAF)
+				rightleaf = right->left;
+			else if (right->rightnode == LEAF)
+				rightleaf = right->right;
+			else if (right->left)
+				right = right->left;
+			else if (right->right)
+				right = right->right;
+			else
+				assert(0);
+		}
+		if (! tree->leaf_equal(leftleaf, rightleaf))
+			goto advance;
+		/*
+		 * This node has identical singleton-only subtrees.
+		 * Remove it.
+		 */
+		parent = node->parent;
+		left = node->left;
+		right = node->right;
+		if (parent->left == node)
+			parent->left = left;
+		else if (parent->right == node)
+			parent->right = left;
+		else
+			assert(0);
+		left->parent = parent;
+		left->keymask |= (1 << node->bitnum);
+		node->left = NULL;
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			if (node->leftnode == NODE && node->left) {
+				left = node->left;
+				free(node);
+				count++;
+				node = left;
+			} else if (node->rightnode == NODE && node->right) {
+				right = node->right;
+				free(node);
+				count++;
+				node = right;
+			} else {
+				node = NULL;
+			}
+		}
+		/* Propagate keymasks up along singleton chains. */
+		node = parent;
+		/* Force re-check */
+		bitmask = 1 << node->bitnum;
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		for (;;) {
+			if (node->left && node->right)
+				break;
+			if (node->left) {
+				left = node->left;
+				node->keymask |= left->keymask;
+				node->keybits |= left->keybits;
+			}
+			if (node->right) {
+				right = node->right;
+				node->keymask |= right->keymask;
+				node->keybits |= right->keybits;
+			}
+			node->keymask |= (1 << node->bitnum);
+			node = node->parent;
+			/* Force re-check */
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+		}
+	advance:
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0 &&
+		    node->leftnode == NODE &&
+		    node->left) {
+			leftmask |= bitmask;
+			node = node->left;
+		} else if ((rightmask & bitmask) == 0 &&
+			   node->rightnode == NODE &&
+			   node->right) {
+			rightmask |= bitmask;
+			node = node->right;
+		} else {
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+		}
+	}
+	if (verbose > 0)
+		printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+	struct node *node;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int marked;
+
+	marked = 0;
+	if (verbose > 0)
+		printf("Marking %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+
+	/* second pass: left siblings and singletons */
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				if (!node->mark && node->parent->mark) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				if (!node->mark && node->parent->mark &&
+				    !node->parent->left) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+done:
+	if (verbose > 0)
+		printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie.  These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+	int indent;
+
+	/* Align to a cache line (or half a cache line?). */
+	while (index % 64)
+		index++;
+	tree->index = index;
+	indent = 1;
+	count = 0;
+
+	if (verbose > 0)
+		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+	if (tree->childnode == LEAF) {
+		index += tree->leaf_size(tree->root);
+		goto done;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		count++;
+		if (node->index != index)
+			node->index = index;
+		index += node->size;
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					*tree->leaf_index(tree, node->left) =
+									index;
+					index += tree->leaf_size(node->left);
+					count++;
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					*tree->leaf_index(tree, node->right) = index;
+					index += tree->leaf_size(node->right);
+					count++;
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	/* Round up to a multiple of 16 */
+	while (index % 16)
+		index++;
+	if (verbose > 0)
+		printf("Final index %d\n", index);
+	return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced.  This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+	struct tree *next;
+	struct node *node;
+	struct node *right;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	unsigned int pathbits;
+	unsigned int pathmask;
+	int changed;
+	int offset;
+	int size;
+	int indent;
+
+	indent = 1;
+	changed = 0;
+	size = 0;
+
+	if (verbose > 0)
+		printf("Sizing %s_%x", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	pathbits = 0;
+	pathmask = 0;
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		offset = 0;
+		if (!node->left || !node->right) {
+			size = 1;
+		} else {
+			if (node->rightnode == NODE) {
+				right = node->right;
+				next = tree->next;
+				while (!right->mark) {
+					assert(next);
+					n = next->root;
+					while (n->bitnum != node->bitnum) {
+						if (pathbits & (1<<n->bitnum))
+							n = n->right;
+						else
+							n = n->left;
+					}
+					n = n->right;
+					assert(right->bitnum == n->bitnum);
+					right = n;
+					next = next->next;
+				}
+				offset = right->index - node->index;
+			} else {
+				offset = *tree->leaf_index(tree, node->right);
+				offset -= node->index;
+			}
+			assert(offset >= 0);
+			assert(offset <= 0xffffff);
+			if (offset <= 0xff) {
+				size = 2;
+			} else if (offset <= 0xffff) {
+				size = 3;
+			} else { /* offset <= 0xffffff */
+				size = 4;
+			}
+		}
+		if (node->size != size || node->offset != offset) {
+			node->size = size;
+			node->offset = offset;
+			changed++;
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			pathmask |= bitmask;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				pathbits |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			pathmask &= ~bitmask;
+			pathbits &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	if (verbose > 0)
+		printf("Found %d changes\n", changed);
+	return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int offlen;
+	int offset;
+	int index;
+	int indent;
+	unsigned char byte;
+
+	index = tree->index;
+	data += index;
+	indent = 1;
+	if (verbose > 0)
+		printf("Emitting %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_emit(tree->root, data);
+		return;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		assert(node->offset != -1);
+		assert(node->index == index);
+
+		byte = 0;
+		if (node->nextbyte)
+			byte |= NEXTBYTE;
+		byte |= (node->bitnum & BITNUM);
+		if (node->left && node->right) {
+			if (node->leftnode == NODE)
+				byte |= LEFTNODE;
+			if (node->rightnode == NODE)
+				byte |= RIGHTNODE;
+			if (node->offset <= 0xff)
+				offlen = 1;
+			else if (node->offset <= 0xffff)
+				offlen = 2;
+			else
+				offlen = 3;
+			offset = node->offset;
+			byte |= offlen << OFFLEN_SHIFT;
+			*data++ = byte;
+			index++;
+			while (offlen--) {
+				*data++ = offset & 0xff;
+				index++;
+				offset >>= 8;
+			}
+		} else if (node->left) {
+			if (node->leftnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else if (node->right) {
+			byte |= RIGHTNODE;
+			if (node->rightnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else {
+			assert(0);
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					data = tree->leaf_emit(node->left,
+							       data);
+					index += tree->leaf_size(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					data = tree->leaf_emit(node->right,
+							       data);
+					index += tree->leaf_size(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table.  Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions.  The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+	unsigned int code;
+	int ccc;
+	int gen;
+	int correction;
+	unsigned int *utf32nfkdi;
+	unsigned int *utf32nfkdicf;
+	char *utf8nfkdi;
+	char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int    corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int          trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+	int i;
+
+	for (i = 0; i != corrections_count; i++)
+		if (u->code == corrections[i].code)
+			return &corrections[i];
+	return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdicf && right->utf8nfkdicf &&
+	    strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+		return 1;
+	if (left->utf8nfkdicf && right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdicf || right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+               leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+               leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdicf)
+		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+	return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+	if (leaf->utf8nfkdicf)
+		return 1;
+	return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+	return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+	struct unicode_data *leaf = l;
+	int size = 2;
+	if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+	struct unicode_data *leaf = l;
+	int size = 2;
+	if (leaf->utf8nfkdicf)
+		size += strlen(leaf->utf8nfkdicf) + 1;
+	else if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+	return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+	return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdicf) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdicf;
+		while ((*data++ = *s++) != 0)
+			;
+	} else if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+	char utf[18*4+1];
+	char *u;
+	unsigned int *um;
+	int i;
+
+	u = utf;
+	um = data->utf32nfkdi;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		data->utf8nfkdi = strdup((char*)utf);
+	}
+	u = utf;
+	um = data->utf32nfkdicf;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+			data->utf8nfkdicf = strdup((char*)utf);
+	}
+}
+
+static void
+utf8_init(void)
+{
+	unsigned int unichar;
+	int i;
+
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		utf8_create(&unicode_data[unichar]);
+
+	for (i = 0; i != corrections_count; i++)
+		utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+	struct unicode_data *data;
+	unsigned int maxage;
+	unsigned int nextage;
+	int count;
+	int i;
+	int j;
+
+	/* Count the number of different ages. */
+	count = 0;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+		count++;
+	} while (nextage);
+
+	/* Two trees per age: nfkdi and nfkdicf */
+	trees_count = count * 2;
+	trees = calloc(trees_count, sizeof(struct tree));
+
+	/* Assign ages to the trees. */
+	count = trees_count;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		trees[--count].maxage = maxage;
+		trees[--count].maxage = maxage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+	} while (nextage);
+
+	/* The ages assigned above are off by one. */
+	for (i = 0; i != trees_count; i++) {
+		j = 0;
+		while (ages[j] < trees[i].maxage)
+			j++;
+		trees[i].maxage = ages[j-1];
+	}
+
+	/* Set up the forwarding between trees. */
+	trees[trees_count-2].next = &trees[trees_count-1];
+	trees[trees_count-1].leaf_mark = nfkdi_mark;
+	trees[trees_count-2].leaf_mark = nfkdicf_mark;
+	for (i = 0; i != trees_count-2; i += 2) {
+		trees[i].next = &trees[trees_count-2];
+		trees[i].leaf_mark = correction_mark;
+		trees[i+1].next = &trees[trees_count-1];
+		trees[i+1].leaf_mark = correction_mark;
+	}
+
+	/* Assign the callouts. */
+	for (i = 0; i != trees_count; i += 2) {
+		trees[i].type = "nfkdicf";
+		trees[i].leaf_equal = nfkdicf_equal;
+		trees[i].leaf_print = nfkdicf_print;
+		trees[i].leaf_size = nfkdicf_size;
+		trees[i].leaf_index = nfkdicf_index;
+		trees[i].leaf_emit = nfkdicf_emit;
+
+		trees[i+1].type = "nfkdi";
+		trees[i+1].leaf_equal = nfkdi_equal;
+		trees[i+1].leaf_print = nfkdi_print;
+		trees[i+1].leaf_size = nfkdi_size;
+		trees[i+1].leaf_index = nfkdi_index;
+		trees[i+1].leaf_emit = nfkdi_emit;
+	}
+
+	/* Finish init. */
+	for (i = 0; i != trees_count; i++)
+		trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+	struct unicode_data *data;
+	unsigned int unichar;
+	char keyval[4];
+	int keylen;
+	int i;
+
+	for (i = 0; i != trees_count; i++) {
+		if (verbose > 0) {
+			printf("Populating %s_%x\n",
+				trees[i].type, trees[i].maxage);
+		}
+		for (unichar = 0; unichar != 0x110000; unichar++) {
+			if (unicode_data[unichar].gen < 0)
+				continue;
+			keylen = utf8key(unichar, keyval);
+			data = corrections_lookup(&unicode_data[unichar]);
+			if (data->correction <= trees[i].maxage)
+				data = &unicode_data[unichar];
+			insert(&trees[i], keyval, keylen, data);
+		}
+	}
+}
+
+static void
+trees_reduce(void)
+{
+	int i;
+	int size;
+	int changed;
+
+	for (i = 0; i != trees_count; i++)
+		prune(&trees[i]);
+	for (i = 0; i != trees_count; i++)
+		mark_nodes(&trees[i]);
+	do {
+		size = 0;
+		for (i = 0; i != trees_count; i++)
+			size = index_nodes(&trees[i], size);
+		changed = 0;
+		for (i = 0; i != trees_count; i++)
+			changed += size_nodes(&trees[i]);
+	} while (changed);
+
+	utf8data = calloc(size, 1);
+	utf8data_size = size;
+	for (i = 0; i != trees_count; i++)
+		emit(&trees[i], utf8data);
+
+	if (verbose > 0) {
+		for (i = 0; i != trees_count; i++) {
+			printf("%s_%x idx %d\n",
+				trees[i].type, trees[i].maxage, trees[i].index);
+		}
+	}
+
+	nfkdi = utf8data + trees[trees_count-1].index;
+	nfkdicf = utf8data + trees[trees_count-2].index;
+
+	nfkdi_tree = &trees[trees_count-1];
+	nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+	struct unicode_data *data;
+	utf8leaf_t	*leaf;
+	unsigned int	unichar;
+	char		key[4];
+	int		report;
+	int		nocf;
+
+	if (verbose > 0)
+		printf("Verifying %s_%x\n", tree->type, tree->maxage);
+	nocf = strcmp(tree->type, "nfkdicf");
+
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		report = 0;
+		data = corrections_lookup(&unicode_data[unichar]);
+		if (data->correction <= tree->maxage)
+			data = &unicode_data[unichar];
+		utf8key(unichar, key);
+		leaf = utf8lookup(tree, key);
+		if (!leaf) {
+			if (data->gen != -1)
+				report++;
+			if (unichar < 0xd800 || unichar > 0xdfff)
+				report++;
+		} else {
+			if (unichar >= 0xd800 && unichar <= 0xdfff)
+				report++;
+			if (data->gen == -1)
+				report++;
+			if (data->gen != LEAF_GEN(leaf))
+				report++;
+			if (LEAF_CCC(leaf) == DECOMPOSE) {
+				if (nocf) {
+					if (!data->utf8nfkdi) {
+						report++;
+					} else if (strcmp(data->utf8nfkdi,
+							LEAF_STR(leaf))) {
+						report++;
+					}
+				} else {
+					if (!data->utf8nfkdicf &&
+					    !data->utf8nfkdi) {
+						report++;
+					} else if (data->utf8nfkdicf) {
+						if (strcmp(data->utf8nfkdicf,
+							   LEAF_STR(leaf)))
+							report++;
+					} else if (strcmp(data->utf8nfkdi,
+							  LEAF_STR(leaf))) {
+						report++;
+					}
+				}
+			} else if (data->ccc != LEAF_CCC(leaf)) {
+				report++;
+			}
+		}
+		if (report) {
+			printf("%X code %X gen %d ccc %d"
+				" nfdki -> \"%s\"",
+				unichar, data->code, data->gen,
+				data->ccc,
+				data->utf8nfkdi);
+			if (leaf) {
+				printf(" age %d ccc %d"
+					" nfdki -> \"%s\"\n",
+					LEAF_GEN(leaf),
+					LEAF_CCC(leaf),
+					LEAF_CCC(leaf) == DECOMPOSE ?
+						LEAF_STR(leaf) : "");
+			}
+			printf("\n");
+		}
+	}
+}
+
+static void
+trees_verify(void)
+{
+	int i;
+
+	for (i = 0; i != trees_count; i++)
+		verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("Usage: %s [options]\n", argv0);
+	printf("\n");
+	printf("This program creates an a data trie used for parsing and\n");
+	printf("normalization of UTF-8 strings. The trie is derived from\n");
+	printf("a set of input files from the Unicode character database\n");
+	printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+	printf("\n");
+	printf("The generated tree supports two normalization forms:\n");
+	printf("\n");
+	printf("\tnfkdi:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\n");
+	printf("\tnfkdicf:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\t- Apply a full casefold (C + F).\n");
+	printf("\n");
+	printf("These forms were chosen as being most useful when dealing\n");
+	printf("with file names: NFKD catches most cases where characters\n");
+	printf("should be considered equivalent. The ignorables are mostly\n");
+	printf("invisible, making names hard to type.\n");
+	printf("\n");
+	printf("The options to specify the files to be used are listed\n");
+	printf("below with their default values, which are the names used\n");
+	printf("by version 7.0.0 of the Unicode Character Database.\n");
+	printf("\n");
+	printf("The input files:\n");
+	printf("\t-a %s\n", AGE_NAME);
+	printf("\t-c %s\n", CCC_NAME);
+	printf("\t-p %s\n", PROP_NAME);
+	printf("\t-d %s\n", DATA_NAME);
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-n %s\n", NORM_NAME);
+	printf("\n");
+	printf("Additionally, the generated tables are tested using:\n");
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n");
+	printf("Finally, the output file:\n");
+	printf("\t-o %s\n", UTF8_NAME);
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+	printf("Error parsing %s:%s\n", filename, line);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+	int	i;
+	for (i = 0; utf32str[i]; i++)
+		printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdi);
+	printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdicf);
+	printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	int gen;
+        int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", age_name);
+
+	file = fopen(age_name, "r");
+	if (!file)
+		open_fail(age_name, errno);
+        count = 0;
+
+        gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d\n",
+					major, minor, revision);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d\n", major, minor);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+
+	/* We must have found something above. */
+	if (verbose > 1)
+		printf("%d age entries\n", ages_count);
+	if (ages_count == 0 || ages_count > MAXGEN)
+		file_fail(age_name);
+
+	/* There is a 0 entry. */
+	ages_count++;
+	ages = calloc(ages_count + 1, sizeof(*ages));
+	/* And a guard entry. */
+	ages[ages_count] = (unsigned int)-1;
+
+	rewind(file);
+        count = 0;
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages[++gen] =
+				UNICODE_AGE(major, minor, revision);
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d = gen %d\n",
+					major, minor, revision, gen);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages[++gen] = UNICODE_AGE(major, minor, 0);
+			if (verbose > 1)
+				printf(" Age V%d_%d = %d\n",
+					major, minor, gen);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X..%X ; %d.%d #",
+			     &first, &last, &major, &minor);
+		if (ret == 4) {
+			for (unichar = first; unichar <= last; unichar++)
+				unicode_data[unichar].gen = gen;
+                        count += 1 + last - first;
+			if (verbose > 1)
+				printf("  %X..%X gen %d\n", first, last, gen);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+		if (ret == 3) {
+			unicode_data[unichar].gen = gen;
+                        count++;
+			if (verbose > 1)
+				printf("  %X gen %d\n", unichar, gen);
+			if (!utf32valid(unichar))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+	unicode_maxage = ages[gen];
+	fclose(file);
+
+	/* Nix surrogate block */
+	if (verbose > 1)
+		printf(" Removing surrogate block D800..DFFF\n");
+	for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+		unicode_data[unichar].gen = -1;
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int value;
+        int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", ccc_name);
+
+	file = fopen(ccc_name, "r");
+	if (!file)
+		open_fail(ccc_name, errno);
+
+        count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+		if (ret == 3) {
+			for (unichar = first; unichar <= last; unichar++) {
+				unicode_data[unichar].ccc = value;
+                                count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X ccc %d\n", first, last, value);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(ccc_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d #", &unichar, &value);
+		if (ret == 2) {
+			unicode_data[unichar].ccc = value;
+                        count++;
+			if (verbose > 1)
+				printf(" %X ccc %d\n", unichar, value);
+			if (!utf32valid(unichar))
+				line_fail(ccc_name, line);
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+            printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	unsigned int *um;
+	int count;
+	int i;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", data_name);
+	file = fopen(data_name, "r");
+	if (!file)
+		open_fail(data_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     &unichar, buf0);
+		if (ret != 2)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(data_name, line);
+
+		s = buf0;
+		/* skip over <tag> */
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		/* decode the decomposition into UTF-32 */
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(data_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char status;
+	char *s;
+	unsigned int *um;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", fold_name);
+	file = fopen(fold_name, "r");
+	if (!file)
+		open_fail(fold_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+		if (ret != 3)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(fold_name, line);
+		/* Use the C+F casefold. */
+		if (status != 'C' && status != 'F')
+			continue;
+		s = buf0;
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(fold_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int first;
+	unsigned int last;
+	unsigned int *um;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", prop_name);
+	file = fopen(prop_name, "r");
+	if (!file)
+		open_fail(prop_name, errno);
+	assert(file);
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+		if (ret == 3) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(prop_name, line);
+			for (unichar = first; unichar <= last; unichar++) {
+				free(unicode_data[unichar].utf32nfkdi);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdi = um;
+				free(unicode_data[unichar].utf32nfkdicf);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdicf = um;
+				count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X Default_Ignorable_Code_Point\n",
+					first, last);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+		if (ret == 2) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(unichar))
+				line_fail(prop_name, line);
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdi = um;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdicf = um;
+			if (verbose > 1)
+				printf(" %X Default_Ignorable_Code_Point\n",
+					unichar);
+			count++;
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	unsigned int age;
+	unsigned int *um;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", norm_name);
+	file = fopen(norm_name, "r");
+	if (!file)
+		open_fail(norm_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		count++;
+	}
+	corrections = calloc(count, sizeof(struct unicode_data));
+	corrections_count = count;
+	rewind(file);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		corrections[count] = unicode_data[unichar];
+		assert(corrections[count].code == unichar);
+		age = UNICODE_AGE(major, minor, revision);
+		corrections[count].correction = age;
+
+		i = 0;
+		s = buf0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(norm_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		corrections[count].utf32nfkdi = um;
+
+		if (verbose > 1)
+			printf(" %X -> %s -> %s V%d_%d_%d\n",
+				unichar, buf0, buf1, major, minor, revision);
+		count++;
+	}
+	fclose(file);
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount
+ *   LVPart = LBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+	unsigned int sb = 0xAC00;
+	unsigned int lb = 0x1100;
+	unsigned int vb = 0x1161;
+	unsigned int tb = 0x11a7;
+	/* unsigned int lc = 19; */
+	unsigned int vc = 21;
+	unsigned int tc = 28;
+	unsigned int nc = (vc * tc);
+	/* unsigned int sc = (lc * nc); */
+	unsigned int unichar;
+	unsigned int mapping[4];
+	unsigned int *um;
+        int count;
+	int i;
+
+	if (verbose > 0)
+		printf("Decomposing hangul\n");
+	/* Hangul */
+	count = 0;
+	for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+		unsigned int si = unichar - sb;
+		unsigned int li = si / nc;
+		unsigned int vi = (si % nc) / tc;
+		unsigned int ti = si % tc;
+
+		i = 0;
+		mapping[i++] = lb + li;
+		mapping[i++] = vb + vi;
+		if (ti)
+			mapping[i++] = tb + ti;
+		mapping[i++] = 0;
+
+		assert(!unicode_data[unichar].utf32nfkdi);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		assert(!unicode_data[unichar].utf32nfkdicf);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+
+		count++;
+	}
+	if (verbose > 0)
+		printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdi\n");
+
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdi)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdi;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdi;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdi = um;
+		}
+		/* Add this decomposition to nfkdicf if there is no entry. */
+		if (!unicode_data[unichar].utf32nfkdicf) {
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdicf\n");
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdicf)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdicf;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdicf;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + tree->index;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!tree)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+	return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = tree->maxage;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = tree->maxage;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	struct tree	*tree;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+	unsigned int	unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : string.
+ *   len    : length of s.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s,
+	size_t		len)
+{
+	if (!tree)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->tree = tree;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	u8c->unichar = 0;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : NUL-terminated string.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s)
+{
+	return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1  -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->tree, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->tree, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+		u8c->unichar = utf8code(u8c->s);
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			assert(u8c->ccc == STOPPER);
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+	char *s;
+	char *t;
+	int c;
+	struct utf8cursor u8c;
+
+	/* First test: null-terminated string. */
+	s = buf2;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	/* Second test: length-limited string. */
+	s = buf2;
+	/* Replace NUL with a value that will cause an error if seen. */
+	s[strlen(s) + 1] = -1;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	struct unicode_data *data;
+	char *s;
+	char *t;
+	int ret;
+	int ignorables;
+	int tests = 0;
+	int failures = 0;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     buf0, buf1);
+		if (ret != 2 || *line == '#')
+			continue;
+		s = buf0;
+		t = buf2;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		ignorables = 0;
+		s = buf1;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			data = &unicode_data[unichar];
+			if (data->utf8nfkdi && !*data->utf8nfkdi)
+				ignorables = 1;
+			else
+				t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		tests++;
+		if (normalize_line(nfkdi_tree) < 0) {
+			printf("\nline %s -> %s", buf0, buf1);
+			if (ignorables)
+				printf(" (ignorables removed)");
+			printf(" failure\n");
+			failures++;
+		}
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+	FILE *file;
+	int i;
+	int j;
+	int t;
+	int gen;
+
+	if (verbose > 0)
+		printf("Writing %s\n", utf8_name);
+	file = fopen(utf8_name, "w");
+	if (!file)
+		open_fail(utf8_name, errno);
+
+	fprintf(file, "/* This file is generated code, do not edit. */\n");
+	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+	fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+	fprintf(file, "#endif\n");
+	fprintf(file, "\n");
+	fprintf(file, "const unsigned int utf8version = %#x;\n",
+		unicode_maxage);
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+	for (i = 0; i != ages_count; i++)
+		fprintf(file, "\t%#x%s\n", ages[i],
+			ages[i] == unicode_maxage ? "" : ",");
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+	t = 0;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+	t = 1;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+		utf8data_size);
+	t = 0;
+	for (i = 0; i != utf8data_size; i += 16) {
+		if (i == trees[t].index) {
+			fprintf(file, "\t/* %s_%x */\n",
+				trees[t].type, trees[t].maxage);
+			if (t < trees_count-1)
+				t++;
+		}
+		fprintf(file, "\t");
+		for (j = i; j != i + 16; j++)
+			fprintf(file, "0x%.2x%s", utf8data[j],
+				(j < utf8data_size -1 ? "," : ""));
+		fprintf(file, "\n");
+	}
+	fprintf(file, "};\n");
+	fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+	unsigned int unichar;
+	int opt;
+
+	argv0 = argv[0];
+
+	while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+		switch (opt) {
+		case 'a':
+			age_name = optarg;
+			break;
+		case 'c':
+			ccc_name = optarg;
+			break;
+		case 'd':
+			data_name = optarg;
+			break;
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 'n':
+			norm_name = optarg;
+			break;
+		case 'o':
+			utf8_name = optarg;
+			break;
+		case 'p':
+			prop_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	if (verbose > 1)
+		help();
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		unicode_data[unichar].code = unichar;
+	age_init();
+	ccc_init();
+	nfkdi_init();
+	nfkdicf_init();
+	ignore_init();
+	corrections_init();
+	hangul_decompose();
+	nfkdi_decompose();
+	nfkdicf_decompose();
+	utf8_init();
+	trees_init();
+	trees_populate();
+	trees_reduce();
+	trees_verify();
+	/* Prevent "unused function" warning. */
+	(void)lookup(nfkdi_tree, " ");
+	if (verbose > 2)
+		tree_walk(nfkdi_tree);
+	if (verbose > 2)
+		tree_walk(nfkdicf_tree);
+	normalization_test();
+	write_file();
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold.
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (6 preceding siblings ...)
  2014-09-18 20:38   ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers
@ 2014-09-18 20:38   ` Ben Myers
  2014-09-18 20:39   ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:38 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
and are installed if the utf8bit is set in the super block.

The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
filenames, and are installed if both the utf8bit and the borgbit are set
in the superblock.

Normalized filenames are not stored on disk. Normalization will fail if a
filename is not valid UTF-8, in which case the filename is treated as an
opaque blob.

Changes:
 Type conversion to "(const char *)" added to utf8ncursor() and utf8nlen()
 calls.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 Makefile           |   2 +-
 include/libxfs.h   |   1 +
 include/xfs_utf8.h |  25 ++++++
 libxfs/Makefile    |   4 +-
 libxfs/xfs_dir2.c  |  15 +++-
 libxfs/xfs_utf8.c  | 238 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 support/Makefile   |  24 ++++++
 7 files changed, 303 insertions(+), 6 deletions(-)
 create mode 100644 include/xfs_utf8.h
 create mode 100644 libxfs/xfs_utf8.c
 create mode 100644 support/Makefile

diff --git a/Makefile b/Makefile
index f56aebd..c442da6 100644
--- a/Makefile
+++ b/Makefile
@@ -40,7 +40,7 @@ LDIRDIRT = $(SRCDIR)
 LDIRT += $(SRCTAR)
 endif
 
-LIB_SUBDIRS = libxfs libxlog libxcmd libhandle libdisk
+LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk
 TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \
 		mdrestore repair rtcp m4 man doc po debian
 
diff --git a/include/libxfs.h b/include/libxfs.h
index 45a924f..99cb3d9 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -59,6 +59,7 @@
 #include <xfs/xfs_btree_trace.h>
 #include <xfs/xfs_bmap.h>
 #include <xfs/xfs_trace.h>
+#include <xfs_utf8.h>
 
 
 #ifndef ARRAY_SIZE
diff --git a/include/xfs_utf8.h b/include/xfs_utf8.h
new file mode 100644
index 0000000..97b6a91
--- /dev/null
+++ b/include/xfs_utf8.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef XFS_UTF8_H
+#define XFS_UTF8_H
+
+extern struct xfs_nameops xfs_utf8_nameops;
+extern struct xfs_nameops xfs_utf8_ci_nameops;
+
+#endif /* XFS_UTF8_H */
diff --git a/libxfs/Makefile b/libxfs/Makefile
index ae15a5d..d836027 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -14,6 +14,7 @@ HFILES = xfs.h init.h xfs_dir2_priv.h crc32defs.h crc32table.h
 CFILES = cache.c \
 	crc32.c \
 	init.c kmem.c logitem.c radix-tree.c rdwr.c trans.c util.c \
+	utf8norm.c \
 	xfs_alloc.c \
 	xfs_alloc_btree.c \
 	xfs_attr.c \
@@ -38,7 +39,8 @@ CFILES = cache.c \
 	xfs_rtbitmap.c \
 	xfs_sb.c \
 	xfs_symlink_remote.c \
-	xfs_trans_resv.c
+	xfs_trans_resv.c \
+	xfs_utf8.c
 
 CFILES += $(PKG_PLATFORM).c
 PCFILES = darwin.c freebsd.c irix.c linux.c
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 1893931..6872844 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -123,10 +123,17 @@ xfs_dir_mount(
 				(uint)sizeof(xfs_da_node_entry_t);
 
 	mp->m_dir_magicpct = (mp->m_dirblksize * 37) / 100;
-	if (xfs_sb_version_hasasciici(&mp->m_sb))
-		mp->m_dirnameops = &xfs_ascii_ci_nameops;
-	else
-		mp->m_dirnameops = &xfs_default_nameops;
+	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_utf8_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_utf8_nameops;
+	} else {
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_ascii_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_default_nameops;
+	}
 }
 
 /*
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
new file mode 100644
index 0000000..f5cc231
--- /dev/null
+++ b/libxfs/xfs_utf8.c
@@ -0,0 +1,238 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_inum.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_dir2.h"
+#include "xfs_da_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_dinode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_bmap.h"
+#include "xfs_dir2.h"
+#include "xfs_trace.h"
+#include "xfs_utf8.h"
+#include "utf8norm.h"
+
+/*
+ * xfs nameops using nfkdi
+ */
+
+static xfs_dahash_t
+xfs_utf8_hashname(
+	const unsigned char *name,
+	int len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdi = utf8nfkdi(utf8version);
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdi, (const char *)name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+
+	nfkdi = utf8nfkdi(utf8version);
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdi, (const char *)args->name,
+				 args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdi, (const char *)args->name,
+			args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free((void *)args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	const char	*norm;
+	int		c;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdi = utf8nfkdi(utf8version);
+	if (utf8ncursor(&u8c, nfkdi, (const char *)name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = (const char *)args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_nameops = {
+	.hashname = xfs_utf8_hashname,
+	.normhash = xfs_utf8_normhash,
+	.compname = xfs_utf8_compname,
+};
+
+/*
+ * xfs nameops using nfkdicf
+ */
+
+static xfs_dahash_t
+xfs_utf8_ci_hashname(
+	const unsigned char *name,
+	int len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdicf = utf8nfkdicf(utf8version);
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdicf, (const char *)name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_ci_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+
+	nfkdicf = utf8nfkdicf(utf8version);
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdicf, (const char *)args->name,
+				args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdicf, (const char *)args->name,
+			args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free((void *)args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_ci_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdicf = utf8nfkdicf(utf8version);
+	if (utf8ncursor(&u8c, nfkdicf, (const char *)name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_ci_nameops = {
+	.hashname = xfs_utf8_ci_hashname,
+	.normhash = xfs_utf8_ci_normhash,
+	.compname = xfs_utf8_ci_compname,
+};
diff --git a/support/Makefile b/support/Makefile
new file mode 100644
index 0000000..cade5fe
--- /dev/null
+++ b/support/Makefile
@@ -0,0 +1,24 @@
+#
+# Copyright (c) 2014 SGI. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+default = ../include/utf8data.h
+
+../include/utf8data.h:	mkutf8data.c
+	cc -o mkutf8data mkutf8data.c
+	cd ucd-7.0.0 ; ../mkutf8data
+	mv ucd-7.0.0/utf8data.h ../include
+
+default clean:
+	rm -f mkutf8data ../include/utf8data.h
+
+default install:
+
+default install-dev:
+
+default install-qa:
+
+-include .ltdep
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (7 preceding siblings ...)
  2014-09-18 20:38   ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
@ 2014-09-18 20:39   ` Ben Myers
  2014-09-18 20:40   ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:39 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Apply the same rules for UTF-8 normalization to the names of user-defined
extended attributes. System attributes are excluded because they are not
user-visible in the first place, and the kernel is expected to know what
it is doing when naming them.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 libxfs/xfs_attr.c      | 49 +++++++++++++++++++++++++++++++++++++++++--------
 libxfs/xfs_attr_leaf.c | 11 +++++++++--
 libxfs/xfs_utf8.c      |  7 +++++++
 3 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/libxfs/xfs_attr.c b/libxfs/xfs_attr.c
index 17519d3..c30703b 100644
--- a/libxfs/xfs_attr.c
+++ b/libxfs/xfs_attr.c
@@ -88,8 +88,9 @@ xfs_attr_get_int(
 	int			*valuelenp,
 	int			flags)
 {
-	xfs_da_args_t   args;
-	int             error;
+	xfs_da_args_t   	args;
+	struct xfs_mount	*mp = ip->i_mount;
+	int             	error;
 
 	if (!xfs_inode_hasattr(ip))
 		return ENOATTR;
@@ -103,9 +104,12 @@ xfs_attr_get_int(
 	args.value = value;
 	args.valuelen = *valuelenp;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = ip;
 	args.whichfork = XFS_ATTR_FORK;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/*
 	 * Decide on what work routines to call based on the inode size.
@@ -118,6 +122,9 @@ xfs_attr_get_int(
 		error = xfs_attr_node_get(&args);
 	}
 
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	/*
 	 * Return the number of bytes in the value to the caller.
 	 */
@@ -239,12 +246,15 @@ xfs_attr_set_int(
 	args.value = value;
 	args.valuelen = valuelen;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = dp;
 	args.firstblock = &firstblock;
 	args.flist = &flist;
 	args.whichfork = XFS_ATTR_FORK;
 	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/* Size is now blocks for attribute data */
 	args.total = xfs_attr_calc_size(dp, name->len, valuelen, &local);
@@ -276,6 +286,8 @@ xfs_attr_set_int(
 	error = xfs_trans_reserve(args.trans, &tres, args.total, 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return(error);
 	}
 	xfs_ilock(dp, XFS_ILOCK_EXCL);
@@ -286,6 +298,8 @@ xfs_attr_set_int(
 	if (error) {
 		xfs_iunlock(dp, XFS_ILOCK_EXCL);
 		xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return (error);
 	}
 
@@ -333,7 +347,8 @@ xfs_attr_set_int(
 			err2 = xfs_trans_commit(args.trans,
 						 XFS_TRANS_RELEASE_LOG_RES);
 			xfs_iunlock(dp, XFS_ILOCK_EXCL);
-
+			if (args.norm)
+				kmem_free((void *)args.norm);
 			return(error == 0 ? err2 : error);
 		}
 
@@ -398,6 +413,8 @@ xfs_attr_set_int(
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
 
 	return(error);
 
@@ -406,6 +423,9 @@ out:
 		xfs_trans_cancel(args.trans,
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	return(error);
 }
 
@@ -452,12 +472,15 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	args.name = name->name;
 	args.namelen = name->len;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = dp;
 	args.firstblock = &firstblock;
 	args.flist = &flist;
 	args.total = 0;
 	args.whichfork = XFS_ATTR_FORK;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/*
 	 * we have no control over the attribute names that userspace passes us
@@ -470,8 +493,11 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	 * Attach the dquots to the inode.
 	 */
 	error = xfs_qm_dqattach(dp, 0);
-	if (error)
-		return error;
+	if (error) {
+		if (args.norm)
+			kmem_free((void *)args.norm);
+			return error;
+	}
 
 	/*
 	 * Start our first transaction of the day.
@@ -497,6 +523,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 				  XFS_ATTRRM_SPACE_RES(mp), 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return(error);
 	}
 
@@ -546,6 +574,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
 
 	return(error);
 
@@ -554,6 +584,9 @@ out:
 		xfs_trans_cancel(args.trans,
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	return(error);
 }
 
diff --git a/libxfs/xfs_attr_leaf.c b/libxfs/xfs_attr_leaf.c
index f7f02ae..052a6a1 100644
--- a/libxfs/xfs_attr_leaf.c
+++ b/libxfs/xfs_attr_leaf.c
@@ -634,6 +634,7 @@ int
 xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 {
 	xfs_inode_t *dp;
+	struct xfs_mount *mp;
 	xfs_attr_shortform_t *sf;
 	xfs_attr_sf_entry_t *sfe;
 	xfs_da_args_t nargs;
@@ -646,6 +647,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	trace_xfs_attr_sf_to_leaf(args);
 
 	dp = args->dp;
+	mp = dp->i_mount;
 	ifp = dp->i_afp;
 	sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
 	size = be16_to_cpu(sf->hdr.totsize);
@@ -698,13 +700,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 		nargs.namelen = sfe->namelen;
 		nargs.value = &sfe->nameval[nargs.namelen];
 		nargs.valuelen = sfe->valuelen;
-		nargs.hashval = xfs_da_hashname(sfe->nameval,
-						sfe->namelen);
 		nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags);
+		if (! xfs_sb_version_hasutf8(&mp->m_sb))
+			nargs.hashval = xfs_da_hashname(sfe->nameval,
+							sfe->namelen);
+		else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0)
+			goto out;
 		error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */
 		ASSERT(error == ENOATTR);
 		error = xfs_attr3_leaf_add(bp, &nargs);
 		ASSERT(error != ENOSPC);
+		if (nargs.norm)
+			 kmem_free((void *)nargs.norm);
 		if (error)
 			goto out;
 		sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
index f5cc231..5c69591 100644
--- a/libxfs/xfs_utf8.c
+++ b/libxfs/xfs_utf8.c
@@ -31,6 +31,7 @@
 #include "xfs_inode_fork.h"
 #include "xfs_bmap.h"
 #include "xfs_dir2.h"
+#include "xfs_attr_leaf.h"
 #include "xfs_trace.h"
 #include "xfs_utf8.h"
 #include "utf8norm.h"
@@ -72,6 +73,9 @@ xfs_utf8_normhash(
 	ssize_t		normlen;
 	int		c;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdi = utf8nfkdi(utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdi, (const char *)args->name,
@@ -173,6 +177,9 @@ xfs_utf8_ci_normhash(
 	ssize_t		normlen;
 	int		c;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdicf = utf8nfkdicf(utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdicf, (const char *)args->name,
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 10/13] xfsprogs: add utf8 support to growfs
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (8 preceding siblings ...)
  2014-09-18 20:39   ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
@ 2014-09-18 20:40   ` Ben Myers
  2014-09-18 20:41   ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Mark Tinguely <tinguely@sgi.com>

Add reporting of the utf-8 mkfs options to xfs_growfs and xfs_info.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
---
 growfs/xfs_growfs.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/growfs/xfs_growfs.c b/growfs/xfs_growfs.c
index 8e611b6..6c41803 100644
--- a/growfs/xfs_growfs.c
+++ b/growfs/xfs_growfs.c
@@ -57,7 +57,8 @@ report_info(
 	int		crcs_enabled,
 	int		cimode,
 	int		ftype_enabled,
-	int		finobt_enabled)
+	int		finobt_enabled,
+	int		utf8)
 {
 	printf(_(
 	    "meta-data=%-22s isize=%-6u agcount=%u, agsize=%u blks\n"
@@ -65,7 +66,7 @@ report_info(
 	    "         =%-22s crc=%-8u finobt=%u\n"
 	    "data     =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n"
 	    "         =%-22s sunit=%-6u swidth=%u blks\n"
-	    "naming   =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n"
+	    "naming   =version %-14u bsize=%-6u ascii-ci=%d ftype=%d utf8=%d\n"
 	    "log      =%-22s bsize=%-6u blocks=%u, version=%u\n"
 	    "         =%-22s sectsz=%-5u sunit=%u blks, lazy-count=%u\n"
 	    "realtime =%-22s extsz=%-6u blocks=%llu, rtextents=%llu\n"),
@@ -76,7 +77,7 @@ report_info(
 		"", geo.blocksize, (unsigned long long)geo.datablocks,
 			geo.imaxpct,
 		"", geo.sunit, geo.swidth,
-  		dirversion, geo.dirblocksize, cimode, ftype_enabled,
+  		dirversion, geo.dirblocksize, cimode, ftype_enabled, utf8,
 		isint ? _("internal") : logname ? logname : _("external"),
 			geo.blocksize, geo.logblocks, logversion,
 		"", geo.logsectsize, geo.logsunit / geo.blocksize, lazycount,
@@ -114,6 +115,7 @@ main(int argc, char **argv)
 	long long		rsize;	/* new rt size in fs blocks */
 	int			ci;	/* ASCII case-insensitive fs */
 	int			lazycount; /* lazy superblock counters */
+	int			utf8;	/* Unicode chars supported */
 	int			xflag;	/* -x flag */
 	char			*fname;	/* mount point name */
 	char			*datadev; /* data device name */
@@ -247,11 +249,12 @@ main(int argc, char **argv)
 	crcs_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_V5SB ? 1 : 0;
 	ftype_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FTYPE ? 1 : 0;
 	finobt_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FINOBT ? 1 : 0;
+	utf8 = geo.flags & XFS_FSOP_GEOM_FLAGS_UTF8 ? 1 : 0;
 	if (nflag) {
 		report_info(geo, datadev, isint, logdev, rtdev,
 				lazycount, dirversion, logversion,
 				attrversion, projid32bit, crcs_enabled, ci,
-				ftype_enabled, finobt_enabled);
+				ftype_enabled, finobt_enabled, utf8);
 		exit(0);
 	}
 
@@ -289,7 +292,7 @@ main(int argc, char **argv)
 	report_info(geo, datadev, isint, logdev, rtdev,
 			lazycount, dirversion, logversion,
 			attrversion, projid32bit, crcs_enabled, ci, ftype_enabled,
-			finobt_enabled);
+			finobt_enabled, utf8);
 
 	ddsize = xi.dsize;
 	dlsize = ( xi.logBBsize? xi.logBBsize :
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (9 preceding siblings ...)
  2014-09-18 20:40   ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers
@ 2014-09-18 20:41   ` Ben Myers
  2014-09-18 20:42     ` Ben Myers
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Mark Tinguely <tinguely@sgi.com>

Set the utf-8 feature bit.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
---
 man/man8/mkfs.xfs.8 |  9 ++++++++-
 mkfs/xfs_mkfs.c     | 27 ++++++++++++++++++++++-----
 mkfs/xfs_mkfs.h     |  3 ++-
 3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/man/man8/mkfs.xfs.8 b/man/man8/mkfs.xfs.8
index ad9ff3d..aa43cf5 100644
--- a/man/man8/mkfs.xfs.8
+++ b/man/man8/mkfs.xfs.8
@@ -558,7 +558,7 @@ any power of 2 size from the filesystem block size up to 65536.
 .IP
 The
 .B version=ci
-option enables ASCII only case-insensitive filename lookup and version
+option enables ASCII or UTF-8 case-insensitive filename lookup and version
 2 directories. Filenames are case-preserving, that is, the names
 are stored in directories using the case they were created with.
 .IP
@@ -582,6 +582,13 @@ When CRCs are enabled via
 the ftype functionality is always enabled. This feature can not be turned
 off for such filesystem configurations.
 .IP
+.TP
+.BI utf8[= value ]
+This is used to enable the UTF-8 character set support. The
+.I value
+is either 0 or 1, with 1 signifying that UTF-8 character support is to be
+enabled. If the value is omitted, 1 is assumed.
+.IP
 .RE
 .TP
 .BI \-p " protofile"
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index c85258a..1829e51 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -149,6 +149,8 @@ char	*nopts[] = {
 	"version",
 #define	N_FTYPE		3
 	"ftype",
+#define	N_UTF8		4
+	"utf8",
 	NULL,
 };
 
@@ -958,6 +960,7 @@ main(
 	int			nsflag;
 	int			nvflag;
 	int			nci;
+	int			utf8;
 	int			Nflag;
 	int			discard = 1;
 	char			*p;
@@ -1004,6 +1007,7 @@ main(
 	logagno = logblocks = rtblocks = rtextblocks = 0;
 	Nflag = nlflag = nsflag = nvflag = nci = 0;
 	nftype = dirftype = 0;		/* inode type information in the dir */
+	utf8 = 0;			/* utf-8 support */
 	dirblocklog = dirblocksize = 0;
 	dirversion = XFS_DFL_DIR_VERSION;
 	qflag = 0;
@@ -1565,7 +1569,8 @@ _("cannot specify both crc and ftype\n"));
 					if (nvflag)
 						respec('n', nopts, N_VERSION);
 					if (!strcasecmp(value, "ci")) {
-						nci = 1; /* ASCII CI mode */
+						/* ASCII or UTF-8 CI mode */
+						nci = 1;
 					} else {
 						dirversion = atoi(value);
 						if (dirversion != 2)
@@ -1587,6 +1592,14 @@ _("cannot specify both crc and ftype\n"));
 					}
 					nftype = 1;
 					break;
+				case N_UTF8:
+					if (!value || *value == '\0')
+						value = "1";
+					c = atoi(value);
+					if (c < 0 || c > 1)
+						illegal(value, "n utf8");
+					utf8 = c;
+					break;
 				default:
 					unknown('n', value);
 				}
@@ -2460,7 +2473,8 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 	 */
 	sbp->sb_features2 = XFS_SB_VERSION2_MKFS(crcs_enabled, lazy_sb_counters,
 					attrversion == 2, !projid16bit, 0,
-					(!crcs_enabled && dirftype));
+					(!crcs_enabled && dirftype),
+					(!crcs_enabled && utf8));
 	sbp->sb_versionnum = XFS_SB_VERSION_MKFS(crcs_enabled, iaflag,
 					dsunit != 0,
 					logversion == 2, attrversion == 1,
@@ -2534,6 +2548,9 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 	if (crcs_enabled) {
 		sbp->sb_features_incompat = XFS_SB_FEAT_INCOMPAT_FTYPE;
 		dirftype = 1;
+		/* turn on the utf-8 support */
+		if (utf8)
+			sbp->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_UTF8;
 	}
 
 	if (!qflag || Nflag) {
@@ -2543,7 +2560,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 		   "         =%-22s crc=%-8u finobt=%u\n"
 		   "data     =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n"
 		   "         =%-22s sunit=%-6u swidth=%u blks\n"
-		   "naming   =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n"
+		   "naming   =version %-14u bsize=%-6u ascii-ci=%d ftype=%d utf8=%d\n"
 		   "log      =%-22s bsize=%-6d blocks=%lld, version=%d\n"
 		   "         =%-22s sectsz=%-5u sunit=%d blks, lazy-count=%d\n"
 		   "realtime =%-22s extsz=%-6d blocks=%lld, rtextents=%lld\n"),
@@ -2552,7 +2569,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 			"", crcs_enabled, finobt,
 			"", blocksize, (long long)dblocks, imaxpct,
 			"", dsunit, dswidth,
-			dirversion, dirblocksize, nci, dirftype,
+			dirversion, dirblocksize, nci, dirftype, utf8,
 			logfile, 1 << blocklog, (long long)logblocks,
 			logversion, "", lsectorsize, lsunit, lazy_sb_counters,
 			rtfile, rtextblocks << blocklog,
@@ -3171,7 +3188,7 @@ usage( void )
 			    sunit=value|su=num,sectlog=n|sectsize=num,\n\
 			    lazy-count=0|1]\n\
 /* label */		[-L label (maximum 12 characters)]\n\
-/* naming */		[-n log=n|size=num,version=2|ci,ftype=0|1]\n\
+/* naming */		[-n log=n|size=num,version=2|ci,ftype=0|1,utf8=0|1]\n\
 /* no-op info only */	[-N]\n\
 /* prototype file */	[-p fname]\n\
 /* quiet */		[-q]\n\
diff --git a/mkfs/xfs_mkfs.h b/mkfs/xfs_mkfs.h
index 9df5f37..f40b284 100644
--- a/mkfs/xfs_mkfs.h
+++ b/mkfs/xfs_mkfs.h
@@ -37,13 +37,14 @@
 	0 ) : XFS_SB_VERSION_1 )
 
 #define XFS_SB_VERSION2_MKFS(crc, lazycount, attr2, projid32bit, parent, \
-			     ftype) (\
+			     ftype, utf8) (\
 	((lazycount) ? XFS_SB_VERSION2_LAZYSBCOUNTBIT : 0) |		\
 	((attr2) ? XFS_SB_VERSION2_ATTR2BIT : 0) |			\
 	((projid32bit) ? XFS_SB_VERSION2_PROJID32BIT : 0) |		\
 	((parent) ? XFS_SB_VERSION2_PARENTBIT : 0) |			\
 	((crc) ? XFS_SB_VERSION2_CRCBIT : 0) |				\
 	((ftype) ? XFS_SB_VERSION2_FTYPE : 0) |				\
+	((utf8) ? XFS_SB_VERSION2_UTF8BIT : 0) |			\
 	0 )
 
 #define	XFS_DFL_BLOCKSIZE_LOG	12		/* 4096 byte blocks */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 20:42     ` Ben Myers
  2014-09-18 20:33   ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                       ` (13 subsequent siblings)
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:42 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: xfs, olaf, tinguely

From: Mark Tinguely <tinguely@sgi.com>

Fix the duplicate filename detection to use the utf-8 normalization
routines.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
---
 repair/phase6.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index f374fd0..eb3ea35 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -176,13 +176,15 @@ dir_hash_add(
 	unsigned char		*name,
 	__uint8_t		ftype)
 {
-	xfs_dahash_t		hash = 0;
 	int			byaddr;
 	int			byhash = 0;
 	dir_hash_ent_t		*p;
 	int			dup;
 	short			junk;
 	struct xfs_name		xname;
+	xfs_da_args_t		args;
+
+	memset(&args, 0, sizeof(xfs_da_args_t));
 
 	ASSERT(!hashtab->names_duped);
 
@@ -195,19 +197,30 @@ dir_hash_add(
 	dup = 0;
 
 	if (!junk) {
-		hash = mp->m_dirnameops->hashname(name, namelen);
-		byhash = DIR_HASH_FUNC(hashtab, hash);
+		int error;
+
+		args.name = name;
+		args.namelen = namelen;
+		args.inumber = inum;
+		args.whichfork = XFS_DATA_FORK;
+
+		error = mp->m_dirnameops->normhash(&args);
+		if (error)
+			do_error(_("normalize has failed %d)\n"), error);
+
+		byhash = DIR_HASH_FUNC(hashtab, args.hashval);
 
 		/*
 		 * search hash bucket for existing name.
 		 */
 		for (p = hashtab->byhash[byhash]; p; p = p->nextbyhash) {
-			if (p->hashval == hash && p->name.len == namelen) {
-				if (memcmp(p->name.name, name, namelen) == 0) {
-					dup = 1;
-					junk = 1;
-					break;
-				}
+			if (p->hashval == args.hashval &&
+			    mp->m_dirnameops->compname(&args, p->name.name,
+						       p->name.len) !=
+							 XFS_CMP_DIFFERENT) {
+				dup = 1;
+				junk = 1;
+				break;
 			}
 		}
 	}
@@ -226,7 +239,7 @@ dir_hash_add(
 	hashtab->last = p;
 
 	if (!(p->junkit = junk)) {
-		p->hashval = hash;
+		p->hashval = args.hashval;
 		p->nextbyhash = hashtab->byhash[byhash];
 		hashtab->byhash[byhash] = p;
 	}
@@ -235,6 +248,8 @@ dir_hash_add(
 	p->seen = 0;
 	p->name = xname;
 
+	if (args.norm)
+		kmem_free((void *) args.norm);
 	return !dup;
 }
 
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair
@ 2014-09-18 20:42     ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:42 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Mark Tinguely <tinguely@sgi.com>

Fix the duplicate filename detection to use the utf-8 normalization
routines.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
---
 repair/phase6.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index f374fd0..eb3ea35 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -176,13 +176,15 @@ dir_hash_add(
 	unsigned char		*name,
 	__uint8_t		ftype)
 {
-	xfs_dahash_t		hash = 0;
 	int			byaddr;
 	int			byhash = 0;
 	dir_hash_ent_t		*p;
 	int			dup;
 	short			junk;
 	struct xfs_name		xname;
+	xfs_da_args_t		args;
+
+	memset(&args, 0, sizeof(xfs_da_args_t));
 
 	ASSERT(!hashtab->names_duped);
 
@@ -195,19 +197,30 @@ dir_hash_add(
 	dup = 0;
 
 	if (!junk) {
-		hash = mp->m_dirnameops->hashname(name, namelen);
-		byhash = DIR_HASH_FUNC(hashtab, hash);
+		int error;
+
+		args.name = name;
+		args.namelen = namelen;
+		args.inumber = inum;
+		args.whichfork = XFS_DATA_FORK;
+
+		error = mp->m_dirnameops->normhash(&args);
+		if (error)
+			do_error(_("normalize has failed %d)\n"), error);
+
+		byhash = DIR_HASH_FUNC(hashtab, args.hashval);
 
 		/*
 		 * search hash bucket for existing name.
 		 */
 		for (p = hashtab->byhash[byhash]; p; p = p->nextbyhash) {
-			if (p->hashval == hash && p->name.len == namelen) {
-				if (memcmp(p->name.name, name, namelen) == 0) {
-					dup = 1;
-					junk = 1;
-					break;
-				}
+			if (p->hashval == args.hashval &&
+			    mp->m_dirnameops->compname(&args, p->name.name,
+						       p->name.len) !=
+							 XFS_CMP_DIFFERENT) {
+				dup = 1;
+				junk = 1;
+				break;
 			}
 		}
 	}
@@ -226,7 +239,7 @@ dir_hash_add(
 	hashtab->last = p;
 
 	if (!(p->junkit = junk)) {
-		p->hashval = hash;
+		p->hashval = args.hashval;
 		p->nextbyhash = hashtab->byhash[byhash];
 		hashtab->byhash[byhash] = p;
 	}
@@ -235,6 +248,8 @@ dir_hash_add(
 	p->seen = 0;
 	p->name = xname;
 
+	if (args.norm)
+		kmem_free((void *) args.norm);
 	return !dup;
 }
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (11 preceding siblings ...)
  2014-09-18 20:42     ` Ben Myers
@ 2014-09-18 20:43   ` Ben Myers
  2014-09-19 16:06   ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers
  2014-09-19 16:07   ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 20:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Ben Myers <bpm@sgi.com>

Here's a preliminary test for utf8 support in xfs.  It is based on code
that also does some testing in the trie generator.  Here too we are
using the NormalizationTest.txt file from the unicode distribution.  We
check that the normalization in libxfs is working and then run checks on
a filesystem.  Note that there are some 'blacklisted' unichars which
normalize to reserved characters.

FIXME:

For convenience of build this patch is against xfsprogs access to
libxfs.  Handling of ignorables and case fold is also not implemented
here.
---
 Makefile                  |   2 +-
 chkutf8data/Makefile      |  21 +++
 chkutf8data/chkutf8data.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 452 insertions(+), 1 deletion(-)
 create mode 100644 chkutf8data/Makefile
 create mode 100644 chkutf8data/chkutf8data.c

diff --git a/Makefile b/Makefile
index c442da6..d4c0a23 100644
--- a/Makefile
+++ b/Makefile
@@ -42,7 +42,7 @@ endif
 
 LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk
 TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \
-		mdrestore repair rtcp m4 man doc po debian
+		mdrestore repair rtcp m4 man doc po debian chkutf8data
 
 SUBDIRS = include $(LIB_SUBDIRS) $(TOOL_SUBDIRS)
 
diff --git a/chkutf8data/Makefile b/chkutf8data/Makefile
new file mode 100644
index 0000000..6ce5706
--- /dev/null
+++ b/chkutf8data/Makefile
@@ -0,0 +1,21 @@
+#
+# Copyright (c) 2014 SGI. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+LTCOMMAND = chkutf8data
+CFILES = chkutf8data.c
+
+LLDLIBS = $(LIBXFS)
+LTDEPENDENCIES = $(LIBXFS)
+LLDFLAGS = -static
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: default
+
+-include .ltdep
diff --git a/chkutf8data/chkutf8data.c b/chkutf8data/chkutf8data.c
new file mode 100644
index 0000000..487cf1e
--- /dev/null
+++ b/chkutf8data/chkutf8data.c
@@ -0,0 +1,430 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include "utf8norm.h"
+
+#define FOLD_NAME	"CaseFolding.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+
+const char	*fold_name = FOLD_NAME;
+const char	*test_name = TEST_NAME;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+char buf4[LINESIZE];
+char buf5[LINESIZE];
+
+const char *mtpt;
+int verbose = 0;
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("The input files:\n");
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n\n");
+	printf("\t-m mtpt\n");
+	printf("\t-v (verbose)\n");
+	printf("\t-h (help)\n");
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static int
+normalize_line(utf8data_t tree, char *s, char *t)
+{
+	struct utf8cursor u8c;
+
+	if (utf8cursor(&u8c, tree, s)) {
+		printf("%s return utf8cursor failed\n", __func__);
+		return -1;
+	}
+
+	while ((*t = utf8byte(&u8c)) > 0)
+		t++;
+
+	if (*t < 0) {
+		printf("%s return error %d\r", __func__, *t);
+		return -1;
+	}
+	if (*t != 0) {
+		printf("%s return t not 0\n", __func__);
+		return -1;
+	}
+
+        return 0;
+}
+
+static void
+test_key(char	*source,
+	 char	*NFC,
+	 char	*NFD,
+	 char	*NFKC,
+	 char	*NFKD)
+{
+	int	fd;
+	int	error;
+
+	if (verbose)
+		printf("Testing %s -> %s\n", source, NFKD);
+
+	error = chdir(mtpt);	/* XXX hardcoded mount point */
+	if (error) {
+		perror(mtpt);
+		exit(-1);
+	}
+
+	/* the initial create should succeed */
+	if (verbose)
+		printf("Initial create %s... ", source);
+	fd = open(source, O_CREAT|O_EXCL, 0);
+	if (fd < 0) {
+		printf("Failed to create %s XXX\n", source);
+		perror(source);
+		close(fd);
+		exit(-1);
+	}
+	close(fd);
+	if (verbose)
+		printf("Success\n");
+
+	/* a second create should fail */
+	if (verbose)
+		printf("Second create %s (should return EEXIST)... ", NFKD);
+	fd = open(NFKD, O_CREAT|O_EXCL, 0);
+	if (fd >= 1) {
+		printf("Test Failed.  Was able to create %s XXX\n", NFKD);
+		perror(NFKD);
+		close(fd);
+		exit(-1);
+	}
+	close(fd);
+	if (verbose)
+		printf("EEXIST\n");
+
+       	error = unlink(NFKD);
+	if (error) {
+		printf("Unlink failed\n"); 
+		perror(NFKD);
+		exit(-1);
+	}
+}
+
+int
+blacklisted(unsigned int unichar)
+{
+	/* these unichars normalize to characters we don't allow */
+	unsigned int list[] = {	0x2024 /* . */,
+				0x2025 /* .. */,
+       				0x2100 /* a/c */,
+				0x2101 /* a/s */,
+				0x2105 /* c/o */,
+				0x2106 /* c/u */,
+				0xFE30 /* .. */,
+				0xFE52 /* . */,
+				0xFF0E /* . */,
+				0xFF0F /* / */};
+	int i;
+
+	for (i=0; i < (sizeof(list) / sizeof(unichar)); i++) {
+		if (list[i] == unichar)
+			return 1;
+	}
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	char *s;
+	char *t;
+	int ret;
+	int tests = 0;
+	int failures = 0;
+	char	source[LINESIZE];
+	char	NFKD[LINESIZE];
+	int	skip;
+	utf8data_t	nfkdi = utf8nfkdi(utf8version);
+
+	printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+				source, NFKD);
+			//NFC, NFD, NFKC, NFKD);
+		if (ret != 2 || *line == '#')
+			continue;
+
+		s = source;
+		t = buf2;
+		skip = 0;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			if (blacklisted(unichar))
+				skip++;
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		if (skip)
+			continue;
+
+		s = NFKD;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		/* normalize source */
+		if (normalize_line(nfkdi, buf2, buf4) < 0) {
+			printf("normalize_line for unichar %s Failed\n", buf0);
+			exit(1);
+		}
+		if (verbose)
+			printf("(%s) %s normalized to %s... ",
+					source, buf2, buf4);
+
+		/* does it match NFKD? */
+		tests++;
+		if (memcmp(buf4, buf3, strlen(buf3))) {
+			if (verbose)
+				printf("Fail!\n");
+			failures++;
+		} else { 
+			if (verbose)
+				printf("Correct!\n");
+		}
+
+		/* normalize NFKD */
+		if (normalize_line(nfkdi, buf3, buf5) < 0) {
+			printf("normalize_line for unichar %s Failed\n",
+					buf3);
+			exit(1);
+		}
+		if (verbose)
+			printf("(%s) %s normalized to %s... ",
+					NFKD, buf3, buf5);
+
+		/* does it normalize to itself? */
+		tests++;
+		if (memcmp(buf5, buf3, strlen(buf3))) {
+			if (verbose)
+				printf("Fail!\n");
+			failures++;
+		} else {
+			if (verbose)
+				printf("Correct!\n");
+		}
+
+		/* XXX ignorables need to be taken into account? */
+		test_key(buf2, NULL, NULL, NULL, buf3);
+	}
+	fclose(file);
+	printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+int
+main(int argc, char *argv[])
+{
+	int opt;
+
+	while ((opt = getopt(argc, argv, "f:t:m:vh")) != -1) {
+		switch (opt) {
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'm':
+			mtpt = optarg;
+			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	if (!test_name || !mtpt) {
+		usage();
+		exit(-1);
+	}
+
+	normalization_test();
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (11 preceding siblings ...)
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 21:10 ` Ben Myers
  2014-09-18 21:24     ` Zach Brown
  2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Ben Myers @ 2014-09-18 21:10 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

Hi,

On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
> is busy with other projects.

Patch 7 of each series is the trie generator and utf8 normalization
module.  Either it's just a little slow or it's over the message size
limit on vger at a little over 100k.  In case it's the latter I've
placed them here:

ftp://oss.sgi.com/projects/xfs/tmp/18-Sep-14/0007-xfs-add-trie-generator-and-supporting-code-for-UTF-8.patch
and 
ftp://oss.sgi.com/projects/xfs/tmp/18-Sep-14/0007-libxfs-add-trie-generator-and-supporting-code-for-UT.patch

I'll give it a while and then see about splitting up the patch.

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-18 21:24     ` Zach Brown
  0 siblings, 0 replies; 84+ messages in thread
From: Zach Brown @ 2014-09-18 21:24 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, xfs, olaf, tinguely

On Thu, Sep 18, 2014 at 04:10:10PM -0500, Ben Myers wrote:
> Hi,
> 
> On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
> > is busy with other projects.
> 
> Patch 7 of each series is the trie generator and utf8 normalization
> module.  Either it's just a little slow or it's over the message size
> limit on vger at a little over 100k.

Probably, yeah:

http://vger.kernel.org/majordomo-info.html

"
  * Message size exceeding 100 000 characters causes blocking.
"

- z

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-18 21:24     ` Zach Brown
  0 siblings, 0 replies; 84+ messages in thread
From: Zach Brown @ 2014-09-18 21:24 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 04:10:10PM -0500, Ben Myers wrote:
> Hi,
> 
> On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
> > is busy with other projects.
> 
> Patch 7 of each series is the trie generator and utf8 normalization
> module.  Either it's just a little slow or it's over the message size
> limit on vger at a little over 100k.

Probably, yeah:

http://vger.kernel.org/majordomo-info.html

"
  * Message size exceeding 100 000 characters causes blocking.
"

- z

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-18 21:24     ` Zach Brown
  (?)
@ 2014-09-18 22:23     ` Ben Myers
  -1 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-18 22:23 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 02:24:17PM -0700, Zach Brown wrote:
> On Thu, Sep 18, 2014 at 04:10:10PM -0500, Ben Myers wrote:
> > On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> > > I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
> > > is busy with other projects.
> > 
> > Patch 7 of each series is the trie generator and utf8 normalization
> > module.  Either it's just a little slow or it's over the message size
> > limit on vger at a little over 100k.
> 
> Probably, yeah:
> 
> http://vger.kernel.org/majordomo-info.html
> 
> "
>   * Message size exceeding 100 000 characters causes blocking.
> "

D'oh!  I'll split it up.

Thanks,
Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 07a/10] xfs: add trie generator for UTF-8.
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (12 preceding siblings ...)
  2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-19 16:03 ` Ben Myers
  2014-09-19 16:04 ` [PATCH 07b/10] xfs: add supporting code " Ben Myers
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-19 16:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: the trie is now separated into utf8norm.ko;
     utf8version is now a function and exported;
     introduced CONFIG_XFS_UTF8;
     removed supporting code due to vger size constraint.  --bpm]
---
 fs/xfs/Kconfig               |    8 +
 fs/xfs/Makefile              |    2 +-
 fs/xfs/utf8norm/Makefile     |   33 +
 fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 3281 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/utf8norm/Makefile
 create mode 100644 fs/xfs/utf8norm/mkutf8data.c

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 5d47b4d..a847857 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -95,3 +95,11 @@ config XFS_DEBUG
 	  not useful unless you are debugging a particular problem.
 
 	  Say N unless you are an XFS developer, or you play one on TV.
+
+config XFS_UTF8
+	bool "XFS UTF-8 support"
+	depends on XFS_FS
+	help
+	  Say Y here to enable utf8 normalization support in XFS.  You
+	  will be able to mount and use filesystems created with the
+	  utf8 mkfs.xfs option.
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d617999..6d000d3 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -21,7 +21,7 @@ ccflags-y += -I$(src)/libxfs
 
 ccflags-$(CONFIG_XFS_DEBUG) += -g
 
-obj-$(CONFIG_XFS_FS)		+= xfs.o
+obj-$(CONFIG_XFS_FS)		+= xfs.o utf8norm/
 
 # this one should be compiled first, as the tracing macros can easily blow up
 xfs-y				+= xfs_trace.o
diff --git a/fs/xfs/utf8norm/Makefile b/fs/xfs/utf8norm/Makefile
new file mode 100644
index 0000000..9b2efa9
--- /dev/null
+++ b/fs/xfs/utf8norm/Makefile
@@ -0,0 +1,33 @@
+#
+# Copyright (c) 2014 SGI.
+# All rights reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+
+hostprogs-y			:= mkutf8data
+$(obj)/utf8norm.o: $(obj)/utf8data.h
+$(obj)/utf8data.h: $(src)/ucd/*.txt
+$(obj)/utf8data.h: $(obj)/mkutf8data FORCE
+	$(call if_changed,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+      cmd_mkutf8data = $(obj)/mkutf8data \
+		-a $(src)/ucd/DerivedAge-7.0.0.txt \
+		-c $(src)/ucd/DerivedCombiningClass-7.0.0.txt \
+		-p $(src)/ucd/DerivedCoreProperties-7.0.0.txt \
+		-d $(src)/ucd/UnicodeData-7.0.0.txt \
+		-f $(src)/ucd/CaseFolding-7.0.0.txt \
+		-n $(src)/ucd/NormalizationCorrections-7.0.0.txt \
+		-t $(src)/ucd/NormalizationTest-7.0.0.txt \
+		-o $@
diff --git a/fs/xfs/utf8norm/mkutf8data.c b/fs/xfs/utf8norm/mkutf8data.c
new file mode 100644
index 0000000..1d6ec02
--- /dev/null
+++ b/fs/xfs/utf8norm/mkutf8data.c
@@ -0,0 +1,3239 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME	"DerivedAge.txt"
+#define CCC_NAME	"DerivedCombiningClass.txt"
+#define PROP_NAME	"DerivedCoreProperties.txt"
+#define DATA_NAME	"UnicodeData.txt"
+#define FOLD_NAME	"CaseFolding.txt"
+#define NORM_NAME	"NormalizationCorrections.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+#define UTF8_NAME	"utf8data.h"
+
+const char	*age_name  = AGE_NAME;
+const char	*ccc_name  = CCC_NAME;
+const char	*prop_name = PROP_NAME;
+const char	*data_name = DATA_NAME;
+const char	*fold_name = FOLD_NAME;
+const char	*norm_name = NORM_NAME;
+const char	*test_name = TEST_NAME;
+const char	*utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision.  These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_MAJ_MAX			((unsigned short)-1)
+#define UNICODE_MIN_MAX			((unsigned char)-1)
+#define UNICODE_REV_MAX			((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+	if (major > UNICODE_MAJ_MAX)
+		return 0;
+	if (minor > UNICODE_MIN_MAX)
+		return 0;
+	if (revision > UNICODE_REV_MAX)
+		return 0;
+	return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MAXGEN		(255)
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+	const unsigned char *s = (const unsigned char*)str;
+	unsigned int unichar = 0;
+
+	if (*s < 0x80) {
+		unichar = *s;
+	} else if (*s < UTF8_3_BITS) {
+		unichar = *s++ & 0x1F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else if (*s < UTF8_4_BITS) {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	}
+	return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+	return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+	void *root;
+	int childnode;
+	const char *type;
+	unsigned int maxage;
+	struct tree *next;
+	int (*leaf_equal)(void *, void *);
+	void (*leaf_print)(void *, int);
+	int (*leaf_mark)(void *);
+	int (*leaf_size)(void *);
+	int *(*leaf_index)(struct tree *, void *);
+	unsigned char *(*leaf_emit)(void *, unsigned char *);
+	int leafindex[0x110000];
+	int index;
+};
+
+struct node {
+	int index;
+	int offset;
+	int mark;
+	int size;
+	struct node *parent;
+	void *left;
+	void *right;
+	unsigned char bitnum;
+	unsigned char nextbyte;
+	unsigned char leftnode;
+	unsigned char rightnode;
+	unsigned int keybits;
+	unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+	struct node *node;
+	void *leaf = NULL;
+
+	node = tree->root;
+	while (!leaf && node) {
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7))) {
+			/* Right leg */
+			if (node->rightnode == NODE) {
+				node = node->right;
+			} else if (node->rightnode == LEAF) {
+				leaf = node->right;
+			} else {
+				node = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (node->leftnode == NODE) {
+				node = node->left;
+			} else if (node->leftnode == LEAF) {
+				leaf = node->left;
+			} else {
+				node = NULL;
+			}
+		}
+	}
+
+	return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int indent = 1;
+	int nodes, singletons, leaves;
+
+	nodes = singletons = leaves = 0;
+
+	printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_print(tree->root, indent);
+		leaves = 1;
+	} else {
+		assert(tree->childnode == NODE);
+		node = tree->root;
+		leftmask = rightmask = 0;
+		while (node) {
+			printf("%*snode @ %p bitnum %d nextbyte %d"
+			       " left %p right %p mask %x bits %x\n",
+				indent, "", node,
+				node->bitnum, node->nextbyte,
+				node->left, node->right,
+				node->keymask, node->keybits);
+			nodes += 1;
+			if (!(node->left && node->right))
+				singletons += 1;
+
+			while (node) {
+				bitmask = 1 << node->bitnum;
+				if ((leftmask & bitmask) == 0) {
+					leftmask |= bitmask;
+					if (node->leftnode == LEAF) {
+						assert(node->left);
+						tree->leaf_print(node->left,
+								 indent+1);
+						leaves += 1;
+					} else if (node->left) {
+						assert(node->leftnode == NODE);
+						indent += 1;
+						node = node->left;
+						break;
+					}
+				}
+				if ((rightmask & bitmask) == 0) {
+					rightmask |= bitmask;
+					if (node->rightnode == LEAF) {
+						assert(node->right);
+						tree->leaf_print(node->right,
+								 indent+1);
+						leaves += 1;
+					} else if (node->right) {
+						assert(node->rightnode==NODE);
+						indent += 1;
+						node = node->right;
+						break;
+					}
+				}
+				leftmask &= ~bitmask;
+				rightmask &= ~bitmask;
+				node = node->parent;
+				indent -= 1;
+			}
+		}
+	}
+	printf("nodes %d leaves %d singletons %d\n",
+	       nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+	struct node *node;
+	int bitnum;
+
+	node = malloc(sizeof(*node));
+	node->left = node->right = NULL;
+	node->parent = parent;
+	node->leftnode = NODE;
+	node->rightnode = NODE;
+	node->keybits = 0;
+	node->keymask = 0;
+	node->mark = 0;
+	node->index = 0;
+	node->offset = -1;
+	node->size = 4;
+
+	if (node->parent) {
+		bitnum = parent->bitnum;
+		if ((bitnum & 7) == 0) {
+			node->bitnum = bitnum + 7 + 8;
+			node->nextbyte = 1;
+		} else {
+			node->bitnum = bitnum - 1;
+			node->nextbyte = 0;
+		}
+	} else {
+		node->bitnum = 7;
+		node->nextbyte = 0;
+	}
+
+	return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+	struct node *node;
+	struct node *parent;
+	void **cursor;
+	int keybits;
+
+	assert(keylen >= 1 && keylen <= 4);
+
+	node = NULL;
+	cursor = &tree->root;
+	keybits = 8 * keylen;
+
+	/* Insert, creating path along the way. */
+	while (keybits) {
+		if (!*cursor)
+			*cursor = alloc_node(node);
+		node = *cursor;
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7)))
+			cursor = &node->right;
+		else
+			cursor = &node->left;
+		keybits--;
+	}
+	*cursor = leaf;
+
+	/* Merge subtrees if possible. */
+	while (node) {
+		if (*key & (1 << (node->bitnum & 7)))
+			node->rightnode = LEAF;
+		else
+			node->leftnode = LEAF;
+		if (node->nextbyte)
+			break;
+		if (node->leftnode == NODE || node->rightnode == NODE)
+			break;
+		assert(node->left);
+		assert(node->right);
+		/* Compare */
+		if (! tree->leaf_equal(node->left, node->right))
+			break;
+		/* Keep left, drop right leaf. */
+		leaf = node->left;
+		/* Check in parent */
+		parent = node->parent;
+		if (!parent) {
+			/* root of tree! */
+			tree->root = leaf;
+			tree->childnode = LEAF;
+		} else if (parent->left == node) {
+			parent->left = leaf;
+			parent->leftnode = LEAF;
+			if (parent->right) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+			}
+		} else if (parent->right == node) {
+			parent->right = leaf;
+			parent->rightnode = LEAF;
+			if (parent->left) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+				parent->keybits |= (1 << node->bitnum);
+			}
+		} else {
+			/* internal tree error */
+			assert(0);
+		}
+		free(node);
+		node = parent;
+	}
+
+	/* Propagate keymasks up along singleton chains. */
+	while (node) {
+		parent = node->parent;
+		if (!parent)
+			break;
+		/* Nix the mask for parents with two children. */
+		if (node->keymask == 0) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else if (parent->left && parent->right) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else {
+			assert((parent->keymask & node->keymask) == 0);
+			parent->keymask |= node->keymask;
+			parent->keymask |= (1 << parent->bitnum);
+			parent->keybits |= node->keybits;
+			if (parent->right)
+				parent->keybits |= (1 << parent->bitnum);
+		}
+		node = parent;
+	}
+
+	return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed.  There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves.  The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains.  When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity.  Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+	struct node *node;
+	struct node *left;
+	struct node *right;
+	struct node *parent;
+	void *leftleaf;
+	void *rightleaf;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+
+	if (verbose > 0)
+		printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+	count = 0;
+	if (tree->childnode == LEAF)
+		return;
+	if (!tree->root)
+		return;
+
+	leftmask = rightmask = 0;
+	node = tree->root;
+	while (node) {
+		if (node->nextbyte)
+			goto advance;
+		if (node->leftnode == LEAF)
+			goto advance;
+		if (node->rightnode == LEAF)
+			goto advance;
+		if (!node->left)
+			goto advance;
+		if (!node->right)
+			goto advance;
+		left = node->left;
+		right = node->right;
+		if (left->keymask == 0)
+			goto advance;
+		if (right->keymask == 0)
+			goto advance;
+		if (left->keymask != right->keymask)
+			goto advance;
+		if (left->keybits != right->keybits)
+			goto advance;
+		leftleaf = NULL;
+		while (!leftleaf) {
+			assert(left->left || left->right);
+			if (left->leftnode == LEAF)
+				leftleaf = left->left;
+			else if (left->rightnode == LEAF)
+				leftleaf = left->right;
+			else if (left->left)
+				left = left->left;
+			else if (left->right)
+				left = left->right;
+			else
+				assert(0);
+		}
+		rightleaf = NULL;
+		while (!rightleaf) {
+			assert(right->left || right->right);
+			if (right->leftnode == LEAF)
+				rightleaf = right->left;
+			else if (right->rightnode == LEAF)
+				rightleaf = right->right;
+			else if (right->left)
+				right = right->left;
+			else if (right->right)
+				right = right->right;
+			else
+				assert(0);
+		}
+		if (! tree->leaf_equal(leftleaf, rightleaf))
+			goto advance;
+		/*
+		 * This node has identical singleton-only subtrees.
+		 * Remove it.
+		 */
+		parent = node->parent;
+		left = node->left;
+		right = node->right;
+		if (parent->left == node)
+			parent->left = left;
+		else if (parent->right == node)
+			parent->right = left;
+		else
+			assert(0);
+		left->parent = parent;
+		left->keymask |= (1 << node->bitnum);
+		node->left = NULL;
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			if (node->leftnode == NODE && node->left) {
+				left = node->left;
+				free(node);
+				count++;
+				node = left;
+			} else if (node->rightnode == NODE && node->right) {
+				right = node->right;
+				free(node);
+				count++;
+				node = right;
+			} else {
+				node = NULL;
+			}
+		}
+		/* Propagate keymasks up along singleton chains. */
+		node = parent;
+		/* Force re-check */
+		bitmask = 1 << node->bitnum;
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		for (;;) {
+			if (node->left && node->right)
+				break;
+			if (node->left) {
+				left = node->left;
+				node->keymask |= left->keymask;
+				node->keybits |= left->keybits;
+			}
+			if (node->right) {
+				right = node->right;
+				node->keymask |= right->keymask;
+				node->keybits |= right->keybits;
+			}
+			node->keymask |= (1 << node->bitnum);
+			node = node->parent;
+			/* Force re-check */
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+		}
+	advance:
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0 &&
+		    node->leftnode == NODE &&
+		    node->left) {
+			leftmask |= bitmask;
+			node = node->left;
+		} else if ((rightmask & bitmask) == 0 &&
+			   node->rightnode == NODE &&
+			   node->right) {
+			rightmask |= bitmask;
+			node = node->right;
+		} else {
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+		}
+	}
+	if (verbose > 0)
+		printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+	struct node *node;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int marked;
+
+	marked = 0;
+	if (verbose > 0)
+		printf("Marking %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+
+	/* second pass: left siblings and singletons */
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				if (!node->mark && node->parent->mark) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				if (!node->mark && node->parent->mark &&
+				    !node->parent->left) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+done:
+	if (verbose > 0)
+		printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie.  These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+	int indent;
+
+	/* Align to a cache line (or half a cache line?). */
+	while (index % 64)
+		index++;
+	tree->index = index;
+	indent = 1;
+	count = 0;
+
+	if (verbose > 0)
+		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+	if (tree->childnode == LEAF) {
+		index += tree->leaf_size(tree->root);
+		goto done;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		count++;
+		if (node->index != index)
+			node->index = index;
+		index += node->size;
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					*tree->leaf_index(tree, node->left) =
+									index;
+					index += tree->leaf_size(node->left);
+					count++;
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					*tree->leaf_index(tree, node->right) = index;
+					index += tree->leaf_size(node->right);
+					count++;
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	/* Round up to a multiple of 16 */
+	while (index % 16)
+		index++;
+	if (verbose > 0)
+		printf("Final index %d\n", index);
+	return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced.  This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+	struct tree *next;
+	struct node *node;
+	struct node *right;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	unsigned int pathbits;
+	unsigned int pathmask;
+	int changed;
+	int offset;
+	int size;
+	int indent;
+
+	indent = 1;
+	changed = 0;
+	size = 0;
+
+	if (verbose > 0)
+		printf("Sizing %s_%x", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	pathbits = 0;
+	pathmask = 0;
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		offset = 0;
+		if (!node->left || !node->right) {
+			size = 1;
+		} else {
+			if (node->rightnode == NODE) {
+				right = node->right;
+				next = tree->next;
+				while (!right->mark) {
+					assert(next);
+					n = next->root;
+					while (n->bitnum != node->bitnum) {
+						if (pathbits & (1<<n->bitnum))
+							n = n->right;
+						else
+							n = n->left;
+					}
+					n = n->right;
+					assert(right->bitnum == n->bitnum);
+					right = n;
+					next = next->next;
+				}
+				offset = right->index - node->index;
+			} else {
+				offset = *tree->leaf_index(tree, node->right);
+				offset -= node->index;
+			}
+			assert(offset >= 0);
+			assert(offset <= 0xffffff);
+			if (offset <= 0xff) {
+				size = 2;
+			} else if (offset <= 0xffff) {
+				size = 3;
+			} else { /* offset <= 0xffffff */
+				size = 4;
+			}
+		}
+		if (node->size != size || node->offset != offset) {
+			node->size = size;
+			node->offset = offset;
+			changed++;
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			pathmask |= bitmask;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				pathbits |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			pathmask &= ~bitmask;
+			pathbits &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	if (verbose > 0)
+		printf("Found %d changes\n", changed);
+	return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int offlen;
+	int offset;
+	int index;
+	int indent;
+	unsigned char byte;
+
+	index = tree->index;
+	data += index;
+	indent = 1;
+	if (verbose > 0)
+		printf("Emitting %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_emit(tree->root, data);
+		return;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		assert(node->offset != -1);
+		assert(node->index == index);
+
+		byte = 0;
+		if (node->nextbyte)
+			byte |= NEXTBYTE;
+		byte |= (node->bitnum & BITNUM);
+		if (node->left && node->right) {
+			if (node->leftnode == NODE)
+				byte |= LEFTNODE;
+			if (node->rightnode == NODE)
+				byte |= RIGHTNODE;
+			if (node->offset <= 0xff)
+				offlen = 1;
+			else if (node->offset <= 0xffff)
+				offlen = 2;
+			else
+				offlen = 3;
+			offset = node->offset;
+			byte |= offlen << OFFLEN_SHIFT;
+			*data++ = byte;
+			index++;
+			while (offlen--) {
+				*data++ = offset & 0xff;
+				index++;
+				offset >>= 8;
+			}
+		} else if (node->left) {
+			if (node->leftnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else if (node->right) {
+			byte |= RIGHTNODE;
+			if (node->rightnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else {
+			assert(0);
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					data = tree->leaf_emit(node->left,
+							       data);
+					index += tree->leaf_size(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					data = tree->leaf_emit(node->right,
+							       data);
+					index += tree->leaf_size(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table.  Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions.  The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+	unsigned int code;
+	int ccc;
+	int gen;
+	int correction;
+	unsigned int *utf32nfkdi;
+	unsigned int *utf32nfkdicf;
+	char *utf8nfkdi;
+	char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int    corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int          trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+	int i;
+
+	for (i = 0; i != corrections_count; i++)
+		if (u->code == corrections[i].code)
+			return &corrections[i];
+	return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdicf && right->utf8nfkdicf &&
+	    strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+		return 1;
+	if (left->utf8nfkdicf && right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdicf || right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdicf)
+		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+	return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	if (leaf->utf8nfkdicf)
+		return 1;
+	return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdicf)
+		size += strlen(leaf->utf8nfkdicf) + 1;
+	else if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdicf) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdicf;
+		while ((*data++ = *s++) != 0)
+			;
+	} else if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+	char utf[18*4+1];
+	char *u;
+	unsigned int *um;
+	int i;
+
+	u = utf;
+	um = data->utf32nfkdi;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		data->utf8nfkdi = strdup((char*)utf);
+	}
+	u = utf;
+	um = data->utf32nfkdicf;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+			data->utf8nfkdicf = strdup((char*)utf);
+	}
+}
+
+static void
+utf8_init(void)
+{
+	unsigned int unichar;
+	int i;
+
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		utf8_create(&unicode_data[unichar]);
+
+	for (i = 0; i != corrections_count; i++)
+		utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+	struct unicode_data *data;
+	unsigned int maxage;
+	unsigned int nextage;
+	int count;
+	int i;
+	int j;
+
+	/* Count the number of different ages. */
+	count = 0;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+		count++;
+	} while (nextage);
+
+	/* Two trees per age: nfkdi and nfkdicf */
+	trees_count = count * 2;
+	trees = calloc(trees_count, sizeof(struct tree));
+
+	/* Assign ages to the trees. */
+	count = trees_count;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		trees[--count].maxage = maxage;
+		trees[--count].maxage = maxage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+	} while (nextage);
+
+	/* The ages assigned above are off by one. */
+	for (i = 0; i != trees_count; i++) {
+		j = 0;
+		while (ages[j] < trees[i].maxage)
+			j++;
+		trees[i].maxage = ages[j-1];
+	}
+
+	/* Set up the forwarding between trees. */
+	trees[trees_count-2].next = &trees[trees_count-1];
+	trees[trees_count-1].leaf_mark = nfkdi_mark;
+	trees[trees_count-2].leaf_mark = nfkdicf_mark;
+	for (i = 0; i != trees_count-2; i += 2) {
+		trees[i].next = &trees[trees_count-2];
+		trees[i].leaf_mark = correction_mark;
+		trees[i+1].next = &trees[trees_count-1];
+		trees[i+1].leaf_mark = correction_mark;
+	}
+
+	/* Assign the callouts. */
+	for (i = 0; i != trees_count; i += 2) {
+		trees[i].type = "nfkdicf";
+		trees[i].leaf_equal = nfkdicf_equal;
+		trees[i].leaf_print = nfkdicf_print;
+		trees[i].leaf_size = nfkdicf_size;
+		trees[i].leaf_index = nfkdicf_index;
+		trees[i].leaf_emit = nfkdicf_emit;
+
+		trees[i+1].type = "nfkdi";
+		trees[i+1].leaf_equal = nfkdi_equal;
+		trees[i+1].leaf_print = nfkdi_print;
+		trees[i+1].leaf_size = nfkdi_size;
+		trees[i+1].leaf_index = nfkdi_index;
+		trees[i+1].leaf_emit = nfkdi_emit;
+	}
+
+	/* Finish init. */
+	for (i = 0; i != trees_count; i++)
+		trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+	struct unicode_data *data;
+	unsigned int unichar;
+	char keyval[4];
+	int keylen;
+	int i;
+
+	for (i = 0; i != trees_count; i++) {
+		if (verbose > 0) {
+			printf("Populating %s_%x\n",
+				trees[i].type, trees[i].maxage);
+		}
+		for (unichar = 0; unichar != 0x110000; unichar++) {
+			if (unicode_data[unichar].gen < 0)
+				continue;
+			keylen = utf8key(unichar, keyval);
+			data = corrections_lookup(&unicode_data[unichar]);
+			if (data->correction <= trees[i].maxage)
+				data = &unicode_data[unichar];
+			insert(&trees[i], keyval, keylen, data);
+		}
+	}
+}
+
+static void
+trees_reduce(void)
+{
+	int i;
+	int size;
+	int changed;
+
+	for (i = 0; i != trees_count; i++)
+		prune(&trees[i]);
+	for (i = 0; i != trees_count; i++)
+		mark_nodes(&trees[i]);
+	do {
+		size = 0;
+		for (i = 0; i != trees_count; i++)
+			size = index_nodes(&trees[i], size);
+		changed = 0;
+		for (i = 0; i != trees_count; i++)
+			changed += size_nodes(&trees[i]);
+	} while (changed);
+
+	utf8data = calloc(size, 1);
+	utf8data_size = size;
+	for (i = 0; i != trees_count; i++)
+		emit(&trees[i], utf8data);
+
+	if (verbose > 0) {
+		for (i = 0; i != trees_count; i++) {
+			printf("%s_%x idx %d\n",
+				trees[i].type, trees[i].maxage, trees[i].index);
+		}
+	}
+
+	nfkdi = utf8data + trees[trees_count-1].index;
+	nfkdicf = utf8data + trees[trees_count-2].index;
+
+	nfkdi_tree = &trees[trees_count-1];
+	nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+	struct unicode_data *data;
+	utf8leaf_t	*leaf;
+	unsigned int	unichar;
+	char		key[4];
+	int		report;
+	int		nocf;
+
+	if (verbose > 0)
+		printf("Verifying %s_%x\n", tree->type, tree->maxage);
+	nocf = strcmp(tree->type, "nfkdicf");
+
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		report = 0;
+		data = corrections_lookup(&unicode_data[unichar]);
+		if (data->correction <= tree->maxage)
+			data = &unicode_data[unichar];
+		utf8key(unichar, key);
+		leaf = utf8lookup(tree, key);
+		if (!leaf) {
+			if (data->gen != -1)
+				report++;
+			if (unichar < 0xd800 || unichar > 0xdfff)
+				report++;
+		} else {
+			if (unichar >= 0xd800 && unichar <= 0xdfff)
+				report++;
+			if (data->gen == -1)
+				report++;
+			if (data->gen != LEAF_GEN(leaf))
+				report++;
+			if (LEAF_CCC(leaf) == DECOMPOSE) {
+				if (nocf) {
+					if (!data->utf8nfkdi) {
+						report++;
+					} else if (strcmp(data->utf8nfkdi,
+							LEAF_STR(leaf))) {
+						report++;
+					}
+				} else {
+					if (!data->utf8nfkdicf &&
+					    !data->utf8nfkdi) {
+						report++;
+					} else if (data->utf8nfkdicf) {
+						if (strcmp(data->utf8nfkdicf,
+							   LEAF_STR(leaf)))
+							report++;
+					} else if (strcmp(data->utf8nfkdi,
+							  LEAF_STR(leaf))) {
+						report++;
+					}
+				}
+			} else if (data->ccc != LEAF_CCC(leaf)) {
+				report++;
+			}
+		}
+		if (report) {
+			printf("%X code %X gen %d ccc %d"
+				" nfdki -> \"%s\"",
+				unichar, data->code, data->gen,
+				data->ccc,
+				data->utf8nfkdi);
+			if (leaf) {
+				printf(" age %d ccc %d"
+					" nfdki -> \"%s\"\n",
+					LEAF_GEN(leaf),
+					LEAF_CCC(leaf),
+					LEAF_CCC(leaf) == DECOMPOSE ?
+						LEAF_STR(leaf) : "");
+			}
+			printf("\n");
+		}
+	}
+}
+
+static void
+trees_verify(void)
+{
+	int i;
+
+	for (i = 0; i != trees_count; i++)
+		verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("Usage: %s [options]\n", argv0);
+	printf("\n");
+	printf("This program creates an a data trie used for parsing and\n");
+	printf("normalization of UTF-8 strings. The trie is derived from\n");
+	printf("a set of input files from the Unicode character database\n");
+	printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+	printf("\n");
+	printf("The generated tree supports two normalization forms:\n");
+	printf("\n");
+	printf("\tnfkdi:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\n");
+	printf("\tnfkdicf:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\t- Apply a full casefold (C + F).\n");
+	printf("\n");
+	printf("These forms were chosen as being most useful when dealing\n");
+	printf("with file names: NFKD catches most cases where characters\n");
+	printf("should be considered equivalent. The ignorables are mostly\n");
+	printf("invisible, making names hard to type.\n");
+	printf("\n");
+	printf("The options to specify the files to be used are listed\n");
+	printf("below with their default values, which are the names used\n");
+	printf("by version 7.0.0 of the Unicode Character Database.\n");
+	printf("\n");
+	printf("The input files:\n");
+	printf("\t-a %s\n", AGE_NAME);
+	printf("\t-c %s\n", CCC_NAME);
+	printf("\t-p %s\n", PROP_NAME);
+	printf("\t-d %s\n", DATA_NAME);
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-n %s\n", NORM_NAME);
+	printf("\n");
+	printf("Additionally, the generated tables are tested using:\n");
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n");
+	printf("Finally, the output file:\n");
+	printf("\t-o %s\n", UTF8_NAME);
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+	printf("Error parsing %s:%s\n", filename, line);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+	int	i;
+
+	for (i = 0; utf32str[i]; i++)
+		printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdi);
+	printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdicf);
+	printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	int gen;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", age_name);
+
+	file = fopen(age_name, "r");
+	if (!file)
+		open_fail(age_name, errno);
+	count = 0;
+
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d\n",
+					major, minor, revision);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d\n", major, minor);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+
+	/* We must have found something above. */
+	if (verbose > 1)
+		printf("%d age entries\n", ages_count);
+	if (ages_count == 0 || ages_count > MAXGEN)
+		file_fail(age_name);
+
+	/* There is a 0 entry. */
+	ages_count++;
+	ages = calloc(ages_count + 1, sizeof(*ages));
+	/* And a guard entry. */
+	ages[ages_count] = (unsigned int)-1;
+
+	rewind(file);
+	count = 0;
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages[++gen] =
+				UNICODE_AGE(major, minor, revision);
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d = gen %d\n",
+					major, minor, revision, gen);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages[++gen] = UNICODE_AGE(major, minor, 0);
+			if (verbose > 1)
+				printf(" Age V%d_%d = %d\n",
+					major, minor, gen);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X..%X ; %d.%d #",
+			     &first, &last, &major, &minor);
+		if (ret == 4) {
+			for (unichar = first; unichar <= last; unichar++)
+				unicode_data[unichar].gen = gen;
+			count += 1 + last - first;
+			if (verbose > 1)
+				printf("  %X..%X gen %d\n", first, last, gen);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+		if (ret == 3) {
+			unicode_data[unichar].gen = gen;
+			count++;
+			if (verbose > 1)
+				printf("  %X gen %d\n", unichar, gen);
+			if (!utf32valid(unichar))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+	unicode_maxage = ages[gen];
+	fclose(file);
+
+	/* Nix surrogate block */
+	if (verbose > 1)
+		printf(" Removing surrogate block D800..DFFF\n");
+	for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+		unicode_data[unichar].gen = -1;
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int value;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", ccc_name);
+
+	file = fopen(ccc_name, "r");
+	if (!file)
+		open_fail(ccc_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+		if (ret == 3) {
+			for (unichar = first; unichar <= last; unichar++) {
+				unicode_data[unichar].ccc = value;
+                                count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X ccc %d\n", first, last, value);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(ccc_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d #", &unichar, &value);
+		if (ret == 2) {
+			unicode_data[unichar].ccc = value;
+                        count++;
+			if (verbose > 1)
+				printf(" %X ccc %d\n", unichar, value);
+			if (!utf32valid(unichar))
+				line_fail(ccc_name, line);
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	unsigned int *um;
+	int count;
+	int i;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", data_name);
+	file = fopen(data_name, "r");
+	if (!file)
+		open_fail(data_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     &unichar, buf0);
+		if (ret != 2)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(data_name, line);
+
+		s = buf0;
+		/* skip over <tag> */
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		/* decode the decomposition into UTF-32 */
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(data_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char status;
+	char *s;
+	unsigned int *um;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", fold_name);
+	file = fopen(fold_name, "r");
+	if (!file)
+		open_fail(fold_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+		if (ret != 3)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(fold_name, line);
+		/* Use the C+F casefold. */
+		if (status != 'C' && status != 'F')
+			continue;
+		s = buf0;
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(fold_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int first;
+	unsigned int last;
+	unsigned int *um;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", prop_name);
+	file = fopen(prop_name, "r");
+	if (!file)
+		open_fail(prop_name, errno);
+	assert(file);
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+		if (ret == 3) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(prop_name, line);
+			for (unichar = first; unichar <= last; unichar++) {
+				free(unicode_data[unichar].utf32nfkdi);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdi = um;
+				free(unicode_data[unichar].utf32nfkdicf);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdicf = um;
+				count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X Default_Ignorable_Code_Point\n",
+					first, last);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+		if (ret == 2) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(unichar))
+				line_fail(prop_name, line);
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdi = um;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdicf = um;
+			if (verbose > 1)
+				printf(" %X Default_Ignorable_Code_Point\n",
+					unichar);
+			count++;
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	unsigned int age;
+	unsigned int *um;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", norm_name);
+	file = fopen(norm_name, "r");
+	if (!file)
+		open_fail(norm_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		count++;
+	}
+	corrections = calloc(count, sizeof(struct unicode_data));
+	corrections_count = count;
+	rewind(file);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		corrections[count] = unicode_data[unichar];
+		assert(corrections[count].code == unichar);
+		age = UNICODE_AGE(major, minor, revision);
+		corrections[count].correction = age;
+
+		i = 0;
+		s = buf0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(norm_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		corrections[count].utf32nfkdi = um;
+
+		if (verbose > 1)
+			printf(" %X -> %s -> %s V%d_%d_%d\n",
+				unichar, buf0, buf1, major, minor, revision);
+		count++;
+	}
+	fclose(file);
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount
+ *   LVPart = LBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+	unsigned int sb = 0xAC00;
+	unsigned int lb = 0x1100;
+	unsigned int vb = 0x1161;
+	unsigned int tb = 0x11a7;
+	/* unsigned int lc = 19; */
+	unsigned int vc = 21;
+	unsigned int tc = 28;
+	unsigned int nc = (vc * tc);
+	/* unsigned int sc = (lc * nc); */
+	unsigned int unichar;
+	unsigned int mapping[4];
+	unsigned int *um;
+        int count;
+	int i;
+
+	if (verbose > 0)
+		printf("Decomposing hangul\n");
+	/* Hangul */
+	count = 0;
+	for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+		unsigned int si = unichar - sb;
+		unsigned int li = si / nc;
+		unsigned int vi = (si % nc) / tc;
+		unsigned int ti = si % tc;
+
+		i = 0;
+		mapping[i++] = lb + li;
+		mapping[i++] = vb + vi;
+		if (ti)
+			mapping[i++] = tb + ti;
+		mapping[i++] = 0;
+
+		assert(!unicode_data[unichar].utf32nfkdi);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		assert(!unicode_data[unichar].utf32nfkdicf);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+
+		count++;
+	}
+	if (verbose > 0)
+		printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdi\n");
+
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdi)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdi;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdi;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdi = um;
+		}
+		/* Add this decomposition to nfkdicf if there is no entry. */
+		if (!unicode_data[unichar].utf32nfkdicf) {
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdicf\n");
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdicf)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdicf;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdicf;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + tree->index;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!tree)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+	return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = tree->maxage;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = tree->maxage;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	struct tree	*tree;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+	unsigned int	unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : string.
+ *   len    : length of s.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s,
+	size_t		len)
+{
+	if (!tree)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->tree = tree;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	u8c->unichar = 0;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : NUL-terminated string.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s)
+{
+	return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1  -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->tree, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->tree, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+		u8c->unichar = utf8code(u8c->s);
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			assert(u8c->ccc == STOPPER);
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+	char *s;
+	char *t;
+	int c;
+	struct utf8cursor u8c;
+
+	/* First test: null-terminated string. */
+	s = buf2;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	/* Second test: length-limited string. */
+	s = buf2;
+	/* Replace NUL with a value that will cause an error if seen. */
+	s[strlen(s) + 1] = -1;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	struct unicode_data *data;
+	char *s;
+	char *t;
+	int ret;
+	int ignorables;
+	int tests = 0;
+	int failures = 0;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     buf0, buf1);
+		if (ret != 2 || *line == '#')
+			continue;
+		s = buf0;
+		t = buf2;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		ignorables = 0;
+		s = buf1;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			data = &unicode_data[unichar];
+			if (data->utf8nfkdi && !*data->utf8nfkdi)
+				ignorables = 1;
+			else
+				t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		tests++;
+		if (normalize_line(nfkdi_tree) < 0) {
+			printf("\nline %s -> %s", buf0, buf1);
+			if (ignorables)
+				printf(" (ignorables removed)");
+			printf(" failure\n");
+			failures++;
+		}
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+	FILE *file;
+	int i;
+	int j;
+	int t;
+	int gen;
+
+	if (verbose > 0)
+		printf("Writing %s\n", utf8_name);
+	file = fopen(utf8_name, "w");
+	if (!file)
+		open_fail(utf8_name, errno);
+
+	fprintf(file, "/* This file is generated code, do not edit. */\n");
+	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+	fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+	fprintf(file, "#endif\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+		unicode_maxage);
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+	for (i = 0; i != ages_count; i++)
+		fprintf(file, "\t%#x%s\n", ages[i],
+			ages[i] == unicode_maxage ? "" : ",");
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+	t = 0;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+	t = 1;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+		utf8data_size);
+	t = 0;
+	for (i = 0; i != utf8data_size; i += 16) {
+		if (i == trees[t].index) {
+			fprintf(file, "\t/* %s_%x */\n",
+				trees[t].type, trees[t].maxage);
+			if (t < trees_count-1)
+				t++;
+		}
+		fprintf(file, "\t");
+		for (j = i; j != i + 16; j++)
+			fprintf(file, "0x%.2x%s", utf8data[j],
+				(j < utf8data_size -1 ? "," : ""));
+		fprintf(file, "\n");
+	}
+	fprintf(file, "};\n");
+	fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+	unsigned int unichar;
+	int opt;
+
+	argv0 = argv[0];
+
+	while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+		switch (opt) {
+		case 'a':
+			age_name = optarg;
+			break;
+		case 'c':
+			ccc_name = optarg;
+			break;
+		case 'd':
+			data_name = optarg;
+			break;
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 'n':
+			norm_name = optarg;
+			break;
+		case 'o':
+			utf8_name = optarg;
+			break;
+		case 'p':
+			prop_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	if (verbose > 1)
+		help();
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		unicode_data[unichar].code = unichar;
+	age_init();
+	ccc_init();
+	nfkdi_init();
+	nfkdicf_init();
+	ignore_init();
+	corrections_init();
+	hangul_decompose();
+	nfkdi_decompose();
+	nfkdicf_decompose();
+	utf8_init();
+	trees_init();
+	trees_populate();
+	trees_reduce();
+	trees_verify();
+	/* Prevent "unused function" warning. */
+	(void)lookup(nfkdi_tree, " ");
+	if (verbose > 2)
+		tree_walk(nfkdi_tree);
+	if (verbose > 2)
+		tree_walk(nfkdicf_tree);
+	normalization_test();
+	write_file();
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 07b/10] xfs: add supporting code for UTF-8.
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
                   ` (13 preceding siblings ...)
  2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers
@ 2014-09-19 16:04 ` Ben Myers
  2014-09-22 14:55   ` Andi Kleen
  2014-09-22 22:26   ` Dave Chinner
  16 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-19 16:04 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

  nfkdi:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.

  nfkdicf:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: the trie is now separated into utf8norm.ko;
     utf8version is now a function and exported;
     introduced CONFIG_XFS_UTF8;
     removed trie generator due to vger size constraint.  --bpm]
---
 fs/xfs/utf8norm/Makefile   |   4 +
 fs/xfs/utf8norm/utf8norm.c | 649 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/utf8norm/utf8norm.h | 116 ++++++++
 3 files changed, 769 insertions(+)
 create mode 100644 fs/xfs/utf8norm/utf8norm.c
 create mode 100644 fs/xfs/utf8norm/utf8norm.h

diff --git a/fs/xfs/utf8norm/Makefile b/fs/xfs/utf8norm/Makefile
index 9b2efa9..f83f9b9 100644
--- a/fs/xfs/utf8norm/Makefile
+++ b/fs/xfs/utf8norm/Makefile
@@ -16,6 +16,10 @@
 # Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 #
 
+ifeq ($(CONFIG_XFS_UTF8),y)
+obj-m		 		+= utf8norm.o
+endif
+
 hostprogs-y			:= mkutf8data
 $(obj)/utf8norm.o: $(obj)/utf8data.h
 $(obj)/utf8data.h: $(src)/ucd/*.txt
diff --git a/fs/xfs/utf8norm/utf8norm.c b/fs/xfs/utf8norm/utf8norm.c
new file mode 100644
index 0000000..995c4df
--- /dev/null
+++ b/fs/xfs/utf8norm/utf8norm.c
@@ -0,0 +1,649 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "utf8norm.h"
+
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+const unsigned int utf8version(void)
+{
+	return utf8vers;
+}
+EXPORT_SYMBOL(utf8version);
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7F: 0                   - 0x7F
+ *       0x80 -    0x7FF: 0xC2 0x80           - 0xDF 0xBF
+ *      0x800 -   0xFFFF: 0xE0 0xA0 0x80      - 0xEF 0xBF 0xBF
+ *    0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences.  Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + data->offset;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!data)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+	return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : string.
+ *   len    : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s,
+	size_t		len)
+{
+	if (!data)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->data = data;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s)
+{
+	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1   -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->data, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->data, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+	while (maxage < utf8nfkdidata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdidata[i].maxage)
+		return NULL;
+	return &utf8nfkdidata[i];
+}
+EXPORT_SYMBOL(utf8nfkdi);
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+	while (maxage < utf8nfkdicfdata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdicfdata[i].maxage)
+		return NULL;
+	return &utf8nfkdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfkdicf);
+
+MODULE_AUTHOR("SGI");
+MODULE_DESCRIPTION("utf8 normalization");
+MODULE_LICENSE("GPL");
diff --git a/fs/xfs/utf8norm/utf8norm.h b/fs/xfs/utf8norm/utf8norm.h
new file mode 100644
index 0000000..44a9e53
--- /dev/null
+++ b/fs/xfs/utf8norm/utf8norm.h
@@ -0,0 +1,116 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/module.h>
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern const unsigned int utf8version(void);
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *  - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	utf8data_t	data;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 07a/13] xfsprogs: add trie generator for UTF-8.
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (12 preceding siblings ...)
  2014-09-18 20:43   ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers
@ 2014-09-19 16:06   ` Ben Myers
  2014-09-23 18:34     ` Roger Willcocks
  2014-09-19 16:07   ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers
  14 siblings, 1 reply; 84+ messages in thread
From: Ben Myers @ 2014-09-19 16:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h,
which contains the trie that utf8norm.c uses. The trie is generated from
the Unicode 7.0.0 data files. The format of the utf8data[] table is
described in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 Makefile             |    2 +-
 support/Makefile     |   24 +
 support/mkutf8data.c | 3232 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 3257 insertions(+), 1 deletion(-)
 create mode 100644 support/Makefile
 create mode 100644 support/mkutf8data.c

diff --git a/Makefile b/Makefile
index f56aebd..c442da6 100644
--- a/Makefile
+++ b/Makefile
@@ -40,7 +40,7 @@ LDIRDIRT = $(SRCDIR)
 LDIRT += $(SRCTAR)
 endif
 
-LIB_SUBDIRS = libxfs libxlog libxcmd libhandle libdisk
+LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk
 TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \
 		mdrestore repair rtcp m4 man doc po debian
 
diff --git a/support/Makefile b/support/Makefile
new file mode 100644
index 0000000..cade5fe
--- /dev/null
+++ b/support/Makefile
@@ -0,0 +1,24 @@
+#
+# Copyright (c) 2014 SGI. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+default = ../include/utf8data.h
+
+../include/utf8data.h:	mkutf8data.c
+	cc -o mkutf8data mkutf8data.c
+	cd ucd-7.0.0 ; ../mkutf8data
+	mv ucd-7.0.0/utf8data.h ../include
+
+default clean:
+	rm -f mkutf8data ../include/utf8data.h
+
+default install:
+
+default install-dev:
+
+default install-qa:
+
+-include .ltdep
diff --git a/support/mkutf8data.c b/support/mkutf8data.c
new file mode 100644
index 0000000..e5c3507
--- /dev/null
+++ b/support/mkutf8data.c
@@ -0,0 +1,3232 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME	"DerivedAge.txt"
+#define CCC_NAME	"DerivedCombiningClass.txt"
+#define PROP_NAME	"DerivedCoreProperties.txt"
+#define DATA_NAME	"UnicodeData.txt"
+#define FOLD_NAME	"CaseFolding.txt"
+#define NORM_NAME	"NormalizationCorrections.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+#define UTF8_NAME	"utf8data.h"
+
+const char	*age_name  = AGE_NAME;
+const char	*ccc_name  = CCC_NAME;
+const char	*prop_name = PROP_NAME;
+const char	*data_name = DATA_NAME;
+const char	*fold_name = FOLD_NAME;
+const char	*norm_name = NORM_NAME;
+const char	*test_name = TEST_NAME;
+const char	*utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision.  These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_MAJ_MAX			((unsigned short)-1)
+#define UNICODE_MIN_MAX			((unsigned char)-1)
+#define UNICODE_REV_MAX			((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+	if (major > UNICODE_MAJ_MAX)
+		return 0;
+	if (minor > UNICODE_MIN_MAX)
+		return 0;
+	if (revision > UNICODE_REV_MAX)
+		return 0;
+	return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MAXGEN		(255)
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+	const unsigned char *s = (const unsigned char*)str;
+	unsigned int unichar = 0;
+
+	if (*s < 0x80) {
+		unichar = *s;
+	} else if (*s < UTF8_3_BITS) {
+		unichar = *s++ & 0x1F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else if (*s < UTF8_4_BITS) {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	}
+	return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+	return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+	void *root;
+	int childnode;
+	const char *type;
+	unsigned int maxage;
+	struct tree *next;
+	int (*leaf_equal)(void *, void *);
+	void (*leaf_print)(void *, int);
+	int (*leaf_mark)(void *);
+	int (*leaf_size)(void *);
+	int *(*leaf_index)(struct tree *, void *);
+	unsigned char *(*leaf_emit)(void *, unsigned char *);
+	int leafindex[0x110000];
+	int index;
+};
+
+struct node {
+	int index;
+	int offset;
+	int mark;
+	int size;
+	struct node *parent;
+	void *left;
+	void *right;
+	unsigned char bitnum;
+	unsigned char nextbyte;
+	unsigned char leftnode;
+	unsigned char rightnode;
+	unsigned int keybits;
+	unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+	struct node *node;
+	void *leaf = NULL;
+
+	node = tree->root;
+	while (!leaf && node) {
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7))) {
+			/* Right leg */
+			if (node->rightnode == NODE) {
+				node = node->right;
+			} else if (node->rightnode == LEAF) {
+				leaf = node->right;
+			} else {
+				node = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (node->leftnode == NODE) {
+				node = node->left;
+			} else if (node->leftnode == LEAF) {
+				leaf = node->left;
+			} else {
+				node = NULL;
+			}
+		}
+	}
+
+	return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int indent = 1;
+	int nodes, singletons, leaves;
+
+	nodes = singletons = leaves = 0;
+
+	printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_print(tree->root, indent);
+		leaves = 1;
+	} else {
+		assert(tree->childnode == NODE);
+		node = tree->root;
+		leftmask = rightmask = 0;
+		while (node) {
+			printf("%*snode @ %p bitnum %d nextbyte %d"
+			       " left %p right %p mask %x bits %x\n",
+				indent, "", node,
+				node->bitnum, node->nextbyte,
+				node->left, node->right,
+				node->keymask, node->keybits);
+			nodes += 1;
+			if (!(node->left && node->right))
+				singletons += 1;
+
+			while (node) {
+				bitmask = 1 << node->bitnum;
+				if ((leftmask & bitmask) == 0) {
+					leftmask |= bitmask;
+					if (node->leftnode == LEAF) {
+						assert(node->left);
+						tree->leaf_print(node->left,
+								 indent+1);
+						leaves += 1;
+					} else if (node->left) {
+						assert(node->leftnode == NODE);
+						indent += 1;
+						node = node->left;
+						break;
+					}
+				}
+				if ((rightmask & bitmask) == 0) {
+					rightmask |= bitmask;
+					if (node->rightnode == LEAF) {
+						assert(node->right);
+						tree->leaf_print(node->right,
+								 indent+1);
+						leaves += 1;
+					} else if (node->right) {
+						assert(node->rightnode==NODE);
+						indent += 1;
+						node = node->right;
+						break;
+					}
+				}
+				leftmask &= ~bitmask;
+				rightmask &= ~bitmask;
+				node = node->parent;
+				indent -= 1;
+			}
+		}
+	}
+	printf("nodes %d leaves %d singletons %d\n",
+	       nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+	struct node *node;
+	int bitnum;
+
+	node = malloc(sizeof(*node));
+	node->left = node->right = NULL;
+	node->parent = parent;
+	node->leftnode = NODE;
+	node->rightnode = NODE;
+	node->keybits = 0;
+	node->keymask = 0;
+	node->mark = 0;
+	node->index = 0;
+	node->offset = -1;
+	node->size = 4;
+
+	if (node->parent) {
+		bitnum = parent->bitnum;
+		if ((bitnum & 7) == 0) {
+			node->bitnum = bitnum + 7 + 8;
+			node->nextbyte = 1;
+		} else {
+			node->bitnum = bitnum - 1;
+			node->nextbyte = 0;
+		}
+	} else {
+		node->bitnum = 7;
+		node->nextbyte = 0;
+	}
+
+        return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+	struct node *node;
+	struct node *parent;
+	void **cursor;
+	int keybits;
+
+	assert(keylen >= 1 && keylen <= 4);
+
+	node = NULL;
+	cursor = &tree->root;
+	keybits = 8 * keylen;
+
+	/* Insert, creating path along the way. */
+	while (keybits) {
+		if (!*cursor)
+			*cursor = alloc_node(node);
+		node = *cursor;
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7)))
+			cursor = &node->right;
+		else
+			cursor = &node->left;
+		keybits--;
+	}
+	*cursor = leaf;
+
+	/* Merge subtrees if possible. */
+	while (node) {
+		if (*key & (1 << (node->bitnum & 7)))
+			node->rightnode = LEAF;
+		else
+			node->leftnode = LEAF;
+		if (node->nextbyte)
+			break;
+		if (node->leftnode == NODE || node->rightnode == NODE)
+			break;
+		assert(node->left);
+		assert(node->right);
+		/* Compare */
+		if (! tree->leaf_equal(node->left, node->right))
+			break;
+		/* Keep left, drop right leaf. */
+		leaf = node->left;
+		/* Check in parent */
+		parent = node->parent;
+		if (!parent) {
+			/* root of tree! */
+			tree->root = leaf;
+			tree->childnode = LEAF;
+		} else if (parent->left == node) {
+			parent->left = leaf;
+			parent->leftnode = LEAF;
+			if (parent->right) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+			}
+		} else if (parent->right == node) {
+			parent->right = leaf;
+			parent->rightnode = LEAF;
+			if (parent->left) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+				parent->keybits |= (1 << node->bitnum);
+			}
+		} else {
+			/* internal tree error */
+			assert(0);
+		}
+		free(node);
+		node = parent;
+	}
+
+	/* Propagate keymasks up along singleton chains. */
+	while (node) {
+		parent = node->parent;
+		if (!parent)
+			break;
+		/* Nix the mask for parents with two children. */
+		if (node->keymask == 0) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else if (parent->left && parent->right) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else {
+			assert((parent->keymask & node->keymask) == 0);
+			parent->keymask |= node->keymask;
+			parent->keymask |= (1 << parent->bitnum);
+			parent->keybits |= node->keybits;
+			if (parent->right)
+				parent->keybits |= (1 << parent->bitnum);
+		}
+		node = parent;
+	}
+
+	return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed.  There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves.  The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains.  When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity.  Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+	struct node *node;
+	struct node *left;
+	struct node *right;
+	struct node *parent;
+	void *leftleaf;
+	void *rightleaf;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+
+	if (verbose > 0)
+		printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+	count = 0;
+	if (tree->childnode == LEAF)
+		return;
+	if (!tree->root)
+		return;
+
+	leftmask = rightmask = 0;
+	node = tree->root;
+	while (node) {
+		if (node->nextbyte)
+			goto advance;
+		if (node->leftnode == LEAF)
+			goto advance;
+		if (node->rightnode == LEAF)
+			goto advance;
+		if (!node->left)
+			goto advance;
+		if (!node->right)
+			goto advance;
+		left = node->left;
+		right = node->right;
+		if (left->keymask == 0)
+			goto advance;
+		if (right->keymask == 0)
+			goto advance;
+		if (left->keymask != right->keymask)
+			goto advance;
+		if (left->keybits != right->keybits)
+			goto advance;
+		leftleaf = NULL;
+		while (!leftleaf) {
+			assert(left->left || left->right);
+			if (left->leftnode == LEAF)
+				leftleaf = left->left;
+			else if (left->rightnode == LEAF)
+				leftleaf = left->right;
+			else if (left->left)
+				left = left->left;
+			else if (left->right)
+				left = left->right;
+			else
+				assert(0);
+		}
+		rightleaf = NULL;
+		while (!rightleaf) {
+			assert(right->left || right->right);
+			if (right->leftnode == LEAF)
+				rightleaf = right->left;
+			else if (right->rightnode == LEAF)
+				rightleaf = right->right;
+			else if (right->left)
+				right = right->left;
+			else if (right->right)
+				right = right->right;
+			else
+				assert(0);
+		}
+		if (! tree->leaf_equal(leftleaf, rightleaf))
+			goto advance;
+		/*
+		 * This node has identical singleton-only subtrees.
+		 * Remove it.
+		 */
+		parent = node->parent;
+		left = node->left;
+		right = node->right;
+		if (parent->left == node)
+			parent->left = left;
+		else if (parent->right == node)
+			parent->right = left;
+		else
+			assert(0);
+		left->parent = parent;
+		left->keymask |= (1 << node->bitnum);
+		node->left = NULL;
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			if (node->leftnode == NODE && node->left) {
+				left = node->left;
+				free(node);
+				count++;
+				node = left;
+			} else if (node->rightnode == NODE && node->right) {
+				right = node->right;
+				free(node);
+				count++;
+				node = right;
+			} else {
+				node = NULL;
+			}
+		}
+		/* Propagate keymasks up along singleton chains. */
+		node = parent;
+		/* Force re-check */
+		bitmask = 1 << node->bitnum;
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		for (;;) {
+			if (node->left && node->right)
+				break;
+			if (node->left) {
+				left = node->left;
+				node->keymask |= left->keymask;
+				node->keybits |= left->keybits;
+			}
+			if (node->right) {
+				right = node->right;
+				node->keymask |= right->keymask;
+				node->keybits |= right->keybits;
+			}
+			node->keymask |= (1 << node->bitnum);
+			node = node->parent;
+			/* Force re-check */
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+		}
+	advance:
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0 &&
+		    node->leftnode == NODE &&
+		    node->left) {
+			leftmask |= bitmask;
+			node = node->left;
+		} else if ((rightmask & bitmask) == 0 &&
+			   node->rightnode == NODE &&
+			   node->right) {
+			rightmask |= bitmask;
+			node = node->right;
+		} else {
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+		}
+	}
+	if (verbose > 0)
+		printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+	struct node *node;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int marked;
+
+	marked = 0;
+	if (verbose > 0)
+		printf("Marking %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+
+	/* second pass: left siblings and singletons */
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				if (!node->mark && node->parent->mark) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				if (!node->mark && node->parent->mark &&
+				    !node->parent->left) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+done:
+	if (verbose > 0)
+		printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie.  These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+	int indent;
+
+	/* Align to a cache line (or half a cache line?). */
+	while (index % 64)
+		index++;
+	tree->index = index;
+	indent = 1;
+	count = 0;
+
+	if (verbose > 0)
+		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+	if (tree->childnode == LEAF) {
+		index += tree->leaf_size(tree->root);
+		goto done;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		count++;
+		if (node->index != index)
+			node->index = index;
+		index += node->size;
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					*tree->leaf_index(tree, node->left) =
+									index;
+					index += tree->leaf_size(node->left);
+					count++;
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					*tree->leaf_index(tree, node->right) = index;
+					index += tree->leaf_size(node->right);
+					count++;
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	/* Round up to a multiple of 16 */
+	while (index % 16)
+		index++;
+	if (verbose > 0)
+		printf("Final index %d\n", index);
+	return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced.  This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+	struct tree *next;
+	struct node *node;
+	struct node *right;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	unsigned int pathbits;
+	unsigned int pathmask;
+	int changed;
+	int offset;
+	int size;
+	int indent;
+
+	indent = 1;
+	changed = 0;
+	size = 0;
+
+	if (verbose > 0)
+		printf("Sizing %s_%x", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	pathbits = 0;
+	pathmask = 0;
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		offset = 0;
+		if (!node->left || !node->right) {
+			size = 1;
+		} else {
+			if (node->rightnode == NODE) {
+				right = node->right;
+				next = tree->next;
+				while (!right->mark) {
+					assert(next);
+					n = next->root;
+					while (n->bitnum != node->bitnum) {
+						if (pathbits & (1<<n->bitnum))
+							n = n->right;
+						else
+							n = n->left;
+					}
+					n = n->right;
+					assert(right->bitnum == n->bitnum);
+					right = n;
+					next = next->next;
+				}
+				offset = right->index - node->index;
+			} else {
+				offset = *tree->leaf_index(tree, node->right);
+				offset -= node->index;
+			}
+			assert(offset >= 0);
+			assert(offset <= 0xffffff);
+			if (offset <= 0xff) {
+				size = 2;
+			} else if (offset <= 0xffff) {
+				size = 3;
+			} else { /* offset <= 0xffffff */
+				size = 4;
+			}
+		}
+		if (node->size != size || node->offset != offset) {
+			node->size = size;
+			node->offset = offset;
+			changed++;
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			pathmask |= bitmask;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				pathbits |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			pathmask &= ~bitmask;
+			pathbits &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	if (verbose > 0)
+		printf("Found %d changes\n", changed);
+	return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int offlen;
+	int offset;
+	int index;
+	int indent;
+	unsigned char byte;
+
+	index = tree->index;
+	data += index;
+	indent = 1;
+	if (verbose > 0)
+		printf("Emitting %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_emit(tree->root, data);
+		return;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		assert(node->offset != -1);
+		assert(node->index == index);
+
+		byte = 0;
+		if (node->nextbyte)
+			byte |= NEXTBYTE;
+		byte |= (node->bitnum & BITNUM);
+		if (node->left && node->right) {
+			if (node->leftnode == NODE)
+				byte |= LEFTNODE;
+			if (node->rightnode == NODE)
+				byte |= RIGHTNODE;
+			if (node->offset <= 0xff)
+				offlen = 1;
+			else if (node->offset <= 0xffff)
+				offlen = 2;
+			else
+				offlen = 3;
+			offset = node->offset;
+			byte |= offlen << OFFLEN_SHIFT;
+			*data++ = byte;
+			index++;
+			while (offlen--) {
+				*data++ = offset & 0xff;
+				index++;
+				offset >>= 8;
+			}
+		} else if (node->left) {
+			if (node->leftnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else if (node->right) {
+			byte |= RIGHTNODE;
+			if (node->rightnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else {
+			assert(0);
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					data = tree->leaf_emit(node->left,
+							       data);
+					index += tree->leaf_size(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					data = tree->leaf_emit(node->right,
+							       data);
+					index += tree->leaf_size(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table.  Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions.  The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+	unsigned int code;
+	int ccc;
+	int gen;
+	int correction;
+	unsigned int *utf32nfkdi;
+	unsigned int *utf32nfkdicf;
+	char *utf8nfkdi;
+	char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int    corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int          trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+	int i;
+
+	for (i = 0; i != corrections_count; i++)
+		if (u->code == corrections[i].code)
+			return &corrections[i];
+	return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdicf && right->utf8nfkdicf &&
+	    strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+		return 1;
+	if (left->utf8nfkdicf && right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdicf || right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+               leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+               leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdicf)
+		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+	return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+	if (leaf->utf8nfkdicf)
+		return 1;
+	return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+	return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+	struct unicode_data *leaf = l;
+	int size = 2;
+	if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+	struct unicode_data *leaf = l;
+	int size = 2;
+	if (leaf->utf8nfkdicf)
+		size += strlen(leaf->utf8nfkdicf) + 1;
+	else if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+	return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+	return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdicf) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdicf;
+		while ((*data++ = *s++) != 0)
+			;
+	} else if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+	char utf[18*4+1];
+	char *u;
+	unsigned int *um;
+	int i;
+
+	u = utf;
+	um = data->utf32nfkdi;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		data->utf8nfkdi = strdup((char*)utf);
+	}
+	u = utf;
+	um = data->utf32nfkdicf;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+			data->utf8nfkdicf = strdup((char*)utf);
+	}
+}
+
+static void
+utf8_init(void)
+{
+	unsigned int unichar;
+	int i;
+
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		utf8_create(&unicode_data[unichar]);
+
+	for (i = 0; i != corrections_count; i++)
+		utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+	struct unicode_data *data;
+	unsigned int maxage;
+	unsigned int nextage;
+	int count;
+	int i;
+	int j;
+
+	/* Count the number of different ages. */
+	count = 0;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+		count++;
+	} while (nextage);
+
+	/* Two trees per age: nfkdi and nfkdicf */
+	trees_count = count * 2;
+	trees = calloc(trees_count, sizeof(struct tree));
+
+	/* Assign ages to the trees. */
+	count = trees_count;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		trees[--count].maxage = maxage;
+		trees[--count].maxage = maxage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+	} while (nextage);
+
+	/* The ages assigned above are off by one. */
+	for (i = 0; i != trees_count; i++) {
+		j = 0;
+		while (ages[j] < trees[i].maxage)
+			j++;
+		trees[i].maxage = ages[j-1];
+	}
+
+	/* Set up the forwarding between trees. */
+	trees[trees_count-2].next = &trees[trees_count-1];
+	trees[trees_count-1].leaf_mark = nfkdi_mark;
+	trees[trees_count-2].leaf_mark = nfkdicf_mark;
+	for (i = 0; i != trees_count-2; i += 2) {
+		trees[i].next = &trees[trees_count-2];
+		trees[i].leaf_mark = correction_mark;
+		trees[i+1].next = &trees[trees_count-1];
+		trees[i+1].leaf_mark = correction_mark;
+	}
+
+	/* Assign the callouts. */
+	for (i = 0; i != trees_count; i += 2) {
+		trees[i].type = "nfkdicf";
+		trees[i].leaf_equal = nfkdicf_equal;
+		trees[i].leaf_print = nfkdicf_print;
+		trees[i].leaf_size = nfkdicf_size;
+		trees[i].leaf_index = nfkdicf_index;
+		trees[i].leaf_emit = nfkdicf_emit;
+
+		trees[i+1].type = "nfkdi";
+		trees[i+1].leaf_equal = nfkdi_equal;
+		trees[i+1].leaf_print = nfkdi_print;
+		trees[i+1].leaf_size = nfkdi_size;
+		trees[i+1].leaf_index = nfkdi_index;
+		trees[i+1].leaf_emit = nfkdi_emit;
+	}
+
+	/* Finish init. */
+	for (i = 0; i != trees_count; i++)
+		trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+	struct unicode_data *data;
+	unsigned int unichar;
+	char keyval[4];
+	int keylen;
+	int i;
+
+	for (i = 0; i != trees_count; i++) {
+		if (verbose > 0) {
+			printf("Populating %s_%x\n",
+				trees[i].type, trees[i].maxage);
+		}
+		for (unichar = 0; unichar != 0x110000; unichar++) {
+			if (unicode_data[unichar].gen < 0)
+				continue;
+			keylen = utf8key(unichar, keyval);
+			data = corrections_lookup(&unicode_data[unichar]);
+			if (data->correction <= trees[i].maxage)
+				data = &unicode_data[unichar];
+			insert(&trees[i], keyval, keylen, data);
+		}
+	}
+}
+
+static void
+trees_reduce(void)
+{
+	int i;
+	int size;
+	int changed;
+
+	for (i = 0; i != trees_count; i++)
+		prune(&trees[i]);
+	for (i = 0; i != trees_count; i++)
+		mark_nodes(&trees[i]);
+	do {
+		size = 0;
+		for (i = 0; i != trees_count; i++)
+			size = index_nodes(&trees[i], size);
+		changed = 0;
+		for (i = 0; i != trees_count; i++)
+			changed += size_nodes(&trees[i]);
+	} while (changed);
+
+	utf8data = calloc(size, 1);
+	utf8data_size = size;
+	for (i = 0; i != trees_count; i++)
+		emit(&trees[i], utf8data);
+
+	if (verbose > 0) {
+		for (i = 0; i != trees_count; i++) {
+			printf("%s_%x idx %d\n",
+				trees[i].type, trees[i].maxage, trees[i].index);
+		}
+	}
+
+	nfkdi = utf8data + trees[trees_count-1].index;
+	nfkdicf = utf8data + trees[trees_count-2].index;
+
+	nfkdi_tree = &trees[trees_count-1];
+	nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+	struct unicode_data *data;
+	utf8leaf_t	*leaf;
+	unsigned int	unichar;
+	char		key[4];
+	int		report;
+	int		nocf;
+
+	if (verbose > 0)
+		printf("Verifying %s_%x\n", tree->type, tree->maxage);
+	nocf = strcmp(tree->type, "nfkdicf");
+
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		report = 0;
+		data = corrections_lookup(&unicode_data[unichar]);
+		if (data->correction <= tree->maxage)
+			data = &unicode_data[unichar];
+		utf8key(unichar, key);
+		leaf = utf8lookup(tree, key);
+		if (!leaf) {
+			if (data->gen != -1)
+				report++;
+			if (unichar < 0xd800 || unichar > 0xdfff)
+				report++;
+		} else {
+			if (unichar >= 0xd800 && unichar <= 0xdfff)
+				report++;
+			if (data->gen == -1)
+				report++;
+			if (data->gen != LEAF_GEN(leaf))
+				report++;
+			if (LEAF_CCC(leaf) == DECOMPOSE) {
+				if (nocf) {
+					if (!data->utf8nfkdi) {
+						report++;
+					} else if (strcmp(data->utf8nfkdi,
+							LEAF_STR(leaf))) {
+						report++;
+					}
+				} else {
+					if (!data->utf8nfkdicf &&
+					    !data->utf8nfkdi) {
+						report++;
+					} else if (data->utf8nfkdicf) {
+						if (strcmp(data->utf8nfkdicf,
+							   LEAF_STR(leaf)))
+							report++;
+					} else if (strcmp(data->utf8nfkdi,
+							  LEAF_STR(leaf))) {
+						report++;
+					}
+				}
+			} else if (data->ccc != LEAF_CCC(leaf)) {
+				report++;
+			}
+		}
+		if (report) {
+			printf("%X code %X gen %d ccc %d"
+				" nfdki -> \"%s\"",
+				unichar, data->code, data->gen,
+				data->ccc,
+				data->utf8nfkdi);
+			if (leaf) {
+				printf(" age %d ccc %d"
+					" nfdki -> \"%s\"\n",
+					LEAF_GEN(leaf),
+					LEAF_CCC(leaf),
+					LEAF_CCC(leaf) == DECOMPOSE ?
+						LEAF_STR(leaf) : "");
+			}
+			printf("\n");
+		}
+	}
+}
+
+static void
+trees_verify(void)
+{
+	int i;
+
+	for (i = 0; i != trees_count; i++)
+		verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("Usage: %s [options]\n", argv0);
+	printf("\n");
+	printf("This program creates an a data trie used for parsing and\n");
+	printf("normalization of UTF-8 strings. The trie is derived from\n");
+	printf("a set of input files from the Unicode character database\n");
+	printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+	printf("\n");
+	printf("The generated tree supports two normalization forms:\n");
+	printf("\n");
+	printf("\tnfkdi:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\n");
+	printf("\tnfkdicf:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\t- Apply a full casefold (C + F).\n");
+	printf("\n");
+	printf("These forms were chosen as being most useful when dealing\n");
+	printf("with file names: NFKD catches most cases where characters\n");
+	printf("should be considered equivalent. The ignorables are mostly\n");
+	printf("invisible, making names hard to type.\n");
+	printf("\n");
+	printf("The options to specify the files to be used are listed\n");
+	printf("below with their default values, which are the names used\n");
+	printf("by version 7.0.0 of the Unicode Character Database.\n");
+	printf("\n");
+	printf("The input files:\n");
+	printf("\t-a %s\n", AGE_NAME);
+	printf("\t-c %s\n", CCC_NAME);
+	printf("\t-p %s\n", PROP_NAME);
+	printf("\t-d %s\n", DATA_NAME);
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-n %s\n", NORM_NAME);
+	printf("\n");
+	printf("Additionally, the generated tables are tested using:\n");
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n");
+	printf("Finally, the output file:\n");
+	printf("\t-o %s\n", UTF8_NAME);
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+	printf("Error parsing %s:%s\n", filename, line);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+	int	i;
+	for (i = 0; utf32str[i]; i++)
+		printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdi);
+	printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdicf);
+	printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	int gen;
+        int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", age_name);
+
+	file = fopen(age_name, "r");
+	if (!file)
+		open_fail(age_name, errno);
+        count = 0;
+
+        gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d\n",
+					major, minor, revision);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d\n", major, minor);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+
+	/* We must have found something above. */
+	if (verbose > 1)
+		printf("%d age entries\n", ages_count);
+	if (ages_count == 0 || ages_count > MAXGEN)
+		file_fail(age_name);
+
+	/* There is a 0 entry. */
+	ages_count++;
+	ages = calloc(ages_count + 1, sizeof(*ages));
+	/* And a guard entry. */
+	ages[ages_count] = (unsigned int)-1;
+
+	rewind(file);
+        count = 0;
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages[++gen] =
+				UNICODE_AGE(major, minor, revision);
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d = gen %d\n",
+					major, minor, revision, gen);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages[++gen] = UNICODE_AGE(major, minor, 0);
+			if (verbose > 1)
+				printf(" Age V%d_%d = %d\n",
+					major, minor, gen);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X..%X ; %d.%d #",
+			     &first, &last, &major, &minor);
+		if (ret == 4) {
+			for (unichar = first; unichar <= last; unichar++)
+				unicode_data[unichar].gen = gen;
+                        count += 1 + last - first;
+			if (verbose > 1)
+				printf("  %X..%X gen %d\n", first, last, gen);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+		if (ret == 3) {
+			unicode_data[unichar].gen = gen;
+                        count++;
+			if (verbose > 1)
+				printf("  %X gen %d\n", unichar, gen);
+			if (!utf32valid(unichar))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+	unicode_maxage = ages[gen];
+	fclose(file);
+
+	/* Nix surrogate block */
+	if (verbose > 1)
+		printf(" Removing surrogate block D800..DFFF\n");
+	for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+		unicode_data[unichar].gen = -1;
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int value;
+        int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", ccc_name);
+
+	file = fopen(ccc_name, "r");
+	if (!file)
+		open_fail(ccc_name, errno);
+
+        count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+		if (ret == 3) {
+			for (unichar = first; unichar <= last; unichar++) {
+				unicode_data[unichar].ccc = value;
+                                count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X ccc %d\n", first, last, value);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(ccc_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d #", &unichar, &value);
+		if (ret == 2) {
+			unicode_data[unichar].ccc = value;
+                        count++;
+			if (verbose > 1)
+				printf(" %X ccc %d\n", unichar, value);
+			if (!utf32valid(unichar))
+				line_fail(ccc_name, line);
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+            printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	unsigned int *um;
+	int count;
+	int i;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", data_name);
+	file = fopen(data_name, "r");
+	if (!file)
+		open_fail(data_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     &unichar, buf0);
+		if (ret != 2)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(data_name, line);
+
+		s = buf0;
+		/* skip over <tag> */
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		/* decode the decomposition into UTF-32 */
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(data_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char status;
+	char *s;
+	unsigned int *um;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", fold_name);
+	file = fopen(fold_name, "r");
+	if (!file)
+		open_fail(fold_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+		if (ret != 3)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(fold_name, line);
+		/* Use the C+F casefold. */
+		if (status != 'C' && status != 'F')
+			continue;
+		s = buf0;
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(fold_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int first;
+	unsigned int last;
+	unsigned int *um;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", prop_name);
+	file = fopen(prop_name, "r");
+	if (!file)
+		open_fail(prop_name, errno);
+	assert(file);
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+		if (ret == 3) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(prop_name, line);
+			for (unichar = first; unichar <= last; unichar++) {
+				free(unicode_data[unichar].utf32nfkdi);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdi = um;
+				free(unicode_data[unichar].utf32nfkdicf);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdicf = um;
+				count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X Default_Ignorable_Code_Point\n",
+					first, last);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+		if (ret == 2) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(unichar))
+				line_fail(prop_name, line);
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdi = um;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdicf = um;
+			if (verbose > 1)
+				printf(" %X Default_Ignorable_Code_Point\n",
+					unichar);
+			count++;
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	unsigned int age;
+	unsigned int *um;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", norm_name);
+	file = fopen(norm_name, "r");
+	if (!file)
+		open_fail(norm_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		count++;
+	}
+	corrections = calloc(count, sizeof(struct unicode_data));
+	corrections_count = count;
+	rewind(file);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		corrections[count] = unicode_data[unichar];
+		assert(corrections[count].code == unichar);
+		age = UNICODE_AGE(major, minor, revision);
+		corrections[count].correction = age;
+
+		i = 0;
+		s = buf0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(norm_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		corrections[count].utf32nfkdi = um;
+
+		if (verbose > 1)
+			printf(" %X -> %s -> %s V%d_%d_%d\n",
+				unichar, buf0, buf1, major, minor, revision);
+		count++;
+	}
+	fclose(file);
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount
+ *   LVPart = LBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+	unsigned int sb = 0xAC00;
+	unsigned int lb = 0x1100;
+	unsigned int vb = 0x1161;
+	unsigned int tb = 0x11a7;
+	/* unsigned int lc = 19; */
+	unsigned int vc = 21;
+	unsigned int tc = 28;
+	unsigned int nc = (vc * tc);
+	/* unsigned int sc = (lc * nc); */
+	unsigned int unichar;
+	unsigned int mapping[4];
+	unsigned int *um;
+        int count;
+	int i;
+
+	if (verbose > 0)
+		printf("Decomposing hangul\n");
+	/* Hangul */
+	count = 0;
+	for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+		unsigned int si = unichar - sb;
+		unsigned int li = si / nc;
+		unsigned int vi = (si % nc) / tc;
+		unsigned int ti = si % tc;
+
+		i = 0;
+		mapping[i++] = lb + li;
+		mapping[i++] = vb + vi;
+		if (ti)
+			mapping[i++] = tb + ti;
+		mapping[i++] = 0;
+
+		assert(!unicode_data[unichar].utf32nfkdi);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		assert(!unicode_data[unichar].utf32nfkdicf);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+
+		count++;
+	}
+	if (verbose > 0)
+		printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdi\n");
+
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdi)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdi;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdi;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdi = um;
+		}
+		/* Add this decomposition to nfkdicf if there is no entry. */
+		if (!unicode_data[unichar].utf32nfkdicf) {
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdicf\n");
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdicf)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdicf;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdicf;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + tree->index;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!tree)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+	return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = tree->maxage;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = tree->maxage;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	struct tree	*tree;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+	unsigned int	unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : string.
+ *   len    : length of s.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s,
+	size_t		len)
+{
+	if (!tree)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->tree = tree;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	u8c->unichar = 0;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : NUL-terminated string.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s)
+{
+	return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1  -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->tree, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->tree, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+		u8c->unichar = utf8code(u8c->s);
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			assert(u8c->ccc == STOPPER);
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+	char *s;
+	char *t;
+	int c;
+	struct utf8cursor u8c;
+
+	/* First test: null-terminated string. */
+	s = buf2;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	/* Second test: length-limited string. */
+	s = buf2;
+	/* Replace NUL with a value that will cause an error if seen. */
+	s[strlen(s) + 1] = -1;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	struct unicode_data *data;
+	char *s;
+	char *t;
+	int ret;
+	int ignorables;
+	int tests = 0;
+	int failures = 0;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     buf0, buf1);
+		if (ret != 2 || *line == '#')
+			continue;
+		s = buf0;
+		t = buf2;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		ignorables = 0;
+		s = buf1;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			data = &unicode_data[unichar];
+			if (data->utf8nfkdi && !*data->utf8nfkdi)
+				ignorables = 1;
+			else
+				t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		tests++;
+		if (normalize_line(nfkdi_tree) < 0) {
+			printf("\nline %s -> %s", buf0, buf1);
+			if (ignorables)
+				printf(" (ignorables removed)");
+			printf(" failure\n");
+			failures++;
+		}
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+	FILE *file;
+	int i;
+	int j;
+	int t;
+	int gen;
+
+	if (verbose > 0)
+		printf("Writing %s\n", utf8_name);
+	file = fopen(utf8_name, "w");
+	if (!file)
+		open_fail(utf8_name, errno);
+
+	fprintf(file, "/* This file is generated code, do not edit. */\n");
+	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+	fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+	fprintf(file, "#endif\n");
+	fprintf(file, "\n");
+	fprintf(file, "const unsigned int utf8version = %#x;\n",
+		unicode_maxage);
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+	for (i = 0; i != ages_count; i++)
+		fprintf(file, "\t%#x%s\n", ages[i],
+			ages[i] == unicode_maxage ? "" : ",");
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+	t = 0;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+	t = 1;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+		utf8data_size);
+	t = 0;
+	for (i = 0; i != utf8data_size; i += 16) {
+		if (i == trees[t].index) {
+			fprintf(file, "\t/* %s_%x */\n",
+				trees[t].type, trees[t].maxage);
+			if (t < trees_count-1)
+				t++;
+		}
+		fprintf(file, "\t");
+		for (j = i; j != i + 16; j++)
+			fprintf(file, "0x%.2x%s", utf8data[j],
+				(j < utf8data_size -1 ? "," : ""));
+		fprintf(file, "\n");
+	}
+	fprintf(file, "};\n");
+	fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+	unsigned int unichar;
+	int opt;
+
+	argv0 = argv[0];
+
+	while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+		switch (opt) {
+		case 'a':
+			age_name = optarg;
+			break;
+		case 'c':
+			ccc_name = optarg;
+			break;
+		case 'd':
+			data_name = optarg;
+			break;
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 'n':
+			norm_name = optarg;
+			break;
+		case 'o':
+			utf8_name = optarg;
+			break;
+		case 'p':
+			prop_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	if (verbose > 1)
+		help();
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		unicode_data[unichar].code = unichar;
+	age_init();
+	ccc_init();
+	nfkdi_init();
+	nfkdicf_init();
+	ignore_init();
+	corrections_init();
+	hangul_decompose();
+	nfkdi_decompose();
+	nfkdicf_decompose();
+	utf8_init();
+	trees_init();
+	trees_populate();
+	trees_reduce();
+	trees_verify();
+	/* Prevent "unused function" warning. */
+	(void)lookup(nfkdi_tree, " ");
+	if (verbose > 2)
+		tree_walk(nfkdi_tree);
+	if (verbose > 2)
+		tree_walk(nfkdicf_tree);
+	normalization_test();
+	write_file();
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 07b/13] libxfs: add supporting code for UTF-8.
  2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
                     ` (13 preceding siblings ...)
  2014-09-19 16:06   ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers
@ 2014-09-19 16:07   ` Ben Myers
  14 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-19 16:07 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tinguely, olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and
nfkdicf.

  nfkdi:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.

  nfkdicf:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix)
and on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/utf8norm.h | 111 ++++++++++
 libxfs/Makefile    |   1 +
 libxfs/utf8norm.c  | 628 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 740 insertions(+)
 create mode 100644 include/utf8norm.h
 create mode 100644 libxfs/utf8norm.c

diff --git a/include/utf8norm.h b/include/utf8norm.h
new file mode 100644
index 0000000..6aa3391
--- /dev/null
+++ b/include/utf8norm.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern const unsigned int utf8version;
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *  - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	utf8data_t	data;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
diff --git a/libxfs/Makefile b/libxfs/Makefile
index ae15a5d..a1e85ef 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -14,6 +14,7 @@ HFILES = xfs.h init.h xfs_dir2_priv.h crc32defs.h crc32table.h
 CFILES = cache.c \
 	crc32.c \
 	init.c kmem.c logitem.c radix-tree.c rdwr.c trans.c util.c \
+	utf8norm.c \
 	xfs_alloc.c \
 	xfs_alloc_btree.c \
 	xfs_attr.c \
diff --git a/libxfs/utf8norm.c b/libxfs/utf8norm.c
new file mode 100644
index 0000000..6232d1a
--- /dev/null
+++ b/libxfs/utf8norm.c
@@ -0,0 +1,628 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_types.h"
+#include <utf8norm.h>
+
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include <utf8data.h>
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7F: 0                   - 0x7F
+ *       0x80 -    0x7FF: 0xC2 0x80           - 0xDF 0xBF
+ *      0x800 -   0xFFFF: 0xE0 0xA0 0x80      - 0xEF 0xBF 0xBF
+ *    0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences.  Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + data->offset;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!data)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+	return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = data->maxage;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = data->maxage;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : string.
+ *   len    : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s,
+	size_t		len)
+{
+	if (!data)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->data = data;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s)
+{
+	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1   -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->data, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->data, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+	while (maxage < utf8nfkdidata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdidata[i].maxage)
+		return NULL;
+	return &utf8nfkdidata[i];
+}
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+	while (maxage < utf8nfkdicfdata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdicfdata[i].maxage)
+		return NULL;
+	return &utf8nfkdicfdata[i];
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-22 14:55   ` Andi Kleen
  2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Andi Kleen @ 2014-09-22 14:55 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

Ben Myers <bpm@sgi.com> writes:
>
> Strings are normalized using a trie that stores the relevant
> information.  The trie itself is about 250kB in size, and lives in a
> separate module.

So 250kB bloat -- and what does this fix exactly?

Someone putting random ligatures into their file names and expecting
the file to be the same as before. Can't they just not do that?

-Andi

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-22 14:55   ` Andi Kleen
  0 siblings, 0 replies; 84+ messages in thread
From: Andi Kleen @ 2014-09-22 14:55 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

Ben Myers <bpm@sgi.com> writes:
>
> Strings are normalized using a trie that stores the relevant
> information.  The trie itself is about 250kB in size, and lives in a
> separate module.

So 250kB bloat -- and what does this fix exactly?

Someone putting random ligatures into their file names and expecting
the file to be the same as before. Can't they just not do that?

-Andi

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-22 14:55   ` Andi Kleen
  (?)
@ 2014-09-22 18:41   ` Ben Myers
  2014-09-22 19:29       ` Andi Kleen
  -1 siblings, 1 reply; 84+ messages in thread
From: Ben Myers @ 2014-09-22 18:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel, tinguely, olaf, xfs

Hey Andi,

On Mon, Sep 22, 2014 at 07:55:59AM -0700, Andi Kleen wrote:
> Ben Myers <bpm@sgi.com> writes:
> >
> > Strings are normalized using a trie that stores the relevant
> > information.  The trie itself is about 250kB in size, and lives in a
> > separate module.
> 
> So 250kB bloat -- and what does this fix exactly?

We're trying to address the size issue by only loading the module when
it's needed, but yeah it's big.  Open to suggestions on how best to deal
with that.  I understand the sticker shock.
 
> Someone putting random ligatures into their file names and expecting
> the file to be the same as before. Can't they just not do that?

The ligature example that Olaf gave might seem kind of trivial, but for
other characters and languages could it be more significant?

As far as telling the customer "don't do that", my guess is that they
would just go elsewhere.  There are several other options for
filesystems that support unicode.

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-22 18:41   ` Ben Myers
@ 2014-09-22 19:29       ` Andi Kleen
  0 siblings, 0 replies; 84+ messages in thread
From: Andi Kleen @ 2014-09-22 19:29 UTC (permalink / raw)
  To: Ben Myers; +Cc: Andi Kleen, linux-fsdevel, tinguely, olaf, xfs

> > So 250kB bloat -- and what does this fix exactly?
> 
> We're trying to address the size issue by only loading the module when

I'm not sure this is really addressing it.

> it's needed, but yeah it's big.  Open to suggestions on how best to deal
> with that.  I understand the sticker shock.

I don't even understand why you need the whole table.

You want to not compare some special symbols, and a few other symbols
are equivalent to others.  But most symbols are only identical to themselves.

Couldn't you have a much smaller table that only expresses
the exceptions?

> As far as telling the customer "don't do that", my guess is that they
> would just go elsewhere.  There are several other options for
> filesystems that support unicode.

They could put some code into their user app that generates
an unique representation.

-Andi

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-22 19:29       ` Andi Kleen
  0 siblings, 0 replies; 84+ messages in thread
From: Andi Kleen @ 2014-09-22 19:29 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, Andi Kleen, tinguely, olaf, xfs

> > So 250kB bloat -- and what does this fix exactly?
> 
> We're trying to address the size issue by only loading the module when

I'm not sure this is really addressing it.

> it's needed, but yeah it's big.  Open to suggestions on how best to deal
> with that.  I understand the sticker shock.

I don't even understand why you need the whole table.

You want to not compare some special symbols, and a few other symbols
are equivalent to others.  But most symbols are only identical to themselves.

Couldn't you have a much smaller table that only expresses
the exceptions?

> As far as telling the customer "don't do that", my guess is that they
> would just go elsewhere.  There are several other options for
> filesystems that support unicode.

They could put some code into their user app that generates
an unique representation.

-Andi

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/10] xfs: add unicode character database files
  2014-09-18 20:14   ` Ben Myers
@ 2014-09-22 20:54     ` Dave Chinner
  -1 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-22 20:54 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 03:14:40PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> Add files from the Unicode Character Database, version 7.0.0, to the source.
> A helper program that generates a trie used for normalization from these
> files is part of a separate commit.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> ---
> [v2: Removed large unicode files prior to posting.  Get them as below. -bpm]
> [v3: Moved files to ucd8norm directory. -bpm]
> 
> cd fs/xfs/utf8norm/ucd
> wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> for e in *.txt
> do
> 	base=`basename $e .txt`
> 	mv $e $base-7.0.0.txt
> done
> ---
>  fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++

This probably needs to live somewhere under lib/.  There's nothing
XFS specific in it and the translations should be the same for
anything that wants to parse unicode.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/10] xfs: add unicode character database files
@ 2014-09-22 20:54     ` Dave Chinner
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-22 20:54 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 03:14:40PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> Add files from the Unicode Character Database, version 7.0.0, to the source.
> A helper program that generates a trie used for normalization from these
> files is part of a separate commit.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> ---
> [v2: Removed large unicode files prior to posting.  Get them as below. -bpm]
> [v3: Moved files to ucd8norm directory. -bpm]
> 
> cd fs/xfs/utf8norm/ucd
> wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
> wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> for e in *.txt
> do
> 	base=`basename $e .txt`
> 	mv $e $base-7.0.0.txt
> done
> ---
>  fs/xfs/utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++

This probably needs to live somewhere under lib/.  There's nothing
XFS specific in it and the translations should be the same for
anything that wants to parse unicode.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8.
  2014-09-18 20:15 ` [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8 Ben Myers
@ 2014-09-22 20:57     ` Dave Chinner
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-22 20:57 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 03:15:19PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> mkutf8data.c is the source for a program that generates utf8data.h, which
> contains the trie that utf8norm.c uses. The trie is generated from the
> Unicode 7.0.0 data files. The format of the utf8data[] table is described
> in utf8norm.c.
> 
> Supporting functions for UTF-8 normalization are in utf8norm.c with the
> header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.
> 
>   nfkdi:
>    - Apply unicode normalization form NFKD.
>    - Remove any Default_Ignorable_Code_Point.
> 
>   nfkdicf:
>    - Apply unicode normalization form NFKD.
>    - Remove any Default_Ignorable_Code_Point.
>    - Apply a full casefold (C + F).
> 
> For the purposes of the code, a string is valid UTF-8 if:
> 
>  - The values encoded are 0x1..0x10FFFF.
>  - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
>  - The shortest possible encoding is used for all values.
> 
> The supporting functions work on null-terminated strings (utf8 prefix) and
> on length-limited strings (utf8n prefix).
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> 
> ---
> [v2: the trie is now separated into utf8norm.ko;
>      utf8version is now a function and exported;
>      introduced CONFIG_XFS_UTF8. -bpm]
> ---
>  fs/xfs/Kconfig               |    8 +
>  fs/xfs/Makefile              |    2 +-
>  fs/xfs/utf8norm/Makefile     |   37 +
>  fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/utf8norm/utf8norm.c   |  649 +++++++++
>  fs/xfs/utf8norm/utf8norm.h   |  116 ++

Again, nothing XFS specific here. It's being built as a separate
module and the only thing that XFS uses are exported functions, so
it really should be generic library code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8.
@ 2014-09-22 20:57     ` Dave Chinner
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-22 20:57 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 03:15:19PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> mkutf8data.c is the source for a program that generates utf8data.h, which
> contains the trie that utf8norm.c uses. The trie is generated from the
> Unicode 7.0.0 data files. The format of the utf8data[] table is described
> in utf8norm.c.
> 
> Supporting functions for UTF-8 normalization are in utf8norm.c with the
> header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.
> 
>   nfkdi:
>    - Apply unicode normalization form NFKD.
>    - Remove any Default_Ignorable_Code_Point.
> 
>   nfkdicf:
>    - Apply unicode normalization form NFKD.
>    - Remove any Default_Ignorable_Code_Point.
>    - Apply a full casefold (C + F).
> 
> For the purposes of the code, a string is valid UTF-8 if:
> 
>  - The values encoded are 0x1..0x10FFFF.
>  - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
>  - The shortest possible encoding is used for all values.
> 
> The supporting functions work on null-terminated strings (utf8 prefix) and
> on length-limited strings (utf8n prefix).
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> 
> ---
> [v2: the trie is now separated into utf8norm.ko;
>      utf8version is now a function and exported;
>      introduced CONFIG_XFS_UTF8. -bpm]
> ---
>  fs/xfs/Kconfig               |    8 +
>  fs/xfs/Makefile              |    2 +-
>  fs/xfs/utf8norm/Makefile     |   37 +
>  fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/utf8norm/utf8norm.c   |  649 +++++++++
>  fs/xfs/utf8norm/utf8norm.h   |  116 ++

Again, nothing XFS specific here. It's being built as a separate
module and the only thing that XFS uses are exported functions, so
it really should be generic library code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-09-22 22:26   ` Dave Chinner
  2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-22 22:26 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> Hi,
> 
> I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
> is busy with other projects.  This is the second revision of the series.
> The first is available here:
> 
> http://oss.sgi.com/archives/xfs/2014-09/msg00169.html
> 
> In response to the initial feedback, the changes in version 2 include:
> 
> * linux-fsdevel in the To: line,
> * Updated design notes,
> * Separation of the fs-independent trie and support code into utf8norm.ko,
> * A mechanism for loading the normalization module only when necessary.
> 
> I'll post the whole series for completeness sake.  Many on -fsdevel will
> not be interested in the xfs-specific bits, but it may be helpful to
> have the full series as an example and for testing purposes.
> 
> First there is a set of kernel bits, then some libxfs/xfsprogs stuff,
> and finally a test.  (Note: I am not posting the unicode database files
> due to their large size.  There are scripts to download them from
> unicode.org in the relevant commit headers.)
> 
> TODO: Store the unicode version number of the filesystem on disk in the
> super block.

So, if the filesystem has to store the specific unicode version it
was created with so that we know what version to put in trie
lookups, again I'll ask: why are we loading the trie as a generic
kernel module and not as metadata in the filesystem that is demand
paged and cached?

i.e. put the entire trie on disk, look up the specific conversion
required for the name being compared, and then cache that conversion
in memory. This makes repeated lookups much faster because the trie
only contains conversions that are in use, the memory footprint is
way lower and the conversions are guaranteed to be consistent for
the life of the filesystem....

> Here are Olaf's design notes:
> 
> -----------------------------------------------------------------------------
> Unicode/UTF-8 support for XFS
> 
> So we had a customer request proper unicode support...
> 
> 
> * What does "supporting unicode" actually mean?
> 
> From a text processing point of view, what a filesystem does with
> filenames is simple: it stores and retrieves them, and compares them
> for equality. It may reject certain byte sequences as invalid
> filenames (for example, no filename can contain an ASCII NUL).
> 
> I've been taking it as a given that when a file is created with a
> certain byte sequence as its name, then a subsequent directory listing
> will contain that same byte sequence among the names listed.
> 
> This leaves comparing names for equality, and in my view this is what
> "supporting unicode" revolves about.
> 
> The present state of affairs is that different byte sequences are
> different filenames. This amounts to tolerating unicode without
> actually supporting it.

That's somewhat circular - using your own definition of "supported"
to argue that your own definition is the right one....

> To support unicode we have to interpret filenames. What happens when
> (part of) a filename cannot be interpreted? We can reject the
> filename, interpret the parts we can, or punt and accept it as an
> uninterpreted blob.
> 
> Rejecting ill-formed filenames was my first choice, but I came around
> on the issue: there are too many ways in which you can end up with
> having to deal with ill-formed filenames that would leave a user with
> no recourse but to move whatever they're doing to a different
> filesystem. Unpacking a tarball with filenames in a different encoding
> is an example.

You still haven't addressed this:

| So we accept invalid unicode in filenames, but only after failing to
| parse them? Isn't this a potential vector for exploiting weaknesses
| in application filename handling? i.e.  unprivileged user writes
| specially crafted invalid unicode filename to disk, setuid program
| tries to parse it, invalid sequence triggers a buffer overflow bug
| in setuid parser?

apart from handwaving that userspace has to be able to handle
invalid utf-8 already. Why should we let filesystems say "we fully
understand and support utf8" and then allow them to accept and
propagate invalid utf8 sequences and leave everyone else to have to
clean up the mess?

> Partial interpretation of an ill-formed filename just strikes me as
> the kind of bad idea that most half-houses are. I admit that I have no
> stronger objection to this than the fact that it makes the code even
> more complicated and fragile.
> 
> Which leaves "blob" as the preferred option by default for coping with
> ill-formed filenames.

And so can't be case-folded, leading to inconsistent behaviour of
case-insensitive filename comparisons.

I don't blindly subscribe to the robustness principle of "be liberal
with what you accept".  Being liberal means accepting malformed junk
and then trying to make good. It's a fool's game - we've learnt time
and time again that if we don't fully validate string inputs that we
have to interpret then someone will find an exploit that utilises
malformed strings. I don't think we should expose core kernel code
to such structural weaknesses...

> When comparing well-formed filenames, the question now becomes which
> byte sequences are considered to be alternative spellings of the same
> filename. This is where normalization forms come into play, and the
> unicode standard has quite a bit to say about the subject.
> 
> If all you're doing is comparison, then choosing NFD over NFC is easy,
> because the former is easier to calculate than the latter.
> 
> If you want various spellings of "office" to compare equal, then
> picking NFKD over NFD for comparison is also an obvious
> choice. (Hand-picking individual compatibility forms is truly a bad
> idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
> "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
> ligature. (Some fool thought it a good idea to add these ligatures to
> unicode, all we get to decide is how to cope.)

Yet normalised strings are only stable and hence comparable
if there are no unassigned code points in them.  What happens when
userspace is not using the same version of unicode as the
filesystem and is using newer code points in it's strings?
Normalisation fails, right?

And as an extension of using normalisation for case-folded
comparisons, how do we make case folding work with blobs that can't
be normalised? It seems to me that this just leads to the nasty
situation where some filenames are case sensitive and some aren't
based on what the filesystem thinks is valid utf-8. The worst part
is that userspace has no idea that the filesystem is making such
distinctions and so behaviour is not at all predictable or expected.

This is another point in favour of rejecting invalid utf-8 strings
and for keeping the translation tables stable within the
filesystem...

> The most contentious part is (should be) ignoring the codepoints with
> the Default_Ignorable_Code_Point property. I've included the list
> below. My argument, such as it is, is that these code points either
> have no visible rendering, or in cases like the soft hyphen, are only
> conditionally visible. The problem with these (as I see it) is that on
> seeing a filename that might contain them you cannot tell whether they
> are present. So I propose to ignore them for the purpose of comparing
> filenames for equality.

Which introduces a non-standard "visibility criterial" for
determining what should be or shouldn't be part of the normalised
string for comparison. I don't see any real justification for
stepping outside the standard unicode normalisation here - just
because the user cannot see a character in a specific context does
not mean that it is not significant to the application that created
it.

> Finally, case folding. First of all, it is optional. Then the issue is
> that you either go the language-specific route, or simplify the task
> by "just" doing a full casefold (C+F, in unicode parlance). Looking
> around the net I tend to find that if you're going to do casefolding
> at all, then a language-independent full casefold is preferred because
> it is the most predictable option. See
> http://www.w3.org/TR/charmod-norm/ for an example of that kind of
> reasoning.

Which says in section 2.4: "Some languages need case-folding to be
tailored to meet specific linguistic needs". That implies that the
case folding needs to be language aware and hence needs to be tied
into the NLS subsystem for handling specific quirks like Turkic.

I also note that it says in several places that C+F can result in a
folded string of a different length. What happens when that folded
string is longer than 255 bytes and hence longer than NAME_MAX?
That's a bit of a nasty landmine for pathname string handling
functions - developers are going to assume that pathname components
are not longer than NAME_MAX, and if we are passing normalised
strings around that is not a valid assumption....

> * XFS-specific design notes.
...
> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
> in the superblock, then case folding is added into the mix. This is
> the nfkdicf normalization form mentioned above. It allows for the
> creation of case-insensitive filesystems with UTF-8 support.

Please don't overload existing superblock feature bits with multiple
meanings. ASCII-CI is a stand-alone feature and is not in any way
compatible with Unicode: Unicode-CI is a superset of Unicode
support. So it really needs two new feature bits for Unicode and
Unicode-CI, not just one for unicode.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-22 22:26   ` Dave Chinner
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-22 22:26 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> Hi,
> 
> I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
> is busy with other projects.  This is the second revision of the series.
> The first is available here:
> 
> http://oss.sgi.com/archives/xfs/2014-09/msg00169.html
> 
> In response to the initial feedback, the changes in version 2 include:
> 
> * linux-fsdevel in the To: line,
> * Updated design notes,
> * Separation of the fs-independent trie and support code into utf8norm.ko,
> * A mechanism for loading the normalization module only when necessary.
> 
> I'll post the whole series for completeness sake.  Many on -fsdevel will
> not be interested in the xfs-specific bits, but it may be helpful to
> have the full series as an example and for testing purposes.
> 
> First there is a set of kernel bits, then some libxfs/xfsprogs stuff,
> and finally a test.  (Note: I am not posting the unicode database files
> due to their large size.  There are scripts to download them from
> unicode.org in the relevant commit headers.)
> 
> TODO: Store the unicode version number of the filesystem on disk in the
> super block.

So, if the filesystem has to store the specific unicode version it
was created with so that we know what version to put in trie
lookups, again I'll ask: why are we loading the trie as a generic
kernel module and not as metadata in the filesystem that is demand
paged and cached?

i.e. put the entire trie on disk, look up the specific conversion
required for the name being compared, and then cache that conversion
in memory. This makes repeated lookups much faster because the trie
only contains conversions that are in use, the memory footprint is
way lower and the conversions are guaranteed to be consistent for
the life of the filesystem....

> Here are Olaf's design notes:
> 
> -----------------------------------------------------------------------------
> Unicode/UTF-8 support for XFS
> 
> So we had a customer request proper unicode support...
> 
> 
> * What does "supporting unicode" actually mean?
> 
> From a text processing point of view, what a filesystem does with
> filenames is simple: it stores and retrieves them, and compares them
> for equality. It may reject certain byte sequences as invalid
> filenames (for example, no filename can contain an ASCII NUL).
> 
> I've been taking it as a given that when a file is created with a
> certain byte sequence as its name, then a subsequent directory listing
> will contain that same byte sequence among the names listed.
> 
> This leaves comparing names for equality, and in my view this is what
> "supporting unicode" revolves about.
> 
> The present state of affairs is that different byte sequences are
> different filenames. This amounts to tolerating unicode without
> actually supporting it.

That's somewhat circular - using your own definition of "supported"
to argue that your own definition is the right one....

> To support unicode we have to interpret filenames. What happens when
> (part of) a filename cannot be interpreted? We can reject the
> filename, interpret the parts we can, or punt and accept it as an
> uninterpreted blob.
> 
> Rejecting ill-formed filenames was my first choice, but I came around
> on the issue: there are too many ways in which you can end up with
> having to deal with ill-formed filenames that would leave a user with
> no recourse but to move whatever they're doing to a different
> filesystem. Unpacking a tarball with filenames in a different encoding
> is an example.

You still haven't addressed this:

| So we accept invalid unicode in filenames, but only after failing to
| parse them? Isn't this a potential vector for exploiting weaknesses
| in application filename handling? i.e.  unprivileged user writes
| specially crafted invalid unicode filename to disk, setuid program
| tries to parse it, invalid sequence triggers a buffer overflow bug
| in setuid parser?

apart from handwaving that userspace has to be able to handle
invalid utf-8 already. Why should we let filesystems say "we fully
understand and support utf8" and then allow them to accept and
propagate invalid utf8 sequences and leave everyone else to have to
clean up the mess?

> Partial interpretation of an ill-formed filename just strikes me as
> the kind of bad idea that most half-houses are. I admit that I have no
> stronger objection to this than the fact that it makes the code even
> more complicated and fragile.
> 
> Which leaves "blob" as the preferred option by default for coping with
> ill-formed filenames.

And so can't be case-folded, leading to inconsistent behaviour of
case-insensitive filename comparisons.

I don't blindly subscribe to the robustness principle of "be liberal
with what you accept".  Being liberal means accepting malformed junk
and then trying to make good. It's a fool's game - we've learnt time
and time again that if we don't fully validate string inputs that we
have to interpret then someone will find an exploit that utilises
malformed strings. I don't think we should expose core kernel code
to such structural weaknesses...

> When comparing well-formed filenames, the question now becomes which
> byte sequences are considered to be alternative spellings of the same
> filename. This is where normalization forms come into play, and the
> unicode standard has quite a bit to say about the subject.
> 
> If all you're doing is comparison, then choosing NFD over NFC is easy,
> because the former is easier to calculate than the latter.
> 
> If you want various spellings of "office" to compare equal, then
> picking NFKD over NFD for comparison is also an obvious
> choice. (Hand-picking individual compatibility forms is truly a bad
> idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
> "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
> ligature. (Some fool thought it a good idea to add these ligatures to
> unicode, all we get to decide is how to cope.)

Yet normalised strings are only stable and hence comparable
if there are no unassigned code points in them.  What happens when
userspace is not using the same version of unicode as the
filesystem and is using newer code points in it's strings?
Normalisation fails, right?

And as an extension of using normalisation for case-folded
comparisons, how do we make case folding work with blobs that can't
be normalised? It seems to me that this just leads to the nasty
situation where some filenames are case sensitive and some aren't
based on what the filesystem thinks is valid utf-8. The worst part
is that userspace has no idea that the filesystem is making such
distinctions and so behaviour is not at all predictable or expected.

This is another point in favour of rejecting invalid utf-8 strings
and for keeping the translation tables stable within the
filesystem...

> The most contentious part is (should be) ignoring the codepoints with
> the Default_Ignorable_Code_Point property. I've included the list
> below. My argument, such as it is, is that these code points either
> have no visible rendering, or in cases like the soft hyphen, are only
> conditionally visible. The problem with these (as I see it) is that on
> seeing a filename that might contain them you cannot tell whether they
> are present. So I propose to ignore them for the purpose of comparing
> filenames for equality.

Which introduces a non-standard "visibility criterial" for
determining what should be or shouldn't be part of the normalised
string for comparison. I don't see any real justification for
stepping outside the standard unicode normalisation here - just
because the user cannot see a character in a specific context does
not mean that it is not significant to the application that created
it.

> Finally, case folding. First of all, it is optional. Then the issue is
> that you either go the language-specific route, or simplify the task
> by "just" doing a full casefold (C+F, in unicode parlance). Looking
> around the net I tend to find that if you're going to do casefolding
> at all, then a language-independent full casefold is preferred because
> it is the most predictable option. See
> http://www.w3.org/TR/charmod-norm/ for an example of that kind of
> reasoning.

Which says in section 2.4: "Some languages need case-folding to be
tailored to meet specific linguistic needs". That implies that the
case folding needs to be language aware and hence needs to be tied
into the NLS subsystem for handling specific quirks like Turkic.

I also note that it says in several places that C+F can result in a
folded string of a different length. What happens when that folded
string is longer than 255 bytes and hence longer than NAME_MAX?
That's a bit of a nasty landmine for pathname string handling
functions - developers are going to assume that pathname components
are not longer than NAME_MAX, and if we are passing normalised
strings around that is not a valid assumption....

> * XFS-specific design notes.
...
> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
> in the superblock, then case folding is added into the mix. This is
> the nfkdicf normalization form mentioned above. It allows for the
> creation of case-insensitive filesystems with UTF-8 support.

Please don't overload existing superblock feature bits with multiple
meanings. ASCII-CI is a stand-alone feature and is not in any way
compatible with Unicode: Unicode-CI is a superset of Unicode
support. So it really needs two new feature bits for Unicode and
Unicode-CI, not just one for unicode.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-22 14:55   ` Andi Kleen
  (?)
  (?)
@ 2014-09-23 13:01   ` Olaf Weber
  2014-09-23 20:02     ` Andi Kleen
  -1 siblings, 1 reply; 84+ messages in thread
From: Olaf Weber @ 2014-09-23 13:01 UTC (permalink / raw)
  To: Andi Kleen, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs

On 22-09-14 16:55, Andi Kleen wrote:
> Ben Myers <bpm@sgi.com> writes:
>>
>> Strings are normalized using a trie that stores the relevant
>> information.  The trie itself is about 250kB in size, and lives in a
>> separate module.
>
> So 250kB bloat -- and what does this fix exactly?
>
> Someone putting random ligatures into their file names and expecting
> the file to be the same as before. Can't they just not do that?

I like the 'office' example because it is applicable to English and easy to 
explain. Once you move away from English examples are much easier to come 
by. Take a Dutch name like 'Renée Soutendijk'.

These two forms both spell Renée in UTF-8:
   0x52 0x65 0x6E 0xC3 0xA9 0x65
   0x52 0x65 0x6E 0x65 0xCC 0x81 0x65
The difference is
  LATIN SMALL LETTER E WITH ACUTE (U+00E9)
  LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301)
and corresponds to the difference between NFC and NFD.

These two forms both spell Soutendijk in UTF-8:
   0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0x69 0x6A 0x6B
   0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0xC4 0xB3 0x6B
The difference is
   LATIN SMALL LETTER I (U+0069) LATIN SMALL LETTER J (U+006A)
   LATIN SMALL LIGATURE IJ (U+0133)
and the former is the compatibility decomposition of the latter, the 'K' in 
NFKC/NFKD.

Do accented letters count as random ligatures that people should just not use?

The bulk of the table deals with Korean.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-22 19:29       ` Andi Kleen
  (?)
@ 2014-09-23 16:13       ` Olaf Weber
  2014-09-23 20:15         ` Andi Kleen
  -1 siblings, 1 reply; 84+ messages in thread
From: Olaf Weber @ 2014-09-23 16:13 UTC (permalink / raw)
  To: Andi Kleen, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs

On 22-09-14 21:29, Andi Kleen wrote:
>>> So 250kB bloat -- and what does this fix exactly?
>>
>> We're trying to address the size issue by only loading the module when
>
> I'm not sure this is really addressing it.

You only pay the space cost if you use it, similar to the nls tables.

>> it's needed, but yeah it's big.  Open to suggestions on how best to deal
>> with that.  I understand the sticker shock.
>
> I don't even understand why you need the whole table.
>
> You want to not compare some special symbols, and a few other symbols
> are equivalent to others.  But most symbols are only identical to themselves.
>
> Couldn't you have a much smaller table that only expresses
> the exceptions?

The trie tells you whether a given sequence of bytes is a UTF-8 encoded 
unicode codepoint, and if so, it gives the unicode version in which the 
codepoint was assigned an interpretation (if any), the canonical combining 
class (required for normalization), and the decomposition and case fold (if 
any).

A big part of the table does decompositions for Korean: eliminating the 
Hangul decompositions removes 156320 bytes, leaving 89936 bytes.

Hangul decomposition uses two or three unicode code points and a terminating 
NUL byte in a UTF-8 string. The code points each require a three-byte UTF-8 
sequence, so the total is 7 bytes per 2-part decomposition, and 10 bytes per 
3-part decomposition.

With that in mind, the 156320 additional bytes spent on Hangul are accounted 
for as follows:

  22344 bytes : 11172 leaves * 2 byte leaf size
   2793 bytes :   399 2-part decompositions at 7 bytes each
107730 bytes : 10773 3-part decompositions at 10 bytes each

This adds up to 132867 bytes of data, with the remainder, 23453 bytes, spent 
on additional internal trie nodes.

>> As far as telling the customer "don't do that", my guess is that they
>> would just go elsewhere.  There are several other options for
>> filesystems that support unicode.
>
> They could put some code into their user app that generates
> an unique representation.

This assumes a single app, and that they control the source of that app.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07a/13] xfsprogs: add trie generator for UTF-8.
  2014-09-19 16:06   ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers
@ 2014-09-23 18:34     ` Roger Willcocks
  2014-09-24 23:11       ` Ben Myers
  0 siblings, 1 reply; 84+ messages in thread
From: Roger Willcocks @ 2014-09-23 18:34 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs


On Fri, 2014-09-19 at 11:06 -0500, Ben Myers wrote:
> +#define AGE_NAME       "DerivedAge.txt"
> +#define CCC_NAME       "DerivedCombiningClass.txt"
> +#define PROP_NAME      "DerivedCoreProperties.txt"
> +#define DATA_NAME      "UnicodeData.txt"
> +#define FOLD_NAME      "CaseFolding.txt"
> +#define NORM_NAME      "NormalizationCorrections.txt"
> +#define TEST_NAME      "NormalizationTest.txt"

Is there a reason why you're using multiple text-based data files (and
hand-parsing them) when there's an xml formatted flat file available ?

http://www.unicode.org/Public/UCD/latest/ucdxml/


And a 2nd question - why does the trie need to encode "the the unicode
version in which the codepoint was assigned an interpretation" ?


-- 
Roger Willcocks <roger@filmlight.ltd.uk>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8.
  2014-09-22 20:57     ` Dave Chinner
  (?)
@ 2014-09-23 18:57     ` Ben Myers
  2014-09-26 17:10       ` Christoph Hellwig
  -1 siblings, 1 reply; 84+ messages in thread
From: Ben Myers @ 2014-09-23 18:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Tue, Sep 23, 2014 at 06:57:14AM +1000, Dave Chinner wrote:
> On Thu, Sep 18, 2014 at 03:15:19PM -0500, Ben Myers wrote:
> > From: Olaf Weber <olaf@sgi.com>
> > 
> > mkutf8data.c is the source for a program that generates utf8data.h, which
> > contains the trie that utf8norm.c uses. The trie is generated from the
> > Unicode 7.0.0 data files. The format of the utf8data[] table is described
> > in utf8norm.c.
> > 
> > Supporting functions for UTF-8 normalization are in utf8norm.c with the
> > header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.
> > 
> >   nfkdi:
> >    - Apply unicode normalization form NFKD.
> >    - Remove any Default_Ignorable_Code_Point.
> > 
> >   nfkdicf:
> >    - Apply unicode normalization form NFKD.
> >    - Remove any Default_Ignorable_Code_Point.
> >    - Apply a full casefold (C + F).
> > 
> > For the purposes of the code, a string is valid UTF-8 if:
> > 
> >  - The values encoded are 0x1..0x10FFFF.
> >  - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
> >  - The shortest possible encoding is used for all values.
> > 
> > The supporting functions work on null-terminated strings (utf8 prefix) and
> > on length-limited strings (utf8n prefix).
> > 
> > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > 
> > ---
> > [v2: the trie is now separated into utf8norm.ko;
> >      utf8version is now a function and exported;
> >      introduced CONFIG_XFS_UTF8. -bpm]
> > ---
> >  fs/xfs/Kconfig               |    8 +
> >  fs/xfs/Makefile              |    2 +-
> >  fs/xfs/utf8norm/Makefile     |   37 +
> >  fs/xfs/utf8norm/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/utf8norm/utf8norm.c   |  649 +++++++++
> >  fs/xfs/utf8norm/utf8norm.h   |  116 ++
> 
> Again, nothing XFS specific here. It's being built as a separate
> module and the only thing that XFS uses are exported functions, so
> it really should be generic library code....

I'll get this moved to lib/ as you suggested elsewhere in the
thread.

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-23 13:01   ` Olaf Weber
@ 2014-09-23 20:02     ` Andi Kleen
  0 siblings, 0 replies; 84+ messages in thread
From: Andi Kleen @ 2014-09-23 20:02 UTC (permalink / raw)
  To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, Andi Kleen, tinguely, xfs


> The bulk of the table deals with Korean.

Is a table of Korean exceptions 250k too?

If not please find some other way to compress this.

I'm sure there is redundancy somewhere.

-Andi

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-23 16:13       ` Olaf Weber
@ 2014-09-23 20:15         ` Andi Kleen
  2014-09-23 20:45             ` Ben Myers
  2014-09-24 11:07           ` Olaf Weber
  0 siblings, 2 replies; 84+ messages in thread
From: Andi Kleen @ 2014-09-23 20:15 UTC (permalink / raw)
  To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, Andi Kleen, tinguely, xfs

> You only pay the space cost if you use it, similar to the nls tables.

The way Linux module loading works these things are loaded by default,
not when someone needs it.

So no, you (or rather every unfortunate Linux XFS user) would pay it always,
unless you black list the module or rebuild the kernel.

> A big part of the table does decompositions for Korean: eliminating
> the Hangul decompositions removes 156320 bytes, leaving 89936 bytes.

Are there regular ranges or other redundancies in the Korean encoding
that could be used to compress paths?

Doing some basic research other people already answered this:

Please use the ICU or google tables referenced below. Apparently
smaller is possible too, but 40-50k seems more reasonable.

I'm just gonna make the claim that whatever performance you
get from a larger table is dwarfed by the cache miss overhead.

----

http://macchiato.com/unicode/normalization_footprint.htm

http://www.macchiato.com/unicode/nfc-faq

NFC normalization requires large tables, right?
Like many other cases, there is a tradeoff between size and performance.
You can use very small tables, at some cost in performance. (Even there,
the actual performance cost depends on how often normalization needs to
be invoked, as discussed above.)

To see an analysis of the situation, see Normalization Footprint. It is
a bit out of date, but gives a sense of the magnitude. For comparison,
ICU's optimized tables for NFC take 44 kB (UTF-16) and Google's
optimized tables for NFC take 46 kB (UTF-8).


-Andi

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-23 20:15         ` Andi Kleen
@ 2014-09-23 20:45             ` Ben Myers
  2014-09-24 11:07           ` Olaf Weber
  1 sibling, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-23 20:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Olaf Weber, linux-fsdevel, tinguely, xfs

Hi Andi,

On Tue, Sep 23, 2014 at 10:15:40PM +0200, Andi Kleen wrote:
> > You only pay the space cost if you use it, similar to the nls tables.
> 
> The way Linux module loading works these things are loaded by default,
> not when someone needs it.
> 
> So no, you (or rather every unfortunate Linux XFS user) would pay it always,
> unless you black list the module or rebuild the kernel.

The suggestion to use symbol_get was made in response to the initial
post of the series.  In my testing the normalization module did not load
until I attempted to mount a filesystem which required it.  Seems to
work ok.

See patch 10, "xfs: implement demand load of utf8norm.ko". 

Thanks,
	Ben

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-23 20:45             ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-23 20:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel, tinguely, Olaf Weber, xfs

Hi Andi,

On Tue, Sep 23, 2014 at 10:15:40PM +0200, Andi Kleen wrote:
> > You only pay the space cost if you use it, similar to the nls tables.
> 
> The way Linux module loading works these things are loaded by default,
> not when someone needs it.
> 
> So no, you (or rather every unfortunate Linux XFS user) would pay it always,
> unless you black list the module or rebuild the kernel.

The suggestion to use symbol_get was made in response to the initial
post of the series.  In my testing the normalization module did not load
until I attempted to mount a filesystem which required it.  Seems to
work ok.

See patch 10, "xfs: implement demand load of utf8norm.ko". 

Thanks,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-23 20:15         ` Andi Kleen
  2014-09-23 20:45             ` Ben Myers
@ 2014-09-24 11:07           ` Olaf Weber
  2014-09-26 14:06             ` Olaf Weber
  1 sibling, 1 reply; 84+ messages in thread
From: Olaf Weber @ 2014-09-24 11:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs

On 23-09-14 22:15, Andi Kleen wrote:

>> A big part of the table does decompositions for Korean: eliminating
>> the Hangul decompositions removes 156320 bytes, leaving 89936 bytes.
>
> Are there regular ranges or other redundancies in the Korean encoding
> that could be used to compress paths?

Yes, though at the expense of more complicated code and interfaces. in 
particular, lookups that want a normalized string would need to provide a 
10-byte buffer to store it in.

> Doing some basic research other people already answered this:
>
> Please use the ICU or google tables referenced below. Apparently
> smaller is possible too, but 40-50k seems more reasonable.

Riffing off the http://macchiato.com/unicode/normalization_footprint.htm 
link you provided, looking at the NFKD case. For Unicode 3.0.0 that link 
gives 3483 NFKD normalizations (exlcuding Hangul), and gives 26,918 bytes as 
the size of a simple lookup table (key/offset pairs, with the offset 
pointing into a string table).

In Unicode 7.0.0 I count 5721 NFKD normalizations (again excluding Hangul). 
  As NUL-terminated UTF-8 strings these take 23390 bytes.  Using a 
key-offset table I need 3 bytes for the key (code points are 21 bits) and 2 
bytes for an offset. Total is 5721 * (3 + 2) + 23390 = 51995 bytes. Stealing 
4 bits from the key field and 1 from the offset to store the size of the 
normalized string I can remove the NUL bytes from the string table and 
reduce total size to 46274 bytes.

The trie implementation used here would use 66283 bytes to store the same 
information, but it also provides unicode version and canonical combining 
class for all codepoints. There are 10268 leaves in this case, with the size 
of each leaf being 1 byte for version, one for ccc, plus the size of the 
decomposition, if any. So a quick estimate on the space used just for the 
NFKD data is some 45747 bytes.

I'm pretty certain that a trie that only stores the NFKD would be smaller 
than 45747 bytes as it would need fewer internal nodes, but didn't do the 
experiment. But as you can see, for just the NFKD part and excluding Hangul, 
the size of the trie is within the ballpark of the numbers you gave.

Case folding adds a partial trie that forwards to the "main" trie for parts 
that are identical. This adds 2672 extra leaves and 20171 extra bytes. The 
data for the normalization corrections adds another 3328 bytes. With a bit 
of rounding, total size comes to 89840 bytes.

 > I'm just gonna make the claim that whatever performance you
 > get from a larger table is dwarfed by the cache miss overhead.

That's possible, though it seems plausible you'd only suffer from this doing 
actual Hangul decomposition: all the data related to this (trie nodes, trie 
leaves, and strings) sits in one contiguous block in memory.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-22 22:26   ` Dave Chinner
@ 2014-09-24 13:21     ` Olaf Weber
  -1 siblings, 0 replies; 84+ messages in thread
From: Olaf Weber @ 2014-09-24 13:21 UTC (permalink / raw)
  To: Dave Chinner, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs

On 23-09-14 00:26, Dave Chinner wrote:
> On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:

[...]

>> TODO: Store the unicode version number of the filesystem on disk in the
>> super block.
>
> So, if the filesystem has to store the specific unicode version it
> was created with so that we know what version to put in trie
> lookups, again I'll ask: why are we loading the trie as a generic
> kernel module and not as metadata in the filesystem that is demand
> paged and cached?

This way the trie can be shared, and the code using it is not entangled with 
the XFS code.

> i.e. put the entire trie on disk, look up the specific conversion
> required for the name being compared, and then cache that conversion
> in memory. This makes repeated lookups much faster because the trie
> only contains conversions that are in use, the memory footprint is
> way lower and the conversions are guaranteed to be consistent for
> the life of the filesystem....

Above you mention demand paging parts of the trie, but here you seem to 
suggest creating an in-core conversion table on the fly from data read from 
disk. The former seems a lot easier to do than the latter.

>> Here are Olaf's design notes:
>>
>> -----------------------------------------------------------------------------
>> Unicode/UTF-8 support for XFS
>>
>> So we had a customer request proper unicode support...
>>
>>
>> * What does "supporting unicode" actually mean?
>>
>>  From a text processing point of view, what a filesystem does with
>> filenames is simple: it stores and retrieves them, and compares them
>> for equality. It may reject certain byte sequences as invalid
>> filenames (for example, no filename can contain an ASCII NUL).
>>
>> I've been taking it as a given that when a file is created with a
>> certain byte sequence as its name, then a subsequent directory listing
>> will contain that same byte sequence among the names listed.
>>
>> This leaves comparing names for equality, and in my view this is what
>> "supporting unicode" revolves about.
>>
>> The present state of affairs is that different byte sequences are
>> different filenames. This amounts to tolerating unicode without
>> actually supporting it.
>
> That's somewhat circular - using your own definition of "supported"
> to argue that your own definition is the right one....
>
>> To support unicode we have to interpret filenames. What happens when
>> (part of) a filename cannot be interpreted? We can reject the
>> filename, interpret the parts we can, or punt and accept it as an
>> uninterpreted blob.
>>
>> Rejecting ill-formed filenames was my first choice, but I came around
>> on the issue: there are too many ways in which you can end up with
>> having to deal with ill-formed filenames that would leave a user with
>> no recourse but to move whatever they're doing to a different
>> filesystem. Unpacking a tarball with filenames in a different encoding
>> is an example.
>
> You still haven't addressed this:
>
> | So we accept invalid unicode in filenames, but only after failing to
> | parse them? Isn't this a potential vector for exploiting weaknesses
> | in application filename handling? i.e.  unprivileged user writes
> | specially crafted invalid unicode filename to disk, setuid program
> | tries to parse it, invalid sequence triggers a buffer overflow bug
> | in setuid parser?
>
> apart from handwaving that userspace has to be able to handle
> invalid utf-8 already. Why should we let filesystems say "we fully
> understand and support utf8" and then allow them to accept and
> propagate invalid utf8 sequences and leave everyone else to have to
> clean up the mess?

Because the alternative amounts in my opinion to a demand that every bit of 
userspace that may be involved in generating filenames generate only clean 
UTF-8. I do not believe that this is a realistic demand at this point in time.

>> Partial interpretation of an ill-formed filename just strikes me as
>> the kind of bad idea that most half-houses are. I admit that I have no
>> stronger objection to this than the fact that it makes the code even
>> more complicated and fragile.
>>
>> Which leaves "blob" as the preferred option by default for coping with
>> ill-formed filenames.
>
> And so can't be case-folded, leading to inconsistent behaviour of
> case-insensitive filename comparisons.
>
> I don't blindly subscribe to the robustness principle of "be liberal
> with what you accept".  Being liberal means accepting malformed junk
> and then trying to make good. It's a fool's game - we've learnt time
> and time again that if we don't fully validate string inputs that we
> have to interpret then someone will find an exploit that utilises
> malformed strings. I don't think we should expose core kernel code
> to such structural weaknesses...

This is why I prefer not to interpret strings that are not UTF-8. I just 
don't think we can afford to outright reject them.

>> When comparing well-formed filenames, the question now becomes which
>> byte sequences are considered to be alternative spellings of the same
>> filename. This is where normalization forms come into play, and the
>> unicode standard has quite a bit to say about the subject.
>>
>> If all you're doing is comparison, then choosing NFD over NFC is easy,
>> because the former is easier to calculate than the latter.
>>
>> If you want various spellings of "office" to compare equal, then
>> picking NFKD over NFD for comparison is also an obvious
>> choice. (Hand-picking individual compatibility forms is truly a bad
>> idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
>> "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
>> ligature. (Some fool thought it a good idea to add these ligatures to
>> unicode, all we get to decide is how to cope.)
>
> Yet normalised strings are only stable and hence comparable
> if there are no unassigned code points in them.  What happens when
> userspace is not using the same version of unicode as the
> filesystem and is using newer code points in it's strings?
> Normalisation fails, right?

For the newer code points, yes. This is not treated as a failure to 
normalize the string as a whole, as there are clear guidelines in unicode on 
how unassigned code points interact with normalization: they have canonical 
combining class 0 and no decomposition.

> And as an extension of using normalisation for case-folded
> comparisons, how do we make case folding work with blobs that can't
> be normalised? It seems to me that this just leads to the nasty
> situation where some filenames are case sensitive and some aren't
> based on what the filesystem thinks is valid utf-8. The worst part
> is that userspace has no idea that the filesystem is making such
> distinctions and so behaviour is not at all predictable or expected.

Making case-folding work on a blob that cannot be normalized is (in my 
opinion) akin to doing an ASCII-based casefold on a Shift-JIS string: the 
result is neither pretty nor useful.

> This is another point in favour of rejecting invalid utf-8 strings
> and for keeping the translation tables stable within the
> filesystem...

Bear in mind that this means not just rejecting invalid UTF-8 strings, but 
also rejecting valid UTF-8 strings that encode unassigned code points.

This should be easy to implement if it is decided that we want to do this.

>> The most contentious part is (should be) ignoring the codepoints with
>> the Default_Ignorable_Code_Point property. I've included the list
>> below. My argument, such as it is, is that these code points either
>> have no visible rendering, or in cases like the soft hyphen, are only
>> conditionally visible. The problem with these (as I see it) is that on
>> seeing a filename that might contain them you cannot tell whether they
>> are present. So I propose to ignore them for the purpose of comparing
>> filenames for equality.
>
> Which introduces a non-standard "visibility criterial" for
> determining what should be or shouldn't be part of the normalised
> string for comparison. I don't see any real justification for
> stepping outside the standard unicode normalisation here - just
> because the user cannot see a character in a specific context does
> not mean that it is not significant to the application that created
> it.

I agree these characters may be significant to the application. I'm just not 
convinced that they should be significant in a file name.

>> Finally, case folding. First of all, it is optional. Then the issue is
>> that you either go the language-specific route, or simplify the task
>> by "just" doing a full casefold (C+F, in unicode parlance). Looking
>> around the net I tend to find that if you're going to do casefolding
>> at all, then a language-independent full casefold is preferred because
>> it is the most predictable option. See
>> http://www.w3.org/TR/charmod-norm/ for an example of that kind of
>> reasoning.
>
> Which says in section 2.4: "Some languages need case-folding to be
> tailored to meet specific linguistic needs". That implies that the
> case folding needs to be language aware and hence needs to be tied
> into the NLS subsystem for handling specific quirks like Turkic.

It also recommends just doing a full case fold for cases where you are 
ignorant of the language actually in use.  In section 3.1 they say: 
"However, language-sensitive case-sensitive matching in document formats and 
protocols is NOT RECOMMENDED because language information can be hard to 
obtain, verify, or manage and the resulting operations can produce results 
that frustrate users." This doesn't exactly address the case of filesystems, 
but as far as I know there is no defined interface that allows kernel code 
to query the locale settings that currently apply to a userspace process.

> I also note that it says in several places that C+F can result in a
> folded string of a different length. What happens when that folded
> string is longer than 255 bytes and hence longer than NAME_MAX?
> That's a bit of a nasty landmine for pathname string handling
> functions - developers are going to assume that pathname components
> are not longer than NAME_MAX, and if we are passing normalised
> strings around that is not a valid assumption....

This is not just true for case folding: normalization may also change string 
length, and NFD or NFKD will typically increase the length.

That is among the reasons why normalized and case folded strings are not 
stored on disk, and are not passed up to other parts of the kernel. The code 
posted will generate a normalized version of the user-provided string used 
to look up data as a way to cache that normalization and to reduce stack 
pressure a bit, but this string is ephemeral and discarded once lookup is 
complete.

>> * XFS-specific design notes.
> ...
>> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
>> in the superblock, then case folding is added into the mix. This is
>> the nfkdicf normalization form mentioned above. It allows for the
>> creation of case-insensitive filesystems with UTF-8 support.
>
> Please don't overload existing superblock feature bits with multiple
> meanings. ASCII-CI is a stand-alone feature and is not in any way
> compatible with Unicode: Unicode-CI is a superset of Unicode
> support. So it really needs two new feature bits for Unicode and
> Unicode-CI, not just one for unicode.

It seemed an obvious extension of the meaning of that bit.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-24 13:21     ` Olaf Weber
  0 siblings, 0 replies; 84+ messages in thread
From: Olaf Weber @ 2014-09-24 13:21 UTC (permalink / raw)
  To: Dave Chinner, Ben Myers; +Cc: linux-fsdevel, tinguely, xfs

On 23-09-14 00:26, Dave Chinner wrote:
> On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:

[...]

>> TODO: Store the unicode version number of the filesystem on disk in the
>> super block.
>
> So, if the filesystem has to store the specific unicode version it
> was created with so that we know what version to put in trie
> lookups, again I'll ask: why are we loading the trie as a generic
> kernel module and not as metadata in the filesystem that is demand
> paged and cached?

This way the trie can be shared, and the code using it is not entangled with 
the XFS code.

> i.e. put the entire trie on disk, look up the specific conversion
> required for the name being compared, and then cache that conversion
> in memory. This makes repeated lookups much faster because the trie
> only contains conversions that are in use, the memory footprint is
> way lower and the conversions are guaranteed to be consistent for
> the life of the filesystem....

Above you mention demand paging parts of the trie, but here you seem to 
suggest creating an in-core conversion table on the fly from data read from 
disk. The former seems a lot easier to do than the latter.

>> Here are Olaf's design notes:
>>
>> -----------------------------------------------------------------------------
>> Unicode/UTF-8 support for XFS
>>
>> So we had a customer request proper unicode support...
>>
>>
>> * What does "supporting unicode" actually mean?
>>
>>  From a text processing point of view, what a filesystem does with
>> filenames is simple: it stores and retrieves them, and compares them
>> for equality. It may reject certain byte sequences as invalid
>> filenames (for example, no filename can contain an ASCII NUL).
>>
>> I've been taking it as a given that when a file is created with a
>> certain byte sequence as its name, then a subsequent directory listing
>> will contain that same byte sequence among the names listed.
>>
>> This leaves comparing names for equality, and in my view this is what
>> "supporting unicode" revolves about.
>>
>> The present state of affairs is that different byte sequences are
>> different filenames. This amounts to tolerating unicode without
>> actually supporting it.
>
> That's somewhat circular - using your own definition of "supported"
> to argue that your own definition is the right one....
>
>> To support unicode we have to interpret filenames. What happens when
>> (part of) a filename cannot be interpreted? We can reject the
>> filename, interpret the parts we can, or punt and accept it as an
>> uninterpreted blob.
>>
>> Rejecting ill-formed filenames was my first choice, but I came around
>> on the issue: there are too many ways in which you can end up with
>> having to deal with ill-formed filenames that would leave a user with
>> no recourse but to move whatever they're doing to a different
>> filesystem. Unpacking a tarball with filenames in a different encoding
>> is an example.
>
> You still haven't addressed this:
>
> | So we accept invalid unicode in filenames, but only after failing to
> | parse them? Isn't this a potential vector for exploiting weaknesses
> | in application filename handling? i.e.  unprivileged user writes
> | specially crafted invalid unicode filename to disk, setuid program
> | tries to parse it, invalid sequence triggers a buffer overflow bug
> | in setuid parser?
>
> apart from handwaving that userspace has to be able to handle
> invalid utf-8 already. Why should we let filesystems say "we fully
> understand and support utf8" and then allow them to accept and
> propagate invalid utf8 sequences and leave everyone else to have to
> clean up the mess?

Because the alternative amounts in my opinion to a demand that every bit of 
userspace that may be involved in generating filenames generate only clean 
UTF-8. I do not believe that this is a realistic demand at this point in time.

>> Partial interpretation of an ill-formed filename just strikes me as
>> the kind of bad idea that most half-houses are. I admit that I have no
>> stronger objection to this than the fact that it makes the code even
>> more complicated and fragile.
>>
>> Which leaves "blob" as the preferred option by default for coping with
>> ill-formed filenames.
>
> And so can't be case-folded, leading to inconsistent behaviour of
> case-insensitive filename comparisons.
>
> I don't blindly subscribe to the robustness principle of "be liberal
> with what you accept".  Being liberal means accepting malformed junk
> and then trying to make good. It's a fool's game - we've learnt time
> and time again that if we don't fully validate string inputs that we
> have to interpret then someone will find an exploit that utilises
> malformed strings. I don't think we should expose core kernel code
> to such structural weaknesses...

This is why I prefer not to interpret strings that are not UTF-8. I just 
don't think we can afford to outright reject them.

>> When comparing well-formed filenames, the question now becomes which
>> byte sequences are considered to be alternative spellings of the same
>> filename. This is where normalization forms come into play, and the
>> unicode standard has quite a bit to say about the subject.
>>
>> If all you're doing is comparison, then choosing NFD over NFC is easy,
>> because the former is easier to calculate than the latter.
>>
>> If you want various spellings of "office" to compare equal, then
>> picking NFKD over NFD for comparison is also an obvious
>> choice. (Hand-picking individual compatibility forms is truly a bad
>> idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
>> "o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
>> ligature. (Some fool thought it a good idea to add these ligatures to
>> unicode, all we get to decide is how to cope.)
>
> Yet normalised strings are only stable and hence comparable
> if there are no unassigned code points in them.  What happens when
> userspace is not using the same version of unicode as the
> filesystem and is using newer code points in it's strings?
> Normalisation fails, right?

For the newer code points, yes. This is not treated as a failure to 
normalize the string as a whole, as there are clear guidelines in unicode on 
how unassigned code points interact with normalization: they have canonical 
combining class 0 and no decomposition.

> And as an extension of using normalisation for case-folded
> comparisons, how do we make case folding work with blobs that can't
> be normalised? It seems to me that this just leads to the nasty
> situation where some filenames are case sensitive and some aren't
> based on what the filesystem thinks is valid utf-8. The worst part
> is that userspace has no idea that the filesystem is making such
> distinctions and so behaviour is not at all predictable or expected.

Making case-folding work on a blob that cannot be normalized is (in my 
opinion) akin to doing an ASCII-based casefold on a Shift-JIS string: the 
result is neither pretty nor useful.

> This is another point in favour of rejecting invalid utf-8 strings
> and for keeping the translation tables stable within the
> filesystem...

Bear in mind that this means not just rejecting invalid UTF-8 strings, but 
also rejecting valid UTF-8 strings that encode unassigned code points.

This should be easy to implement if it is decided that we want to do this.

>> The most contentious part is (should be) ignoring the codepoints with
>> the Default_Ignorable_Code_Point property. I've included the list
>> below. My argument, such as it is, is that these code points either
>> have no visible rendering, or in cases like the soft hyphen, are only
>> conditionally visible. The problem with these (as I see it) is that on
>> seeing a filename that might contain them you cannot tell whether they
>> are present. So I propose to ignore them for the purpose of comparing
>> filenames for equality.
>
> Which introduces a non-standard "visibility criterial" for
> determining what should be or shouldn't be part of the normalised
> string for comparison. I don't see any real justification for
> stepping outside the standard unicode normalisation here - just
> because the user cannot see a character in a specific context does
> not mean that it is not significant to the application that created
> it.

I agree these characters may be significant to the application. I'm just not 
convinced that they should be significant in a file name.

>> Finally, case folding. First of all, it is optional. Then the issue is
>> that you either go the language-specific route, or simplify the task
>> by "just" doing a full casefold (C+F, in unicode parlance). Looking
>> around the net I tend to find that if you're going to do casefolding
>> at all, then a language-independent full casefold is preferred because
>> it is the most predictable option. See
>> http://www.w3.org/TR/charmod-norm/ for an example of that kind of
>> reasoning.
>
> Which says in section 2.4: "Some languages need case-folding to be
> tailored to meet specific linguistic needs". That implies that the
> case folding needs to be language aware and hence needs to be tied
> into the NLS subsystem for handling specific quirks like Turkic.

It also recommends just doing a full case fold for cases where you are 
ignorant of the language actually in use.  In section 3.1 they say: 
"However, language-sensitive case-sensitive matching in document formats and 
protocols is NOT RECOMMENDED because language information can be hard to 
obtain, verify, or manage and the resulting operations can produce results 
that frustrate users." This doesn't exactly address the case of filesystems, 
but as far as I know there is no defined interface that allows kernel code 
to query the locale settings that currently apply to a userspace process.

> I also note that it says in several places that C+F can result in a
> folded string of a different length. What happens when that folded
> string is longer than 255 bytes and hence longer than NAME_MAX?
> That's a bit of a nasty landmine for pathname string handling
> functions - developers are going to assume that pathname components
> are not longer than NAME_MAX, and if we are passing normalised
> strings around that is not a valid assumption....

This is not just true for case folding: normalization may also change string 
length, and NFD or NFKD will typically increase the length.

That is among the reasons why normalized and case folded strings are not 
stored on disk, and are not passed up to other parts of the kernel. The code 
posted will generate a normalized version of the user-provided string used 
to look up data as a way to cache that normalization and to reduce stack 
pressure a bit, but this string is ephemeral and discarded once lookup is 
complete.

>> * XFS-specific design notes.
> ...
>> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
>> in the superblock, then case folding is added into the mix. This is
>> the nfkdicf normalization form mentioned above. It allows for the
>> creation of case-insensitive filesystems with UTF-8 support.
>
> Please don't overload existing superblock feature bits with multiple
> meanings. ASCII-CI is a stand-alone feature and is not in any way
> compatible with Unicode: Unicode-CI is a superset of Unicode
> support. So it really needs two new feature bits for Unicode and
> Unicode-CI, not just one for unicode.

It seemed an obvious extension of the meaning of that bit.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-24 13:21     ` Olaf Weber
@ 2014-09-24 23:10       ` Dave Chinner
  -1 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-24 23:10 UTC (permalink / raw)
  To: Olaf Weber; +Cc: Ben Myers, linux-fsdevel, tinguely, xfs

On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote:
> On 23-09-14 00:26, Dave Chinner wrote:
> >On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> 
> [...]
> 
> >>TODO: Store the unicode version number of the filesystem on disk in the
> >>super block.
> >
> >So, if the filesystem has to store the specific unicode version it
> >was created with so that we know what version to put in trie
> >lookups, again I'll ask: why are we loading the trie as a generic
> >kernel module and not as metadata in the filesystem that is demand
> >paged and cached?
> 
> This way the trie can be shared, and the code using it is not
> entangled with the XFS code.

The trie parsing code can still be common - just the location and
contents of the data is determined by the end-user.

> 
> >i.e. put the entire trie on disk, look up the specific conversion
> >required for the name being compared, and then cache that conversion
> >in memory. This makes repeated lookups much faster because the trie
> >only contains conversions that are in use, the memory footprint is
> >way lower and the conversions are guaranteed to be consistent for
> >the life of the filesystem....
> 
> Above you mention demand paging parts of the trie, but here you seem
> to suggest creating an in-core conversion table on the fly from data
> read from disk. The former seems a lot easier to do than the latter.

Right - it's a question of what needs optimising. If people are only
concerned about memory footprint, then demand paging solves that
problem. If people are concerned about performance and memory
footprint, then demand paging plus a lookaside cache
will address both of those aspects.

We can't do demand paging if the trie data is built into the kernel.
We can still do a lookaside cache to avoid performane issues with
repeated trie lookups...

[...]

> >>To support unicode we have to interpret filenames. What happens when
> >>(part of) a filename cannot be interpreted? We can reject the
> >>filename, interpret the parts we can, or punt and accept it as an
> >>uninterpreted blob.
> >>
> >>Rejecting ill-formed filenames was my first choice, but I came around
> >>on the issue: there are too many ways in which you can end up with
> >>having to deal with ill-formed filenames that would leave a user with
> >>no recourse but to move whatever they're doing to a different
> >>filesystem. Unpacking a tarball with filenames in a different encoding
> >>is an example.
> >
> >You still haven't addressed this:
> >
> >| So we accept invalid unicode in filenames, but only after failing to
> >| parse them? Isn't this a potential vector for exploiting weaknesses
> >| in application filename handling? i.e.  unprivileged user writes
> >| specially crafted invalid unicode filename to disk, setuid program
> >| tries to parse it, invalid sequence triggers a buffer overflow bug
> >| in setuid parser?
> >
> >apart from handwaving that userspace has to be able to handle
> >invalid utf-8 already. Why should we let filesystems say "we fully
> >understand and support utf8" and then allow them to accept and
> >propagate invalid utf8 sequences and leave everyone else to have to
> >clean up the mess?
> 
> Because the alternative amounts in my opinion to a demand that every
> bit of userspace that may be involved in generating filenames
> generate only clean UTF-8. I do not believe that this is a realistic
> demand at this point in time.

It's a chicken and egg situation. I'd much prefer we enforce clean
utf8 from the start, because if we don't we'll never be able to do
that. And other filesystems (e.g. ZFS) allow you to do reject
anything that is not clean utf8....

[...]

> >>When comparing well-formed filenames, the question now becomes which
> >>byte sequences are considered to be alternative spellings of the same
> >>filename. This is where normalization forms come into play, and the
> >>unicode standard has quite a bit to say about the subject.
> >>
> >>If all you're doing is comparison, then choosing NFD over NFC is easy,
> >>because the former is easier to calculate than the latter.
> >>
> >>If you want various spellings of "office" to compare equal, then
> >>picking NFKD over NFD for comparison is also an obvious
> >>choice. (Hand-picking individual compatibility forms is truly a bad
> >>idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
> >>"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
> >>ligature. (Some fool thought it a good idea to add these ligatures to
> >>unicode, all we get to decide is how to cope.)
> >
> >Yet normalised strings are only stable and hence comparable
> >if there are no unassigned code points in them.  What happens when
> >userspace is not using the same version of unicode as the
> >filesystem and is using newer code points in it's strings?
> >Normalisation fails, right?
> 
> For the newer code points, yes. This is not treated as a failure to
> normalize the string as a whole, as there are clear guidelines in
> unicode on how unassigned code points interact with normalization:
> they have canonical combining class 0 and no decomposition.

And so effectively are not stable. Which is something we absolutely
have to avoid for information stored on disk. i.e. you're using the
normalised form to build the hash values in the lookup index in the
directory structure, and so having unstable normalisation forms is
just wrong. Hence we'd need to reject anything with unassigned code
points....

> >And as an extension of using normalisation for case-folded
> >comparisons, how do we make case folding work with blobs that can't
> >be normalised? It seems to me that this just leads to the nasty
> >situation where some filenames are case sensitive and some aren't
> >based on what the filesystem thinks is valid utf-8. The worst part
> >is that userspace has no idea that the filesystem is making such
> >distinctions and so behaviour is not at all predictable or expected.
> 
> Making case-folding work on a blob that cannot be normalized is (in
> my opinion) akin to doing an ASCII-based casefold on a Shift-JIS
> string: the result is neither pretty nor useful.

Yes, that's exactly my point.

> >This is another point in favour of rejecting invalid utf-8 strings
> >and for keeping the translation tables stable within the
> >filesystem...
> 
> Bear in mind that this means not just rejecting invalid UTF-8
> strings, but also rejecting valid UTF-8 strings that encode
> unassigned code points.

And that's precisely what I'm suggesting: If we can't normalise the
filename to a stable form then it cannot be used for hashing or case
folding. That means it needs to be rejected, not treated as an
opaque blob.

The moment we start parsing filenames they are no longer opaque
blobs and so all existing "filename are opaque blobs" handling rules
go out the window. They are now either valid so we can use them, or
they are invalid and need to be rejected to avoid unpredictable
and/or undesirable behaviour.

> This should be easy to implement if it is decided that we want to do this.
> 
> >>The most contentious part is (should be) ignoring the codepoints with
> >>the Default_Ignorable_Code_Point property. I've included the list
> >>below. My argument, such as it is, is that these code points either
> >>have no visible rendering, or in cases like the soft hyphen, are only
> >>conditionally visible. The problem with these (as I see it) is that on
> >>seeing a filename that might contain them you cannot tell whether they
> >>are present. So I propose to ignore them for the purpose of comparing
> >>filenames for equality.
> >
> >Which introduces a non-standard "visibility criterial" for
> >determining what should be or shouldn't be part of the normalised
> >string for comparison. I don't see any real justification for
> >stepping outside the standard unicode normalisation here - just
> >because the user cannot see a character in a specific context does
> >not mean that it is not significant to the application that created
> >it.
> 
> I agree these characters may be significant to the application. I'm
> just not convinced that they should be significant in a file name.

They are significant to the case folding result, right? And
therefore would be significant in a filename...
> 
> >>Finally, case folding. First of all, it is optional. Then the issue is
> >>that you either go the language-specific route, or simplify the task
> >>by "just" doing a full casefold (C+F, in unicode parlance). Looking
> >>around the net I tend to find that if you're going to do casefolding
> >>at all, then a language-independent full casefold is preferred because
> >>it is the most predictable option. See
> >>http://www.w3.org/TR/charmod-norm/ for an example of that kind of
> >>reasoning.
> >
> >Which says in section 2.4: "Some languages need case-folding to be
> >tailored to meet specific linguistic needs". That implies that the
> >case folding needs to be language aware and hence needs to be tied
> >into the NLS subsystem for handling specific quirks like Turkic.
> 
> It also recommends just doing a full case fold for cases where you
> are ignorant of the language actually in use.  In section 3.1 they
> say: "However, language-sensitive case-sensitive matching in
> document formats and protocols is NOT RECOMMENDED because language
> information can be hard to obtain, verify, or manage and the
> resulting operations can produce results that frustrate users." This
> doesn't exactly address the case of filesystems, but as far as I
> know there is no defined interface that allows kernel code to query
> the locale settings that currently apply to a userspace process.

Hence my comments about NLS integration. The NLS subsystem already
has utf8 support with language dependent case folding tables. All the
current filesystems that deal with unicode (including case folding)
use the NLS subsystem for conversions.

Hmmm - looking at all the NLS code that does different utf format
conversions first: what happens if an application is using UTF16 or
UTF32 for it's filename encoding rather than utf8?

> >>* XFS-specific design notes.
> >...
> >>If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
> >>in the superblock, then case folding is added into the mix. This is
> >>the nfkdicf normalization form mentioned above. It allows for the
> >>creation of case-insensitive filesystems with UTF-8 support.
> >
> >Please don't overload existing superblock feature bits with multiple
> >meanings. ASCII-CI is a stand-alone feature and is not in any way
> >compatible with Unicode: Unicode-CI is a superset of Unicode
> >support. So it really needs two new feature bits for Unicode and
> >Unicode-CI, not just one for unicode.
> 
> It seemed an obvious extension of the meaning of that bit.

Feature bits refer to a specific on disk format feature. If that bit
is set, then that feature is present. In this case, it means the
filesystem is using ascii-ci. If that bit is passed out to
userspace via the geometry ioctl, then *existing applications*
expect it to mean ascii-ci behaviour from the filesystem. If an
existing utility reads the flag field from disk (e.g. repair,
metadump, db, etc) they all expect it to mean ascii-ci, and will do
stuff based on that specific meaning. We cannot redefine the meaning
of a feature bit after the fact - we have lots of feature bits so
there's no need to overload an existing one for this.

Hmmm - another interesting question just popped into my head about
metadump: file name obfuscation.  What does unicode and utf8 mean
for the hash collision calculation algorithm?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-24 23:10       ` Dave Chinner
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Chinner @ 2014-09-24 23:10 UTC (permalink / raw)
  To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs

On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote:
> On 23-09-14 00:26, Dave Chinner wrote:
> >On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> 
> [...]
> 
> >>TODO: Store the unicode version number of the filesystem on disk in the
> >>super block.
> >
> >So, if the filesystem has to store the specific unicode version it
> >was created with so that we know what version to put in trie
> >lookups, again I'll ask: why are we loading the trie as a generic
> >kernel module and not as metadata in the filesystem that is demand
> >paged and cached?
> 
> This way the trie can be shared, and the code using it is not
> entangled with the XFS code.

The trie parsing code can still be common - just the location and
contents of the data is determined by the end-user.

> 
> >i.e. put the entire trie on disk, look up the specific conversion
> >required for the name being compared, and then cache that conversion
> >in memory. This makes repeated lookups much faster because the trie
> >only contains conversions that are in use, the memory footprint is
> >way lower and the conversions are guaranteed to be consistent for
> >the life of the filesystem....
> 
> Above you mention demand paging parts of the trie, but here you seem
> to suggest creating an in-core conversion table on the fly from data
> read from disk. The former seems a lot easier to do than the latter.

Right - it's a question of what needs optimising. If people are only
concerned about memory footprint, then demand paging solves that
problem. If people are concerned about performance and memory
footprint, then demand paging plus a lookaside cache
will address both of those aspects.

We can't do demand paging if the trie data is built into the kernel.
We can still do a lookaside cache to avoid performane issues with
repeated trie lookups...

[...]

> >>To support unicode we have to interpret filenames. What happens when
> >>(part of) a filename cannot be interpreted? We can reject the
> >>filename, interpret the parts we can, or punt and accept it as an
> >>uninterpreted blob.
> >>
> >>Rejecting ill-formed filenames was my first choice, but I came around
> >>on the issue: there are too many ways in which you can end up with
> >>having to deal with ill-formed filenames that would leave a user with
> >>no recourse but to move whatever they're doing to a different
> >>filesystem. Unpacking a tarball with filenames in a different encoding
> >>is an example.
> >
> >You still haven't addressed this:
> >
> >| So we accept invalid unicode in filenames, but only after failing to
> >| parse them? Isn't this a potential vector for exploiting weaknesses
> >| in application filename handling? i.e.  unprivileged user writes
> >| specially crafted invalid unicode filename to disk, setuid program
> >| tries to parse it, invalid sequence triggers a buffer overflow bug
> >| in setuid parser?
> >
> >apart from handwaving that userspace has to be able to handle
> >invalid utf-8 already. Why should we let filesystems say "we fully
> >understand and support utf8" and then allow them to accept and
> >propagate invalid utf8 sequences and leave everyone else to have to
> >clean up the mess?
> 
> Because the alternative amounts in my opinion to a demand that every
> bit of userspace that may be involved in generating filenames
> generate only clean UTF-8. I do not believe that this is a realistic
> demand at this point in time.

It's a chicken and egg situation. I'd much prefer we enforce clean
utf8 from the start, because if we don't we'll never be able to do
that. And other filesystems (e.g. ZFS) allow you to do reject
anything that is not clean utf8....

[...]

> >>When comparing well-formed filenames, the question now becomes which
> >>byte sequences are considered to be alternative spellings of the same
> >>filename. This is where normalization forms come into play, and the
> >>unicode standard has quite a bit to say about the subject.
> >>
> >>If all you're doing is comparison, then choosing NFD over NFC is easy,
> >>because the former is easier to calculate than the latter.
> >>
> >>If you want various spellings of "office" to compare equal, then
> >>picking NFKD over NFD for comparison is also an obvious
> >>choice. (Hand-picking individual compatibility forms is truly a bad
> >>idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
> >>"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
> >>ligature. (Some fool thought it a good idea to add these ligatures to
> >>unicode, all we get to decide is how to cope.)
> >
> >Yet normalised strings are only stable and hence comparable
> >if there are no unassigned code points in them.  What happens when
> >userspace is not using the same version of unicode as the
> >filesystem and is using newer code points in it's strings?
> >Normalisation fails, right?
> 
> For the newer code points, yes. This is not treated as a failure to
> normalize the string as a whole, as there are clear guidelines in
> unicode on how unassigned code points interact with normalization:
> they have canonical combining class 0 and no decomposition.

And so effectively are not stable. Which is something we absolutely
have to avoid for information stored on disk. i.e. you're using the
normalised form to build the hash values in the lookup index in the
directory structure, and so having unstable normalisation forms is
just wrong. Hence we'd need to reject anything with unassigned code
points....

> >And as an extension of using normalisation for case-folded
> >comparisons, how do we make case folding work with blobs that can't
> >be normalised? It seems to me that this just leads to the nasty
> >situation where some filenames are case sensitive and some aren't
> >based on what the filesystem thinks is valid utf-8. The worst part
> >is that userspace has no idea that the filesystem is making such
> >distinctions and so behaviour is not at all predictable or expected.
> 
> Making case-folding work on a blob that cannot be normalized is (in
> my opinion) akin to doing an ASCII-based casefold on a Shift-JIS
> string: the result is neither pretty nor useful.

Yes, that's exactly my point.

> >This is another point in favour of rejecting invalid utf-8 strings
> >and for keeping the translation tables stable within the
> >filesystem...
> 
> Bear in mind that this means not just rejecting invalid UTF-8
> strings, but also rejecting valid UTF-8 strings that encode
> unassigned code points.

And that's precisely what I'm suggesting: If we can't normalise the
filename to a stable form then it cannot be used for hashing or case
folding. That means it needs to be rejected, not treated as an
opaque blob.

The moment we start parsing filenames they are no longer opaque
blobs and so all existing "filename are opaque blobs" handling rules
go out the window. They are now either valid so we can use them, or
they are invalid and need to be rejected to avoid unpredictable
and/or undesirable behaviour.

> This should be easy to implement if it is decided that we want to do this.
> 
> >>The most contentious part is (should be) ignoring the codepoints with
> >>the Default_Ignorable_Code_Point property. I've included the list
> >>below. My argument, such as it is, is that these code points either
> >>have no visible rendering, or in cases like the soft hyphen, are only
> >>conditionally visible. The problem with these (as I see it) is that on
> >>seeing a filename that might contain them you cannot tell whether they
> >>are present. So I propose to ignore them for the purpose of comparing
> >>filenames for equality.
> >
> >Which introduces a non-standard "visibility criterial" for
> >determining what should be or shouldn't be part of the normalised
> >string for comparison. I don't see any real justification for
> >stepping outside the standard unicode normalisation here - just
> >because the user cannot see a character in a specific context does
> >not mean that it is not significant to the application that created
> >it.
> 
> I agree these characters may be significant to the application. I'm
> just not convinced that they should be significant in a file name.

They are significant to the case folding result, right? And
therefore would be significant in a filename...
> 
> >>Finally, case folding. First of all, it is optional. Then the issue is
> >>that you either go the language-specific route, or simplify the task
> >>by "just" doing a full casefold (C+F, in unicode parlance). Looking
> >>around the net I tend to find that if you're going to do casefolding
> >>at all, then a language-independent full casefold is preferred because
> >>it is the most predictable option. See
> >>http://www.w3.org/TR/charmod-norm/ for an example of that kind of
> >>reasoning.
> >
> >Which says in section 2.4: "Some languages need case-folding to be
> >tailored to meet specific linguistic needs". That implies that the
> >case folding needs to be language aware and hence needs to be tied
> >into the NLS subsystem for handling specific quirks like Turkic.
> 
> It also recommends just doing a full case fold for cases where you
> are ignorant of the language actually in use.  In section 3.1 they
> say: "However, language-sensitive case-sensitive matching in
> document formats and protocols is NOT RECOMMENDED because language
> information can be hard to obtain, verify, or manage and the
> resulting operations can produce results that frustrate users." This
> doesn't exactly address the case of filesystems, but as far as I
> know there is no defined interface that allows kernel code to query
> the locale settings that currently apply to a userspace process.

Hence my comments about NLS integration. The NLS subsystem already
has utf8 support with language dependent case folding tables. All the
current filesystems that deal with unicode (including case folding)
use the NLS subsystem for conversions.

Hmmm - looking at all the NLS code that does different utf format
conversions first: what happens if an application is using UTF16 or
UTF32 for it's filename encoding rather than utf8?

> >>* XFS-specific design notes.
> >...
> >>If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
> >>in the superblock, then case folding is added into the mix. This is
> >>the nfkdicf normalization form mentioned above. It allows for the
> >>creation of case-insensitive filesystems with UTF-8 support.
> >
> >Please don't overload existing superblock feature bits with multiple
> >meanings. ASCII-CI is a stand-alone feature and is not in any way
> >compatible with Unicode: Unicode-CI is a superset of Unicode
> >support. So it really needs two new feature bits for Unicode and
> >Unicode-CI, not just one for unicode.
> 
> It seemed an obvious extension of the meaning of that bit.

Feature bits refer to a specific on disk format feature. If that bit
is set, then that feature is present. In this case, it means the
filesystem is using ascii-ci. If that bit is passed out to
userspace via the geometry ioctl, then *existing applications*
expect it to mean ascii-ci behaviour from the filesystem. If an
existing utility reads the flag field from disk (e.g. repair,
metadump, db, etc) they all expect it to mean ascii-ci, and will do
stuff based on that specific meaning. We cannot redefine the meaning
of a feature bit after the fact - we have lots of feature bits so
there's no need to overload an existing one for this.

Hmmm - another interesting question just popped into my head about
metadump: file name obfuscation.  What does unicode and utf8 mean
for the hash collision calculation algorithm?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07a/13] xfsprogs: add trie generator for UTF-8.
  2014-09-23 18:34     ` Roger Willcocks
@ 2014-09-24 23:11       ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-24 23:11 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: linux-fsdevel, tinguely, olaf, xfs

Hi Roger,

On Tue, Sep 23, 2014 at 07:34:19PM +0100, Roger Willcocks wrote:
> On Fri, 2014-09-19 at 11:06 -0500, Ben Myers wrote:
> > +#define AGE_NAME       "DerivedAge.txt"
> > +#define CCC_NAME       "DerivedCombiningClass.txt"
> > +#define PROP_NAME      "DerivedCoreProperties.txt"
> > +#define DATA_NAME      "UnicodeData.txt"
> > +#define FOLD_NAME      "CaseFolding.txt"
> > +#define NORM_NAME      "NormalizationCorrections.txt"
> > +#define TEST_NAME      "NormalizationTest.txt"
> 
> Is there a reason why you're using multiple text-based data files (and
> hand-parsing them) when there's an xml formatted flat file available ?
> 
> http://www.unicode.org/Public/UCD/latest/ucdxml/

The UCD files being parsed are the authoritative source.  Check out
ucdxml.readme.txt.

> And a 2nd question - why does the trie need to encode "the the unicode
> version in which the codepoint was assigned an interpretation" ?

You need to know whether a given code point is assigned in the version
of Unicode you're normalizing for.  Unicode 8 is supposed to release
June/July 2015 (see http://www.unicode.org/versions/), but filesystems
you created this year will still need the version 7 normalization.
There is still some plumbing to do to pass the version along with the
string for normalization.

I think you bring up a good point, but we'll need to support multiple
versions in the long run.

Regards,
Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-24 23:10       ` Dave Chinner
  (?)
@ 2014-09-25 13:33       ` Zuckerman, Boris
  -1 siblings, 0 replies; 84+ messages in thread
From: Zuckerman, Boris @ 2014-09-25 13:33 UTC (permalink / raw)
  To: Dave Chinner, Olaf Weber; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs



> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-
> owner@vger.kernel.org] On Behalf Of Dave Chinner
> Sent: Wednesday, September 24, 2014 7:10 PM
> To: Olaf Weber
> Cc: Ben Myers; linux-fsdevel@vger.kernel.org; tinguely@sgi.com; xfs@oss.sgi.com
> Subject: Re: [RFC v2] Unicode/UTF-8 support for XFS
> 
> On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote:
> > On 23-09-14 00:26, Dave Chinner wrote:
> > >On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
> >
> > [...]
> >
> > >>TODO: Store the unicode version number of the filesystem on disk in
> > >>the super block.
> > >
> > >So, if the filesystem has to store the specific unicode version it
> > >was created with so that we know what version to put in trie lookups,
> > >again I'll ask: why are we loading the trie as a generic kernel
> > >module and not as metadata in the filesystem that is demand paged and
> > >cached?
> >
> > This way the trie can be shared, and the code using it is not
> > entangled with the XFS code.
> 
> The trie parsing code can still be common - just the location and contents of the data is
> determined by the end-user.
> 

Both these approaches can co-exists (as I recall was done in Windows). A system may have "a default trie" and a caller (name space) can provide its own...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-24 11:07           ` Olaf Weber
@ 2014-09-26 14:06             ` Olaf Weber
  0 siblings, 0 replies; 84+ messages in thread
From: Olaf Weber @ 2014-09-26 14:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs

On 24-09-14 13:07, Olaf Weber wrote:
> On 23-09-14 22:15, Andi Kleen wrote:
>
>>> A big part of the table does decompositions for Korean: eliminating
>>> the Hangul decompositions removes 156320 bytes, leaving 89936 bytes.
>>
>> Are there regular ranges or other redundancies in the Korean encoding
>> that could be used to compress paths?
>
> Yes, though at the expense of more complicated code and interfaces. in
> particular, lookups that want a normalized string would need to provide a
> 10-byte buffer to store it in.

I spent some time working on this, and the effect on the lookup code isn't 
as bad as I'd thought. The updated code should be posted early next week.

With this change, the table size for the full trie becomes 89952 bytes. Of 
this, 66400 bytes are spent on the NFKD + Ignorables, an additional 20992 
bytes on NFDK + Ignorables + Case Fold. The remainder, 2560 bytes, are 
additional info for older unicode versions.

Note that the NFDK + Ignorables + Case Fold trie forwards to the NFKD + 
Ignorables where they overlap. A stand-alone version would be 71750 bytes.

As noted before these tables also contain the Canonical Combining Class and 
unicode version information for the code points. The latter allows for 
supporting multiple unicode versions using a single combined table.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-24 23:10       ` Dave Chinner
  (?)
  (?)
@ 2014-09-26 14:50       ` Olaf Weber
  2014-09-26 16:56           ` Christoph Hellwig
  -1 siblings, 1 reply; 84+ messages in thread
From: Olaf Weber @ 2014-09-26 14:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs

On 25-09-14 01:10, Dave Chinner wrote:
> On Wed, Sep 24, 2014 at 03:21:04PM +0200, Olaf Weber wrote:
>> On 23-09-14 00:26, Dave Chinner wrote:
>>> On Thu, Sep 18, 2014 at 02:56:50PM -0500, Ben Myers wrote:
>>
>> [...]
>>
>>>> TODO: Store the unicode version number of the filesystem on disk in the
>>>> super block.
>>>
>>> So, if the filesystem has to store the specific unicode version it
>>> was created with so that we know what version to put in trie
>>> lookups, again I'll ask: why are we loading the trie as a generic
>>> kernel module and not as metadata in the filesystem that is demand
>>> paged and cached?
>>
>> This way the trie can be shared, and the code using it is not
>> entangled with the XFS code.
>
> The trie parsing code can still be common - just the location and
> contents of the data is determined by the end-user.

I'm not sure how common the parsing code can be if needs to be capable of 
retrieving data from a filesystem.

Note given your and Andi Kleen's feedback on the trie size I've switched to 
doing algorithmic decomposition for Hangul. This reduces the size of the 
trie to 89952 bytes.

In addition, if you store the trie in the filesystem, then the only part 
that needs storing is the version for that particular filesystem, e.g no 
compatibility info for different unicode versions would be required.  This 
would reduce the trie size to about 50kB for case-sensitive filesystems, and 
about 55kB on case-folding filesystems.

[...]

>>> [...] Why should we let filesystems say "we fully
>>> understand and support utf8" and then allow them to accept and
>>> propagate invalid utf8 sequences and leave everyone else to have to
>>> clean up the mess?
>>
>> Because the alternative amounts in my opinion to a demand that every
>> bit of userspace that may be involved in generating filenames
>> generate only clean UTF-8. I do not believe that this is a realistic
>> demand at this point in time.
>
> It's a chicken and egg situation. I'd much prefer we enforce clean
> utf8 from the start, because if we don't we'll never be able to do
> that. And other filesystems (e.g. ZFS) allow you to do reject
> anything that is not clean utf8....

As I understand it, this is optional in ZFS. I wonder what people's 
experiences are with this.

[...]

>>> Yet normalised strings are only stable and hence comparable
>>> if there are no unassigned code points in them.  What happens when
>>> userspace is not using the same version of unicode as the
>>> filesystem and is using newer code points in it's strings?
>>> Normalisation fails, right?
>>
>> For the newer code points, yes. This is not treated as a failure to
>> normalize the string as a whole, as there are clear guidelines in
>> unicode on how unassigned code points interact with normalization:
>> they have canonical combining class 0 and no decomposition.
>
> And so effectively are not stable. Which is something we absolutely
> have to avoid for information stored on disk. i.e. you're using the
> normalised form to build the hash values in the lookup index in the
> directory structure, and so having unstable normalisation forms is
> just wrong. Hence we'd need to reject anything with unassigned code
> points....

On a particular filesystem, the calculated normalization would be stable.

>>> And as an extension of using normalisation for case-folded
>>> comparisons, how do we make case folding work with blobs that can't
>>> be normalised? It seems to me that this just leads to the nasty
>>> situation where some filenames are case sensitive and some aren't
>>> based on what the filesystem thinks is valid utf-8. The worst part
>>> is that userspace has no idea that the filesystem is making such
>>> distinctions and so behaviour is not at all predictable or expected.
>>
>> Making case-folding work on a blob that cannot be normalized is (in
>> my opinion) akin to doing an ASCII-based casefold on a Shift-JIS
>> string: the result is neither pretty nor useful.
>
> Yes, that's exactly my point.

But apparently we draw different conclusions from it.

>>> This is another point in favour of rejecting invalid utf-8 strings
>>> and for keeping the translation tables stable within the
>>> filesystem...
>>
>> Bear in mind that this means not just rejecting invalid UTF-8
>> strings, but also rejecting valid UTF-8 strings that encode
>> unassigned code points.
>
> And that's precisely what I'm suggesting: If we can't normalise the
> filename to a stable form then it cannot be used for hashing or case
> folding. That means it needs to be rejected, not treated as an
> opaque blob.
>
> The moment we start parsing filenames they are no longer opaque
> blobs and so all existing "filename are opaque blobs" handling rules
> go out the window. They are now either valid so we can use them, or
> they are invalid and need to be rejected to avoid unpredictable
> and/or undesirable behaviour.

At this point I'd really like other people to weigh in on this and get a 
sense of how sentiment is spread on the question.

- Forbid non-UTF-8 filenames
- Allow non-UTF-8 filenames
- Make it a mount option
- Make it a mkfs option

[...]

>>>> The most contentious part is (should be) ignoring the codepoints with
>>>> the Default_Ignorable_Code_Point property. I've included the list
>>>> below. My argument, such as it is, is that these code points either
>>>> have no visible rendering, or in cases like the soft hyphen, are only
>>>> conditionally visible. The problem with these (as I see it) is that on
>>>> seeing a filename that might contain them you cannot tell whether they
>>>> are present. So I propose to ignore them for the purpose of comparing
>>>> filenames for equality.
>>>
>>> Which introduces a non-standard "visibility criterial" for
>>> determining what should be or shouldn't be part of the normalised
>>> string for comparison. I don't see any real justification for
>>> stepping outside the standard unicode normalisation here - just
>>> because the user cannot see a character in a specific context does
>>> not mean that it is not significant to the application that created
>>> it.
>>
>> I agree these characters may be significant to the application. I'm
>> just not convinced that they should be significant in a file name.
>
> They are significant to the case folding result, right? And
> therefore would be significant in a filename...

Case Folding doesn't affect the ignorables, so in that sense at least 
they're not significant to the case folding result, even if you do not 
ignore them.

[...]

> Hence my comments about NLS integration. The NLS subsystem already
> has utf8 support with language dependent case folding tables.  All the
 > current filesystems that deal with unicode (including case folding)
 > use the NLS subsystem for conversions.

Looking at the NLS subsystem I see support for translating a number of 
different encodings ("code pages") to unicode and back.

There is support for uppercase/lowercase translation for a number of those 
encodings. Which is not the same as language dependent case folding.

As for a unicode case fold, I see no support at all. In nls_utf8.c the 
uppercase/lowercase mappings are set to the identity maps.

I see no support for unicode normalization forms either.

> Hmmm - looking at all the NLS code that does different utf format
> conversions first: what happens if an application is using UTF16 or
> UTF32 for it's filename encoding rather than utf8?

Since UTF-16 and UTF-32 strings contain embedded 0 bytes, those encodings 
cannot be used to pass a filename across the kernel/userspace interface.


>>>> * XFS-specific design notes.
>>> ...
>>>> If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
>>>> in the superblock, then case folding is added into the mix. This is
>>>> the nfkdicf normalization form mentioned above. It allows for the
>>>> creation of case-insensitive filesystems with UTF-8 support.
>>>
>>> Please don't overload existing superblock feature bits with multiple
>>> meanings. ASCII-CI is a stand-alone feature and is not in any way
>>> compatible with Unicode: Unicode-CI is a superset of Unicode
>>> support. So it really needs two new feature bits for Unicode and
>>> Unicode-CI, not just one for unicode.
>>
>> It seemed an obvious extension of the meaning of that bit.
>
> Feature bits refer to a specific on disk format feature. If that bit
> is set, then that feature is present. In this case, it means the
> filesystem is using ascii-ci. If that bit is passed out to
> userspace via the geometry ioctl, then *existing applications*
> expect it to mean ascii-ci behaviour from the filesystem. If an
> existing utility reads the flag field from disk (e.g. repair,
> metadump, db, etc) they all expect it to mean ascii-ci, and will do
> stuff based on that specific meaning. We cannot redefine the meaning
> of a feature bit after the fact - we have lots of feature bits so
> there's no need to overload an existing one for this.

Good point.

> Hmmm - another interesting question just popped into my head about
> metadump: file name obfuscation.  What does unicode and utf8 mean
> for the hash collision calculation algorithm?

Good question.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 14:50       ` Olaf Weber
@ 2014-09-26 16:56           ` Christoph Hellwig
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Hellwig @ 2014-09-26 16:56 UTC (permalink / raw)
  To: Olaf Weber; +Cc: Dave Chinner, Ben Myers, linux-fsdevel, tinguely, xfs

On Fri, Sep 26, 2014 at 04:50:39PM +0200, Olaf Weber wrote:
> I'm not sure how common the parsing code can be if needs to be capable of
> retrieving data from a filesystem.
> 
> Note given your and Andi Kleen's feedback on the trie size I've switched to
> doing algorithmic decomposition for Hangul. This reduces the size of the
> trie to 89952 bytes.
> 
> In addition, if you store the trie in the filesystem, then the only part
> that needs storing is the version for that particular filesystem, e.g no
> compatibility info for different unicode versions would be required.  This
> would reduce the trie size to about 50kB for case-sensitive filesystems, and
> about 55kB on case-folding filesystems.

Honestly I wouldn't worry about demand loading it too much.  This is a
fairly special case code for NAS servers, and should not affect normal
uses now that we use symbol_get.  Let's get back to the fundamentals.

> >It's a chicken and egg situation. I'd much prefer we enforce clean
> >utf8 from the start, because if we don't we'll never be able to do
> >that. And other filesystems (e.g. ZFS) allow you to do reject
> >anything that is not clean utf8....
> 
> As I understand it, this is optional in ZFS. I wonder what people's
> experiences are with this.

It is as optional as your utf8 support for XFS is.  But they do
enforce valid utf8 if they use utf8 normalization for file name
comparisms, be that case sensitive or insensitive.  Take a look at the
zfs(8) man page.

> - Forbid non-UTF-8 filenames
> - Allow non-UTF-8 filenames
> - Make it a mount option
> - Make it a mkfs option

My take on this is:

 - I think we'll have to prevent non-utf8 file names for any cases where
   we use utf8 normalization.  If you do not use utf8 normalization
   it's plain old Unix everything is allowed.
 
 - I think utf8 normalization vs not should be mkfs option, to make sure
   everyone including kernel and repair knows what sort of filesystem
   deal with.

 - case insensitive matching for utf8 normalized filesystems should be
   a runtime decision.  mount time for now, but Samba people would be
   extremly happy to allow per-operation or per-process CI matching.
   But that is another totally different discusion I'd like to keep
   separate, I just want to make sure the disk format allows for it for
   now.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-26 16:56           ` Christoph Hellwig
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Hellwig @ 2014-09-26 16:56 UTC (permalink / raw)
  To: Olaf Weber; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs

On Fri, Sep 26, 2014 at 04:50:39PM +0200, Olaf Weber wrote:
> I'm not sure how common the parsing code can be if needs to be capable of
> retrieving data from a filesystem.
> 
> Note given your and Andi Kleen's feedback on the trie size I've switched to
> doing algorithmic decomposition for Hangul. This reduces the size of the
> trie to 89952 bytes.
> 
> In addition, if you store the trie in the filesystem, then the only part
> that needs storing is the version for that particular filesystem, e.g no
> compatibility info for different unicode versions would be required.  This
> would reduce the trie size to about 50kB for case-sensitive filesystems, and
> about 55kB on case-folding filesystems.

Honestly I wouldn't worry about demand loading it too much.  This is a
fairly special case code for NAS servers, and should not affect normal
uses now that we use symbol_get.  Let's get back to the fundamentals.

> >It's a chicken and egg situation. I'd much prefer we enforce clean
> >utf8 from the start, because if we don't we'll never be able to do
> >that. And other filesystems (e.g. ZFS) allow you to do reject
> >anything that is not clean utf8....
> 
> As I understand it, this is optional in ZFS. I wonder what people's
> experiences are with this.

It is as optional as your utf8 support for XFS is.  But they do
enforce valid utf8 if they use utf8 normalization for file name
comparisms, be that case sensitive or insensitive.  Take a look at the
zfs(8) man page.

> - Forbid non-UTF-8 filenames
> - Allow non-UTF-8 filenames
> - Make it a mount option
> - Make it a mkfs option

My take on this is:

 - I think we'll have to prevent non-utf8 file names for any cases where
   we use utf8 normalization.  If you do not use utf8 normalization
   it's plain old Unix everything is allowed.
 
 - I think utf8 normalization vs not should be mkfs option, to make sure
   everyone including kernel and repair knows what sort of filesystem
   deal with.

 - case insensitive matching for utf8 normalized filesystems should be
   a runtime decision.  mount time for now, but Samba people would be
   extremly happy to allow per-operation or per-process CI matching.
   But that is another totally different discusion I'd like to keep
   separate, I just want to make sure the disk format allows for it for
   now.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 16:56           ` Christoph Hellwig
  (?)
@ 2014-09-26 17:04           ` Jeremy Allison
  2014-09-26 17:06             ` Christoph Hellwig
  2014-09-26 19:37             ` Olaf Weber
  -1 siblings, 2 replies; 84+ messages in thread
From: Jeremy Allison @ 2014-09-26 17:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: tinguely, xfs, Ben Myers, Olaf Weber, linux-fsdevel

On Fri, Sep 26, 2014 at 09:56:05AM -0700, Christoph Hellwig wrote:
> 
> My take on this is:
> 
>  - I think we'll have to prevent non-utf8 file names for any cases where
>    we use utf8 normalization.  If you do not use utf8 normalization
>    it's plain old Unix everything is allowed.
>  
>  - I think utf8 normalization vs not should be mkfs option, to make sure
>    everyone including kernel and repair knows what sort of filesystem
>    deal with.
> 
>  - case insensitive matching for utf8 normalized filesystems should be
>    a runtime decision.  mount time for now, but Samba people would be
>    extremly happy to allow per-operation or per-process CI matching.
>    But that is another totally different discusion I'd like to keep
>    separate, I just want to make sure the disk format allows for it for
>    now.

Actually, I'm so eager for case-insensitive matching I'd
take "at format time", as with ZFS :-) :-).

Having CI matching can speed up Samba operations by a
factor of 10 on large directories (warning, number made
up, depending on the number of entries per dir :-).

Jeremy.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 17:04           ` Jeremy Allison
@ 2014-09-26 17:06             ` Christoph Hellwig
  2014-09-26 17:13                 ` Jeremy Allison
  2014-09-26 19:37             ` Olaf Weber
  1 sibling, 1 reply; 84+ messages in thread
From: Christoph Hellwig @ 2014-09-26 17:06 UTC (permalink / raw)
  To: Jeremy Allison; +Cc: tinguely, xfs, Ben Myers, Olaf Weber, linux-fsdevel

On Fri, Sep 26, 2014 at 10:04:07AM -0700, Jeremy Allison wrote:
> Actually, I'm so eager for case-insensitive matching I'd
> take "at format time", as with ZFS :-) :-).

You already get this with XFS as long as you limit yourself to
7-bit ASCII :)  And utf-8 with Olaf's patches as-is.

Maybe time to give them some testing?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/10] xfs: add unicode character database files
  2014-09-22 20:54     ` Dave Chinner
  (?)
@ 2014-09-26 17:09     ` Christoph Hellwig
  -1 siblings, 0 replies; 84+ messages in thread
From: Christoph Hellwig @ 2014-09-26 17:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Ben Myers, tinguely, olaf, xfs

On Tue, Sep 23, 2014 at 06:54:38AM +1000, Dave Chinner wrote:
> This probably needs to live somewhere under lib/.  There's nothing
> XFS specific in it and the translations should be the same for
> anything that wants to parse unicode.

Agreed.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8.
  2014-09-23 18:57     ` Ben Myers
@ 2014-09-26 17:10       ` Christoph Hellwig
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Hellwig @ 2014-09-26 17:10 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, tinguely, olaf, xfs

On Tue, Sep 23, 2014 at 01:57:21PM -0500, Ben Myers wrote:
> I'll get this moved to lib/ as you suggested elsewhere in the
> thread.

Given that this is a host side tool it should be under scripts/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 17:06             ` Christoph Hellwig
@ 2014-09-26 17:13                 ` Jeremy Allison
  0 siblings, 0 replies; 84+ messages in thread
From: Jeremy Allison @ 2014-09-26 17:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Olaf Weber, Dave Chinner, Ben Myers,
	linux-fsdevel, tinguely, xfs

On Fri, Sep 26, 2014 at 10:06:04AM -0700, Christoph Hellwig wrote:
> On Fri, Sep 26, 2014 at 10:04:07AM -0700, Jeremy Allison wrote:
> > Actually, I'm so eager for case-insensitive matching I'd
> > take "at format time", as with ZFS :-) :-).
> 
> You already get this with XFS as long as you limit yourself to
> 7-bit ASCII :)

Thankyou for playing, here's a $10 gift token... No, that
won't do I'm afraid :-).

> And utf-8 with Olaf's patches as-is.
> 
> Maybe time to give them some testing?

I might do that, once I've finished rebuilding
my home server (remember kids, RAID5 is *NOT* a
backup :-) :-).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-26 17:13                 ` Jeremy Allison
  0 siblings, 0 replies; 84+ messages in thread
From: Jeremy Allison @ 2014-09-26 17:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tinguely, xfs, Ben Myers, Olaf Weber, linux-fsdevel, Jeremy Allison

On Fri, Sep 26, 2014 at 10:06:04AM -0700, Christoph Hellwig wrote:
> On Fri, Sep 26, 2014 at 10:04:07AM -0700, Jeremy Allison wrote:
> > Actually, I'm so eager for case-insensitive matching I'd
> > take "at format time", as with ZFS :-) :-).
> 
> You already get this with XFS as long as you limit yourself to
> 7-bit ASCII :)

Thankyou for playing, here's a $10 gift token... No, that
won't do I'm afraid :-).

> And utf-8 with Olaf's patches as-is.
> 
> Maybe time to give them some testing?

I might do that, once I've finished rebuilding
my home server (remember kids, RAID5 is *NOT* a
backup :-) :-).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 16:56           ` Christoph Hellwig
  (?)
  (?)
@ 2014-09-26 17:30           ` Ben Myers
  -1 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-26 17:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, tinguely, Olaf Weber, xfs

Hey Christoph,

On Fri, Sep 26, 2014 at 09:56:05AM -0700, Christoph Hellwig wrote:
> On Fri, Sep 26, 2014 at 04:50:39PM +0200, Olaf Weber wrote:
> > >It's a chicken and egg situation. I'd much prefer we enforce clean
> > >utf8 from the start, because if we don't we'll never be able to do
> > >that. And other filesystems (e.g. ZFS) allow you to do reject
> > >anything that is not clean utf8....
> > 
> > As I understand it, this is optional in ZFS. I wonder what people's
> > experiences are with this.
> 
> It is as optional as your utf8 support for XFS is.  But they do
> enforce valid utf8 if they use utf8 normalization for file name
> comparisms, be that case sensitive or insensitive.  Take a look at the
> zfs(8) man page.

The way I'm reading that man page, it seems like with ZFS you have one
option to choose whether to use normalization:

'nomalization = "none|FormD|FormKCf"'

And a separate option to choose whether to accept non-utf8 filenames:

'utf8only = "on|off".

The default setting appears to be that ZFS does allow non-utf8
filenames.

Whereas with Olaf's series you have one option that turns normalization
on or off, and he is not giving you a choice of whether non-utf8
filenames will be accepted (they will be accepted).

So IIUC there is a distinction:  The utf8 support for ZFS is "more
optional" than the utf8 support for XFS.

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 17:04           ` Jeremy Allison
  2014-09-26 17:06             ` Christoph Hellwig
@ 2014-09-26 19:37             ` Olaf Weber
  2014-09-26 19:46               ` Jeremy Allison
  2014-09-29 11:06                 ` Christoph Hellwig
  1 sibling, 2 replies; 84+ messages in thread
From: Olaf Weber @ 2014-09-26 19:37 UTC (permalink / raw)
  To: Jeremy Allison, Christoph Hellwig; +Cc: linux-fsdevel, Ben Myers, tinguely, xfs

On 26-09-14 19:04, Jeremy Allison wrote:
> On Fri, Sep 26, 2014 at 09:56:05AM -0700, Christoph Hellwig wrote:
>>
>> My take on this is:
>>
>>   - I think we'll have to prevent non-utf8 file names for any cases where
>>     we use utf8 normalization.  If you do not use utf8 normalization
>>     it's plain old Unix everything is allowed.
>>
>>   - I think utf8 normalization vs not should be mkfs option, to make sure
>>     everyone including kernel and repair knows what sort of filesystem
>>     deal with.
>>
>>   - case insensitive matching for utf8 normalized filesystems should be
>>     a runtime decision.  mount time for now, but Samba people would be
>>     extremly happy to allow per-operation or per-process CI matching.
>>     But that is another totally different discusion I'd like to keep
>>     separate, I just want to make sure the disk format allows for it for
>>     now.
>
> Actually, I'm so eager for case-insensitive matching I'd
> take "at format time", as with ZFS :-) :-).

My argument against "mount time case-insensitivity" and for "mkfs time 
case-insensitivity" is related to switching from the case-sensitive domain 
to the case-insensitive one.

For case-sensitive, from "README" to "readme" there are 64 different 
possible filenames.  Let's say you create 63 out of these 64. Now remount 
the filesystem case-insensitive, and try to open by the 64th version of 
"readme". It is not an exact match for any of the 63 candidate files, and a 
case-insensitive match to all 63 candidate files. Which of these 63 files 
should be opened, and why that one in particular?

> Having CI matching can speed up Samba operations by a
> factor of 10 on large directories (warning, number made
> up, depending on the number of entries per dir :-).

I really want that to be true, but the proof of the pudding...

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 19:37             ` Olaf Weber
@ 2014-09-26 19:46               ` Jeremy Allison
  2014-09-26 20:03                 ` Olaf Weber
  2014-09-29 11:06                 ` Christoph Hellwig
  1 sibling, 1 reply; 84+ messages in thread
From: Jeremy Allison @ 2014-09-26 19:46 UTC (permalink / raw)
  To: Olaf Weber
  Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel,
	Jeremy Allison

On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote:
> 
> My argument against "mount time case-insensitivity" and for "mkfs
> time case-insensitivity" is related to switching from the
> case-sensitive domain to the case-insensitive one.
> 
> For case-sensitive, from "README" to "readme" there are 64 different
> possible filenames.  Let's say you create 63 out of these 64. Now
> remount the filesystem case-insensitive, and try to open by the 64th
> version of "readme". It is not an exact match for any of the 63
> candidate files, and a case-insensitive match to all 63 candidate
> files. Which of these 63 files should be opened, and why that one in
> particular?

I'm ok with "mkfs time case-insensitivity" - really !
Most of my OEMs would set that and claim victory (few
of them care much about NFS semantics :-).

> >Having CI matching can speed up Samba operations by a
> >factor of 10 on large directories (warning, number made
> >up, depending on the number of entries per dir :-).
> 
> I really want that to be true, but the proof of the pudding...

No it really *is* true. The reason I can't give
exact numbers is it depends on the number of entries.

Remember, for every cache *miss*, we have to scan
the entire directory.

So a user asks for README, and we attempt that
and it fails. So now we have to enumerate the
entire directory to see if READMe (or any other
case varient) exists.

Now do that in a directory with 10, 100, 1000,
.... 10000000 existing files (don't laugh, I've
seen an application for Music files that did
*exactly* that). On a case insensitive filesystem
you just request README and you're done.

Certain vendors who shall remain nameless :-)
created test cases of just this example to
show how much storage on Linux sucks. Not
a happy camper about that - and telling them
to use ZFS on FreeBSD or Solaris just doesn't
feel right :-).

Jeremy.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 19:46               ` Jeremy Allison
@ 2014-09-26 20:03                 ` Olaf Weber
  2014-09-29 20:16                     ` J. Bruce Fields
  0 siblings, 1 reply; 84+ messages in thread
From: Olaf Weber @ 2014-09-26 20:03 UTC (permalink / raw)
  To: Jeremy Allison; +Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel

On 26-09-14 21:46, Jeremy Allison wrote:
> On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote:
>>
>> My argument against "mount time case-insensitivity" and for "mkfs
>> time case-insensitivity" is related to switching from the
>> case-sensitive domain to the case-insensitive one.
>>
>> For case-sensitive, from "README" to "readme" there are 64 different
>> possible filenames.  Let's say you create 63 out of these 64. Now
>> remount the filesystem case-insensitive, and try to open by the 64th
>> version of "readme". It is not an exact match for any of the 63
>> candidate files, and a case-insensitive match to all 63 candidate
>> files. Which of these 63 files should be opened, and why that one in
>> particular?
>
> I'm ok with "mkfs time case-insensitivity" - really !
> Most of my OEMs would set that and claim victory (few
> of them care much about NFS semantics :-).

I'd say you can have CIFS-style case-insensitive semantics or NFS-style 
case-sensitive semantics, but not both. And in particular, that a customer 
should not actually want to have both.

>>> Having CI matching can speed up Samba operations by a
>>> factor of 10 on large directories (warning, number made
>>> up, depending on the number of entries per dir :-).
>>
>> I really want that to be true, but the proof of the pudding...
>
> No it really *is* true. The reason I can't give
> exact numbers is it depends on the number of entries.
>
> Remember, for every cache *miss*, we have to scan
> the entire directory.
>
> So a user asks for README, and we attempt that
> and it fails. So now we have to enumerate the
> entire directory to see if READMe (or any other
> case varient) exists.
>
> Now do that in a directory with 10, 100, 1000,
> .... 10000000 existing files (don't laugh, I've
> seen an application for Music files that did
> *exactly* that). On a case insensitive filesystem
> you just request README and you're done.
>
> Certain vendors who shall remain nameless :-)
> created test cases of just this example to
> show how much storage on Linux sucks. Not
> a happy camper about that - and telling them
> to use ZFS on FreeBSD or Solaris just doesn't
> feel right :-).

Here's the thing to bear in mind: what I did is a straightforward extension 
of the existing XFS ASCII-based case-insensitive code. If that gets you the 
desired performance improvement, then my code should extend that to more 
general usage. If it doesn't, then there are places in XFS that I haven't 
touched that need modification to have these cases work well.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 19:37             ` Olaf Weber
@ 2014-09-29 11:06                 ` Christoph Hellwig
  2014-09-29 11:06                 ` Christoph Hellwig
  1 sibling, 0 replies; 84+ messages in thread
From: Christoph Hellwig @ 2014-09-29 11:06 UTC (permalink / raw)
  To: Olaf Weber
  Cc: Jeremy Allison, Christoph Hellwig, Dave Chinner, Ben Myers,
	linux-fsdevel, tinguely, xfs

On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote:
> My argument against "mount time case-insensitivity" and for "mkfs time
> case-insensitivity" is related to switching from the case-sensitive domain
> to the case-insensitive one.
> 
> For case-sensitive, from "README" to "readme" there are 64 different
> possible filenames.  Let's say you create 63 out of these 64. Now remount
> the filesystem case-insensitive, and try to open by the 64th version of
> "readme". It is not an exact match for any of the 63 candidate files, and a
> case-insensitive match to all 63 candidate files. Which of these 63 files
> should be opened, and why that one in particular?

Well, the point is not that we use the CI-capable hash all the time.  I
fully expect the current XFS behavior to remain the default for normal
systems forever.  I just want to make sure that the CI implementation
you chose can also allow mixed lookups if we desire.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-29 11:06                 ` Christoph Hellwig
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Hellwig @ 2014-09-29 11:06 UTC (permalink / raw)
  To: Olaf Weber
  Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel,
	Jeremy Allison

On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote:
> My argument against "mount time case-insensitivity" and for "mkfs time
> case-insensitivity" is related to switching from the case-sensitive domain
> to the case-insensitive one.
> 
> For case-sensitive, from "README" to "readme" there are 64 different
> possible filenames.  Let's say you create 63 out of these 64. Now remount
> the filesystem case-insensitive, and try to open by the 64th version of
> "readme". It is not an exact match for any of the 63 candidate files, and a
> case-insensitive match to all 63 candidate files. Which of these 63 files
> should be opened, and why that one in particular?

Well, the point is not that we use the CI-capable hash all the time.  I
fully expect the current XFS behavior to remain the default for normal
systems forever.  I just want to make sure that the CI implementation
you chose can also allow mixed lookups if we desire.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
  2014-09-26 20:03                 ` Olaf Weber
@ 2014-09-29 20:16                     ` J. Bruce Fields
  0 siblings, 0 replies; 84+ messages in thread
From: J. Bruce Fields @ 2014-09-29 20:16 UTC (permalink / raw)
  To: Olaf Weber
  Cc: Jeremy Allison, Christoph Hellwig, Dave Chinner, Ben Myers,
	linux-fsdevel, tinguely, xfs

On Fri, Sep 26, 2014 at 10:03:50PM +0200, Olaf Weber wrote:
> On 26-09-14 21:46, Jeremy Allison wrote:
> >On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote:
> >>
> >>My argument against "mount time case-insensitivity" and for "mkfs
> >>time case-insensitivity" is related to switching from the
> >>case-sensitive domain to the case-insensitive one.
> >>
> >>For case-sensitive, from "README" to "readme" there are 64 different
> >>possible filenames.  Let's say you create 63 out of these 64. Now
> >>remount the filesystem case-insensitive, and try to open by the 64th
> >>version of "readme". It is not an exact match for any of the 63
> >>candidate files, and a case-insensitive match to all 63 candidate
> >>files. Which of these 63 files should be opened, and why that one in
> >>particular?
> >
> >I'm ok with "mkfs time case-insensitivity" - really !
> >Most of my OEMs would set that and claim victory (few
> >of them care much about NFS semantics :-).
> 
> I'd say you can have CIFS-style case-insensitive semantics or
> NFS-style case-sensitive semantics, but not both.

Note the NFSv4 specs do claim to allow case insensitivity.  No idea how
well clients deal with it.

I think rfc3530bis has the most up to date language on NFSv4
internationalization issues:

	http://tools.ietf.org/html/draft-ietf-nfsv4-rfc3530bis-33

(One nit in the current knfsd: the server doesn't correctly report the
case_insensitive attribute.  If it had some flag it could check in the
filesystem's superblock then it could do that right instead of just
assuming 0 as it currently does (see FATTR4_WORD0_CASE_INSENSITIVE in
fs/nfsd/nfs4xdr.c:nfsd4_encode_fattr).)

--b.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC v2] Unicode/UTF-8 support for XFS
@ 2014-09-29 20:16                     ` J. Bruce Fields
  0 siblings, 0 replies; 84+ messages in thread
From: J. Bruce Fields @ 2014-09-29 20:16 UTC (permalink / raw)
  To: Olaf Weber
  Cc: tinguely, xfs, Christoph Hellwig, Ben Myers, linux-fsdevel,
	Jeremy Allison

On Fri, Sep 26, 2014 at 10:03:50PM +0200, Olaf Weber wrote:
> On 26-09-14 21:46, Jeremy Allison wrote:
> >On Fri, Sep 26, 2014 at 09:37:11PM +0200, Olaf Weber wrote:
> >>
> >>My argument against "mount time case-insensitivity" and for "mkfs
> >>time case-insensitivity" is related to switching from the
> >>case-sensitive domain to the case-insensitive one.
> >>
> >>For case-sensitive, from "README" to "readme" there are 64 different
> >>possible filenames.  Let's say you create 63 out of these 64. Now
> >>remount the filesystem case-insensitive, and try to open by the 64th
> >>version of "readme". It is not an exact match for any of the 63
> >>candidate files, and a case-insensitive match to all 63 candidate
> >>files. Which of these 63 files should be opened, and why that one in
> >>particular?
> >
> >I'm ok with "mkfs time case-insensitivity" - really !
> >Most of my OEMs would set that and claim victory (few
> >of them care much about NFS semantics :-).
> 
> I'd say you can have CIFS-style case-insensitive semantics or
> NFS-style case-sensitive semantics, but not both.

Note the NFSv4 specs do claim to allow case insensitivity.  No idea how
well clients deal with it.

I think rfc3530bis has the most up to date language on NFSv4
internationalization issues:

	http://tools.ietf.org/html/draft-ietf-nfsv4-rfc3530bis-33

(One nit in the current knfsd: the server doesn't correctly report the
case_insensitive attribute.  If it had some flag it could check in the
filesystem's superblock then it could do that right instead of just
assuming 0 as it currently does (see FATTR4_WORD0_CASE_INSENSITIVE in
fs/nfsd/nfs4xdr.c:nfsd4_encode_fattr).)

--b.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names
  2014-09-11 20:37 [RFC] " Ben Myers
@ 2014-09-11 21:01 ` Ben Myers
  0 siblings, 0 replies; 84+ messages in thread
From: Ben Myers @ 2014-09-11 21:01 UTC (permalink / raw)
  To: xfs; +Cc: tinguely, olaf

From: Olaf Weber <olaf@sgi.com>

Apply the same rules for UTF-8 normalization to the names of user-defined
extended attributes. System attributes are excluded because they are not
user-visible in the first place, and the kernel is expected to know what
it is doing when naming them.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 libxfs/xfs_attr.c      | 49 +++++++++++++++++++++++++++++++++++++++++--------
 libxfs/xfs_attr_leaf.c | 11 +++++++++--
 libxfs/xfs_utf8.c      |  7 +++++++
 3 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/libxfs/xfs_attr.c b/libxfs/xfs_attr.c
index 17519d3..c30703b 100644
--- a/libxfs/xfs_attr.c
+++ b/libxfs/xfs_attr.c
@@ -88,8 +88,9 @@ xfs_attr_get_int(
 	int			*valuelenp,
 	int			flags)
 {
-	xfs_da_args_t   args;
-	int             error;
+	xfs_da_args_t   	args;
+	struct xfs_mount	*mp = ip->i_mount;
+	int             	error;
 
 	if (!xfs_inode_hasattr(ip))
 		return ENOATTR;
@@ -103,9 +104,12 @@ xfs_attr_get_int(
 	args.value = value;
 	args.valuelen = *valuelenp;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = ip;
 	args.whichfork = XFS_ATTR_FORK;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/*
 	 * Decide on what work routines to call based on the inode size.
@@ -118,6 +122,9 @@ xfs_attr_get_int(
 		error = xfs_attr_node_get(&args);
 	}
 
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	/*
 	 * Return the number of bytes in the value to the caller.
 	 */
@@ -239,12 +246,15 @@ xfs_attr_set_int(
 	args.value = value;
 	args.valuelen = valuelen;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = dp;
 	args.firstblock = &firstblock;
 	args.flist = &flist;
 	args.whichfork = XFS_ATTR_FORK;
 	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/* Size is now blocks for attribute data */
 	args.total = xfs_attr_calc_size(dp, name->len, valuelen, &local);
@@ -276,6 +286,8 @@ xfs_attr_set_int(
 	error = xfs_trans_reserve(args.trans, &tres, args.total, 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return(error);
 	}
 	xfs_ilock(dp, XFS_ILOCK_EXCL);
@@ -286,6 +298,8 @@ xfs_attr_set_int(
 	if (error) {
 		xfs_iunlock(dp, XFS_ILOCK_EXCL);
 		xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return (error);
 	}
 
@@ -333,7 +347,8 @@ xfs_attr_set_int(
 			err2 = xfs_trans_commit(args.trans,
 						 XFS_TRANS_RELEASE_LOG_RES);
 			xfs_iunlock(dp, XFS_ILOCK_EXCL);
-
+			if (args.norm)
+				kmem_free((void *)args.norm);
 			return(error == 0 ? err2 : error);
 		}
 
@@ -398,6 +413,8 @@ xfs_attr_set_int(
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
 
 	return(error);
 
@@ -406,6 +423,9 @@ out:
 		xfs_trans_cancel(args.trans,
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	return(error);
 }
 
@@ -452,12 +472,15 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	args.name = name->name;
 	args.namelen = name->len;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = dp;
 	args.firstblock = &firstblock;
 	args.flist = &flist;
 	args.total = 0;
 	args.whichfork = XFS_ATTR_FORK;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/*
 	 * we have no control over the attribute names that userspace passes us
@@ -470,8 +493,11 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	 * Attach the dquots to the inode.
 	 */
 	error = xfs_qm_dqattach(dp, 0);
-	if (error)
-		return error;
+	if (error) {
+		if (args.norm)
+			kmem_free((void *)args.norm);
+			return error;
+	}
 
 	/*
 	 * Start our first transaction of the day.
@@ -497,6 +523,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 				  XFS_ATTRRM_SPACE_RES(mp), 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return(error);
 	}
 
@@ -546,6 +574,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
 
 	return(error);
 
@@ -554,6 +584,9 @@ out:
 		xfs_trans_cancel(args.trans,
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	return(error);
 }
 
diff --git a/libxfs/xfs_attr_leaf.c b/libxfs/xfs_attr_leaf.c
index f7f02ae..052a6a1 100644
--- a/libxfs/xfs_attr_leaf.c
+++ b/libxfs/xfs_attr_leaf.c
@@ -634,6 +634,7 @@ int
 xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 {
 	xfs_inode_t *dp;
+	struct xfs_mount *mp;
 	xfs_attr_shortform_t *sf;
 	xfs_attr_sf_entry_t *sfe;
 	xfs_da_args_t nargs;
@@ -646,6 +647,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	trace_xfs_attr_sf_to_leaf(args);
 
 	dp = args->dp;
+	mp = dp->i_mount;
 	ifp = dp->i_afp;
 	sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
 	size = be16_to_cpu(sf->hdr.totsize);
@@ -698,13 +700,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 		nargs.namelen = sfe->namelen;
 		nargs.value = &sfe->nameval[nargs.namelen];
 		nargs.valuelen = sfe->valuelen;
-		nargs.hashval = xfs_da_hashname(sfe->nameval,
-						sfe->namelen);
 		nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags);
+		if (! xfs_sb_version_hasutf8(&mp->m_sb))
+			nargs.hashval = xfs_da_hashname(sfe->nameval,
+							sfe->namelen);
+		else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0)
+			goto out;
 		error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */
 		ASSERT(error == ENOATTR);
 		error = xfs_attr3_leaf_add(bp, &nargs);
 		ASSERT(error != ENOSPC);
+		if (nargs.norm)
+			 kmem_free((void *)nargs.norm);
 		if (error)
 			goto out;
 		sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
index f5cc231..5c69591 100644
--- a/libxfs/xfs_utf8.c
+++ b/libxfs/xfs_utf8.c
@@ -31,6 +31,7 @@
 #include "xfs_inode_fork.h"
 #include "xfs_bmap.h"
 #include "xfs_dir2.h"
+#include "xfs_attr_leaf.h"
 #include "xfs_trace.h"
 #include "xfs_utf8.h"
 #include "utf8norm.h"
@@ -72,6 +73,9 @@ xfs_utf8_normhash(
 	ssize_t		normlen;
 	int		c;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdi = utf8nfkdi(utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdi, (const char *)args->name,
@@ -173,6 +177,9 @@ xfs_utf8_ci_normhash(
 	ssize_t		normlen;
 	int		c;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdicf = utf8nfkdicf(utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdicf, (const char *)args->name,
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2014-09-29 20:17 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers
2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-18 20:09 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:10 ` [PATCH 04/10] xfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-18 20:10   ` Ben Myers
2014-09-18 20:11 ` [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-18 20:13 ` [PATCH 03/10] xfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:14 ` [PATCH 06/10] xfs: add unicode character database files Ben Myers
2014-09-18 20:14   ` Ben Myers
2014-09-22 20:54   ` Dave Chinner
2014-09-22 20:54     ` Dave Chinner
2014-09-26 17:09     ` Christoph Hellwig
2014-09-18 20:15 ` [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-22 20:57   ` Dave Chinner
2014-09-22 20:57     ` Dave Chinner
2014-09-23 18:57     ` Ben Myers
2014-09-26 17:10       ` Christoph Hellwig
2014-09-18 20:16 ` [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-18 20:16   ` Ben Myers
2014-09-18 20:17 ` [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers
2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 20:33   ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
2014-09-18 20:33   ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-18 20:34   ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:35   ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-18 20:35     ` Ben Myers
2014-09-18 20:36   ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-18 20:37   ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers
2014-09-18 20:38   ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-18 20:38   ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-18 20:39   ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-18 20:40   ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers
2014-09-18 20:41   ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
2014-09-18 20:42   ` [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair Ben Myers
2014-09-18 20:42     ` Ben Myers
2014-09-18 20:43   ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers
2014-09-19 16:06   ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers
2014-09-23 18:34     ` Roger Willcocks
2014-09-24 23:11       ` Ben Myers
2014-09-19 16:07   ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers
2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 21:24   ` Zach Brown
2014-09-18 21:24     ` Zach Brown
2014-09-18 22:23     ` Ben Myers
2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers
2014-09-19 16:04 ` [PATCH 07b/10] xfs: add supporting code " Ben Myers
2014-09-22 14:55 ` [RFC v2] Unicode/UTF-8 support for XFS Andi Kleen
2014-09-22 14:55   ` Andi Kleen
2014-09-22 18:41   ` Ben Myers
2014-09-22 19:29     ` Andi Kleen
2014-09-22 19:29       ` Andi Kleen
2014-09-23 16:13       ` Olaf Weber
2014-09-23 20:15         ` Andi Kleen
2014-09-23 20:45           ` Ben Myers
2014-09-23 20:45             ` Ben Myers
2014-09-24 11:07           ` Olaf Weber
2014-09-26 14:06             ` Olaf Weber
2014-09-23 13:01   ` Olaf Weber
2014-09-23 20:02     ` Andi Kleen
2014-09-22 22:26 ` Dave Chinner
2014-09-22 22:26   ` Dave Chinner
2014-09-24 13:21   ` Olaf Weber
2014-09-24 13:21     ` Olaf Weber
2014-09-24 23:10     ` Dave Chinner
2014-09-24 23:10       ` Dave Chinner
2014-09-25 13:33       ` Zuckerman, Boris
2014-09-26 14:50       ` Olaf Weber
2014-09-26 16:56         ` Christoph Hellwig
2014-09-26 16:56           ` Christoph Hellwig
2014-09-26 17:04           ` Jeremy Allison
2014-09-26 17:06             ` Christoph Hellwig
2014-09-26 17:13               ` Jeremy Allison
2014-09-26 17:13                 ` Jeremy Allison
2014-09-26 19:37             ` Olaf Weber
2014-09-26 19:46               ` Jeremy Allison
2014-09-26 20:03                 ` Olaf Weber
2014-09-29 20:16                   ` J. Bruce Fields
2014-09-29 20:16                     ` J. Bruce Fields
2014-09-29 11:06               ` Christoph Hellwig
2014-09-29 11:06                 ` Christoph Hellwig
2014-09-26 17:30           ` Ben Myers
  -- strict thread matches above, loose matches on Subject: below --
2014-09-11 20:37 [RFC] " Ben Myers
2014-09-11 21:01 ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.