All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3] Unicode/UTF-8 support for XFS
@ 2014-10-03 21:47 Ben Myers
  2014-10-03 21:50 ` [PATCH 01/16] lib: add unicode character database files Ben Myers
                   ` (34 more replies)
  0 siblings, 35 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:47 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: bpm, olaf, xfs

Hi,

I'm posting this RFC for Unicode support in XFS on Olaf's behalf, as he
is busy with other projects.  This is the third revision of the series.
The others are available here:

http://oss.sgi.com/archives/xfs/2014-09/msg00260.html
http://oss.sgi.com/archives/xfs/2014-09/msg00169.html

In response to feedback for v2, the changes in v3 include:

** trie size is reduced from 245kB to 90kB by using algorithmic
   decomposition
* robustness fixes to the trie generator mkutf8data.c
* mkutf8data.c moved to scripts/, 
  utf8 normalization module moved to lib/, and
  compiled under CONFIG_UTF8_NORMALIZATION
* removed CONFIG_XFS_UTF8_DEMAND_LOAD, now it is unconditional
* the unicode version is stored in the superblock, checked at mount
  time, and passed through to the normalization module.
* add a versioned fsgeometry ioctl, and xfs_info bits to print the
  unicode version
* TODO don't overload asciici feature bit
* TODO fix patch klunkiness in xfs_da_mount

In response to the initial feedback, the changes in version 2 include:

* linux-fsdevel in the To: line,
* Updated design notes,
* Separation of the fs-independent trie and support code into utf8norm.ko,
* A mechanism for loading the normalization module only when necessary.

I'll post the whole series for completeness sake.  Many on -fsdevel will
not be interested in the xfs-specific bits, but it may be helpful to
have the full series as an example and for testing purposes.

First there is a set of kernel bits, then some libxfs/xfsprogs stuff,
and finally a test.  (Note: I am not posting the unicode database files
due to their large size.  There are scripts to download them from
unicode.org in the relevant commit headers.)

Thanks,
Ben

Here are Olaf's design notes:

-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS

So we had a customer request proper unicode support...


* What does "supporting unicode" actually mean?

>From a text processing point of view, what a filesystem does with
filenames is simple: it stores and retrieves them, and compares them
for equality. It may reject certain byte sequences as invalid
filenames (for example, no filename can contain an ASCII NUL).

I've been taking it as a given that when a file is created with a
certain byte sequence as its name, then a subsequent directory listing
will contain that same byte sequence among the names listed. That
things ought to work like that is either obvious to the point of being
axiomatic, or -- I suppose -- not. The reader gets one (1) guess as to
which camp I'm in.

This leaves comparing names for equality, and in my view this is what
"supporting unicode" revolves about.

The present state of affairs is that different byte sequences are
different filenames. This amounts to tolerating unicode without
actually supporting it.

To support unicode we have to interpret filenames. What happens when
(part of) a filename cannot be interpreted? We can reject the
filename, interpret the parts we can, or punt and accept it as an
uninterpreted blob.

Rejecting ill-formed filenames was my first choice, but I came around
on the issue: there are too many ways in which you can end up with
having to deal with ill-formed filenames that would leave a user with
no recourse but to move whatever they're doing to a different
filesystem. Unpacking a tarball with filenames in a different encoding
is an example.

Partial interpretation of an ill-formed filename just strikes me as
the kind of bad idea that most half-houses are. I admit that I have no
stronger objection to this than the fact that it makes the code even
more complicated and fragile.

Which leaves "blob" as the preferred option by default for coping with
ill-formed filenames.

When comparing well-formed filenames, the question now becomes which
byte sequences are considered to be alternative spellings of the same
filename. This is where normalization forms come into play, and the
unicode standard has quite a bit to say about the subject.

If all you're doing is comparison, then choosing NFD over NFC is easy,
because the former is easier to calculate than the latter.

If you want various spellings of "office" to compare equal, then
picking NFKD over NFD for comparison is also an obvious
choice. (Hand-picking individual compatibility forms is truly a bad
idea.) Ways to spell "office": "o_f_f_i_c_e", "o_f_fi_c_e", and
"o_ffi_c_e", using no ligatures, the fi ligature, or the ffi
ligature. (Some fool thought it a good idea to add these ligatures to
unicode, all we get to decide is how to cope.)

The most contentious part is (should be) ignoring the codepoints with
the Default_Ignorable_Code_Point property. I've included the list
below. My argument, such as it is, is that these code points either
have no visible rendering, or in cases like the soft hyphen, are only
conditionally visible. The problem with these (as I see it) is that on
seeing a filename that might contain them you cannot tell whether they
are present. So I propose to ignore them for the purpose of comparing
filenames for equality.

Finally, case folding. First of all, it is optional. Then the issue is
that you either go the language-specific route, or simplify the task
by "just" doing a full casefold (C+F, in unicode parlance). Looking
around the net I tend to find that if you're going to do casefolding
at all, then a language-independent full casefold is preferred because
it is the most predictable option. See
http://www.w3.org/TR/charmod-norm/ for an example of that kind of
reasoning.

All of these choices can be argued with, but I do believe that the
particular combination of choices I made is a defensible one.

The code refers to these normalization forms as nfkdi and nfkdicf.


* XFS-specific design notes.

XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
must meet to be UTF-8. We settled on the following:
 - Valid unicode code points are 0..0x10FFFF, except that
 - The surrogates 0xD800..0xDFFF are not valid code points, and
 - Valid UTF-8 must be a shortest encoding of a valid unicode code point.

In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).

The code uses ("leverages", in corp-speak) the existing XFS
infrastructure for case-insensitive filenames. Like the CI code, the
name used to create a file is stored on disk, and returned in a
lookup. When comparing filenames the normalized forms of the names
being compared are generated on the fly from the non-normalized forms
stored on disk.

If the borgbit (the bit enabling legacy ASCII-based CI in XFS) is set
in the superblock, then case folding is added into the mix. This is
the nfkdicf normalization form mentioned above. It allows for the
creation of case-insensitive filesystems with UTF-8 support.


* Implementation notes.

Strings are normalized using a trie that stores the relevant
information.  The trie itself is about 250kB in size, and lives in a
separate module. The trie is not checked in: instead we add the source
files from the Unicode Character Database and a program that creates
the header containing the trie.

The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.

The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.

The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.

The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.

The non-XFS-specific supporting code functions have the prefix 'utf8n'
if they handle length-limited strings, and 'utf8' if they handle
NUL-terminated strings.

----
# Derived Property: Default_Ignorable_Code_Point
#  Generated from
#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0605, 06DD, 070F, 110BD (exceptional Cf characters that should be visible)

00AD          ; Default_Ignorable_Code_Point # Cf       SOFT HYPHEN
034F          ; Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
061C          ; Default_Ignorable_Code_Point # Cf       ARABIC LETTER MARK
115F..1160    ; Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
180B..180D    ; Default_Ignorable_Code_Point # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
180E          ; Default_Ignorable_Code_Point # Cf       MONGOLIAN VOWEL SEPARATOR
200B..200F    ; Default_Ignorable_Code_Point # Cf   [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
202A..202E    ; Default_Ignorable_Code_Point # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE PLUS
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
2066..206F    ; Default_Ignorable_Code_Point # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
3164          ; Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FE00..FE0F    ; Default_Ignorable_Code_Point # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
FEFF          ; Default_Ignorable_Code_Point # Cf       ZERO WIDTH NO-BREAK SPACE
FFA0          ; Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
1BCA0..1BCA3  ; Default_Ignorable_Code_Point # Cf   [4] SHORTHAND FORMAT LETTER OVERLAP..SHORTHAND FORMAT UP STEP
1D173..1D17A  ; Default_Ignorable_Code_Point # Cf   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0000         ; Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Default_Ignorable_Code_Point # Cf       LANGUAGE TAG
E0002..E001F  ; Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Default_Ignorable_Code_Point # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E0100..E01EF  ; Default_Ignorable_Code_Point # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
E01F0..E0FFF  ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

# Total code points: 4173
----

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 01/16] lib: add unicode character database files
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
@ 2014-10-03 21:50 ` Ben Myers
  2014-10-03 21:51 ` [PATCH 02/16] scripts: add trie generator for UTF-8 Ben Myers
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:50 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
[v2: Removed large unicode files prior to posting.  Get them as below. -bpm]
[v3: Moved files to ucd8norm directory. -bpm]
[v4: Moved to lib/ucd. -bpm]

cd lib/ucd
wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
for e in *.txt
do
	base=`basename $e .txt`
	mv $e $base-7.0.0.txt
done
---
 lib/ucd/README | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)
 create mode 100644 lib/ucd/README

diff --git a/lib/ucd/README b/lib/ucd/README
new file mode 100644
index 0000000..d713e66
--- /dev/null
+++ b/lib/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+  http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+  http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+  http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
+  07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
+  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
+  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
+  522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
+  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
+  c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 02/16] scripts: add trie generator for UTF-8.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
  2014-10-03 21:50 ` [PATCH 01/16] lib: add unicode character database files Ben Myers
@ 2014-10-03 21:51 ` Ben Myers
  2014-10-03 21:54 ` [PATCH 03/16] lib: add supporting code " Ben Myers
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: the trie is now separated into utf8norm.ko;
     utf8version is now a function and exported;
     introduced CONFIG_XFS_UTF8;
     removed supporting code due to vger size constraint.  --bpm]
[v3: moved trie generator to scripts/;
     introduced CONFIG_UTF8_NORMALIZATION.  --bpm]
---
 lib/Kconfig          |    8 +
 lib/Makefile         |   13 +
 scripts/Makefile     |    1 +
 scripts/mkutf8data.c | 3239 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 3261 insertions(+)
 create mode 100644 scripts/mkutf8data.c

diff --git a/lib/Kconfig b/lib/Kconfig
index a5ce0c7..c92dfd8 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -515,4 +515,12 @@ source "lib/fonts/Kconfig"
 config ARCH_HAS_SG_CHAIN
 	def_bool n
 
+#
+# utf8 normalization module
+#
+config UTF8_NORMALIZATION
+	tristate "UTF-8 normalization support"
+	help
+	  Say Y here to enable utf8 normalization support.
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index d6b4bc4..b0b0d57 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -196,3 +196,16 @@ quiet_cmd_build_OID_registry = GEN     $@
 clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+
+$(obj)/utf8data.h: $(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
+	$(call cmd,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+      cmd_mkutf8data = $(objtree)/scripts/mkutf8data \
+		-a $(src)/ucd/DerivedAge-7.0.0.txt \
+		-c $(src)/ucd/DerivedCombiningClass-7.0.0.txt \
+		-p $(src)/ucd/DerivedCoreProperties-7.0.0.txt \
+		-d $(src)/ucd/UnicodeData-7.0.0.txt \
+		-f $(src)/ucd/CaseFolding-7.0.0.txt \
+		-n $(src)/ucd/NormalizationCorrections-7.0.0.txt \
+		-t $(src)/ucd/NormalizationTest-7.0.0.txt \
+		-o $@
diff --git a/scripts/Makefile b/scripts/Makefile
index 72902b5..80fcf43 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -16,6 +16,7 @@ hostprogs-$(CONFIG_VT)           += conmakehash
 hostprogs-$(BUILD_C_RECORDMCOUNT) += recordmcount
 hostprogs-$(CONFIG_BUILDTIME_EXTABLE_SORT) += sortextable
 hostprogs-$(CONFIG_ASN1)	 += asn1_compiler
+hostprogs-$(CONFIG_UTF8_NORMALIZATION)	+= mkutf8data
 
 HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include
 HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
new file mode 100644
index 0000000..1d6ec02
--- /dev/null
+++ b/scripts/mkutf8data.c
@@ -0,0 +1,3239 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME	"DerivedAge.txt"
+#define CCC_NAME	"DerivedCombiningClass.txt"
+#define PROP_NAME	"DerivedCoreProperties.txt"
+#define DATA_NAME	"UnicodeData.txt"
+#define FOLD_NAME	"CaseFolding.txt"
+#define NORM_NAME	"NormalizationCorrections.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+#define UTF8_NAME	"utf8data.h"
+
+const char	*age_name  = AGE_NAME;
+const char	*ccc_name  = CCC_NAME;
+const char	*prop_name = PROP_NAME;
+const char	*data_name = DATA_NAME;
+const char	*fold_name = FOLD_NAME;
+const char	*norm_name = NORM_NAME;
+const char	*test_name = TEST_NAME;
+const char	*utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision.  These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_MAJ_MAX			((unsigned short)-1)
+#define UNICODE_MIN_MAX			((unsigned char)-1)
+#define UNICODE_REV_MAX			((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+	if (major > UNICODE_MAJ_MAX)
+		return 0;
+	if (minor > UNICODE_MIN_MAX)
+		return 0;
+	if (revision > UNICODE_REV_MAX)
+		return 0;
+	return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MAXGEN		(255)
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+	const unsigned char *s = (const unsigned char*)str;
+	unsigned int unichar = 0;
+
+	if (*s < 0x80) {
+		unichar = *s;
+	} else if (*s < UTF8_3_BITS) {
+		unichar = *s++ & 0x1F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else if (*s < UTF8_4_BITS) {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	}
+	return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+	return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+	void *root;
+	int childnode;
+	const char *type;
+	unsigned int maxage;
+	struct tree *next;
+	int (*leaf_equal)(void *, void *);
+	void (*leaf_print)(void *, int);
+	int (*leaf_mark)(void *);
+	int (*leaf_size)(void *);
+	int *(*leaf_index)(struct tree *, void *);
+	unsigned char *(*leaf_emit)(void *, unsigned char *);
+	int leafindex[0x110000];
+	int index;
+};
+
+struct node {
+	int index;
+	int offset;
+	int mark;
+	int size;
+	struct node *parent;
+	void *left;
+	void *right;
+	unsigned char bitnum;
+	unsigned char nextbyte;
+	unsigned char leftnode;
+	unsigned char rightnode;
+	unsigned int keybits;
+	unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+	struct node *node;
+	void *leaf = NULL;
+
+	node = tree->root;
+	while (!leaf && node) {
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7))) {
+			/* Right leg */
+			if (node->rightnode == NODE) {
+				node = node->right;
+			} else if (node->rightnode == LEAF) {
+				leaf = node->right;
+			} else {
+				node = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (node->leftnode == NODE) {
+				node = node->left;
+			} else if (node->leftnode == LEAF) {
+				leaf = node->left;
+			} else {
+				node = NULL;
+			}
+		}
+	}
+
+	return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int indent = 1;
+	int nodes, singletons, leaves;
+
+	nodes = singletons = leaves = 0;
+
+	printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_print(tree->root, indent);
+		leaves = 1;
+	} else {
+		assert(tree->childnode == NODE);
+		node = tree->root;
+		leftmask = rightmask = 0;
+		while (node) {
+			printf("%*snode @ %p bitnum %d nextbyte %d"
+			       " left %p right %p mask %x bits %x\n",
+				indent, "", node,
+				node->bitnum, node->nextbyte,
+				node->left, node->right,
+				node->keymask, node->keybits);
+			nodes += 1;
+			if (!(node->left && node->right))
+				singletons += 1;
+
+			while (node) {
+				bitmask = 1 << node->bitnum;
+				if ((leftmask & bitmask) == 0) {
+					leftmask |= bitmask;
+					if (node->leftnode == LEAF) {
+						assert(node->left);
+						tree->leaf_print(node->left,
+								 indent+1);
+						leaves += 1;
+					} else if (node->left) {
+						assert(node->leftnode == NODE);
+						indent += 1;
+						node = node->left;
+						break;
+					}
+				}
+				if ((rightmask & bitmask) == 0) {
+					rightmask |= bitmask;
+					if (node->rightnode == LEAF) {
+						assert(node->right);
+						tree->leaf_print(node->right,
+								 indent+1);
+						leaves += 1;
+					} else if (node->right) {
+						assert(node->rightnode==NODE);
+						indent += 1;
+						node = node->right;
+						break;
+					}
+				}
+				leftmask &= ~bitmask;
+				rightmask &= ~bitmask;
+				node = node->parent;
+				indent -= 1;
+			}
+		}
+	}
+	printf("nodes %d leaves %d singletons %d\n",
+	       nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+	struct node *node;
+	int bitnum;
+
+	node = malloc(sizeof(*node));
+	node->left = node->right = NULL;
+	node->parent = parent;
+	node->leftnode = NODE;
+	node->rightnode = NODE;
+	node->keybits = 0;
+	node->keymask = 0;
+	node->mark = 0;
+	node->index = 0;
+	node->offset = -1;
+	node->size = 4;
+
+	if (node->parent) {
+		bitnum = parent->bitnum;
+		if ((bitnum & 7) == 0) {
+			node->bitnum = bitnum + 7 + 8;
+			node->nextbyte = 1;
+		} else {
+			node->bitnum = bitnum - 1;
+			node->nextbyte = 0;
+		}
+	} else {
+		node->bitnum = 7;
+		node->nextbyte = 0;
+	}
+
+	return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+	struct node *node;
+	struct node *parent;
+	void **cursor;
+	int keybits;
+
+	assert(keylen >= 1 && keylen <= 4);
+
+	node = NULL;
+	cursor = &tree->root;
+	keybits = 8 * keylen;
+
+	/* Insert, creating path along the way. */
+	while (keybits) {
+		if (!*cursor)
+			*cursor = alloc_node(node);
+		node = *cursor;
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7)))
+			cursor = &node->right;
+		else
+			cursor = &node->left;
+		keybits--;
+	}
+	*cursor = leaf;
+
+	/* Merge subtrees if possible. */
+	while (node) {
+		if (*key & (1 << (node->bitnum & 7)))
+			node->rightnode = LEAF;
+		else
+			node->leftnode = LEAF;
+		if (node->nextbyte)
+			break;
+		if (node->leftnode == NODE || node->rightnode == NODE)
+			break;
+		assert(node->left);
+		assert(node->right);
+		/* Compare */
+		if (! tree->leaf_equal(node->left, node->right))
+			break;
+		/* Keep left, drop right leaf. */
+		leaf = node->left;
+		/* Check in parent */
+		parent = node->parent;
+		if (!parent) {
+			/* root of tree! */
+			tree->root = leaf;
+			tree->childnode = LEAF;
+		} else if (parent->left == node) {
+			parent->left = leaf;
+			parent->leftnode = LEAF;
+			if (parent->right) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+			}
+		} else if (parent->right == node) {
+			parent->right = leaf;
+			parent->rightnode = LEAF;
+			if (parent->left) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+				parent->keybits |= (1 << node->bitnum);
+			}
+		} else {
+			/* internal tree error */
+			assert(0);
+		}
+		free(node);
+		node = parent;
+	}
+
+	/* Propagate keymasks up along singleton chains. */
+	while (node) {
+		parent = node->parent;
+		if (!parent)
+			break;
+		/* Nix the mask for parents with two children. */
+		if (node->keymask == 0) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else if (parent->left && parent->right) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else {
+			assert((parent->keymask & node->keymask) == 0);
+			parent->keymask |= node->keymask;
+			parent->keymask |= (1 << parent->bitnum);
+			parent->keybits |= node->keybits;
+			if (parent->right)
+				parent->keybits |= (1 << parent->bitnum);
+		}
+		node = parent;
+	}
+
+	return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed.  There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves.  The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains.  When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity.  Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+	struct node *node;
+	struct node *left;
+	struct node *right;
+	struct node *parent;
+	void *leftleaf;
+	void *rightleaf;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+
+	if (verbose > 0)
+		printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+	count = 0;
+	if (tree->childnode == LEAF)
+		return;
+	if (!tree->root)
+		return;
+
+	leftmask = rightmask = 0;
+	node = tree->root;
+	while (node) {
+		if (node->nextbyte)
+			goto advance;
+		if (node->leftnode == LEAF)
+			goto advance;
+		if (node->rightnode == LEAF)
+			goto advance;
+		if (!node->left)
+			goto advance;
+		if (!node->right)
+			goto advance;
+		left = node->left;
+		right = node->right;
+		if (left->keymask == 0)
+			goto advance;
+		if (right->keymask == 0)
+			goto advance;
+		if (left->keymask != right->keymask)
+			goto advance;
+		if (left->keybits != right->keybits)
+			goto advance;
+		leftleaf = NULL;
+		while (!leftleaf) {
+			assert(left->left || left->right);
+			if (left->leftnode == LEAF)
+				leftleaf = left->left;
+			else if (left->rightnode == LEAF)
+				leftleaf = left->right;
+			else if (left->left)
+				left = left->left;
+			else if (left->right)
+				left = left->right;
+			else
+				assert(0);
+		}
+		rightleaf = NULL;
+		while (!rightleaf) {
+			assert(right->left || right->right);
+			if (right->leftnode == LEAF)
+				rightleaf = right->left;
+			else if (right->rightnode == LEAF)
+				rightleaf = right->right;
+			else if (right->left)
+				right = right->left;
+			else if (right->right)
+				right = right->right;
+			else
+				assert(0);
+		}
+		if (! tree->leaf_equal(leftleaf, rightleaf))
+			goto advance;
+		/*
+		 * This node has identical singleton-only subtrees.
+		 * Remove it.
+		 */
+		parent = node->parent;
+		left = node->left;
+		right = node->right;
+		if (parent->left == node)
+			parent->left = left;
+		else if (parent->right == node)
+			parent->right = left;
+		else
+			assert(0);
+		left->parent = parent;
+		left->keymask |= (1 << node->bitnum);
+		node->left = NULL;
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			if (node->leftnode == NODE && node->left) {
+				left = node->left;
+				free(node);
+				count++;
+				node = left;
+			} else if (node->rightnode == NODE && node->right) {
+				right = node->right;
+				free(node);
+				count++;
+				node = right;
+			} else {
+				node = NULL;
+			}
+		}
+		/* Propagate keymasks up along singleton chains. */
+		node = parent;
+		/* Force re-check */
+		bitmask = 1 << node->bitnum;
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		for (;;) {
+			if (node->left && node->right)
+				break;
+			if (node->left) {
+				left = node->left;
+				node->keymask |= left->keymask;
+				node->keybits |= left->keybits;
+			}
+			if (node->right) {
+				right = node->right;
+				node->keymask |= right->keymask;
+				node->keybits |= right->keybits;
+			}
+			node->keymask |= (1 << node->bitnum);
+			node = node->parent;
+			/* Force re-check */
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+		}
+	advance:
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0 &&
+		    node->leftnode == NODE &&
+		    node->left) {
+			leftmask |= bitmask;
+			node = node->left;
+		} else if ((rightmask & bitmask) == 0 &&
+			   node->rightnode == NODE &&
+			   node->right) {
+			rightmask |= bitmask;
+			node = node->right;
+		} else {
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+		}
+	}
+	if (verbose > 0)
+		printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+	struct node *node;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int marked;
+
+	marked = 0;
+	if (verbose > 0)
+		printf("Marking %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+
+	/* second pass: left siblings and singletons */
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				if (!node->mark && node->parent->mark) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				if (!node->mark && node->parent->mark &&
+				    !node->parent->left) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+done:
+	if (verbose > 0)
+		printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie.  These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+	int indent;
+
+	/* Align to a cache line (or half a cache line?). */
+	while (index % 64)
+		index++;
+	tree->index = index;
+	indent = 1;
+	count = 0;
+
+	if (verbose > 0)
+		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+	if (tree->childnode == LEAF) {
+		index += tree->leaf_size(tree->root);
+		goto done;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		count++;
+		if (node->index != index)
+			node->index = index;
+		index += node->size;
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					*tree->leaf_index(tree, node->left) =
+									index;
+					index += tree->leaf_size(node->left);
+					count++;
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					*tree->leaf_index(tree, node->right) = index;
+					index += tree->leaf_size(node->right);
+					count++;
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	/* Round up to a multiple of 16 */
+	while (index % 16)
+		index++;
+	if (verbose > 0)
+		printf("Final index %d\n", index);
+	return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced.  This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+	struct tree *next;
+	struct node *node;
+	struct node *right;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	unsigned int pathbits;
+	unsigned int pathmask;
+	int changed;
+	int offset;
+	int size;
+	int indent;
+
+	indent = 1;
+	changed = 0;
+	size = 0;
+
+	if (verbose > 0)
+		printf("Sizing %s_%x", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	pathbits = 0;
+	pathmask = 0;
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		offset = 0;
+		if (!node->left || !node->right) {
+			size = 1;
+		} else {
+			if (node->rightnode == NODE) {
+				right = node->right;
+				next = tree->next;
+				while (!right->mark) {
+					assert(next);
+					n = next->root;
+					while (n->bitnum != node->bitnum) {
+						if (pathbits & (1<<n->bitnum))
+							n = n->right;
+						else
+							n = n->left;
+					}
+					n = n->right;
+					assert(right->bitnum == n->bitnum);
+					right = n;
+					next = next->next;
+				}
+				offset = right->index - node->index;
+			} else {
+				offset = *tree->leaf_index(tree, node->right);
+				offset -= node->index;
+			}
+			assert(offset >= 0);
+			assert(offset <= 0xffffff);
+			if (offset <= 0xff) {
+				size = 2;
+			} else if (offset <= 0xffff) {
+				size = 3;
+			} else { /* offset <= 0xffffff */
+				size = 4;
+			}
+		}
+		if (node->size != size || node->offset != offset) {
+			node->size = size;
+			node->offset = offset;
+			changed++;
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			pathmask |= bitmask;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				pathbits |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			pathmask &= ~bitmask;
+			pathbits &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	if (verbose > 0)
+		printf("Found %d changes\n", changed);
+	return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int offlen;
+	int offset;
+	int index;
+	int indent;
+	unsigned char byte;
+
+	index = tree->index;
+	data += index;
+	indent = 1;
+	if (verbose > 0)
+		printf("Emitting %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_emit(tree->root, data);
+		return;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		assert(node->offset != -1);
+		assert(node->index == index);
+
+		byte = 0;
+		if (node->nextbyte)
+			byte |= NEXTBYTE;
+		byte |= (node->bitnum & BITNUM);
+		if (node->left && node->right) {
+			if (node->leftnode == NODE)
+				byte |= LEFTNODE;
+			if (node->rightnode == NODE)
+				byte |= RIGHTNODE;
+			if (node->offset <= 0xff)
+				offlen = 1;
+			else if (node->offset <= 0xffff)
+				offlen = 2;
+			else
+				offlen = 3;
+			offset = node->offset;
+			byte |= offlen << OFFLEN_SHIFT;
+			*data++ = byte;
+			index++;
+			while (offlen--) {
+				*data++ = offset & 0xff;
+				index++;
+				offset >>= 8;
+			}
+		} else if (node->left) {
+			if (node->leftnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else if (node->right) {
+			byte |= RIGHTNODE;
+			if (node->rightnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else {
+			assert(0);
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					data = tree->leaf_emit(node->left,
+							       data);
+					index += tree->leaf_size(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					data = tree->leaf_emit(node->right,
+							       data);
+					index += tree->leaf_size(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table.  Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions.  The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+	unsigned int code;
+	int ccc;
+	int gen;
+	int correction;
+	unsigned int *utf32nfkdi;
+	unsigned int *utf32nfkdicf;
+	char *utf8nfkdi;
+	char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int    corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int          trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+	int i;
+
+	for (i = 0; i != corrections_count; i++)
+		if (u->code == corrections[i].code)
+			return &corrections[i];
+	return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdicf && right->utf8nfkdicf &&
+	    strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+		return 1;
+	if (left->utf8nfkdicf && right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdicf || right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdicf)
+		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+	return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	if (leaf->utf8nfkdicf)
+		return 1;
+	return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdicf)
+		size += strlen(leaf->utf8nfkdicf) + 1;
+	else if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdicf) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdicf;
+		while ((*data++ = *s++) != 0)
+			;
+	} else if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+	char utf[18*4+1];
+	char *u;
+	unsigned int *um;
+	int i;
+
+	u = utf;
+	um = data->utf32nfkdi;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		data->utf8nfkdi = strdup((char*)utf);
+	}
+	u = utf;
+	um = data->utf32nfkdicf;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+			data->utf8nfkdicf = strdup((char*)utf);
+	}
+}
+
+static void
+utf8_init(void)
+{
+	unsigned int unichar;
+	int i;
+
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		utf8_create(&unicode_data[unichar]);
+
+	for (i = 0; i != corrections_count; i++)
+		utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+	struct unicode_data *data;
+	unsigned int maxage;
+	unsigned int nextage;
+	int count;
+	int i;
+	int j;
+
+	/* Count the number of different ages. */
+	count = 0;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+		count++;
+	} while (nextage);
+
+	/* Two trees per age: nfkdi and nfkdicf */
+	trees_count = count * 2;
+	trees = calloc(trees_count, sizeof(struct tree));
+
+	/* Assign ages to the trees. */
+	count = trees_count;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		trees[--count].maxage = maxage;
+		trees[--count].maxage = maxage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+	} while (nextage);
+
+	/* The ages assigned above are off by one. */
+	for (i = 0; i != trees_count; i++) {
+		j = 0;
+		while (ages[j] < trees[i].maxage)
+			j++;
+		trees[i].maxage = ages[j-1];
+	}
+
+	/* Set up the forwarding between trees. */
+	trees[trees_count-2].next = &trees[trees_count-1];
+	trees[trees_count-1].leaf_mark = nfkdi_mark;
+	trees[trees_count-2].leaf_mark = nfkdicf_mark;
+	for (i = 0; i != trees_count-2; i += 2) {
+		trees[i].next = &trees[trees_count-2];
+		trees[i].leaf_mark = correction_mark;
+		trees[i+1].next = &trees[trees_count-1];
+		trees[i+1].leaf_mark = correction_mark;
+	}
+
+	/* Assign the callouts. */
+	for (i = 0; i != trees_count; i += 2) {
+		trees[i].type = "nfkdicf";
+		trees[i].leaf_equal = nfkdicf_equal;
+		trees[i].leaf_print = nfkdicf_print;
+		trees[i].leaf_size = nfkdicf_size;
+		trees[i].leaf_index = nfkdicf_index;
+		trees[i].leaf_emit = nfkdicf_emit;
+
+		trees[i+1].type = "nfkdi";
+		trees[i+1].leaf_equal = nfkdi_equal;
+		trees[i+1].leaf_print = nfkdi_print;
+		trees[i+1].leaf_size = nfkdi_size;
+		trees[i+1].leaf_index = nfkdi_index;
+		trees[i+1].leaf_emit = nfkdi_emit;
+	}
+
+	/* Finish init. */
+	for (i = 0; i != trees_count; i++)
+		trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+	struct unicode_data *data;
+	unsigned int unichar;
+	char keyval[4];
+	int keylen;
+	int i;
+
+	for (i = 0; i != trees_count; i++) {
+		if (verbose > 0) {
+			printf("Populating %s_%x\n",
+				trees[i].type, trees[i].maxage);
+		}
+		for (unichar = 0; unichar != 0x110000; unichar++) {
+			if (unicode_data[unichar].gen < 0)
+				continue;
+			keylen = utf8key(unichar, keyval);
+			data = corrections_lookup(&unicode_data[unichar]);
+			if (data->correction <= trees[i].maxage)
+				data = &unicode_data[unichar];
+			insert(&trees[i], keyval, keylen, data);
+		}
+	}
+}
+
+static void
+trees_reduce(void)
+{
+	int i;
+	int size;
+	int changed;
+
+	for (i = 0; i != trees_count; i++)
+		prune(&trees[i]);
+	for (i = 0; i != trees_count; i++)
+		mark_nodes(&trees[i]);
+	do {
+		size = 0;
+		for (i = 0; i != trees_count; i++)
+			size = index_nodes(&trees[i], size);
+		changed = 0;
+		for (i = 0; i != trees_count; i++)
+			changed += size_nodes(&trees[i]);
+	} while (changed);
+
+	utf8data = calloc(size, 1);
+	utf8data_size = size;
+	for (i = 0; i != trees_count; i++)
+		emit(&trees[i], utf8data);
+
+	if (verbose > 0) {
+		for (i = 0; i != trees_count; i++) {
+			printf("%s_%x idx %d\n",
+				trees[i].type, trees[i].maxage, trees[i].index);
+		}
+	}
+
+	nfkdi = utf8data + trees[trees_count-1].index;
+	nfkdicf = utf8data + trees[trees_count-2].index;
+
+	nfkdi_tree = &trees[trees_count-1];
+	nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+	struct unicode_data *data;
+	utf8leaf_t	*leaf;
+	unsigned int	unichar;
+	char		key[4];
+	int		report;
+	int		nocf;
+
+	if (verbose > 0)
+		printf("Verifying %s_%x\n", tree->type, tree->maxage);
+	nocf = strcmp(tree->type, "nfkdicf");
+
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		report = 0;
+		data = corrections_lookup(&unicode_data[unichar]);
+		if (data->correction <= tree->maxage)
+			data = &unicode_data[unichar];
+		utf8key(unichar, key);
+		leaf = utf8lookup(tree, key);
+		if (!leaf) {
+			if (data->gen != -1)
+				report++;
+			if (unichar < 0xd800 || unichar > 0xdfff)
+				report++;
+		} else {
+			if (unichar >= 0xd800 && unichar <= 0xdfff)
+				report++;
+			if (data->gen == -1)
+				report++;
+			if (data->gen != LEAF_GEN(leaf))
+				report++;
+			if (LEAF_CCC(leaf) == DECOMPOSE) {
+				if (nocf) {
+					if (!data->utf8nfkdi) {
+						report++;
+					} else if (strcmp(data->utf8nfkdi,
+							LEAF_STR(leaf))) {
+						report++;
+					}
+				} else {
+					if (!data->utf8nfkdicf &&
+					    !data->utf8nfkdi) {
+						report++;
+					} else if (data->utf8nfkdicf) {
+						if (strcmp(data->utf8nfkdicf,
+							   LEAF_STR(leaf)))
+							report++;
+					} else if (strcmp(data->utf8nfkdi,
+							  LEAF_STR(leaf))) {
+						report++;
+					}
+				}
+			} else if (data->ccc != LEAF_CCC(leaf)) {
+				report++;
+			}
+		}
+		if (report) {
+			printf("%X code %X gen %d ccc %d"
+				" nfdki -> \"%s\"",
+				unichar, data->code, data->gen,
+				data->ccc,
+				data->utf8nfkdi);
+			if (leaf) {
+				printf(" age %d ccc %d"
+					" nfdki -> \"%s\"\n",
+					LEAF_GEN(leaf),
+					LEAF_CCC(leaf),
+					LEAF_CCC(leaf) == DECOMPOSE ?
+						LEAF_STR(leaf) : "");
+			}
+			printf("\n");
+		}
+	}
+}
+
+static void
+trees_verify(void)
+{
+	int i;
+
+	for (i = 0; i != trees_count; i++)
+		verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("Usage: %s [options]\n", argv0);
+	printf("\n");
+	printf("This program creates an a data trie used for parsing and\n");
+	printf("normalization of UTF-8 strings. The trie is derived from\n");
+	printf("a set of input files from the Unicode character database\n");
+	printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+	printf("\n");
+	printf("The generated tree supports two normalization forms:\n");
+	printf("\n");
+	printf("\tnfkdi:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\n");
+	printf("\tnfkdicf:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\t- Apply a full casefold (C + F).\n");
+	printf("\n");
+	printf("These forms were chosen as being most useful when dealing\n");
+	printf("with file names: NFKD catches most cases where characters\n");
+	printf("should be considered equivalent. The ignorables are mostly\n");
+	printf("invisible, making names hard to type.\n");
+	printf("\n");
+	printf("The options to specify the files to be used are listed\n");
+	printf("below with their default values, which are the names used\n");
+	printf("by version 7.0.0 of the Unicode Character Database.\n");
+	printf("\n");
+	printf("The input files:\n");
+	printf("\t-a %s\n", AGE_NAME);
+	printf("\t-c %s\n", CCC_NAME);
+	printf("\t-p %s\n", PROP_NAME);
+	printf("\t-d %s\n", DATA_NAME);
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-n %s\n", NORM_NAME);
+	printf("\n");
+	printf("Additionally, the generated tables are tested using:\n");
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n");
+	printf("Finally, the output file:\n");
+	printf("\t-o %s\n", UTF8_NAME);
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+	printf("Error parsing %s:%s\n", filename, line);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+	int	i;
+
+	for (i = 0; utf32str[i]; i++)
+		printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdi);
+	printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdicf);
+	printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	int gen;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", age_name);
+
+	file = fopen(age_name, "r");
+	if (!file)
+		open_fail(age_name, errno);
+	count = 0;
+
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d\n",
+					major, minor, revision);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d\n", major, minor);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+
+	/* We must have found something above. */
+	if (verbose > 1)
+		printf("%d age entries\n", ages_count);
+	if (ages_count == 0 || ages_count > MAXGEN)
+		file_fail(age_name);
+
+	/* There is a 0 entry. */
+	ages_count++;
+	ages = calloc(ages_count + 1, sizeof(*ages));
+	/* And a guard entry. */
+	ages[ages_count] = (unsigned int)-1;
+
+	rewind(file);
+	count = 0;
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages[++gen] =
+				UNICODE_AGE(major, minor, revision);
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d = gen %d\n",
+					major, minor, revision, gen);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages[++gen] = UNICODE_AGE(major, minor, 0);
+			if (verbose > 1)
+				printf(" Age V%d_%d = %d\n",
+					major, minor, gen);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X..%X ; %d.%d #",
+			     &first, &last, &major, &minor);
+		if (ret == 4) {
+			for (unichar = first; unichar <= last; unichar++)
+				unicode_data[unichar].gen = gen;
+			count += 1 + last - first;
+			if (verbose > 1)
+				printf("  %X..%X gen %d\n", first, last, gen);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+		if (ret == 3) {
+			unicode_data[unichar].gen = gen;
+			count++;
+			if (verbose > 1)
+				printf("  %X gen %d\n", unichar, gen);
+			if (!utf32valid(unichar))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+	unicode_maxage = ages[gen];
+	fclose(file);
+
+	/* Nix surrogate block */
+	if (verbose > 1)
+		printf(" Removing surrogate block D800..DFFF\n");
+	for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+		unicode_data[unichar].gen = -1;
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int value;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", ccc_name);
+
+	file = fopen(ccc_name, "r");
+	if (!file)
+		open_fail(ccc_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+		if (ret == 3) {
+			for (unichar = first; unichar <= last; unichar++) {
+				unicode_data[unichar].ccc = value;
+                                count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X ccc %d\n", first, last, value);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(ccc_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d #", &unichar, &value);
+		if (ret == 2) {
+			unicode_data[unichar].ccc = value;
+                        count++;
+			if (verbose > 1)
+				printf(" %X ccc %d\n", unichar, value);
+			if (!utf32valid(unichar))
+				line_fail(ccc_name, line);
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	unsigned int *um;
+	int count;
+	int i;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", data_name);
+	file = fopen(data_name, "r");
+	if (!file)
+		open_fail(data_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     &unichar, buf0);
+		if (ret != 2)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(data_name, line);
+
+		s = buf0;
+		/* skip over <tag> */
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		/* decode the decomposition into UTF-32 */
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(data_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char status;
+	char *s;
+	unsigned int *um;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", fold_name);
+	file = fopen(fold_name, "r");
+	if (!file)
+		open_fail(fold_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+		if (ret != 3)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(fold_name, line);
+		/* Use the C+F casefold. */
+		if (status != 'C' && status != 'F')
+			continue;
+		s = buf0;
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(fold_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int first;
+	unsigned int last;
+	unsigned int *um;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", prop_name);
+	file = fopen(prop_name, "r");
+	if (!file)
+		open_fail(prop_name, errno);
+	assert(file);
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+		if (ret == 3) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(prop_name, line);
+			for (unichar = first; unichar <= last; unichar++) {
+				free(unicode_data[unichar].utf32nfkdi);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdi = um;
+				free(unicode_data[unichar].utf32nfkdicf);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdicf = um;
+				count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X Default_Ignorable_Code_Point\n",
+					first, last);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+		if (ret == 2) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(unichar))
+				line_fail(prop_name, line);
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdi = um;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdicf = um;
+			if (verbose > 1)
+				printf(" %X Default_Ignorable_Code_Point\n",
+					unichar);
+			count++;
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	unsigned int age;
+	unsigned int *um;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", norm_name);
+	file = fopen(norm_name, "r");
+	if (!file)
+		open_fail(norm_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		count++;
+	}
+	corrections = calloc(count, sizeof(struct unicode_data));
+	corrections_count = count;
+	rewind(file);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		corrections[count] = unicode_data[unichar];
+		assert(corrections[count].code == unichar);
+		age = UNICODE_AGE(major, minor, revision);
+		corrections[count].correction = age;
+
+		i = 0;
+		s = buf0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(norm_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		corrections[count].utf32nfkdi = um;
+
+		if (verbose > 1)
+			printf(" %X -> %s -> %s V%d_%d_%d\n",
+				unichar, buf0, buf1, major, minor, revision);
+		count++;
+	}
+	fclose(file);
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount
+ *   LVPart = LBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+	unsigned int sb = 0xAC00;
+	unsigned int lb = 0x1100;
+	unsigned int vb = 0x1161;
+	unsigned int tb = 0x11a7;
+	/* unsigned int lc = 19; */
+	unsigned int vc = 21;
+	unsigned int tc = 28;
+	unsigned int nc = (vc * tc);
+	/* unsigned int sc = (lc * nc); */
+	unsigned int unichar;
+	unsigned int mapping[4];
+	unsigned int *um;
+        int count;
+	int i;
+
+	if (verbose > 0)
+		printf("Decomposing hangul\n");
+	/* Hangul */
+	count = 0;
+	for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+		unsigned int si = unichar - sb;
+		unsigned int li = si / nc;
+		unsigned int vi = (si % nc) / tc;
+		unsigned int ti = si % tc;
+
+		i = 0;
+		mapping[i++] = lb + li;
+		mapping[i++] = vb + vi;
+		if (ti)
+			mapping[i++] = tb + ti;
+		mapping[i++] = 0;
+
+		assert(!unicode_data[unichar].utf32nfkdi);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		assert(!unicode_data[unichar].utf32nfkdicf);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+
+		count++;
+	}
+	if (verbose > 0)
+		printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdi\n");
+
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdi)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdi;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdi;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdi = um;
+		}
+		/* Add this decomposition to nfkdicf if there is no entry. */
+		if (!unicode_data[unichar].utf32nfkdicf) {
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdicf\n");
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdicf)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdicf;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdicf;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + tree->index;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!tree)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+	return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = tree->maxage;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = tree->maxage;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	struct tree	*tree;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+	unsigned int	unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : string.
+ *   len    : length of s.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s,
+	size_t		len)
+{
+	if (!tree)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->tree = tree;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	u8c->unichar = 0;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : NUL-terminated string.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s)
+{
+	return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1  -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->tree, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->tree, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+		u8c->unichar = utf8code(u8c->s);
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			assert(u8c->ccc == STOPPER);
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+	char *s;
+	char *t;
+	int c;
+	struct utf8cursor u8c;
+
+	/* First test: null-terminated string. */
+	s = buf2;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	/* Second test: length-limited string. */
+	s = buf2;
+	/* Replace NUL with a value that will cause an error if seen. */
+	s[strlen(s) + 1] = -1;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	struct unicode_data *data;
+	char *s;
+	char *t;
+	int ret;
+	int ignorables;
+	int tests = 0;
+	int failures = 0;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     buf0, buf1);
+		if (ret != 2 || *line == '#')
+			continue;
+		s = buf0;
+		t = buf2;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		ignorables = 0;
+		s = buf1;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			data = &unicode_data[unichar];
+			if (data->utf8nfkdi && !*data->utf8nfkdi)
+				ignorables = 1;
+			else
+				t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		tests++;
+		if (normalize_line(nfkdi_tree) < 0) {
+			printf("\nline %s -> %s", buf0, buf1);
+			if (ignorables)
+				printf(" (ignorables removed)");
+			printf(" failure\n");
+			failures++;
+		}
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+	FILE *file;
+	int i;
+	int j;
+	int t;
+	int gen;
+
+	if (verbose > 0)
+		printf("Writing %s\n", utf8_name);
+	file = fopen(utf8_name, "w");
+	if (!file)
+		open_fail(utf8_name, errno);
+
+	fprintf(file, "/* This file is generated code, do not edit. */\n");
+	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+	fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+	fprintf(file, "#endif\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+		unicode_maxage);
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+	for (i = 0; i != ages_count; i++)
+		fprintf(file, "\t%#x%s\n", ages[i],
+			ages[i] == unicode_maxage ? "" : ",");
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+	t = 0;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+	t = 1;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+		utf8data_size);
+	t = 0;
+	for (i = 0; i != utf8data_size; i += 16) {
+		if (i == trees[t].index) {
+			fprintf(file, "\t/* %s_%x */\n",
+				trees[t].type, trees[t].maxage);
+			if (t < trees_count-1)
+				t++;
+		}
+		fprintf(file, "\t");
+		for (j = i; j != i + 16; j++)
+			fprintf(file, "0x%.2x%s", utf8data[j],
+				(j < utf8data_size -1 ? "," : ""));
+		fprintf(file, "\n");
+	}
+	fprintf(file, "};\n");
+	fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+	unsigned int unichar;
+	int opt;
+
+	argv0 = argv[0];
+
+	while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+		switch (opt) {
+		case 'a':
+			age_name = optarg;
+			break;
+		case 'c':
+			ccc_name = optarg;
+			break;
+		case 'd':
+			data_name = optarg;
+			break;
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 'n':
+			norm_name = optarg;
+			break;
+		case 'o':
+			utf8_name = optarg;
+			break;
+		case 'p':
+			prop_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	if (verbose > 1)
+		help();
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		unicode_data[unichar].code = unichar;
+	age_init();
+	ccc_init();
+	nfkdi_init();
+	nfkdicf_init();
+	ignore_init();
+	corrections_init();
+	hangul_decompose();
+	nfkdi_decompose();
+	nfkdicf_decompose();
+	utf8_init();
+	trees_init();
+	trees_populate();
+	trees_reduce();
+	trees_verify();
+	/* Prevent "unused function" warning. */
+	(void)lookup(nfkdi_tree, " ");
+	if (verbose > 2)
+		tree_walk(nfkdi_tree);
+	if (verbose > 2)
+		tree_walk(nfkdicf_tree);
+	normalization_test();
+	write_file();
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 03/16] lib: add supporting code for UTF-8.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
  2014-10-03 21:50 ` [PATCH 01/16] lib: add unicode character database files Ben Myers
  2014-10-03 21:51 ` [PATCH 02/16] scripts: add trie generator for UTF-8 Ben Myers
@ 2014-10-03 21:54 ` Ben Myers
  2014-10-03 21:54 ` [PATCH 04/16] lib/utf8norm.c: reduce the size of utf8data[] Ben Myers
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:54 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

  nfkdi:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.

  nfkdicf:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: the trie is now separated into utf8norm.ko;
     utf8version is now a function and exported;
     introduced CONFIG_XFS_UTF8;
     removed trie generator due to vger size constraint.  --bpm]
[v3: replaced utf8version with utf8version_is_supported;
     moved utf8norm.ko to lib/ --bpm]
---
 include/linux/utf8norm.h | 116 +++++++++
 lib/Makefile             |   3 +
 lib/utf8norm.c           | 657 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 776 insertions(+)
 create mode 100644 include/linux/utf8norm.h
 create mode 100644 lib/utf8norm.c

diff --git a/include/linux/utf8norm.h b/include/linux/utf8norm.h
new file mode 100644
index 0000000..82f86c4
--- /dev/null
+++ b/include/linux/utf8norm.h
@@ -0,0 +1,116 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/module.h>
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern int utf8version_is_supported(unsigned int);
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *  - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	utf8data_t	data;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
diff --git a/lib/Makefile b/lib/Makefile
index b0b0d57..9e15e2b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -197,6 +197,9 @@ clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
 
+obj-$(CONFIG_UTF8_NORMALIZATION) += utf8norm.o
+
+$(obj)/utf8norm.o: $(obj)/utf8data.h
 $(obj)/utf8data.h: $(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
 	$(call cmd,mkutf8data)
 quiet_cmd_mkutf8data = MKUTF8DATA $@
diff --git a/lib/utf8norm.c b/lib/utf8norm.c
new file mode 100644
index 0000000..0fa97d1
--- /dev/null
+++ b/lib/utf8norm.c
@@ -0,0 +1,657 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/utf8norm.h>
+
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+int
+utf8version_is_supported(unsigned int sb_utf8version)
+{
+	int i = sizeof(utf8agetab)/sizeof(utf8agetab[0]) - 1;
+
+	while (i >= 0 && utf8agetab[i] != 0) {
+		if (sb_utf8version == utf8agetab[i])
+			return 1;
+		i--;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(utf8version_is_supported);
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7F: 0                   - 0x7F
+ *       0x80 -    0x7FF: 0xC2 0x80           - 0xDF 0xBF
+ *      0x800 -   0xFFFF: 0xE0 0xA0 0x80      - 0xEF 0xBF 0xBF
+ *    0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences.  Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + data->offset;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!data)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+	return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : string.
+ *   len    : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s,
+	size_t		len)
+{
+	if (!data)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->data = data;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s)
+{
+	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1   -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->data, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->data, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+	while (maxage < utf8nfkdidata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdidata[i].maxage)
+		return NULL;
+	return &utf8nfkdidata[i];
+}
+EXPORT_SYMBOL(utf8nfkdi);
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+	while (maxage < utf8nfkdicfdata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdicfdata[i].maxage)
+		return NULL;
+	return &utf8nfkdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfkdicf);
+
+MODULE_AUTHOR("SGI");
+MODULE_DESCRIPTION("utf8 normalization");
+MODULE_LICENSE("GPL");
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 04/16] lib/utf8norm.c: reduce the size of utf8data[]
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (2 preceding siblings ...)
  2014-10-03 21:54 ` [PATCH 03/16] lib: add supporting code " Ben Myers
@ 2014-10-03 21:54 ` Ben Myers
  2014-10-05 21:52     ` Dave Chinner
  2014-10-03 21:55 ` [PATCH 05/16] xfs: return the first match during case-insensitive lookup Ben Myers
                   ` (30 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:54 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store
the decomposition the caller of utf8lookup()/utf8nlookup() must
provide a 12-byte buffer, which is used to synthesize a leaf with
the decomposition. Trie size is reduced from 245kB to 90kB.

This change also contains a number of robustness fixes to the
trie generator mkutf8data.c.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/linux/utf8norm.h |   4 +
 lib/utf8norm.c           | 190 ++++++++++++++++++---
 scripts/mkutf8data.c     | 421 +++++++++++++++++++++++++++++++++++------------
 3 files changed, 492 insertions(+), 123 deletions(-)

diff --git a/include/linux/utf8norm.h b/include/linux/utf8norm.h
index 82f86c4..a6d8ce4 100644
--- a/include/linux/utf8norm.h
+++ b/include/linux/utf8norm.h
@@ -27,6 +27,9 @@
 /* An opaque type used to determine the normalization in use. */
 typedef const struct utf8data *utf8data_t;
 
+/* Needed in struct utf8cursor below. */
+#define UTF8HANGULLEAF	(12)
+
 /* Encoding a unicode version number as a single unsigned int. */
 #define UNICODE_MAJ_SHIFT		(16)
 #define UNICODE_MIN_SHIFT		(8)
@@ -95,6 +98,7 @@ struct utf8cursor {
 	unsigned int	slen;
 	short int	ccc;
 	short int	nccc;
+	unsigned char	hangul[UTF8HANGULLEAF];
 };
 
 /*
diff --git a/lib/utf8norm.c b/lib/utf8norm.c
index 0fa97d1..3ed9636 100644
--- a/lib/utf8norm.c
+++ b/lib/utf8norm.c
@@ -102,6 +102,38 @@ utf8clen(const char *s)
 }
 
 /*
+ * Decode a 3-byte UTF-8 sequence.
+ */
+static unsigned int
+utf8decode3(const char *str)
+{
+	unsigned int		uc;
+
+	uc = *str++ & 0x0F;
+	uc <<= 6;
+	uc |= *str++ & 0x3F;
+	uc <<= 6;
+	uc |= *str++ & 0x3F;
+
+	return uc;
+}
+
+/*
+ * Encode a 3-byte UTF-8 sequence.
+ */
+static int
+utf8encode3(char *str, unsigned int val)
+{
+	str[2] = (val & 0x3F) | 0x80;
+	val >>= 6;
+	str[1] = (val & 0x3F) | 0x80;
+	val >>= 6;
+	str[0] = val | 0xE0;
+
+	return 3;
+}
+
+/*
  * utf8trie_t
  *
  * A compact binary tree, used to decode UTF-8 characters.
@@ -162,7 +194,8 @@ typedef const unsigned char utf8trie_t;
  *          characters with the Default_Ignorable_Code_Point property.
  *          These do affect normalization, as they all have CCC 0.
  *
- * The decompositions in the trie have been fully expanded.
+ * The decompositions in the trie have been fully expanded, with the
+ * exception of Hangul syllables, which are decomposed algorithmically.
  *
  * Casefolding, if applicable, is also done using decompositions.
  *
@@ -182,6 +215,105 @@ typedef const unsigned char utf8leaf_t;
 #define STOPPER		(0)
 #define	DECOMPOSE	(255)
 
+/* Marker for hangul syllable decomposition. */
+#define HANGUL		((char)(255))
+/* Size of the synthesized leaf used for Hangul syllable decomposition. */
+#define UTF8HANGULLEAF	(12)
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount)
+ *   LVPart = SBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount)
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ */
+
+/* Constants */
+#define SB	(0xAC00)
+#define LB	(0x1100)
+#define VB	(0x1161)
+#define TB	(0x11A7)
+#define LC	(19)
+#define VC	(21)
+#define TC	(28)
+#define NC	(VC * TC)
+#define SC	(LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+	unsigned int	si;
+	unsigned int	li;
+	unsigned int	vi;
+	unsigned int	ti;
+	unsigned char	*h;
+
+	/* Calculate the SI, LI, VI, and TI values. */
+	si = utf8decode3(str) - SB;
+	li = si / NC;
+	vi = (si % NC) / TC;
+	ti = si % TC;
+
+	/* Fill in base of leaf. */
+	h = hangul;
+	LEAF_GEN(h) = 2;
+	LEAF_CCC(h) = DECOMPOSE;
+	h += 2;
+
+	/* Add LPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode3((char*)h, li + LB);
+
+	/* Add VPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode3((char*)h, vi + VB);
+
+	/* Add TPart if required, also a 3-byte UTF-8 sequence. */
+	if (ti)
+		h += utf8encode3((char*)h, ti + TB);
+
+	/* Terminate string. */
+	h[0] = '\0';
+
+	return hangul;
+}
+
 /*
  * Use trie to scan s, touching at most len bytes.
  * Returns the leaf if one exists, NULL otherwise.
@@ -191,7 +323,7 @@ typedef const unsigned char utf8leaf_t;
  * shorthand for this will be "is valid UTF-8 unicode".
  */
 static utf8leaf_t *
-utf8nlookup(utf8data_t data, const char *s, size_t len)
+utf8nlookup(utf8data_t data, unsigned char *hangul, const char *s, size_t len)
 {
 	utf8trie_t	*trie = utf8data + data->offset;
 	int		offlen;
@@ -229,8 +361,7 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
 				trie++;
 			} else {
 				/* No right node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			}
 		} else {
 			/* Left leg */
@@ -240,8 +371,7 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
 				trie += offlen + 1;
 			} else if (*trie & RIGHTPATH) {
 				/* No left node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			} else {
 				/* Left node after this node */
 				node = (*trie & TRIENODE);
@@ -249,6 +379,14 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
 			}
 		}
 	}
+	/*
+	 * Hangul decomposition is done algorithmically. These are the
+	 * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+	 * always 3 bytes long, so s has been advanced twice, and the
+	 * start of the sequence is at s-2.
+	 */
+	if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+		trie = utf8hangul(s - 2, hangul);
 	return trie;
 }
 
@@ -259,9 +397,9 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
  * Forwards to utf8nlookup().
  */
 static utf8leaf_t *
-utf8lookup(utf8data_t data, const char *s)
+utf8lookup(utf8data_t data, unsigned char *hangul, const char *s)
 {
-	return utf8nlookup(data, s, (size_t)-1);
+	return utf8nlookup(data, hangul, s, (size_t)-1);
 }
 
 /*
@@ -273,13 +411,15 @@ int
 utf8agemax(utf8data_t data, const char *s)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
+	age = 0;
 	while (*s) {
-		if (!(leaf = utf8lookup(data, s)))
+		if (!(leaf = utf8lookup(data, hangul, s)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age > age)
@@ -301,12 +441,13 @@ utf8agemin(utf8data_t data, const char *s)
 	utf8leaf_t	*leaf;
 	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	age = data->maxage;
 	while (*s) {
-		if (!(leaf = utf8lookup(data, s)))
+		if (!(leaf = utf8lookup(data, hangul, s)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age < age)
@@ -325,13 +466,15 @@ int
 utf8nagemax(utf8data_t data, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
+	age = 0;
         while (len && *s) {
-		if (!(leaf = utf8nlookup(data, s, len)))
+		if (!(leaf = utf8nlookup(data, hangul, s, len)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age > age)
@@ -353,12 +496,13 @@ utf8nagemin(utf8data_t data, const char *s, size_t len)
 	utf8leaf_t	*leaf;
 	int		leaf_age;
 	int		age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	age = data->maxage;
 	while (len && *s) {
-		if (!(leaf = utf8nlookup(data, s, len)))
+		if (!(leaf = utf8nlookup(data, hangul, s, len)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age < age)
@@ -381,11 +525,12 @@ utf8len(utf8data_t data, const char *s)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	while (*s) {
-		if (!(leaf = utf8lookup(data, s)))
+		if (!(leaf = utf8lookup(data, hangul, s)))
 			return -1;
 		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
 			ret += utf8clen(s);
@@ -408,11 +553,12 @@ utf8nlen(utf8data_t data, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	while (len && *s) {
-		if (!(leaf = utf8nlookup(data, s, len)))
+		if (!(leaf = utf8nlookup(data, hangul, s, len)))
 			return -1;
 		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
 			ret += utf8clen(s);
@@ -542,10 +688,12 @@ utf8byte(struct utf8cursor *u8c)
 		}
 
 		/* Look up the data for the current character. */
-		if (u8c->p)
-			leaf = utf8lookup(u8c->data, u8c->s);
-		else
-			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+		if (u8c->p) {
+			leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+		} else {
+			leaf = utf8nlookup(u8c->data, u8c->hangul,
+					   u8c->s, u8c->len);
+		}
 
 		/* No leaf found implies that the input is a binary blob. */
 		if (!leaf)
@@ -565,7 +713,7 @@ utf8byte(struct utf8cursor *u8c)
 				ccc = STOPPER;
 				goto ccc_mismatch;
 			}
-			leaf = utf8lookup(u8c->data, u8c->s);
+			leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
 			ccc = LEAF_CCC(leaf);
 		}
 
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
index 1d6ec02..7c7756f 100644
--- a/scripts/mkutf8data.c
+++ b/scripts/mkutf8data.c
@@ -179,11 +179,15 @@ typedef unsigned char utf8leaf_t;
 #define MINCCC		(0)
 #define MAXCCC		(254)
 #define STOPPER		(0)
-#define	DECOMPOSE	(255)
+#define DECOMPOSE	(255)
+#define HANGUL		((char)(255))
+
+#define UTF8HANGULLEAF	(12)
 
 struct tree;
-static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
-static utf8leaf_t *utf8lookup(struct tree *, const char *);
+static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *,
+			       const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *);
 
 unsigned char *utf8data;
 size_t utf8data_size;
@@ -254,52 +258,52 @@ utf8trie_t *nfkdicf;
 #define UTF8_V_SHIFT    6
 
 static int
-utf8key(unsigned int key, char keyval[])
-{
-	int keylen;
-
-	if (key < 0x80) {
-		keyval[0] = key;
-		keylen = 1;
-	} else if (key < 0x800) {
-		keyval[1] = key & UTF8_V_MASK;
-		keyval[1] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[0] = key;
-		keyval[0] |= UTF8_2_BITS;
-		keylen = 2;
-	} else if (key < 0x10000) {
-		keyval[2] = key & UTF8_V_MASK;
-		keyval[2] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[1] = key & UTF8_V_MASK;
-		keyval[1] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[0] = key;
-		keyval[0] |= UTF8_3_BITS;
-		keylen = 3;
-	} else if (key < 0x110000) {
-		keyval[3] = key & UTF8_V_MASK;
-		keyval[3] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[2] = key & UTF8_V_MASK;
-		keyval[2] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[1] = key & UTF8_V_MASK;
-		keyval[1] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[0] = key;
-		keyval[0] |= UTF8_4_BITS;
-		keylen = 4;
+utf8encode(char *str, unsigned int val)
+{
+	int len;
+
+	if (val < 0x80) {
+		str[0] = val;
+		len = 1;
+	} else if (val < 0x800) {
+		str[1] = val & UTF8_V_MASK;
+		str[1] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[0] = val;
+		str[0] |= UTF8_2_BITS;
+		len = 2;
+	} else if (val < 0x10000) {
+		str[2] = val & UTF8_V_MASK;
+		str[2] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[1] = val & UTF8_V_MASK;
+		str[1] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[0] = val;
+		str[0] |= UTF8_3_BITS;
+		len = 3;
+	} else if (val < 0x110000) {
+		str[3] = val & UTF8_V_MASK;
+		str[3] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[2] = val & UTF8_V_MASK;
+		str[2] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[1] = val & UTF8_V_MASK;
+		str[1] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[0] = val;
+		str[0] |= UTF8_4_BITS;
+		len = 4;
 	} else {
-		printf("%#x: illegal key\n", key);
-		keylen = 0;
+		printf("%#x: illegal val\n", val);
+		len = 0;
 	}
-	return keylen;
+	return len;
 }
 
 static unsigned int
-utf8code(const char *str)
+utf8decode(const char *str)
 {
 	const unsigned char *s = (const unsigned char*)str;
 	unsigned int unichar = 0;
@@ -334,6 +338,8 @@ utf32valid(unsigned int unichar)
 	return unichar < 0x110000;
 }
 
+#define HANGUL_SYLLABLE(U)	((U) >= 0xAC00 && (U) <= 0xD7A3)
+
 #define NODE 1
 #define LEAF 0
 
@@ -937,7 +943,7 @@ done:
 
 /*
  * Compute the index of each node and leaf, which is the offset in the
- * emitted trie.  These value must be pre-computed because relative
+ * emitted trie.  These values must be pre-computed because relative
  * offsets between nodes are used to navigate the tree.
  */
 static int
@@ -958,7 +964,7 @@ index_nodes(struct tree *tree, int index)
 	count = 0;
 
 	if (verbose > 0)
-		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+		printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index);
 	if (tree->childnode == LEAF) {
 		index += tree->leaf_size(tree->root);
 		goto done;
@@ -1022,6 +1028,26 @@ done:
 }
 
 /*
+ * Mark the nodes in a subtree, helper for size_nodes().
+ */
+static int
+mark_subtree(struct node *node)
+{
+	int changed;
+
+	if (!node || node->mark)
+		return 0;
+	node->mark = 1;
+	node->index = node->parent->index;
+	changed = 1;
+	if (node->leftnode == NODE)
+		changed += mark_subtree(node->left);
+	if (node->rightnode == NODE)
+		changed += mark_subtree(node->right);
+	return changed;
+}
+
+/*
  * Compute the size of nodes and leaves. We start by assuming that
  * each node needs to store a three-byte offset. The indexes of the
  * nodes are calculated based on that, and then this function is
@@ -1040,6 +1066,7 @@ size_nodes(struct tree *tree)
 	unsigned int bitmask;
 	unsigned int pathbits;
 	unsigned int pathmask;
+	unsigned int nbit;
 	int changed;
 	int offset;
 	int size;
@@ -1050,7 +1077,7 @@ size_nodes(struct tree *tree)
 	size = 0;
 
 	if (verbose > 0)
-		printf("Sizing %s_%x", tree->type, tree->maxage);
+		printf("Sizing %s_%x\n", tree->type, tree->maxage);
 	if (tree->childnode == LEAF)
 		goto done;
 
@@ -1067,22 +1094,40 @@ size_nodes(struct tree *tree)
 			size = 1;
 		} else {
 			if (node->rightnode == NODE) {
+				/*
+				 * If the right node is not marked,
+				 * look for a corresponding node in
+				 * the next tree.  Such a node need
+				 * not exist.
+				 */
 				right = node->right;
 				next = tree->next;
 				while (!right->mark) {
 					assert(next);
 					n = next->root;
 					while (n->bitnum != node->bitnum) {
-						if (pathbits & (1<<n->bitnum))
+						nbit = 1 << n->bitnum;
+						if (!(pathmask & nbit))
+							break;
+						if (pathbits & nbit) {
+							if (n->rightnode==LEAF)
+								break;
 							n = n->right;
-						else
+						} else {
+							if (n->leftnode==LEAF)
+								break;
 							n = n->left;
+						}
 					}
+					if (n->bitnum != node->bitnum)
+						break;
 					n = n->right;
-					assert(right->bitnum == n->bitnum);
 					right = n;
 					next = next->next;
 				}
+				/* Make sure the right node is marked. */
+				if (!right->mark)
+					changed += mark_subtree(right);
 				offset = right->index - node->index;
 			} else {
 				offset = *tree->leaf_index(tree, node->right);
@@ -1158,8 +1203,15 @@ emit(struct tree *tree, unsigned char *data)
 	int offset;
 	int index;
 	int indent;
+	int size;
+	int bytes;
+	int leaves;
+	int nodes[4];
 	unsigned char byte;
 
+	nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0;
+	leaves = 0;
+	bytes = 0;
 	index = tree->index;
 	data += index;
 	indent = 1;
@@ -1168,7 +1220,10 @@ emit(struct tree *tree, unsigned char *data)
 	if (tree->childnode == LEAF) {
 		assert(tree->root);
 		tree->leaf_emit(tree->root, data);
-		return;
+		size = tree->leaf_size(tree->root);
+		index += size;
+		leaves++;
+		goto done;
 	}
 
 	assert(tree->childnode == NODE);
@@ -1195,6 +1250,7 @@ emit(struct tree *tree, unsigned char *data)
 				offlen = 2;
 			else
 				offlen = 3;
+			nodes[offlen]++;
 			offset = node->offset;
 			byte |= offlen << OFFLEN_SHIFT;
 			*data++ = byte;
@@ -1207,12 +1263,14 @@ emit(struct tree *tree, unsigned char *data)
 		} else if (node->left) {
 			if (node->leftnode == NODE)
 				byte |= TRIENODE;
+			nodes[0]++;
 			*data++ = byte;
 			index++;
 		} else if (node->right) {
 			byte |= RIGHTNODE;
 			if (node->rightnode == NODE)
 				byte |= TRIENODE;
+			nodes[0]++;
 			*data++ = byte;
 			index++;
 		} else {
@@ -1227,7 +1285,10 @@ skip:
 					assert(node->left);
 					data = tree->leaf_emit(node->left,
 							       data);
-					index += tree->leaf_size(node->left);
+					size = tree->leaf_size(node->left);
+					index += size;
+					bytes += size;
+					leaves++;
 				} else if (node->left) {
 					assert(node->leftnode == NODE);
 					indent += 1;
@@ -1241,7 +1302,10 @@ skip:
 					assert(node->right);
 					data = tree->leaf_emit(node->right,
 							       data);
-					index += tree->leaf_size(node->right);
+					size = tree->leaf_size(node->right);
+					index += size;
+					bytes += size;
+					leaves++;
 				} else if (node->right) {
 					assert(node->rightnode==NODE);
 					indent += 1;
@@ -1255,6 +1319,15 @@ skip:
 			indent -= 1;
 		}
 	}
+done:
+	if (verbose > 0) {
+		printf("Emitted %d (%d) leaves",
+			leaves, bytes);
+		printf(" %d (%d+%d+%d+%d) nodes",
+			nodes[0] + nodes[1] + nodes[2] + nodes[3],
+			nodes[0], nodes[1], nodes[2], nodes[3]);
+		printf(" %d total\n", index - tree->index);
+	}
 }
 
 /* ------------------------------------------------------------------ */
@@ -1360,7 +1433,9 @@ nfkdi_print(void *l, int indent)
 
 	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
 		leaf->code, leaf->ccc, leaf->gen);
-	if (leaf->utf8nfkdi)
+	if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+		printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
+	else if (leaf->utf8nfkdi)
 		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
 	printf("\n");
 }
@@ -1374,6 +1449,8 @@ nfkdicf_print(void *l, int indent)
 		leaf->code, leaf->ccc, leaf->gen);
 	if (leaf->utf8nfkdicf)
 		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+		printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
 	else if (leaf->utf8nfkdi)
 		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
 	printf("\n");
@@ -1409,7 +1486,9 @@ nfkdi_size(void *l)
 	struct unicode_data *leaf = l;
 
 	int size = 2;
-	if (leaf->utf8nfkdi)
+	if (HANGUL_SYLLABLE(leaf->code))
+		size += 1;
+	else if (leaf->utf8nfkdi)
 		size += strlen(leaf->utf8nfkdi) + 1;
 	return size;
 }
@@ -1420,7 +1499,9 @@ nfkdicf_size(void *l)
 	struct unicode_data *leaf = l;
 
 	int size = 2;
-	if (leaf->utf8nfkdicf)
+	if (HANGUL_SYLLABLE(leaf->code))
+		size += 1;
+	else if (leaf->utf8nfkdicf)
 		size += strlen(leaf->utf8nfkdicf) + 1;
 	else if (leaf->utf8nfkdi)
 		size += strlen(leaf->utf8nfkdi) + 1;
@@ -1450,7 +1531,10 @@ nfkdi_emit(void *l, unsigned char *data)
 	unsigned char *s;
 
 	*data++ = leaf->gen;
-	if (leaf->utf8nfkdi) {
+	if (HANGUL_SYLLABLE(leaf->code)) {
+		*data++ = DECOMPOSE;
+		*data++ = HANGUL;
+	} else if (leaf->utf8nfkdi) {
 		*data++ = DECOMPOSE;
 		s = (unsigned char*)leaf->utf8nfkdi;
 		while ((*data++ = *s++) != 0)
@@ -1468,7 +1552,10 @@ nfkdicf_emit(void *l, unsigned char *data)
 	unsigned char *s;
 
 	*data++ = leaf->gen;
-	if (leaf->utf8nfkdicf) {
+	if (HANGUL_SYLLABLE(leaf->code)) {
+		*data++ = DECOMPOSE;
+		*data++ = HANGUL;
+	} else if (leaf->utf8nfkdicf) {
 		*data++ = DECOMPOSE;
 		s = (unsigned char*)leaf->utf8nfkdicf;
 		while ((*data++ = *s++) != 0)
@@ -1492,22 +1579,27 @@ utf8_create(struct unicode_data *data)
 	unsigned int *um;
 	int i;
 
+	if (data->utf8nfkdi) {
+		assert(data->utf8nfkdi[0] == HANGUL);
+		return;
+	}
+
 	u = utf;
 	um = data->utf32nfkdi;
 	if (um) {
 		for (i = 0; um[i]; i++)
-			u += utf8key(um[i], u);
+			u += utf8encode(u, um[i]);
 		*u = '\0';
-		data->utf8nfkdi = strdup((char*)utf);
+		data->utf8nfkdi = strdup(utf);
 	}
 	u = utf;
 	um = data->utf32nfkdicf;
 	if (um) {
 		for (i = 0; um[i]; i++)
-			u += utf8key(um[i], u);
+			u += utf8encode(u, um[i]);
 		*u = '\0';
-		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
-			data->utf8nfkdicf = strdup((char*)utf);
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, utf))
+			data->utf8nfkdicf = strdup(utf);
 	}
 }
 
@@ -1627,7 +1719,7 @@ trees_populate(void)
 		for (unichar = 0; unichar != 0x110000; unichar++) {
 			if (unicode_data[unichar].gen < 0)
 				continue;
-			keylen = utf8key(unichar, keyval);
+			keylen = utf8encode(keyval, unichar);
 			data = corrections_lookup(&unicode_data[unichar]);
 			if (data->correction <= trees[i].maxage)
 				data = &unicode_data[unichar];
@@ -1682,6 +1774,7 @@ verify(struct tree *tree)
 	utf8leaf_t	*leaf;
 	unsigned int	unichar;
 	char		key[4];
+	unsigned char	hangul[UTF8HANGULLEAF];
 	int		report;
 	int		nocf;
 
@@ -1694,8 +1787,8 @@ verify(struct tree *tree)
 		data = corrections_lookup(&unicode_data[unichar]);
 		if (data->correction <= tree->maxage)
 			data = &unicode_data[unichar];
-		utf8key(unichar, key);
-		leaf = utf8lookup(tree, key);
+		utf8encode(key, unichar);
+		leaf = utf8lookup(tree, hangul, key);
 		if (!leaf) {
 			if (data->gen != -1)
 				report++;
@@ -1709,7 +1802,10 @@ verify(struct tree *tree)
 			if (data->gen != LEAF_GEN(leaf))
 				report++;
 			if (LEAF_CCC(leaf) == DECOMPOSE) {
-				if (nocf) {
+				if (HANGUL_SYLLABLE(data->code)) {
+					if (data->utf8nfkdi[0] != HANGUL)
+						report++;
+				} else if (nocf) {
 					if (!data->utf8nfkdi) {
 						report++;
 					} else if (strcmp(data->utf8nfkdi,
@@ -1725,7 +1821,7 @@ verify(struct tree *tree)
 							   LEAF_STR(leaf)))
 							report++;
 					} else if (strcmp(data->utf8nfkdi,
-							  LEAF_STR(leaf))) {
+							LEAF_STR(leaf))) {
 						report++;
 					}
 				}
@@ -1735,13 +1831,13 @@ verify(struct tree *tree)
 		}
 		if (report) {
 			printf("%X code %X gen %d ccc %d"
-				" nfdki -> \"%s\"",
+				" nfkdi -> \"%s\"",
 				unichar, data->code, data->gen,
 				data->ccc,
 				data->utf8nfkdi);
 			if (leaf) {
-				printf(" age %d ccc %d"
-					" nfdki -> \"%s\"\n",
+				printf(" gen %d ccc %d"
+					" nfkdi -> \"%s\"",
 					LEAF_GEN(leaf),
 					LEAF_CCC(leaf),
 					LEAF_CCC(leaf) == DECOMPOSE ?
@@ -2330,21 +2426,21 @@ corrections_init(void)
  *
  * LVT (Canonical)
  *   LVIndex = (SIndex / TCount) * TCount
- *   TIndex = (Sindex % TCount
- *   LVPart = LBase + LVIndex
+ *   TIndex = (Sindex % TCount)
+ *   LVPart = SBase + LVIndex
  *   TPart = TBase + TIndex
  *
  * LVT (Full)
  *   LIndex = SIndex / NCount
  *   VIndex = (Sindex % NCount) / TCount
- *   TIndex = (Sindex % TCount
+ *   TIndex = (Sindex % TCount)
  *   LPart = LBase + LIndex
  *   VPart = VBase + VIndex
  *   if (TIndex == 0) {
  *          d = <LPart, VPart>
  *   } else {
  *          TPart = TBase + TIndex
- *          d = <LPart, TPart, VPart>
+ *          d = <LPart, VPart, TPart>
  *   }
  *
  */
@@ -2394,9 +2490,17 @@ hangul_decompose(void)
 		memcpy(um, mapping, i * sizeof(unsigned int));
 		unicode_data[unichar].utf32nfkdicf = um;
 
+		/*
+		 * Add a cookie as a reminder that the hangul syllable
+		 * decompositions must not be stored in the generated
+		 * trie.
+		 */
+		unicode_data[unichar].utf8nfkdi = malloc(2);
+		unicode_data[unichar].utf8nfkdi[0] = HANGUL;
+		unicode_data[unichar].utf8nfkdi[1] = '\0';
+
 		if (verbose > 1)
 			print_utf32nfkdi(unichar);
-
 		count++;
 	}
 	if (verbose > 0)
@@ -2522,6 +2626,100 @@ int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
 int utf8byte(struct utf8cursor *);
 
 /*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount)
+ *   LVPart = SBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount)
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, VPart, TPart>
+ *   }
+ */
+
+/* Constants */
+#define SB	(0xAC00)
+#define LB	(0x1100)
+#define VB	(0x1161)
+#define TB	(0x11A7)
+#define LC	(19)
+#define VC	(21)
+#define TC	(28)
+#define NC	(VC * TC)
+#define SC	(LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+	unsigned int	si;
+	unsigned int	li;
+	unsigned int	vi;
+	unsigned int	ti;
+	unsigned char	*h;
+
+	/* Calculate the SI, LI, VI, and TI values. */
+	si = utf8decode(str) - SB;
+	li = si / NC;
+	vi = (si % NC) / TC;
+	ti = si % TC;
+
+	/* Fill in base of leaf. */
+	h = hangul;
+	LEAF_GEN(h) = 2;
+	LEAF_CCC(h) = DECOMPOSE;
+	h += 2;
+
+	/* Add LPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode((char *)h, li + LB);
+
+	/* Add VPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode((char *)h, vi + VB);
+
+	/* Add TPart if required, also a 3-byte UTF-8 sequence. */
+	if (ti)
+		h += utf8encode((char *)h, ti + TB);
+
+	/* Terminate string. */
+	h[0] = '\0';
+
+	return hangul;
+}
+
+/*
  * Use trie to scan s, touching at most len bytes.
  * Returns the leaf if one exists, NULL otherwise.
  *
@@ -2530,7 +2728,7 @@ int utf8byte(struct utf8cursor *);
  * shorthand for this will be "is valid UTF-8 unicode".
  */
 static utf8leaf_t *
-utf8nlookup(struct tree *tree, const char *s, size_t len)
+utf8nlookup(struct tree *tree, unsigned char *hangul, const char *s, size_t len)
 {
 	utf8trie_t	*trie = utf8data + tree->index;
 	int		offlen;
@@ -2568,8 +2766,7 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
 				trie++;
 			} else {
 				/* No right node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			}
 		} else {
 			/* Left leg */
@@ -2579,8 +2776,7 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
 				trie += offlen + 1;
 			} else if (*trie & RIGHTPATH) {
 				/* No left node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			} else {
 				/* Left node after this node */
 				node = (*trie & TRIENODE);
@@ -2588,6 +2784,14 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
 			}
 		}
 	}
+	/*
+	 * Hangul decomposition is done algorithmically. These are the
+	 * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+	 * always 3 bytes long, so s has been advanced twice, and the
+	 * start of the sequence is at s-2.
+	 */
+	if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+		trie = utf8hangul(s - 2, hangul);
 	return trie;
 }
 
@@ -2598,9 +2802,9 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
  * Forwards to trie_nlookup().
  */
 static utf8leaf_t *
-utf8lookup(struct tree *tree, const char *s)
+utf8lookup(struct tree *tree, unsigned char *hangul, const char *s)
 {
-	return utf8nlookup(tree, s, (size_t)-1);
+	return utf8nlookup(tree, hangul, s, (size_t)-1);
 }
 
 /*
@@ -2624,13 +2828,15 @@ int
 utf8agemax(struct tree *tree, const char *s)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = 0;
 	while (*s) {
-		if (!(leaf = utf8lookup(tree, s)))
+		if (!(leaf = utf8lookup(tree, hangul, s)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2649,13 +2855,15 @@ int
 utf8agemin(struct tree *tree, const char *s)
 {
 	utf8leaf_t	*leaf;
-	int		age = tree->maxage;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = tree->maxage;
 	while (*s) {
-		if (!(leaf = utf8lookup(tree, s)))
+		if (!(leaf = utf8lookup(tree, hangul, s)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2673,13 +2881,15 @@ int
 utf8nagemax(struct tree *tree, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = 0;
         while (len && *s) {
-		if (!(leaf = utf8nlookup(tree, s, len)))
+		if (!(leaf = utf8nlookup(tree, hangul, s, len)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2699,12 +2909,14 @@ utf8nagemin(struct tree *tree, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
 	int		leaf_age;
-	int		age = tree->maxage;
+	int		age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = tree->maxage;
         while (len && *s) {
-		if (!(leaf = utf8nlookup(tree, s, len)))
+		if (!(leaf = utf8nlookup(tree, hangul, s, len)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2726,11 +2938,12 @@ utf8len(struct tree *tree, const char *s)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
 	while (*s) {
-		if (!(leaf = utf8lookup(tree, s)))
+		if (!(leaf = utf8lookup(tree, hangul, s)))
 			return -1;
 		if (ages[LEAF_GEN(leaf)] > tree->maxage)
 			ret += utf8clen(s);
@@ -2752,11 +2965,12 @@ utf8nlen(struct tree *tree, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
 	while (len && *s) {
-		if (!(leaf = utf8nlookup(tree, s, len)))
+		if (!(leaf = utf8nlookup(tree, hangul, s, len)))
 			return -1;
 		if (ages[LEAF_GEN(leaf)] > tree->maxage)
 			ret += utf8clen(s);
@@ -2784,6 +2998,7 @@ struct utf8cursor {
 	short int	ccc;
 	short int	nccc;
 	unsigned int	unichar;
+	unsigned char	hangul[UTF8HANGULLEAF];
 };
 
 /*
@@ -2900,10 +3115,12 @@ utf8byte(struct utf8cursor *u8c)
 		}
 
 		/* Look up the data for the current character. */
-		if (u8c->p)
-			leaf = utf8lookup(u8c->tree, u8c->s);
-		else
-			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+		if (u8c->p) {
+			leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
+		} else {
+			leaf = utf8nlookup(u8c->tree, u8c->hangul,
+					   u8c->s, u8c->len);
+		}
 
 		/* No leaf found implies that the input is a binary blob. */
 		if (!leaf)
@@ -2923,10 +3140,10 @@ utf8byte(struct utf8cursor *u8c)
 				ccc = STOPPER;
 				goto ccc_mismatch;
 			}
-			leaf = utf8lookup(u8c->tree, u8c->s);
+			leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
 			ccc = LEAF_CCC(leaf);
 		}
-		u8c->unichar = utf8code(u8c->s);
+		u8c->unichar = utf8decode(u8c->s);
 
 		/*
 		 * If this is not a stopper, then see if it updates
@@ -3055,7 +3272,7 @@ normalization_test(void)
 		t = buf2;
 		while (*s) {
 			unichar = strtoul(s, &s, 16);
-			t += utf8key(unichar, t);
+			t += utf8encode(t, unichar);
 		}
 		*t = '\0';
 
@@ -3068,13 +3285,13 @@ normalization_test(void)
 			if (data->utf8nfkdi && !*data->utf8nfkdi)
 				ignorables = 1;
 			else
-				t += utf8key(unichar, t);
+				t += utf8encode(t, unichar);
 		}
 		*t = '\0';
 
 		tests++;
 		if (normalize_line(nfkdi_tree) < 0) {
-			printf("\nline %s -> %s", buf0, buf1);
+			printf("Line %s -> %s", buf0, buf1);
 			if (ignorables)
 				printf(" (ignorables removed)");
 			printf(" failure\n");
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 05/16] xfs: return the first match during case-insensitive lookup.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (3 preceding siblings ...)
  2014-10-03 21:54 ` [PATCH 04/16] lib/utf8norm.c: reduce the size of utf8data[] Ben Myers
@ 2014-10-03 21:55 ` Ben Myers
  2014-10-06 22:19   ` Dave Chinner
  2014-10-03 21:56 ` [PATCH 06/16] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                   ` (29 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:55 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Change the XFS case-insensitive lookup code to return the first match
found, even if it is not an exact match. Whether a filesystem uses
case-insensitive lookups is determined by a superblock bit set during
filesystem creation.  This means that normal use cannot create two files
that both match the same filename.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_dir2_block.c | 17 +++------
 fs/xfs/libxfs/xfs_dir2_leaf.c  | 37 ++++----------------
 fs/xfs/libxfs/xfs_dir2_node.c  | 79 ++++++++++++++++--------------------------
 fs/xfs/libxfs/xfs_dir2_sf.c    |  8 ++---
 4 files changed, 45 insertions(+), 96 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 9628cec..990bf0c 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -725,28 +725,21 @@ xfs_dir2_block_lookup_int(
 		dep = (xfs_dir2_data_entry_t *)
 			((char *)hdr + xfs_dir2_dataptr_to_off(args->geo, addr));
 		/*
-		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * Compare name and if it's a match, return the
+		 * index and buffer.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*bpp = bp;
 			*entno = mid;
-			if (cmp == XFS_CMP_EXACT)
-				return 0;
+			return 0;
 		}
 	} while (++mid < be32_to_cpu(btp->count) &&
 			be32_to_cpu(blp[mid].hashval) == hash);
 
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or replace).
-	 * If a case-insensitive match was found earlier, return success.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE)
-		return 0;
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
 	/*
 	 * No match, release the buffer and return ENOENT.
 	 */
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index a19174e..3d572ee 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -1226,7 +1226,6 @@ xfs_dir2_leaf_lookup_int(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	xfs_dir2_db_t		newdb;		/* new data block number */
 	xfs_trans_t		*tp;		/* transaction pointer */
-	xfs_dir2_db_t		cidb = -1;	/* case match data block no. */
 	enum xfs_dacmp		cmp;		/* name compare result */
 	struct xfs_dir2_leaf_entry *ents;
 	struct xfs_dir3_icleaf_hdr leafhdr;
@@ -1290,46 +1289,22 @@ xfs_dir2_leaf_lookup_int(
 						be32_to_cpu(lep->address)));
 		/*
 		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * and buffer
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*indexp = index;
-			/* case exact match: return the current buffer. */
-			if (cmp == XFS_CMP_EXACT) {
-				*dbpp = dbp;
-				return 0;
-			}
-			cidb = curdb;
+			*dbpp = dbp;
+			return 0;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or remove).
-	 * If a case-insensitive match was found earlier, re-read the
-	 * appropriate data block if required and return it.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE) {
-		ASSERT(cidb != -1);
-		if (cidb != curdb) {
-			xfs_trans_brelse(tp, dbp);
-			error = xfs_dir3_data_read(tp, dp,
-					   xfs_dir2_db_to_da(args->geo, cidb),
-					   -1, &dbp);
-			if (error) {
-				xfs_trans_brelse(tp, lbp);
-				return error;
-			}
-		}
-		*dbpp = dbp;
-		return 0;
-	}
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
+
 	/*
 	 * No match found, return -ENOENT.
 	 */
-	ASSERT(cidb == -1);
 	if (dbp)
 		xfs_trans_brelse(tp, dbp);
 	xfs_trans_brelse(tp, lbp);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 2ae6ac2..1778c40 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -679,6 +679,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	xfs_dir2_data_entry_t	*dep;		/* data block entry */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	int			error;		/* error return value */
+	int			di = -1;	/* data entry index */
 	int			index;		/* leaf entry index */
 	xfs_dir2_leaf_t		*leaf;		/* leaf structure */
 	xfs_dir2_leaf_entry_t	*lep;		/* leaf entry */
@@ -709,6 +710,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	if (state->extravalid) {
 		curbp = state->extrablk.bp;
 		curdb = state->extrablk.blkno;
+		di = state->extrablk.index;
 	}
 	/*
 	 * Loop over leaf entries with the right hash value.
@@ -734,28 +736,20 @@ xfs_dir2_leafn_lookup_for_entry(
 		 */
 		if (newdb != curdb) {
 			/*
-			 * If we had a block before that we aren't saving
-			 * for a CI name, drop it
+			 * If we had a block, drop it
 			 */
-			if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT ||
-						curdb != state->extrablk.blkno))
+			if (curbp) {
 				xfs_trans_brelse(tp, curbp);
+				di = -1;
+			}
 			/*
-			 * If needing the block that is saved with a CI match,
-			 * use it otherwise read in the new data block.
+			 * Read in the new data block.
 			 */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-					newdb == state->extrablk.blkno) {
-				ASSERT(state->extravalid);
-				curbp = state->extrablk.bp;
-			} else {
-				error = xfs_dir3_data_read(tp, dp,
-						xfs_dir2_db_to_da(args->geo,
-								  newdb),
+			error = xfs_dir3_data_read(tp, dp,
+					xfs_dir2_db_to_da(args->geo, newdb),
 						-1, &curbp);
-				if (error)
-					return error;
-			}
+			if (error)
+				return error;
 			xfs_dir3_data_check(dp, curbp);
 			curdb = newdb;
 		}
@@ -766,53 +760,40 @@ xfs_dir2_leafn_lookup_for_entry(
 			xfs_dir2_dataptr_to_off(args->geo,
 						be32_to_cpu(lep->address)));
 		/*
-		 * Compare the entry and if it's an exact match, return
-		 * EEXIST immediately. If it's the first case-insensitive
-		 * match, store the block & inode number and continue looking.
+		 * Compare the entry and if it's a match, return
+		 * EEXIST immediately.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
-			/* If there is a CI match block, drop it */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-						curdb != state->extrablk.blkno)
-				xfs_trans_brelse(tp, state->extrablk.bp);
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = be64_to_cpu(dep->inumber);
 			args->filetype = dp->d_ops->data_get_ftype(dep);
-			*indexp = index;
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.blkno = curdb;
-			state->extrablk.index = (int)((char *)dep -
-							(char *)curbp->b_addr);
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
 			curbp->b_ops = &xfs_dir3_data_buf_ops;
 			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-			if (cmp == XFS_CMP_EXACT)
-				return -EEXIST;
+			di = (int)((char *)dep - (char *)curbp->b_addr);
+			error = -EEXIST;
+			goto out;
+
 		}
 	}
+	/* Didn't find a match */
+	error = -ENOENT;
 	ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT));
+out:
 	if (curbp) {
-		if (args->cmpresult == XFS_CMP_DIFFERENT) {
-			/* Giving back last used data block. */
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.index = -1;
-			state->extrablk.blkno = curdb;
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
-			curbp->b_ops = &xfs_dir3_data_buf_ops;
-			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-		} else {
-			/* If the curbp is not the CI match block, drop it */
-			if (state->extrablk.bp != curbp)
-				xfs_trans_brelse(tp, curbp);
-		}
+		/* Giving back last used data block. */
+		state->extravalid = 1;
+		state->extrablk.bp = curbp;
+		state->extrablk.index = di;
+		state->extrablk.blkno = curdb;
+		state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
+		curbp->b_ops = &xfs_dir3_data_buf_ops;
+		xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
 	} else {
 		state->extravalid = 0;
 	}
 	*indexp = index;
-	return -ENOENT;
+	return error;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c
index 5079e05..e69fdb7 100644
--- a/fs/xfs/libxfs/xfs_dir2_sf.c
+++ b/fs/xfs/libxfs/xfs_dir2_sf.c
@@ -757,19 +757,19 @@ xfs_dir2_sf_lookup(
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
 	     i++, sfep = dp->d_ops->sf_nextentry(sfp, sfep)) {
 		/*
-		 * Compare name and if it's an exact match, return the inode
-		 * number. If it's the first case-insensitive match, store the
-		 * inode number and continue looking for an exact match.
+		 * Compare name and if it's a match, return the inode
+		 * number.
 		 */
 		cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name,
 								sfep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = dp->d_ops->sf_get_ino(sfp, sfep);
 			args->filetype = dp->d_ops->sf_get_ftype(sfep);
 			if (cmp == XFS_CMP_EXACT)
 				return -EEXIST;
 			ci_sfep = sfep;
+			break;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 06/16] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (4 preceding siblings ...)
  2014-10-03 21:55 ` [PATCH 05/16] xfs: return the first match during case-insensitive lookup Ben Myers
@ 2014-10-03 21:56 ` Ben Myers
  2014-10-03 21:58 ` [PATCH 07/16] xfs: add xfs_nameops.normhash Ben Myers
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:56 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and
normalization, different strings will match on other criteria than
case insensitivity.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_da_btree.h  | 2 +-
 fs/xfs/libxfs/xfs_dir2.c      | 9 ++++++---
 fs/xfs/libxfs/xfs_dir2_node.c | 2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6e153e3..9ebcc23 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -52,7 +52,7 @@ struct xfs_da_geometry {
 enum xfs_dacmp {
 	XFS_CMP_DIFFERENT,	/* names are completely different */
 	XFS_CMP_EXACT,		/* names are exactly the same */
-	XFS_CMP_CASE		/* names are same but differ in case */
+	XFS_CMP_MATCH		/* names are same but differ in encoding */
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 6cef221..32e769b 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -74,7 +74,7 @@ xfs_ascii_ci_compname(
 			continue;
 		if (tolower(args->name[i]) != tolower(name[i]))
 			return XFS_CMP_DIFFERENT;
-		result = XFS_CMP_CASE;
+		result = XFS_CMP_MATCH;
 	}
 
 	return result;
@@ -315,8 +315,11 @@ xfs_dir_cilookup_result(
 {
 	if (args->cmpresult == XFS_CMP_DIFFERENT)
 		return -ENOENT;
-	if (args->cmpresult != XFS_CMP_CASE ||
-					!(args->op_flags & XFS_DA_OP_CILOOKUP))
+	if (args->cmpresult == XFS_CMP_EXACT)
+		return -EEXIST;
+	ASSERT(args->cmpresult == XFS_CMP_MATCH);
+	/* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */
+	if (!(args->op_flags & XFS_DA_OP_CILOOKUP))
 		return -EEXIST;
 
 	args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 1778c40..9d46e8d 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -2023,7 +2023,7 @@ xfs_dir2_node_lookup(
 	error = xfs_da3_node_lookup_int(state, &rval);
 	if (error)
 		rval = error;
-	else if (rval == -ENOENT && args->cmpresult == XFS_CMP_CASE) {
+	else if (rval == -ENOENT && args->cmpresult == XFS_CMP_MATCH) {
 		/* If a CI match, dup the actual name and return -EEXIST */
 		xfs_dir2_data_entry_t	*dep;
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 07/16] xfs: add xfs_nameops.normhash
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (5 preceding siblings ...)
  2014-10-03 21:56 ` [PATCH 06/16] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
@ 2014-10-03 21:58 ` Ben Myers
  2014-10-03 21:58 ` [PATCH 08/16] xfs: change interface of xfs_nameops.hashname Ben Myers
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:58 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args
structure as its argument, and calculates a hash value over the name. It may
in the process create a normalized form of the name, and assign that to the
norm/normlen fields in the xfs_da_args structure.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_da_btree.c |  9 +++++++++
 fs/xfs/libxfs/xfs_da_btree.h |  3 +++
 fs/xfs/libxfs/xfs_dir2.c     | 42 +++++++++++++++++++++++++++++++++++++-----
 3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 2c42ae2..07a3acf 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1990,8 +1990,17 @@ xfs_default_hashname(
 	return xfs_da_hashname(name->name, name->len);
 }
 
+STATIC int
+xfs_da_normhash(
+	struct xfs_da_args *args)
+{
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
 const struct xfs_nameops xfs_default_nameops = {
 	.hashname	= xfs_default_hashname,
+	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
 
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 9ebcc23..6cdafee 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -61,7 +61,9 @@ enum xfs_dacmp {
 typedef struct xfs_da_args {
 	struct xfs_da_geometry *geo;	/* da block geometry */
 	const __uint8_t	*name;		/* string (maybe not NULL terminated) */
+	const __uint8_t	*norm;		/* normalized name (may be NULL) */
 	int		namelen;	/* length of string (maybe no NULL) */
+	int		normlen;	/* length of normalized name */
 	__uint8_t	filetype;	/* filetype of inode for directories */
 	__uint8_t	*value;		/* set of bytes (maybe contain NULLs) */
 	int		valuelen;	/* length of value */
@@ -150,6 +152,7 @@ typedef struct xfs_da_state {
  */
 struct xfs_nameops {
 	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
 };
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 32e769b..55733a6 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -56,6 +56,21 @@ xfs_ascii_ci_hashname(
 	return hash;
 }
 
+STATIC int
+xfs_ascii_ci_normhash(
+	struct xfs_da_args *args)
+{
+	xfs_dahash_t	hash;
+	int		i;
+
+	for (i = 0, hash = 0; i < args->namelen; i++)
+		hash = tolower(args->name[i]) ^ rol32(hash, 7);
+
+	args->hashval = hash;
+	return 0;
+}
+
+
 STATIC enum xfs_dacmp
 xfs_ascii_ci_compname(
 	struct xfs_da_args *args,
@@ -82,6 +97,7 @@ xfs_ascii_ci_compname(
 
 static struct xfs_nameops xfs_ascii_ci_nameops = {
 	.hashname	= xfs_ascii_ci_hashname,
+	.normhash	= xfs_ascii_ci_normhash,
 	.compname	= xfs_ascii_ci_compname,
 };
 
@@ -267,7 +283,6 @@ xfs_dir_createname(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->inumber = inum;
 	args->dp = dp;
 	args->firstblock = first;
@@ -276,6 +291,8 @@ xfs_dir_createname(
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_addname(args);
@@ -299,6 +316,8 @@ xfs_dir_createname(
 		rval = xfs_dir2_node_addname(args);
 
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
@@ -365,13 +384,14 @@ xfs_dir_lookup(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->dp = dp;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_OKNOENT;
 	if (ci_name)
 		args->op_flags |= XFS_DA_OP_CILOOKUP;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_lookup(args);
@@ -405,6 +425,9 @@ out_check_rval:
 		}
 	}
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
+
 	kmem_free(args);
 	return rval;
 }
@@ -437,7 +460,6 @@ xfs_dir_removename(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->inumber = ino;
 	args->dp = dp;
 	args->firstblock = first;
@@ -445,6 +467,8 @@ xfs_dir_removename(
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_removename(args);
@@ -467,6 +491,8 @@ xfs_dir_removename(
 	else
 		rval = xfs_dir2_node_removename(args);
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
@@ -502,7 +528,6 @@ xfs_dir_replace(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->inumber = inum;
 	args->dp = dp;
 	args->firstblock = first;
@@ -510,6 +535,8 @@ xfs_dir_replace(
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_replace(args);
@@ -532,6 +559,8 @@ xfs_dir_replace(
 	else
 		rval = xfs_dir2_node_replace(args);
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
@@ -564,12 +593,13 @@ xfs_dir_canenter(
 	args->name = name->name;
 	args->namelen = name->len;
 	args->filetype = name->type;
-	args->hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args->dp = dp;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME |
 							XFS_DA_OP_OKNOENT;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+		goto out_free;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_addname(args);
@@ -592,6 +622,8 @@ xfs_dir_canenter(
 	else
 		rval = xfs_dir2_node_addname(args);
 out_free:
+	if (args->norm)
+		kmem_free(args->norm);
 	kmem_free(args);
 	return rval;
 }
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 08/16] xfs: change interface of xfs_nameops.hashname
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (6 preceding siblings ...)
  2014-10-03 21:58 ` [PATCH 07/16] xfs: add xfs_nameops.normhash Ben Myers
@ 2014-10-03 21:58 ` Ben Myers
  2014-10-06 22:17     ` Dave Chinner
  2014-10-03 21:59 ` [PATCH 09/16] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
                   ` (26 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:58 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <olaf@sgi.com>

[v2: pass a 3rd argument for sb_utf8version to hashname.  --bpm]
---
 fs/xfs/libxfs/xfs_da_btree.c   | 18 ++++++++++--------
 fs/xfs/libxfs/xfs_da_btree.h   |  3 ++-
 fs/xfs/libxfs/xfs_dir2.c       |  8 +++++---
 fs/xfs/libxfs/xfs_dir2_block.c |  4 +++-
 fs/xfs/libxfs/xfs_dir2_data.c  |  4 +++-
 5 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 07a3acf..ec6cc98 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1973,6 +1973,15 @@ xfs_da_hashname(const __uint8_t *name, int namelen)
 	}
 }
 
+xfs_dahash_t
+xfs_da_hashname_op(
+	const __uint8_t		*name,
+	int 			namelen,
+	unsigned int 		unused)
+{
+	return xfs_da_hashname(name, namelen);
+}
+
 enum xfs_dacmp
 xfs_da_compname(
 	struct xfs_da_args *args,
@@ -1983,13 +1992,6 @@ xfs_da_compname(
 					XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
 }
 
-static xfs_dahash_t
-xfs_default_hashname(
-	struct xfs_name	*name)
-{
-	return xfs_da_hashname(name->name, name->len);
-}
-
 STATIC int
 xfs_da_normhash(
 	struct xfs_da_args *args)
@@ -1999,7 +2001,7 @@ xfs_da_normhash(
 }
 
 const struct xfs_nameops xfs_default_nameops = {
-	.hashname	= xfs_default_hashname,
+	.hashname	= xfs_da_hashname_op,
 	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6cdafee..ce6888a 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -151,7 +151,8 @@ typedef struct xfs_da_state {
  * Name ops for directory and/or attr name operations
  */
 struct xfs_nameops {
-	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	xfs_dahash_t	(*hashname)(const unsigned char *, int,
+					unsigned int);
 	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 55733a6..4eb0973 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -45,13 +45,15 @@ struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
  */
 STATIC xfs_dahash_t
 xfs_ascii_ci_hashname(
-	struct xfs_name	*name)
+	const unsigned char *name,
+	int len,
+	unsigned int unused)
 {
 	xfs_dahash_t	hash;
 	int		i;
 
-	for (i = 0, hash = 0; i < name->len; i++)
-		hash = tolower(name->name[i]) ^ rol32(hash, 7);
+	for (i = 0, hash = 0; i < len; i++)
+		hash = tolower(name[i]) ^ rol32(hash, 7);
 
 	return hash;
 }
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 990bf0c..12ebdd8 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -1231,7 +1231,9 @@ xfs_dir2_sf_to_block(
 		name.name = sfep->name;
 		name.len = sfep->namelen;
 		blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
-							hashname(&name));
+					hashname(sfep->name,
+						 sfep->namelen,
+						 0 /* version for later */));
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(
 						 (char *)dep - (char *)hdr));
 		offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index fdd803f..25b0f7b 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -179,7 +179,9 @@ __xfs_dir3_data_check(
 						((char *)dep - (char *)hdr));
 			name.name = dep->name;
 			name.len = dep->namelen;
-			hash = mp->m_dirnameops->hashname(&name);
+			hash = mp->m_dirnameops->hashname(dep->name,
+					dep->namelen,
+					0 /* version for later */);
 			for (i = 0; i < be32_to_cpu(btp->count); i++) {
 				if (be32_to_cpu(lep[i].address) == addr &&
 				    be32_to_cpu(lep[i].hashval) == hash)
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 09/16] xfs: add a superblock feature bit to indicate UTF-8 support.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (7 preceding siblings ...)
  2014-10-03 21:58 ` [PATCH 08/16] xfs: change interface of xfs_nameops.hashname Ben Myers
@ 2014-10-03 21:59 ` Ben Myers
  2014-10-06 21:25   ` Dave Chinner
  2014-10-03 22:00 ` [PATCH 10/16] xfs: store utf8version in the superblock Ben Myers
                   ` (25 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 21:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
the utf8bit, and returns true if at least one of them is set. Replace
calls to xfs_sb_version_hasasciici() as needed.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_sb.h | 24 +++++++++++++++++++++++-
 fs/xfs/xfs_fs.h        |  1 +
 fs/xfs/xfs_fsops.c     |  4 +++-
 fs/xfs/xfs_iops.c      |  4 ++--
 4 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 2e73970..525eacb 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -70,6 +70,7 @@ struct xfs_trans;
 #define XFS_SB_VERSION2_RESERVED4BIT	0x00000004
 #define XFS_SB_VERSION2_ATTR2BIT	0x00000008	/* Inline attr rework */
 #define XFS_SB_VERSION2_PARENTBIT	0x00000010	/* parent pointers */
+#define XFS_SB_VERSION2_UTF8BIT		0x00000020      /* utf8 names */
 #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */
 #define XFS_SB_VERSION2_CRCBIT		0x00000100	/* metadata CRCs */
 #define XFS_SB_VERSION2_FTYPE		0x00000200	/* inode type in dir */
@@ -77,6 +78,7 @@ struct xfs_trans;
 #define	XFS_SB_VERSION2_OKBITS		\
 	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
 	 XFS_SB_VERSION2_ATTR2BIT	| \
+	 XFS_SB_VERSION2_UTF8BIT	| \
 	 XFS_SB_VERSION2_PROJID32BIT	| \
 	 XFS_SB_VERSION2_FTYPE)
 
@@ -509,8 +511,10 @@ xfs_sb_has_ro_compat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_UTF8	(1 << 1)	/* utf-8 name support */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
-		(XFS_SB_FEAT_INCOMPAT_FTYPE)
+		(XFS_SB_FEAT_INCOMPAT_FTYPE | \
+		 XFS_SB_FEAT_INCOMPAT_UTF8)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
@@ -558,6 +562,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
 }
 
+static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) ||
+		(xfs_sb_version_hasmorebits(sbp) &&
+		(sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));
+}
+
+/*
+ * Special case: there are a number of places where we need to test
+ * both the borgbit and the utf8bit, and take the same action if
+ * either of those is set.
+ */
+static inline int xfs_sb_version_hasci(xfs_sb_t *sbp)
+{
+	return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp);
+}
+
 /*
  * end of superblock version macros
  */
diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index 18dc721..e845d75 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
+#define XFS_FSOP_GEOM_FLAGS_UTF8	0x40000	/* utf8 filenames */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index f91de1e..1a83eef 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -103,7 +103,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hasftype(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_FTYPE : 0) |
 			(xfs_sb_version_hasfinobt(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_FINOBT : 0);
+				XFS_FSOP_GEOM_FLAGS_FINOBT : 0) |
+			(xfs_sb_version_hasutf8(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_UTF8 : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 7212949..cea3d64 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -335,9 +335,9 @@ xfs_vn_unlink(
 	/*
 	 * With unlink, the VFS makes the dentry "negative": no inode,
 	 * but still hashed. This is incompatible with case-insensitive
-	 * mode, so invalidate (unhash) the dentry in CI-mode.
+	 * or utf8 mode, so invalidate (unhash) the dentry in CI-mode.
 	 */
-	if (xfs_sb_version_hasasciici(&XFS_M(dir->i_sb)->m_sb))
+	if (xfs_sb_version_hasci(&XFS_M(dir->i_sb)->m_sb))
 		d_invalidate(dentry);
 	return 0;
 }
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 10/16] xfs: store utf8version in the superblock
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (8 preceding siblings ...)
  2014-10-03 21:59 ` [PATCH 09/16] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
@ 2014-10-03 22:00 ` Ben Myers
  2014-10-06 21:53     ` Dave Chinner
  2014-10-03 22:01 ` [PATCH 11/16] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
                   ` (24 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

The utf8 version a filesystem was created with needs to be stored in
order that normalizations will remain stable over the lifetime of the
filesystem.  Convert sb_pad to sb_utf8version in the super block.  This
also adds checks at mount time to see whether the unicode normalization
module has support for the version of unicode that the filesystem
requires.  If not we fail the mount.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 fs/xfs/libxfs/xfs_dir2.c | 28 ++++++++++++++++---
 fs/xfs/libxfs/xfs_sb.c   |  4 +--
 fs/xfs/libxfs/xfs_sb.h   | 10 ++++---
 fs/xfs/libxfs/xfs_utf8.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_utf8.h | 24 +++++++++++++++++
 5 files changed, 126 insertions(+), 10 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_utf8.c
 create mode 100644 fs/xfs/libxfs/xfs_utf8.h

diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 4eb0973..2c89211 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -157,10 +157,30 @@ xfs_da_mount(
 				(uint)sizeof(xfs_da_node_entry_t);
 	dageo->magicpct = (dageo->blksize * 37) / 100;
 
-	if (xfs_sb_version_hasasciici(&mp->m_sb))
-		mp->m_dirnameops = &xfs_ascii_ci_nameops;
-	else
-		mp->m_dirnameops = &xfs_default_nameops;
+	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
+#ifdef CONFIG_XFS_UTF8
+		if (!xfs_utf8_version_ok(mp))
+			return -ENOSYS;
+
+		/* XXX these are replaced in the next patch need
+		   to do some kind of reordering here */
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_ascii_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_default_nameops;
+#else
+		xfs_warn(mp,
+"Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
+		kmem_free(mp->m_dir_geo);
+		kmem_free(mp->m_attr_geo);
+		return -ENOSYS;
+#endif
+	} else {
+		if (xfs_sb_version_hasasciici(&mp->m_sb))
+			mp->m_dirnameops = &xfs_ascii_ci_nameops;
+		else
+			mp->m_dirnameops = &xfs_default_nameops;
+	}
 
 	return 0;
 }
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index ad525a5..1ee7d33 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -99,7 +99,7 @@ static const struct {
 	{ offsetof(xfs_sb_t, sb_features_incompat),	0 },
 	{ offsetof(xfs_sb_t, sb_features_log_incompat),	0 },
 	{ offsetof(xfs_sb_t, sb_crc),		0 },
-	{ offsetof(xfs_sb_t, sb_pad),		0 },
+	{ offsetof(xfs_sb_t, sb_utf8version),	0 },
 	{ offsetof(xfs_sb_t, sb_pquotino),	0 },
 	{ offsetof(xfs_sb_t, sb_lsn),		0 },
 	{ sizeof(xfs_sb_t),			0 }
@@ -443,7 +443,7 @@ __xfs_sb_from_disk(
 	to->sb_features_incompat = be32_to_cpu(from->sb_features_incompat);
 	to->sb_features_log_incompat =
 				be32_to_cpu(from->sb_features_log_incompat);
-	to->sb_pad = 0;
+	to->sb_utf8version = be32_to_cpu(from->sb_utf8version);
 	to->sb_pquotino = be64_to_cpu(from->sb_pquotino);
 	to->sb_lsn = be64_to_cpu(from->sb_lsn);
 	/* Convert on-disk flags to in-memory flags? */
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 525eacb..dc7b6c6 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -159,7 +159,7 @@ typedef struct xfs_sb {
 	__uint32_t	sb_features_log_incompat;
 
 	__uint32_t	sb_crc;		/* superblock crc */
-	__uint32_t	sb_pad;
+	__uint32_t	sb_utf8version;	/* unicode version */
 
 	xfs_ino_t	sb_pquotino;	/* project quota inode */
 	xfs_lsn_t	sb_lsn;		/* last write sequence */
@@ -245,7 +245,7 @@ typedef struct xfs_dsb {
 	__be32		sb_features_log_incompat;
 
 	__le32		sb_crc;		/* superblock crc */
-	__be32		sb_pad;
+	__be32		sb_utf8version;	/* version of unicode */
 
 	__be64		sb_pquotino;	/* project quota inode */
 	__be64		sb_lsn;		/* last write sequence */
@@ -271,7 +271,7 @@ typedef enum {
 	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
 	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
 	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
-	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
+	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_UTF8VERSION,
 	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
 	XFS_SBS_FIELDCOUNT
 } xfs_sb_field_t;
@@ -303,6 +303,7 @@ typedef enum {
 #define XFS_SB_FEATURES_INCOMPAT XFS_SB_MVAL(FEATURES_INCOMPAT)
 #define XFS_SB_FEATURES_LOG_INCOMPAT XFS_SB_MVAL(FEATURES_LOG_INCOMPAT)
 #define XFS_SB_CRC		XFS_SB_MVAL(CRC)
+#define XFS_SB_UTF8VERSION	XFS_SB_MVAL(UTF8VERSION)
 #define XFS_SB_PQUOTINO		XFS_SB_MVAL(PQUOTINO)
 #define	XFS_SB_NUM_BITS		((int)XFS_SBS_FIELDCOUNT)
 #define	XFS_SB_ALL_BITS		((1LL << XFS_SB_NUM_BITS) - 1)
@@ -313,7 +314,8 @@ typedef enum {
 	 XFS_SB_ICOUNT | XFS_SB_IFREE | XFS_SB_FDBLOCKS | XFS_SB_FEATURES2 | \
 	 XFS_SB_BAD_FEATURES2 | XFS_SB_FEATURES_COMPAT | \
 	 XFS_SB_FEATURES_RO_COMPAT | XFS_SB_FEATURES_INCOMPAT | \
-	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_PQUOTINO)
+	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_UTF8VERSION | \
+	 XFS_SB_PQUOTINO)
 
 
 /*
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
new file mode 100644
index 0000000..7e63111
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_inum.h"
+#include "xfs_trans.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_mount.h"
+#include "xfs_da_btree.h"
+#include "xfs_format.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_dinode.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_utf8.h"
+#include <linux/utf8norm.h>
+
+int
+xfs_utf8_version_ok(
+	struct xfs_mount	*mp)
+{
+	int	major, minor, revision;
+
+	if (utf8version_is_supported(mp->m_sb.sb_utf8version))
+		return 1;
+
+	major = mp->m_sb.sb_utf8version >> UNICODE_MAJ_SHIFT;
+	minor = (mp->m_sb.sb_utf8version & 0xff00) >> UNICODE_MIN_SHIFT;
+	revision = mp->m_sb.sb_utf8version & 0xff;
+
+	if (revision) {
+		xfs_warn(mp,
+		"Unicode version %d.%d.%d not supported by utf8norm.ko",
+		major, minor, revision);
+	} else {
+		xfs_warn(mp,
+		"Unicode version %d.%d not supported by utf8norm.ko",
+		major, minor);
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
new file mode 100644
index 0000000..8a700de
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_utf8.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef XFS_UTF8_H
+#define XFS_UTF8_H
+
+extern int xfs_utf8_version_ok(struct xfs_mount *);
+
+#endif /* XFS_UTF8_H */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 11/16] xfs: add xfs_nameops for utf8 and utf8+casefold.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (9 preceding siblings ...)
  2014-10-03 22:00 ` [PATCH 10/16] xfs: store utf8version in the superblock Ben Myers
@ 2014-10-03 22:01 ` Ben Myers
  2014-10-06 22:10     ` Dave Chinner
  2014-10-03 22:03 ` [PATCH 12/16] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
                   ` (23 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:01 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
and are installed if the utf8bit is set in the super block.

The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
filenames, and are installed if both the utf8bit and the borgbit are set
in the superblock.

Normalized filenames are not stored on disk. Normalization will fail if a
filename is not valid UTF-8, in which case the filename is treated as an
opaque blob.

Signed-off-by: Olaf Weber <olaf@sgi.com>

---
[v2: updated to use utf8norm.ko module;
     compiled conditionally on CONFIG_XFS_UTF8=y;
     utf8version is now a function;
     move xfs_utf8.[ch] into libxfs. --bpm]
[v3: pass utf8version from the superblock through xfs_nameops
     instead of the max version of the normalization module. --bpm]
---
 fs/xfs/Kconfig           |   9 ++
 fs/xfs/Makefile          |   2 +
 fs/xfs/libxfs/xfs_dir2.c |   4 +-
 fs/xfs/libxfs/xfs_utf8.c | 208 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_utf8.h |   3 +
 fs/xfs/xfs_iops.c        |   2 +-
 6 files changed, 225 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 5d47b4d..1e8a463 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -95,3 +95,12 @@ config XFS_DEBUG
 	  not useful unless you are debugging a particular problem.
 
 	  Say N unless you are an XFS developer, or you play one on TV.
+
+config XFS_UTF8
+	bool "XFS UTF-8 support"
+	depends on XFS_FS
+	select CONFIG_UTF8_NORMALIZATION
+	help
+	  Say Y here to enable utf8 normalization support in XFS.  You
+	  will be able to mount and use filesystems created with the
+	  utf8 mkfs.xfs option.
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d617999..192aaca 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -114,6 +114,8 @@ xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
 				   xfs_qm.o \
 				   xfs_quotaops.o
 
+xfs-$(CONFIG_XFS_UTF8)		+= libxfs/xfs_utf8.o
+
 # xfs_rtbitmap is shared with libxfs
 xfs-$(CONFIG_XFS_RT)		+= xfs_rtalloc.o
 
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 2c89211..9cfbd6b 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -165,9 +165,9 @@ xfs_da_mount(
 		/* XXX these are replaced in the next patch need
 		   to do some kind of reordering here */
 		if (xfs_sb_version_hasasciici(&mp->m_sb))
-			mp->m_dirnameops = &xfs_ascii_ci_nameops;
+			mp->m_dirnameops = &xfs_utf8_ci_nameops;
 		else
-			mp->m_dirnameops = &xfs_default_nameops;
+			mp->m_dirnameops = &xfs_utf8_nameops;
 #else
 		xfs_warn(mp,
 "Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
index 7e63111..1e75299 100644
--- a/fs/xfs/libxfs/xfs_utf8.c
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -68,3 +68,211 @@ xfs_utf8_version_ok(
 
 	return 0;
 }
+
+/*
+ * xfs nameops using nfkdi
+ */
+
+static xfs_dahash_t
+xfs_utf8_hashname(
+	const unsigned char *name,
+	int len,
+	unsigned int sb_utf8version)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdi = utf8nfkdi(sb_utf8version);
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	nfkdi = utf8nfkdi(sb_utf8version);
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdi = utf8nfkdi(sb_utf8version);
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_nameops = {
+	.hashname = xfs_utf8_hashname,
+	.normhash = xfs_utf8_normhash,
+	.compname = xfs_utf8_compname,
+};
+
+/*
+ * xfs nameops using nfkdicf
+ */
+
+static xfs_dahash_t
+xfs_utf8_ci_hashname(
+	const unsigned char *name,
+	int len,
+	unsigned int sb_utf8version)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdicf = utf8nfkdicf(sb_utf8version);
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_ci_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	nfkdicf = utf8nfkdicf(sb_utf8version);
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdicf, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_ci_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdicf = utf8nfkdicf(sb_utf8version);
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_ci_nameops = {
+	.hashname = xfs_utf8_ci_hashname,
+	.normhash = xfs_utf8_ci_normhash,
+	.compname = xfs_utf8_ci_compname,
+};
diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
index 8a700de..404db54 100644
--- a/fs/xfs/libxfs/xfs_utf8.h
+++ b/fs/xfs/libxfs/xfs_utf8.h
@@ -21,4 +21,7 @@
 
 extern int xfs_utf8_version_ok(struct xfs_mount *);
 
+extern struct xfs_nameops xfs_utf8_nameops;
+extern struct xfs_nameops xfs_utf8_ci_nameops;
+
 #endif /* XFS_UTF8_H */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index cea3d64..fbfb1bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1257,7 +1257,7 @@ xfs_setup_inode(
 		break;
 	case S_IFDIR:
 		lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class);
-		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
+		if (xfs_sb_version_hasci(&XFS_M(inode->i_sb)->m_sb))
 			inode->i_op = &xfs_dir_ci_inode_operations;
 		else
 			inode->i_op = &xfs_dir_inode_operations;
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 12/16] xfs: apply utf-8 normalization rules to user extended attribute names
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (10 preceding siblings ...)
  2014-10-03 22:01 ` [PATCH 11/16] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
@ 2014-10-03 22:03 ` Ben Myers
  2014-10-03 22:03 ` [PATCH 13/16] xfs: implement demand load of utf8norm.ko Ben Myers
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Apply the same rules for UTF-8 normalization to the names of user-defined
extended attributes. System attributes are excluded because they are not
user-visible in the first place, and the kernel is expected to know what
it is doing when naming them.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 fs/xfs/libxfs/xfs_attr.c      | 56 ++++++++++++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_attr_leaf.c | 11 +++++++--
 fs/xfs/libxfs/xfs_utf8.c      |  7 ++++++
 fs/xfs/xfs_attr_list.c        | 12 +++++++++-
 4 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 353fb42..68e7ce3 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -83,12 +83,14 @@ xfs_attr_args_init(
 	const unsigned char	*name,
 	int			flags)
 {
+	struct xfs_mount	*mp = dp->i_mount;
+	int			error;
 
 	if (!name)
 		return -EINVAL;
 
 	memset(args, 0, sizeof(*args));
-	args->geo = dp->i_mount->m_attr_geo;
+	args->geo = mp->m_attr_geo;
 	args->whichfork = XFS_ATTR_FORK;
 	args->dp = dp;
 	args->flags = flags;
@@ -97,7 +99,11 @@ xfs_attr_args_init(
 	if (args->namelen >= MAXNAMELEN)
 		return -EFAULT;		/* match IRIX behaviour */
 
-	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	if (!xfs_sb_version_hasutf8(&mp->m_sb))
+		args->hashval = xfs_da_hashname(args->name, args->namelen);
+	else if ((error = mp->m_dirnameops->normhash(args)) != 0)
+		return error;
+
 	return 0;
 }
 
@@ -154,6 +160,9 @@ xfs_attr_get(
 		error = xfs_attr_node_get(&args);
 	xfs_iunlock(ip, lock_mode);
 
+	if (args.norm)
+		kmem_free(args.norm);
+
 	*valuelenp = args.valuelen;
 	return error == -EEXIST ? 0 : error;
 }
@@ -216,8 +225,11 @@ xfs_attr_set(
 		return -EIO;
 
 	error = xfs_attr_args_init(&args, dp, name, flags);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	args.value = value;
 	args.valuelen = valuelen;
@@ -227,8 +239,11 @@ xfs_attr_set(
 	args.total = xfs_attr_calc_size(&args, &local);
 
 	error = xfs_qm_dqattach(dp, 0);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	/*
 	 * If the inode doesn't have an attribute fork, add one.
@@ -239,8 +254,11 @@ xfs_attr_set(
 			XFS_ATTR_SF_ENTSIZE_BYNAME(args.namelen, valuelen);
 
 		error = xfs_bmap_add_attrfork(dp, sf_size, rsvd);
-		if (error)
+		if (error) {
+			if (args.norm)
+				kmem_free(args.norm);
 			return error;
+		}
 	}
 
 	/*
@@ -270,6 +288,8 @@ xfs_attr_set(
 	error = xfs_trans_reserve(args.trans, &tres, args.total, 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
 	}
 	xfs_ilock(dp, XFS_ILOCK_EXCL);
@@ -280,6 +300,8 @@ xfs_attr_set(
 	if (error) {
 		xfs_iunlock(dp, XFS_ILOCK_EXCL);
 		xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES);
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
 	}
 
@@ -327,6 +349,8 @@ xfs_attr_set(
 						 XFS_TRANS_RELEASE_LOG_RES);
 			xfs_iunlock(dp, XFS_ILOCK_EXCL);
 
+			if (args.norm)
+				kmem_free(args.norm);
 			return error ? error : err2;
 		}
 
@@ -388,7 +412,8 @@ xfs_attr_set(
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
-
+	if (args.norm)
+		kmem_free(args.norm);
 	return error;
 
 out:
@@ -397,6 +422,8 @@ out:
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	}
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free(args.norm);
 	return error;
 }
 
@@ -425,8 +452,11 @@ xfs_attr_remove(
 		return -ENOATTR;
 
 	error = xfs_attr_args_init(&args, dp, name, flags);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	args.firstblock = &firstblock;
 	args.flist = &flist;
@@ -439,8 +469,11 @@ xfs_attr_remove(
 	args.op_flags = XFS_DA_OP_OKNOENT;
 
 	error = xfs_qm_dqattach(dp, 0);
-	if (error)
+	if (error) {
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
+	}
 
 	/*
 	 * Start our first transaction of the day.
@@ -466,6 +499,8 @@ xfs_attr_remove(
 				  XFS_ATTRRM_SPACE_RES(mp), 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free(args.norm);
 		return error;
 	}
 
@@ -506,6 +541,8 @@ xfs_attr_remove(
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free(args.norm);
 
 	return error;
 
@@ -515,6 +552,9 @@ out:
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	}
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free(args.norm);
+
 	return error;
 }
 
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b1f73db..c991a88 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -661,6 +661,7 @@ int
 xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 {
 	xfs_inode_t *dp;
+	struct xfs_mount *mp;
 	xfs_attr_shortform_t *sf;
 	xfs_attr_sf_entry_t *sfe;
 	xfs_da_args_t nargs;
@@ -673,6 +674,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	trace_xfs_attr_sf_to_leaf(args);
 
 	dp = args->dp;
+	mp = dp->i_mount;
 	ifp = dp->i_afp;
 	sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
 	size = be16_to_cpu(sf->hdr.totsize);
@@ -726,13 +728,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 		nargs.namelen = sfe->namelen;
 		nargs.value = &sfe->nameval[nargs.namelen];
 		nargs.valuelen = sfe->valuelen;
-		nargs.hashval = xfs_da_hashname(sfe->nameval,
-						sfe->namelen);
 		nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags);
+		if (!xfs_sb_version_hasutf8(&mp->m_sb))
+			nargs.hashval = xfs_da_hashname(sfe->nameval,
+							sfe->namelen);
+		else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0)
+			goto out;
 		error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */
 		ASSERT(error == -ENOATTR);
 		error = xfs_attr3_leaf_add(bp, &nargs);
 		ASSERT(error != -ENOSPC);
+		if (nargs.norm)
+			kmem_free(nargs.norm);
 		if (error)
 			goto out;
 		sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
index 1e75299..ede6228 100644
--- a/fs/xfs/libxfs/xfs_utf8.c
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -38,6 +38,7 @@
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
 #include "xfs_bmap.h"
+#include "xfs_attr.h"
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_utf8.h"
@@ -109,6 +110,9 @@ xfs_utf8_normhash(
 	unsigned int	sb_utf8version =
 		args->dp->i_mount->m_sb.sb_utf8version;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdi = utf8nfkdi(sb_utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
@@ -213,6 +217,9 @@ xfs_utf8_ci_normhash(
 	unsigned int	sb_utf8version =
 		args->dp->i_mount->m_sb.sb_utf8version;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdicf = utf8nfkdicf(sb_utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index 62db83a..034199d 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -76,12 +76,14 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context)
 	xfs_attr_shortform_t *sf;
 	xfs_attr_sf_entry_t *sfe;
 	xfs_inode_t *dp;
+	struct xfs_mount *mp;
 	int sbsize, nsbuf, count, i;
 	int error;
 
 	ASSERT(context != NULL);
 	dp = context->dp;
 	ASSERT(dp != NULL);
+	mp = dp->i_mount;
 	ASSERT(dp->i_afp != NULL);
 	sf = (xfs_attr_shortform_t *)dp->i_afp->if_u1.if_data;
 	ASSERT(sf != NULL);
@@ -154,7 +156,15 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context)
 		}
 
 		sbp->entno = i;
-		sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen);
+
+		/* ATTR_ROOT and ATTR_SECURE are never normalized. */
+		if (!xfs_sb_version_hasutf8(&mp->m_sb) ||
+		    (sfe->flags & (ATTR_ROOT|ATTR_SECURE))) {
+			sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen);
+		} else {
+			sbp->hash = mp->m_dirnameops->hashname(sfe->nameval,
+				       sfe->namelen, mp->m_sb.sb_utf8version);
+		}
 		sbp->name = sfe->nameval;
 		sbp->namelen = sfe->namelen;
 		/* These are bytes, and both on-disk, don't endian-flip */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 13/16] xfs: implement demand load of utf8norm.ko
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (11 preceding siblings ...)
  2014-10-03 22:03 ` [PATCH 12/16] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
@ 2014-10-03 22:03 ` Ben Myers
  2014-10-04  7:16     ` Christoph Hellwig
  2014-10-03 22:04 ` [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2 Ben Myers
                   ` (21 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

The utf8 normalization module is large and there is no need to have it
loaded unless an xfs filesystem with utf8 enabled has been mounted.
This loads utf8norm.ko at mount time for filesystems that need
it.

Signed-off-by: Ben Myers <bpm@sgi.com>

---
[v2: updated for utf8version_is_supported. --bpm]
[v3: removed CONFIG_XFS_UTF8_DEMAND_LOAD. --bpm]
---
 fs/xfs/libxfs/xfs_dir2.c |   9 +++++
 fs/xfs/libxfs/xfs_utf8.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_utf8.h |   3 ++
 fs/xfs/xfs_super.c       |   6 +++
 4 files changed, 118 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 9cfbd6b..844044b 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -35,6 +35,9 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_dinode.h"
+#ifdef CONFIG_XFS_UTF8
+#include "xfs_utf8.h"
+#endif
 
 struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
 
@@ -159,6 +162,12 @@ xfs_da_mount(
 
 	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
 #ifdef CONFIG_XFS_UTF8
+		if (xfs_init_utf8_module(mp)) {
+			kmem_free(mp->m_dir_geo);
+			kmem_free(mp->m_attr_geo);
+			return -ENOSYS;
+		}
+
 		if (!xfs_utf8_version_ok(mp))
 			return -ENOSYS;
 
diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
index ede6228..09efcda 100644
--- a/fs/xfs/libxfs/xfs_utf8.c
+++ b/fs/xfs/libxfs/xfs_utf8.c
@@ -43,6 +43,106 @@
 #include "xfs_trace.h"
 #include "xfs_utf8.h"
 #include <linux/utf8norm.h>
+#include <linux/kmod.h>
+
+static DEFINE_SPINLOCK(utf8norm_lock);
+static int utf8norm_initialized;
+
+static int (*utf8version_is_supported_func)(unsigned int);
+static utf8data_t (*utf8nfkdi_func)(unsigned int);
+static utf8data_t (*utf8nfkdicf_func)(unsigned int);
+static ssize_t (*utf8nlen_func)(utf8data_t, const char *, size_t);
+static int (*utf8ncursor_func)(struct utf8cursor *, utf8data_t,
+		const char *, size_t);
+static int (*utf8byte_func)(struct utf8cursor *);
+
+static void
+xfs_put_utf8_module_locked(void)
+{
+	if (utf8version_is_supported_func)
+		symbol_put(utf8version_is_supported);
+
+	if (utf8nfkdi_func)
+		symbol_put(utf8nfkdi);
+
+	if (utf8nfkdicf_func)
+		symbol_put(utf8nfkdicf);
+
+	if (utf8nlen_func)
+		symbol_put(utf8nlen);
+
+	if (utf8ncursor_func)
+		symbol_put(utf8ncursor);
+
+	if (utf8byte_func)
+		symbol_put(utf8byte);
+}
+
+void
+xfs_put_utf8_module(void)
+{
+	spin_lock(&utf8norm_lock);
+	if (!utf8norm_initialized) {
+		spin_unlock(&utf8norm_lock);
+		return;
+	}
+	xfs_put_utf8_module_locked();
+	spin_unlock(&utf8norm_lock);
+}
+
+int
+xfs_init_utf8_module(struct xfs_mount	*mp)
+{
+	request_module("utf8norm");
+
+	spin_lock(&utf8norm_lock);
+	if (utf8norm_initialized) {
+		spin_unlock(&utf8norm_lock);
+		return 0;
+	}
+
+	utf8version_is_supported_func = symbol_get(utf8version_is_supported);
+	if (!utf8version_is_supported_func)
+		goto error;
+
+	utf8nfkdi_func = symbol_get(utf8nfkdi);
+	if (!utf8nfkdi_func)
+		goto error;
+
+	utf8nfkdicf_func = symbol_get(utf8nfkdicf);
+	if (!utf8nfkdicf_func)
+		goto error;
+
+	utf8nlen_func = symbol_get(utf8nlen);
+	if (!utf8nlen_func) 
+		goto error;
+
+	utf8ncursor_func = symbol_get(utf8ncursor);
+	if (!utf8ncursor_func)
+		goto error;
+
+	utf8byte_func = symbol_get(utf8byte);
+	if (!utf8byte_func)
+		goto error;
+
+	utf8norm_initialized = 1;	
+	spin_unlock(&utf8norm_lock);
+	return 0;
+error:
+	xfs_put_utf8_module_locked();
+	spin_unlock(&utf8norm_lock);
+	xfs_warn(mp,
+		"Failed to load utf8norm.ko which is required to "
+		"mount a filesystem with utf8 support.");
+	return -ENOSYS;
+}
+
+#define utf8version_is_supported (*utf8version_is_supported_func)
+#define utf8nfkdi (*utf8nfkdi_func)
+#define utf8nfkdicf (*utf8nfkdicf_func)
+#define utf8nlen (*utf8nlen_func)
+#define utf8ncursor (*utf8ncursor_func)
+#define utf8byte (*utf8byte_func)
 
 int
 xfs_utf8_version_ok(
diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
index 404db54..b79ce05 100644
--- a/fs/xfs/libxfs/xfs_utf8.h
+++ b/fs/xfs/libxfs/xfs_utf8.h
@@ -24,4 +24,7 @@ extern int xfs_utf8_version_ok(struct xfs_mount *);
 extern struct xfs_nameops xfs_utf8_nameops;
 extern struct xfs_nameops xfs_utf8_ci_nameops;
 
+extern int xfs_init_utf8_module(struct xfs_mount *);
+extern void xfs_put_utf8_module(void);
+
 #endif /* XFS_UTF8_H */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b194652..60a3ebc 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -47,6 +47,9 @@
 #include "xfs_dinode.h"
 #include "xfs_filestream.h"
 #include "xfs_quota.h"
+#ifdef CONFIG_XFS_UTF8
+#include "xfs_utf8.h"
+#endif
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1809,6 +1812,9 @@ exit_xfs_fs(void)
 	xfs_mru_cache_uninit();
 	xfs_destroy_workqueues();
 	xfs_destroy_zones();
+#ifdef CONFIG_XFS_UTF8
+	xfs_put_utf8_module();
+#endif
 }
 
 module_init(init_xfs_fs);
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (12 preceding siblings ...)
  2014-10-03 22:03 ` [PATCH 13/16] xfs: implement demand load of utf8norm.ko Ben Myers
@ 2014-10-03 22:04 ` Ben Myers
  2014-10-06 20:33     ` Dave Chinner
  2014-10-03 22:05 ` [PATCH 15/16] xfs: xfs_fs_geometry returns a number of bytes to copy Ben Myers
                   ` (20 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:04 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

We'll be creating a new versioned XFS_IOC_FSGEOMETRY ioctl and structure
so rename the current revision to _V2.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 fs/xfs/xfs_fs.h      |  8 ++++----
 fs/xfs/xfs_fsops.c   |  2 +-
 fs/xfs/xfs_fsops.h   |  3 ++-
 fs/xfs/xfs_ioctl.c   | 12 ++++++------
 fs/xfs/xfs_ioctl32.c |  4 ++--
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index e845d75..fd45cbe 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -180,9 +180,9 @@ typedef struct xfs_fsop_geom_v1 {
 } xfs_fsop_geom_v1_t;
 
 /*
- * Output for XFS_IOC_FSGEOMETRY
+ * Output for XFS_IOC_FSGEOMETRY_V2
  */
-typedef struct xfs_fsop_geom {
+typedef struct xfs_fsop_geom_v2 {
 	__u32		blocksize;	/* filesystem (data) block size */
 	__u32		rtextsize;	/* realtime extent size		*/
 	__u32		agblocks;	/* fsblocks in an AG		*/
@@ -204,7 +204,7 @@ typedef struct xfs_fsop_geom {
 	__u32		rtsectsize;	/* realtime sector size, bytes	*/
 	__u32		dirblocksize;	/* directory block size, bytes	*/
 	__u32		logsunit;	/* log stripe unit, bytes */
-} xfs_fsop_geom_t;
+} xfs_fsop_geom_v2_t;
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
@@ -555,7 +555,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_FSSETDM_BY_HANDLE    _IOW ('X', 121, struct xfs_fsop_setdm_handlereq)
 #define XFS_IOC_ATTRLIST_BY_HANDLE   _IOW ('X', 122, struct xfs_fsop_attrlist_handlereq)
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
-#define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
+#define XFS_IOC_FSGEOMETRY_V2	     _IOR ('X', 124, struct xfs_fsop_geom_v2)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 1a83eef..b69468e 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -50,7 +50,7 @@
 int
 xfs_fs_geometry(
 	xfs_mount_t		*mp,
-	xfs_fsop_geom_t		*geo,
+	xfs_fsop_geom_v2_t	*geo,
 	int			new_version)
 {
 
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 1b6a98b..26e7343 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -18,7 +18,8 @@
 #ifndef __XFS_FSOPS_H__
 #define	__XFS_FSOPS_H__
 
-extern int xfs_fs_geometry(xfs_mount_t *mp, xfs_fsop_geom_t *geo, int nversion);
+extern int xfs_fs_geometry(xfs_mount_t *mp, xfs_fsop_geom_v2_t *geo,
+		int nversion);
 extern int xfs_growfs_data(xfs_mount_t *mp, xfs_growfs_data_t *in);
 extern int xfs_growfs_log(xfs_mount_t *mp, xfs_growfs_log_t *in);
 extern int xfs_fs_counts(xfs_mount_t *mp, xfs_fsop_counts_t *cnt);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 3799695..4393405 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -823,7 +823,7 @@ xfs_ioc_fsgeometry_v1(
 	xfs_mount_t		*mp,
 	void			__user *arg)
 {
-	xfs_fsop_geom_t         fsgeo;
+	xfs_fsop_geom_v2_t	fsgeo;
 	int			error;
 
 	error = xfs_fs_geometry(mp, &fsgeo, 3);
@@ -841,18 +841,18 @@ xfs_ioc_fsgeometry_v1(
 }
 
 STATIC int
-xfs_ioc_fsgeometry(
+xfs_ioc_fsgeometry_v2(
 	xfs_mount_t		*mp,
 	void			__user *arg)
 {
-	xfs_fsop_geom_t		fsgeo;
+	xfs_fsop_geom_v2_t	fsgeo;
 	int			error;
 
 	error = xfs_fs_geometry(mp, &fsgeo, 4);
 	if (error)
 		return error;
 
-	if (copy_to_user(arg, &fsgeo, sizeof(fsgeo)))
+	if (copy_to_user(arg, &fsgeo, sizeof(xfs_fsop_geom_v2_t)))
 		return -EFAULT;
 	return 0;
 }
@@ -1564,8 +1564,8 @@ xfs_file_ioctl(
 	case XFS_IOC_FSGEOMETRY_V1:
 		return xfs_ioc_fsgeometry_v1(mp, arg);
 
-	case XFS_IOC_FSGEOMETRY:
-		return xfs_ioc_fsgeometry(mp, arg);
+	case XFS_IOC_FSGEOMETRY_V2:
+		return xfs_ioc_fsgeometry_v2(mp, arg);
 
 	case XFS_IOC_GETVERSION:
 		return put_user(inode->i_generation, (int __user *)arg);
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index a554646..207b224 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -64,7 +64,7 @@ xfs_compat_ioc_fsgeometry_v1(
 	struct xfs_mount	  *mp,
 	compat_xfs_fsop_geom_v1_t __user *arg32)
 {
-	xfs_fsop_geom_t		  fsgeo;
+	xfs_fsop_geom_v2_t	  fsgeo;
 	int			  error;
 
 	error = xfs_fs_geometry(mp, &fsgeo, 3);
@@ -543,7 +543,7 @@ xfs_file_compat_ioctl(
 	switch (cmd) {
 	/* No size or alignment issues on any arch */
 	case XFS_IOC_DIOINFO:
-	case XFS_IOC_FSGEOMETRY:
+	case XFS_IOC_FSGEOMETRY_V2:
 	case XFS_IOC_FSGETXATTR:
 	case XFS_IOC_FSSETXATTR:
 	case XFS_IOC_FSGETXATTRA:
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 15/16] xfs: xfs_fs_geometry returns a number of bytes to copy
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (13 preceding siblings ...)
  2014-10-03 22:04 ` [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2 Ben Myers
@ 2014-10-03 22:05 ` Ben Myers
  2014-10-06 20:41     ` Dave Chinner
  2014-10-03 22:05 ` [PATCH 16/16] xfs: add versioned fsgeom ioctl with utf8version field Ben Myers
                   ` (19 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:05 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

The versioned xfs_fsop_geom_t will be of variable size.  Make
xfs_fs_geometry return the number of bytes to copy out to userspace for
a given version of the structure.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 fs/xfs/xfs_fsops.c   | 6 ++----
 fs/xfs/xfs_fsops.h   | 2 +-
 fs/xfs/xfs_ioctl.c   | 6 ++++--
 fs/xfs/xfs_ioctl32.c | 3 ++-
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b69468e..cf87e16 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -51,11 +51,9 @@ int
 xfs_fs_geometry(
 	xfs_mount_t		*mp,
 	xfs_fsop_geom_v2_t	*geo,
-	int			new_version)
+	int			new_version,
+	size_t			*bytes)
 {
-
-	memset(geo, 0, sizeof(*geo));
-
 	geo->blocksize = mp->m_sb.sb_blocksize;
 	geo->rtextsize = mp->m_sb.sb_rextsize;
 	geo->agblocks = mp->m_sb.sb_agblocks;
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 26e7343..74e1fee 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -19,7 +19,7 @@
 #define	__XFS_FSOPS_H__
 
 extern int xfs_fs_geometry(xfs_mount_t *mp, xfs_fsop_geom_v2_t *geo,
-		int nversion);
+		int new_version, size_t *bytes);
 extern int xfs_growfs_data(xfs_mount_t *mp, xfs_growfs_data_t *in);
 extern int xfs_growfs_log(xfs_mount_t *mp, xfs_growfs_log_t *in);
 extern int xfs_fs_counts(xfs_mount_t *mp, xfs_fsop_counts_t *cnt);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4393405..1657ce5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -826,7 +826,8 @@ xfs_ioc_fsgeometry_v1(
 	xfs_fsop_geom_v2_t	fsgeo;
 	int			error;
 
-	error = xfs_fs_geometry(mp, &fsgeo, 3);
+	memset(&fsgeo, 0, sizeof(fsgeo));
+	error = xfs_fs_geometry(mp, &fsgeo, 3, NULL);
 	if (error)
 		return error;
 
@@ -848,7 +849,8 @@ xfs_ioc_fsgeometry_v2(
 	xfs_fsop_geom_v2_t	fsgeo;
 	int			error;
 
-	error = xfs_fs_geometry(mp, &fsgeo, 4);
+	memset(&fsgeo, 0, sizeof(fsgeo));
+	error = xfs_fs_geometry(mp, &fsgeo, 4, NULL);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 207b224..aca988a 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -67,7 +67,8 @@ xfs_compat_ioc_fsgeometry_v1(
 	xfs_fsop_geom_v2_t	  fsgeo;
 	int			  error;
 
-	error = xfs_fs_geometry(mp, &fsgeo, 3);
+	memset(&fsgeo, 0, sizeof(fsgeo));
+	error = xfs_fs_geometry(mp, &fsgeo, 3, NULL);
 	if (error)
 		return error;
 	/* The 32-bit variant simply has some padding at the end */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 16/16] xfs: add versioned fsgeom ioctl with utf8version field
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (14 preceding siblings ...)
  2014-10-03 22:05 ` [PATCH 15/16] xfs: xfs_fs_geometry returns a number of bytes to copy Ben Myers
@ 2014-10-03 22:05 ` Ben Myers
  2014-10-06 21:13     ` Dave Chinner
  2014-10-03 22:06 ` [PATCH 17/35] xfsprogs: add unicode character database files Ben Myers
                   ` (18 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:05 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

This adds a utf8version field to the xfs_fs_geom structure.  An
important characteristic of this version of the ioctl is that
fsgeo.version needs to be set by the caller to specify which version of
the structure to return.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 fs/xfs/xfs_fs.h    | 31 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsops.c | 13 ++++++++++++-
 fs/xfs/xfs_fsops.h |  2 +-
 fs/xfs/xfs_ioctl.c | 31 +++++++++++++++++++++++++++++++
 4 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index fd45cbe..2f4d430 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -206,6 +206,34 @@ typedef struct xfs_fsop_geom_v2 {
 	__u32		logsunit;	/* log stripe unit, bytes */
 } xfs_fsop_geom_v2_t;
 
+/*
+ * Output for XFS_IOC_FSGEOMETRY
+ */
+typedef struct xfs_fsop_geom {
+	__u32		blocksize;	/* filesystem (data) block size */
+	__u32		rtextsize;	/* realtime extent size		*/
+	__u32		agblocks;	/* fsblocks in an AG		*/
+	__u32		agcount;	/* number of allocation groups	*/
+	__u32		logblocks;	/* fsblocks in the log		*/
+	__u32		sectsize;	/* (data) sector size, bytes	*/
+	__u32		inodesize;	/* inode size in bytes		*/
+	__u32		imaxpct;	/* max allowed inode space(%)	*/
+	__u64		datablocks;	/* fsblocks in data subvolume	*/
+	__u64		rtblocks;	/* fsblocks in realtime subvol	*/
+	__u64		rtextents;	/* rt extents in realtime subvol*/
+	__u64		logstart;	/* starting fsblock of the log	*/
+	unsigned char	uuid[16];	/* unique id of the filesystem	*/
+	__u32		sunit;		/* stripe unit, fsblocks	*/
+	__u32		swidth;		/* stripe width, fsblocks	*/
+	__s32		version;	/* structure version		*/
+	__u32		flags;		/* superblock version flags	*/
+	__u32		logsectsize;	/* log sector size, bytes	*/
+	__u32		rtsectsize;	/* realtime sector size, bytes	*/
+	__u32		dirblocksize;	/* directory block size, bytes	*/
+	__u32		logsunit;	/* log stripe unit, bytes */
+	__u32		utf8version;	/* Unicode version		*/
+} xfs_fsop_geom_t;
+
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
 	__u64	freedata;	/* free data section blocks */
@@ -221,6 +249,8 @@ typedef struct xfs_fsop_resblks {
 } xfs_fsop_resblks_t;
 
 #define XFS_FSOP_GEOM_VERSION	0
+/* skipped 1-4 to match existing new_version xfs_fs_geometry argument */
+#define XFS_FSOP_GEOM_VERSION5	5
 
 #define XFS_FSOP_GEOM_FLAGS_ATTR	0x0001	/* attributes in use	*/
 #define XFS_FSOP_GEOM_FLAGS_NLINK	0x0002	/* 32-bit nlink values	*/
@@ -557,6 +587,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY_V2	     _IOR ('X', 124, struct xfs_fsop_geom_v2)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index cf87e16..d70acf8 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -50,10 +50,12 @@
 int
 xfs_fs_geometry(
 	xfs_mount_t		*mp,
-	xfs_fsop_geom_v2_t	*geo,
+	void			*buffer,
 	int			new_version,
 	size_t			*bytes)
 {
+	xfs_fsop_geom_t		*geo = (xfs_fsop_geom_t *)buffer;
+
 	geo->blocksize = mp->m_sb.sb_blocksize;
 	geo->rtextsize = mp->m_sb.sb_rextsize;
 	geo->agblocks = mp->m_sb.sb_agblocks;
@@ -115,6 +117,15 @@ xfs_fs_geometry(
 				XFS_FSOP_GEOM_FLAGS_LOGV2 : 0);
 		geo->logsunit = mp->m_sb.sb_logsunit;
 	}
+	if (new_version >= XFS_FSOP_GEOM_VERSION5) {
+		geo->version = XFS_FSOP_GEOM_VERSION5;
+		geo->flags |= (xfs_sb_version_hasutf8(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_UTF8 : 0);
+		geo->utf8version = mp->m_sb.sb_utf8version;
+		if (bytes)
+			*bytes = sizeof(xfs_fsop_geom_v2_t) +
+				 sizeof(geo->utf8version);
+	}
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 74e1fee..b723f36 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -18,7 +18,7 @@
 #ifndef __XFS_FSOPS_H__
 #define	__XFS_FSOPS_H__
 
-extern int xfs_fs_geometry(xfs_mount_t *mp, xfs_fsop_geom_v2_t *geo,
+extern int xfs_fs_geometry(xfs_mount_t *mp, void *buffer,
 		int new_version, size_t *bytes);
 extern int xfs_growfs_data(xfs_mount_t *mp, xfs_growfs_data_t *in);
 extern int xfs_growfs_log(xfs_mount_t *mp, xfs_growfs_log_t *in);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 1657ce5..6308680 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -859,6 +859,34 @@ xfs_ioc_fsgeometry_v2(
 	return 0;
 }
 
+STATIC int
+xfs_ioc_fsgeometry(
+	struct xfs_mount	*mp,
+	void			__user *arg)
+{
+	xfs_fsop_geom_t		fsgeo;
+	int			version, error;
+	size_t			bytes;
+
+	/* offsetof(version)? XXX just get 32 bits? */
+	if (copy_from_user(&fsgeo, arg, sizeof(xfs_fsop_geom_v1_t)))
+		return -EFAULT;
+
+	version = fsgeo.version;
+
+	if (version < XFS_FSOP_GEOM_VERSION5)
+		return -EINVAL;
+
+	memset(&fsgeo, 0, sizeof(fsgeo));
+	error = xfs_fs_geometry(mp, &fsgeo, version, &bytes);
+	if (error)
+		return error;
+
+	if (copy_to_user(arg, &fsgeo, bytes))
+		return -EFAULT;
+	return 0;
+}
+
 /*
  * Linux extended inode flags interface.
  */
@@ -1569,6 +1597,9 @@ xfs_file_ioctl(
 	case XFS_IOC_FSGEOMETRY_V2:
 		return xfs_ioc_fsgeometry_v2(mp, arg);
 
+	case XFS_IOC_FSGEOMETRY:
+		return xfs_ioc_fsgeometry(mp, arg);
+
 	case XFS_IOC_GETVERSION:
 		return put_user(inode->i_generation, (int __user *)arg);
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 17/35] xfsprogs: add unicode character database files
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (15 preceding siblings ...)
  2014-10-03 22:05 ` [PATCH 16/16] xfs: add versioned fsgeom ioctl with utf8version field Ben Myers
@ 2014-10-03 22:06 ` Ben Myers
  2014-10-03 22:07 ` [PATCH 18/35] xfsprogs: add trie generator for UTF-8 Ben Myers
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <olaf@sgi.com>

[v2: moved from support to utf8norm/ucd -bpm]
---
 utf8norm/ucd/README | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)
 create mode 100644 utf8norm/ucd/README

diff --git a/utf8norm/ucd/README b/utf8norm/ucd/README
new file mode 100644
index 0000000..d713e66
--- /dev/null
+++ b/utf8norm/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+  http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+  http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+  http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+  http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+  http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+  http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+  http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+  9a92b2bfe56c6719def926bab524fefd  CaseFolding-7.0.0.txt
+  07b8b1027eb824cf0835314e94f23d2e  DerivedAge-7.0.0.txt
+  90c3340b16821e2f2153acdbe6fc6180  DerivedCombiningClass-7.0.0.txt
+  c41c0601f808116f623de47110ed4f93  DerivedCoreProperties-7.0.0.txt
+  522720ddfc150d8e63a2518634829bce  NormalizationCorrections-7.0.0.txt
+  1f35175eba4a2ad795db489f789ae352  NormalizationTest-7.0.0.txt
+  c8355655731d75e6a3de8c20d7e601ba  UnicodeData-7.0.0.txt
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 18/35] xfsprogs: add trie generator for UTF-8.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (16 preceding siblings ...)
  2014-10-03 22:06 ` [PATCH 17/35] xfsprogs: add unicode character database files Ben Myers
@ 2014-10-03 22:07 ` Ben Myers
  2014-10-03 22:07 ` [PATCH 19/35] xfsprogs: add supporting code " Ben Myers
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:07 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <olaf@sgi.com>

[v2: moved to utf8norm directory.  -bpm]
---
 Makefile              |    2 +-
 utf8norm/Makefile     |   24 +
 utf8norm/mkutf8data.c | 3239 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 3264 insertions(+), 1 deletion(-)
 create mode 100644 utf8norm/Makefile
 create mode 100644 utf8norm/mkutf8data.c

diff --git a/Makefile b/Makefile
index f56aebd..74778b5 100644
--- a/Makefile
+++ b/Makefile
@@ -40,7 +40,7 @@ LDIRDIRT = $(SRCDIR)
 LDIRT += $(SRCTAR)
 endif
 
-LIB_SUBDIRS = libxfs libxlog libxcmd libhandle libdisk
+LIB_SUBDIRS = utf8norm libxfs libxlog libxcmd libhandle libdisk
 TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \
 		mdrestore repair rtcp m4 man doc po debian
 
diff --git a/utf8norm/Makefile b/utf8norm/Makefile
new file mode 100644
index 0000000..a32660e
--- /dev/null
+++ b/utf8norm/Makefile
@@ -0,0 +1,24 @@
+#
+# Copyright (c) 2014 SGI. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+default = ../include/utf8data.h
+
+../include/utf8data.h: mkutf8data.c
+	cc -o mkutf8data mkutf8data.c
+	cd ucd; ../mkutf8data
+	mv ucd/utf8data.h ../include
+
+default clean:
+	rm -f mkutf8data ../include/utf8data.h
+
+default install:
+
+default install-dev:
+
+default install-qa:
+
+-include .ltdep
diff --git a/utf8norm/mkutf8data.c b/utf8norm/mkutf8data.c
new file mode 100644
index 0000000..1d6ec02
--- /dev/null
+++ b/utf8norm/mkutf8data.c
@@ -0,0 +1,3239 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME	"DerivedAge.txt"
+#define CCC_NAME	"DerivedCombiningClass.txt"
+#define PROP_NAME	"DerivedCoreProperties.txt"
+#define DATA_NAME	"UnicodeData.txt"
+#define FOLD_NAME	"CaseFolding.txt"
+#define NORM_NAME	"NormalizationCorrections.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+#define UTF8_NAME	"utf8data.h"
+
+const char	*age_name  = AGE_NAME;
+const char	*ccc_name  = CCC_NAME;
+const char	*prop_name = PROP_NAME;
+const char	*data_name = DATA_NAME;
+const char	*fold_name = FOLD_NAME;
+const char	*norm_name = NORM_NAME;
+const char	*test_name = TEST_NAME;
+const char	*utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision.  These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_MAJ_MAX			((unsigned short)-1)
+#define UNICODE_MIN_MAX			((unsigned char)-1)
+#define UNICODE_REV_MAX			((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+	if (major > UNICODE_MAJ_MAX)
+		return 0;
+	if (minor > UNICODE_MIN_MAX)
+		return 0;
+	if (revision > UNICODE_REV_MAX)
+		return 0;
+	return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MAXGEN		(255)
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+	const unsigned char *s = (const unsigned char*)str;
+	unsigned int unichar = 0;
+
+	if (*s < 0x80) {
+		unichar = *s;
+	} else if (*s < UTF8_3_BITS) {
+		unichar = *s++ & 0x1F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else if (*s < UTF8_4_BITS) {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	}
+	return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+	return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+	void *root;
+	int childnode;
+	const char *type;
+	unsigned int maxage;
+	struct tree *next;
+	int (*leaf_equal)(void *, void *);
+	void (*leaf_print)(void *, int);
+	int (*leaf_mark)(void *);
+	int (*leaf_size)(void *);
+	int *(*leaf_index)(struct tree *, void *);
+	unsigned char *(*leaf_emit)(void *, unsigned char *);
+	int leafindex[0x110000];
+	int index;
+};
+
+struct node {
+	int index;
+	int offset;
+	int mark;
+	int size;
+	struct node *parent;
+	void *left;
+	void *right;
+	unsigned char bitnum;
+	unsigned char nextbyte;
+	unsigned char leftnode;
+	unsigned char rightnode;
+	unsigned int keybits;
+	unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+	struct node *node;
+	void *leaf = NULL;
+
+	node = tree->root;
+	while (!leaf && node) {
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7))) {
+			/* Right leg */
+			if (node->rightnode == NODE) {
+				node = node->right;
+			} else if (node->rightnode == LEAF) {
+				leaf = node->right;
+			} else {
+				node = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (node->leftnode == NODE) {
+				node = node->left;
+			} else if (node->leftnode == LEAF) {
+				leaf = node->left;
+			} else {
+				node = NULL;
+			}
+		}
+	}
+
+	return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int indent = 1;
+	int nodes, singletons, leaves;
+
+	nodes = singletons = leaves = 0;
+
+	printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_print(tree->root, indent);
+		leaves = 1;
+	} else {
+		assert(tree->childnode == NODE);
+		node = tree->root;
+		leftmask = rightmask = 0;
+		while (node) {
+			printf("%*snode @ %p bitnum %d nextbyte %d"
+			       " left %p right %p mask %x bits %x\n",
+				indent, "", node,
+				node->bitnum, node->nextbyte,
+				node->left, node->right,
+				node->keymask, node->keybits);
+			nodes += 1;
+			if (!(node->left && node->right))
+				singletons += 1;
+
+			while (node) {
+				bitmask = 1 << node->bitnum;
+				if ((leftmask & bitmask) == 0) {
+					leftmask |= bitmask;
+					if (node->leftnode == LEAF) {
+						assert(node->left);
+						tree->leaf_print(node->left,
+								 indent+1);
+						leaves += 1;
+					} else if (node->left) {
+						assert(node->leftnode == NODE);
+						indent += 1;
+						node = node->left;
+						break;
+					}
+				}
+				if ((rightmask & bitmask) == 0) {
+					rightmask |= bitmask;
+					if (node->rightnode == LEAF) {
+						assert(node->right);
+						tree->leaf_print(node->right,
+								 indent+1);
+						leaves += 1;
+					} else if (node->right) {
+						assert(node->rightnode==NODE);
+						indent += 1;
+						node = node->right;
+						break;
+					}
+				}
+				leftmask &= ~bitmask;
+				rightmask &= ~bitmask;
+				node = node->parent;
+				indent -= 1;
+			}
+		}
+	}
+	printf("nodes %d leaves %d singletons %d\n",
+	       nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+	struct node *node;
+	int bitnum;
+
+	node = malloc(sizeof(*node));
+	node->left = node->right = NULL;
+	node->parent = parent;
+	node->leftnode = NODE;
+	node->rightnode = NODE;
+	node->keybits = 0;
+	node->keymask = 0;
+	node->mark = 0;
+	node->index = 0;
+	node->offset = -1;
+	node->size = 4;
+
+	if (node->parent) {
+		bitnum = parent->bitnum;
+		if ((bitnum & 7) == 0) {
+			node->bitnum = bitnum + 7 + 8;
+			node->nextbyte = 1;
+		} else {
+			node->bitnum = bitnum - 1;
+			node->nextbyte = 0;
+		}
+	} else {
+		node->bitnum = 7;
+		node->nextbyte = 0;
+	}
+
+	return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+	struct node *node;
+	struct node *parent;
+	void **cursor;
+	int keybits;
+
+	assert(keylen >= 1 && keylen <= 4);
+
+	node = NULL;
+	cursor = &tree->root;
+	keybits = 8 * keylen;
+
+	/* Insert, creating path along the way. */
+	while (keybits) {
+		if (!*cursor)
+			*cursor = alloc_node(node);
+		node = *cursor;
+		if (node->nextbyte)
+			key++;
+		if (*key & (1 << (node->bitnum & 7)))
+			cursor = &node->right;
+		else
+			cursor = &node->left;
+		keybits--;
+	}
+	*cursor = leaf;
+
+	/* Merge subtrees if possible. */
+	while (node) {
+		if (*key & (1 << (node->bitnum & 7)))
+			node->rightnode = LEAF;
+		else
+			node->leftnode = LEAF;
+		if (node->nextbyte)
+			break;
+		if (node->leftnode == NODE || node->rightnode == NODE)
+			break;
+		assert(node->left);
+		assert(node->right);
+		/* Compare */
+		if (! tree->leaf_equal(node->left, node->right))
+			break;
+		/* Keep left, drop right leaf. */
+		leaf = node->left;
+		/* Check in parent */
+		parent = node->parent;
+		if (!parent) {
+			/* root of tree! */
+			tree->root = leaf;
+			tree->childnode = LEAF;
+		} else if (parent->left == node) {
+			parent->left = leaf;
+			parent->leftnode = LEAF;
+			if (parent->right) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+			}
+		} else if (parent->right == node) {
+			parent->right = leaf;
+			parent->rightnode = LEAF;
+			if (parent->left) {
+				parent->keymask = 0;
+				parent->keybits = 0;
+			} else {
+				parent->keymask |= (1 << node->bitnum);
+				parent->keybits |= (1 << node->bitnum);
+			}
+		} else {
+			/* internal tree error */
+			assert(0);
+		}
+		free(node);
+		node = parent;
+	}
+
+	/* Propagate keymasks up along singleton chains. */
+	while (node) {
+		parent = node->parent;
+		if (!parent)
+			break;
+		/* Nix the mask for parents with two children. */
+		if (node->keymask == 0) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else if (parent->left && parent->right) {
+			parent->keymask = 0;
+			parent->keybits = 0;
+		} else {
+			assert((parent->keymask & node->keymask) == 0);
+			parent->keymask |= node->keymask;
+			parent->keymask |= (1 << parent->bitnum);
+			parent->keybits |= node->keybits;
+			if (parent->right)
+				parent->keybits |= (1 << parent->bitnum);
+		}
+		node = parent;
+	}
+
+	return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed.  There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves.  The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains.  When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity.  Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+	struct node *node;
+	struct node *left;
+	struct node *right;
+	struct node *parent;
+	void *leftleaf;
+	void *rightleaf;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+
+	if (verbose > 0)
+		printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+	count = 0;
+	if (tree->childnode == LEAF)
+		return;
+	if (!tree->root)
+		return;
+
+	leftmask = rightmask = 0;
+	node = tree->root;
+	while (node) {
+		if (node->nextbyte)
+			goto advance;
+		if (node->leftnode == LEAF)
+			goto advance;
+		if (node->rightnode == LEAF)
+			goto advance;
+		if (!node->left)
+			goto advance;
+		if (!node->right)
+			goto advance;
+		left = node->left;
+		right = node->right;
+		if (left->keymask == 0)
+			goto advance;
+		if (right->keymask == 0)
+			goto advance;
+		if (left->keymask != right->keymask)
+			goto advance;
+		if (left->keybits != right->keybits)
+			goto advance;
+		leftleaf = NULL;
+		while (!leftleaf) {
+			assert(left->left || left->right);
+			if (left->leftnode == LEAF)
+				leftleaf = left->left;
+			else if (left->rightnode == LEAF)
+				leftleaf = left->right;
+			else if (left->left)
+				left = left->left;
+			else if (left->right)
+				left = left->right;
+			else
+				assert(0);
+		}
+		rightleaf = NULL;
+		while (!rightleaf) {
+			assert(right->left || right->right);
+			if (right->leftnode == LEAF)
+				rightleaf = right->left;
+			else if (right->rightnode == LEAF)
+				rightleaf = right->right;
+			else if (right->left)
+				right = right->left;
+			else if (right->right)
+				right = right->right;
+			else
+				assert(0);
+		}
+		if (! tree->leaf_equal(leftleaf, rightleaf))
+			goto advance;
+		/*
+		 * This node has identical singleton-only subtrees.
+		 * Remove it.
+		 */
+		parent = node->parent;
+		left = node->left;
+		right = node->right;
+		if (parent->left == node)
+			parent->left = left;
+		else if (parent->right == node)
+			parent->right = left;
+		else
+			assert(0);
+		left->parent = parent;
+		left->keymask |= (1 << node->bitnum);
+		node->left = NULL;
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			if (node->leftnode == NODE && node->left) {
+				left = node->left;
+				free(node);
+				count++;
+				node = left;
+			} else if (node->rightnode == NODE && node->right) {
+				right = node->right;
+				free(node);
+				count++;
+				node = right;
+			} else {
+				node = NULL;
+			}
+		}
+		/* Propagate keymasks up along singleton chains. */
+		node = parent;
+		/* Force re-check */
+		bitmask = 1 << node->bitnum;
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		for (;;) {
+			if (node->left && node->right)
+				break;
+			if (node->left) {
+				left = node->left;
+				node->keymask |= left->keymask;
+				node->keybits |= left->keybits;
+			}
+			if (node->right) {
+				right = node->right;
+				node->keymask |= right->keymask;
+				node->keybits |= right->keybits;
+			}
+			node->keymask |= (1 << node->bitnum);
+			node = node->parent;
+			/* Force re-check */
+			bitmask = 1 << node->bitnum;
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+		}
+	advance:
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0 &&
+		    node->leftnode == NODE &&
+		    node->left) {
+			leftmask |= bitmask;
+			node = node->left;
+		} else if ((rightmask & bitmask) == 0 &&
+			   node->rightnode == NODE &&
+			   node->right) {
+			rightmask |= bitmask;
+			node = node->right;
+		} else {
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+		}
+	}
+	if (verbose > 0)
+		printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+	struct node *node;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int marked;
+
+	marked = 0;
+	if (verbose > 0)
+		printf("Marking %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+
+	/* second pass: left siblings and singletons */
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		bitmask = 1 << node->bitnum;
+		if ((leftmask & bitmask) == 0) {
+			leftmask |= bitmask;
+			if (node->leftnode == LEAF) {
+				assert(node->left);
+				if (tree->leaf_mark(node->left)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->left) {
+				assert(node->leftnode == NODE);
+				node = node->left;
+				if (!node->mark && node->parent->mark) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		if ((rightmask & bitmask) == 0) {
+			rightmask |= bitmask;
+			if (node->rightnode == LEAF) {
+				assert(node->right);
+				if (tree->leaf_mark(node->right)) {
+					n = node;
+					while (n && !n->mark) {
+						marked++;
+						n->mark = 1;
+						n = n->parent;
+					}
+				}
+			} else if (node->right) {
+				assert(node->rightnode==NODE);
+				node = node->right;
+				if (!node->mark && node->parent->mark &&
+				    !node->parent->left) {
+					marked++;
+					node->mark = 1;
+				}
+				continue;
+			}
+		}
+		leftmask &= ~bitmask;
+		rightmask &= ~bitmask;
+		node = node->parent;
+	}
+done:
+	if (verbose > 0)
+		printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie.  These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int count;
+	int indent;
+
+	/* Align to a cache line (or half a cache line?). */
+	while (index % 64)
+		index++;
+	tree->index = index;
+	indent = 1;
+	count = 0;
+
+	if (verbose > 0)
+		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+	if (tree->childnode == LEAF) {
+		index += tree->leaf_size(tree->root);
+		goto done;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		count++;
+		if (node->index != index)
+			node->index = index;
+		index += node->size;
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					*tree->leaf_index(tree, node->left) =
+									index;
+					index += tree->leaf_size(node->left);
+					count++;
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					*tree->leaf_index(tree, node->right) = index;
+					index += tree->leaf_size(node->right);
+					count++;
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	/* Round up to a multiple of 16 */
+	while (index % 16)
+		index++;
+	if (verbose > 0)
+		printf("Final index %d\n", index);
+	return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced.  This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+	struct tree *next;
+	struct node *node;
+	struct node *right;
+	struct node *n;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	unsigned int pathbits;
+	unsigned int pathmask;
+	int changed;
+	int offset;
+	int size;
+	int indent;
+
+	indent = 1;
+	changed = 0;
+	size = 0;
+
+	if (verbose > 0)
+		printf("Sizing %s_%x", tree->type, tree->maxage);
+	if (tree->childnode == LEAF)
+		goto done;
+
+	assert(tree->childnode == NODE);
+	pathbits = 0;
+	pathmask = 0;
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		offset = 0;
+		if (!node->left || !node->right) {
+			size = 1;
+		} else {
+			if (node->rightnode == NODE) {
+				right = node->right;
+				next = tree->next;
+				while (!right->mark) {
+					assert(next);
+					n = next->root;
+					while (n->bitnum != node->bitnum) {
+						if (pathbits & (1<<n->bitnum))
+							n = n->right;
+						else
+							n = n->left;
+					}
+					n = n->right;
+					assert(right->bitnum == n->bitnum);
+					right = n;
+					next = next->next;
+				}
+				offset = right->index - node->index;
+			} else {
+				offset = *tree->leaf_index(tree, node->right);
+				offset -= node->index;
+			}
+			assert(offset >= 0);
+			assert(offset <= 0xffffff);
+			if (offset <= 0xff) {
+				size = 2;
+			} else if (offset <= 0xffff) {
+				size = 3;
+			} else { /* offset <= 0xffffff */
+				size = 4;
+			}
+		}
+		if (node->size != size || node->offset != offset) {
+			node->size = size;
+			node->offset = offset;
+			changed++;
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			pathmask |= bitmask;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				pathbits |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			pathmask &= ~bitmask;
+			pathbits &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+done:
+	if (verbose > 0)
+		printf("Found %d changes\n", changed);
+	return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+	struct node *node;
+	unsigned int leftmask;
+	unsigned int rightmask;
+	unsigned int bitmask;
+	int offlen;
+	int offset;
+	int index;
+	int indent;
+	unsigned char byte;
+
+	index = tree->index;
+	data += index;
+	indent = 1;
+	if (verbose > 0)
+		printf("Emitting %s_%x\n", tree->type, tree->maxage);
+	if (tree->childnode == LEAF) {
+		assert(tree->root);
+		tree->leaf_emit(tree->root, data);
+		return;
+	}
+
+	assert(tree->childnode == NODE);
+	node = tree->root;
+	leftmask = rightmask = 0;
+	while (node) {
+		if (!node->mark)
+			goto skip;
+		assert(node->offset != -1);
+		assert(node->index == index);
+
+		byte = 0;
+		if (node->nextbyte)
+			byte |= NEXTBYTE;
+		byte |= (node->bitnum & BITNUM);
+		if (node->left && node->right) {
+			if (node->leftnode == NODE)
+				byte |= LEFTNODE;
+			if (node->rightnode == NODE)
+				byte |= RIGHTNODE;
+			if (node->offset <= 0xff)
+				offlen = 1;
+			else if (node->offset <= 0xffff)
+				offlen = 2;
+			else
+				offlen = 3;
+			offset = node->offset;
+			byte |= offlen << OFFLEN_SHIFT;
+			*data++ = byte;
+			index++;
+			while (offlen--) {
+				*data++ = offset & 0xff;
+				index++;
+				offset >>= 8;
+			}
+		} else if (node->left) {
+			if (node->leftnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else if (node->right) {
+			byte |= RIGHTNODE;
+			if (node->rightnode == NODE)
+				byte |= TRIENODE;
+			*data++ = byte;
+			index++;
+		} else {
+			assert(0);
+		}
+skip:
+		while (node) {
+			bitmask = 1 << node->bitnum;
+			if (node->mark && (leftmask & bitmask) == 0) {
+				leftmask |= bitmask;
+				if (node->leftnode == LEAF) {
+					assert(node->left);
+					data = tree->leaf_emit(node->left,
+							       data);
+					index += tree->leaf_size(node->left);
+				} else if (node->left) {
+					assert(node->leftnode == NODE);
+					indent += 1;
+					node = node->left;
+					break;
+				}
+			}
+			if (node->mark && (rightmask & bitmask) == 0) {
+				rightmask |= bitmask;
+				if (node->rightnode == LEAF) {
+					assert(node->right);
+					data = tree->leaf_emit(node->right,
+							       data);
+					index += tree->leaf_size(node->right);
+				} else if (node->right) {
+					assert(node->rightnode==NODE);
+					indent += 1;
+					node = node->right;
+					break;
+				}
+			}
+			leftmask &= ~bitmask;
+			rightmask &= ~bitmask;
+			node = node->parent;
+			indent -= 1;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table.  Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions.  The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+	unsigned int code;
+	int ccc;
+	int gen;
+	int correction;
+	unsigned int *utf32nfkdi;
+	unsigned int *utf32nfkdicf;
+	char *utf8nfkdi;
+	char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int    corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int          trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+	int i;
+
+	for (i = 0; i != corrections_count; i++)
+		if (u->code == corrections[i].code)
+			return &corrections[i];
+	return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+	struct unicode_data *left = l;
+	struct unicode_data *right = r;
+
+	if (left->gen != right->gen)
+		return 0;
+	if (left->ccc != right->ccc)
+		return 0;
+	if (left->utf8nfkdicf && right->utf8nfkdicf &&
+	    strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+		return 1;
+	if (left->utf8nfkdicf && right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdicf || right->utf8nfkdicf)
+		return 0;
+	if (left->utf8nfkdi && right->utf8nfkdi &&
+	    strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+		return 1;
+	if (left->utf8nfkdi || right->utf8nfkdi)
+		return 0;
+	return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+	struct unicode_data *leaf = l;
+
+	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+		leaf->code, leaf->ccc, leaf->gen);
+	if (leaf->utf8nfkdicf)
+		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi)
+		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+	printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+	return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	if (leaf->utf8nfkdicf)
+		return 1;
+	return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+	struct unicode_data *leaf = l;
+
+	int size = 2;
+	if (leaf->utf8nfkdicf)
+		size += strlen(leaf->utf8nfkdicf) + 1;
+	else if (leaf->utf8nfkdi)
+		size += strlen(leaf->utf8nfkdi) + 1;
+	return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+	struct unicode_data *leaf = l;
+
+	return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+	struct unicode_data *leaf = l;
+	unsigned char *s;
+
+	*data++ = leaf->gen;
+	if (leaf->utf8nfkdicf) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdicf;
+		while ((*data++ = *s++) != 0)
+			;
+	} else if (leaf->utf8nfkdi) {
+		*data++ = DECOMPOSE;
+		s = (unsigned char*)leaf->utf8nfkdi;
+		while ((*data++ = *s++) != 0)
+			;
+	} else {
+		*data++ = leaf->ccc;
+	}
+	return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+	char utf[18*4+1];
+	char *u;
+	unsigned int *um;
+	int i;
+
+	u = utf;
+	um = data->utf32nfkdi;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		data->utf8nfkdi = strdup((char*)utf);
+	}
+	u = utf;
+	um = data->utf32nfkdicf;
+	if (um) {
+		for (i = 0; um[i]; i++)
+			u += utf8key(um[i], u);
+		*u = '\0';
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+			data->utf8nfkdicf = strdup((char*)utf);
+	}
+}
+
+static void
+utf8_init(void)
+{
+	unsigned int unichar;
+	int i;
+
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		utf8_create(&unicode_data[unichar]);
+
+	for (i = 0; i != corrections_count; i++)
+		utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+	struct unicode_data *data;
+	unsigned int maxage;
+	unsigned int nextage;
+	int count;
+	int i;
+	int j;
+
+	/* Count the number of different ages. */
+	count = 0;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+		count++;
+	} while (nextage);
+
+	/* Two trees per age: nfkdi and nfkdicf */
+	trees_count = count * 2;
+	trees = calloc(trees_count, sizeof(struct tree));
+
+	/* Assign ages to the trees. */
+	count = trees_count;
+	nextage = (unsigned int)-1;
+	do {
+		maxage = nextage;
+		trees[--count].maxage = maxage;
+		trees[--count].maxage = maxage;
+		nextage = 0;
+		for (i = 0; i <= corrections_count; i++) {
+			data = &corrections[i];
+			if (nextage < data->correction &&
+			    data->correction < maxage)
+				nextage = data->correction;
+		}
+	} while (nextage);
+
+	/* The ages assigned above are off by one. */
+	for (i = 0; i != trees_count; i++) {
+		j = 0;
+		while (ages[j] < trees[i].maxage)
+			j++;
+		trees[i].maxage = ages[j-1];
+	}
+
+	/* Set up the forwarding between trees. */
+	trees[trees_count-2].next = &trees[trees_count-1];
+	trees[trees_count-1].leaf_mark = nfkdi_mark;
+	trees[trees_count-2].leaf_mark = nfkdicf_mark;
+	for (i = 0; i != trees_count-2; i += 2) {
+		trees[i].next = &trees[trees_count-2];
+		trees[i].leaf_mark = correction_mark;
+		trees[i+1].next = &trees[trees_count-1];
+		trees[i+1].leaf_mark = correction_mark;
+	}
+
+	/* Assign the callouts. */
+	for (i = 0; i != trees_count; i += 2) {
+		trees[i].type = "nfkdicf";
+		trees[i].leaf_equal = nfkdicf_equal;
+		trees[i].leaf_print = nfkdicf_print;
+		trees[i].leaf_size = nfkdicf_size;
+		trees[i].leaf_index = nfkdicf_index;
+		trees[i].leaf_emit = nfkdicf_emit;
+
+		trees[i+1].type = "nfkdi";
+		trees[i+1].leaf_equal = nfkdi_equal;
+		trees[i+1].leaf_print = nfkdi_print;
+		trees[i+1].leaf_size = nfkdi_size;
+		trees[i+1].leaf_index = nfkdi_index;
+		trees[i+1].leaf_emit = nfkdi_emit;
+	}
+
+	/* Finish init. */
+	for (i = 0; i != trees_count; i++)
+		trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+	struct unicode_data *data;
+	unsigned int unichar;
+	char keyval[4];
+	int keylen;
+	int i;
+
+	for (i = 0; i != trees_count; i++) {
+		if (verbose > 0) {
+			printf("Populating %s_%x\n",
+				trees[i].type, trees[i].maxage);
+		}
+		for (unichar = 0; unichar != 0x110000; unichar++) {
+			if (unicode_data[unichar].gen < 0)
+				continue;
+			keylen = utf8key(unichar, keyval);
+			data = corrections_lookup(&unicode_data[unichar]);
+			if (data->correction <= trees[i].maxage)
+				data = &unicode_data[unichar];
+			insert(&trees[i], keyval, keylen, data);
+		}
+	}
+}
+
+static void
+trees_reduce(void)
+{
+	int i;
+	int size;
+	int changed;
+
+	for (i = 0; i != trees_count; i++)
+		prune(&trees[i]);
+	for (i = 0; i != trees_count; i++)
+		mark_nodes(&trees[i]);
+	do {
+		size = 0;
+		for (i = 0; i != trees_count; i++)
+			size = index_nodes(&trees[i], size);
+		changed = 0;
+		for (i = 0; i != trees_count; i++)
+			changed += size_nodes(&trees[i]);
+	} while (changed);
+
+	utf8data = calloc(size, 1);
+	utf8data_size = size;
+	for (i = 0; i != trees_count; i++)
+		emit(&trees[i], utf8data);
+
+	if (verbose > 0) {
+		for (i = 0; i != trees_count; i++) {
+			printf("%s_%x idx %d\n",
+				trees[i].type, trees[i].maxage, trees[i].index);
+		}
+	}
+
+	nfkdi = utf8data + trees[trees_count-1].index;
+	nfkdicf = utf8data + trees[trees_count-2].index;
+
+	nfkdi_tree = &trees[trees_count-1];
+	nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+	struct unicode_data *data;
+	utf8leaf_t	*leaf;
+	unsigned int	unichar;
+	char		key[4];
+	int		report;
+	int		nocf;
+
+	if (verbose > 0)
+		printf("Verifying %s_%x\n", tree->type, tree->maxage);
+	nocf = strcmp(tree->type, "nfkdicf");
+
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		report = 0;
+		data = corrections_lookup(&unicode_data[unichar]);
+		if (data->correction <= tree->maxage)
+			data = &unicode_data[unichar];
+		utf8key(unichar, key);
+		leaf = utf8lookup(tree, key);
+		if (!leaf) {
+			if (data->gen != -1)
+				report++;
+			if (unichar < 0xd800 || unichar > 0xdfff)
+				report++;
+		} else {
+			if (unichar >= 0xd800 && unichar <= 0xdfff)
+				report++;
+			if (data->gen == -1)
+				report++;
+			if (data->gen != LEAF_GEN(leaf))
+				report++;
+			if (LEAF_CCC(leaf) == DECOMPOSE) {
+				if (nocf) {
+					if (!data->utf8nfkdi) {
+						report++;
+					} else if (strcmp(data->utf8nfkdi,
+							LEAF_STR(leaf))) {
+						report++;
+					}
+				} else {
+					if (!data->utf8nfkdicf &&
+					    !data->utf8nfkdi) {
+						report++;
+					} else if (data->utf8nfkdicf) {
+						if (strcmp(data->utf8nfkdicf,
+							   LEAF_STR(leaf)))
+							report++;
+					} else if (strcmp(data->utf8nfkdi,
+							  LEAF_STR(leaf))) {
+						report++;
+					}
+				}
+			} else if (data->ccc != LEAF_CCC(leaf)) {
+				report++;
+			}
+		}
+		if (report) {
+			printf("%X code %X gen %d ccc %d"
+				" nfdki -> \"%s\"",
+				unichar, data->code, data->gen,
+				data->ccc,
+				data->utf8nfkdi);
+			if (leaf) {
+				printf(" age %d ccc %d"
+					" nfdki -> \"%s\"\n",
+					LEAF_GEN(leaf),
+					LEAF_CCC(leaf),
+					LEAF_CCC(leaf) == DECOMPOSE ?
+						LEAF_STR(leaf) : "");
+			}
+			printf("\n");
+		}
+	}
+}
+
+static void
+trees_verify(void)
+{
+	int i;
+
+	for (i = 0; i != trees_count; i++)
+		verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("Usage: %s [options]\n", argv0);
+	printf("\n");
+	printf("This program creates an a data trie used for parsing and\n");
+	printf("normalization of UTF-8 strings. The trie is derived from\n");
+	printf("a set of input files from the Unicode character database\n");
+	printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+	printf("\n");
+	printf("The generated tree supports two normalization forms:\n");
+	printf("\n");
+	printf("\tnfkdi:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\n");
+	printf("\tnfkdicf:\n");
+	printf("\t- Apply unicode normalization form NFKD.\n");
+	printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+	printf("\t- Apply a full casefold (C + F).\n");
+	printf("\n");
+	printf("These forms were chosen as being most useful when dealing\n");
+	printf("with file names: NFKD catches most cases where characters\n");
+	printf("should be considered equivalent. The ignorables are mostly\n");
+	printf("invisible, making names hard to type.\n");
+	printf("\n");
+	printf("The options to specify the files to be used are listed\n");
+	printf("below with their default values, which are the names used\n");
+	printf("by version 7.0.0 of the Unicode Character Database.\n");
+	printf("\n");
+	printf("The input files:\n");
+	printf("\t-a %s\n", AGE_NAME);
+	printf("\t-c %s\n", CCC_NAME);
+	printf("\t-p %s\n", PROP_NAME);
+	printf("\t-d %s\n", DATA_NAME);
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-n %s\n", NORM_NAME);
+	printf("\n");
+	printf("Additionally, the generated tables are tested using:\n");
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n");
+	printf("Finally, the output file:\n");
+	printf("\t-o %s\n", UTF8_NAME);
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+	printf("Error parsing %s:%s\n", filename, line);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+	int	i;
+
+	for (i = 0; utf32str[i]; i++)
+		printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdi);
+	printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+	printf(" %X ->", unichar);
+	print_utf32(unicode_data[unichar].utf32nfkdicf);
+	printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	int gen;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", age_name);
+
+	file = fopen(age_name, "r");
+	if (!file)
+		open_fail(age_name, errno);
+	count = 0;
+
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d\n",
+					major, minor, revision);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages_count++;
+			if (verbose > 1)
+				printf(" Age V%d_%d\n", major, minor);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+
+	/* We must have found something above. */
+	if (verbose > 1)
+		printf("%d age entries\n", ages_count);
+	if (ages_count == 0 || ages_count > MAXGEN)
+		file_fail(age_name);
+
+	/* There is a 0 entry. */
+	ages_count++;
+	ages = calloc(ages_count + 1, sizeof(*ages));
+	/* And a guard entry. */
+	ages[ages_count] = (unsigned int)-1;
+
+	rewind(file);
+	count = 0;
+	gen = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "# Age=V%d_%d_%d",
+				&major, &minor, &revision);
+		if (ret == 3) {
+			ages[++gen] =
+				UNICODE_AGE(major, minor, revision);
+			if (verbose > 1)
+				printf(" Age V%d_%d_%d = gen %d\n",
+					major, minor, revision, gen);
+			if (!age_valid(major, minor, revision))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+		if (ret == 2) {
+			ages[++gen] = UNICODE_AGE(major, minor, 0);
+			if (verbose > 1)
+				printf(" Age V%d_%d = %d\n",
+					major, minor, gen);
+			if (!age_valid(major, minor, 0))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X..%X ; %d.%d #",
+			     &first, &last, &major, &minor);
+		if (ret == 4) {
+			for (unichar = first; unichar <= last; unichar++)
+				unicode_data[unichar].gen = gen;
+			count += 1 + last - first;
+			if (verbose > 1)
+				printf("  %X..%X gen %d\n", first, last, gen);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(age_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+		if (ret == 3) {
+			unicode_data[unichar].gen = gen;
+			count++;
+			if (verbose > 1)
+				printf("  %X gen %d\n", unichar, gen);
+			if (!utf32valid(unichar))
+				line_fail(age_name, line);
+			continue;
+		}
+	}
+	unicode_maxage = ages[gen];
+	fclose(file);
+
+	/* Nix surrogate block */
+	if (verbose > 1)
+		printf(" Removing surrogate block D800..DFFF\n");
+	for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+		unicode_data[unichar].gen = -1;
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+	FILE *file;
+	unsigned int first;
+	unsigned int last;
+	unsigned int unichar;
+	unsigned int value;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", ccc_name);
+
+	file = fopen(ccc_name, "r");
+	if (!file)
+		open_fail(ccc_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+		if (ret == 3) {
+			for (unichar = first; unichar <= last; unichar++) {
+				unicode_data[unichar].ccc = value;
+                                count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X ccc %d\n", first, last, value);
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(ccc_name, line);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %d #", &unichar, &value);
+		if (ret == 2) {
+			unicode_data[unichar].ccc = value;
+                        count++;
+			if (verbose > 1)
+				printf(" %X ccc %d\n", unichar, value);
+			if (!utf32valid(unichar))
+				line_fail(ccc_name, line);
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	unsigned int *um;
+	int count;
+	int i;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", data_name);
+	file = fopen(data_name, "r");
+	if (!file)
+		open_fail(data_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     &unichar, buf0);
+		if (ret != 2)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(data_name, line);
+
+		s = buf0;
+		/* skip over <tag> */
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		/* decode the decomposition into UTF-32 */
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(data_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char status;
+	char *s;
+	unsigned int *um;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", fold_name);
+	file = fopen(fold_name, "r");
+	if (!file)
+		open_fail(fold_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+		if (ret != 3)
+			continue;
+		if (!utf32valid(unichar))
+			line_fail(fold_name, line);
+		/* Use the C+F casefold. */
+		if (status != 'C' && status != 'F')
+			continue;
+		s = buf0;
+		if (*s == '<')
+			while (*s++ != ' ')
+				;
+		i = 0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(fold_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int first;
+	unsigned int last;
+	unsigned int *um;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", prop_name);
+	file = fopen(prop_name, "r");
+	if (!file)
+		open_fail(prop_name, errno);
+	assert(file);
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+		if (ret == 3) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(first) || !utf32valid(last))
+				line_fail(prop_name, line);
+			for (unichar = first; unichar <= last; unichar++) {
+				free(unicode_data[unichar].utf32nfkdi);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdi = um;
+				free(unicode_data[unichar].utf32nfkdicf);
+				um = malloc(sizeof(unsigned int));
+				*um = 0;
+				unicode_data[unichar].utf32nfkdicf = um;
+				count++;
+			}
+			if (verbose > 1)
+				printf(" %X..%X Default_Ignorable_Code_Point\n",
+					first, last);
+			continue;
+		}
+		ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+		if (ret == 2) {
+			if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+				continue;
+			if (!utf32valid(unichar))
+				line_fail(prop_name, line);
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdi = um;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(sizeof(unsigned int));
+			*um = 0;
+			unicode_data[unichar].utf32nfkdicf = um;
+			if (verbose > 1)
+				printf(" %X Default_Ignorable_Code_Point\n",
+					unichar);
+			count++;
+			continue;
+		}
+	}
+	fclose(file);
+
+	if (verbose > 0)
+		printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	unsigned int major;
+	unsigned int minor;
+	unsigned int revision;
+	unsigned int age;
+	unsigned int *um;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	char *s;
+	int i;
+	int count;
+	int ret;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", norm_name);
+	file = fopen(norm_name, "r");
+	if (!file)
+		open_fail(norm_name, errno);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		count++;
+	}
+	corrections = calloc(count, sizeof(struct unicode_data));
+	corrections_count = count;
+	rewind(file);
+
+	count = 0;
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+				&unichar, buf0, buf1,
+				&major, &minor, &revision);
+		if (ret != 6)
+			continue;
+		if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+			line_fail(norm_name, line);
+		corrections[count] = unicode_data[unichar];
+		assert(corrections[count].code == unichar);
+		age = UNICODE_AGE(major, minor, revision);
+		corrections[count].correction = age;
+
+		i = 0;
+		s = buf0;
+		while (*s) {
+			mapping[i] = strtoul(s, &s, 16);
+			if (!utf32valid(mapping[i]))
+				line_fail(norm_name, line);
+			i++;
+		}
+		mapping[i++] = 0;
+
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		corrections[count].utf32nfkdi = um;
+
+		if (verbose > 1)
+			printf(" %X -> %s -> %s V%d_%d_%d\n",
+				unichar, buf0, buf1, major, minor, revision);
+		count++;
+	}
+	fclose(file);
+
+	if (verbose > 0)
+	        printf("Found %d entries\n", count);
+	if (count == 0)
+		file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount
+ *   LVPart = LBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+	unsigned int sb = 0xAC00;
+	unsigned int lb = 0x1100;
+	unsigned int vb = 0x1161;
+	unsigned int tb = 0x11a7;
+	/* unsigned int lc = 19; */
+	unsigned int vc = 21;
+	unsigned int tc = 28;
+	unsigned int nc = (vc * tc);
+	/* unsigned int sc = (lc * nc); */
+	unsigned int unichar;
+	unsigned int mapping[4];
+	unsigned int *um;
+        int count;
+	int i;
+
+	if (verbose > 0)
+		printf("Decomposing hangul\n");
+	/* Hangul */
+	count = 0;
+	for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+		unsigned int si = unichar - sb;
+		unsigned int li = si / nc;
+		unsigned int vi = (si % nc) / tc;
+		unsigned int ti = si % tc;
+
+		i = 0;
+		mapping[i++] = lb + li;
+		mapping[i++] = vb + vi;
+		if (ti)
+			mapping[i++] = tb + ti;
+		mapping[i++] = 0;
+
+		assert(!unicode_data[unichar].utf32nfkdi);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdi = um;
+
+		assert(!unicode_data[unichar].utf32nfkdicf);
+		um = malloc(i * sizeof(unsigned int));
+		memcpy(um, mapping, i * sizeof(unsigned int));
+		unicode_data[unichar].utf32nfkdicf = um;
+
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+
+		count++;
+	}
+	if (verbose > 0)
+		printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdi\n");
+
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdi)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdi;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdi;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdi);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdi = um;
+		}
+		/* Add this decomposition to nfkdicf if there is no entry. */
+		if (!unicode_data[unichar].utf32nfkdicf) {
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdi(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+	unsigned int unichar;
+	unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+	unsigned int *um;
+	unsigned int *dc;
+	int count;
+	int i;
+	int j;
+	int ret;
+
+	if (verbose > 0)
+		printf("Decomposing nfkdicf\n");
+	count = 0;
+	for (unichar = 0; unichar != 0x110000; unichar++) {
+		if (!unicode_data[unichar].utf32nfkdicf)
+			continue;
+		for (;;) {
+			ret = 1;
+			i = 0;
+			um = unicode_data[unichar].utf32nfkdicf;
+			while (*um) {
+				dc = unicode_data[*um].utf32nfkdicf;
+				if (dc) {
+					for (j = 0; dc[j]; j++)
+						mapping[i++] = dc[j];
+					ret = 0;
+				} else {
+					mapping[i++] = *um;
+				}
+				um++;
+			}
+			mapping[i++] = 0;
+			if (ret)
+				break;
+			free(unicode_data[unichar].utf32nfkdicf);
+			um = malloc(i * sizeof(unsigned int));
+			memcpy(um, mapping, i * sizeof(unsigned int));
+			unicode_data[unichar].utf32nfkdicf = um;
+		}
+		if (verbose > 1)
+			print_utf32nfkdicf(unichar);
+		count++;
+	}
+	if (verbose > 0)
+		printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + tree->index;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!tree)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+	return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = tree->maxage;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age = tree->maxage;
+
+	if (!tree)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		leaf_age = ages[LEAF_GEN(leaf)];
+		if (leaf_age <= tree->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(tree, s)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!tree)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(tree, s, len)))
+			return -1;
+		if (ages[LEAF_GEN(leaf)] > tree->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	struct tree	*tree;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+	unsigned int	unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : string.
+ *   len    : length of s.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s,
+	size_t		len)
+{
+	if (!tree)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->tree = tree;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	u8c->unichar = 0;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   s      : NUL-terminated string.
+ *   u8c    : pointer to cursor.
+ *   trie   : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	struct tree	*tree,
+	const char	*s)
+{
+	return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1  -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->tree, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->tree, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+		u8c->unichar = utf8code(u8c->s);
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			assert(u8c->ccc == STOPPER);
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+	char *s;
+	char *t;
+	int c;
+	struct utf8cursor u8c;
+
+	/* First test: null-terminated string. */
+	s = buf2;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	/* Second test: length-limited string. */
+	s = buf2;
+	/* Replace NUL with a value that will cause an error if seen. */
+	s[strlen(s) + 1] = -1;
+	t = buf3;
+	if (utf8cursor(&u8c, tree, s))
+		return -1;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != (unsigned char)*t++)
+			return -1;
+	if (c < 0)
+		return -1;
+	if (*t != 0)
+		return -1;
+
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	struct unicode_data *data;
+	char *s;
+	char *t;
+	int ret;
+	int ignorables;
+	int tests = 0;
+	int failures = 0;
+
+	if (verbose > 0)
+		printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+			     buf0, buf1);
+		if (ret != 2 || *line == '#')
+			continue;
+		s = buf0;
+		t = buf2;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		ignorables = 0;
+		s = buf1;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			data = &unicode_data[unichar];
+			if (data->utf8nfkdi && !*data->utf8nfkdi)
+				ignorables = 1;
+			else
+				t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		tests++;
+		if (normalize_line(nfkdi_tree) < 0) {
+			printf("\nline %s -> %s", buf0, buf1);
+			if (ignorables)
+				printf(" (ignorables removed)");
+			printf(" failure\n");
+			failures++;
+		}
+	}
+	fclose(file);
+	if (verbose > 0)
+		printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+	FILE *file;
+	int i;
+	int j;
+	int t;
+	int gen;
+
+	if (verbose > 0)
+		printf("Writing %s\n", utf8_name);
+	file = fopen(utf8_name, "w");
+	if (!file)
+		open_fail(utf8_name, errno);
+
+	fprintf(file, "/* This file is generated code, do not edit. */\n");
+	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+	fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+	fprintf(file, "#endif\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+		unicode_maxage);
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+	for (i = 0; i != ages_count; i++)
+		fprintf(file, "\t%#x%s\n", ages[i],
+			ages[i] == unicode_maxage ? "" : ",");
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+	t = 0;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+	t = 1;
+	for (gen = 0; gen < ages_count; gen++) {
+		fprintf(file, "\t{ %#x, %d }%s\n",
+			ages[gen], trees[t].index,
+			ages[gen] == unicode_maxage ? "" : ",");
+		if (trees[t].maxage == ages[gen])
+			t += 2;
+	}
+	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+		utf8data_size);
+	t = 0;
+	for (i = 0; i != utf8data_size; i += 16) {
+		if (i == trees[t].index) {
+			fprintf(file, "\t/* %s_%x */\n",
+				trees[t].type, trees[t].maxage);
+			if (t < trees_count-1)
+				t++;
+		}
+		fprintf(file, "\t");
+		for (j = i; j != i + 16; j++)
+			fprintf(file, "0x%.2x%s", utf8data[j],
+				(j < utf8data_size -1 ? "," : ""));
+		fprintf(file, "\n");
+	}
+	fprintf(file, "};\n");
+	fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+	unsigned int unichar;
+	int opt;
+
+	argv0 = argv[0];
+
+	while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+		switch (opt) {
+		case 'a':
+			age_name = optarg;
+			break;
+		case 'c':
+			ccc_name = optarg;
+			break;
+		case 'd':
+			data_name = optarg;
+			break;
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 'n':
+			norm_name = optarg;
+			break;
+		case 'o':
+			utf8_name = optarg;
+			break;
+		case 'p':
+			prop_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	if (verbose > 1)
+		help();
+	for (unichar = 0; unichar != 0x110000; unichar++)
+		unicode_data[unichar].code = unichar;
+	age_init();
+	ccc_init();
+	nfkdi_init();
+	nfkdicf_init();
+	ignore_init();
+	corrections_init();
+	hangul_decompose();
+	nfkdi_decompose();
+	nfkdicf_decompose();
+	utf8_init();
+	trees_init();
+	trees_populate();
+	trees_reduce();
+	trees_verify();
+	/* Prevent "unused function" warning. */
+	(void)lookup(nfkdi_tree, " ");
+	if (verbose > 2)
+		tree_walk(nfkdi_tree);
+	if (verbose > 2)
+		tree_walk(nfkdicf_tree);
+	normalization_test();
+	write_file();
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 19/35] xfsprogs: add supporting code for UTF-8.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (17 preceding siblings ...)
  2014-10-03 22:07 ` [PATCH 18/35] xfsprogs: add trie generator for UTF-8 Ben Myers
@ 2014-10-03 22:07 ` Ben Myers
  2014-10-03 22:08 ` [PATCH 20/35] xfsprogs: reduce the size of utf8data[] Ben Myers
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:07 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

  nfkdi:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.

  nfkdicf:
   - Apply unicode normalization form NFKD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <olaf@sgi.com>

[v2: synced from kernel version, minor updates to remove/add includes,
     and remove linux module exports. --bpm]
---
 include/utf8norm.h | 111 +++++++++
 libxfs/Makefile    |   1 +
 libxfs/utf8norm.c  | 643 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 755 insertions(+)
 create mode 100644 include/utf8norm.h
 create mode 100644 libxfs/utf8norm.c

diff --git a/include/utf8norm.h b/include/utf8norm.h
new file mode 100644
index 0000000..cd77580
--- /dev/null
+++ b/include/utf8norm.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT		(16)
+#define UNICODE_MIN_SHIFT		(8)
+
+#define UNICODE_AGE(MAJ,MIN,REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern int utf8version_is_supported(unsigned int);
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ *  - Apply unicode normalization form NFKD.
+ *  - Remove any Default_Ignorable_Code_Point.
+ *  - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+	utf8data_t	data;
+	const char	*s;
+	const char	*p;
+	const char	*ss;
+	const char	*sp;
+	unsigned int	len;
+	unsigned int	slen;
+	short int	ccc;
+	short int	nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
diff --git a/libxfs/Makefile b/libxfs/Makefile
index ae15a5d..a1e85ef 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -14,6 +14,7 @@ HFILES = xfs.h init.h xfs_dir2_priv.h crc32defs.h crc32table.h
 CFILES = cache.c \
 	crc32.c \
 	init.c kmem.c logitem.c radix-tree.c rdwr.c trans.c util.c \
+	utf8norm.c \
 	xfs_alloc.c \
 	xfs_alloc_btree.c \
 	xfs_attr.c \
diff --git a/libxfs/utf8norm.c b/libxfs/utf8norm.c
new file mode 100644
index 0000000..8076e99
--- /dev/null
+++ b/libxfs/utf8norm.c
@@ -0,0 +1,643 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_types.h"
+#include "utf8norm.h"
+
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+int
+utf8version_is_supported(unsigned int sb_utf8version)
+{
+	int i = sizeof(utf8agetab)/sizeof(utf8agetab[0]) - 1;
+
+	while (i >= 0) {
+		if (sb_utf8version == utf8agetab[i])
+			return 1;
+		i--;
+	}
+	return 0;
+}
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7F: 0                   - 0x7F
+ *       0x80 -    0x7FF: 0xC2 0x80           - 0xDF 0xBF
+ *      0x800 -   0xFFFF: 0xE0 0xA0 0x80      - 0xEF 0xBF 0xBF
+ *    0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+	unsigned char c = *s;
+	return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree.  The first byte contains the
+ * following information:
+ *  NEXTBYTE  - flag        - advance to next byte if set
+ *  BITNUM    - 3 bit field - the bit number to tested
+ *  OFFLEN    - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ *  RIGHTPATH - 1 bit field - set if the following node is for the
+ *                            right-hand path (tested bit is set)
+ *  TRIENODE  - 1 bit field - set if the following node is an internal
+ *                            node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ *  LEFTNODE  - 1 bit field - set if the left-hand node is internal
+ *  RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM		0x07
+#define NEXTBYTE	0x08
+#define OFFLEN		0x30
+#define OFFLEN_SHIFT	4
+#define RIGHTPATH	0x40
+#define TRIENODE	0x80
+#define RIGHTNODE	0x40
+#define LEFTNODE	0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ *          an index into utf8agetab[].  With this we can filter code
+ *          points based on the unicode version in which they were
+ *          defined.  The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ *          to do a stable sort into ascending order of all characters
+ *          with a non-zero CCC that occur between two characters with
+ *          a CCC of 0, or at the begin or end of a string.
+ *          The unicode standard guarantees that all CCC values are
+ *          between 0 and 254 inclusive, which leaves 255 available as
+ *          a special value.
+ *          Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ *          start of a NUL-terminated string that is the decomposition
+ *          of the character.
+ *          The CCC of a decomposable character is the same as the CCC
+ *          of the first character of its decomposition.
+ *          Some characters decompose as the empty string: these are
+ *          characters with the Default_Ignorable_Code_Point property.
+ *          These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences.  Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF)	((LEAF)[0])
+#define LEAF_CCC(LEAF)	((LEAF)[1])
+#define LEAF_STR(LEAF)	((const char*)((LEAF) + 2))
+
+#define MINCCC		(0)
+#define MAXCCC		(254)
+#define STOPPER		(0)
+#define	DECOMPOSE	(255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point.  The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+	utf8trie_t	*trie = utf8data + data->offset;
+	int		offlen;
+	int		offset;
+	int		mask;
+	int		node;
+
+	if (!data)
+		return NULL;
+	if (len == 0)
+		return NULL;
+	node = 1;
+	while (node) {
+		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+		if (*trie & NEXTBYTE) {
+			if (--len == 0)
+				return NULL;
+			s++;
+		}
+		mask = 1 << (*trie & BITNUM);
+		if (*s & mask) {
+			/* Right leg */
+			if (offlen) {
+				/* Right node at offset of trie */
+				node = (*trie & RIGHTNODE);
+				offset = trie[offlen];
+				while (--offlen) {
+					offset <<= 8;
+					offset |= trie[offlen];
+				}
+				trie += offset;
+			} else if (*trie & RIGHTPATH) {
+				/* Right node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			} else {
+				/* No right node. */
+				node = 0;
+				trie = NULL;
+			}
+		} else {
+			/* Left leg */
+			if (offlen) {
+				/* Left node after this node. */
+				node = (*trie & LEFTNODE);
+				trie += offlen + 1;
+			} else if (*trie & RIGHTPATH) {
+				/* No left node. */
+				node = 0;
+				trie = NULL;
+			} else {
+				/* Left node after this node */
+				node = (*trie & TRIENODE);
+				trie++;
+			}
+		}
+	}
+	return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+	return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	int		age;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		age = 0;
+	int		leaf_age;
+
+	if (!data)
+		return -1;
+        while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age > age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	int		leaf_age;
+	int		age;
+
+	if (!data)
+		return -1;
+	age = data->maxage;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		leaf_age = utf8agetab[LEAF_GEN(leaf)];
+		if (leaf_age <= data->maxage && leaf_age < age)
+			age = leaf_age;
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (*s) {
+		if (!(leaf = utf8lookup(data, s)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+	utf8leaf_t	*leaf;
+	size_t		ret = 0;
+
+	if (!data)
+		return -1;
+	while (len && *s) {
+		if (!(leaf = utf8nlookup(data, s, len)))
+			return -1;
+		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+			ret += utf8clen(s);
+		else if (LEAF_CCC(leaf) == DECOMPOSE)
+			ret += strlen(LEAF_STR(leaf));
+		else
+			ret += utf8clen(s);
+		len -= utf8clen(s);
+		s += utf8clen(s);
+	}
+	return ret;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : string.
+ *   len    : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s,
+	size_t		len)
+{
+	if (!data)
+		return -1;
+	if (!s)
+		return -1;
+	u8c->data = data;
+	u8c->s = s;
+	u8c->p = NULL;
+	u8c->ss = NULL;
+	u8c->sp = NULL;
+	u8c->len = len;
+	u8c->slen = 0;
+	u8c->ccc = STOPPER;
+	u8c->nccc = STOPPER;
+	/* Check we didn't clobber the maximum length. */
+	if (u8c->len != len)
+		return -1;
+	/* The first byte of s may not be an utf8 continuation. */
+	if (len > 0 && (*s & 0xC0) == 0x80)
+		return -1;
+	return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ *   u8c    : pointer to cursor.
+ *   data   : utf8data_t to use for normalization.
+ *   s      : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+	struct utf8cursor *u8c,
+	utf8data_t	data,
+	const char	*s)
+{
+	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string.  The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan.  The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ *  u8c->p  != NULL -> a decomposition is being scanned.
+ *  u8c->ss != NULL -> this is a repeating scan.
+ *  u8c->ccc == -1   -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+	utf8leaf_t *leaf;
+	int ccc;
+
+	for (;;) {
+		/* Check for the end of a decomposed character. */
+		if (u8c->p && *u8c->s == '\0') {
+			u8c->s = u8c->p;
+			u8c->p = NULL;
+		}
+
+		/* Check for end-of-string. */
+		if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+			/* There is no next byte. */
+			if (u8c->ccc == STOPPER)
+				return 0;
+			/* End-of-string during a scan counts as a stopper. */
+			ccc = STOPPER;
+			goto ccc_mismatch;
+		} else if ((*u8c->s & 0xC0) == 0x80) {
+			/* This is a continuation of the current character. */
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Look up the data for the current character. */
+		if (u8c->p)
+			leaf = utf8lookup(u8c->data, u8c->s);
+		else
+			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+		/* No leaf found implies that the input is a binary blob. */
+		if (!leaf)
+			return -1;
+
+		/* Characters that are too new have CCC 0. */
+		if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+			ccc = STOPPER;
+		} else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+			u8c->len -= utf8clen(u8c->s);
+			u8c->p = u8c->s + utf8clen(u8c->s);
+			u8c->s = LEAF_STR(leaf);
+			/* Empty decomposition implies CCC 0. */
+			if (*u8c->s == '\0') {
+				if (u8c->ccc == STOPPER)
+					continue;
+				ccc = STOPPER;
+				goto ccc_mismatch;
+			}
+			leaf = utf8lookup(u8c->data, u8c->s);
+			ccc = LEAF_CCC(leaf);
+		}
+
+		/*
+		 * If this is not a stopper, then see if it updates
+		 * the next canonical class to be emitted.
+		 */
+		if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+			u8c->nccc = ccc;
+
+		/*
+		 * Return the current byte if this is the current
+		 * combining class.
+		 */
+		if (ccc == u8c->ccc) {
+			if (!u8c->p)
+				u8c->len--;
+			return (unsigned char)*u8c->s++;
+		}
+
+		/* Current combining class mismatch. */
+	ccc_mismatch:
+		if (u8c->nccc == STOPPER) {
+			/*
+			 * Scan forward for the first canonical class
+			 * to be emitted.  Save the position from
+			 * which to restart.
+			 */
+			u8c->ccc = MINCCC - 1;
+			u8c->nccc = ccc;
+			u8c->sp = u8c->p;
+			u8c->ss = u8c->s;
+			u8c->slen = u8c->len;
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (ccc != STOPPER) {
+			/* Not a stopper, and not the ccc we're emitting. */
+			if (!u8c->p)
+				u8c->len -= utf8clen(u8c->s);
+			u8c->s += utf8clen(u8c->s);
+		} else if (u8c->nccc != MAXCCC + 1) {
+			/* At a stopper, restart for next ccc. */
+			u8c->ccc = u8c->nccc;
+			u8c->nccc = MAXCCC + 1;
+			u8c->s = u8c->ss;
+			u8c->p = u8c->sp;
+			u8c->len = u8c->slen;
+		} else {
+			/* All done, proceed from here. */
+			u8c->ccc = STOPPER;
+			u8c->nccc = STOPPER;
+			u8c->sp = NULL;
+			u8c->ss = NULL;
+			u8c->slen = 0;
+		}
+	}
+}
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+	while (maxage < utf8nfkdidata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdidata[i].maxage)
+		return NULL;
+	return &utf8nfkdidata[i];
+}
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+	int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+	while (maxage < utf8nfkdicfdata[i].maxage)
+		i--;
+	if (maxage > utf8nfkdicfdata[i].maxage)
+		return NULL;
+	return &utf8nfkdicfdata[i];
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 20/35] xfsprogs: reduce the size of utf8data[]
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (18 preceding siblings ...)
  2014-10-03 22:07 ` [PATCH 19/35] xfsprogs: add supporting code " Ben Myers
@ 2014-10-03 22:08 ` Ben Myers
  2014-10-03 22:09 ` [PATCH 21/35] libxfs: return the first match during case-insensitive lookup Ben Myers
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:08 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store
the decomposition the caller of utf8lookup()/utf8nlookup() must
provide a 12-byte buffer, which is used to synthesize a leaf with
the decomposition. Trie size is reduced from 245kB to 90kB.

This change also contains a number of robustness fixes to the
trie generator mkutf8data.c.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/utf8norm.h    |   4 +
 libxfs/utf8norm.c     | 190 ++++++++++++++++++++---
 utf8norm/mkutf8data.c | 421 ++++++++++++++++++++++++++++++++++++++------------
 3 files changed, 492 insertions(+), 123 deletions(-)

diff --git a/include/utf8norm.h b/include/utf8norm.h
index cd77580..44a869c 100644
--- a/include/utf8norm.h
+++ b/include/utf8norm.h
@@ -22,6 +22,9 @@
 /* An opaque type used to determine the normalization in use. */
 typedef const struct utf8data *utf8data_t;
 
+/* Needed in struct utf8cursor below. */
+#define UTF8HANGULLEAF	(12)
+
 /* Encoding a unicode version number as a single unsigned int. */
 #define UNICODE_MAJ_SHIFT		(16)
 #define UNICODE_MIN_SHIFT		(8)
@@ -90,6 +93,7 @@ struct utf8cursor {
 	unsigned int	slen;
 	short int	ccc;
 	short int	nccc;
+	unsigned char	hangul[UTF8HANGULLEAF];
 };
 
 /*
diff --git a/libxfs/utf8norm.c b/libxfs/utf8norm.c
index 8076e99..5c5ece5 100644
--- a/libxfs/utf8norm.c
+++ b/libxfs/utf8norm.c
@@ -103,6 +103,38 @@ utf8clen(const char *s)
 }
 
 /*
+ * Decode a 3-byte UTF-8 sequence.
+ */
+static unsigned int
+utf8decode3(const char *str)
+{
+	unsigned int		uc;
+
+	uc = *str++ & 0x0F;
+	uc <<= 6;
+	uc |= *str++ & 0x3F;
+	uc <<= 6;
+	uc |= *str++ & 0x3F;
+
+	return uc;
+}
+
+/*
+ * Encode a 3-byte UTF-8 sequence.
+ */
+static int
+utf8encode3(char *str, unsigned int val)
+{
+	str[2] = (val & 0x3F) | 0x80;
+	val >>= 6;
+	str[1] = (val & 0x3F) | 0x80;
+	val >>= 6;
+	str[0] = val | 0xE0;
+
+	return 3;
+}
+
+/*
  * utf8trie_t
  *
  * A compact binary tree, used to decode UTF-8 characters.
@@ -163,7 +195,8 @@ typedef const unsigned char utf8trie_t;
  *          characters with the Default_Ignorable_Code_Point property.
  *          These do affect normalization, as they all have CCC 0.
  *
- * The decompositions in the trie have been fully expanded.
+ * The decompositions in the trie have been fully expanded, with the
+ * exception of Hangul syllables, which are decomposed algorithmically.
  *
  * Casefolding, if applicable, is also done using decompositions.
  *
@@ -183,6 +216,105 @@ typedef const unsigned char utf8leaf_t;
 #define STOPPER		(0)
 #define	DECOMPOSE	(255)
 
+/* Marker for hangul syllable decomposition. */
+#define HANGUL		((char)(255))
+/* Size of the synthesized leaf used for Hangul syllable decomposition. */
+#define UTF8HANGULLEAF	(12)
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount)
+ *   LVPart = SBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount)
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, TPart, VPart>
+ *   }
+ */
+
+/* Constants */
+#define SB	(0xAC00)
+#define LB	(0x1100)
+#define VB	(0x1161)
+#define TB	(0x11A7)
+#define LC	(19)
+#define VC	(21)
+#define TC	(28)
+#define NC	(VC * TC)
+#define SC	(LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+	unsigned int	si;
+	unsigned int	li;
+	unsigned int	vi;
+	unsigned int	ti;
+	unsigned char	*h;
+
+	/* Calculate the SI, LI, VI, and TI values. */
+	si = utf8decode3(str) - SB;
+	li = si / NC;
+	vi = (si % NC) / TC;
+	ti = si % TC;
+
+	/* Fill in base of leaf. */
+	h = hangul;
+	LEAF_GEN(h) = 2;
+	LEAF_CCC(h) = DECOMPOSE;
+	h += 2;
+
+	/* Add LPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode3((char*)h, li + LB);
+
+	/* Add VPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode3((char*)h, vi + VB);
+
+	/* Add TPart if required, also a 3-byte UTF-8 sequence. */
+	if (ti)
+		h += utf8encode3((char*)h, ti + TB);
+
+	/* Terminate string. */
+	h[0] = '\0';
+
+	return hangul;
+}
+
 /*
  * Use trie to scan s, touching at most len bytes.
  * Returns the leaf if one exists, NULL otherwise.
@@ -192,7 +324,7 @@ typedef const unsigned char utf8leaf_t;
  * shorthand for this will be "is valid UTF-8 unicode".
  */
 static utf8leaf_t *
-utf8nlookup(utf8data_t data, const char *s, size_t len)
+utf8nlookup(utf8data_t data, unsigned char *hangul, const char *s, size_t len)
 {
 	utf8trie_t	*trie = utf8data + data->offset;
 	int		offlen;
@@ -230,8 +362,7 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
 				trie++;
 			} else {
 				/* No right node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			}
 		} else {
 			/* Left leg */
@@ -241,8 +372,7 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
 				trie += offlen + 1;
 			} else if (*trie & RIGHTPATH) {
 				/* No left node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			} else {
 				/* Left node after this node */
 				node = (*trie & TRIENODE);
@@ -250,6 +380,14 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
 			}
 		}
 	}
+	/*
+	 * Hangul decomposition is done algorithmically. These are the
+	 * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+	 * always 3 bytes long, so s has been advanced twice, and the
+	 * start of the sequence is at s-2.
+	 */
+	if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+		trie = utf8hangul(s - 2, hangul);
 	return trie;
 }
 
@@ -260,9 +398,9 @@ utf8nlookup(utf8data_t data, const char *s, size_t len)
  * Forwards to utf8nlookup().
  */
 static utf8leaf_t *
-utf8lookup(utf8data_t data, const char *s)
+utf8lookup(utf8data_t data, unsigned char *hangul, const char *s)
 {
-	return utf8nlookup(data, s, (size_t)-1);
+	return utf8nlookup(data, hangul, s, (size_t)-1);
 }
 
 /*
@@ -274,13 +412,15 @@ int
 utf8agemax(utf8data_t data, const char *s)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
+	age = 0;
 	while (*s) {
-		if (!(leaf = utf8lookup(data, s)))
+		if (!(leaf = utf8lookup(data, hangul, s)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age > age)
@@ -301,12 +441,13 @@ utf8agemin(utf8data_t data, const char *s)
 	utf8leaf_t	*leaf;
 	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	age = data->maxage;
 	while (*s) {
-		if (!(leaf = utf8lookup(data, s)))
+		if (!(leaf = utf8lookup(data, hangul, s)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age < age)
@@ -324,13 +465,15 @@ int
 utf8nagemax(utf8data_t data, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
+	age = 0;
         while (len && *s) {
-		if (!(leaf = utf8nlookup(data, s, len)))
+		if (!(leaf = utf8nlookup(data, hangul, s, len)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age > age)
@@ -351,12 +494,13 @@ utf8nagemin(utf8data_t data, const char *s, size_t len)
 	utf8leaf_t	*leaf;
 	int		leaf_age;
 	int		age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	age = data->maxage;
 	while (len && *s) {
-		if (!(leaf = utf8nlookup(data, s, len)))
+		if (!(leaf = utf8nlookup(data, hangul, s, len)))
 			return -1;
 		leaf_age = utf8agetab[LEAF_GEN(leaf)];
 		if (leaf_age <= data->maxage && leaf_age < age)
@@ -378,11 +522,12 @@ utf8len(utf8data_t data, const char *s)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	while (*s) {
-		if (!(leaf = utf8lookup(data, s)))
+		if (!(leaf = utf8lookup(data, hangul, s)))
 			return -1;
 		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
 			ret += utf8clen(s);
@@ -404,11 +549,12 @@ utf8nlen(utf8data_t data, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!data)
 		return -1;
 	while (len && *s) {
-		if (!(leaf = utf8nlookup(data, s, len)))
+		if (!(leaf = utf8nlookup(data, hangul, s, len)))
 			return -1;
 		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
 			ret += utf8clen(s);
@@ -535,10 +681,12 @@ utf8byte(struct utf8cursor *u8c)
 		}
 
 		/* Look up the data for the current character. */
-		if (u8c->p)
-			leaf = utf8lookup(u8c->data, u8c->s);
-		else
-			leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+		if (u8c->p) {
+			leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+		} else {
+			leaf = utf8nlookup(u8c->data, u8c->hangul,
+					   u8c->s, u8c->len);
+		}
 
 		/* No leaf found implies that the input is a binary blob. */
 		if (!leaf)
@@ -558,7 +706,7 @@ utf8byte(struct utf8cursor *u8c)
 				ccc = STOPPER;
 				goto ccc_mismatch;
 			}
-			leaf = utf8lookup(u8c->data, u8c->s);
+			leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
 			ccc = LEAF_CCC(leaf);
 		}
 
diff --git a/utf8norm/mkutf8data.c b/utf8norm/mkutf8data.c
index 1d6ec02..7c7756f 100644
--- a/utf8norm/mkutf8data.c
+++ b/utf8norm/mkutf8data.c
@@ -179,11 +179,15 @@ typedef unsigned char utf8leaf_t;
 #define MINCCC		(0)
 #define MAXCCC		(254)
 #define STOPPER		(0)
-#define	DECOMPOSE	(255)
+#define DECOMPOSE	(255)
+#define HANGUL		((char)(255))
+
+#define UTF8HANGULLEAF	(12)
 
 struct tree;
-static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
-static utf8leaf_t *utf8lookup(struct tree *, const char *);
+static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *,
+			       const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *);
 
 unsigned char *utf8data;
 size_t utf8data_size;
@@ -254,52 +258,52 @@ utf8trie_t *nfkdicf;
 #define UTF8_V_SHIFT    6
 
 static int
-utf8key(unsigned int key, char keyval[])
-{
-	int keylen;
-
-	if (key < 0x80) {
-		keyval[0] = key;
-		keylen = 1;
-	} else if (key < 0x800) {
-		keyval[1] = key & UTF8_V_MASK;
-		keyval[1] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[0] = key;
-		keyval[0] |= UTF8_2_BITS;
-		keylen = 2;
-	} else if (key < 0x10000) {
-		keyval[2] = key & UTF8_V_MASK;
-		keyval[2] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[1] = key & UTF8_V_MASK;
-		keyval[1] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[0] = key;
-		keyval[0] |= UTF8_3_BITS;
-		keylen = 3;
-	} else if (key < 0x110000) {
-		keyval[3] = key & UTF8_V_MASK;
-		keyval[3] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[2] = key & UTF8_V_MASK;
-		keyval[2] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[1] = key & UTF8_V_MASK;
-		keyval[1] |= UTF8_N_BITS;
-		key >>= UTF8_V_SHIFT;
-		keyval[0] = key;
-		keyval[0] |= UTF8_4_BITS;
-		keylen = 4;
+utf8encode(char *str, unsigned int val)
+{
+	int len;
+
+	if (val < 0x80) {
+		str[0] = val;
+		len = 1;
+	} else if (val < 0x800) {
+		str[1] = val & UTF8_V_MASK;
+		str[1] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[0] = val;
+		str[0] |= UTF8_2_BITS;
+		len = 2;
+	} else if (val < 0x10000) {
+		str[2] = val & UTF8_V_MASK;
+		str[2] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[1] = val & UTF8_V_MASK;
+		str[1] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[0] = val;
+		str[0] |= UTF8_3_BITS;
+		len = 3;
+	} else if (val < 0x110000) {
+		str[3] = val & UTF8_V_MASK;
+		str[3] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[2] = val & UTF8_V_MASK;
+		str[2] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[1] = val & UTF8_V_MASK;
+		str[1] |= UTF8_N_BITS;
+		val >>= UTF8_V_SHIFT;
+		str[0] = val;
+		str[0] |= UTF8_4_BITS;
+		len = 4;
 	} else {
-		printf("%#x: illegal key\n", key);
-		keylen = 0;
+		printf("%#x: illegal val\n", val);
+		len = 0;
 	}
-	return keylen;
+	return len;
 }
 
 static unsigned int
-utf8code(const char *str)
+utf8decode(const char *str)
 {
 	const unsigned char *s = (const unsigned char*)str;
 	unsigned int unichar = 0;
@@ -334,6 +338,8 @@ utf32valid(unsigned int unichar)
 	return unichar < 0x110000;
 }
 
+#define HANGUL_SYLLABLE(U)	((U) >= 0xAC00 && (U) <= 0xD7A3)
+
 #define NODE 1
 #define LEAF 0
 
@@ -937,7 +943,7 @@ done:
 
 /*
  * Compute the index of each node and leaf, which is the offset in the
- * emitted trie.  These value must be pre-computed because relative
+ * emitted trie.  These values must be pre-computed because relative
  * offsets between nodes are used to navigate the tree.
  */
 static int
@@ -958,7 +964,7 @@ index_nodes(struct tree *tree, int index)
 	count = 0;
 
 	if (verbose > 0)
-		printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+		printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index);
 	if (tree->childnode == LEAF) {
 		index += tree->leaf_size(tree->root);
 		goto done;
@@ -1022,6 +1028,26 @@ done:
 }
 
 /*
+ * Mark the nodes in a subtree, helper for size_nodes().
+ */
+static int
+mark_subtree(struct node *node)
+{
+	int changed;
+
+	if (!node || node->mark)
+		return 0;
+	node->mark = 1;
+	node->index = node->parent->index;
+	changed = 1;
+	if (node->leftnode == NODE)
+		changed += mark_subtree(node->left);
+	if (node->rightnode == NODE)
+		changed += mark_subtree(node->right);
+	return changed;
+}
+
+/*
  * Compute the size of nodes and leaves. We start by assuming that
  * each node needs to store a three-byte offset. The indexes of the
  * nodes are calculated based on that, and then this function is
@@ -1040,6 +1066,7 @@ size_nodes(struct tree *tree)
 	unsigned int bitmask;
 	unsigned int pathbits;
 	unsigned int pathmask;
+	unsigned int nbit;
 	int changed;
 	int offset;
 	int size;
@@ -1050,7 +1077,7 @@ size_nodes(struct tree *tree)
 	size = 0;
 
 	if (verbose > 0)
-		printf("Sizing %s_%x", tree->type, tree->maxage);
+		printf("Sizing %s_%x\n", tree->type, tree->maxage);
 	if (tree->childnode == LEAF)
 		goto done;
 
@@ -1067,22 +1094,40 @@ size_nodes(struct tree *tree)
 			size = 1;
 		} else {
 			if (node->rightnode == NODE) {
+				/*
+				 * If the right node is not marked,
+				 * look for a corresponding node in
+				 * the next tree.  Such a node need
+				 * not exist.
+				 */
 				right = node->right;
 				next = tree->next;
 				while (!right->mark) {
 					assert(next);
 					n = next->root;
 					while (n->bitnum != node->bitnum) {
-						if (pathbits & (1<<n->bitnum))
+						nbit = 1 << n->bitnum;
+						if (!(pathmask & nbit))
+							break;
+						if (pathbits & nbit) {
+							if (n->rightnode==LEAF)
+								break;
 							n = n->right;
-						else
+						} else {
+							if (n->leftnode==LEAF)
+								break;
 							n = n->left;
+						}
 					}
+					if (n->bitnum != node->bitnum)
+						break;
 					n = n->right;
-					assert(right->bitnum == n->bitnum);
 					right = n;
 					next = next->next;
 				}
+				/* Make sure the right node is marked. */
+				if (!right->mark)
+					changed += mark_subtree(right);
 				offset = right->index - node->index;
 			} else {
 				offset = *tree->leaf_index(tree, node->right);
@@ -1158,8 +1203,15 @@ emit(struct tree *tree, unsigned char *data)
 	int offset;
 	int index;
 	int indent;
+	int size;
+	int bytes;
+	int leaves;
+	int nodes[4];
 	unsigned char byte;
 
+	nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0;
+	leaves = 0;
+	bytes = 0;
 	index = tree->index;
 	data += index;
 	indent = 1;
@@ -1168,7 +1220,10 @@ emit(struct tree *tree, unsigned char *data)
 	if (tree->childnode == LEAF) {
 		assert(tree->root);
 		tree->leaf_emit(tree->root, data);
-		return;
+		size = tree->leaf_size(tree->root);
+		index += size;
+		leaves++;
+		goto done;
 	}
 
 	assert(tree->childnode == NODE);
@@ -1195,6 +1250,7 @@ emit(struct tree *tree, unsigned char *data)
 				offlen = 2;
 			else
 				offlen = 3;
+			nodes[offlen]++;
 			offset = node->offset;
 			byte |= offlen << OFFLEN_SHIFT;
 			*data++ = byte;
@@ -1207,12 +1263,14 @@ emit(struct tree *tree, unsigned char *data)
 		} else if (node->left) {
 			if (node->leftnode == NODE)
 				byte |= TRIENODE;
+			nodes[0]++;
 			*data++ = byte;
 			index++;
 		} else if (node->right) {
 			byte |= RIGHTNODE;
 			if (node->rightnode == NODE)
 				byte |= TRIENODE;
+			nodes[0]++;
 			*data++ = byte;
 			index++;
 		} else {
@@ -1227,7 +1285,10 @@ skip:
 					assert(node->left);
 					data = tree->leaf_emit(node->left,
 							       data);
-					index += tree->leaf_size(node->left);
+					size = tree->leaf_size(node->left);
+					index += size;
+					bytes += size;
+					leaves++;
 				} else if (node->left) {
 					assert(node->leftnode == NODE);
 					indent += 1;
@@ -1241,7 +1302,10 @@ skip:
 					assert(node->right);
 					data = tree->leaf_emit(node->right,
 							       data);
-					index += tree->leaf_size(node->right);
+					size = tree->leaf_size(node->right);
+					index += size;
+					bytes += size;
+					leaves++;
 				} else if (node->right) {
 					assert(node->rightnode==NODE);
 					indent += 1;
@@ -1255,6 +1319,15 @@ skip:
 			indent -= 1;
 		}
 	}
+done:
+	if (verbose > 0) {
+		printf("Emitted %d (%d) leaves",
+			leaves, bytes);
+		printf(" %d (%d+%d+%d+%d) nodes",
+			nodes[0] + nodes[1] + nodes[2] + nodes[3],
+			nodes[0], nodes[1], nodes[2], nodes[3]);
+		printf(" %d total\n", index - tree->index);
+	}
 }
 
 /* ------------------------------------------------------------------ */
@@ -1360,7 +1433,9 @@ nfkdi_print(void *l, int indent)
 
 	printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
 		leaf->code, leaf->ccc, leaf->gen);
-	if (leaf->utf8nfkdi)
+	if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+		printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
+	else if (leaf->utf8nfkdi)
 		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
 	printf("\n");
 }
@@ -1374,6 +1449,8 @@ nfkdicf_print(void *l, int indent)
 		leaf->code, leaf->ccc, leaf->gen);
 	if (leaf->utf8nfkdicf)
 		printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+	else if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+		printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
 	else if (leaf->utf8nfkdi)
 		printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
 	printf("\n");
@@ -1409,7 +1486,9 @@ nfkdi_size(void *l)
 	struct unicode_data *leaf = l;
 
 	int size = 2;
-	if (leaf->utf8nfkdi)
+	if (HANGUL_SYLLABLE(leaf->code))
+		size += 1;
+	else if (leaf->utf8nfkdi)
 		size += strlen(leaf->utf8nfkdi) + 1;
 	return size;
 }
@@ -1420,7 +1499,9 @@ nfkdicf_size(void *l)
 	struct unicode_data *leaf = l;
 
 	int size = 2;
-	if (leaf->utf8nfkdicf)
+	if (HANGUL_SYLLABLE(leaf->code))
+		size += 1;
+	else if (leaf->utf8nfkdicf)
 		size += strlen(leaf->utf8nfkdicf) + 1;
 	else if (leaf->utf8nfkdi)
 		size += strlen(leaf->utf8nfkdi) + 1;
@@ -1450,7 +1531,10 @@ nfkdi_emit(void *l, unsigned char *data)
 	unsigned char *s;
 
 	*data++ = leaf->gen;
-	if (leaf->utf8nfkdi) {
+	if (HANGUL_SYLLABLE(leaf->code)) {
+		*data++ = DECOMPOSE;
+		*data++ = HANGUL;
+	} else if (leaf->utf8nfkdi) {
 		*data++ = DECOMPOSE;
 		s = (unsigned char*)leaf->utf8nfkdi;
 		while ((*data++ = *s++) != 0)
@@ -1468,7 +1552,10 @@ nfkdicf_emit(void *l, unsigned char *data)
 	unsigned char *s;
 
 	*data++ = leaf->gen;
-	if (leaf->utf8nfkdicf) {
+	if (HANGUL_SYLLABLE(leaf->code)) {
+		*data++ = DECOMPOSE;
+		*data++ = HANGUL;
+	} else if (leaf->utf8nfkdicf) {
 		*data++ = DECOMPOSE;
 		s = (unsigned char*)leaf->utf8nfkdicf;
 		while ((*data++ = *s++) != 0)
@@ -1492,22 +1579,27 @@ utf8_create(struct unicode_data *data)
 	unsigned int *um;
 	int i;
 
+	if (data->utf8nfkdi) {
+		assert(data->utf8nfkdi[0] == HANGUL);
+		return;
+	}
+
 	u = utf;
 	um = data->utf32nfkdi;
 	if (um) {
 		for (i = 0; um[i]; i++)
-			u += utf8key(um[i], u);
+			u += utf8encode(u, um[i]);
 		*u = '\0';
-		data->utf8nfkdi = strdup((char*)utf);
+		data->utf8nfkdi = strdup(utf);
 	}
 	u = utf;
 	um = data->utf32nfkdicf;
 	if (um) {
 		for (i = 0; um[i]; i++)
-			u += utf8key(um[i], u);
+			u += utf8encode(u, um[i]);
 		*u = '\0';
-		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
-			data->utf8nfkdicf = strdup((char*)utf);
+		if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, utf))
+			data->utf8nfkdicf = strdup(utf);
 	}
 }
 
@@ -1627,7 +1719,7 @@ trees_populate(void)
 		for (unichar = 0; unichar != 0x110000; unichar++) {
 			if (unicode_data[unichar].gen < 0)
 				continue;
-			keylen = utf8key(unichar, keyval);
+			keylen = utf8encode(keyval, unichar);
 			data = corrections_lookup(&unicode_data[unichar]);
 			if (data->correction <= trees[i].maxage)
 				data = &unicode_data[unichar];
@@ -1682,6 +1774,7 @@ verify(struct tree *tree)
 	utf8leaf_t	*leaf;
 	unsigned int	unichar;
 	char		key[4];
+	unsigned char	hangul[UTF8HANGULLEAF];
 	int		report;
 	int		nocf;
 
@@ -1694,8 +1787,8 @@ verify(struct tree *tree)
 		data = corrections_lookup(&unicode_data[unichar]);
 		if (data->correction <= tree->maxage)
 			data = &unicode_data[unichar];
-		utf8key(unichar, key);
-		leaf = utf8lookup(tree, key);
+		utf8encode(key, unichar);
+		leaf = utf8lookup(tree, hangul, key);
 		if (!leaf) {
 			if (data->gen != -1)
 				report++;
@@ -1709,7 +1802,10 @@ verify(struct tree *tree)
 			if (data->gen != LEAF_GEN(leaf))
 				report++;
 			if (LEAF_CCC(leaf) == DECOMPOSE) {
-				if (nocf) {
+				if (HANGUL_SYLLABLE(data->code)) {
+					if (data->utf8nfkdi[0] != HANGUL)
+						report++;
+				} else if (nocf) {
 					if (!data->utf8nfkdi) {
 						report++;
 					} else if (strcmp(data->utf8nfkdi,
@@ -1725,7 +1821,7 @@ verify(struct tree *tree)
 							   LEAF_STR(leaf)))
 							report++;
 					} else if (strcmp(data->utf8nfkdi,
-							  LEAF_STR(leaf))) {
+							LEAF_STR(leaf))) {
 						report++;
 					}
 				}
@@ -1735,13 +1831,13 @@ verify(struct tree *tree)
 		}
 		if (report) {
 			printf("%X code %X gen %d ccc %d"
-				" nfdki -> \"%s\"",
+				" nfkdi -> \"%s\"",
 				unichar, data->code, data->gen,
 				data->ccc,
 				data->utf8nfkdi);
 			if (leaf) {
-				printf(" age %d ccc %d"
-					" nfdki -> \"%s\"\n",
+				printf(" gen %d ccc %d"
+					" nfkdi -> \"%s\"",
 					LEAF_GEN(leaf),
 					LEAF_CCC(leaf),
 					LEAF_CCC(leaf) == DECOMPOSE ?
@@ -2330,21 +2426,21 @@ corrections_init(void)
  *
  * LVT (Canonical)
  *   LVIndex = (SIndex / TCount) * TCount
- *   TIndex = (Sindex % TCount
- *   LVPart = LBase + LVIndex
+ *   TIndex = (Sindex % TCount)
+ *   LVPart = SBase + LVIndex
  *   TPart = TBase + TIndex
  *
  * LVT (Full)
  *   LIndex = SIndex / NCount
  *   VIndex = (Sindex % NCount) / TCount
- *   TIndex = (Sindex % TCount
+ *   TIndex = (Sindex % TCount)
  *   LPart = LBase + LIndex
  *   VPart = VBase + VIndex
  *   if (TIndex == 0) {
  *          d = <LPart, VPart>
  *   } else {
  *          TPart = TBase + TIndex
- *          d = <LPart, TPart, VPart>
+ *          d = <LPart, VPart, TPart>
  *   }
  *
  */
@@ -2394,9 +2490,17 @@ hangul_decompose(void)
 		memcpy(um, mapping, i * sizeof(unsigned int));
 		unicode_data[unichar].utf32nfkdicf = um;
 
+		/*
+		 * Add a cookie as a reminder that the hangul syllable
+		 * decompositions must not be stored in the generated
+		 * trie.
+		 */
+		unicode_data[unichar].utf8nfkdi = malloc(2);
+		unicode_data[unichar].utf8nfkdi[0] = HANGUL;
+		unicode_data[unichar].utf8nfkdi[1] = '\0';
+
 		if (verbose > 1)
 			print_utf32nfkdi(unichar);
-
 		count++;
 	}
 	if (verbose > 0)
@@ -2522,6 +2626,100 @@ int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
 int utf8byte(struct utf8cursor *);
 
 /*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ *   SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ *   LVIndex = (SIndex / TCount) * TCount
+ *   TIndex = (Sindex % TCount)
+ *   LVPart = SBase + LVIndex
+ *   TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ *   LIndex = SIndex / NCount
+ *   VIndex = (Sindex % NCount) / TCount
+ *   TIndex = (Sindex % TCount)
+ *   LPart = LBase + LIndex
+ *   VPart = VBase + VIndex
+ *   if (TIndex == 0) {
+ *          d = <LPart, VPart>
+ *   } else {
+ *          TPart = TBase + TIndex
+ *          d = <LPart, VPart, TPart>
+ *   }
+ */
+
+/* Constants */
+#define SB	(0xAC00)
+#define LB	(0x1100)
+#define VB	(0x1161)
+#define TB	(0x11A7)
+#define LC	(19)
+#define VC	(21)
+#define TC	(28)
+#define NC	(VC * TC)
+#define SC	(LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+	unsigned int	si;
+	unsigned int	li;
+	unsigned int	vi;
+	unsigned int	ti;
+	unsigned char	*h;
+
+	/* Calculate the SI, LI, VI, and TI values. */
+	si = utf8decode(str) - SB;
+	li = si / NC;
+	vi = (si % NC) / TC;
+	ti = si % TC;
+
+	/* Fill in base of leaf. */
+	h = hangul;
+	LEAF_GEN(h) = 2;
+	LEAF_CCC(h) = DECOMPOSE;
+	h += 2;
+
+	/* Add LPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode((char *)h, li + LB);
+
+	/* Add VPart, a 3-byte UTF-8 sequence. */
+	h += utf8encode((char *)h, vi + VB);
+
+	/* Add TPart if required, also a 3-byte UTF-8 sequence. */
+	if (ti)
+		h += utf8encode((char *)h, ti + TB);
+
+	/* Terminate string. */
+	h[0] = '\0';
+
+	return hangul;
+}
+
+/*
  * Use trie to scan s, touching at most len bytes.
  * Returns the leaf if one exists, NULL otherwise.
  *
@@ -2530,7 +2728,7 @@ int utf8byte(struct utf8cursor *);
  * shorthand for this will be "is valid UTF-8 unicode".
  */
 static utf8leaf_t *
-utf8nlookup(struct tree *tree, const char *s, size_t len)
+utf8nlookup(struct tree *tree, unsigned char *hangul, const char *s, size_t len)
 {
 	utf8trie_t	*trie = utf8data + tree->index;
 	int		offlen;
@@ -2568,8 +2766,7 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
 				trie++;
 			} else {
 				/* No right node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			}
 		} else {
 			/* Left leg */
@@ -2579,8 +2776,7 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
 				trie += offlen + 1;
 			} else if (*trie & RIGHTPATH) {
 				/* No left node. */
-				node = 0;
-				trie = NULL;
+				return NULL;
 			} else {
 				/* Left node after this node */
 				node = (*trie & TRIENODE);
@@ -2588,6 +2784,14 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
 			}
 		}
 	}
+	/*
+	 * Hangul decomposition is done algorithmically. These are the
+	 * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+	 * always 3 bytes long, so s has been advanced twice, and the
+	 * start of the sequence is at s-2.
+	 */
+	if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+		trie = utf8hangul(s - 2, hangul);
 	return trie;
 }
 
@@ -2598,9 +2802,9 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
  * Forwards to trie_nlookup().
  */
 static utf8leaf_t *
-utf8lookup(struct tree *tree, const char *s)
+utf8lookup(struct tree *tree, unsigned char *hangul, const char *s)
 {
-	return utf8nlookup(tree, s, (size_t)-1);
+	return utf8nlookup(tree, hangul, s, (size_t)-1);
 }
 
 /*
@@ -2624,13 +2828,15 @@ int
 utf8agemax(struct tree *tree, const char *s)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = 0;
 	while (*s) {
-		if (!(leaf = utf8lookup(tree, s)))
+		if (!(leaf = utf8lookup(tree, hangul, s)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2649,13 +2855,15 @@ int
 utf8agemin(struct tree *tree, const char *s)
 {
 	utf8leaf_t	*leaf;
-	int		age = tree->maxage;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = tree->maxage;
 	while (*s) {
-		if (!(leaf = utf8lookup(tree, s)))
+		if (!(leaf = utf8lookup(tree, hangul, s)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2673,13 +2881,15 @@ int
 utf8nagemax(struct tree *tree, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
-	int		age = 0;
+	int		age;
 	int		leaf_age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = 0;
         while (len && *s) {
-		if (!(leaf = utf8nlookup(tree, s, len)))
+		if (!(leaf = utf8nlookup(tree, hangul, s, len)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2699,12 +2909,14 @@ utf8nagemin(struct tree *tree, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
 	int		leaf_age;
-	int		age = tree->maxage;
+	int		age;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
+	age = tree->maxage;
         while (len && *s) {
-		if (!(leaf = utf8nlookup(tree, s, len)))
+		if (!(leaf = utf8nlookup(tree, hangul, s, len)))
 			return -1;
 		leaf_age = ages[LEAF_GEN(leaf)];
 		if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2726,11 +2938,12 @@ utf8len(struct tree *tree, const char *s)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
 	while (*s) {
-		if (!(leaf = utf8lookup(tree, s)))
+		if (!(leaf = utf8lookup(tree, hangul, s)))
 			return -1;
 		if (ages[LEAF_GEN(leaf)] > tree->maxage)
 			ret += utf8clen(s);
@@ -2752,11 +2965,12 @@ utf8nlen(struct tree *tree, const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
+	unsigned char	hangul[UTF8HANGULLEAF];
 
 	if (!tree)
 		return -1;
 	while (len && *s) {
-		if (!(leaf = utf8nlookup(tree, s, len)))
+		if (!(leaf = utf8nlookup(tree, hangul, s, len)))
 			return -1;
 		if (ages[LEAF_GEN(leaf)] > tree->maxage)
 			ret += utf8clen(s);
@@ -2784,6 +2998,7 @@ struct utf8cursor {
 	short int	ccc;
 	short int	nccc;
 	unsigned int	unichar;
+	unsigned char	hangul[UTF8HANGULLEAF];
 };
 
 /*
@@ -2900,10 +3115,12 @@ utf8byte(struct utf8cursor *u8c)
 		}
 
 		/* Look up the data for the current character. */
-		if (u8c->p)
-			leaf = utf8lookup(u8c->tree, u8c->s);
-		else
-			leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+		if (u8c->p) {
+			leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
+		} else {
+			leaf = utf8nlookup(u8c->tree, u8c->hangul,
+					   u8c->s, u8c->len);
+		}
 
 		/* No leaf found implies that the input is a binary blob. */
 		if (!leaf)
@@ -2923,10 +3140,10 @@ utf8byte(struct utf8cursor *u8c)
 				ccc = STOPPER;
 				goto ccc_mismatch;
 			}
-			leaf = utf8lookup(u8c->tree, u8c->s);
+			leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
 			ccc = LEAF_CCC(leaf);
 		}
-		u8c->unichar = utf8code(u8c->s);
+		u8c->unichar = utf8decode(u8c->s);
 
 		/*
 		 * If this is not a stopper, then see if it updates
@@ -3055,7 +3272,7 @@ normalization_test(void)
 		t = buf2;
 		while (*s) {
 			unichar = strtoul(s, &s, 16);
-			t += utf8key(unichar, t);
+			t += utf8encode(t, unichar);
 		}
 		*t = '\0';
 
@@ -3068,13 +3285,13 @@ normalization_test(void)
 			if (data->utf8nfkdi && !*data->utf8nfkdi)
 				ignorables = 1;
 			else
-				t += utf8key(unichar, t);
+				t += utf8encode(t, unichar);
 		}
 		*t = '\0';
 
 		tests++;
 		if (normalize_line(nfkdi_tree) < 0) {
-			printf("\nline %s -> %s", buf0, buf1);
+			printf("Line %s -> %s", buf0, buf1);
 			if (ignorables)
 				printf(" (ignorables removed)");
 			printf(" failure\n");
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 21/35] libxfs: return the first match during case-insensitive lookup
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (19 preceding siblings ...)
  2014-10-03 22:08 ` [PATCH 20/35] xfsprogs: reduce the size of utf8data[] Ben Myers
@ 2014-10-03 22:09 ` Ben Myers
  2014-10-03 22:09 ` [PATCH 22/35] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:09 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Change the XFS case-insensitive lookup code to return the first match
found, even if it is not an exact match. Whether a filesystem uses
case-insensitive lookups is determined by a superblock bit set during
filesystem creation.  This means that normal use cannot create two files
that both match the same filename.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 libxfs/xfs_dir2_block.c | 17 ++++-------
 libxfs/xfs_dir2_leaf.c  | 38 ++++-------------------
 libxfs/xfs_dir2_node.c  | 80 ++++++++++++++++++-------------------------------
 libxfs/xfs_dir2_sf.c    |  8 ++---
 4 files changed, 44 insertions(+), 99 deletions(-)

diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c
index cede01f..2880431 100644
--- a/libxfs/xfs_dir2_block.c
+++ b/libxfs/xfs_dir2_block.c
@@ -705,28 +705,21 @@ xfs_dir2_block_lookup_int(
 		dep = (xfs_dir2_data_entry_t *)
 			((char *)hdr + xfs_dir2_dataptr_to_off(mp, addr));
 		/*
-		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * Compare name and if it's a match, return the
+		 * index and buffer.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*bpp = bp;
 			*entno = mid;
-			if (cmp == XFS_CMP_EXACT)
-				return 0;
+			return 0;
 		}
 	} while (++mid < be32_to_cpu(btp->count) &&
 			be32_to_cpu(blp[mid].hashval) == hash);
 
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or replace).
-	 * If a case-insensitive match was found earlier, return success.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE)
-		return 0;
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
 	/*
 	 * No match, release the buffer and return ENOENT.
 	 */
diff --git a/libxfs/xfs_dir2_leaf.c b/libxfs/xfs_dir2_leaf.c
index 8e0cbc9..b1901d3 100644
--- a/libxfs/xfs_dir2_leaf.c
+++ b/libxfs/xfs_dir2_leaf.c
@@ -1246,7 +1246,6 @@ xfs_dir2_leaf_lookup_int(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	xfs_dir2_db_t		newdb;		/* new data block number */
 	xfs_trans_t		*tp;		/* transaction pointer */
-	xfs_dir2_db_t		cidb = -1;	/* case match data block no. */
 	enum xfs_dacmp		cmp;		/* name compare result */
 	struct xfs_dir2_leaf_entry *ents;
 	struct xfs_dir3_icleaf_hdr leafhdr;
@@ -1307,47 +1306,22 @@ xfs_dir2_leaf_lookup_int(
 		dep = (xfs_dir2_data_entry_t *)((char *)dbp->b_addr +
 			xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
 		/*
-		 * Compare name and if it's an exact match, return the index
-		 * and buffer. If it's the first case-insensitive match, store
-		 * the index and buffer and continue looking for an exact match.
+		 * Compare name and if it's a match, return the index
+		 * and buffer.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			*indexp = index;
-			/* case exact match: return the current buffer. */
-			if (cmp == XFS_CMP_EXACT) {
-				*dbpp = dbp;
-				return 0;
-			}
-			cidb = curdb;
+			*dbpp = dbp;
+			return 0;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-	/*
-	 * Here, we can only be doing a lookup (not a rename or remove).
-	 * If a case-insensitive match was found earlier, re-read the
-	 * appropriate data block if required and return it.
-	 */
-	if (args->cmpresult == XFS_CMP_CASE) {
-		ASSERT(cidb != -1);
-		if (cidb != curdb) {
-			xfs_trans_brelse(tp, dbp);
-			error = xfs_dir3_data_read(tp, dp,
-						   xfs_dir2_db_to_da(mp, cidb),
-						   -1, &dbp);
-			if (error) {
-				xfs_trans_brelse(tp, lbp);
-				return error;
-			}
-		}
-		*dbpp = dbp;
-		return 0;
-	}
+	ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
 	/*
 	 * No match found, return ENOENT.
 	 */
-	ASSERT(cidb == -1);
 	if (dbp)
 		xfs_trans_brelse(tp, dbp);
 	xfs_trans_brelse(tp, lbp);
diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c
index 3737e4e..fb27506 100644
--- a/libxfs/xfs_dir2_node.c
+++ b/libxfs/xfs_dir2_node.c
@@ -702,6 +702,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	xfs_dir2_db_t		curdb = -1;	/* current data block number */
 	xfs_dir2_data_entry_t	*dep;		/* data block entry */
 	xfs_inode_t		*dp;		/* incore directory inode */
+	int			di = -1;	/* data entry index */
 	int			error;		/* error return value */
 	int			index;		/* leaf entry index */
 	xfs_dir2_leaf_t		*leaf;		/* leaf structure */
@@ -733,6 +734,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	if (state->extravalid) {
 		curbp = state->extrablk.bp;
 		curdb = state->extrablk.blkno;
+		di = state->extrablk.index;
 	}
 	/*
 	 * Loop over leaf entries with the right hash value.
@@ -757,27 +759,20 @@ xfs_dir2_leafn_lookup_for_entry(
 		 */
 		if (newdb != curdb) {
 			/*
-			 * If we had a block before that we aren't saving
-			 * for a CI name, drop it
+			 * If we had a block, drop it
 			 */
-			if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT ||
-						curdb != state->extrablk.blkno))
+			if (curbp) {
 				xfs_trans_brelse(tp, curbp);
+				di = -1;
+			}
 			/*
-			 * If needing the block that is saved with a CI match,
-			 * use it otherwise read in the new data block.
+			 * Read in the new data block.
 			 */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-					newdb == state->extrablk.blkno) {
-				ASSERT(state->extravalid);
-				curbp = state->extrablk.bp;
-			} else {
-				error = xfs_dir3_data_read(tp, dp,
-						xfs_dir2_db_to_da(mp, newdb),
-						-1, &curbp);
-				if (error)
-					return error;
-			}
+			error = xfs_dir3_data_read(tp, dp,
+					xfs_dir2_db_to_da(mp, newdb),
+					-1, &curbp);
+			if (error)
+				return error;
 			xfs_dir3_data_check(dp, curbp);
 			curdb = newdb;
 		}
@@ -787,53 +782,36 @@ xfs_dir2_leafn_lookup_for_entry(
 		dep = (xfs_dir2_data_entry_t *)((char *)curbp->b_addr +
 			xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
 		/*
-		 * Compare the entry and if it's an exact match, return
-		 * EEXIST immediately. If it's the first case-insensitive
-		 * match, store the block & inode number and continue looking.
+		 * Compare the entry and if it's a match, return
+		 * EEXIST immediately.
 		 */
 		cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
-			/* If there is a CI match block, drop it */
-			if (args->cmpresult != XFS_CMP_DIFFERENT &&
-						curdb != state->extrablk.blkno)
-				xfs_trans_brelse(tp, state->extrablk.bp);
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = be64_to_cpu(dep->inumber);
 			args->filetype = xfs_dir3_dirent_get_ftype(mp, dep);
-			*indexp = index;
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.blkno = curdb;
-			state->extrablk.index = (int)((char *)dep -
-							(char *)curbp->b_addr);
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
-			curbp->b_ops = &xfs_dir3_data_buf_ops;
-			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-			if (cmp == XFS_CMP_EXACT)
-				return XFS_ERROR(EEXIST);
+			error = EEXIST;
+			goto out;
 		}
 	}
+	/* Didn't find a match */
+	error = ENOENT;
 	ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT));
+out:
 	if (curbp) {
-		if (args->cmpresult == XFS_CMP_DIFFERENT) {
-			/* Giving back last used data block. */
-			state->extravalid = 1;
-			state->extrablk.bp = curbp;
-			state->extrablk.index = -1;
-			state->extrablk.blkno = curdb;
-			state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
-			curbp->b_ops = &xfs_dir3_data_buf_ops;
-			xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
-		} else {
-			/* If the curbp is not the CI match block, drop it */
-			if (state->extrablk.bp != curbp)
-				xfs_trans_brelse(tp, curbp);
-		}
+		/* Giving back last used data block. */
+		state->extravalid = 1;
+		state->extrablk.bp = curbp;
+		state->extrablk.index = di;
+		state->extrablk.blkno = curdb;
+		state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
+		curbp->b_ops = &xfs_dir3_data_buf_ops;
+		xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
 	} else {
 		state->extravalid = 0;
 	}
 	*indexp = index;
-	return XFS_ERROR(ENOENT);
+	return XFS_ERROR(error);
 }
 
 /*
diff --git a/libxfs/xfs_dir2_sf.c b/libxfs/xfs_dir2_sf.c
index 7580333..7b01d43 100644
--- a/libxfs/xfs_dir2_sf.c
+++ b/libxfs/xfs_dir2_sf.c
@@ -833,13 +833,12 @@ xfs_dir2_sf_lookup(
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
 	     i++, sfep = xfs_dir3_sf_nextentry(dp->i_mount, sfp, sfep)) {
 		/*
-		 * Compare name and if it's an exact match, return the inode
-		 * number. If it's the first case-insensitive match, store the
-		 * inode number and continue looking for an exact match.
+		 * Compare name and if it's a match, return the inode
+		 * number.
 		 */
 		cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name,
 								sfep->namelen);
-		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+		if (cmp != XFS_CMP_DIFFERENT) {
 			args->cmpresult = cmp;
 			args->inumber = xfs_dir3_sfe_get_ino(dp->i_mount,
 							     sfp, sfep);
@@ -848,6 +847,7 @@ xfs_dir2_sf_lookup(
 			if (cmp == XFS_CMP_EXACT)
 				return XFS_ERROR(EEXIST);
 			ci_sfep = sfep;
+			break;
 		}
 	}
 	ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 22/35] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (20 preceding siblings ...)
  2014-10-03 22:09 ` [PATCH 21/35] libxfs: return the first match during case-insensitive lookup Ben Myers
@ 2014-10-03 22:09 ` Ben Myers
  2014-10-03 22:10 ` [PATCH 23/35] libxfs: add xfs_nameops.normhash Ben Myers
                   ` (12 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:09 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and
normalization, different strings will match on other criteria than
case insensitivity.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/xfs_da_btree.h | 2 +-
 libxfs/xfs_dir2.c      | 9 ++++++---
 libxfs/xfs_dir2_node.c | 2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index e492dca..3d9f9dd 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -34,7 +34,7 @@ struct zone;
 enum xfs_dacmp {
 	XFS_CMP_DIFFERENT,	/* names are completely different */
 	XFS_CMP_EXACT,		/* names are exactly the same */
-	XFS_CMP_CASE		/* names are same but differ in case */
+	XFS_CMP_MATCH		/* names are same but differ in encoding */
 };
 
 /*
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 4c8c836..57e98a3 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -72,7 +72,7 @@ xfs_ascii_ci_compname(
 			continue;
 		if (tolower(args->name[i]) != tolower(name[i]))
 			return XFS_CMP_DIFFERENT;
-		result = XFS_CMP_CASE;
+		result = XFS_CMP_MATCH;
 	}
 
 	return result;
@@ -248,8 +248,11 @@ xfs_dir_cilookup_result(
 {
 	if (args->cmpresult == XFS_CMP_DIFFERENT)
 		return ENOENT;
-	if (args->cmpresult != XFS_CMP_CASE ||
-					!(args->op_flags & XFS_DA_OP_CILOOKUP))
+	if (args->cmpresult == XFS_CMP_EXACT)
+		return EEXIST;
+	ASSERT(args->cmpresult == XFS_CMP_MATCH);
+	/* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */
+	if (!(args->op_flags & XFS_DA_OP_CILOOKUP))
 		return EEXIST;
 
 	args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL);
diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c
index fb27506..550ca99 100644
--- a/libxfs/xfs_dir2_node.c
+++ b/libxfs/xfs_dir2_node.c
@@ -2034,7 +2034,7 @@ xfs_dir2_node_lookup(
 	error = xfs_da3_node_lookup_int(state, &rval);
 	if (error)
 		rval = error;
-	else if (rval == ENOENT && args->cmpresult == XFS_CMP_CASE) {
+	else if (rval == ENOENT && args->cmpresult == XFS_CMP_MATCH) {
 		/* If a CI match, dup the actual name and return EEXIST */
 		xfs_dir2_data_entry_t	*dep;
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 23/35] libxfs: add xfs_nameops.normhash
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (21 preceding siblings ...)
  2014-10-03 22:09 ` [PATCH 22/35] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
@ 2014-10-03 22:10 ` Ben Myers
  2014-10-03 22:11 ` [PATCH 24/35] libxfs: change interface of xfs_nameops.hashname Ben Myers
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:10 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Add a normhash callout to the xfs_nameops. This callout takes an
xfs_da_args structure as its argument, and calculates a hash value
over the name. It may in the process create a normalized form of the
name, and assign that to the norm/normlen fields in the xfs_da_args
structure.

Changes: The pointer in kmem_free() was type converted to suppress
compiler warnings.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/xfs_da_btree.h |  3 +++
 libxfs/xfs_da_btree.c  |  9 ++++++++
 libxfs/xfs_dir2.c      | 56 +++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 56 insertions(+), 12 deletions(-)

diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index 3d9f9dd..f532d63 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -42,7 +42,9 @@ enum xfs_dacmp {
  */
 typedef struct xfs_da_args {
 	const __uint8_t	*name;		/* string (maybe not NULL terminated) */
+	const __uint8_t	*norm;		/* normalized name (may be NULL) */
 	int		namelen;	/* length of string (maybe no NULL) */
+	int		normlen;	/* length of normalized name */
 	__uint8_t	filetype;	/* filetype of inode for directories */
 	__uint8_t	*value;		/* set of bytes (maybe contain NULLs) */
 	int		valuelen;	/* length of value */
@@ -131,6 +133,7 @@ typedef struct xfs_da_state {
  */
 struct xfs_nameops {
 	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
 };
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index b731b54..eb97317 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -2000,8 +2000,17 @@ xfs_default_hashname(
 	return xfs_da_hashname(name->name, name->len);
 }
 
+STATIC int
+xfs_da_normhash(
+	struct xfs_da_args *args)
+{
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
 const struct xfs_nameops xfs_default_nameops = {
 	.hashname	= xfs_default_hashname,
+	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
 
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 57e98a3..e52d082 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -54,6 +54,21 @@ xfs_ascii_ci_hashname(
 	return hash;
 }
 
+STATIC int
+xfs_ascii_ci_normhash(
+	struct xfs_da_args *args)
+{
+	xfs_dahash_t	hash;
+	int		i;
+
+	for (i = 0, hash = 0; i < args->namelen; i++)
+		hash = tolower(args->name[i]) ^ rol32(hash, 7);
+
+	args->hashval = hash;
+	return 0;
+}
+
+
 STATIC enum xfs_dacmp
 xfs_ascii_ci_compname(
 	struct xfs_da_args *args,
@@ -80,6 +95,7 @@ xfs_ascii_ci_compname(
 
 static struct xfs_nameops xfs_ascii_ci_nameops = {
 	.hashname	= xfs_ascii_ci_hashname,
+	.normhash	= xfs_ascii_ci_normhash,
 	.compname	= xfs_ascii_ci_compname,
 };
 
@@ -211,7 +227,6 @@ xfs_dir_createname(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = inum;
 	args.dp = dp;
 	args.firstblock = first;
@@ -220,19 +235,24 @@ xfs_dir_createname(
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
 	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_addname(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_addname(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_addname(&args);
 	else
 		rval = xfs_dir2_node_addname(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -289,22 +309,23 @@ xfs_dir_lookup(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.dp = dp;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
 	args.op_flags = XFS_DA_OP_OKNOENT;
 	if (ci_name)
 		args.op_flags |= XFS_DA_OP_CILOOKUP;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_lookup(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_lookup(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_lookup(&args);
 	else
@@ -318,6 +339,9 @@ xfs_dir_lookup(
 			ci_name->len = args.valuelen;
 		}
 	}
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -345,7 +369,6 @@ xfs_dir_removename(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = ino;
 	args.dp = dp;
 	args.firstblock = first;
@@ -353,19 +376,24 @@ xfs_dir_removename(
 	args.total = total;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_removename(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_removename(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_removename(&args);
 	else
 		rval = xfs_dir2_node_removename(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
@@ -395,7 +423,6 @@ xfs_dir_replace(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
 	args.inumber = inum;
 	args.dp = dp;
 	args.firstblock = first;
@@ -403,19 +430,24 @@ xfs_dir_replace(
 	args.total = total;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
+	if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+		return rval;
 
 	if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
 		rval = xfs_dir2_sf_replace(&args);
 	else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_block_replace(&args);
 	else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
-		return rval;
+		goto out_free;
 	else if (v)
 		rval = xfs_dir2_leaf_replace(&args);
 	else
 		rval = xfs_dir2_node_replace(&args);
+out_free:
+	if (args.norm)
+		kmem_free((void *)args.norm);
 	return rval;
 }
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 24/35] libxfs: change interface of xfs_nameops.hashname
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (22 preceding siblings ...)
  2014-10-03 22:10 ` [PATCH 23/35] libxfs: add xfs_nameops.normhash Ben Myers
@ 2014-10-03 22:11 ` Ben Myers
  2014-10-03 22:11 ` [PATCH 25/35] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:11 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <olaf@sgi.com>

[v2: pass a 3rd argument for sb_utf8version to hashname.  --bpm]
---
 db/check.c              |  7 +++----
 include/xfs_da_btree.h  |  3 ++-
 libxfs/xfs_da_btree.c   | 18 ++++++++++--------
 libxfs/xfs_dir2.c       | 11 +++++++----
 libxfs/xfs_dir2_block.c |  7 +++----
 libxfs/xfs_dir2_data.c  |  7 +++----
 repair/phase6.c         |  3 ++-
 7 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/db/check.c b/db/check.c
index 4fd9fd0..d317a71 100644
--- a/db/check.c
+++ b/db/check.c
@@ -2212,7 +2212,6 @@ process_data_dir_v2(
 	int			stale = 0;
 	int			tag_err;
 	__be16			*tagp;
-	struct xfs_name		xname;
 
 	data = iocur_top->data;
 	block = iocur_top->data;
@@ -2323,9 +2322,9 @@ process_data_dir_v2(
 		tag_err += be16_to_cpu(*tagp) != (char *)dep - (char *)data;
 		addr = xfs_dir2_db_off_to_dataptr(mp, db,
 			(char *)dep - (char *)data);
-		xname.name = dep->name;
-		xname.len = dep->namelen;
-		dir_hash_add(mp->m_dirnameops->hashname(&xname), addr);
+		dir_hash_add(mp->m_dirnameops->hashname(dep->name,
+					dep->namelen,
+					0 /* version for later */), addr);
 		ptr += xfs_dir3_data_entsize(mp, dep->namelen);
 		count++;
 		lastfree = 0;
diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index f532d63..33efd3e 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -132,7 +132,8 @@ typedef struct xfs_da_state {
  * Name ops for directory and/or attr name operations
  */
 struct xfs_nameops {
-	xfs_dahash_t	(*hashname)(struct xfs_name *);
+	xfs_dahash_t	(*hashname)(const unsigned char *, int,
+					unsigned int);
 	int		(*normhash)(struct xfs_da_args *);
 	enum xfs_dacmp	(*compname)(struct xfs_da_args *,
 					const unsigned char *, int);
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index eb97317..c9784bd 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -1983,6 +1983,15 @@ xfs_da_hashname(const __uint8_t *name, int namelen)
 	}
 }
 
+xfs_dahash_t
+xfs_da_hashname_op(
+	const __uint8_t		*name,
+	int 			namelen,
+	unsigned int 		unused)
+{
+	return xfs_da_hashname(name, namelen);
+}
+
 enum xfs_dacmp
 xfs_da_compname(
 	struct xfs_da_args *args,
@@ -1993,13 +2002,6 @@ xfs_da_compname(
 					XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
 }
 
-static xfs_dahash_t
-xfs_default_hashname(
-	struct xfs_name	*name)
-{
-	return xfs_da_hashname(name->name, name->len);
-}
-
 STATIC int
 xfs_da_normhash(
 	struct xfs_da_args *args)
@@ -2009,7 +2011,7 @@ xfs_da_normhash(
 }
 
 const struct xfs_nameops xfs_default_nameops = {
-	.hashname	= xfs_default_hashname,
+	.hashname	= xfs_da_hashname_op,
 	.normhash	= xfs_da_normhash,
 	.compname	= xfs_da_compname
 };
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index e52d082..191925d 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -43,13 +43,15 @@ const unsigned char xfs_mode_to_ftype[S_IFMT >> S_SHIFT] = {
  */
 STATIC xfs_dahash_t
 xfs_ascii_ci_hashname(
-	struct xfs_name	*name)
+	const unsigned char *name,
+	int len,
+	unsigned int unused)
 {
 	xfs_dahash_t	hash;
 	int		i;
 
-	for (i = 0, hash = 0; i < name->len; i++)
-		hash = tolower(name->name[i]) ^ rol32(hash, 7);
+	for (i = 0, hash = 0; i < len; i++)
+		hash = tolower(name[i]) ^ rol32(hash, 7);
 
 	return hash;
 }
@@ -475,7 +477,8 @@ xfs_dir_canenter(
 	args.name = name->name;
 	args.namelen = name->len;
 	args.filetype = name->type;
-	args.hashval = dp->i_mount->m_dirnameops->hashname(name);
+	args.hashval = dp->i_mount->m_dirnameops->hashname(name->name,
+			name->len, 0 /* version for later */);
 	args.dp = dp;
 	args.whichfork = XFS_DATA_FORK;
 	args.trans = tp;
diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c
index 2880431..c26308e 100644
--- a/libxfs/xfs_dir2_block.c
+++ b/libxfs/xfs_dir2_block.c
@@ -1047,7 +1047,6 @@ xfs_dir2_sf_to_block(
 	xfs_dir2_sf_hdr_t	*sfp;		/* shortform header  */
 	__be16			*tagp;		/* end of data entry */
 	xfs_trans_t		*tp;		/* transaction pointer */
-	struct xfs_name		name;
 	struct xfs_ifork	*ifp;
 
 	trace_xfs_dir2_sf_to_block(args);
@@ -1205,10 +1204,10 @@ xfs_dir2_sf_to_block(
 		tagp = xfs_dir3_data_entry_tag_p(mp, dep);
 		*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 		xfs_dir2_data_log_entry(tp, bp, dep);
-		name.name = sfep->name;
-		name.len = sfep->namelen;
 		blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
-							hashname(&name));
+					hashname(sfep->name,
+						 sfep->namelen,
+						 0 /* version for later */));
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
 						 (char *)dep - (char *)hdr));
 		offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/libxfs/xfs_dir2_data.c b/libxfs/xfs_dir2_data.c
index dc9df4d..ada4d1d 100644
--- a/libxfs/xfs_dir2_data.c
+++ b/libxfs/xfs_dir2_data.c
@@ -46,7 +46,6 @@ __xfs_dir3_data_check(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	char			*p;		/* current data position */
 	int			stale;		/* count of stale leaves */
-	struct xfs_name		name;
 
 	mp = bp->b_target->bt_mount;
 	hdr = bp->b_addr;
@@ -142,9 +141,9 @@ __xfs_dir3_data_check(
 			addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
 				(xfs_dir2_data_aoff_t)
 				((char *)dep - (char *)hdr));
-			name.name = dep->name;
-			name.len = dep->namelen;
-			hash = mp->m_dirnameops->hashname(&name);
+			hash = mp->m_dirnameops->hashname(dep->name,
+					dep->namelen,
+					0 /* version for later */);
 			for (i = 0; i < be32_to_cpu(btp->count); i++) {
 				if (be32_to_cpu(lep[i].address) == addr &&
 				    be32_to_cpu(lep[i].hashval) == hash)
diff --git a/repair/phase6.c b/repair/phase6.c
index f13069f..c18ef69 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -195,7 +195,8 @@ dir_hash_add(
 	dup = 0;
 
 	if (!junk) {
-		hash = mp->m_dirnameops->hashname(&xname);
+		hash = mp->m_dirnameops->hashname(name, namelen,
+				0 /* version for later */);
 		byhash = DIR_HASH_FUNC(hashtab, hash);
 
 		/*
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 25/35] libxfs: add a superblock feature bit to indicate UTF-8 support.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (23 preceding siblings ...)
  2014-10-03 22:11 ` [PATCH 24/35] libxfs: change interface of xfs_nameops.hashname Ben Myers
@ 2014-10-03 22:11 ` Ben Myers
  2014-10-03 22:12 ` [PATCH 26/35] libxfs: store utf8version in the superblock Ben Myers
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:11 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
the utf8bit, and returns true if at least one of them is set. Replace
calls to xfs_sb_version_hasasciici() as needed.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 include/xfs_fs.h |  1 +
 include/xfs_sb.h | 24 +++++++++++++++++++++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/xfs_fs.h b/include/xfs_fs.h
index 59c40fc..c260ec6 100644
--- a/include/xfs_fs.h
+++ b/include/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
+#define XFS_FSOP_GEOM_FLAGS_UTF8	0x40000	/* utf8 filenames */
 
 
 /*
diff --git a/include/xfs_sb.h b/include/xfs_sb.h
index 950d1ea..c8563ce 100644
--- a/include/xfs_sb.h
+++ b/include/xfs_sb.h
@@ -82,6 +82,7 @@ struct xfs_trans;
 #define XFS_SB_VERSION2_RESERVED4BIT	0x00000004
 #define XFS_SB_VERSION2_ATTR2BIT	0x00000008	/* Inline attr rework */
 #define XFS_SB_VERSION2_PARENTBIT	0x00000010	/* parent pointers */
+#define XFS_SB_VERSION2_UTF8BIT		0x00000020      /* utf8 names */
 #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */
 #define XFS_SB_VERSION2_CRCBIT		0x00000100	/* metadata CRCs */
 #define XFS_SB_VERSION2_FTYPE		0x00000200	/* inode type in dir */
@@ -89,6 +90,7 @@ struct xfs_trans;
 #define	XFS_SB_VERSION2_OKREALFBITS	\
 	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
 	 XFS_SB_VERSION2_ATTR2BIT	| \
+	 XFS_SB_VERSION2_UTF8BIT	| \
 	 XFS_SB_VERSION2_PROJID32BIT	| \
 	 XFS_SB_VERSION2_FTYPE)
 #define	XFS_SB_VERSION2_OKSASHFBITS	\
@@ -600,8 +602,10 @@ xfs_sb_has_ro_compat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_UTF8	(1 << 1)	/* utf-8 name support */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
-		(XFS_SB_FEAT_INCOMPAT_FTYPE)
+		(XFS_SB_FEAT_INCOMPAT_FTYPE | \
+		 XFS_SB_FEAT_INCOMPAT_UTF8)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
@@ -649,6 +653,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
 }
 
+static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) ||
+		(xfs_sb_version_hasmorebits(sbp) &&
+		(sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));
+}
+
+/*
+ * Special case: there are a number of places where we need to test
+ * both the borgbit and the utf8bit, and take the same action if
+ * either of those is set.
+ */
+static inline int xfs_sb_version_hasci(xfs_sb_t *sbp)
+{
+	return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp);
+}
+
 /*
  * end of superblock version macros
  */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 26/35] libxfs: store utf8version in the superblock
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (24 preceding siblings ...)
  2014-10-03 22:11 ` [PATCH 25/35] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
@ 2014-10-03 22:12 ` Ben Myers
  2014-10-03 22:13 ` [PATCH 27/35] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

The utf8 version a filesystem was created with needs to be stored in
order that normalizations will remain stable over the lifetime of the
filesystem.  Convert sb_pad to sb_utf8version in the super block.  This
also adds checks at mount time to see whether the unicode normalization
module has support for the version of unicode that the filesystem
requires.  If not we fail the mount.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 include/xfs_sb.h   | 10 ++++----
 include/xfs_utf8.h | 24 +++++++++++++++++++
 libxfs/xfs_sb.c    |  4 ++--
 libxfs/xfs_utf8.c  | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 102 insertions(+), 6 deletions(-)
 create mode 100644 include/xfs_utf8.h
 create mode 100644 libxfs/xfs_utf8.c

diff --git a/include/xfs_sb.h b/include/xfs_sb.h
index c8563ce..a400cf1 100644
--- a/include/xfs_sb.h
+++ b/include/xfs_sb.h
@@ -176,7 +176,7 @@ typedef struct xfs_sb {
 	__uint32_t	sb_features_log_incompat;
 
 	__uint32_t	sb_crc;		/* superblock crc */
-	__uint32_t	sb_pad;
+	__uint32_t	sb_utf8version;	/* unicode version */
 
 	xfs_ino_t	sb_pquotino;	/* project quota inode */
 	xfs_lsn_t	sb_lsn;		/* last write sequence */
@@ -262,7 +262,7 @@ typedef struct xfs_dsb {
 	__be32		sb_features_log_incompat;
 
 	__le32		sb_crc;		/* superblock crc */
-	__be32		sb_pad;
+	__be32		sb_utf8version;	/* version of unicode */
 
 	__be64		sb_pquotino;	/* project quota inode */
 	__be64		sb_lsn;		/* last write sequence */
@@ -288,7 +288,7 @@ typedef enum {
 	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
 	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
 	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
-	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
+	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_UTF8VERSION,
 	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
 	XFS_SBS_FIELDCOUNT
 } xfs_sb_field_t;
@@ -320,6 +320,7 @@ typedef enum {
 #define XFS_SB_FEATURES_INCOMPAT XFS_SB_MVAL(FEATURES_INCOMPAT)
 #define XFS_SB_FEATURES_LOG_INCOMPAT XFS_SB_MVAL(FEATURES_LOG_INCOMPAT)
 #define XFS_SB_CRC		XFS_SB_MVAL(CRC)
+#define XFS_SB_UTF8VERSION	XFS_SB_MVAL(UTF8VERSION)
 #define XFS_SB_PQUOTINO		XFS_SB_MVAL(PQUOTINO)
 #define	XFS_SB_NUM_BITS		((int)XFS_SBS_FIELDCOUNT)
 #define	XFS_SB_ALL_BITS		((1LL << XFS_SB_NUM_BITS) - 1)
@@ -330,7 +331,8 @@ typedef enum {
 	 XFS_SB_ICOUNT | XFS_SB_IFREE | XFS_SB_FDBLOCKS | XFS_SB_FEATURES2 | \
 	 XFS_SB_BAD_FEATURES2 | XFS_SB_FEATURES_COMPAT | \
 	 XFS_SB_FEATURES_RO_COMPAT | XFS_SB_FEATURES_INCOMPAT | \
-	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_PQUOTINO)
+	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_UTF8VERSION | \
+	 XFS_SB_PQUOTINO)
 
 
 /*
diff --git a/include/xfs_utf8.h b/include/xfs_utf8.h
new file mode 100644
index 0000000..8a700de
--- /dev/null
+++ b/include/xfs_utf8.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#ifndef XFS_UTF8_H
+#define XFS_UTF8_H
+
+extern int xfs_utf8_version_ok(struct xfs_mount *);
+
+#endif /* XFS_UTF8_H */
diff --git a/libxfs/xfs_sb.c b/libxfs/xfs_sb.c
index ea89367..642a2df 100644
--- a/libxfs/xfs_sb.c
+++ b/libxfs/xfs_sb.c
@@ -78,7 +78,7 @@ static const struct {
 	{ offsetof(xfs_sb_t, sb_features_incompat),	0 },
 	{ offsetof(xfs_sb_t, sb_features_log_incompat),	0 },
 	{ offsetof(xfs_sb_t, sb_crc),		0 },
-	{ offsetof(xfs_sb_t, sb_pad),		0 },
+	{ offsetof(xfs_sb_t, sb_utf8version),	0 },
 	{ offsetof(xfs_sb_t, sb_pquotino),	0 },
 	{ offsetof(xfs_sb_t, sb_lsn),		0 },
 	{ sizeof(xfs_sb_t),			0 }
@@ -410,7 +410,7 @@ xfs_sb_from_disk(
 				be32_to_cpu(from->sb_features_log_incompat);
 	/* crc is only used on disk, not in memory; just init to 0 here. */
 	to->sb_crc = 0;
-	to->sb_pad = 0;
+	to->sb_utf8version = be32_to_cpu(from->sb_utf8version);
 	to->sb_pquotino = be64_to_cpu(from->sb_pquotino);
 	to->sb_lsn = be64_to_cpu(from->sb_lsn);
 }
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
new file mode 100644
index 0000000..ebfdaec
--- /dev/null
+++ b/libxfs/xfs_utf8.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_inum.h"
+#include "xfs_trans.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_mount.h"
+#include "xfs_da_btree.h"
+#include "xfs_format.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_dinode.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_utf8.h"
+#include <utf8norm/utf8norm.h>
+
+int
+xfs_utf8_version_ok(
+	struct xfs_mount	*mp)
+{
+	int	major, minor, revision;
+
+	if (utf8version_is_supported(mp->m_sb.sb_utf8version))
+		return 1;
+
+	major = mp->m_sb.sb_utf8version >> UNICODE_MAJ_SHIFT;
+	minor = (mp->m_sb.sb_utf8version & 0xff00) >> UNICODE_MIN_SHIFT;
+	revision = mp->m_sb.sb_utf8version & 0xff;
+
+	if (revision) {
+		xfs_warn(mp,
+		"Unicode version %d.%d.%d not supported by utf8norm.ko",
+		major, minor, revision);
+	} else {
+		xfs_warn(mp,
+		"Unicode version %d.%d not supported by utf8norm.ko",
+		major, minor);
+	}
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 27/35] libxfs: add xfs_nameops for utf8 and utf8+casefold.
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (25 preceding siblings ...)
  2014-10-03 22:12 ` [PATCH 26/35] libxfs: store utf8version in the superblock Ben Myers
@ 2014-10-03 22:13 ` Ben Myers
  2014-10-03 22:13 ` [PATCH 28/35] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
and are installed if the utf8bit is set in the super block.

The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
filenames, and are installed if both the utf8bit and the borgbit are set
in the superblock.

Normalized filenames are not stored on disk. Normalization will fail if a
filename is not valid UTF-8, in which case the filename is treated as an
opaque blob.

Signed-off-by: Olaf Weber <olaf@sgi.com>

[v2: updated to pass through sb_utf8version. -bpm]
---
 libxfs/xfs_utf8.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 208 insertions(+)

diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
index ebfdaec..3be1fbb 100644
--- a/libxfs/xfs_utf8.c
+++ b/libxfs/xfs_utf8.c
@@ -68,3 +68,211 @@ xfs_utf8_version_ok(
 
 	return 0;
 }
+
+/*
+ * xfs nameops using nfkdi
+ */
+
+static xfs_dahash_t
+xfs_utf8_hashname(
+	const unsigned char *name,
+	int len,
+	unsigned int sb_utf8version)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdi = utf8nfkdi(sb_utf8version);
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	nfkdi = utf8nfkdi(sb_utf8version);
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdi;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdi = utf8nfkdi(sb_utf8version);
+	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_nameops = {
+	.hashname = xfs_utf8_hashname,
+	.normhash = xfs_utf8_normhash,
+	.compname = xfs_utf8_compname,
+};
+
+/*
+ * xfs nameops using nfkdicf
+ */
+
+static xfs_dahash_t
+xfs_utf8_ci_hashname(
+	const unsigned char *name,
+	int len,
+	unsigned int sb_utf8version)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	xfs_dahash_t	hash;
+	int		val;
+
+	nfkdicf = utf8nfkdicf(sb_utf8version);
+	hash = 0;
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		goto blob;
+	while ((val = utf8byte(&u8c)) > 0)
+		hash = val ^ rol32(hash, 7);
+	/* In case of error treat the name as a binary blob. */
+	if (val == 0)
+		return hash;
+blob:
+	return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_ci_normhash(
+	struct xfs_da_args *args)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	unsigned char	*norm;
+	ssize_t		normlen;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	nfkdicf = utf8nfkdicf(sb_utf8version);
+	/* Failure to normalize is treated as a blob. */
+	if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
+		goto blob;
+	if (utf8ncursor(&u8c, nfkdicf, args->name, args->namelen) < 0)
+		goto blob;
+	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+		return -ENOMEM;
+	args->norm = norm;
+	args->normlen = normlen;
+	while ((c = utf8byte(&u8c)) > 0)
+		*norm++ = c;
+	if (c == 0) {
+		*norm = '\0';
+		args->hashval = xfs_da_hashname(args->norm, args->normlen);
+		return 0;
+	}
+	kmem_free(args->norm);
+blob:
+	args->norm = NULL;
+	args->normlen = -1;
+	args->hashval = xfs_da_hashname(args->name, args->namelen);
+	return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_ci_compname(
+	struct xfs_da_args *args,
+	const unsigned char *name,
+	int		len)
+{
+	utf8data_t	nfkdicf;
+	struct utf8cursor u8c;
+	const unsigned char *norm;
+	int		c;
+	unsigned int	sb_utf8version =
+		args->dp->i_mount->m_sb.sb_utf8version;
+
+	ASSERT(args->norm || args->normlen == -1);
+
+	/* Check for an exact match first. */
+	if (args->namelen == len && memcmp(args->name, name, len) == 0)
+		return XFS_CMP_EXACT;
+	/* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */
+	if (args->normlen < 0)
+		return XFS_CMP_DIFFERENT;
+	nfkdicf = utf8nfkdicf(sb_utf8version);
+	if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+		return XFS_CMP_DIFFERENT;
+	norm = args->norm;
+	while ((c = utf8byte(&u8c)) > 0)
+		if (c != *norm++)
+			return XFS_CMP_DIFFERENT;
+	if (c < 0 || *norm != '\0')
+		return XFS_CMP_DIFFERENT;
+	return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_ci_nameops = {
+	.hashname = xfs_utf8_ci_hashname,
+	.normhash = xfs_utf8_ci_normhash,
+	.compname = xfs_utf8_ci_compname,
+};
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 28/35] libxfs: apply utf-8 normalization rules to user extended attribute names
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (26 preceding siblings ...)
  2014-10-03 22:13 ` [PATCH 27/35] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
@ 2014-10-03 22:13 ` Ben Myers
  2014-10-03 22:14 ` [PATCH 29/35] libxfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2 Ben Myers
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Olaf Weber <olaf@sgi.com>

Apply the same rules for UTF-8 normalization to the names of user-defined
extended attributes. System attributes are excluded because they are not
user-visible in the first place, and the kernel is expected to know what
it is doing when naming them.

Signed-off-by: Olaf Weber <olaf@sgi.com>
---
 libxfs/xfs_attr.c      | 49 +++++++++++++++++++++++++++++++++++++++++--------
 libxfs/xfs_attr_leaf.c | 11 +++++++++--
 libxfs/xfs_utf8.c      |  6 ++++++
 3 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/libxfs/xfs_attr.c b/libxfs/xfs_attr.c
index 17519d3..c30703b 100644
--- a/libxfs/xfs_attr.c
+++ b/libxfs/xfs_attr.c
@@ -88,8 +88,9 @@ xfs_attr_get_int(
 	int			*valuelenp,
 	int			flags)
 {
-	xfs_da_args_t   args;
-	int             error;
+	xfs_da_args_t   	args;
+	struct xfs_mount	*mp = ip->i_mount;
+	int             	error;
 
 	if (!xfs_inode_hasattr(ip))
 		return ENOATTR;
@@ -103,9 +104,12 @@ xfs_attr_get_int(
 	args.value = value;
 	args.valuelen = *valuelenp;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = ip;
 	args.whichfork = XFS_ATTR_FORK;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/*
 	 * Decide on what work routines to call based on the inode size.
@@ -118,6 +122,9 @@ xfs_attr_get_int(
 		error = xfs_attr_node_get(&args);
 	}
 
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	/*
 	 * Return the number of bytes in the value to the caller.
 	 */
@@ -239,12 +246,15 @@ xfs_attr_set_int(
 	args.value = value;
 	args.valuelen = valuelen;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = dp;
 	args.firstblock = &firstblock;
 	args.flist = &flist;
 	args.whichfork = XFS_ATTR_FORK;
 	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/* Size is now blocks for attribute data */
 	args.total = xfs_attr_calc_size(dp, name->len, valuelen, &local);
@@ -276,6 +286,8 @@ xfs_attr_set_int(
 	error = xfs_trans_reserve(args.trans, &tres, args.total, 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return(error);
 	}
 	xfs_ilock(dp, XFS_ILOCK_EXCL);
@@ -286,6 +298,8 @@ xfs_attr_set_int(
 	if (error) {
 		xfs_iunlock(dp, XFS_ILOCK_EXCL);
 		xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return (error);
 	}
 
@@ -333,7 +347,8 @@ xfs_attr_set_int(
 			err2 = xfs_trans_commit(args.trans,
 						 XFS_TRANS_RELEASE_LOG_RES);
 			xfs_iunlock(dp, XFS_ILOCK_EXCL);
-
+			if (args.norm)
+				kmem_free((void *)args.norm);
 			return(error == 0 ? err2 : error);
 		}
 
@@ -398,6 +413,8 @@ xfs_attr_set_int(
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
 
 	return(error);
 
@@ -406,6 +423,9 @@ out:
 		xfs_trans_cancel(args.trans,
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	return(error);
 }
 
@@ -452,12 +472,15 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	args.name = name->name;
 	args.namelen = name->len;
 	args.flags = flags;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
 	args.dp = dp;
 	args.firstblock = &firstblock;
 	args.flist = &flist;
 	args.total = 0;
 	args.whichfork = XFS_ATTR_FORK;
+	if (! xfs_sb_version_hasutf8(&mp->m_sb))
+		args.hashval = xfs_da_hashname(args.name, args.namelen);
+	else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+		return error;
 
 	/*
 	 * we have no control over the attribute names that userspace passes us
@@ -470,8 +493,11 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	 * Attach the dquots to the inode.
 	 */
 	error = xfs_qm_dqattach(dp, 0);
-	if (error)
-		return error;
+	if (error) {
+		if (args.norm)
+			kmem_free((void *)args.norm);
+			return error;
+	}
 
 	/*
 	 * Start our first transaction of the day.
@@ -497,6 +523,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 				  XFS_ATTRRM_SPACE_RES(mp), 0);
 	if (error) {
 		xfs_trans_cancel(args.trans, 0);
+		if (args.norm)
+			kmem_free((void *)args.norm);
 		return(error);
 	}
 
@@ -546,6 +574,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
 	xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
 	error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
 
 	return(error);
 
@@ -554,6 +584,9 @@ out:
 		xfs_trans_cancel(args.trans,
 			XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
 	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (args.norm)
+		kmem_free((void *)args.norm);
+
 	return(error);
 }
 
diff --git a/libxfs/xfs_attr_leaf.c b/libxfs/xfs_attr_leaf.c
index f7f02ae..052a6a1 100644
--- a/libxfs/xfs_attr_leaf.c
+++ b/libxfs/xfs_attr_leaf.c
@@ -634,6 +634,7 @@ int
 xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 {
 	xfs_inode_t *dp;
+	struct xfs_mount *mp;
 	xfs_attr_shortform_t *sf;
 	xfs_attr_sf_entry_t *sfe;
 	xfs_da_args_t nargs;
@@ -646,6 +647,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	trace_xfs_attr_sf_to_leaf(args);
 
 	dp = args->dp;
+	mp = dp->i_mount;
 	ifp = dp->i_afp;
 	sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
 	size = be16_to_cpu(sf->hdr.totsize);
@@ -698,13 +700,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 		nargs.namelen = sfe->namelen;
 		nargs.value = &sfe->nameval[nargs.namelen];
 		nargs.valuelen = sfe->valuelen;
-		nargs.hashval = xfs_da_hashname(sfe->nameval,
-						sfe->namelen);
 		nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags);
+		if (! xfs_sb_version_hasutf8(&mp->m_sb))
+			nargs.hashval = xfs_da_hashname(sfe->nameval,
+							sfe->namelen);
+		else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0)
+			goto out;
 		error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */
 		ASSERT(error == ENOATTR);
 		error = xfs_attr3_leaf_add(bp, &nargs);
 		ASSERT(error != ENOSPC);
+		if (nargs.norm)
+			 kmem_free((void *)nargs.norm);
 		if (error)
 			goto out;
 		sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
index 3be1fbb..f7042ef 100644
--- a/libxfs/xfs_utf8.c
+++ b/libxfs/xfs_utf8.c
@@ -109,6 +109,9 @@ xfs_utf8_normhash(
 	unsigned int	sb_utf8version =
 		args->dp->i_mount->m_sb.sb_utf8version;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdi = utf8nfkdi(sb_utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
@@ -213,6 +216,9 @@ xfs_utf8_ci_normhash(
 	unsigned int	sb_utf8version =
 		args->dp->i_mount->m_sb.sb_utf8version;
 
+	/* Don't normalize system attribute names. */
+	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+		goto blob;
 	nfkdicf = utf8nfkdicf(sb_utf8version);
 	/* Failure to normalize is treated as a blob. */
 	if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 29/35] libxfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (27 preceding siblings ...)
  2014-10-03 22:13 ` [PATCH 28/35] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
@ 2014-10-03 22:14 ` Ben Myers
  2014-10-03 22:14 ` [PATCH 30/35] libxfs: add versioned fsgeom ioctl with utf8version field Ben Myers
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

We'll be creating a new versioned XFS_IOC_FSGEOMETRY ioctl and structure
so rename the current revision to _V2.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 growfs/xfs_growfs.c |  8 ++++----
 include/xfs_fs.h    |  8 ++++----
 io/bmap.c           |  2 +-
 io/init.c           |  2 +-
 io/io.h             |  6 +++---
 io/open.c           | 10 +++++-----
 quota/free.c        |  2 +-
 7 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/growfs/xfs_growfs.c b/growfs/xfs_growfs.c
index 8e611b6..c3df0c0 100644
--- a/growfs/xfs_growfs.c
+++ b/growfs/xfs_growfs.c
@@ -44,7 +44,7 @@ Options:\n\
 
 void
 report_info(
-	xfs_fsop_geom_t	geo,
+	xfs_fsop_geom_v2_t	geo,
 	char		*mntpoint,
 	int		isint,
 	char		*logname,
@@ -101,7 +101,7 @@ main(int argc, char **argv)
 	int			error;	/* we have hit an error */
 	long			esize;	/* new rt extent size */
 	int			ffd;	/* mount point file descriptor */
-	xfs_fsop_geom_t		geo;	/* current fs geometry */
+	xfs_fsop_geom_v2_t	geo;	/* current fs geometry */
 	int			iflag;	/* -i flag */
 	int			isint;	/* log is currently internal */
 	int			lflag;	/* -l flag */
@@ -109,7 +109,7 @@ main(int argc, char **argv)
 	int			maxpct;	/* -m flag value */
 	int			mflag;	/* -m flag */
 	int			nflag;	/* -n flag */
-	xfs_fsop_geom_t		ngeo;	/* new fs geometry */
+	xfs_fsop_geom_v2_t	ngeo;	/* new fs geometry */
 	int			rflag;	/* -r flag */
 	long long		rsize;	/* new rt size in fs blocks */
 	int			ci;	/* ASCII case-insensitive fs */
@@ -220,7 +220,7 @@ main(int argc, char **argv)
 	}
 
 	/* get the current filesystem size & geometry */
-	if (xfsctl(fname, ffd, XFS_IOC_FSGEOMETRY, &geo) < 0) {
+	if (xfsctl(fname, ffd, XFS_IOC_FSGEOMETRY_V2, &geo) < 0) {
 		/*
 		 * OK, new xfsctl barfed - back off and try earlier version
 		 * as we're probably running an older kernel version.
diff --git a/include/xfs_fs.h b/include/xfs_fs.h
index c260ec6..38faf5d 100644
--- a/include/xfs_fs.h
+++ b/include/xfs_fs.h
@@ -180,9 +180,9 @@ typedef struct xfs_fsop_geom_v1 {
 } xfs_fsop_geom_v1_t;
 
 /*
- * Output for XFS_IOC_FSGEOMETRY
+ * Output for XFS_IOC_FSGEOMETRY_V2
  */
-typedef struct xfs_fsop_geom {
+typedef struct xfs_fsop_geom_v2 {
 	__u32		blocksize;	/* filesystem (data) block size */
 	__u32		rtextsize;	/* realtime extent size		*/
 	__u32		agblocks;	/* fsblocks in an AG		*/
@@ -204,7 +204,7 @@ typedef struct xfs_fsop_geom {
 	__u32		rtsectsize;	/* realtime sector size, bytes	*/
 	__u32		dirblocksize;	/* directory block size, bytes	*/
 	__u32		logsunit;	/* log stripe unit, bytes */
-} xfs_fsop_geom_t;
+} xfs_fsop_geom_v2_t;
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
@@ -553,7 +553,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_FSSETDM_BY_HANDLE    _IOW ('X', 121, struct xfs_fsop_setdm_handlereq)
 #define XFS_IOC_ATTRLIST_BY_HANDLE   _IOW ('X', 122, struct xfs_fsop_attrlist_handlereq)
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
-#define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
+#define XFS_IOC_FSGEOMETRY_V2	     _IOR ('X', 124, struct xfs_fsop_geom_v2)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
diff --git a/io/bmap.c b/io/bmap.c
index a78cbb1..614eba1 100644
--- a/io/bmap.c
+++ b/io/bmap.c
@@ -70,7 +70,7 @@ bmap_f(
 {
 	struct fsxattr		fsx;
 	struct getbmapx		*map;
-	struct xfs_fsop_geom	fsgeo;
+	struct xfs_fsop_geom_v2	fsgeo;
 	int			map_size;
 	int			loop = 0;
 	int			flg = 0;
diff --git a/io/init.c b/io/init.c
index bfc35bf..9622a6c 100644
--- a/io/init.c
+++ b/io/init.c
@@ -127,7 +127,7 @@ init(
 	int		c, flags = 0;
 	char		*sp;
 	mode_t		mode = 0600;
-	xfs_fsop_geom_t	geometry = { 0 };
+	xfs_fsop_geom_v2_t	geometry = { 0 };
 
 	progname = basename(argv[0]);
 	setlocale(LC_ALL, "");
diff --git a/io/io.h b/io/io.h
index 1b3bca1..1837fe4 100644
--- a/io/io.h
+++ b/io/io.h
@@ -44,7 +44,7 @@ typedef struct fileio {
 	int		fd;		/* open file descriptor */
 	int		flags;		/* flags describing file state */
 	char		*name;		/* file name at time of open */
-	xfs_fsop_geom_t	geom;		/* XFS filesystem geometry */
+	xfs_fsop_geom_v2_t	geom;	/* XFS filesystem geometry */
 } fileio_t;
 
 extern fileio_t		*filetable;	/* open file table */
@@ -74,8 +74,8 @@ extern void *check_mapping_range(mmap_region_t *, off64_t, size_t, int);
  */
 
 extern off64_t		filesize(void);
-extern int		openfile(char *, xfs_fsop_geom_t *, int, mode_t);
-extern int		addfile(char *, int , xfs_fsop_geom_t *, int);
+extern int		openfile(char *, xfs_fsop_geom_v2_t *, int, mode_t);
+extern int		addfile(char *, int , xfs_fsop_geom_v2_t *, int);
 extern void		printxattr(uint, int, int, const char *, int, int);
 
 extern unsigned int	recurse_all;
diff --git a/io/open.c b/io/open.c
index c106fa7..81c19c9 100644
--- a/io/open.c
+++ b/io/open.c
@@ -140,7 +140,7 @@ stat_f(
 int
 openfile(
 	char		*path,
-	xfs_fsop_geom_t	*geom,
+	xfs_fsop_geom_v2_t	*geom,
 	int		flags,
 	mode_t		mode)
 {
@@ -185,7 +185,7 @@ openfile(
 	if (!geom || !platform_test_xfs_fd(fd))
 		return fd;
 
-	if (xfsctl(path, fd, XFS_IOC_FSGEOMETRY, geom) < 0) {
+	if (xfsctl(path, fd, XFS_IOC_FSGEOMETRY_V2, geom) < 0) {
 		perror("XFS_IOC_FSGEOMETRY");
 		close(fd);
 		return -1;
@@ -215,7 +215,7 @@ int
 addfile(
 	char		*name,
 	int		fd,
-	xfs_fsop_geom_t	*geometry,
+	xfs_fsop_geom_v2_t	*geometry,
 	int		flags)
 {
 	char		*filename;
@@ -284,7 +284,7 @@ open_f(
 	int		c, fd, flags = 0;
 	char		*sp;
 	mode_t		mode = 0600;
-	xfs_fsop_geom_t	geometry = { 0 };
+	xfs_fsop_geom_v2_t	geometry = { 0 };
 
 	if (argc == 1) {
 		if (file)
@@ -701,7 +701,7 @@ statfs_f(
 	char			**argv)
 {
 	struct xfs_fsop_counts	fscounts;
-	struct xfs_fsop_geom	fsgeo;
+	struct xfs_fsop_geom_v2	fsgeo;
 	struct statfs		st;
 
 	printf(_("fd.path = \"%s\"\n"), file->name);
diff --git a/quota/free.c b/quota/free.c
index 79b52e9..b2e325b 100644
--- a/quota/free.c
+++ b/quota/free.c
@@ -59,7 +59,7 @@ mount_free_space_data(
 	__uint64_t		*rfree)
 {
 	struct xfs_fsop_counts	fscounts;
-	struct xfs_fsop_geom	fsgeo;
+	struct xfs_fsop_geom_v2	fsgeo;
 	struct statfs		st;
 	__uint64_t		logsize, count, free;
 	int			fd;
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 30/35] libxfs: add versioned fsgeom ioctl with utf8version field
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (28 preceding siblings ...)
  2014-10-03 22:14 ` [PATCH 29/35] libxfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2 Ben Myers
@ 2014-10-03 22:14 ` Ben Myers
  2014-10-03 22:15 ` [PATCH 31/35] xfsprogs: add utf8 support to growfs Ben Myers
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

This adds a utf8version field to the xfs_fs_geom structure.  An
important characteristic of this version of the ioctl is that
fsgeo.version needs to be set by the caller to specify which version of
the structure to return.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 include/xfs_fs.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/include/xfs_fs.h b/include/xfs_fs.h
index 38faf5d..f920498 100644
--- a/include/xfs_fs.h
+++ b/include/xfs_fs.h
@@ -206,6 +206,34 @@ typedef struct xfs_fsop_geom_v2 {
 	__u32		logsunit;	/* log stripe unit, bytes */
 } xfs_fsop_geom_v2_t;
 
+/*
+ * Output for XFS_IOC_FSGEOMETRY
+ */
+typedef struct xfs_fsop_geom {
+	__u32		blocksize;	/* filesystem (data) block size */
+	__u32		rtextsize;	/* realtime extent size		*/
+	__u32		agblocks;	/* fsblocks in an AG		*/
+	__u32		agcount;	/* number of allocation groups	*/
+	__u32		logblocks;	/* fsblocks in the log		*/
+	__u32		sectsize;	/* (data) sector size, bytes	*/
+	__u32		inodesize;	/* inode size in bytes		*/
+	__u32		imaxpct;	/* max allowed inode space(%)	*/
+	__u64		datablocks;	/* fsblocks in data subvolume	*/
+	__u64		rtblocks;	/* fsblocks in realtime subvol	*/
+	__u64		rtextents;	/* rt extents in realtime subvol*/
+	__u64		logstart;	/* starting fsblock of the log	*/
+	unsigned char	uuid[16];	/* unique id of the filesystem	*/
+	__u32		sunit;		/* stripe unit, fsblocks	*/
+	__u32		swidth;		/* stripe width, fsblocks	*/
+	__s32		version;	/* structure version		*/
+	__u32		flags;		/* superblock version flags	*/
+	__u32		logsectsize;	/* log sector size, bytes	*/
+	__u32		rtsectsize;	/* realtime sector size, bytes	*/
+	__u32		dirblocksize;	/* directory block size, bytes	*/
+	__u32		logsunit;	/* log stripe unit, bytes */
+	__u32		utf8version;	/* Unicode version		*/
+} xfs_fsop_geom_t;
+
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
 	__u64	freedata;	/* free data section blocks */
@@ -221,6 +249,8 @@ typedef struct xfs_fsop_resblks {
 } xfs_fsop_resblks_t;
 
 #define XFS_FSOP_GEOM_VERSION	0
+/* skipped 1-4 to match existing new_version xfs_fs_geometry argument */
+#define XFS_FSOP_GEOM_VERSION5	5
 
 #define XFS_FSOP_GEOM_FLAGS_ATTR	0x0001	/* attributes in use	*/
 #define XFS_FSOP_GEOM_FLAGS_NLINK	0x0002	/* 32-bit nlink values	*/
@@ -555,6 +585,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY_V2	     _IOR ('X', 124, struct xfs_fsop_geom_v2)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 31/35] xfsprogs: add utf8 support to growfs
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (29 preceding siblings ...)
  2014-10-03 22:14 ` [PATCH 30/35] libxfs: add versioned fsgeom ioctl with utf8version field Ben Myers
@ 2014-10-03 22:15 ` Ben Myers
  2014-10-03 22:15 ` [PATCH 32/35] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Mark Tinguely <tinguely@sgi.com>

Add utf-8 to xfs_growfs and xfs_info.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

[v2: use versioned fsgeom ioctl. -bpm]
---
 growfs/xfs_growfs.c | 85 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 61 insertions(+), 24 deletions(-)

diff --git a/growfs/xfs_growfs.c b/growfs/xfs_growfs.c
index c3df0c0..5e9d575 100644
--- a/growfs/xfs_growfs.c
+++ b/growfs/xfs_growfs.c
@@ -18,6 +18,7 @@
 
 #include <xfs/libxfs.h>
 #include <xfs/path.h>
+#include <xfs/utf8norm.h>
 
 static void
 usage(void)
@@ -44,7 +45,8 @@ Options:\n\
 
 void
 report_info(
-	xfs_fsop_geom_v2_t	geo,
+	xfs_fsop_geom_t	geo,
+	int		oldioctl,
 	char		*mntpoint,
 	int		isint,
 	char		*logname,
@@ -57,8 +59,31 @@ report_info(
 	int		crcs_enabled,
 	int		cimode,
 	int		ftype_enabled,
-	int		finobt_enabled)
+	int		finobt_enabled,
+	int		utf8)
 {
+	char		utf8_version_string[10];
+
+	/* XXX Can we assume that geo.version has always been zeroed by
+	 * the kernel so it is always meaningful? */
+	if (!oldioctl && geo.version >= XFS_FSOP_GEOM_VERSION5 && utf8) {
+		int	major, minor, revision;
+
+		major = geo.utf8version >> UNICODE_MAJ_SHIFT;
+		minor = (geo.utf8version & 0xff00) >> UNICODE_MIN_SHIFT;
+		revision = geo.utf8version & 0xff;
+
+		if (!revision && !minor)
+			sprintf(utf8_version_string, "%d", major);
+		else if (!revision)
+			sprintf(utf8_version_string, "%d.%d", major, minor);
+		else
+			sprintf(utf8_version_string, "%d.%d.%d",
+					major, minor, revision);
+	} else {
+		sprintf(utf8_version_string, "0");
+	}
+
 	printf(_(
 	    "meta-data=%-22s isize=%-6u agcount=%u, agsize=%u blks\n"
 	    "         =%-22s sectsz=%-5u attr=%u, projid32bit=%u\n"
@@ -66,6 +91,7 @@ report_info(
 	    "data     =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n"
 	    "         =%-22s sunit=%-6u swidth=%u blks\n"
 	    "naming   =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n"
+	    "         =%-22s utf8=%s\n"
 	    "log      =%-22s bsize=%-6u blocks=%u, version=%u\n"
 	    "         =%-22s sectsz=%-5u sunit=%u blks, lazy-count=%u\n"
 	    "realtime =%-22s extsz=%-6u blocks=%llu, rtextents=%llu\n"),
@@ -76,7 +102,8 @@ report_info(
 		"", geo.blocksize, (unsigned long long)geo.datablocks,
 			geo.imaxpct,
 		"", geo.sunit, geo.swidth,
-  		dirversion, geo.dirblocksize, cimode, ftype_enabled,
+		dirversion, geo.dirblocksize, cimode, ftype_enabled,
+		"", utf8_version_string,
 		isint ? _("internal") : logname ? logname : _("external"),
 			geo.blocksize, geo.logblocks, logversion,
 		"", geo.logsectsize, geo.logsunit / geo.blocksize, lazycount,
@@ -101,7 +128,7 @@ main(int argc, char **argv)
 	int			error;	/* we have hit an error */
 	long			esize;	/* new rt extent size */
 	int			ffd;	/* mount point file descriptor */
-	xfs_fsop_geom_v2_t	geo;	/* current fs geometry */
+	xfs_fsop_geom_t		geo;	/* current fs geometry */
 	int			iflag;	/* -i flag */
 	int			isint;	/* log is currently internal */
 	int			lflag;	/* -l flag */
@@ -109,11 +136,12 @@ main(int argc, char **argv)
 	int			maxpct;	/* -m flag value */
 	int			mflag;	/* -m flag */
 	int			nflag;	/* -n flag */
-	xfs_fsop_geom_v2_t	ngeo;	/* new fs geometry */
+	xfs_fsop_geom_t		ngeo;	/* new fs geometry */
 	int			rflag;	/* -r flag */
 	long long		rsize;	/* new rt size in fs blocks */
 	int			ci;	/* ASCII case-insensitive fs */
 	int			lazycount; /* lazy superblock counters */
+	int			utf8;	/* Unicode chars supported */
 	int			xflag;	/* -x flag */
 	char			*fname;	/* mount point name */
 	char			*datadev; /* data device name */
@@ -125,6 +153,7 @@ main(int argc, char **argv)
 	int			crcs_enabled;
 	int			ftype_enabled = 0;
 	int			finobt_enabled;	/* free inode btree */
+	int			oldioctl = 0;
 
 	progname = basename(argv[0]);
 	setlocale(LC_ALL, "");
@@ -219,21 +248,28 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
-	/* get the current filesystem size & geometry */
-	if (xfsctl(fname, ffd, XFS_IOC_FSGEOMETRY_V2, &geo) < 0) {
-		/*
-		 * OK, new xfsctl barfed - back off and try earlier version
-		 * as we're probably running an older kernel version.
-		 * Only field added in the v2 geometry xfsctl is "logsunit"
-		 * so we'll zero that out for later display (as zero).
-		 */
-		geo.logsunit = 0;
-		if (xfsctl(fname, ffd, XFS_IOC_FSGEOMETRY_V1, &geo) < 0) {
-			fprintf(stderr, _(
-				"%s: cannot determine geometry of filesystem"
-				" mounted at %s: %s\n"),
-				progname, fname, strerror(errno));
-			exit(1);
+	memset(&geo, '\0', sizeof(geo));
+	geo.version = XFS_FSOP_GEOM_VERSION5;
+	if (xfsctl(fname, ffd, XFS_IOC_FSGEOMETRY, &geo) < 0) {
+	
+		oldioctl = 1;
+		/* get the current filesystem size & geometry */
+		if (xfsctl(fname, ffd, XFS_IOC_FSGEOMETRY_V2, &geo) < 0) {
+			/*
+			 * OK, new xfsctl barfed - back off and try
+			 * earlier version as we're probably running an
+			 * older kernel version.  Only field added in
+			 * the v2 geometry xfsctl is "logsunit" so we'll
+			 * zero that out for later display (as zero).
+			 */
+			geo.logsunit = 0;
+			if (xfsctl(fname, ffd, XFS_IOC_FSGEOMETRY_V1, &geo)
+					< 0) {
+				fprintf(stderr,
+	_("%s: cannot determine geometry of filesystem mounted at %s: %s\n"),
+					progname, fname, strerror(errno));
+				exit(1);
+			}
 		}
 	}
 	isint = geo.logstart > 0;
@@ -247,11 +283,12 @@ main(int argc, char **argv)
 	crcs_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_V5SB ? 1 : 0;
 	ftype_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FTYPE ? 1 : 0;
 	finobt_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FINOBT ? 1 : 0;
+	utf8 = geo.flags & XFS_FSOP_GEOM_FLAGS_UTF8 ? 1 : 0;
 	if (nflag) {
-		report_info(geo, datadev, isint, logdev, rtdev,
+		report_info(geo, oldioctl, datadev, isint, logdev, rtdev,
 				lazycount, dirversion, logversion,
 				attrversion, projid32bit, crcs_enabled, ci,
-				ftype_enabled, finobt_enabled);
+				ftype_enabled, finobt_enabled, utf8);
 		exit(0);
 	}
 
@@ -286,10 +323,10 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
-	report_info(geo, datadev, isint, logdev, rtdev,
+	report_info(geo, oldioctl, datadev, isint, logdev, rtdev,
 			lazycount, dirversion, logversion,
 			attrversion, projid32bit, crcs_enabled, ci, ftype_enabled,
-			finobt_enabled);
+			finobt_enabled, utf8);
 
 	ddsize = xi.dsize;
 	dlsize = ( xi.logBBsize? xi.logBBsize :
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 32/35] xfsprogs: add utf8 support to mkfs.xfs
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (30 preceding siblings ...)
  2014-10-03 22:15 ` [PATCH 31/35] xfsprogs: add utf8 support to growfs Ben Myers
@ 2014-10-03 22:15 ` Ben Myers
  2014-10-03 22:16 ` [PATCH 33/35] xfsprogs: add utf8 support to xfs_repair Ben Myers
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Mark Tinguely <tinguely@sgi.com>

Set the utf-8 feature bit.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

[v2: add support for utf8version. -bpm]
---
 man/man8/mkfs.xfs.8 |  9 ++++-
 mkfs/xfs_mkfs.c     | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 mkfs/xfs_mkfs.h     |  3 +-
 3 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/man/man8/mkfs.xfs.8 b/man/man8/mkfs.xfs.8
index ad9ff3d..aa43cf5 100644
--- a/man/man8/mkfs.xfs.8
+++ b/man/man8/mkfs.xfs.8
@@ -558,7 +558,7 @@ any power of 2 size from the filesystem block size up to 65536.
 .IP
 The
 .B version=ci
-option enables ASCII only case-insensitive filename lookup and version
+option enables ASCII or UTF-8 case-insensitive filename lookup and version
 2 directories. Filenames are case-preserving, that is, the names
 are stored in directories using the case they were created with.
 .IP
@@ -582,6 +582,13 @@ When CRCs are enabled via
 the ftype functionality is always enabled. This feature can not be turned
 off for such filesystem configurations.
 .IP
+.TP
+.BI utf8[= value ]
+This is used to enable the UTF-8 character set support. The
+.I value
+is either 0 or 1, with 1 signifying that UTF-8 character support is to be
+enabled. If the value is omitted, 1 is assumed.
+.IP
 .RE
 .TP
 .BI \-p " protofile"
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index c85258a..8cf5f9a 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -25,6 +25,7 @@
 #include <disk/volume.h>
 #endif
 #include "xfs_mkfs.h"
+#include <xfs/utf8norm.h>
 
 /*
  * Device topology information.
@@ -149,6 +150,8 @@ char	*nopts[] = {
 	"version",
 #define	N_FTYPE		3
 	"ftype",
+#define	N_UTF8		4
+	"utf8",
 	NULL,
 };
 
@@ -958,6 +961,11 @@ main(
 	int			nsflag;
 	int			nvflag;
 	int			nci;
+	unsigned int		utf8;
+	unsigned int		utf8_major;
+	unsigned int		utf8_minor;
+	unsigned int		utf8_revision;
+	char			utf8_version_string[10];
 	int			Nflag;
 	int			discard = 1;
 	char			*p;
@@ -984,6 +992,7 @@ main(
 	int			lazy_sb_counters;
 	int			crcs_enabled;
 	int			finobt;
+	int			ret;
 
 	progname = basename(argv[0]);
 	setlocale(LC_ALL, "");
@@ -1004,6 +1013,7 @@ main(
 	logagno = logblocks = rtblocks = rtextblocks = 0;
 	Nflag = nlflag = nsflag = nvflag = nci = 0;
 	nftype = dirftype = 0;		/* inode type information in the dir */
+	utf8 = 0;			/* utf-8 support */
 	dirblocklog = dirblocksize = 0;
 	dirversion = XFS_DFL_DIR_VERSION;
 	qflag = 0;
@@ -1565,7 +1575,8 @@ _("cannot specify both crc and ftype\n"));
 					if (nvflag)
 						respec('n', nopts, N_VERSION);
 					if (!strcasecmp(value, "ci")) {
-						nci = 1; /* ASCII CI mode */
+						/* ASCII or UTF-8 CI mode */
+						nci = 1;
 					} else {
 						dirversion = atoi(value);
 						if (dirversion != 2)
@@ -1587,6 +1598,62 @@ _("cannot specify both crc and ftype\n"));
 					}
 					nftype = 1;
 					break;
+				case N_UTF8:
+					if (!value || *value == '\0')
+						value = "7.0.0";
+					ret = sscanf(value, "%d.%d.%d",
+							&utf8_major,
+							&utf8_minor,
+							&utf8_revision);
+					if (ret == 3) {
+						utf8 = UNICODE_AGE(
+							utf8_major,
+							utf8_minor,
+							utf8_revision);
+						if (!utf8version_is_supported(
+									utf8)) {
+							fprintf(stderr,
+_("utf8 version %d.%d.%d not supported\n"),
+							utf8_major,
+							utf8_minor,
+							utf8_revision);
+							usage();
+						}
+						break;
+					}
+					ret = sscanf(value, "%d.%d",
+							&utf8_major,
+							&utf8_minor);
+					if (ret == 2) {
+						utf8 = UNICODE_AGE(
+							utf8_major,
+							utf8_minor,
+							0);
+						if (!utf8version_is_supported(
+									utf8)) {
+							fprintf(stderr,
+_("utf8 version %d.%d not supported\n"),
+							utf8_major,
+							utf8_minor);
+							usage();
+						}
+						break;
+					}
+					ret = sscanf(value, "%d", &utf8_major);
+					if (ret == 1) {
+						utf8 = UNICODE_AGE(
+							utf8_major,
+							0, 0);
+						if (!utf8version_is_supported(
+									utf8)) {
+							fprintf(stderr,
+_("utf8 version %d not supported\n"),
+							utf8_major);
+							usage();
+						}
+						break;
+					}
+					/* fallthrough */
 				default:
 					unknown('n', value);
 				}
@@ -2460,7 +2527,8 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 	 */
 	sbp->sb_features2 = XFS_SB_VERSION2_MKFS(crcs_enabled, lazy_sb_counters,
 					attrversion == 2, !projid16bit, 0,
-					(!crcs_enabled && dirftype));
+					(!crcs_enabled && dirftype),
+					(!crcs_enabled && utf8));
 	sbp->sb_versionnum = XFS_SB_VERSION_MKFS(crcs_enabled, iaflag,
 					dsunit != 0,
 					logversion == 2, attrversion == 1,
@@ -2534,6 +2602,26 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 	if (crcs_enabled) {
 		sbp->sb_features_incompat = XFS_SB_FEAT_INCOMPAT_FTYPE;
 		dirftype = 1;
+		/* turn on the utf-8 support */
+		if (utf8)
+			sbp->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_UTF8;
+	}
+	if (utf8) {
+		int	major, minor, revision;
+
+		major = utf8 >> UNICODE_MAJ_SHIFT;
+		minor = (utf8 & 0xff00) >> UNICODE_MIN_SHIFT;
+		revision = utf8 & 0xff;
+
+		if (!revision && !minor)
+			sprintf(utf8_version_string, "%d", major);
+		else if (!revision)
+			sprintf(utf8_version_string, "%d.%d", major, minor);
+		else
+			sprintf(utf8_version_string, "%d.%d.%d",
+						major, minor, revision);
+	} else {
+		strcpy(utf8_version_string, "0");
 	}
 
 	if (!qflag || Nflag) {
@@ -2544,6 +2632,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 		   "data     =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n"
 		   "         =%-22s sunit=%-6u swidth=%u blks\n"
 		   "naming   =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n"
+		   "         =%-22s utf8=%s\n"
 		   "log      =%-22s bsize=%-6d blocks=%lld, version=%d\n"
 		   "         =%-22s sectsz=%-5u sunit=%d blks, lazy-count=%d\n"
 		   "realtime =%-22s extsz=%-6d blocks=%lld, rtextents=%lld\n"),
@@ -2553,6 +2642,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 			"", blocksize, (long long)dblocks, imaxpct,
 			"", dsunit, dswidth,
 			dirversion, dirblocksize, nci, dirftype,
+			"", utf8_version_string,
 			logfile, 1 << blocklog, (long long)logblocks,
 			logversion, "", lsectorsize, lsunit, lazy_sb_counters,
 			rtfile, rtextblocks << blocklog,
@@ -2617,6 +2707,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
 		sbp->sb_logsectlog = 0;
 		sbp->sb_logsectsize = 0;
 	}
+	sbp->sb_utf8version = utf8;
 
 	if (force_overwrite)
 		zero_old_xfs_structures(&xi, sbp);
@@ -3171,7 +3262,8 @@ usage( void )
 			    sunit=value|su=num,sectlog=n|sectsize=num,\n\
 			    lazy-count=0|1]\n\
 /* label */		[-L label (maximum 12 characters)]\n\
-/* naming */		[-n log=n|size=num,version=2|ci,ftype=0|1]\n\
+/* naming */		[-n log=n|size=num,version=2|ci,ftype=0|1\n\
+			    utf8=0|7]\n\
 /* no-op info only */	[-N]\n\
 /* prototype file */	[-p fname]\n\
 /* quiet */		[-q]\n\
diff --git a/mkfs/xfs_mkfs.h b/mkfs/xfs_mkfs.h
index 9df5f37..f40b284 100644
--- a/mkfs/xfs_mkfs.h
+++ b/mkfs/xfs_mkfs.h
@@ -37,13 +37,14 @@
 	0 ) : XFS_SB_VERSION_1 )
 
 #define XFS_SB_VERSION2_MKFS(crc, lazycount, attr2, projid32bit, parent, \
-			     ftype) (\
+			     ftype, utf8) (\
 	((lazycount) ? XFS_SB_VERSION2_LAZYSBCOUNTBIT : 0) |		\
 	((attr2) ? XFS_SB_VERSION2_ATTR2BIT : 0) |			\
 	((projid32bit) ? XFS_SB_VERSION2_PROJID32BIT : 0) |		\
 	((parent) ? XFS_SB_VERSION2_PARENTBIT : 0) |			\
 	((crc) ? XFS_SB_VERSION2_CRCBIT : 0) |				\
 	((ftype) ? XFS_SB_VERSION2_FTYPE : 0) |				\
+	((utf8) ? XFS_SB_VERSION2_UTF8BIT : 0) |			\
 	0 )
 
 #define	XFS_DFL_BLOCKSIZE_LOG	12		/* 4096 byte blocks */
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 33/35] xfsprogs: add utf8 support to xfs_repair
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (31 preceding siblings ...)
  2014-10-03 22:15 ` [PATCH 32/35] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
@ 2014-10-03 22:16 ` Ben Myers
  2014-10-03 22:16 ` [PATCH 34/35] xfsprogs: xfs_db support for sb_utf8version Ben Myers
  2014-10-03 22:17 ` [PATCH 35/35] xfsprogs: add a test for utf8 support Ben Myers
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Mark Tinguely <tinguely@sgi.com>

Fix the duplicate filename detection to use the utf-8 normalization
routines.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

[XXX use sb_utf8version on the global xfs_mount.
 TODO maybe add the xfs_mount to the args structure? --bpm]
---
 db/check.c        |  2 +-
 libxfs/xfs_utf8.c | 16 ++++++++--------
 repair/phase6.c   | 36 +++++++++++++++++++++++++-----------
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/db/check.c b/db/check.c
index d317a71..9219cc8 100644
--- a/db/check.c
+++ b/db/check.c
@@ -2324,7 +2324,7 @@ process_data_dir_v2(
 			(char *)dep - (char *)data);
 		dir_hash_add(mp->m_dirnameops->hashname(dep->name,
 					dep->namelen,
-					0 /* version for later */), addr);
+					mp->m_sb.sb_utf8version), addr);
 		ptr += xfs_dir3_data_entsize(mp, dep->namelen);
 		count++;
 		lastfree = 0;
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
index f7042ef..e7a717e 100644
--- a/libxfs/xfs_utf8.c
+++ b/libxfs/xfs_utf8.c
@@ -106,8 +106,8 @@ xfs_utf8_normhash(
 	unsigned char	*norm;
 	ssize_t		normlen;
 	int		c;
-	unsigned int	sb_utf8version =
-		args->dp->i_mount->m_sb.sb_utf8version;
+	unsigned int	sb_utf8version = mp->m_sb.sb_utf8version;
+/* XXX		args->dp->i_mount->m_sb.sb_utf8version; */
 
 	/* Don't normalize system attribute names. */
 	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
@@ -147,8 +147,8 @@ xfs_utf8_compname(
 	struct utf8cursor u8c;
 	const unsigned char *norm;
 	int		c;
-	unsigned int	sb_utf8version =
-		args->dp->i_mount->m_sb.sb_utf8version;
+	unsigned int	sb_utf8version = mp->m_sb.sb_utf8version;
+/* XXX		args->dp->i_mount->m_sb.sb_utf8version; */
 
 	ASSERT(args->norm || args->normlen == -1);
 
@@ -213,8 +213,8 @@ xfs_utf8_ci_normhash(
 	unsigned char	*norm;
 	ssize_t		normlen;
 	int		c;
-	unsigned int	sb_utf8version =
-		args->dp->i_mount->m_sb.sb_utf8version;
+	unsigned int	sb_utf8version = mp->m_sb.sb_utf8version;
+/* XXX		args->dp->i_mount->m_sb.sb_utf8version; */
 
 	/* Don't normalize system attribute names. */
 	if (args->flags & (ATTR_ROOT|ATTR_SECURE))
@@ -254,8 +254,8 @@ xfs_utf8_ci_compname(
 	struct utf8cursor u8c;
 	const unsigned char *norm;
 	int		c;
-	unsigned int	sb_utf8version =
-		args->dp->i_mount->m_sb.sb_utf8version;
+	unsigned int	sb_utf8version = mp->m_sb.sb_utf8version;
+/* XXX		args->dp->i_mount->m_sb.sb_utf8version; */
 
 	ASSERT(args->norm || args->normlen == -1);
 
diff --git a/repair/phase6.c b/repair/phase6.c
index c18ef69..eb3ea35 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -176,13 +176,15 @@ dir_hash_add(
 	unsigned char		*name,
 	__uint8_t		ftype)
 {
-	xfs_dahash_t		hash = 0;
 	int			byaddr;
 	int			byhash = 0;
 	dir_hash_ent_t		*p;
 	int			dup;
 	short			junk;
 	struct xfs_name		xname;
+	xfs_da_args_t		args;
+
+	memset(&args, 0, sizeof(xfs_da_args_t));
 
 	ASSERT(!hashtab->names_duped);
 
@@ -195,20 +197,30 @@ dir_hash_add(
 	dup = 0;
 
 	if (!junk) {
-		hash = mp->m_dirnameops->hashname(name, namelen,
-				0 /* version for later */);
-		byhash = DIR_HASH_FUNC(hashtab, hash);
+		int error;
+
+		args.name = name;
+		args.namelen = namelen;
+		args.inumber = inum;
+		args.whichfork = XFS_DATA_FORK;
+
+		error = mp->m_dirnameops->normhash(&args);
+		if (error)
+			do_error(_("normalize has failed %d)\n"), error);
+
+		byhash = DIR_HASH_FUNC(hashtab, args.hashval);
 
 		/*
 		 * search hash bucket for existing name.
 		 */
 		for (p = hashtab->byhash[byhash]; p; p = p->nextbyhash) {
-			if (p->hashval == hash && p->name.len == namelen) {
-				if (memcmp(p->name.name, name, namelen) == 0) {
-					dup = 1;
-					junk = 1;
-					break;
-				}
+			if (p->hashval == args.hashval &&
+			    mp->m_dirnameops->compname(&args, p->name.name,
+						       p->name.len) !=
+							 XFS_CMP_DIFFERENT) {
+				dup = 1;
+				junk = 1;
+				break;
 			}
 		}
 	}
@@ -227,7 +239,7 @@ dir_hash_add(
 	hashtab->last = p;
 
 	if (!(p->junkit = junk)) {
-		p->hashval = hash;
+		p->hashval = args.hashval;
 		p->nextbyhash = hashtab->byhash[byhash];
 		hashtab->byhash[byhash] = p;
 	}
@@ -236,6 +248,8 @@ dir_hash_add(
 	p->seen = 0;
 	p->name = xname;
 
+	if (args.norm)
+		kmem_free((void *) args.norm);
 	return !dup;
 }
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 34/35] xfsprogs: xfs_db support for sb_utf8version
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (32 preceding siblings ...)
  2014-10-03 22:16 ` [PATCH 33/35] xfsprogs: add utf8 support to xfs_repair Ben Myers
@ 2014-10-03 22:16 ` Ben Myers
  2014-10-03 22:17 ` [PATCH 35/35] xfsprogs: add a test for utf8 support Ben Myers
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

Add support for accessing and setting sb_utf8version to xfs_db.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 db/hash.c | 1 +
 db/sb.c   | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/db/hash.c b/db/hash.c
index 02376e6..5196442 100644
--- a/db/hash.c
+++ b/db/hash.c
@@ -52,6 +52,7 @@ hash_f(
 {
 	xfs_dahash_t	hashval;
 
+	/* XXX utf8version? */
 	hashval = libxfs_da_hashname((uchar_t *)argv[1], (int)strlen(argv[1]));
 	dbprintf("0x%x\n", hashval);
 	return 0;
diff --git a/db/sb.c b/db/sb.c
index 6cb665d..e32790a 100644
--- a/db/sb.c
+++ b/db/sb.c
@@ -119,6 +119,7 @@ const field_t	sb_flds[] = {
 	{ "features_log_incompat", FLDT_UINT32X, OI(OFF(features_log_incompat)),
 		C1, 0, TYP_NONE },
 	{ "crc", FLDT_CRC, OI(OFF(crc)), C1, 0, TYP_NONE },
+	{ "utf8version", FLDT_UINT32D, OI(OFF(utf8version)), C1, 0, TYP_NONE },
 	{ "pquotino", FLDT_INO, OI(OFF(pquotino)), C1, 0, TYP_INODE },
 	{ "lsn", FLDT_UINT64X, OI(OFF(lsn)), C1, 0, TYP_NONE },
 	{ NULL }
@@ -646,6 +647,8 @@ version_string(
 		strcat(s, ",CRC");
 	if (xfs_sb_version_hasftype(sbp))
 		strcat(s, ",FTYPE");
+	if (xfs_sb_version_hasutf8(sbp))
+		strcat(s, ",UTF8");
 	return s;
 }
 
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 35/35] xfsprogs: add a test for utf8 support
  2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
                   ` (33 preceding siblings ...)
  2014-10-03 22:16 ` [PATCH 34/35] xfsprogs: xfs_db support for sb_utf8version Ben Myers
@ 2014-10-03 22:17 ` Ben Myers
  34 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-03 22:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: olaf, xfs

From: Ben Myers <bpm@sgi.com>

Here's a basic test for utf8 support in xfs.  It is based on code that
does testing in the trie generator.  Here too we are using the
NormalizationTest-7.0.0.txt file from the unicode distribution.  We
check that the normalization in libxfs is working and then run checks on
a filesystem mounted on /mnt (currently this is hardcoded).  Note that
there are some 'blacklisted' unichars which normalize to reserved
characters.

Signed-off-by: Ben Myers <bpm@sgi.com>
---
 Makefile                  |   2 +-
 chkutf8data/Makefile      |  21 +++
 chkutf8data/chkutf8data.c | 451 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 473 insertions(+), 1 deletion(-)
 create mode 100644 chkutf8data/Makefile
 create mode 100644 chkutf8data/chkutf8data.c

diff --git a/Makefile b/Makefile
index 74778b5..d2be322 100644
--- a/Makefile
+++ b/Makefile
@@ -42,7 +42,7 @@ endif
 
 LIB_SUBDIRS = utf8norm libxfs libxlog libxcmd libhandle libdisk
 TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \
-		mdrestore repair rtcp m4 man doc po debian
+		mdrestore repair rtcp m4 man doc po debian chkutf8data
 
 SUBDIRS = include $(LIB_SUBDIRS) $(TOOL_SUBDIRS)
 
diff --git a/chkutf8data/Makefile b/chkutf8data/Makefile
new file mode 100644
index 0000000..6ce5706
--- /dev/null
+++ b/chkutf8data/Makefile
@@ -0,0 +1,21 @@
+#
+# Copyright (c) 2014 SGI. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+LTCOMMAND = chkutf8data
+CFILES = chkutf8data.c
+
+LLDLIBS = $(LIBXFS)
+LTDEPENDENCIES = $(LIBXFS)
+LLDFLAGS = -static
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: default
+
+-include .ltdep
diff --git a/chkutf8data/chkutf8data.c b/chkutf8data/chkutf8data.c
new file mode 100644
index 0000000..7fe052f
--- /dev/null
+++ b/chkutf8data/chkutf8data.c
@@ -0,0 +1,451 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include "utf8norm.h"
+
+#define FOLD_NAME	"CaseFolding.txt"
+#define TEST_NAME	"NormalizationTest.txt"
+
+const char	*fold_name = FOLD_NAME;
+const char	*test_name = TEST_NAME;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE	1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+char buf4[LINESIZE];
+char buf5[LINESIZE];
+
+const char *mtpt;
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+	printf("The input files:\n");
+	printf("\t-f %s\n", FOLD_NAME);
+	printf("\t-t %s\n", TEST_NAME);
+	printf("\n");
+}
+
+static void
+usage(void)
+{
+	help();
+	exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+	printf("Error %d opening %s: %s\n", error, name, strerror(error));
+	exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+	printf("Error parsing %s\n", filename);
+	exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used.  A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values.  This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ *          0 -     0x7f: 0                     0x7f
+ *       0x80 -    0x7ff: 0xc2 0x80             0xdf 0xbf
+ *      0x800 -   0xffff: 0xe0 0xa0 0x80        0xef 0xbf 0xbf
+ *    0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80   0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character.  This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ *    Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ *    http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS     0xC0
+#define UTF8_3_BITS     0xE0
+#define UTF8_4_BITS     0xF0
+#define UTF8_N_BITS     0x80
+#define UTF8_2_MASK     0xE0
+#define UTF8_3_MASK     0xF0
+#define UTF8_4_MASK     0xF8
+#define UTF8_N_MASK     0xC0
+#define UTF8_V_MASK     0x3F
+#define UTF8_V_SHIFT    6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+	int keylen;
+
+	if (key < 0x80) {
+		keyval[0] = key;
+		keylen = 1;
+	} else if (key < 0x800) {
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_2_BITS;
+		keylen = 2;
+	} else if (key < 0x10000) {
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_3_BITS;
+		keylen = 3;
+	} else if (key < 0x110000) {
+		keyval[3] = key & UTF8_V_MASK;
+		keyval[3] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[2] = key & UTF8_V_MASK;
+		keyval[2] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[1] = key & UTF8_V_MASK;
+		keyval[1] |= UTF8_N_BITS;
+		key >>= UTF8_V_SHIFT;
+		keyval[0] = key;
+		keyval[0] |= UTF8_4_BITS;
+		keylen = 4;
+	} else {
+		printf("%#x: illegal key\n", key);
+		keylen = 0;
+	}
+	return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+	const unsigned char *s = (const unsigned char*)str;
+	unsigned int unichar = 0;
+
+	if (*s < 0x80) {
+		unichar = *s;
+	} else if (*s < UTF8_3_BITS) {
+		unichar = *s++ & 0x1F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else if (*s < UTF8_4_BITS) {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	} else {
+		unichar = *s++ & 0x0F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s++ & 0x3F;
+		unichar <<= UTF8_V_SHIFT;
+		unichar |= *s & 0x3F;
+	}
+	return unichar;
+}
+
+static int
+normalize_line(utf8data_t tree, char *s, char *t)
+{
+	struct utf8cursor u8c;
+
+	if (utf8cursor(&u8c, tree, s)) {
+		printf("%s return utf8cursor failed\n", __func__);
+		return -1;
+	}
+
+	while ((*t = utf8byte(&u8c)) > 0)
+		t++;
+
+	if (*t != 0) {
+		printf("%s return t not 0\n", __func__);
+		return -1;
+	}
+
+        return 0;
+}
+
+static void
+test_key(char	*source,
+	 char	*NFC,
+	 char	*NFD,
+	 char	*NFKC,
+	 char	*NFKD)
+{
+	int	fd;
+	int	error;
+
+	printf("Testing %s -> %s\n", source, NFKD);
+
+	error = chdir("/mnt");	/* XXX hardcoded mount point */
+	if (error) {
+		perror(mtpt);
+		exit(-1);
+	}
+
+	/* the initial create should succeed */
+	printf("Initial create %s... ", source);
+	fd = open(source, O_CREAT|O_EXCL, 0);
+	if (fd < 0) {
+		printf("Failed to create %s XXX\n", source);
+		perror(source);
+		close(fd);
+//		return;
+		exit(-1);
+	}
+	close(fd);
+	printf("Success\n");
+
+	/* a second create should fail */
+	printf("Second create %s (should return EEXIST)... ", NFKD);
+	fd = open(NFKD, O_CREAT|O_EXCL, 0);
+	if (fd >= 1) {
+		printf("Test Failed.  Was able to create %s XXX\n", NFKD);
+		perror(NFKD);
+		close(fd);
+//		return;
+		exit(-1);
+	}
+	close(fd);
+	printf("EEXIST\n");
+
+	error = unlink(NFKD);
+	if (error) {
+		printf("Unlink failed\n");
+		perror(NFKD);
+		exit(-1);
+	}
+}
+
+int
+blacklisted(unsigned int unichar)
+{
+	/* these unichars normalize to characters we don't allow */
+	unsigned int list[] = {	0x2024 /* . */,
+				0x2025 /* .. */,
+				0x2100 /* a/c */,
+				0x2101 /* a/s */,
+				0x2105 /* c/o */,
+				0x2106 /* c/u */,
+				0xFE30 /* .. */,
+				0xFE52 /* . */,
+				0xFF0E /* . */,
+				0xFF0F /* / */};
+	int i;
+
+	for (i=0; i < (sizeof(list) / sizeof(unichar)); i++) {
+		if (list[i] == unichar)
+			return 1;
+	}
+	return 0;
+}
+
+static void
+normalization_test(void)
+{
+	FILE *file;
+	unsigned int unichar;
+	char *s;
+	char *t;
+	int ret;
+	int tests = 0;
+	int failures = 0;
+	char	source[LINESIZE];
+	char	NFKD[LINESIZE];
+	int	skip;
+	utf8data_t	nfkdi = utf8nfkdi(7 << 16);
+
+	printf("Parsing %s\n", test_name);
+	/* Step one, read data from file. */
+	file = fopen(test_name, "r");
+	if (!file)
+		open_fail(test_name, errno);
+
+	while (fgets(line, LINESIZE, file)) {
+		ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+				source, NFKD);
+		if (ret != 2 || *line == '#')
+			continue;
+
+		s = source;
+		t = buf2;
+		skip = 0;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			if (blacklisted(unichar))
+				skip++;
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		if (skip)
+			continue;
+
+		s = NFKD;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		/* normalize source */
+		if (normalize_line(nfkdi, buf2, buf4) < 0) {
+			printf("normalize_line for unichar %s Failed\n", buf0);
+			exit(1);
+		}
+		printf("(%s) %s normalized to %s... ", source, buf2, buf4);
+
+		/* does it match NFKD? */
+		if (memcmp(buf4, buf3, strlen(buf3))) {
+			printf("Fail!\n");
+		} else {
+			printf("Correct!\n");
+		}
+
+		/* normalize NFKD */
+		if (normalize_line(nfkdi, buf3, buf5) < 0) {
+			printf("normalize_line for unichar %s Failed\n",
+					buf3);
+			exit(1);
+		}
+		printf("(%s) %s normalized to %s... ", NFKD, buf3, buf5);
+
+		/* does it normalize to itself? */
+		if (memcmp(buf5, buf3, strlen(buf3))) {
+			printf("Fail!\n");
+		} else {
+			printf("Correct!\n");
+		}
+
+		test_key(buf2, NULL, NULL, NULL, buf3);
+
+		/* XXX ignorables need to be taken into account? */
+//		printf("%s normalized to %s\n", buf0, buf4);
+//		printf("%s normalized to %s\n", buf1, buf5);
+//		test_key(buf2, NULL, NULL, NULL, buf3);
+#if 0
+		ignorables = 0;
+		s = buf1;
+		t = buf3;
+		while (*s) {
+			unichar = strtoul(s, &s, 16);
+			data = &unicode_data[unichar];
+			if (data->utf8nfkdi && !*data->utf8nfkdi)
+				ignorables = 1;
+			else
+				t += utf8key(unichar, t);
+		}
+		*t = '\0';
+
+		tests++;
+		if (normalize_line(nfkdi_tree) < 0) {
+			printf("\nline %s -> %s", buf0, buf1);
+			if (ignorables)
+				printf(" (ignorables removed)");
+			printf(" failure\n");
+			failures++;
+		}
+#endif
+	}
+	fclose(file);
+	printf("Ran %d tests with %d failures\n", tests, failures);
+	if (failures)
+		file_fail(test_name);
+}
+
+int
+main(int argc, char *argv[])
+{
+	int opt;
+
+	while ((opt = getopt(argc, argv, "f:t:h")) != -1) {
+		switch (opt) {
+		case 'f':
+			fold_name = optarg;
+			break;
+		case 't':
+			test_name = optarg;
+			break;
+		case 'h':
+			help();
+			exit(0);
+		default:
+			usage();
+		}
+	}
+
+	normalization_test();
+
+	return 0;
+}
-- 
1.7.12.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/16] xfs: implement demand load of utf8norm.ko
  2014-10-03 22:03 ` [PATCH 13/16] xfs: implement demand load of utf8norm.ko Ben Myers
@ 2014-10-04  7:16     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2014-10-04  7:16 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, xfs, olaf

> +int
> +xfs_init_utf8_module(struct xfs_mount	*mp)
> +{
> +	request_module("utf8norm");
> +
> +	spin_lock(&utf8norm_lock);
> +	if (utf8norm_initialized) {
> +		spin_unlock(&utf8norm_lock);
> +		return 0;
> +	}
> +
> +	utf8version_is_supported_func = symbol_get(utf8version_is_supported);
> +	if (!utf8version_is_supported_func)
> +		goto error;
> +
> +	utf8nfkdi_func = symbol_get(utf8nfkdi);
> +	if (!utf8nfkdi_func)
> +		goto error;

Please export a structure with a function pointes so that we just need
a single symbol_get call.  I'd have to look up how symbol_get works,
but unless there's something that speaks against this it might be
simpler to than just do a symbol_get per mounst structure that uses
utf8 and can point to that structure so that we don't have to add
additional reference counting infrastructure around it.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/16] xfs: implement demand load of utf8norm.ko
@ 2014-10-04  7:16     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2014-10-04  7:16 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

> +int
> +xfs_init_utf8_module(struct xfs_mount	*mp)
> +{
> +	request_module("utf8norm");
> +
> +	spin_lock(&utf8norm_lock);
> +	if (utf8norm_initialized) {
> +		spin_unlock(&utf8norm_lock);
> +		return 0;
> +	}
> +
> +	utf8version_is_supported_func = symbol_get(utf8version_is_supported);
> +	if (!utf8version_is_supported_func)
> +		goto error;
> +
> +	utf8nfkdi_func = symbol_get(utf8nfkdi);
> +	if (!utf8nfkdi_func)
> +		goto error;

Please export a structure with a function pointes so that we just need
a single symbol_get call.  I'd have to look up how symbol_get works,
but unless there's something that speaks against this it might be
simpler to than just do a symbol_get per mounst structure that uses
utf8 and can point to that structure so that we don't have to add
additional reference counting infrastructure around it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 04/16] lib/utf8norm.c: reduce the size of utf8data[]
  2014-10-03 21:54 ` [PATCH 04/16] lib/utf8norm.c: reduce the size of utf8data[] Ben Myers
@ 2014-10-05 21:52     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-05 21:52 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 04:54:55PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> Remove the Hangul decompositions from the utf8data trie, and do
> algorithmic decomposition to calculate them on the fly. To store
> the decomposition the caller of utf8lookup()/utf8nlookup() must
> provide a 12-byte buffer, which is used to synthesize a leaf with
> the decomposition. Trie size is reduced from 245kB to 90kB.
> 
> This change also contains a number of robustness fixes to the
> trie generator mkutf8data.c.

Please separate out the robustness fixes or merge them back into the
original patch. e.g. Bulk renaming of code like this:


>  static int
> -utf8key(unsigned int key, char keyval[])
> -{
> -	int keylen;
> -
> -	if (key < 0x80) {
> -		keyval[0] = key;
> -		keylen = 1;
> -	} else if (key < 0x800) {
> -		keyval[1] = key & UTF8_V_MASK;
> -		keyval[1] |= UTF8_N_BITS;
> -		key >>= UTF8_V_SHIFT;
....
> +utf8encode(char *str, unsigned int val)
> +{
> +	int len;
> +
> +	if (val < 0x80) {
> +		str[0] = val;
> +		len = 1;
> +	} else if (val < 0x800) {
> +		str[1] = val & UTF8_V_MASK;
> +		str[1] |= UTF8_N_BITS;
> +		val >>= UTF8_V_SHIFT;

Doesn't belong in a patch that introduces special hangul character
handling....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 04/16] lib/utf8norm.c: reduce the size of utf8data[]
@ 2014-10-05 21:52     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-05 21:52 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 04:54:55PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> Remove the Hangul decompositions from the utf8data trie, and do
> algorithmic decomposition to calculate them on the fly. To store
> the decomposition the caller of utf8lookup()/utf8nlookup() must
> provide a 12-byte buffer, which is used to synthesize a leaf with
> the decomposition. Trie size is reduced from 245kB to 90kB.
> 
> This change also contains a number of robustness fixes to the
> trie generator mkutf8data.c.

Please separate out the robustness fixes or merge them back into the
original patch. e.g. Bulk renaming of code like this:


>  static int
> -utf8key(unsigned int key, char keyval[])
> -{
> -	int keylen;
> -
> -	if (key < 0x80) {
> -		keyval[0] = key;
> -		keylen = 1;
> -	} else if (key < 0x800) {
> -		keyval[1] = key & UTF8_V_MASK;
> -		keyval[1] |= UTF8_N_BITS;
> -		key >>= UTF8_V_SHIFT;
....
> +utf8encode(char *str, unsigned int val)
> +{
> +	int len;
> +
> +	if (val < 0x80) {
> +		str[0] = val;
> +		len = 1;
> +	} else if (val < 0x800) {
> +		str[1] = val & UTF8_V_MASK;
> +		str[1] |= UTF8_N_BITS;
> +		val >>= UTF8_V_SHIFT;

Doesn't belong in a patch that introduces special hangul character
handling....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2
  2014-10-03 22:04 ` [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2 Ben Myers
@ 2014-10-06 20:33     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 20:33 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:04:35PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> We'll be creating a new versioned XFS_IOC_FSGEOMETRY ioctl and structure
> so rename the current revision to _V2.

Urk, no.

This will result in older applications picking up the new ioctl when
they are rebuilt without having explicit support for the new ioctl.

Just create a new ioctl with a new name and modify applications to
use it. If the kernel does not support the new ioctl, then fall back
to the old one in userspace.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2
@ 2014-10-06 20:33     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 20:33 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:04:35PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> We'll be creating a new versioned XFS_IOC_FSGEOMETRY ioctl and structure
> so rename the current revision to _V2.

Urk, no.

This will result in older applications picking up the new ioctl when
they are rebuilt without having explicit support for the new ioctl.

Just create a new ioctl with a new name and modify applications to
use it. If the kernel does not support the new ioctl, then fall back
to the old one in userspace.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2
  2014-10-06 20:33     ` Dave Chinner
  (?)
@ 2014-10-06 20:38     ` Ben Myers
  -1 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-06 20:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, olaf, xfs

On Tue, Oct 07, 2014 at 07:33:00AM +1100, Dave Chinner wrote:
> On Fri, Oct 03, 2014 at 05:04:35PM -0500, Ben Myers wrote:
> > From: Ben Myers <bpm@sgi.com>
> > 
> > We'll be creating a new versioned XFS_IOC_FSGEOMETRY ioctl and structure
> > so rename the current revision to _V2.
> 
> Urk, no.
> 
> This will result in older applications picking up the new ioctl when
> they are rebuilt without having explicit support for the new ioctl.
> 
> Just create a new ioctl with a new name and modify applications to
> use it. If the kernel does not support the new ioctl, then fall back
> to the old one in userspace.

D'oh.  Yeah, ok.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 15/16] xfs: xfs_fs_geometry returns a number of bytes to copy
  2014-10-03 22:05 ` [PATCH 15/16] xfs: xfs_fs_geometry returns a number of bytes to copy Ben Myers
@ 2014-10-06 20:41     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 20:41 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:05:09PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> The versioned xfs_fsop_geom_t will be of variable size.  Make
> xfs_fs_geometry return the number of bytes to copy out to userspace for
> a given version of the structure.

xfs_fs_geometry() should be a void right now - it doesn't return any
error value at all.

Further, the size of the structure that is filled in is determined
by the version of the ioctl being called, not the xfs_fs_geometry()
function. Hence the caller already knows the size of the structure
being used, and hence does not need xfs_fs_geometry() to tell it
that information.

So I don't think this change is necessary.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 15/16] xfs: xfs_fs_geometry returns a number of bytes to copy
@ 2014-10-06 20:41     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 20:41 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:05:09PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> The versioned xfs_fsop_geom_t will be of variable size.  Make
> xfs_fs_geometry return the number of bytes to copy out to userspace for
> a given version of the structure.

xfs_fs_geometry() should be a void right now - it doesn't return any
error value at all.

Further, the size of the structure that is filled in is determined
by the version of the ioctl being called, not the xfs_fs_geometry()
function. Hence the caller already knows the size of the structure
being used, and hence does not need xfs_fs_geometry() to tell it
that information.

So I don't think this change is necessary.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 16/16] xfs: add versioned fsgeom ioctl with utf8version field
  2014-10-03 22:05 ` [PATCH 16/16] xfs: add versioned fsgeom ioctl with utf8version field Ben Myers
@ 2014-10-06 21:13     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 21:13 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:05:46PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> This adds a utf8version field to the xfs_fs_geom structure.  An
> important characteristic of this version of the ioctl is that
> fsgeo.version needs to be set by the caller to specify which version of
> the structure to return.
> 
> Signed-off-by: Ben Myers <bpm@sgi.com>
> ---
>  fs/xfs/xfs_fs.h    | 31 +++++++++++++++++++++++++++++++
>  fs/xfs/xfs_fsops.c | 13 ++++++++++++-
>  fs/xfs/xfs_fsops.h |  2 +-
>  fs/xfs/xfs_ioctl.c | 31 +++++++++++++++++++++++++++++++
>  4 files changed, 75 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
> index fd45cbe..2f4d430 100644
> --- a/fs/xfs/xfs_fs.h
> +++ b/fs/xfs/xfs_fs.h
> @@ -206,6 +206,34 @@ typedef struct xfs_fsop_geom_v2 {
>  	__u32		logsunit;	/* log stripe unit, bytes */
>  } xfs_fsop_geom_v2_t;
>  
> +/*
> + * Output for XFS_IOC_FSGEOMETRY
> + */
> +typedef struct xfs_fsop_geom {
> +	__u32		blocksize;	/* filesystem (data) block size */
> +	__u32		rtextsize;	/* realtime extent size		*/
> +	__u32		agblocks;	/* fsblocks in an AG		*/
> +	__u32		agcount;	/* number of allocation groups	*/
> +	__u32		logblocks;	/* fsblocks in the log		*/
> +	__u32		sectsize;	/* (data) sector size, bytes	*/
> +	__u32		inodesize;	/* inode size in bytes		*/
> +	__u32		imaxpct;	/* max allowed inode space(%)	*/
> +	__u64		datablocks;	/* fsblocks in data subvolume	*/
> +	__u64		rtblocks;	/* fsblocks in realtime subvol	*/
> +	__u64		rtextents;	/* rt extents in realtime subvol*/
> +	__u64		logstart;	/* starting fsblock of the log	*/
> +	unsigned char	uuid[16];	/* unique id of the filesystem	*/
> +	__u32		sunit;		/* stripe unit, fsblocks	*/
> +	__u32		swidth;		/* stripe width, fsblocks	*/
> +	__s32		version;	/* structure version		*/
> +	__u32		flags;		/* superblock version flags	*/
> +	__u32		logsectsize;	/* log sector size, bytes	*/
> +	__u32		rtsectsize;	/* realtime sector size, bytes	*/
> +	__u32		dirblocksize;	/* directory block size, bytes	*/
> +	__u32		logsunit;	/* log stripe unit, bytes */
> +	__u32		utf8version;	/* Unicode version		*/
> +} xfs_fsop_geom_t;

New structure defintion, not a redefinition of the old name, please.
Drop the typedef, and the structure needs to be 64 bit size
aligned so we don't get problems with 32 bit userspace on 64 bit
kernels (e.g. we have a v1 compat ioctl handler because of this
issue).

Further, lets avoid needing to rev the ioctl again in future by
adding a bunch of "must be zero" padding to the new structure so we
can extend the information we push to userspace easily. i.e. padding
only becomes meaningful when the superblock flag that exposes
meaning is set. i.e. userspace can do this to conditionally access
the ut8version value if it is meaningful:

	utf8_ver = 0;
	if (geo.flags & XFS_FSOP_GEOM_FLAGS_UTF8)
		utf8_ver = geo->utf8version;

i.e. let's make the new structure forwards compatible with new
features...

> @@ -115,6 +117,15 @@ xfs_fs_geometry(
>  				XFS_FSOP_GEOM_FLAGS_LOGV2 : 0);
>  		geo->logsunit = mp->m_sb.sb_logsunit;
>  	}
> +	if (new_version >= XFS_FSOP_GEOM_VERSION5) {
> +		geo->version = XFS_FSOP_GEOM_VERSION5;
> +		geo->flags |= (xfs_sb_version_hasutf8(&mp->m_sb) ?
> +				XFS_FSOP_GEOM_FLAGS_UTF8 : 0);
> +		geo->utf8version = mp->m_sb.sb_utf8version;
> +		if (bytes)
> +			*bytes = sizeof(xfs_fsop_geom_v2_t) +
> +				 sizeof(geo->utf8version);

Further indication that the *bytes variable should die.

> +	}
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> index 74e1fee..b723f36 100644
> --- a/fs/xfs/xfs_fsops.h
> +++ b/fs/xfs/xfs_fsops.h
> @@ -18,7 +18,7 @@
>  #ifndef __XFS_FSOPS_H__
>  #define	__XFS_FSOPS_H__
>  
> -extern int xfs_fs_geometry(xfs_mount_t *mp, xfs_fsop_geom_v2_t *geo,
> +extern int xfs_fs_geometry(xfs_mount_t *mp, void *buffer,
>  		int new_version, size_t *bytes);
>  extern int xfs_growfs_data(xfs_mount_t *mp, xfs_growfs_data_t *in);
>  extern int xfs_growfs_log(xfs_mount_t *mp, xfs_growfs_log_t *in);
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 1657ce5..6308680 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -859,6 +859,34 @@ xfs_ioc_fsgeometry_v2(
>  	return 0;
>  }
>  
> +STATIC int
> +xfs_ioc_fsgeometry(
> +	struct xfs_mount	*mp,
> +	void			__user *arg)
> +{
> +	xfs_fsop_geom_t		fsgeo;
> +	int			version, error;
> +	size_t			bytes;
> +
> +	/* offsetof(version)? XXX just get 32 bits? */
> +	if (copy_from_user(&fsgeo, arg, sizeof(xfs_fsop_geom_v1_t)))
> +		return -EFAULT;

It's best to copy in the entire structure rather than play offset
games.

> +	version = fsgeo.version;
> +
> +	if (version < XFS_FSOP_GEOM_VERSION5)
> +		return -EINVAL;

Here it rejects anything that is not a v3 structure aware of the
unicode extensions, which means it breaks any old recompiled
application that hasn't been updated to support
XFS_FSOP_GEOM_VERSION5 despite the fact that they will compile
against headers with the new definition without warnings or errors.

> +
> +	memset(&fsgeo, 0, sizeof(fsgeo));
> +	error = xfs_fs_geometry(mp, &fsgeo, version, &bytes);
> +	if (error)
> +		return error;
> +
> +	if (copy_to_user(arg, &fsgeo, bytes))
> +		return -EFAULT;

and you can use sizeof(struct xfs_fs_geom_v3) here instead of bytes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 16/16] xfs: add versioned fsgeom ioctl with utf8version field
@ 2014-10-06 21:13     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 21:13 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:05:46PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> This adds a utf8version field to the xfs_fs_geom structure.  An
> important characteristic of this version of the ioctl is that
> fsgeo.version needs to be set by the caller to specify which version of
> the structure to return.
> 
> Signed-off-by: Ben Myers <bpm@sgi.com>
> ---
>  fs/xfs/xfs_fs.h    | 31 +++++++++++++++++++++++++++++++
>  fs/xfs/xfs_fsops.c | 13 ++++++++++++-
>  fs/xfs/xfs_fsops.h |  2 +-
>  fs/xfs/xfs_ioctl.c | 31 +++++++++++++++++++++++++++++++
>  4 files changed, 75 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
> index fd45cbe..2f4d430 100644
> --- a/fs/xfs/xfs_fs.h
> +++ b/fs/xfs/xfs_fs.h
> @@ -206,6 +206,34 @@ typedef struct xfs_fsop_geom_v2 {
>  	__u32		logsunit;	/* log stripe unit, bytes */
>  } xfs_fsop_geom_v2_t;
>  
> +/*
> + * Output for XFS_IOC_FSGEOMETRY
> + */
> +typedef struct xfs_fsop_geom {
> +	__u32		blocksize;	/* filesystem (data) block size */
> +	__u32		rtextsize;	/* realtime extent size		*/
> +	__u32		agblocks;	/* fsblocks in an AG		*/
> +	__u32		agcount;	/* number of allocation groups	*/
> +	__u32		logblocks;	/* fsblocks in the log		*/
> +	__u32		sectsize;	/* (data) sector size, bytes	*/
> +	__u32		inodesize;	/* inode size in bytes		*/
> +	__u32		imaxpct;	/* max allowed inode space(%)	*/
> +	__u64		datablocks;	/* fsblocks in data subvolume	*/
> +	__u64		rtblocks;	/* fsblocks in realtime subvol	*/
> +	__u64		rtextents;	/* rt extents in realtime subvol*/
> +	__u64		logstart;	/* starting fsblock of the log	*/
> +	unsigned char	uuid[16];	/* unique id of the filesystem	*/
> +	__u32		sunit;		/* stripe unit, fsblocks	*/
> +	__u32		swidth;		/* stripe width, fsblocks	*/
> +	__s32		version;	/* structure version		*/
> +	__u32		flags;		/* superblock version flags	*/
> +	__u32		logsectsize;	/* log sector size, bytes	*/
> +	__u32		rtsectsize;	/* realtime sector size, bytes	*/
> +	__u32		dirblocksize;	/* directory block size, bytes	*/
> +	__u32		logsunit;	/* log stripe unit, bytes */
> +	__u32		utf8version;	/* Unicode version		*/
> +} xfs_fsop_geom_t;

New structure defintion, not a redefinition of the old name, please.
Drop the typedef, and the structure needs to be 64 bit size
aligned so we don't get problems with 32 bit userspace on 64 bit
kernels (e.g. we have a v1 compat ioctl handler because of this
issue).

Further, lets avoid needing to rev the ioctl again in future by
adding a bunch of "must be zero" padding to the new structure so we
can extend the information we push to userspace easily. i.e. padding
only becomes meaningful when the superblock flag that exposes
meaning is set. i.e. userspace can do this to conditionally access
the ut8version value if it is meaningful:

	utf8_ver = 0;
	if (geo.flags & XFS_FSOP_GEOM_FLAGS_UTF8)
		utf8_ver = geo->utf8version;

i.e. let's make the new structure forwards compatible with new
features...

> @@ -115,6 +117,15 @@ xfs_fs_geometry(
>  				XFS_FSOP_GEOM_FLAGS_LOGV2 : 0);
>  		geo->logsunit = mp->m_sb.sb_logsunit;
>  	}
> +	if (new_version >= XFS_FSOP_GEOM_VERSION5) {
> +		geo->version = XFS_FSOP_GEOM_VERSION5;
> +		geo->flags |= (xfs_sb_version_hasutf8(&mp->m_sb) ?
> +				XFS_FSOP_GEOM_FLAGS_UTF8 : 0);
> +		geo->utf8version = mp->m_sb.sb_utf8version;
> +		if (bytes)
> +			*bytes = sizeof(xfs_fsop_geom_v2_t) +
> +				 sizeof(geo->utf8version);

Further indication that the *bytes variable should die.

> +	}
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> index 74e1fee..b723f36 100644
> --- a/fs/xfs/xfs_fsops.h
> +++ b/fs/xfs/xfs_fsops.h
> @@ -18,7 +18,7 @@
>  #ifndef __XFS_FSOPS_H__
>  #define	__XFS_FSOPS_H__
>  
> -extern int xfs_fs_geometry(xfs_mount_t *mp, xfs_fsop_geom_v2_t *geo,
> +extern int xfs_fs_geometry(xfs_mount_t *mp, void *buffer,
>  		int new_version, size_t *bytes);
>  extern int xfs_growfs_data(xfs_mount_t *mp, xfs_growfs_data_t *in);
>  extern int xfs_growfs_log(xfs_mount_t *mp, xfs_growfs_log_t *in);
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 1657ce5..6308680 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -859,6 +859,34 @@ xfs_ioc_fsgeometry_v2(
>  	return 0;
>  }
>  
> +STATIC int
> +xfs_ioc_fsgeometry(
> +	struct xfs_mount	*mp,
> +	void			__user *arg)
> +{
> +	xfs_fsop_geom_t		fsgeo;
> +	int			version, error;
> +	size_t			bytes;
> +
> +	/* offsetof(version)? XXX just get 32 bits? */
> +	if (copy_from_user(&fsgeo, arg, sizeof(xfs_fsop_geom_v1_t)))
> +		return -EFAULT;

It's best to copy in the entire structure rather than play offset
games.

> +	version = fsgeo.version;
> +
> +	if (version < XFS_FSOP_GEOM_VERSION5)
> +		return -EINVAL;

Here it rejects anything that is not a v3 structure aware of the
unicode extensions, which means it breaks any old recompiled
application that hasn't been updated to support
XFS_FSOP_GEOM_VERSION5 despite the fact that they will compile
against headers with the new definition without warnings or errors.

> +
> +	memset(&fsgeo, 0, sizeof(fsgeo));
> +	error = xfs_fs_geometry(mp, &fsgeo, version, &bytes);
> +	if (error)
> +		return error;
> +
> +	if (copy_to_user(arg, &fsgeo, bytes))
> +		return -EFAULT;

and you can use sizeof(struct xfs_fs_geom_v3) here instead of bytes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 09/16] xfs: add a superblock feature bit to indicate UTF-8 support.
  2014-10-03 21:59 ` [PATCH 09/16] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
@ 2014-10-06 21:25   ` Dave Chinner
  2014-10-09 15:26     ` Ben Myers
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 21:25 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 04:59:46PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
> installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
> the utf8bit, and returns true if at least one of them is set. Replace
> calls to xfs_sb_version_hasasciici() as needed.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> ---
>  fs/xfs/libxfs/xfs_sb.h | 24 +++++++++++++++++++++++-
>  fs/xfs/xfs_fs.h        |  1 +
>  fs/xfs/xfs_fsops.c     |  4 +++-
>  fs/xfs/xfs_iops.c      |  4 ++--
>  4 files changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index 2e73970..525eacb 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -70,6 +70,7 @@ struct xfs_trans;
>  #define XFS_SB_VERSION2_RESERVED4BIT	0x00000004
>  #define XFS_SB_VERSION2_ATTR2BIT	0x00000008	/* Inline attr rework */
>  #define XFS_SB_VERSION2_PARENTBIT	0x00000010	/* parent pointers */
> +#define XFS_SB_VERSION2_UTF8BIT		0x00000020      /* utf8 names */
>  #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */

Can you explain why this bit is safe to use? I don't recall why
XFS_SB_VERSION2_PROJID32BIT skipped several bits because there
aren't any comments explaining why that value was chosen. Adding a
comment about the 0x00000040 bit at the same time would be useful.

>  #define XFS_SB_VERSION2_CRCBIT		0x00000100	/* metadata CRCs */
>  #define XFS_SB_VERSION2_FTYPE		0x00000200	/* inode type in dir */
> @@ -77,6 +78,7 @@ struct xfs_trans;
>  #define	XFS_SB_VERSION2_OKBITS		\
>  	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
>  	 XFS_SB_VERSION2_ATTR2BIT	| \
> +	 XFS_SB_VERSION2_UTF8BIT	| \
>  	 XFS_SB_VERSION2_PROJID32BIT	| \
>  	 XFS_SB_VERSION2_FTYPE)
>  
> @@ -509,8 +511,10 @@ xfs_sb_has_ro_compat_feature(
>  }
>  
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> +#define XFS_SB_FEAT_INCOMPAT_UTF8	(1 << 1)	/* utf-8 name support */
>  #define XFS_SB_FEAT_INCOMPAT_ALL \
> -		(XFS_SB_FEAT_INCOMPAT_FTYPE)
> +		(XFS_SB_FEAT_INCOMPAT_FTYPE | \
> +		 XFS_SB_FEAT_INCOMPAT_UTF8)

Don't add support to the filesystem until all the supporting
code is in place. This avoids git bisects landing on commits in the
series where the filesystem says it supports the feature bit it
doesn't actually work. Add a patch at the end of the series that
adds these bits to the feature masks.

>  
>  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
>  static inline bool
> @@ -558,6 +562,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
>  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
>  }
>  
> +static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp)

bool, no typedefs.

> +{
> +	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
> +		xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) ||
> +		(xfs_sb_version_hasmorebits(sbp) &&
> +		(sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));

xfs_sb_version_hasmorebits() already checks for XFS_SB_VERSION_5,
so this could be:

	return xfs_sb_version_hasmorebits(sbp) &&
		(xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8) ||
		 (sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));


> +}
> +
> +/*
> + * Special case: there are a number of places where we need to test
> + * both the borgbit and the utf8bit, and take the same action if
> + * either of those is set.
> + */
> +static inline int xfs_sb_version_hasci(xfs_sb_t *sbp)
> +{

bool, no typedefs, and probably should be a separate patch.


-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/16] xfs: store utf8version in the superblock
  2014-10-03 22:00 ` [PATCH 10/16] xfs: store utf8version in the superblock Ben Myers
@ 2014-10-06 21:53     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 21:53 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:00:31PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> The utf8 version a filesystem was created with needs to be stored in
> order that normalizations will remain stable over the lifetime of the
> filesystem.  Convert sb_pad to sb_utf8version in the super block.  This
> also adds checks at mount time to see whether the unicode normalization
> module has support for the version of unicode that the filesystem
> requires.  If not we fail the mount.
> 
> Signed-off-by: Ben Myers <bpm@sgi.com>
> ---
>  fs/xfs/libxfs/xfs_dir2.c | 28 ++++++++++++++++---
>  fs/xfs/libxfs/xfs_sb.c   |  4 +--
>  fs/xfs/libxfs/xfs_sb.h   | 10 ++++---
>  fs/xfs/libxfs/xfs_utf8.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_utf8.h | 24 +++++++++++++++++
>  5 files changed, 126 insertions(+), 10 deletions(-)
>  create mode 100644 fs/xfs/libxfs/xfs_utf8.c
>  create mode 100644 fs/xfs/libxfs/xfs_utf8.h
> 
> diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
> index 4eb0973..2c89211 100644
> --- a/fs/xfs/libxfs/xfs_dir2.c
> +++ b/fs/xfs/libxfs/xfs_dir2.c
> @@ -157,10 +157,30 @@ xfs_da_mount(
>  				(uint)sizeof(xfs_da_node_entry_t);
>  	dageo->magicpct = (dageo->blksize * 37) / 100;
>  
> -	if (xfs_sb_version_hasasciici(&mp->m_sb))
> -		mp->m_dirnameops = &xfs_ascii_ci_nameops;
> -	else
> -		mp->m_dirnameops = &xfs_default_nameops;
> +	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
> +#ifdef CONFIG_XFS_UTF8
> +		if (!xfs_utf8_version_ok(mp))
> +			return -ENOSYS;
> +
> +		/* XXX these are replaced in the next patch need
> +		   to do some kind of reordering here */
> +		if (xfs_sb_version_hasasciici(&mp->m_sb))
> +			mp->m_dirnameops = &xfs_ascii_ci_nameops;
> +		else
> +			mp->m_dirnameops = &xfs_default_nameops;
> +#else
> +		xfs_warn(mp,
> +"Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
> +		kmem_free(mp->m_dir_geo);
> +		kmem_free(mp->m_attr_geo);
> +		return -ENOSYS;
> +#endif

This config check doesn't belong here. Validation of superblock
fields vs the current config goes in the superblock verifier. I also
think that indication of UTF8 support being compiled in needs to go
in the XFS_VERSION_STRING, not have ifdef hackery added into the
code.

i.e. the mount should fail very early on with a superblock
verification failure from xfs_mount_validate_sb().


> +	} else {
> +		if (xfs_sb_version_hasasciici(&mp->m_sb))
> +			mp->m_dirnameops = &xfs_ascii_ci_nameops;
> +		else
> +			mp->m_dirnameops = &xfs_default_nameops;
> +	}
>  
>  	return 0;
>  }
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index ad525a5..1ee7d33 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -99,7 +99,7 @@ static const struct {
>  	{ offsetof(xfs_sb_t, sb_features_incompat),	0 },
>  	{ offsetof(xfs_sb_t, sb_features_log_incompat),	0 },
>  	{ offsetof(xfs_sb_t, sb_crc),		0 },
> -	{ offsetof(xfs_sb_t, sb_pad),		0 },
> +	{ offsetof(xfs_sb_t, sb_utf8version),	0 },
>  	{ offsetof(xfs_sb_t, sb_pquotino),	0 },
>  	{ offsetof(xfs_sb_t, sb_lsn),		0 },
>  	{ sizeof(xfs_sb_t),			0 }
> @@ -443,7 +443,7 @@ __xfs_sb_from_disk(
>  	to->sb_features_incompat = be32_to_cpu(from->sb_features_incompat);
>  	to->sb_features_log_incompat =
>  				be32_to_cpu(from->sb_features_log_incompat);
> -	to->sb_pad = 0;
> +	to->sb_utf8version = be32_to_cpu(from->sb_utf8version);
>  	to->sb_pquotino = be64_to_cpu(from->sb_pquotino);
>  	to->sb_lsn = be64_to_cpu(from->sb_lsn);
>  	/* Convert on-disk flags to in-memory flags? */
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index 525eacb..dc7b6c6 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -159,7 +159,7 @@ typedef struct xfs_sb {
>  	__uint32_t	sb_features_log_incompat;
>  
>  	__uint32_t	sb_crc;		/* superblock crc */
> -	__uint32_t	sb_pad;
> +	__uint32_t	sb_utf8version;	/* unicode version */
>  
>  	xfs_ino_t	sb_pquotino;	/* project quota inode */
>  	xfs_lsn_t	sb_lsn;		/* last write sequence */
> @@ -245,7 +245,7 @@ typedef struct xfs_dsb {
>  	__be32		sb_features_log_incompat;
>  
>  	__le32		sb_crc;		/* superblock crc */
> -	__be32		sb_pad;
> +	__be32		sb_utf8version;	/* version of unicode */
>  
>  	__be64		sb_pquotino;	/* project quota inode */
>  	__be64		sb_lsn;		/* last write sequence */
> @@ -271,7 +271,7 @@ typedef enum {
>  	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
>  	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
>  	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
> -	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
> +	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_UTF8VERSION,
>  	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
>  	XFS_SBS_FIELDCOUNT
>  } xfs_sb_field_t;
> @@ -303,6 +303,7 @@ typedef enum {
>  #define XFS_SB_FEATURES_INCOMPAT XFS_SB_MVAL(FEATURES_INCOMPAT)
>  #define XFS_SB_FEATURES_LOG_INCOMPAT XFS_SB_MVAL(FEATURES_LOG_INCOMPAT)
>  #define XFS_SB_CRC		XFS_SB_MVAL(CRC)
> +#define XFS_SB_UTF8VERSION	XFS_SB_MVAL(UTF8VERSION)
>  #define XFS_SB_PQUOTINO		XFS_SB_MVAL(PQUOTINO)
>  #define	XFS_SB_NUM_BITS		((int)XFS_SBS_FIELDCOUNT)
>  #define	XFS_SB_ALL_BITS		((1LL << XFS_SB_NUM_BITS) - 1)
> @@ -313,7 +314,8 @@ typedef enum {
>  	 XFS_SB_ICOUNT | XFS_SB_IFREE | XFS_SB_FDBLOCKS | XFS_SB_FEATURES2 | \
>  	 XFS_SB_BAD_FEATURES2 | XFS_SB_FEATURES_COMPAT | \
>  	 XFS_SB_FEATURES_RO_COMPAT | XFS_SB_FEATURES_INCOMPAT | \
> -	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_PQUOTINO)
> +	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_UTF8VERSION | \
> +	 XFS_SB_PQUOTINO)

We should never be modifying the utf8 version number from the
kernel, so this shouldn't be set in the XFS_SB_MOD_BITS mask.

> diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
> new file mode 100644
> index 0000000..7e63111
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_utf8.c
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright (c) 2014 SGI.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_types.h"
> +#include "xfs_bit.h"
> +#include "xfs_log_format.h"
> +#include "xfs_inum.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_sb.h"
> +#include "xfs_ag.h"
> +#include "xfs_da_format.h"
> +#include "xfs_da_btree.h"
> +#include "xfs_dir2.h"
> +#include "xfs_mount.h"
> +#include "xfs_da_btree.h"
> +#include "xfs_format.h"
> +#include "xfs_bmap_btree.h"
> +#include "xfs_alloc_btree.h"
> +#include "xfs_dinode.h"
> +#include "xfs_inode.h"
> +#include "xfs_inode_item.h"
> +#include "xfs_bmap.h"
> +#include "xfs_error.h"
> +#include "xfs_trace.h"
> +#include "xfs_utf8.h"

This may sound pedantic, but in all the libxfs rework I've managed
to standadise the initial include file order to be roughly:

#include "xfs.h"
#include "xfs_fs.h"
#include "xfs_shared.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_bit.h"
#include "xfs_inum.h"
#include "xfs_sb.h"
#include "xfs_ag.h"
#include "xfs_mount.h"
#include "xfs_da_format.h"

i.e. include all the type, shared and on-disk format information
first. Can you re-order these to follow the same convention?

> +#include <linux/utf8norm.h>

And that should end up being included from fs/xfs/xfs_linux.h,
because we can't include things directly from the linux kernel
headers in fs/xfs/libxfs/ files.

> +
> +int

Bool.

> +xfs_utf8_version_ok(
> +	struct xfs_mount	*mp)
> +{
> +	int	major, minor, revision;
> +
> +	if (utf8version_is_supported(mp->m_sb.sb_utf8version))
> +		return 1;

return true;
> +
> +	major = mp->m_sb.sb_utf8version >> UNICODE_MAJ_SHIFT;
> +	minor = (mp->m_sb.sb_utf8version & 0xff00) >> UNICODE_MIN_SHIFT;
> +	revision = mp->m_sb.sb_utf8version & 0xff;
> +
> +	if (revision) {
> +		xfs_warn(mp,
> +		"Unicode version %d.%d.%d not supported by utf8norm.ko",
> +		major, minor, revision);
> +	} else {
> +		xfs_warn(mp,
> +		"Unicode version %d.%d not supported by utf8norm.ko",
> +		major, minor);
> +	}

why do you need two different print statements? Version X.Y.0 is
pretty much recognisable as being the same as X.Y....

> +
> +	return 0;

return false;

> +}
> diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
> new file mode 100644
> index 0000000..8a700de
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_utf8.h
> @@ -0,0 +1,24 @@
> +/*
> + * Copyright (c) 2014 SGI.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +
> +#ifndef XFS_UTF8_H
> +#define XFS_UTF8_H
> +
> +extern int xfs_utf8_version_ok(struct xfs_mount *);
> +
> +#endif /* XFS_UTF8_H */

Do we really need a separate header file for this?
fs/xfs/libxfs/xfs_shared.h was created for such one-off or
limited definitions that need to be shared with userspace...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/16] xfs: store utf8version in the superblock
@ 2014-10-06 21:53     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 21:53 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:00:31PM -0500, Ben Myers wrote:
> From: Ben Myers <bpm@sgi.com>
> 
> The utf8 version a filesystem was created with needs to be stored in
> order that normalizations will remain stable over the lifetime of the
> filesystem.  Convert sb_pad to sb_utf8version in the super block.  This
> also adds checks at mount time to see whether the unicode normalization
> module has support for the version of unicode that the filesystem
> requires.  If not we fail the mount.
> 
> Signed-off-by: Ben Myers <bpm@sgi.com>
> ---
>  fs/xfs/libxfs/xfs_dir2.c | 28 ++++++++++++++++---
>  fs/xfs/libxfs/xfs_sb.c   |  4 +--
>  fs/xfs/libxfs/xfs_sb.h   | 10 ++++---
>  fs/xfs/libxfs/xfs_utf8.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_utf8.h | 24 +++++++++++++++++
>  5 files changed, 126 insertions(+), 10 deletions(-)
>  create mode 100644 fs/xfs/libxfs/xfs_utf8.c
>  create mode 100644 fs/xfs/libxfs/xfs_utf8.h
> 
> diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
> index 4eb0973..2c89211 100644
> --- a/fs/xfs/libxfs/xfs_dir2.c
> +++ b/fs/xfs/libxfs/xfs_dir2.c
> @@ -157,10 +157,30 @@ xfs_da_mount(
>  				(uint)sizeof(xfs_da_node_entry_t);
>  	dageo->magicpct = (dageo->blksize * 37) / 100;
>  
> -	if (xfs_sb_version_hasasciici(&mp->m_sb))
> -		mp->m_dirnameops = &xfs_ascii_ci_nameops;
> -	else
> -		mp->m_dirnameops = &xfs_default_nameops;
> +	if (xfs_sb_version_hasutf8(&mp->m_sb)) {
> +#ifdef CONFIG_XFS_UTF8
> +		if (!xfs_utf8_version_ok(mp))
> +			return -ENOSYS;
> +
> +		/* XXX these are replaced in the next patch need
> +		   to do some kind of reordering here */
> +		if (xfs_sb_version_hasasciici(&mp->m_sb))
> +			mp->m_dirnameops = &xfs_ascii_ci_nameops;
> +		else
> +			mp->m_dirnameops = &xfs_default_nameops;
> +#else
> +		xfs_warn(mp,
> +"Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
> +		kmem_free(mp->m_dir_geo);
> +		kmem_free(mp->m_attr_geo);
> +		return -ENOSYS;
> +#endif

This config check doesn't belong here. Validation of superblock
fields vs the current config goes in the superblock verifier. I also
think that indication of UTF8 support being compiled in needs to go
in the XFS_VERSION_STRING, not have ifdef hackery added into the
code.

i.e. the mount should fail very early on with a superblock
verification failure from xfs_mount_validate_sb().


> +	} else {
> +		if (xfs_sb_version_hasasciici(&mp->m_sb))
> +			mp->m_dirnameops = &xfs_ascii_ci_nameops;
> +		else
> +			mp->m_dirnameops = &xfs_default_nameops;
> +	}
>  
>  	return 0;
>  }
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index ad525a5..1ee7d33 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -99,7 +99,7 @@ static const struct {
>  	{ offsetof(xfs_sb_t, sb_features_incompat),	0 },
>  	{ offsetof(xfs_sb_t, sb_features_log_incompat),	0 },
>  	{ offsetof(xfs_sb_t, sb_crc),		0 },
> -	{ offsetof(xfs_sb_t, sb_pad),		0 },
> +	{ offsetof(xfs_sb_t, sb_utf8version),	0 },
>  	{ offsetof(xfs_sb_t, sb_pquotino),	0 },
>  	{ offsetof(xfs_sb_t, sb_lsn),		0 },
>  	{ sizeof(xfs_sb_t),			0 }
> @@ -443,7 +443,7 @@ __xfs_sb_from_disk(
>  	to->sb_features_incompat = be32_to_cpu(from->sb_features_incompat);
>  	to->sb_features_log_incompat =
>  				be32_to_cpu(from->sb_features_log_incompat);
> -	to->sb_pad = 0;
> +	to->sb_utf8version = be32_to_cpu(from->sb_utf8version);
>  	to->sb_pquotino = be64_to_cpu(from->sb_pquotino);
>  	to->sb_lsn = be64_to_cpu(from->sb_lsn);
>  	/* Convert on-disk flags to in-memory flags? */
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index 525eacb..dc7b6c6 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -159,7 +159,7 @@ typedef struct xfs_sb {
>  	__uint32_t	sb_features_log_incompat;
>  
>  	__uint32_t	sb_crc;		/* superblock crc */
> -	__uint32_t	sb_pad;
> +	__uint32_t	sb_utf8version;	/* unicode version */
>  
>  	xfs_ino_t	sb_pquotino;	/* project quota inode */
>  	xfs_lsn_t	sb_lsn;		/* last write sequence */
> @@ -245,7 +245,7 @@ typedef struct xfs_dsb {
>  	__be32		sb_features_log_incompat;
>  
>  	__le32		sb_crc;		/* superblock crc */
> -	__be32		sb_pad;
> +	__be32		sb_utf8version;	/* version of unicode */
>  
>  	__be64		sb_pquotino;	/* project quota inode */
>  	__be64		sb_lsn;		/* last write sequence */
> @@ -271,7 +271,7 @@ typedef enum {
>  	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
>  	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
>  	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
> -	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
> +	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_UTF8VERSION,
>  	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
>  	XFS_SBS_FIELDCOUNT
>  } xfs_sb_field_t;
> @@ -303,6 +303,7 @@ typedef enum {
>  #define XFS_SB_FEATURES_INCOMPAT XFS_SB_MVAL(FEATURES_INCOMPAT)
>  #define XFS_SB_FEATURES_LOG_INCOMPAT XFS_SB_MVAL(FEATURES_LOG_INCOMPAT)
>  #define XFS_SB_CRC		XFS_SB_MVAL(CRC)
> +#define XFS_SB_UTF8VERSION	XFS_SB_MVAL(UTF8VERSION)
>  #define XFS_SB_PQUOTINO		XFS_SB_MVAL(PQUOTINO)
>  #define	XFS_SB_NUM_BITS		((int)XFS_SBS_FIELDCOUNT)
>  #define	XFS_SB_ALL_BITS		((1LL << XFS_SB_NUM_BITS) - 1)
> @@ -313,7 +314,8 @@ typedef enum {
>  	 XFS_SB_ICOUNT | XFS_SB_IFREE | XFS_SB_FDBLOCKS | XFS_SB_FEATURES2 | \
>  	 XFS_SB_BAD_FEATURES2 | XFS_SB_FEATURES_COMPAT | \
>  	 XFS_SB_FEATURES_RO_COMPAT | XFS_SB_FEATURES_INCOMPAT | \
> -	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_PQUOTINO)
> +	 XFS_SB_FEATURES_LOG_INCOMPAT | XFS_SB_UTF8VERSION | \
> +	 XFS_SB_PQUOTINO)

We should never be modifying the utf8 version number from the
kernel, so this shouldn't be set in the XFS_SB_MOD_BITS mask.

> diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
> new file mode 100644
> index 0000000..7e63111
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_utf8.c
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright (c) 2014 SGI.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_types.h"
> +#include "xfs_bit.h"
> +#include "xfs_log_format.h"
> +#include "xfs_inum.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_sb.h"
> +#include "xfs_ag.h"
> +#include "xfs_da_format.h"
> +#include "xfs_da_btree.h"
> +#include "xfs_dir2.h"
> +#include "xfs_mount.h"
> +#include "xfs_da_btree.h"
> +#include "xfs_format.h"
> +#include "xfs_bmap_btree.h"
> +#include "xfs_alloc_btree.h"
> +#include "xfs_dinode.h"
> +#include "xfs_inode.h"
> +#include "xfs_inode_item.h"
> +#include "xfs_bmap.h"
> +#include "xfs_error.h"
> +#include "xfs_trace.h"
> +#include "xfs_utf8.h"

This may sound pedantic, but in all the libxfs rework I've managed
to standadise the initial include file order to be roughly:

#include "xfs.h"
#include "xfs_fs.h"
#include "xfs_shared.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_bit.h"
#include "xfs_inum.h"
#include "xfs_sb.h"
#include "xfs_ag.h"
#include "xfs_mount.h"
#include "xfs_da_format.h"

i.e. include all the type, shared and on-disk format information
first. Can you re-order these to follow the same convention?

> +#include <linux/utf8norm.h>

And that should end up being included from fs/xfs/xfs_linux.h,
because we can't include things directly from the linux kernel
headers in fs/xfs/libxfs/ files.

> +
> +int

Bool.

> +xfs_utf8_version_ok(
> +	struct xfs_mount	*mp)
> +{
> +	int	major, minor, revision;
> +
> +	if (utf8version_is_supported(mp->m_sb.sb_utf8version))
> +		return 1;

return true;
> +
> +	major = mp->m_sb.sb_utf8version >> UNICODE_MAJ_SHIFT;
> +	minor = (mp->m_sb.sb_utf8version & 0xff00) >> UNICODE_MIN_SHIFT;
> +	revision = mp->m_sb.sb_utf8version & 0xff;
> +
> +	if (revision) {
> +		xfs_warn(mp,
> +		"Unicode version %d.%d.%d not supported by utf8norm.ko",
> +		major, minor, revision);
> +	} else {
> +		xfs_warn(mp,
> +		"Unicode version %d.%d not supported by utf8norm.ko",
> +		major, minor);
> +	}

why do you need two different print statements? Version X.Y.0 is
pretty much recognisable as being the same as X.Y....

> +
> +	return 0;

return false;

> +}
> diff --git a/fs/xfs/libxfs/xfs_utf8.h b/fs/xfs/libxfs/xfs_utf8.h
> new file mode 100644
> index 0000000..8a700de
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_utf8.h
> @@ -0,0 +1,24 @@
> +/*
> + * Copyright (c) 2014 SGI.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +
> +#ifndef XFS_UTF8_H
> +#define XFS_UTF8_H
> +
> +extern int xfs_utf8_version_ok(struct xfs_mount *);
> +
> +#endif /* XFS_UTF8_H */

Do we really need a separate header file for this?
fs/xfs/libxfs/xfs_shared.h was created for such one-off or
limited definitions that need to be shared with userspace...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/16] xfs: add xfs_nameops for utf8 and utf8+casefold.
  2014-10-03 22:01 ` [PATCH 11/16] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
@ 2014-10-06 22:10     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 22:10 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:01:18PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
> and are installed if the utf8bit is set in the super block.
> 
> The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
> filenames, and are installed if both the utf8bit and the borgbit are set
> in the superblock.
> 
> Normalized filenames are not stored on disk. Normalization will fail if a
> filename is not valid UTF-8, in which case the filename is treated as an
> opaque blob.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> 
> ---
> [v2: updated to use utf8norm.ko module;
>      compiled conditionally on CONFIG_XFS_UTF8=y;
>      utf8version is now a function;
>      move xfs_utf8.[ch] into libxfs. --bpm]
> [v3: pass utf8version from the superblock through xfs_nameops
>      instead of the max version of the normalization module. --bpm]
> ---
>  fs/xfs/Kconfig           |   9 ++
>  fs/xfs/Makefile          |   2 +
>  fs/xfs/libxfs/xfs_dir2.c |   4 +-
>  fs/xfs/libxfs/xfs_utf8.c | 208 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_utf8.h |   3 +
>  fs/xfs/xfs_iops.c        |   2 +-
>  6 files changed, 225 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> index 5d47b4d..1e8a463 100644
> --- a/fs/xfs/Kconfig
> +++ b/fs/xfs/Kconfig
> @@ -95,3 +95,12 @@ config XFS_DEBUG
>  	  not useful unless you are debugging a particular problem.
>  
>  	  Say N unless you are an XFS developer, or you play one on TV.
> +
> +config XFS_UTF8
> +	bool "XFS UTF-8 support"
> +	depends on XFS_FS
> +	select CONFIG_UTF8_NORMALIZATION
> +	help
> +	  Say Y here to enable utf8 normalization support in XFS.  You
> +	  will be able to mount and use filesystems created with the
> +	  utf8 mkfs.xfs option.

"created with UTF8 support enabled."

> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index d617999..192aaca 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -114,6 +114,8 @@ xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
>  				   xfs_qm.o \
>  				   xfs_quotaops.o
>  
> +xfs-$(CONFIG_XFS_UTF8)		+= libxfs/xfs_utf8.o
> +

libxfs definitions come first. Also, please use the same prefixing
syntax that the other libxfs rules use.

>  # xfs_rtbitmap is shared with libxfs
>  xfs-$(CONFIG_XFS_RT)		+= xfs_rtalloc.o
>  
> diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
> index 2c89211..9cfbd6b 100644
> --- a/fs/xfs/libxfs/xfs_dir2.c
> +++ b/fs/xfs/libxfs/xfs_dir2.c
> @@ -165,9 +165,9 @@ xfs_da_mount(
>  		/* XXX these are replaced in the next patch need
>  		   to do some kind of reordering here */
>  		if (xfs_sb_version_hasasciici(&mp->m_sb))
> -			mp->m_dirnameops = &xfs_ascii_ci_nameops;
> +			mp->m_dirnameops = &xfs_utf8_ci_nameops;
>  		else
> -			mp->m_dirnameops = &xfs_default_nameops;
> +			mp->m_dirnameops = &xfs_utf8_nameops;
>  #else

xfs_sb_version_hasasciici()? The overloading of the asciici bit is
still used for the utf8 CI functionality? Please fix this for the
next version of the patchset.

>  		xfs_warn(mp,
>  "Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
> diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
> index 7e63111..1e75299 100644
> --- a/fs/xfs/libxfs/xfs_utf8.c
> +++ b/fs/xfs/libxfs/xfs_utf8.c
> @@ -68,3 +68,211 @@ xfs_utf8_version_ok(
>  
>  	return 0;
>  }
> +
> +/*
> + * xfs nameops using nfkdi
> + */

Remind me again what nfkdi means? I I can't remember the details
after a week or two, then perhaps better explanitory comments are
needed in the code?

> +static xfs_dahash_t
> +xfs_utf8_hashname(
> +	const unsigned char *name,
> +	int len,
> +	unsigned int sb_utf8version)

Please use the same indentation levels for the declartions. i.e

	const unsigned char	*name,
	int			len,
	unsigned int		sb_utf8version)

Can you go through all the XFS code and make sure this is done?

> +{
> +	utf8data_t	nfkdi;
> +	struct utf8cursor u8c;
> +	xfs_dahash_t	hash;
> +	int		val;

And these shold line up, too.

> +
> +	nfkdi = utf8nfkdi(sb_utf8version);
> +	hash = 0;

initialise at declaration.

> +	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
> +		goto blob;

Still has the "invalid binary blob" issue.

> +	while ((val = utf8byte(&u8c)) > 0)
> +		hash = val ^ rol32(hash, 7);
> +	/* In case of error treat the name as a binary blob. */
> +	if (val == 0)
> +		return hash;
> +blob:
> +	return xfs_da_hashname(name, len);
> +}
> +
> +static int
> +xfs_utf8_normhash(

More commments needed explaining what is going on.

> +	struct xfs_da_args *args)
> +{
> +	utf8data_t	nfkdi;
> +	struct utf8cursor u8c;
> +	unsigned char	*norm;
> +	ssize_t		normlen;
> +	int		c;
> +	unsigned int	sb_utf8version =
> +		args->dp->i_mount->m_sb.sb_utf8version;

Urk. Initialise on a separate line.

> +
> +	nfkdi = utf8nfkdi(sb_utf8version);
> +	/* Failure to normalize is treated as a blob. */
> +	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
> +		goto blob;

No assignments in logic statements, please.

	normlen = utf8nlen(nfkdi, args->name, args->namelen);
	if (normlen < 0)

This is all through the code - can you please go through and fix up
all the patches to remove this pattern? checkpatch might be helpful
here....

As it is, still has the invalid binary blob issue.


> +	if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0)
> +		goto blob;
> +	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
> +		return -ENOMEM;

Urk.

So, what happens if this memory allocation fails in the middle of a
create transaction?

(Hint: transaction is dirty at this point in time)

The rest of the code in this patch has similar issues to what I've
already pointed out.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/16] xfs: add xfs_nameops for utf8 and utf8+casefold.
@ 2014-10-06 22:10     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 22:10 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 05:01:18PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
> and are installed if the utf8bit is set in the super block.
> 
> The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
> filenames, and are installed if both the utf8bit and the borgbit are set
> in the superblock.
> 
> Normalized filenames are not stored on disk. Normalization will fail if a
> filename is not valid UTF-8, in which case the filename is treated as an
> opaque blob.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> 
> ---
> [v2: updated to use utf8norm.ko module;
>      compiled conditionally on CONFIG_XFS_UTF8=y;
>      utf8version is now a function;
>      move xfs_utf8.[ch] into libxfs. --bpm]
> [v3: pass utf8version from the superblock through xfs_nameops
>      instead of the max version of the normalization module. --bpm]
> ---
>  fs/xfs/Kconfig           |   9 ++
>  fs/xfs/Makefile          |   2 +
>  fs/xfs/libxfs/xfs_dir2.c |   4 +-
>  fs/xfs/libxfs/xfs_utf8.c | 208 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_utf8.h |   3 +
>  fs/xfs/xfs_iops.c        |   2 +-
>  6 files changed, 225 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> index 5d47b4d..1e8a463 100644
> --- a/fs/xfs/Kconfig
> +++ b/fs/xfs/Kconfig
> @@ -95,3 +95,12 @@ config XFS_DEBUG
>  	  not useful unless you are debugging a particular problem.
>  
>  	  Say N unless you are an XFS developer, or you play one on TV.
> +
> +config XFS_UTF8
> +	bool "XFS UTF-8 support"
> +	depends on XFS_FS
> +	select CONFIG_UTF8_NORMALIZATION
> +	help
> +	  Say Y here to enable utf8 normalization support in XFS.  You
> +	  will be able to mount and use filesystems created with the
> +	  utf8 mkfs.xfs option.

"created with UTF8 support enabled."

> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index d617999..192aaca 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -114,6 +114,8 @@ xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
>  				   xfs_qm.o \
>  				   xfs_quotaops.o
>  
> +xfs-$(CONFIG_XFS_UTF8)		+= libxfs/xfs_utf8.o
> +

libxfs definitions come first. Also, please use the same prefixing
syntax that the other libxfs rules use.

>  # xfs_rtbitmap is shared with libxfs
>  xfs-$(CONFIG_XFS_RT)		+= xfs_rtalloc.o
>  
> diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
> index 2c89211..9cfbd6b 100644
> --- a/fs/xfs/libxfs/xfs_dir2.c
> +++ b/fs/xfs/libxfs/xfs_dir2.c
> @@ -165,9 +165,9 @@ xfs_da_mount(
>  		/* XXX these are replaced in the next patch need
>  		   to do some kind of reordering here */
>  		if (xfs_sb_version_hasasciici(&mp->m_sb))
> -			mp->m_dirnameops = &xfs_ascii_ci_nameops;
> +			mp->m_dirnameops = &xfs_utf8_ci_nameops;
>  		else
> -			mp->m_dirnameops = &xfs_default_nameops;
> +			mp->m_dirnameops = &xfs_utf8_nameops;
>  #else

xfs_sb_version_hasasciici()? The overloading of the asciici bit is
still used for the utf8 CI functionality? Please fix this for the
next version of the patchset.

>  		xfs_warn(mp,
>  "Recompile XFS with CONFIG_XFS_UTF8 to mount this filesystem");
> diff --git a/fs/xfs/libxfs/xfs_utf8.c b/fs/xfs/libxfs/xfs_utf8.c
> index 7e63111..1e75299 100644
> --- a/fs/xfs/libxfs/xfs_utf8.c
> +++ b/fs/xfs/libxfs/xfs_utf8.c
> @@ -68,3 +68,211 @@ xfs_utf8_version_ok(
>  
>  	return 0;
>  }
> +
> +/*
> + * xfs nameops using nfkdi
> + */

Remind me again what nfkdi means? I I can't remember the details
after a week or two, then perhaps better explanitory comments are
needed in the code?

> +static xfs_dahash_t
> +xfs_utf8_hashname(
> +	const unsigned char *name,
> +	int len,
> +	unsigned int sb_utf8version)

Please use the same indentation levels for the declartions. i.e

	const unsigned char	*name,
	int			len,
	unsigned int		sb_utf8version)

Can you go through all the XFS code and make sure this is done?

> +{
> +	utf8data_t	nfkdi;
> +	struct utf8cursor u8c;
> +	xfs_dahash_t	hash;
> +	int		val;

And these shold line up, too.

> +
> +	nfkdi = utf8nfkdi(sb_utf8version);
> +	hash = 0;

initialise at declaration.

> +	if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
> +		goto blob;

Still has the "invalid binary blob" issue.

> +	while ((val = utf8byte(&u8c)) > 0)
> +		hash = val ^ rol32(hash, 7);
> +	/* In case of error treat the name as a binary blob. */
> +	if (val == 0)
> +		return hash;
> +blob:
> +	return xfs_da_hashname(name, len);
> +}
> +
> +static int
> +xfs_utf8_normhash(

More commments needed explaining what is going on.

> +	struct xfs_da_args *args)
> +{
> +	utf8data_t	nfkdi;
> +	struct utf8cursor u8c;
> +	unsigned char	*norm;
> +	ssize_t		normlen;
> +	int		c;
> +	unsigned int	sb_utf8version =
> +		args->dp->i_mount->m_sb.sb_utf8version;

Urk. Initialise on a separate line.

> +
> +	nfkdi = utf8nfkdi(sb_utf8version);
> +	/* Failure to normalize is treated as a blob. */
> +	if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
> +		goto blob;

No assignments in logic statements, please.

	normlen = utf8nlen(nfkdi, args->name, args->namelen);
	if (normlen < 0)

This is all through the code - can you please go through and fix up
all the patches to remove this pattern? checkpatch might be helpful
here....

As it is, still has the invalid binary blob issue.


> +	if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0)
> +		goto blob;
> +	if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
> +		return -ENOMEM;

Urk.

So, what happens if this memory allocation fails in the middle of a
create transaction?

(Hint: transaction is dirty at this point in time)

The rest of the code in this patch has similar issues to what I've
already pointed out.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 08/16] xfs: change interface of xfs_nameops.hashname
  2014-10-03 21:58 ` [PATCH 08/16] xfs: change interface of xfs_nameops.hashname Ben Myers
@ 2014-10-06 22:17     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 22:17 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 04:58:44PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> With the introduction of the xfs_nameops.normhash callout, all uses of the
> hashname callout now occur in places where an xfs_name structure must be
> explicitly created just to match the parameter passing convention of this
> callout. Change the arguments to a const unsigned char * and int instead.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> 
> [v2: pass a 3rd argument for sb_utf8version to hashname.  --bpm]

So now I've looked at most of the rest of the patch set, I think
this is the wrong thing to do. I see no reason apart from "it's less
typing" to drop the use of the xfs-name structure, but it removes a
key piece of documentation from the code. i.e. that the name/namelen
are an inseparable tuple and cannot be separated. Indeed, lots of
the utf8 xfs code declares norm/normlen tuples on the stack for
temporary use, so really this comes down to a matter of taste.

And in that matter, I'd prefer that we keep the existing name
abstaction and propagate it into the new code rather than the other
way around.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 08/16] xfs: change interface of xfs_nameops.hashname
@ 2014-10-06 22:17     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 22:17 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 04:58:44PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> With the introduction of the xfs_nameops.normhash callout, all uses of the
> hashname callout now occur in places where an xfs_name structure must be
> explicitly created just to match the parameter passing convention of this
> callout. Change the arguments to a const unsigned char * and int instead.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>
> 
> [v2: pass a 3rd argument for sb_utf8version to hashname.  --bpm]

So now I've looked at most of the rest of the patch set, I think
this is the wrong thing to do. I see no reason apart from "it's less
typing" to drop the use of the xfs-name structure, but it removes a
key piece of documentation from the code. i.e. that the name/namelen
are an inseparable tuple and cannot be separated. Indeed, lots of
the utf8 xfs code declares norm/normlen tuples on the stack for
temporary use, so really this comes down to a matter of taste.

And in that matter, I'd prefer that we keep the existing name
abstaction and propagate it into the new code rather than the other
way around.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/16] xfs: return the first match during case-insensitive lookup.
  2014-10-03 21:55 ` [PATCH 05/16] xfs: return the first match during case-insensitive lookup Ben Myers
@ 2014-10-06 22:19   ` Dave Chinner
  2014-10-09 15:42     ` Ben Myers
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2014-10-06 22:19 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Fri, Oct 03, 2014 at 04:55:42PM -0500, Ben Myers wrote:
> From: Olaf Weber <olaf@sgi.com>
> 
> Change the XFS case-insensitive lookup code to return the first match
> found, even if it is not an exact match. Whether a filesystem uses
> case-insensitive lookups is determined by a superblock bit set during
> filesystem creation.  This means that normal use cannot create two files
> that both match the same filename.
> 
> Signed-off-by: Olaf Weber <olaf@sgi.com>

This is really dependent on whether we want to support mixed
CI/non-CI filesystems, yes? i.e. if we want to support mixed case
setups, then we need to keep the code as it stands? What is the
downside of keeping the code unchaged and our options open?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/16] xfs: implement demand load of utf8norm.ko
  2014-10-04  7:16     ` Christoph Hellwig
  (?)
@ 2014-10-09 15:19     ` Ben Myers
  -1 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-09 15:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, olaf, xfs

On Sat, Oct 04, 2014 at 12:16:51AM -0700, Christoph Hellwig wrote:
> > +int
> > +xfs_init_utf8_module(struct xfs_mount	*mp)
> > +{
> > +	request_module("utf8norm");
> > +
> > +	spin_lock(&utf8norm_lock);
> > +	if (utf8norm_initialized) {
> > +		spin_unlock(&utf8norm_lock);
> > +		return 0;
> > +	}
> > +
> > +	utf8version_is_supported_func = symbol_get(utf8version_is_supported);
> > +	if (!utf8version_is_supported_func)
> > +		goto error;
> > +
> > +	utf8nfkdi_func = symbol_get(utf8nfkdi);
> > +	if (!utf8nfkdi_func)
> > +		goto error;
> 
> Please export a structure with a function pointes so that we just need
> a single symbol_get call.  I'd have to look up how symbol_get works,
> but unless there's something that speaks against this it might be
> simpler to than just do a symbol_get per mounst structure that uses
> utf8 and can point to that structure so that we don't have to add
> additional reference counting infrastructure around it.

Sure, sounds good.  I believe I've seen the approach you are suggesting
used elsewhere and it works fine.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 09/16] xfs: add a superblock feature bit to indicate UTF-8 support.
  2014-10-06 21:25   ` Dave Chinner
@ 2014-10-09 15:26     ` Ben Myers
  0 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-09 15:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, olaf, xfs

On Tue, Oct 07, 2014 at 08:25:58AM +1100, Dave Chinner wrote:
> On Fri, Oct 03, 2014 at 04:59:46PM -0500, Ben Myers wrote:
> > From: Olaf Weber <olaf@sgi.com>
> > 
> > When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
> > installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
> > the utf8bit, and returns true if at least one of them is set. Replace
> > calls to xfs_sb_version_hasasciici() as needed.
> > 
> > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > ---
> >  fs/xfs/libxfs/xfs_sb.h | 24 +++++++++++++++++++++++-
> >  fs/xfs/xfs_fs.h        |  1 +
> >  fs/xfs/xfs_fsops.c     |  4 +++-
> >  fs/xfs/xfs_iops.c      |  4 ++--
> >  4 files changed, 29 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> > index 2e73970..525eacb 100644
> > --- a/fs/xfs/libxfs/xfs_sb.h
> > +++ b/fs/xfs/libxfs/xfs_sb.h
> > @@ -70,6 +70,7 @@ struct xfs_trans;
> >  #define XFS_SB_VERSION2_RESERVED4BIT	0x00000004
> >  #define XFS_SB_VERSION2_ATTR2BIT	0x00000008	/* Inline attr rework */
> >  #define XFS_SB_VERSION2_PARENTBIT	0x00000010	/* parent pointers */
> > +#define XFS_SB_VERSION2_UTF8BIT		0x00000020      /* utf8 names */
> >  #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */
> 
> Can you explain why this bit is safe to use?

I believe Olaf chose this value to match what was used in Barry's
implementation.

> I don't recall why
> XFS_SB_VERSION2_PROJID32BIT skipped several bits because there
> aren't any comments explaining why that value was chosen. Adding a
> comment about the 0x00000040 bit at the same time would be useful.

I'm not sure why we skipped.  I'll see what I can find in mail archives.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/16] xfs: return the first match during case-insensitive lookup.
  2014-10-06 22:19   ` Dave Chinner
@ 2014-10-09 15:42     ` Ben Myers
  2014-10-09 20:38         ` Dave Chinner
  0 siblings, 1 reply; 63+ messages in thread
From: Ben Myers @ 2014-10-09 15:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, olaf, xfs

On Tue, Oct 07, 2014 at 09:19:28AM +1100, Dave Chinner wrote:
> On Fri, Oct 03, 2014 at 04:55:42PM -0500, Ben Myers wrote:
> > From: Olaf Weber <olaf@sgi.com>
> > 
> > Change the XFS case-insensitive lookup code to return the first match
> > found, even if it is not an exact match. Whether a filesystem uses
> > case-insensitive lookups is determined by a superblock bit set during
> > filesystem creation.  This means that normal use cannot create two files
> > that both match the same filename.
> > 
> > Signed-off-by: Olaf Weber <olaf@sgi.com>
> 
> This is really dependent on whether we want to support mixed
> CI/non-CI filesystems, yes? i.e. if we want to support mixed case
> setups, then we need to keep the code as it stands?

It depends upon what semantics you decide are correct in the mixed case.
This is just one solution.

> What is the
> downside of keeping the code unchaged and our options open?

The code that is being removed here was for the case when you could have
multiple filenames that match for a lookup, which was only possible when
the ascii-ci bit was implemented as a mount option.  The mount option
was never merged, and the ascii-ci bit is implemented at mkfs and is not
changeable over the lifetime of the filesystem.  So today this code is
not doing us much good.  Maybe there are performance implications where
you have a hash collision, but mostly I think it makes sense to remove
this code from a maintainability standpoint.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/16] xfs: return the first match during case-insensitive lookup.
  2014-10-09 15:42     ` Ben Myers
@ 2014-10-09 20:38         ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-09 20:38 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Thu, Oct 09, 2014 at 10:42:40AM -0500, Ben Myers wrote:
> On Tue, Oct 07, 2014 at 09:19:28AM +1100, Dave Chinner wrote:
> > On Fri, Oct 03, 2014 at 04:55:42PM -0500, Ben Myers wrote:
> > > From: Olaf Weber <olaf@sgi.com>
> > > 
> > > Change the XFS case-insensitive lookup code to return the first match
> > > found, even if it is not an exact match. Whether a filesystem uses
> > > case-insensitive lookups is determined by a superblock bit set during
> > > filesystem creation.  This means that normal use cannot create two files
> > > that both match the same filename.
> > > 
> > > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > 
> > This is really dependent on whether we want to support mixed
> > CI/non-CI filesystems, yes? i.e. if we want to support mixed case
> > setups, then we need to keep the code as it stands?
> 
> It depends upon what semantics you decide are correct in the mixed case.
> This is just one solution.

Ok, so we need this code or somethign very similar to support mixed
case filesystems.  Can you tell us what the other possible solutions
and semantics have been considered?

> > What is the downside of keeping the code unchaged and our
> > options open?
> 
> The code that is being removed here was for the case when you
> could have multiple filenames that match for a lookup, which was
> only possible when the ascii-ci bit was implemented as a mount
> option.

Yes, I know what the code did - it allowed us to support mixed case
ascii-ci filesystems. All you've said is "if we remove mixed case
support the code is cleaner" but not addressed the issue at hand.

I'll try asking the same question a different way: if we keep this
code, will it work for mixed case unicode filesystem or do we have
to re-implement mixed case matching differently?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/16] xfs: return the first match during case-insensitive lookup.
@ 2014-10-09 20:38         ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2014-10-09 20:38 UTC (permalink / raw)
  To: Ben Myers; +Cc: linux-fsdevel, olaf, xfs

On Thu, Oct 09, 2014 at 10:42:40AM -0500, Ben Myers wrote:
> On Tue, Oct 07, 2014 at 09:19:28AM +1100, Dave Chinner wrote:
> > On Fri, Oct 03, 2014 at 04:55:42PM -0500, Ben Myers wrote:
> > > From: Olaf Weber <olaf@sgi.com>
> > > 
> > > Change the XFS case-insensitive lookup code to return the first match
> > > found, even if it is not an exact match. Whether a filesystem uses
> > > case-insensitive lookups is determined by a superblock bit set during
> > > filesystem creation.  This means that normal use cannot create two files
> > > that both match the same filename.
> > > 
> > > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > 
> > This is really dependent on whether we want to support mixed
> > CI/non-CI filesystems, yes? i.e. if we want to support mixed case
> > setups, then we need to keep the code as it stands?
> 
> It depends upon what semantics you decide are correct in the mixed case.
> This is just one solution.

Ok, so we need this code or somethign very similar to support mixed
case filesystems.  Can you tell us what the other possible solutions
and semantics have been considered?

> > What is the downside of keeping the code unchaged and our
> > options open?
> 
> The code that is being removed here was for the case when you
> could have multiple filenames that match for a lookup, which was
> only possible when the ascii-ci bit was implemented as a mount
> option.

Yes, I know what the code did - it allowed us to support mixed case
ascii-ci filesystems. All you've said is "if we remove mixed case
support the code is cleaner" but not addressed the issue at hand.

I'll try asking the same question a different way: if we keep this
code, will it work for mixed case unicode filesystem or do we have
to re-implement mixed case matching differently?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/16] xfs: return the first match during case-insensitive lookup.
  2014-10-09 20:38         ` Dave Chinner
@ 2014-10-14 15:04           ` Ben Myers
  -1 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-14 15:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, olaf, xfs

Dave,

On Fri, Oct 10, 2014 at 07:38:14AM +1100, Dave Chinner wrote:
> On Thu, Oct 09, 2014 at 10:42:40AM -0500, Ben Myers wrote:
> > On Tue, Oct 07, 2014 at 09:19:28AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 03, 2014 at 04:55:42PM -0500, Ben Myers wrote:
> > > > From: Olaf Weber <olaf@sgi.com>
> > > > 
> > > > Change the XFS case-insensitive lookup code to return the first match
> > > > found, even if it is not an exact match. Whether a filesystem uses
> > > > case-insensitive lookups is determined by a superblock bit set during
> > > > filesystem creation.  This means that normal use cannot create two files
> > > > that both match the same filename.
> > > > 
> > > > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > > 
> > > This is really dependent on whether we want to support mixed
> > > CI/non-CI filesystems, yes? i.e. if we want to support mixed case
> > > setups, then we need to keep the code as it stands?
> > 
> > It depends upon what semantics you decide are correct in the mixed case.
> > This is just one solution.
> 
> Ok, so we need this code or somethign very similar to support mixed
> case filesystems.  Can you tell us what the other possible solutions
> and semantics have been considered?

There was some discussion of this in the v2 posting of this rfc.

http://marc.info/?l=linux-xfs&m=141176024430150&w=2

Olaf's "readme" example at the above link is a pretty good example of
what we're facing.  And I don't have a good answer for which file to
open.  So for now we're just going for the cleanest solution.

> > > What is the downside of keeping the code unchaged and our
> > > options open?
> > 
> > The code that is being removed here was for the case when you
> > could have multiple filenames that match for a lookup, which was
> > only possible when the ascii-ci bit was implemented as a mount
> > option.
> 
> Yes, I know what the code did - it allowed us to support mixed case
> ascii-ci filesystems. All you've said is "if we remove mixed case
> support the code is cleaner" but not addressed the issue at hand.

I also tried to explain that as the codebase stands today, removal of
this code does not represent a loss of functionality.  It is dead code.

> I'll try asking the same question a different way: if we keep this
> code, will it work for mixed case unicode filesystem or do we have
> to re-implement mixed case matching differently?

If you definately want to keep this code around I'll look into this, but
right now I don't have plans to extend the patchset to support mixed
case insensitivity in a single filesystem.

-Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/16] xfs: return the first match during case-insensitive lookup.
@ 2014-10-14 15:04           ` Ben Myers
  0 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-14 15:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, olaf, xfs

Dave,

On Fri, Oct 10, 2014 at 07:38:14AM +1100, Dave Chinner wrote:
> On Thu, Oct 09, 2014 at 10:42:40AM -0500, Ben Myers wrote:
> > On Tue, Oct 07, 2014 at 09:19:28AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 03, 2014 at 04:55:42PM -0500, Ben Myers wrote:
> > > > From: Olaf Weber <olaf@sgi.com>
> > > > 
> > > > Change the XFS case-insensitive lookup code to return the first match
> > > > found, even if it is not an exact match. Whether a filesystem uses
> > > > case-insensitive lookups is determined by a superblock bit set during
> > > > filesystem creation.  This means that normal use cannot create two files
> > > > that both match the same filename.
> > > > 
> > > > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > > 
> > > This is really dependent on whether we want to support mixed
> > > CI/non-CI filesystems, yes? i.e. if we want to support mixed case
> > > setups, then we need to keep the code as it stands?
> > 
> > It depends upon what semantics you decide are correct in the mixed case.
> > This is just one solution.
> 
> Ok, so we need this code or somethign very similar to support mixed
> case filesystems.  Can you tell us what the other possible solutions
> and semantics have been considered?

There was some discussion of this in the v2 posting of this rfc.

http://marc.info/?l=linux-xfs&m=141176024430150&w=2

Olaf's "readme" example at the above link is a pretty good example of
what we're facing.  And I don't have a good answer for which file to
open.  So for now we're just going for the cleanest solution.

> > > What is the downside of keeping the code unchaged and our
> > > options open?
> > 
> > The code that is being removed here was for the case when you
> > could have multiple filenames that match for a lookup, which was
> > only possible when the ascii-ci bit was implemented as a mount
> > option.
> 
> Yes, I know what the code did - it allowed us to support mixed case
> ascii-ci filesystems. All you've said is "if we remove mixed case
> support the code is cleaner" but not addressed the issue at hand.

I also tried to explain that as the codebase stands today, removal of
this code does not represent a loss of functionality.  It is dead code.

> I'll try asking the same question a different way: if we keep this
> code, will it work for mixed case unicode filesystem or do we have
> to re-implement mixed case matching differently?

If you definately want to keep this code around I'll look into this, but
right now I don't have plans to extend the patchset to support mixed
case insensitivity in a single filesystem.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 08/16] xfs: change interface of xfs_nameops.hashname
  2014-10-06 22:17     ` Dave Chinner
  (?)
@ 2014-10-14 15:34     ` Ben Myers
  -1 siblings, 0 replies; 63+ messages in thread
From: Ben Myers @ 2014-10-14 15:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, olaf, xfs

On Tue, Oct 07, 2014 at 09:17:08AM +1100, Dave Chinner wrote:
> On Fri, Oct 03, 2014 at 04:58:44PM -0500, Ben Myers wrote:
> > From: Olaf Weber <olaf@sgi.com>
> > 
> > With the introduction of the xfs_nameops.normhash callout, all uses of the
> > hashname callout now occur in places where an xfs_name structure must be
> > explicitly created just to match the parameter passing convention of this
> > callout. Change the arguments to a const unsigned char * and int instead.
> > 
> > Signed-off-by: Olaf Weber <olaf@sgi.com>
> > 
> > [v2: pass a 3rd argument for sb_utf8version to hashname.  --bpm]
> 
> So now I've looked at most of the rest of the patch set, I think
> this is the wrong thing to do. I see no reason apart from "it's less
> typing" to drop the use of the xfs-name structure, but it removes a
> key piece of documentation from the code. i.e. that the name/namelen
> are an inseparable tuple and cannot be separated. Indeed, lots of
> the utf8 xfs code declares norm/normlen tuples on the stack for
> temporary use, so really this comes down to a matter of taste.
> 
> And in that matter, I'd prefer that we keep the existing name
> abstaction and propagate it into the new code rather than the other
> way around.

Does something like this suit you?

struct xfs_name {
	const unsigned char	*name;
	int			len;
	int			type;
	__uint32_t		utf8version;
};

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2014-10-14 15:34 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-03 21:47 [RFC v3] Unicode/UTF-8 support for XFS Ben Myers
2014-10-03 21:50 ` [PATCH 01/16] lib: add unicode character database files Ben Myers
2014-10-03 21:51 ` [PATCH 02/16] scripts: add trie generator for UTF-8 Ben Myers
2014-10-03 21:54 ` [PATCH 03/16] lib: add supporting code " Ben Myers
2014-10-03 21:54 ` [PATCH 04/16] lib/utf8norm.c: reduce the size of utf8data[] Ben Myers
2014-10-05 21:52   ` Dave Chinner
2014-10-05 21:52     ` Dave Chinner
2014-10-03 21:55 ` [PATCH 05/16] xfs: return the first match during case-insensitive lookup Ben Myers
2014-10-06 22:19   ` Dave Chinner
2014-10-09 15:42     ` Ben Myers
2014-10-09 20:38       ` Dave Chinner
2014-10-09 20:38         ` Dave Chinner
2014-10-14 15:04         ` Ben Myers
2014-10-14 15:04           ` Ben Myers
2014-10-03 21:56 ` [PATCH 06/16] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-10-03 21:58 ` [PATCH 07/16] xfs: add xfs_nameops.normhash Ben Myers
2014-10-03 21:58 ` [PATCH 08/16] xfs: change interface of xfs_nameops.hashname Ben Myers
2014-10-06 22:17   ` Dave Chinner
2014-10-06 22:17     ` Dave Chinner
2014-10-14 15:34     ` Ben Myers
2014-10-03 21:59 ` [PATCH 09/16] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-10-06 21:25   ` Dave Chinner
2014-10-09 15:26     ` Ben Myers
2014-10-03 22:00 ` [PATCH 10/16] xfs: store utf8version in the superblock Ben Myers
2014-10-06 21:53   ` Dave Chinner
2014-10-06 21:53     ` Dave Chinner
2014-10-03 22:01 ` [PATCH 11/16] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-10-06 22:10   ` Dave Chinner
2014-10-06 22:10     ` Dave Chinner
2014-10-03 22:03 ` [PATCH 12/16] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-10-03 22:03 ` [PATCH 13/16] xfs: implement demand load of utf8norm.ko Ben Myers
2014-10-04  7:16   ` Christoph Hellwig
2014-10-04  7:16     ` Christoph Hellwig
2014-10-09 15:19     ` Ben Myers
2014-10-03 22:04 ` [PATCH 14/16] xfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2 Ben Myers
2014-10-06 20:33   ` Dave Chinner
2014-10-06 20:33     ` Dave Chinner
2014-10-06 20:38     ` Ben Myers
2014-10-03 22:05 ` [PATCH 15/16] xfs: xfs_fs_geometry returns a number of bytes to copy Ben Myers
2014-10-06 20:41   ` Dave Chinner
2014-10-06 20:41     ` Dave Chinner
2014-10-03 22:05 ` [PATCH 16/16] xfs: add versioned fsgeom ioctl with utf8version field Ben Myers
2014-10-06 21:13   ` Dave Chinner
2014-10-06 21:13     ` Dave Chinner
2014-10-03 22:06 ` [PATCH 17/35] xfsprogs: add unicode character database files Ben Myers
2014-10-03 22:07 ` [PATCH 18/35] xfsprogs: add trie generator for UTF-8 Ben Myers
2014-10-03 22:07 ` [PATCH 19/35] xfsprogs: add supporting code " Ben Myers
2014-10-03 22:08 ` [PATCH 20/35] xfsprogs: reduce the size of utf8data[] Ben Myers
2014-10-03 22:09 ` [PATCH 21/35] libxfs: return the first match during case-insensitive lookup Ben Myers
2014-10-03 22:09 ` [PATCH 22/35] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-10-03 22:10 ` [PATCH 23/35] libxfs: add xfs_nameops.normhash Ben Myers
2014-10-03 22:11 ` [PATCH 24/35] libxfs: change interface of xfs_nameops.hashname Ben Myers
2014-10-03 22:11 ` [PATCH 25/35] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-10-03 22:12 ` [PATCH 26/35] libxfs: store utf8version in the superblock Ben Myers
2014-10-03 22:13 ` [PATCH 27/35] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-10-03 22:13 ` [PATCH 28/35] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-10-03 22:14 ` [PATCH 29/35] libxfs: rename XFS_IOC_FSGEOM to XFS_IOC_FSGEOM_V2 Ben Myers
2014-10-03 22:14 ` [PATCH 30/35] libxfs: add versioned fsgeom ioctl with utf8version field Ben Myers
2014-10-03 22:15 ` [PATCH 31/35] xfsprogs: add utf8 support to growfs Ben Myers
2014-10-03 22:15 ` [PATCH 32/35] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
2014-10-03 22:16 ` [PATCH 33/35] xfsprogs: add utf8 support to xfs_repair Ben Myers
2014-10-03 22:16 ` [PATCH 34/35] xfsprogs: xfs_db support for sb_utf8version Ben Myers
2014-10-03 22:17 ` [PATCH 35/35] xfsprogs: add a test for utf8 support Ben Myers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.