linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* unicode cleanups, and split the data table into a separate module v2
@ 2021-09-15  6:59 Christoph Hellwig
  2021-09-15  6:59 ` [PATCH 01/11] ext4: simplify ext4_sb_read_encoding Christoph Hellwig
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  6:59 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

Hi all,

this series is an alternate idea to split the utf8 table into a separate
module which comes together with a lot of cleanups.

Changes since v1:
 - don't uglify the mount time messages from ext4/f2fs

Diffstat:

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/11] ext4: simplify ext4_sb_read_encoding
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
@ 2021-09-15  6:59 ` Christoph Hellwig
  2021-09-15  6:59 ` [PATCH 02/11] f2fs: simplify f2fs_sb_read_encoding Christoph Hellwig
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  6:59 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel,
	Theodore Ts'o

Return the encoding table as the return value instead of as an argument,
and don't bother with the encoding flags as the caller can handle that
trivially.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/super.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0775950ee84e3..7401a181878e5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2017,24 +2017,17 @@ static const struct ext4_sb_encodings {
 	{EXT4_ENC_UTF8_12_1, "utf8", "12.1.0"},
 };
 
-static int ext4_sb_read_encoding(const struct ext4_super_block *es,
-				 const struct ext4_sb_encodings **encoding,
-				 __u16 *flags)
+static const struct ext4_sb_encodings *
+ext4_sb_read_encoding(const struct ext4_super_block *es)
 {
 	__u16 magic = le16_to_cpu(es->s_encoding);
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(ext4_sb_encoding_map); i++)
 		if (magic == ext4_sb_encoding_map[i].magic)
-			break;
-
-	if (i >= ARRAY_SIZE(ext4_sb_encoding_map))
-		return -EINVAL;
+			return &ext4_sb_encoding_map[i];
 
-	*encoding = &ext4_sb_encoding_map[i];
-	*flags = le16_to_cpu(es->s_encoding_flags);
-
-	return 0;
+	return NULL;
 }
 #endif
 
@@ -4155,10 +4148,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	if (ext4_has_feature_casefold(sb) && !sb->s_encoding) {
 		const struct ext4_sb_encodings *encoding_info;
 		struct unicode_map *encoding;
-		__u16 encoding_flags;
+		__u16 encoding_flags = le16_to_cpu(es->s_encoding_flags);
 
-		if (ext4_sb_read_encoding(es, &encoding_info,
-					  &encoding_flags)) {
+		encoding_info = ext4_sb_read_encoding(es);
+		if (!encoding_info) {
 			ext4_msg(sb, KERN_ERR,
 				 "Encoding requested by superblock is unknown");
 			goto failed_mount;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/11] f2fs: simplify f2fs_sb_read_encoding
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
  2021-09-15  6:59 ` [PATCH 01/11] ext4: simplify ext4_sb_read_encoding Christoph Hellwig
@ 2021-09-15  6:59 ` Christoph Hellwig
  2021-09-15  6:59 ` [PATCH 03/11] unicode: remove the charset field from struct unicode_map Christoph Hellwig
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  6:59 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel, Chao Yu

Return the encoding table as the return value instead of as an argument,
and don't bother with the encoding flags as the caller can handle that
trivially.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Chao Yu <chao@kernel.org>
---
 fs/f2fs/super.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 78ebc306ee2b5..4c457100f18ea 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -264,24 +264,17 @@ static const struct f2fs_sb_encodings {
 	{F2FS_ENC_UTF8_12_1, "utf8", "12.1.0"},
 };
 
-static int f2fs_sb_read_encoding(const struct f2fs_super_block *sb,
-				 const struct f2fs_sb_encodings **encoding,
-				 __u16 *flags)
+static const struct f2fs_sb_encodings *
+f2fs_sb_read_encoding(const struct f2fs_super_block *sb)
 {
 	__u16 magic = le16_to_cpu(sb->s_encoding);
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(f2fs_sb_encoding_map); i++)
 		if (magic == f2fs_sb_encoding_map[i].magic)
-			break;
-
-	if (i >= ARRAY_SIZE(f2fs_sb_encoding_map))
-		return -EINVAL;
+			return &f2fs_sb_encoding_map[i];
 
-	*encoding = &f2fs_sb_encoding_map[i];
-	*flags = le16_to_cpu(sb->s_encoding_flags);
-
-	return 0;
+	return NULL;
 }
 
 struct kmem_cache *f2fs_cf_name_slab;
@@ -3843,13 +3836,14 @@ static int f2fs_setup_casefold(struct f2fs_sb_info *sbi)
 		struct unicode_map *encoding;
 		__u16 encoding_flags;
 
-		if (f2fs_sb_read_encoding(sbi->raw_super, &encoding_info,
-					  &encoding_flags)) {
+		encoding_info = f2fs_sb_read_encoding(sbi->raw_super);
+		if (!encoding_info) {
 			f2fs_err(sbi,
 				 "Encoding requested by superblock is unknown");
 			return -EINVAL;
 		}
 
+		encoding_flags = le16_to_cpu(sbi->raw_super->s_encoding_flags);
 		encoding = utf8_load(encoding_info->version);
 		if (IS_ERR(encoding)) {
 			f2fs_err(sbi,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/11] unicode: remove the charset field from struct unicode_map
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
  2021-09-15  6:59 ` [PATCH 01/11] ext4: simplify ext4_sb_read_encoding Christoph Hellwig
  2021-09-15  6:59 ` [PATCH 02/11] f2fs: simplify f2fs_sb_read_encoding Christoph Hellwig
@ 2021-09-15  6:59 ` Christoph Hellwig
  2021-09-15  6:59 ` [PATCH 04/11] unicode: mark the version field in struct unicode_map unsigned Christoph Hellwig
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  6:59 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

It is hardcoded and only used for a f2fs sysfs file where it can be
hardcoded just as easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 fs/f2fs/sysfs.c         | 3 +--
 fs/unicode/utf8-core.c  | 3 ---
 include/linux/unicode.h | 1 -
 3 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index a32fe31c33b8e..650e84398f744 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -196,8 +196,7 @@ static ssize_t encoding_show(struct f2fs_attr *a,
 	struct super_block *sb = sbi->sb;
 
 	if (f2fs_sb_has_casefold(sbi))
-		return snprintf(buf, PAGE_SIZE, "%s (%d.%d.%d)\n",
-			sb->s_encoding->charset,
+		return snprintf(buf, PAGE_SIZE, "UTF-8 (%d.%d.%d)\n",
 			(sb->s_encoding->version >> 16) & 0xff,
 			(sb->s_encoding->version >> 8) & 0xff,
 			sb->s_encoding->version & 0xff);
diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
index dc25823bfed96..86f42a078d99b 100644
--- a/fs/unicode/utf8-core.c
+++ b/fs/unicode/utf8-core.c
@@ -219,10 +219,7 @@ struct unicode_map *utf8_load(const char *version)
 	um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL);
 	if (!um)
 		return ERR_PTR(-ENOMEM);
-
-	um->charset = "UTF-8";
 	um->version = unicode_version;
-
 	return um;
 }
 EXPORT_SYMBOL(utf8_load);
diff --git a/include/linux/unicode.h b/include/linux/unicode.h
index 74484d44c7554..6a392cd9f076d 100644
--- a/include/linux/unicode.h
+++ b/include/linux/unicode.h
@@ -6,7 +6,6 @@
 #include <linux/dcache.h>
 
 struct unicode_map {
-	const char *charset;
 	int version;
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/11] unicode: mark the version field in struct unicode_map unsigned
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2021-09-15  6:59 ` [PATCH 03/11] unicode: remove the charset field from struct unicode_map Christoph Hellwig
@ 2021-09-15  6:59 ` Christoph Hellwig
  2021-09-15  7:00 ` [PATCH 05/11] unicode: pass a UNICODE_AGE() tripple to utf8_load Christoph Hellwig
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  6:59 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

unicode version tripplets are always unsigned.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 include/linux/unicode.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/unicode.h b/include/linux/unicode.h
index 6a392cd9f076d..0744f81c4b5fc 100644
--- a/include/linux/unicode.h
+++ b/include/linux/unicode.h
@@ -6,7 +6,7 @@
 #include <linux/dcache.h>
 
 struct unicode_map {
-	int version;
+	unsigned int version;
 };
 
 int utf8_validate(const struct unicode_map *um, const struct qstr *str);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/11] unicode: pass a UNICODE_AGE() tripple to utf8_load
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2021-09-15  6:59 ` [PATCH 04/11] unicode: mark the version field in struct unicode_map unsigned Christoph Hellwig
@ 2021-09-15  7:00 ` Christoph Hellwig
  2021-09-15  7:00 ` [PATCH 06/11] unicode: remove the unused utf8{,n}age{min,max} functions Christoph Hellwig
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  7:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

Don't bother with pointless string parsing when the caller can just pass
the version in the format that the core expects.  Also remove the
fallback to the latest version that none of the callers actually uses.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/super.c            | 18 +++++++++-----
 fs/f2fs/super.c            | 18 +++++++++-----
 fs/unicode/utf8-core.c     | 50 ++++----------------------------------
 fs/unicode/utf8-norm.c     | 11 ++-------
 fs/unicode/utf8-selftest.c | 15 ++++++------
 fs/unicode/utf8n.h         | 14 ++---------
 include/linux/unicode.h    | 25 ++++++++++++++++++-
 7 files changed, 65 insertions(+), 86 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7401a181878e5..da4e307d7599f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2012,9 +2012,9 @@ static const struct mount_opts {
 static const struct ext4_sb_encodings {
 	__u16 magic;
 	char *name;
-	char *version;
+	unsigned int version;
 } ext4_sb_encoding_map[] = {
-	{EXT4_ENC_UTF8_12_1, "utf8", "12.1.0"},
+	{EXT4_ENC_UTF8_12_1, "utf8", UNICODE_AGE(12, 1, 0)},
 };
 
 static const struct ext4_sb_encodings *
@@ -4160,15 +4160,21 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		encoding = utf8_load(encoding_info->version);
 		if (IS_ERR(encoding)) {
 			ext4_msg(sb, KERN_ERR,
-				 "can't mount with superblock charset: %s-%s "
+				 "can't mount with superblock charset: %s-%u.%u.%u "
 				 "not supported by the kernel. flags: 0x%x.",
-				 encoding_info->name, encoding_info->version,
+				 encoding_info->name,
+				 unicode_major(encoding_info->version),
+				 unicode_minor(encoding_info->version),
+				 unicode_rev(encoding_info->version),
 				 encoding_flags);
 			goto failed_mount;
 		}
 		ext4_msg(sb, KERN_INFO,"Using encoding defined by superblock: "
-			 "%s-%s with flags 0x%hx", encoding_info->name,
-			 encoding_info->version?:"\b", encoding_flags);
+			 "%s-%u.%u.%u with flags 0x%hx", encoding_info->name,
+			 unicode_major(encoding_info->version),
+			 unicode_minor(encoding_info->version),
+			 unicode_rev(encoding_info->version),
+			 encoding_flags);
 
 		sb->s_encoding = encoding;
 		sb->s_encoding_flags = encoding_flags;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 4c457100f18ea..3029f71bf2b74 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -259,9 +259,9 @@ void f2fs_printk(struct f2fs_sb_info *sbi, const char *fmt, ...)
 static const struct f2fs_sb_encodings {
 	__u16 magic;
 	char *name;
-	char *version;
+	unsigned int version;
 } f2fs_sb_encoding_map[] = {
-	{F2FS_ENC_UTF8_12_1, "utf8", "12.1.0"},
+	{F2FS_ENC_UTF8_12_1, "utf8", UNICODE_AGE(12, 1, 0)},
 };
 
 static const struct f2fs_sb_encodings *
@@ -3847,15 +3847,21 @@ static int f2fs_setup_casefold(struct f2fs_sb_info *sbi)
 		encoding = utf8_load(encoding_info->version);
 		if (IS_ERR(encoding)) {
 			f2fs_err(sbi,
-				 "can't mount with superblock charset: %s-%s "
+				 "can't mount with superblock charset: %s-%u.%u.%u "
 				 "not supported by the kernel. flags: 0x%x.",
-				 encoding_info->name, encoding_info->version,
+				 encoding_info->name,
+				 unicode_major(encoding_info->version),
+				 unicode_minor(encoding_info->version),
+				 unicode_rev(encoding_info->version),
 				 encoding_flags);
 			return PTR_ERR(encoding);
 		}
 		f2fs_info(sbi, "Using encoding defined by superblock: "
-			 "%s-%s with flags 0x%hx", encoding_info->name,
-			 encoding_info->version?:"\b", encoding_flags);
+			 "%s-%u.%u.%u with flags 0x%hx", encoding_info->name,
+			 unicode_major(encoding_info->version),
+			 unicode_minor(encoding_info->version),
+			 unicode_rev(encoding_info->version),
+			 encoding_flags);
 
 		sbi->sb->s_encoding = encoding;
 		sbi->sb->s_encoding_flags = encoding_flags;
diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
index 86f42a078d99b..dca2865c3bee8 100644
--- a/fs/unicode/utf8-core.c
+++ b/fs/unicode/utf8-core.c
@@ -167,59 +167,19 @@ int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
 	}
 	return -EINVAL;
 }
-
 EXPORT_SYMBOL(utf8_normalize);
 
-static int utf8_parse_version(const char *version, unsigned int *maj,
-			      unsigned int *min, unsigned int *rev)
+struct unicode_map *utf8_load(unsigned int version)
 {
-	substring_t args[3];
-	char version_string[12];
-	static const struct match_token token[] = {
-		{1, "%d.%d.%d"},
-		{0, NULL}
-	};
-
-	strncpy(version_string, version, sizeof(version_string));
-
-	if (match_token(version_string, token, args) != 1)
-		return -EINVAL;
-
-	if (match_int(&args[0], maj) || match_int(&args[1], min) ||
-	    match_int(&args[2], rev))
-		return -EINVAL;
+	struct unicode_map *um;
 
-	return 0;
-}
-
-struct unicode_map *utf8_load(const char *version)
-{
-	struct unicode_map *um = NULL;
-	int unicode_version;
-
-	if (version) {
-		unsigned int maj, min, rev;
-
-		if (utf8_parse_version(version, &maj, &min, &rev) < 0)
-			return ERR_PTR(-EINVAL);
-
-		if (!utf8version_is_supported(maj, min, rev))
-			return ERR_PTR(-EINVAL);
-
-		unicode_version = UNICODE_AGE(maj, min, rev);
-	} else {
-		unicode_version = utf8version_latest();
-		printk(KERN_WARNING"UTF-8 version not specified. "
-		       "Assuming latest supported version (%d.%d.%d).",
-		       (unicode_version >> 16) & 0xff,
-		       (unicode_version >> 8) & 0xff,
-		       (unicode_version & 0xff));
-	}
+	if (!utf8version_is_supported(version))
+		return ERR_PTR(-EINVAL);
 
 	um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL);
 	if (!um)
 		return ERR_PTR(-ENOMEM);
-	um->version = unicode_version;
+	um->version = version;
 	return um;
 }
 EXPORT_SYMBOL(utf8_load);
diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 1d2d2e5b906ae..12abf89ae6eca 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -15,13 +15,12 @@ struct utf8data {
 #include "utf8data.h"
 #undef __INCLUDED_FROM_UTF8NORM_C__
 
-int utf8version_is_supported(u8 maj, u8 min, u8 rev)
+int utf8version_is_supported(unsigned int version)
 {
 	int i = ARRAY_SIZE(utf8agetab) - 1;
-	unsigned int sb_utf8version = UNICODE_AGE(maj, min, rev);
 
 	while (i >= 0 && utf8agetab[i] != 0) {
-		if (sb_utf8version == utf8agetab[i])
+		if (version == utf8agetab[i])
 			return 1;
 		i--;
 	}
@@ -29,12 +28,6 @@ int utf8version_is_supported(u8 maj, u8 min, u8 rev)
 }
 EXPORT_SYMBOL(utf8version_is_supported);
 
-int utf8version_latest(void)
-{
-	return utf8vers;
-}
-EXPORT_SYMBOL(utf8version_latest);
-
 /*
  * UTF-8 valid ranges.
  *
diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
index 6fe8af7edccbb..37f33890e012f 100644
--- a/fs/unicode/utf8-selftest.c
+++ b/fs/unicode/utf8-selftest.c
@@ -235,7 +235,7 @@ static void check_utf8_nfdicf(void)
 static void check_utf8_comparisons(void)
 {
 	int i;
-	struct unicode_map *table = utf8_load("12.1.0");
+	struct unicode_map *table = utf8_load(UNICODE_AGE(12, 1, 0));
 
 	if (IS_ERR(table)) {
 		pr_err("%s: Unable to load utf8 %d.%d.%d. Skipping.\n",
@@ -269,18 +269,19 @@ static void check_utf8_comparisons(void)
 static void check_supported_versions(void)
 {
 	/* Unicode 7.0.0 should be supported. */
-	test(utf8version_is_supported(7, 0, 0));
+	test(utf8version_is_supported(UNICODE_AGE(7, 0, 0)));
 
 	/* Unicode 9.0.0 should be supported. */
-	test(utf8version_is_supported(9, 0, 0));
+	test(utf8version_is_supported(UNICODE_AGE(9, 0, 0)));
 
 	/* Unicode 1x.0.0 (the latest version) should be supported. */
-	test(utf8version_is_supported(latest_maj, latest_min, latest_rev));
+	test(utf8version_is_supported(
+		UNICODE_AGE(latest_maj, latest_min, latest_rev)));
 
 	/* Next versions don't exist. */
-	test(!utf8version_is_supported(13, 0, 0));
-	test(!utf8version_is_supported(0, 0, 0));
-	test(!utf8version_is_supported(-1, -1, -1));
+	test(!utf8version_is_supported(UNICODE_AGE(13, 0, 0)));
+	test(!utf8version_is_supported(UNICODE_AGE(0, 0, 0)));
+	test(!utf8version_is_supported(UNICODE_AGE(-1, -1, -1)));
 }
 
 static int __init init_test_ucd(void)
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index 0acd530c2c791..85a7bebf69275 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -11,19 +11,9 @@
 #include <linux/export.h>
 #include <linux/string.h>
 #include <linux/module.h>
+#include <linux/unicode.h>
 
-/* Encoding a unicode version number as a single unsigned int. */
-#define UNICODE_MAJ_SHIFT		(16)
-#define UNICODE_MIN_SHIFT		(8)
-
-#define UNICODE_AGE(MAJ, MIN, REV)			\
-	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
-	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
-	 ((unsigned int)(REV)))
-
-/* Highest unicode version supported by the data tables. */
-extern int utf8version_is_supported(u8 maj, u8 min, u8 rev);
-extern int utf8version_latest(void);
+int utf8version_is_supported(unsigned int version);
 
 /*
  * Look for the correct const struct utf8data for a unicode version.
diff --git a/include/linux/unicode.h b/include/linux/unicode.h
index 0744f81c4b5fc..77bb915fd1f05 100644
--- a/include/linux/unicode.h
+++ b/include/linux/unicode.h
@@ -5,6 +5,29 @@
 #include <linux/init.h>
 #include <linux/dcache.h>
 
+#define UNICODE_MAJ_SHIFT		16
+#define UNICODE_MIN_SHIFT		8
+
+#define UNICODE_AGE(MAJ, MIN, REV)			\
+	(((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) |	\
+	 ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) |	\
+	 ((unsigned int)(REV)))
+
+static inline u8 unicode_major(unsigned int age)
+{
+	return (age >> UNICODE_MAJ_SHIFT) & 0xff;
+}
+
+static inline u8 unicode_minor(unsigned int age)
+{
+	return (age >> UNICODE_MIN_SHIFT) & 0xff;
+}
+
+static inline u8 unicode_rev(unsigned int age)
+{
+	return age & 0xff;
+}
+
 struct unicode_map {
 	unsigned int version;
 };
@@ -29,7 +52,7 @@ int utf8_casefold(const struct unicode_map *um, const struct qstr *str,
 int utf8_casefold_hash(const struct unicode_map *um, const void *salt,
 		       struct qstr *str);
 
-struct unicode_map *utf8_load(const char *version);
+struct unicode_map *utf8_load(unsigned int version);
 void utf8_unload(struct unicode_map *um);
 
 #endif /* _LINUX_UNICODE_H */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/11] unicode: remove the unused utf8{,n}age{min,max} functions
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2021-09-15  7:00 ` [PATCH 05/11] unicode: pass a UNICODE_AGE() tripple to utf8_load Christoph Hellwig
@ 2021-09-15  7:00 ` Christoph Hellwig
  2021-09-15  7:00 ` [PATCH 07/11] unicode: simplify utf8len Christoph Hellwig
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  7:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

No actually used anywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/unicode/utf8-norm.c | 113 -----------------------------------------
 fs/unicode/utf8n.h     |  16 ------
 2 files changed, 129 deletions(-)

diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 12abf89ae6eca..4b1b53391ce4b 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -391,119 +391,6 @@ static utf8leaf_t *utf8lookup(const struct utf8data *data,
 	return utf8nlookup(data, hangul, s, (size_t)-1);
 }
 
-/*
- * Maximum age of any character in s.
- * Return -1 if s is not valid UTF-8 unicode.
- * Return 0 if only non-assigned code points are used.
- */
-int utf8agemax(const struct utf8data *data, const char *s)
-{
-	utf8leaf_t	*leaf;
-	int		age = 0;
-	int		leaf_age;
-	unsigned char	hangul[UTF8HANGULLEAF];
-
-	if (!data)
-		return -1;
-
-	while (*s) {
-		leaf = utf8lookup(data, hangul, s);
-		if (!leaf)
-			return -1;
-
-		leaf_age = utf8agetab[LEAF_GEN(leaf)];
-		if (leaf_age <= data->maxage && leaf_age > age)
-			age = leaf_age;
-		s += utf8clen(s);
-	}
-	return age;
-}
-EXPORT_SYMBOL(utf8agemax);
-
-/*
- * Minimum age of any character in s.
- * Return -1 if s is not valid UTF-8 unicode.
- * Return 0 if non-assigned code points are used.
- */
-int utf8agemin(const struct utf8data *data, const char *s)
-{
-	utf8leaf_t	*leaf;
-	int		age;
-	int		leaf_age;
-	unsigned char	hangul[UTF8HANGULLEAF];
-
-	if (!data)
-		return -1;
-	age = data->maxage;
-	while (*s) {
-		leaf = utf8lookup(data, hangul, s);
-		if (!leaf)
-			return -1;
-		leaf_age = utf8agetab[LEAF_GEN(leaf)];
-		if (leaf_age <= data->maxage && leaf_age < age)
-			age = leaf_age;
-		s += utf8clen(s);
-	}
-	return age;
-}
-EXPORT_SYMBOL(utf8agemin);
-
-/*
- * Maximum age of any character in s, touch at most len bytes.
- * Return -1 if s is not valid UTF-8 unicode.
- */
-int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
-{
-	utf8leaf_t	*leaf;
-	int		age = 0;
-	int		leaf_age;
-	unsigned char	hangul[UTF8HANGULLEAF];
-
-	if (!data)
-		return -1;
-
-	while (len && *s) {
-		leaf = utf8nlookup(data, hangul, s, len);
-		if (!leaf)
-			return -1;
-		leaf_age = utf8agetab[LEAF_GEN(leaf)];
-		if (leaf_age <= data->maxage && leaf_age > age)
-			age = leaf_age;
-		len -= utf8clen(s);
-		s += utf8clen(s);
-	}
-	return age;
-}
-EXPORT_SYMBOL(utf8nagemax);
-
-/*
- * Maximum age of any character in s, touch at most len bytes.
- * Return -1 if s is not valid UTF-8 unicode.
- */
-int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
-{
-	utf8leaf_t	*leaf;
-	int		leaf_age;
-	int		age;
-	unsigned char	hangul[UTF8HANGULLEAF];
-
-	if (!data)
-		return -1;
-	age = data->maxage;
-	while (len && *s) {
-		leaf = utf8nlookup(data, hangul, s, len);
-		if (!leaf)
-			return -1;
-		leaf_age = utf8agetab[LEAF_GEN(leaf)];
-		if (leaf_age <= data->maxage && leaf_age < age)
-			age = leaf_age;
-		len -= utf8clen(s);
-		s += utf8clen(s);
-	}
-	return age;
-}
-EXPORT_SYMBOL(utf8nagemin);
-
 /*
  * Length of the normalization of s.
  * Return -1 if s is not valid UTF-8 unicode.
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index 85a7bebf69275..e4c8a767cf7a5 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -33,22 +33,6 @@ int utf8version_is_supported(unsigned int version);
 extern const struct utf8data *utf8nfdi(unsigned int maxage);
 extern const struct utf8data *utf8nfdicf(unsigned int maxage);
 
-/*
- * Determine the maximum age of any unicode character in the string.
- * Returns 0 if only unassigned code points are present.
- * Returns -1 if the input is not valid UTF-8.
- */
-extern int utf8agemax(const struct utf8data *data, const char *s);
-extern int utf8nagemax(const struct utf8data *data, const char *s, size_t len);
-
-/*
- * Determine the minimum age of any unicode character in the string.
- * Returns 0 if any unassigned code points are present.
- * Returns -1 if the input is not valid UTF-8.
- */
-extern int utf8agemin(const struct utf8data *data, const char *s);
-extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
-
 /*
  * Determine the length of the normalized from of the string,
  * excluding any terminating NULL byte.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/11] unicode: simplify utf8len
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2021-09-15  7:00 ` [PATCH 06/11] unicode: remove the unused utf8{,n}age{min,max} functions Christoph Hellwig
@ 2021-09-15  7:00 ` Christoph Hellwig
  2021-09-15  7:00 ` [PATCH 08/11] unicode: move utf8cursor to utf8-selftest.c Christoph Hellwig
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  7:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

Just use the utf8nlen implementation with a (size_t)-1 len argument,
similar to utf8_lookup.  Also move the function to utf8-selftest.c, as
it isn't used anywhere else.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/unicode/utf8-norm.c     | 30 ------------------------------
 fs/unicode/utf8-selftest.c |  5 +++++
 fs/unicode/utf8n.h         |  1 -
 3 files changed, 5 insertions(+), 31 deletions(-)

diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 4b1b53391ce4b..348d6e97553f2 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -391,36 +391,6 @@ static utf8leaf_t *utf8lookup(const struct utf8data *data,
 	return utf8nlookup(data, hangul, s, (size_t)-1);
 }
 
-/*
- * Length of the normalization of s.
- * Return -1 if s is not valid UTF-8 unicode.
- *
- * A string of Default_Ignorable_Code_Point has length 0.
- */
-ssize_t utf8len(const struct utf8data *data, const char *s)
-{
-	utf8leaf_t	*leaf;
-	size_t		ret = 0;
-	unsigned char	hangul[UTF8HANGULLEAF];
-
-	if (!data)
-		return -1;
-	while (*s) {
-		leaf = utf8lookup(data, hangul, s);
-		if (!leaf)
-			return -1;
-		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
-			ret += utf8clen(s);
-		else if (LEAF_CCC(leaf) == DECOMPOSE)
-			ret += strlen(LEAF_STR(leaf));
-		else
-			ret += utf8clen(s);
-		s += utf8clen(s);
-	}
-	return ret;
-}
-EXPORT_SYMBOL(utf8len);
-
 /*
  * Length of the normalization of s, touch at most len bytes.
  * Return -1 if s is not valid UTF-8 unicode.
diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
index 37f33890e012f..80fb7c75acb28 100644
--- a/fs/unicode/utf8-selftest.c
+++ b/fs/unicode/utf8-selftest.c
@@ -160,6 +160,11 @@ static const struct {
 	}
 };
 
+static ssize_t utf8len(const struct utf8data *data, const char *s)
+{
+	return utf8nlen(data, s, (size_t)-1);
+}
+
 static void check_utf8_nfdi(void)
 {
 	int i;
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index e4c8a767cf7a5..41182e5464dfa 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -39,7 +39,6 @@ extern const struct utf8data *utf8nfdicf(unsigned int maxage);
  * Returns 0 if only ignorable code points are present.
  * Returns -1 if the input is not valid UTF-8.
  */
-extern ssize_t utf8len(const struct utf8data *data, const char *s);
 extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);
 
 /* Needed in struct utf8cursor below. */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/11] unicode: move utf8cursor to utf8-selftest.c
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2021-09-15  7:00 ` [PATCH 07/11] unicode: simplify utf8len Christoph Hellwig
@ 2021-09-15  7:00 ` Christoph Hellwig
  2021-09-15  7:00 ` [PATCH 09/11] unicode: cache the normalization tables in struct unicode_map Christoph Hellwig
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  7:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

Only used by the tests, so no need to keep it in the core.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/unicode/utf8-norm.c     | 16 ----------------
 fs/unicode/utf8-selftest.c |  6 ++++++
 fs/unicode/utf8n.h         |  2 --
 3 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 348d6e97553f2..1ac90fa00070d 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -456,22 +456,6 @@ int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
 }
 EXPORT_SYMBOL(utf8ncursor);
 
-/*
- * Set up an utf8cursor for use by utf8byte().
- *
- *   u8c    : pointer to cursor.
- *   data   : const struct utf8data to use for normalization.
- *   s      : NUL-terminated string.
- *
- * Returns -1 on error, 0 on success.
- */
-int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
-	       const char *s)
-{
-	return utf8ncursor(u8c, data, s, (unsigned int)-1);
-}
-EXPORT_SYMBOL(utf8cursor);
-
 /*
  * Get one byte from the normalized form of the string described by u8c.
  *
diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
index 80fb7c75acb28..04628b50351d3 100644
--- a/fs/unicode/utf8-selftest.c
+++ b/fs/unicode/utf8-selftest.c
@@ -165,6 +165,12 @@ static ssize_t utf8len(const struct utf8data *data, const char *s)
 	return utf8nlen(data, s, (size_t)-1);
 }
 
+static int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+		const char *s)
+{
+	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+
 static void check_utf8_nfdi(void)
 {
 	int i;
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index 41182e5464dfa..736b6460a38cb 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -65,8 +65,6 @@ struct utf8cursor {
  * Returns 0 on success.
  * Returns -1 on failure.
  */
-extern int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
-		      const char *s);
 extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
 		       const char *s, size_t len);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/11] unicode: cache the normalization tables in struct unicode_map
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2021-09-15  7:00 ` [PATCH 08/11] unicode: move utf8cursor to utf8-selftest.c Christoph Hellwig
@ 2021-09-15  7:00 ` Christoph Hellwig
  2021-09-15  7:00 ` [PATCH 10/11] unicode: Add utf8-data module Christoph Hellwig
  2021-09-15  7:00 ` [PATCH 11/11] unicode: only export internal symbols for the selftests Christoph Hellwig
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  7:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

Instead of repeatedly looking up the version add pointers to the
NFD and NFD+CF tables to struct unicode_map, and pass a
unicode_map plus index to the functions using the normalization
tables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/unicode/utf8-core.c     | 37 +++++++++---------
 fs/unicode/utf8-norm.c     | 45 ++++++++++-----------
 fs/unicode/utf8-selftest.c | 80 ++++++++++++++++----------------------
 fs/unicode/utf8n.h         | 10 +++--
 include/linux/unicode.h    | 19 +++++++++
 5 files changed, 97 insertions(+), 94 deletions(-)

diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
index dca2865c3bee8..d9f713d38c0ad 100644
--- a/fs/unicode/utf8-core.c
+++ b/fs/unicode/utf8-core.c
@@ -5,16 +5,13 @@
 #include <linux/slab.h>
 #include <linux/parser.h>
 #include <linux/errno.h>
-#include <linux/unicode.h>
 #include <linux/stringhash.h>
 
 #include "utf8n.h"
 
 int utf8_validate(const struct unicode_map *um, const struct qstr *str)
 {
-	const struct utf8data *data = utf8nfdi(um->version);
-
-	if (utf8nlen(data, str->name, str->len) < 0)
+	if (utf8nlen(um, UTF8_NFDI, str->name, str->len) < 0)
 		return -1;
 	return 0;
 }
@@ -23,14 +20,13 @@ EXPORT_SYMBOL(utf8_validate);
 int utf8_strncmp(const struct unicode_map *um,
 		 const struct qstr *s1, const struct qstr *s2)
 {
-	const struct utf8data *data = utf8nfdi(um->version);
 	struct utf8cursor cur1, cur2;
 	int c1, c2;
 
-	if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0)
+	if (utf8ncursor(&cur1, um, UTF8_NFDI, s1->name, s1->len) < 0)
 		return -EINVAL;
 
-	if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0)
+	if (utf8ncursor(&cur2, um, UTF8_NFDI, s2->name, s2->len) < 0)
 		return -EINVAL;
 
 	do {
@@ -50,14 +46,13 @@ EXPORT_SYMBOL(utf8_strncmp);
 int utf8_strncasecmp(const struct unicode_map *um,
 		     const struct qstr *s1, const struct qstr *s2)
 {
-	const struct utf8data *data = utf8nfdicf(um->version);
 	struct utf8cursor cur1, cur2;
 	int c1, c2;
 
-	if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0)
+	if (utf8ncursor(&cur1, um, UTF8_NFDICF, s1->name, s1->len) < 0)
 		return -EINVAL;
 
-	if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0)
+	if (utf8ncursor(&cur2, um, UTF8_NFDICF, s2->name, s2->len) < 0)
 		return -EINVAL;
 
 	do {
@@ -81,12 +76,11 @@ int utf8_strncasecmp_folded(const struct unicode_map *um,
 			    const struct qstr *cf,
 			    const struct qstr *s1)
 {
-	const struct utf8data *data = utf8nfdicf(um->version);
 	struct utf8cursor cur1;
 	int c1, c2;
 	int i = 0;
 
-	if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0)
+	if (utf8ncursor(&cur1, um, UTF8_NFDICF, s1->name, s1->len) < 0)
 		return -EINVAL;
 
 	do {
@@ -105,11 +99,10 @@ EXPORT_SYMBOL(utf8_strncasecmp_folded);
 int utf8_casefold(const struct unicode_map *um, const struct qstr *str,
 		  unsigned char *dest, size_t dlen)
 {
-	const struct utf8data *data = utf8nfdicf(um->version);
 	struct utf8cursor cur;
 	size_t nlen = 0;
 
-	if (utf8ncursor(&cur, data, str->name, str->len) < 0)
+	if (utf8ncursor(&cur, um, UTF8_NFDICF, str->name, str->len) < 0)
 		return -EINVAL;
 
 	for (nlen = 0; nlen < dlen; nlen++) {
@@ -128,12 +121,11 @@ EXPORT_SYMBOL(utf8_casefold);
 int utf8_casefold_hash(const struct unicode_map *um, const void *salt,
 		       struct qstr *str)
 {
-	const struct utf8data *data = utf8nfdicf(um->version);
 	struct utf8cursor cur;
 	int c;
 	unsigned long hash = init_name_hash(salt);
 
-	if (utf8ncursor(&cur, data, str->name, str->len) < 0)
+	if (utf8ncursor(&cur, um, UTF8_NFDICF, str->name, str->len) < 0)
 		return -EINVAL;
 
 	while ((c = utf8byte(&cur))) {
@@ -149,11 +141,10 @@ EXPORT_SYMBOL(utf8_casefold_hash);
 int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
 		   unsigned char *dest, size_t dlen)
 {
-	const struct utf8data *data = utf8nfdi(um->version);
 	struct utf8cursor cur;
 	ssize_t nlen = 0;
 
-	if (utf8ncursor(&cur, data, str->name, str->len) < 0)
+	if (utf8ncursor(&cur, um, UTF8_NFDI, str->name, str->len) < 0)
 		return -EINVAL;
 
 	for (nlen = 0; nlen < dlen; nlen++) {
@@ -180,7 +171,17 @@ struct unicode_map *utf8_load(unsigned int version)
 	if (!um)
 		return ERR_PTR(-ENOMEM);
 	um->version = version;
+	um->ntab[UTF8_NFDI] = utf8nfdi(version);
+	if (!um->ntab[UTF8_NFDI])
+		goto out_free_um;
+	um->ntab[UTF8_NFDICF] = utf8nfdicf(version);
+	if (!um->ntab[UTF8_NFDICF])
+		goto out_free_um;
 	return um;
+
+out_free_um:
+	kfree(um);
+	return ERR_PTR(-EINVAL);
 }
 EXPORT_SYMBOL(utf8_load);
 
diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 1ac90fa00070d..7c1f28ab31a80 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -309,21 +309,19 @@ utf8hangul(const char *str, unsigned char *hangul)
  * is well-formed and corresponds to a known unicode code point.  The
  * shorthand for this will be "is valid UTF-8 unicode".
  */
-static utf8leaf_t *utf8nlookup(const struct utf8data *data,
-			       unsigned char *hangul, const char *s, size_t len)
+static utf8leaf_t *utf8nlookup(const struct unicode_map *um,
+		enum utf8_normalization n, unsigned char *hangul, const char *s,
+		size_t len)
 {
-	utf8trie_t	*trie = NULL;
+	utf8trie_t	*trie = utf8data + um->ntab[n]->offset;
 	int		offlen;
 	int		offset;
 	int		mask;
 	int		node;
 
-	if (!data)
-		return NULL;
 	if (len == 0)
 		return NULL;
 
-	trie = utf8data + data->offset;
 	node = 1;
 	while (node) {
 		offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
@@ -385,29 +383,28 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data,
  *
  * Forwards to utf8nlookup().
  */
-static utf8leaf_t *utf8lookup(const struct utf8data *data,
-			      unsigned char *hangul, const char *s)
+static utf8leaf_t *utf8lookup(const struct unicode_map *um,
+		enum utf8_normalization n, unsigned char *hangul, const char *s)
 {
-	return utf8nlookup(data, hangul, s, (size_t)-1);
+	return utf8nlookup(um, n, hangul, s, (size_t)-1);
 }
 
 /*
  * Length of the normalization of s, touch at most len bytes.
  * Return -1 if s is not valid UTF-8 unicode.
  */
-ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
+ssize_t utf8nlen(const struct unicode_map *um, enum utf8_normalization n,
+		const char *s, size_t len)
 {
 	utf8leaf_t	*leaf;
 	size_t		ret = 0;
 	unsigned char	hangul[UTF8HANGULLEAF];
 
-	if (!data)
-		return -1;
 	while (len && *s) {
-		leaf = utf8nlookup(data, hangul, s, len);
+		leaf = utf8nlookup(um, n, hangul, s, len);
 		if (!leaf)
 			return -1;
-		if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+		if (utf8agetab[LEAF_GEN(leaf)] > um->ntab[n]->maxage)
 			ret += utf8clen(s);
 		else if (LEAF_CCC(leaf) == DECOMPOSE)
 			ret += strlen(LEAF_STR(leaf));
@@ -430,14 +427,13 @@ EXPORT_SYMBOL(utf8nlen);
  *
  * Returns -1 on error, 0 on success.
  */
-int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
-		const char *s, size_t len)
+int utf8ncursor(struct utf8cursor *u8c, const struct unicode_map *um,
+		enum utf8_normalization n, const char *s, size_t len)
 {
-	if (!data)
-		return -1;
 	if (!s)
 		return -1;
-	u8c->data = data;
+	u8c->um = um;
+	u8c->n = n;
 	u8c->s = s;
 	u8c->p = NULL;
 	u8c->ss = NULL;
@@ -512,9 +508,9 @@ int utf8byte(struct utf8cursor *u8c)
 
 		/* Look up the data for the current character. */
 		if (u8c->p) {
-			leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+			leaf = utf8lookup(u8c->um, u8c->n, u8c->hangul, u8c->s);
 		} else {
-			leaf = utf8nlookup(u8c->data, u8c->hangul,
+			leaf = utf8nlookup(u8c->um, u8c->n, u8c->hangul,
 					   u8c->s, u8c->len);
 		}
 
@@ -524,7 +520,8 @@ int utf8byte(struct utf8cursor *u8c)
 
 		ccc = LEAF_CCC(leaf);
 		/* Characters that are too new have CCC 0. */
-		if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+		if (utf8agetab[LEAF_GEN(leaf)] >
+		    u8c->um->ntab[u8c->n]->maxage) {
 			ccc = STOPPER;
 		} else if (ccc == DECOMPOSE) {
 			u8c->len -= utf8clen(u8c->s);
@@ -538,7 +535,7 @@ int utf8byte(struct utf8cursor *u8c)
 				goto ccc_mismatch;
 			}
 
-			leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+			leaf = utf8lookup(u8c->um, u8c->n, u8c->hangul, u8c->s);
 			if (!leaf)
 				return -1;
 			ccc = LEAF_CCC(leaf);
@@ -611,7 +608,6 @@ const struct utf8data *utf8nfdi(unsigned int maxage)
 		return NULL;
 	return &utf8nfdidata[i];
 }
-EXPORT_SYMBOL(utf8nfdi);
 
 const struct utf8data *utf8nfdicf(unsigned int maxage)
 {
@@ -623,4 +619,3 @@ const struct utf8data *utf8nfdicf(unsigned int maxage)
 		return NULL;
 	return &utf8nfdicfdata[i];
 }
-EXPORT_SYMBOL(utf8nfdicf);
diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
index 04628b50351d3..cfa3832b75f42 100644
--- a/fs/unicode/utf8-selftest.c
+++ b/fs/unicode/utf8-selftest.c
@@ -18,9 +18,7 @@ unsigned int failed_tests;
 unsigned int total_tests;
 
 /* Tests will be based on this version. */
-#define latest_maj 12
-#define latest_min 1
-#define latest_rev 0
+#define UTF8_LATEST	UNICODE_AGE(12, 1, 0)
 
 #define _test(cond, func, line, fmt, ...) do {				\
 		total_tests++;						\
@@ -160,29 +158,22 @@ static const struct {
 	}
 };
 
-static ssize_t utf8len(const struct utf8data *data, const char *s)
+static ssize_t utf8len(const struct unicode_map *um, enum utf8_normalization n,
+		const char *s)
 {
-	return utf8nlen(data, s, (size_t)-1);
+	return utf8nlen(um, n, s, (size_t)-1);
 }
 
-static int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
-		const char *s)
+static int utf8cursor(struct utf8cursor *u8c, const struct unicode_map *um,
+		enum utf8_normalization n, const char *s)
 {
-	return utf8ncursor(u8c, data, s, (unsigned int)-1);
+	return utf8ncursor(u8c, um, n, s, (unsigned int)-1);
 }
 
-static void check_utf8_nfdi(void)
+static void check_utf8_nfdi(struct unicode_map *um)
 {
 	int i;
 	struct utf8cursor u8c;
-	const struct utf8data *data;
-
-	data = utf8nfdi(UNICODE_AGE(latest_maj, latest_min, latest_rev));
-	if (!data) {
-		pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n",
-		       __func__, latest_maj, latest_min, latest_rev);
-		return;
-	}
 
 	for (i = 0; i < ARRAY_SIZE(nfdi_test_data); i++) {
 		int len = strlen(nfdi_test_data[i].str);
@@ -190,10 +181,11 @@ static void check_utf8_nfdi(void)
 		int j = 0;
 		unsigned char c;
 
-		test((utf8len(data, nfdi_test_data[i].str) == nlen));
-		test((utf8nlen(data, nfdi_test_data[i].str, len) == nlen));
+		test((utf8len(um, UTF8_NFDI, nfdi_test_data[i].str) == nlen));
+		test((utf8nlen(um, UTF8_NFDI, nfdi_test_data[i].str, len) ==
+			nlen));
 
-		if (utf8cursor(&u8c, data, nfdi_test_data[i].str) < 0)
+		if (utf8cursor(&u8c, um, UTF8_NFDI, nfdi_test_data[i].str) < 0)
 			pr_err("can't create cursor\n");
 
 		while ((c = utf8byte(&u8c)) > 0) {
@@ -207,18 +199,10 @@ static void check_utf8_nfdi(void)
 	}
 }
 
-static void check_utf8_nfdicf(void)
+static void check_utf8_nfdicf(struct unicode_map *um)
 {
 	int i;
 	struct utf8cursor u8c;
-	const struct utf8data *data;
-
-	data = utf8nfdicf(UNICODE_AGE(latest_maj, latest_min, latest_rev));
-	if (!data) {
-		pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n",
-		       __func__, latest_maj, latest_min, latest_rev);
-		return;
-	}
 
 	for (i = 0; i < ARRAY_SIZE(nfdicf_test_data); i++) {
 		int len = strlen(nfdicf_test_data[i].str);
@@ -226,10 +210,13 @@ static void check_utf8_nfdicf(void)
 		int j = 0;
 		unsigned char c;
 
-		test((utf8len(data, nfdicf_test_data[i].str) == nlen));
-		test((utf8nlen(data, nfdicf_test_data[i].str, len) == nlen));
+		test((utf8len(um, UTF8_NFDICF, nfdicf_test_data[i].str) ==
+				nlen));
+		test((utf8nlen(um, UTF8_NFDICF, nfdicf_test_data[i].str, len) ==
+				nlen));
 
-		if (utf8cursor(&u8c, data, nfdicf_test_data[i].str) < 0)
+		if (utf8cursor(&u8c, um, UTF8_NFDICF,
+				nfdicf_test_data[i].str) < 0)
 			pr_err("can't create cursor\n");
 
 		while ((c = utf8byte(&u8c)) > 0) {
@@ -243,16 +230,9 @@ static void check_utf8_nfdicf(void)
 	}
 }
 
-static void check_utf8_comparisons(void)
+static void check_utf8_comparisons(struct unicode_map *table)
 {
 	int i;
-	struct unicode_map *table = utf8_load(UNICODE_AGE(12, 1, 0));
-
-	if (IS_ERR(table)) {
-		pr_err("%s: Unable to load utf8 %d.%d.%d. Skipping.\n",
-		       __func__, latest_maj, latest_min, latest_rev);
-		return;
-	}
 
 	for (i = 0; i < ARRAY_SIZE(nfdi_test_data); i++) {
 		const struct qstr s1 = {.name = nfdi_test_data[i].str,
@@ -273,8 +253,6 @@ static void check_utf8_comparisons(void)
 		test_f(!utf8_strncasecmp(table, &s1, &s2),
 		       "%s %s comparison mismatch\n", s1.name, s2.name);
 	}
-
-	utf8_unload(table);
 }
 
 static void check_supported_versions(void)
@@ -286,8 +264,7 @@ static void check_supported_versions(void)
 	test(utf8version_is_supported(UNICODE_AGE(9, 0, 0)));
 
 	/* Unicode 1x.0.0 (the latest version) should be supported. */
-	test(utf8version_is_supported(
-		UNICODE_AGE(latest_maj, latest_min, latest_rev)));
+	test(utf8version_is_supported(UTF8_LATEST));
 
 	/* Next versions don't exist. */
 	test(!utf8version_is_supported(UNICODE_AGE(13, 0, 0)));
@@ -297,19 +274,28 @@ static void check_supported_versions(void)
 
 static int __init init_test_ucd(void)
 {
+	struct unicode_map *um;
+
 	failed_tests = 0;
 	total_tests = 0;
 
+	um = utf8_load(UTF8_LATEST);
+	if (IS_ERR(um)) {
+		pr_err("%s: Unable to load utf8 table.\n", __func__);
+		return PTR_ERR(um);
+	}
+
 	check_supported_versions();
-	check_utf8_nfdi();
-	check_utf8_nfdicf();
-	check_utf8_comparisons();
+	check_utf8_nfdi(um);
+	check_utf8_nfdicf(um);
+	check_utf8_comparisons(um);
 
 	if (!failed_tests)
 		pr_info("All %u tests passed\n", total_tests);
 	else
 		pr_err("%u out of %u tests failed\n", failed_tests,
 		       total_tests);
+	utf8_unload(um);
 	return 0;
 }
 
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index 736b6460a38cb..206c89f0dbf71 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -39,7 +39,8 @@ extern const struct utf8data *utf8nfdicf(unsigned int maxage);
  * Returns 0 if only ignorable code points are present.
  * Returns -1 if the input is not valid UTF-8.
  */
-extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);
+ssize_t utf8nlen(const struct unicode_map *um, enum utf8_normalization n,
+		const char *s, size_t len);
 
 /* Needed in struct utf8cursor below. */
 #define UTF8HANGULLEAF	(12)
@@ -48,7 +49,8 @@ extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);
  * Cursor structure used by the normalizer.
  */
 struct utf8cursor {
-	const struct utf8data	*data;
+	const struct unicode_map *um;
+	enum utf8_normalization n;
 	const char	*s;
 	const char	*p;
 	const char	*ss;
@@ -65,8 +67,8 @@ struct utf8cursor {
  * Returns 0 on success.
  * Returns -1 on failure.
  */
-extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
-		       const char *s, size_t len);
+int utf8ncursor(struct utf8cursor *u8c, const struct unicode_map *um,
+		enum utf8_normalization n, const char *s, size_t len);
 
 /*
  * Get the next byte in the normalization.
diff --git a/include/linux/unicode.h b/include/linux/unicode.h
index 77bb915fd1f05..526ca8b8391a5 100644
--- a/include/linux/unicode.h
+++ b/include/linux/unicode.h
@@ -5,6 +5,8 @@
 #include <linux/init.h>
 #include <linux/dcache.h>
 
+struct utf8data;
+
 #define UNICODE_MAJ_SHIFT		16
 #define UNICODE_MIN_SHIFT		8
 
@@ -28,8 +30,25 @@ static inline u8 unicode_rev(unsigned int age)
 	return age & 0xff;
 }
 
+/*
+ * Two normalization forms are supported:
+ * 1) NFDI
+ *   - Apply unicode normalization form NFD.
+ *   - Remove any Default_Ignorable_Code_Point.
+ * 2) NFDICF
+ *   - Apply unicode normalization form NFD.
+ *   - Remove any Default_Ignorable_Code_Point.
+ *   - Apply a full casefold (C + F).
+ */
+enum utf8_normalization {
+	UTF8_NFDI = 0,
+	UTF8_NFDICF,
+	UTF8_NMAX,
+};
+
 struct unicode_map {
 	unsigned int version;
+	const struct utf8data *ntab[UTF8_NMAX];
 };
 
 int utf8_validate(const struct unicode_map *um, const struct qstr *str);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/11] unicode: Add utf8-data module
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2021-09-15  7:00 ` [PATCH 09/11] unicode: cache the normalization tables in struct unicode_map Christoph Hellwig
@ 2021-09-15  7:00 ` Christoph Hellwig
  2021-10-12 11:25   ` Gabriel Krisman Bertazi
  2021-09-15  7:00 ` [PATCH 11/11] unicode: only export internal symbols for the selftests Christoph Hellwig
  10 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  7:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

utf8data.h contains a large database table which is an auto-generated
decodification trie for the unicode normalization functions.

Allow building it into a separate module.

Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/unicode/Kconfig                            | 13 ++++-
 fs/unicode/Makefile                           | 13 ++---
 fs/unicode/mkutf8data.c                       | 24 ++++++++--
 fs/unicode/utf8-core.c                        | 35 +++++++++++---
 fs/unicode/utf8-norm.c                        | 48 ++++---------------
 fs/unicode/utf8-selftest.c                    | 16 +++----
 ...{utf8data.h_shipped => utf8data.c_shipped} | 22 +++++++--
 fs/unicode/utf8n.h                            | 40 ++++++++--------
 include/linux/unicode.h                       |  2 +
 9 files changed, 123 insertions(+), 90 deletions(-)
 rename fs/unicode/{utf8data.h_shipped => utf8data.c_shipped} (99%)

diff --git a/fs/unicode/Kconfig b/fs/unicode/Kconfig
index 2c27b9a5cd6ce..610d7bc05d6e3 100644
--- a/fs/unicode/Kconfig
+++ b/fs/unicode/Kconfig
@@ -8,7 +8,16 @@ config UNICODE
 	  Say Y here to enable UTF-8 NFD normalization and NFD+CF casefolding
 	  support.
 
+config UNICODE_UTF8_DATA
+	tristate "UTF-8 normalization and casefolding tables"
+	depends on UNICODE
+	default UNICODE
+	help
+	  This contains a large table of case foldings, which can be loaded as
+	  a separate module if you say M here.  To be on the safe side stick
+	  to the default of Y.  Saying N here makes no sense, if you do not want
+	  utf8 casefolding support, disable CONFIG_UNICODE instead.
+
 config UNICODE_NORMALIZATION_SELFTEST
 	tristate "Test UTF-8 normalization support"
-	depends on UNICODE
-	default n
+	depends on UNICODE_UTF8_DATA
diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
index b88aecc865502..2f9d9188852b5 100644
--- a/fs/unicode/Makefile
+++ b/fs/unicode/Makefile
@@ -2,14 +2,15 @@
 
 obj-$(CONFIG_UNICODE) += unicode.o
 obj-$(CONFIG_UNICODE_NORMALIZATION_SELFTEST) += utf8-selftest.o
+obj-$(CONFIG_UNICODE_UTF8_DATA) += utf8data.o
 
 unicode-y := utf8-norm.o utf8-core.o
 
-$(obj)/utf8-norm.o: $(obj)/utf8data.h
+$(obj)/utf8-data.o: $(obj)/utf8data.c
 
-# In the normal build, the checked-in utf8data.h is just shipped.
+# In the normal build, the checked-in utf8data.c is just shipped.
 #
-# To generate utf8data.h from UCD, put *.txt files in this directory
+# To generate utf8data.c from UCD, put *.txt files in this directory
 # and pass REGENERATE_UTF8DATA=1 from the command line.
 ifdef REGENERATE_UTF8DATA
 
@@ -24,15 +25,15 @@ quiet_cmd_utf8data = GEN     $@
 		-t $(srctree)/$(src)/NormalizationTest.txt \
 		-o $@
 
-$(obj)/utf8data.h: $(obj)/mkutf8data $(filter %.txt, $(cmd_utf8data)) FORCE
+$(obj)/utf8data.c: $(obj)/mkutf8data $(filter %.txt, $(cmd_utf8data)) FORCE
 	$(call if_changed,utf8data)
 
 else
 
-$(obj)/utf8data.h: $(src)/utf8data.h_shipped FORCE
+$(obj)/utf8data.c: $(src)/utf8data.c_shipped FORCE
 	$(call if_changed,shipped)
 
 endif
 
-targets += utf8data.h
+targets += utf8data.c
 hostprogs += mkutf8data
diff --git a/fs/unicode/mkutf8data.c b/fs/unicode/mkutf8data.c
index ff2025ac5a325..bc1a7c8b5c8df 100644
--- a/fs/unicode/mkutf8data.c
+++ b/fs/unicode/mkutf8data.c
@@ -3287,12 +3287,10 @@ static void write_file(void)
 		open_fail(utf8_name, errno);
 
 	fprintf(file, "/* This file is generated code, do not edit. */\n");
-	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
-	fprintf(file, "#error Only nls_utf8-norm.c should include this file.\n");
-	fprintf(file, "#endif\n");
 	fprintf(file, "\n");
-	fprintf(file, "static const unsigned int utf8vers = %#x;\n",
-		unicode_maxage);
+	fprintf(file, "#include <linux/module.h>\n");
+	fprintf(file, "#include <linux/kernel.h>\n");
+	fprintf(file, "#include \"utf8n.h\"\n");
 	fprintf(file, "\n");
 	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
 	for (i = 0; i != ages_count; i++)
@@ -3339,6 +3337,22 @@ static void write_file(void)
 		fprintf(file, "\n");
 	}
 	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "struct utf8data_table utf8_data_table = {\n");
+	fprintf(file, "\t.utf8agetab = utf8agetab,\n");
+	fprintf(file, "\t.utf8agetab_size = ARRAY_SIZE(utf8agetab),\n");
+	fprintf(file, "\n");
+	fprintf(file, "\t.utf8nfdicfdata = utf8nfdicfdata,\n");
+	fprintf(file, "\t.utf8nfdicfdata_size = ARRAY_SIZE(utf8nfdicfdata),\n");
+	fprintf(file, "\n");
+	fprintf(file, "\t.utf8nfdidata = utf8nfdidata,\n");
+	fprintf(file, "\t.utf8nfdidata_size = ARRAY_SIZE(utf8nfdidata),\n");
+	fprintf(file, "\n");
+	fprintf(file, "\t.utf8data = utf8data,\n");
+	fprintf(file, "};\n");
+	fprintf(file, "EXPORT_SYMBOL_GPL(utf8_data_table);");
+	fprintf(file, "\n");
+	fprintf(file, "MODULE_LICENSE(\"GPL v2\");\n");
 	fclose(file);
 }
 
diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
index d9f713d38c0ad..38ca824f10158 100644
--- a/fs/unicode/utf8-core.c
+++ b/fs/unicode/utf8-core.c
@@ -160,25 +160,45 @@ int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
 }
 EXPORT_SYMBOL(utf8_normalize);
 
+static const struct utf8data *find_table_version(const struct utf8data *table,
+		size_t nr_entries, unsigned int version)
+{
+	size_t i = nr_entries - 1;
+
+	while (version < table[i].maxage)
+		i--;
+	if (version > table[i].maxage)
+		return NULL;
+	return &table[i];
+}
+
 struct unicode_map *utf8_load(unsigned int version)
 {
 	struct unicode_map *um;
 
-	if (!utf8version_is_supported(version))
-		return ERR_PTR(-EINVAL);
-
 	um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL);
 	if (!um)
 		return ERR_PTR(-ENOMEM);
 	um->version = version;
-	um->ntab[UTF8_NFDI] = utf8nfdi(version);
-	if (!um->ntab[UTF8_NFDI])
+
+	um->tables = symbol_request(utf8_data_table);
+	if (!um->tables)
 		goto out_free_um;
-	um->ntab[UTF8_NFDICF] = utf8nfdicf(version);
+
+	if (!utf8version_is_supported(um, version))
+		goto out_symbol_put;
+	um->ntab[UTF8_NFDI] = find_table_version(um->tables->utf8nfdidata,
+			um->tables->utf8nfdidata_size, um->version);
+	if (!um->ntab[UTF8_NFDI])
+		goto out_symbol_put;
+	um->ntab[UTF8_NFDICF] = find_table_version(um->tables->utf8nfdicfdata,
+			um->tables->utf8nfdicfdata_size, um->version);
 	if (!um->ntab[UTF8_NFDICF])
-		goto out_free_um;
+		goto out_symbol_put;
 	return um;
 
+out_symbol_put:
+	symbol_put(um->tables);
 out_free_um:
 	kfree(um);
 	return ERR_PTR(-EINVAL);
@@ -187,6 +207,7 @@ EXPORT_SYMBOL(utf8_load);
 
 void utf8_unload(struct unicode_map *um)
 {
+	symbol_put(utf8_data_table);
 	kfree(um);
 }
 EXPORT_SYMBOL(utf8_unload);
diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 7c1f28ab31a80..829c7e2ad764a 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -6,21 +6,12 @@
 
 #include "utf8n.h"
 
-struct utf8data {
-	unsigned int maxage;
-	unsigned int offset;
-};
-
-#define __INCLUDED_FROM_UTF8NORM_C__
-#include "utf8data.h"
-#undef __INCLUDED_FROM_UTF8NORM_C__
-
-int utf8version_is_supported(unsigned int version)
+int utf8version_is_supported(const struct unicode_map *um, unsigned int version)
 {
-	int i = ARRAY_SIZE(utf8agetab) - 1;
+	int i = um->tables->utf8agetab_size - 1;
 
-	while (i >= 0 && utf8agetab[i] != 0) {
-		if (version == utf8agetab[i])
+	while (i >= 0 && um->tables->utf8agetab[i] != 0) {
+		if (version == um->tables->utf8agetab[i])
 			return 1;
 		i--;
 	}
@@ -161,7 +152,7 @@ typedef const unsigned char utf8trie_t;
  * underlying datatype: unsigned char.
  *
  * leaf[0]: The unicode version, stored as a generation number that is
- *          an index into utf8agetab[].  With this we can filter code
+ *          an index into ->utf8agetab[].  With this we can filter code
  *          points based on the unicode version in which they were
  *          defined.  The CCC of a non-defined code point is 0.
  * leaf[1]: Canonical Combining Class. During normalization, we need
@@ -313,7 +304,7 @@ static utf8leaf_t *utf8nlookup(const struct unicode_map *um,
 		enum utf8_normalization n, unsigned char *hangul, const char *s,
 		size_t len)
 {
-	utf8trie_t	*trie = utf8data + um->ntab[n]->offset;
+	utf8trie_t	*trie = um->tables->utf8data + um->ntab[n]->offset;
 	int		offlen;
 	int		offset;
 	int		mask;
@@ -404,7 +395,8 @@ ssize_t utf8nlen(const struct unicode_map *um, enum utf8_normalization n,
 		leaf = utf8nlookup(um, n, hangul, s, len);
 		if (!leaf)
 			return -1;
-		if (utf8agetab[LEAF_GEN(leaf)] > um->ntab[n]->maxage)
+		if (um->tables->utf8agetab[LEAF_GEN(leaf)] >
+		    um->ntab[n]->maxage)
 			ret += utf8clen(s);
 		else if (LEAF_CCC(leaf) == DECOMPOSE)
 			ret += strlen(LEAF_STR(leaf));
@@ -520,7 +512,7 @@ int utf8byte(struct utf8cursor *u8c)
 
 		ccc = LEAF_CCC(leaf);
 		/* Characters that are too new have CCC 0. */
-		if (utf8agetab[LEAF_GEN(leaf)] >
+		if (u8c->um->tables->utf8agetab[LEAF_GEN(leaf)] >
 		    u8c->um->ntab[u8c->n]->maxage) {
 			ccc = STOPPER;
 		} else if (ccc == DECOMPOSE) {
@@ -597,25 +589,3 @@ int utf8byte(struct utf8cursor *u8c)
 	}
 }
 EXPORT_SYMBOL(utf8byte);
-
-const struct utf8data *utf8nfdi(unsigned int maxage)
-{
-	int i = ARRAY_SIZE(utf8nfdidata) - 1;
-
-	while (maxage < utf8nfdidata[i].maxage)
-		i--;
-	if (maxage > utf8nfdidata[i].maxage)
-		return NULL;
-	return &utf8nfdidata[i];
-}
-
-const struct utf8data *utf8nfdicf(unsigned int maxage)
-{
-	int i = ARRAY_SIZE(utf8nfdicfdata) - 1;
-
-	while (maxage < utf8nfdicfdata[i].maxage)
-		i--;
-	if (maxage > utf8nfdicfdata[i].maxage)
-		return NULL;
-	return &utf8nfdicfdata[i];
-}
diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
index cfa3832b75f42..eb2bbdd688d71 100644
--- a/fs/unicode/utf8-selftest.c
+++ b/fs/unicode/utf8-selftest.c
@@ -255,21 +255,21 @@ static void check_utf8_comparisons(struct unicode_map *table)
 	}
 }
 
-static void check_supported_versions(void)
+static void check_supported_versions(struct unicode_map *um)
 {
 	/* Unicode 7.0.0 should be supported. */
-	test(utf8version_is_supported(UNICODE_AGE(7, 0, 0)));
+	test(utf8version_is_supported(um, UNICODE_AGE(7, 0, 0)));
 
 	/* Unicode 9.0.0 should be supported. */
-	test(utf8version_is_supported(UNICODE_AGE(9, 0, 0)));
+	test(utf8version_is_supported(um, UNICODE_AGE(9, 0, 0)));
 
 	/* Unicode 1x.0.0 (the latest version) should be supported. */
-	test(utf8version_is_supported(UTF8_LATEST));
+	test(utf8version_is_supported(um, UTF8_LATEST));
 
 	/* Next versions don't exist. */
-	test(!utf8version_is_supported(UNICODE_AGE(13, 0, 0)));
-	test(!utf8version_is_supported(UNICODE_AGE(0, 0, 0)));
-	test(!utf8version_is_supported(UNICODE_AGE(-1, -1, -1)));
+	test(!utf8version_is_supported(um, UNICODE_AGE(13, 0, 0)));
+	test(!utf8version_is_supported(um, UNICODE_AGE(0, 0, 0)));
+	test(!utf8version_is_supported(um, UNICODE_AGE(-1, -1, -1)));
 }
 
 static int __init init_test_ucd(void)
@@ -285,7 +285,7 @@ static int __init init_test_ucd(void)
 		return PTR_ERR(um);
 	}
 
-	check_supported_versions();
+	check_supported_versions(um);
 	check_utf8_nfdi(um);
 	check_utf8_nfdicf(um);
 	check_utf8_comparisons(um);
diff --git a/fs/unicode/utf8data.h_shipped b/fs/unicode/utf8data.c_shipped
similarity index 99%
rename from fs/unicode/utf8data.h_shipped
rename to fs/unicode/utf8data.c_shipped
index 76e4f0e1b0891..d9b62901aa96b 100644
--- a/fs/unicode/utf8data.h_shipped
+++ b/fs/unicode/utf8data.c_shipped
@@ -1,9 +1,8 @@
 /* This file is generated code, do not edit. */
-#ifndef __INCLUDED_FROM_UTF8NORM_C__
-#error Only nls_utf8-norm.c should include this file.
-#endif
 
-static const unsigned int utf8vers = 0xc0100;
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include "utf8n.h"
 
 static const unsigned int utf8agetab[] = {
 	0,
@@ -4107,3 +4106,18 @@ static const unsigned char utf8data[64256] = {
 	0x52,0x04,0x00,0x00,0x11,0x04,0x00,0x00,0x02,0x00,0xcf,0x86,0xcf,0x06,0x02,0x00,
 	0x81,0x80,0xcf,0x86,0x85,0x84,0xcf,0x86,0xcf,0x06,0x02,0x00,0x00,0x00,0x00,0x00
 };
+
+struct utf8data_table utf8_data_table = {
+	.utf8agetab = utf8agetab,
+	.utf8agetab_size = ARRAY_SIZE(utf8agetab),
+
+	.utf8nfdicfdata = utf8nfdicfdata,
+	.utf8nfdicfdata_size = ARRAY_SIZE(utf8nfdicfdata),
+
+	.utf8nfdidata = utf8nfdidata,
+	.utf8nfdidata_size = ARRAY_SIZE(utf8nfdidata),
+
+	.utf8data = utf8data,
+};
+EXPORT_SYMBOL_GPL(utf8_data_table);
+MODULE_LICENSE("GPL v2");
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index 206c89f0dbf71..bd00d587747a7 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -13,25 +13,7 @@
 #include <linux/module.h>
 #include <linux/unicode.h>
 
-int utf8version_is_supported(unsigned int version);
-
-/*
- * Look for the correct const struct utf8data for a unicode version.
- * Returns NULL if the version requested is too new.
- *
- * Two normalization forms are supported: nfdi and nfdicf.
- *
- * nfdi:
- *  - Apply unicode normalization form NFD.
- *  - Remove any Default_Ignorable_Code_Point.
- *
- * nfdicf:
- *  - Apply unicode normalization form NFD.
- *  - Remove any Default_Ignorable_Code_Point.
- *  - Apply a full casefold (C + F).
- */
-extern const struct utf8data *utf8nfdi(unsigned int maxage);
-extern const struct utf8data *utf8nfdicf(unsigned int maxage);
+int utf8version_is_supported(const struct unicode_map *um, unsigned int version);
 
 /*
  * Determine the length of the normalized from of the string,
@@ -78,4 +60,24 @@ int utf8ncursor(struct utf8cursor *u8c, const struct unicode_map *um,
  */
 extern int utf8byte(struct utf8cursor *u8c);
 
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+struct utf8data_table {
+	const unsigned int *utf8agetab;
+	int utf8agetab_size;
+
+	const struct utf8data *utf8nfdicfdata;
+	int utf8nfdicfdata_size;
+
+	const struct utf8data *utf8nfdidata;
+	int utf8nfdidata_size;
+
+	const unsigned char *utf8data;
+};
+
+extern struct utf8data_table utf8_data_table;
+
 #endif /* UTF8NORM_H */
diff --git a/include/linux/unicode.h b/include/linux/unicode.h
index 526ca8b8391a5..4d39e6e11a950 100644
--- a/include/linux/unicode.h
+++ b/include/linux/unicode.h
@@ -6,6 +6,7 @@
 #include <linux/dcache.h>
 
 struct utf8data;
+struct utf8data_table;
 
 #define UNICODE_MAJ_SHIFT		16
 #define UNICODE_MIN_SHIFT		8
@@ -49,6 +50,7 @@ enum utf8_normalization {
 struct unicode_map {
 	unsigned int version;
 	const struct utf8data *ntab[UTF8_NMAX];
+	const struct utf8data_table *tables;
 };
 
 int utf8_validate(const struct unicode_map *um, const struct qstr *str);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/11] unicode: only export internal symbols for the selftests
  2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2021-09-15  7:00 ` [PATCH 10/11] unicode: Add utf8-data module Christoph Hellwig
@ 2021-09-15  7:00 ` Christoph Hellwig
  10 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-09-15  7:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

The exported symbols in utf8-norm.c are not needed for normal
file system consumers, so move them to conditional _GPL exports
just for the selftest.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/unicode/utf8-norm.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 829c7e2ad764a..768f8ab448b8f 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -17,7 +17,6 @@ int utf8version_is_supported(const struct unicode_map *um, unsigned int version)
 	}
 	return 0;
 }
-EXPORT_SYMBOL(utf8version_is_supported);
 
 /*
  * UTF-8 valid ranges.
@@ -407,7 +406,6 @@ ssize_t utf8nlen(const struct unicode_map *um, enum utf8_normalization n,
 	}
 	return ret;
 }
-EXPORT_SYMBOL(utf8nlen);
 
 /*
  * Set up an utf8cursor for use by utf8byte().
@@ -442,7 +440,6 @@ int utf8ncursor(struct utf8cursor *u8c, const struct unicode_map *um,
 		return -1;
 	return 0;
 }
-EXPORT_SYMBOL(utf8ncursor);
 
 /*
  * Get one byte from the normalized form of the string described by u8c.
@@ -588,4 +585,10 @@ int utf8byte(struct utf8cursor *u8c)
 		}
 	}
 }
-EXPORT_SYMBOL(utf8byte);
+
+#ifdef CONFIG_UNICODE_NORMALIZATION_SELFTEST_MODULE
+EXPORT_SYMBOL_GPL(utf8version_is_supported);
+EXPORT_SYMBOL_GPL(utf8nlen);
+EXPORT_SYMBOL_GPL(utf8ncursor);
+EXPORT_SYMBOL_GPL(utf8byte);
+#endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/11] unicode: Add utf8-data module
  2021-09-15  7:00 ` [PATCH 10/11] unicode: Add utf8-data module Christoph Hellwig
@ 2021-10-12 11:25   ` Gabriel Krisman Bertazi
  2021-10-12 12:49     ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-10-12 11:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

Christoph Hellwig <hch@lst.de> writes:

> utf8data.h contains a large database table which is an auto-generated
> decodification trie for the unicode normalization functions.
>
> Allow building it into a separate module.
>
> Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/unicode/Kconfig                            | 13 ++++-
>  fs/unicode/Makefile                           | 13 ++---
>  fs/unicode/mkutf8data.c                       | 24 ++++++++--
>  fs/unicode/utf8-core.c                        | 35 +++++++++++---
>  fs/unicode/utf8-norm.c                        | 48 ++++---------------
>  fs/unicode/utf8-selftest.c                    | 16 +++----
>  ...{utf8data.h_shipped => utf8data.c_shipped} | 22 +++++++--
>  fs/unicode/utf8n.h                            | 40 ++++++++--------
>  include/linux/unicode.h                       |  2 +
>  9 files changed, 123 insertions(+), 90 deletions(-)
>  rename fs/unicode/{utf8data.h_shipped => utf8data.c_shipped} (99%)
>
> diff --git a/fs/unicode/Kconfig b/fs/unicode/Kconfig
> index 2c27b9a5cd6ce..610d7bc05d6e3 100644
> --- a/fs/unicode/Kconfig
> +++ b/fs/unicode/Kconfig
> @@ -8,7 +8,16 @@ config UNICODE
>  	  Say Y here to enable UTF-8 NFD normalization and NFD+CF casefolding
>  	  support.
>  
> +config UNICODE_UTF8_DATA
> +	tristate "UTF-8 normalization and casefolding tables"
> +	depends on UNICODE
> +	default UNICODE
> +	help
> +	  This contains a large table of case foldings, which can be loaded as
> +	  a separate module if you say M here.  To be on the safe side stick
> +	  to the default of Y.  Saying N here makes no sense, if you do not want
> +	  utf8 casefolding support, disable CONFIG_UNICODE instead.
> +
>  config UNICODE_NORMALIZATION_SELFTEST
>  	tristate "Test UTF-8 normalization support"
> -	depends on UNICODE
> -	default n
> +	depends on UNICODE_UTF8_DATA
> diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
> index b88aecc865502..2f9d9188852b5 100644
> --- a/fs/unicode/Makefile
> +++ b/fs/unicode/Makefile
> @@ -2,14 +2,15 @@
>  
>  obj-$(CONFIG_UNICODE) += unicode.o
>  obj-$(CONFIG_UNICODE_NORMALIZATION_SELFTEST) += utf8-selftest.o
> +obj-$(CONFIG_UNICODE_UTF8_DATA) += utf8data.o
>  
>  unicode-y := utf8-norm.o utf8-core.o
>  
> -$(obj)/utf8-norm.o: $(obj)/utf8data.h
> +$(obj)/utf8-data.o: $(obj)/utf8data.c
>  
> -# In the normal build, the checked-in utf8data.h is just shipped.
> +# In the normal build, the checked-in utf8data.c is just shipped.
>  #
> -# To generate utf8data.h from UCD, put *.txt files in this directory
> +# To generate utf8data.c from UCD, put *.txt files in this directory
>  # and pass REGENERATE_UTF8DATA=1 from the command line.
>  ifdef REGENERATE_UTF8DATA
>  
> @@ -24,15 +25,15 @@ quiet_cmd_utf8data = GEN     $@
>  		-t $(srctree)/$(src)/NormalizationTest.txt \
>  		-o $@
>  
> -$(obj)/utf8data.h: $(obj)/mkutf8data $(filter %.txt, $(cmd_utf8data)) FORCE
> +$(obj)/utf8data.c: $(obj)/mkutf8data $(filter %.txt, $(cmd_utf8data)) FORCE
>  	$(call if_changed,utf8data)
>  
>  else
>  
> -$(obj)/utf8data.h: $(src)/utf8data.h_shipped FORCE
> +$(obj)/utf8data.c: $(src)/utf8data.c_shipped FORCE
>  	$(call if_changed,shipped)
>  
>  endif
>  
> -targets += utf8data.h
> +targets += utf8data.c
>  hostprogs += mkutf8data
> diff --git a/fs/unicode/mkutf8data.c b/fs/unicode/mkutf8data.c
> index ff2025ac5a325..bc1a7c8b5c8df 100644
> --- a/fs/unicode/mkutf8data.c
> +++ b/fs/unicode/mkutf8data.c
> @@ -3287,12 +3287,10 @@ static void write_file(void)
>  		open_fail(utf8_name, errno);
>  
>  	fprintf(file, "/* This file is generated code, do not edit. */\n");
> -	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
> -	fprintf(file, "#error Only nls_utf8-norm.c should include this file.\n");
> -	fprintf(file, "#endif\n");
>  	fprintf(file, "\n");
> -	fprintf(file, "static const unsigned int utf8vers = %#x;\n",
> -		unicode_maxage);
> +	fprintf(file, "#include <linux/module.h>\n");
> +	fprintf(file, "#include <linux/kernel.h>\n");
> +	fprintf(file, "#include \"utf8n.h\"\n");
>  	fprintf(file, "\n");
>  	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
>  	for (i = 0; i != ages_count; i++)
> @@ -3339,6 +3337,22 @@ static void write_file(void)
>  		fprintf(file, "\n");
>  	}
>  	fprintf(file, "};\n");
> +	fprintf(file, "\n");
> +	fprintf(file, "struct utf8data_table utf8_data_table = {\n");
> +	fprintf(file, "\t.utf8agetab = utf8agetab,\n");
> +	fprintf(file, "\t.utf8agetab_size = ARRAY_SIZE(utf8agetab),\n");
> +	fprintf(file, "\n");
> +	fprintf(file, "\t.utf8nfdicfdata = utf8nfdicfdata,\n");
> +	fprintf(file, "\t.utf8nfdicfdata_size = ARRAY_SIZE(utf8nfdicfdata),\n");
> +	fprintf(file, "\n");
> +	fprintf(file, "\t.utf8nfdidata = utf8nfdidata,\n");
> +	fprintf(file, "\t.utf8nfdidata_size = ARRAY_SIZE(utf8nfdidata),\n");
> +	fprintf(file, "\n");
> +	fprintf(file, "\t.utf8data = utf8data,\n");
> +	fprintf(file, "};\n");
> +	fprintf(file, "EXPORT_SYMBOL_GPL(utf8_data_table);");
> +	fprintf(file, "\n");
> +	fprintf(file, "MODULE_LICENSE(\"GPL v2\");\n");
>  	fclose(file);
>  }
>  
> diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
> index d9f713d38c0ad..38ca824f10158 100644
> --- a/fs/unicode/utf8-core.c
> +++ b/fs/unicode/utf8-core.c
> @@ -160,25 +160,45 @@ int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
>  }
>  EXPORT_SYMBOL(utf8_normalize);
>  
> +static const struct utf8data *find_table_version(const struct utf8data *table,
> +		size_t nr_entries, unsigned int version)
> +{
> +	size_t i = nr_entries - 1;
> +
> +	while (version < table[i].maxage)
> +		i--;
> +	if (version > table[i].maxage)
> +		return NULL;
> +	return &table[i];
> +}
> +
>  struct unicode_map *utf8_load(unsigned int version)
>  {
>  	struct unicode_map *um;
>  
> -	if (!utf8version_is_supported(version))
> -		return ERR_PTR(-EINVAL);
> -
>  	um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL);
>  	if (!um)
>  		return ERR_PTR(-ENOMEM);
>  	um->version = version;
> -	um->ntab[UTF8_NFDI] = utf8nfdi(version);
> -	if (!um->ntab[UTF8_NFDI])
> +
> +	um->tables = symbol_request(utf8_data_table);
> +	if (!um->tables)
>  		goto out_free_um;
> -	um->ntab[UTF8_NFDICF] = utf8nfdicf(version);
> +
> +	if (!utf8version_is_supported(um, version))
> +		goto out_symbol_put;
> +	um->ntab[UTF8_NFDI] = find_table_version(um->tables->utf8nfdidata,
> +			um->tables->utf8nfdidata_size, um->version);
> +	if (!um->ntab[UTF8_NFDI])
> +		goto out_symbol_put;
> +	um->ntab[UTF8_NFDICF] = find_table_version(um->tables->utf8nfdicfdata,
> +			um->tables->utf8nfdicfdata_size, um->version);
>  	if (!um->ntab[UTF8_NFDICF])
> -		goto out_free_um;
> +		goto out_symbol_put;
>  	return um;
>  
> +out_symbol_put:
> +	symbol_put(um->tables);
>  out_free_um:
>  	kfree(um);
>  	return ERR_PTR(-EINVAL);
> @@ -187,6 +207,7 @@ EXPORT_SYMBOL(utf8_load);
>  
>  void utf8_unload(struct unicode_map *um)
>  {
> +	symbol_put(utf8_data_table);

This triggers a BUG_ON if the symbol isn't loaded/loadable,
i.e. ext4_fill_super fails early.  I'm not sure how to fix it, though.


 Failed to find symbol utf8_data_table
 ------------[ cut here ]------------
 kernel BUG at kernel/module.c:1022!
 invalid opcode: 0000 [#1] SMP
 CPU: 1 PID: 387 Comm: mount Not tainted 5.15.0-rc4-for-next_5.15 #5
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 RIP: 0010:__symbol_put+0x88/0x90
 Code: 84 c0 74 26 48 8b 7c 24 10 e8 44 f9 ff ff 65 ff 0d 1d 44 ea 7e 48 8b 44 24 30 65 48 33 04 25 28 00 00 00 75 07 48 83 c4 38 c3 <0f> 0b e8 51 ca a9 00 90 0f 1f 44 00 00 48 63 46 04 48 8d 74

 RSP: 0018:ffffc90000623cc0 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff888102e91490 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff88813b9d7860 RDI: ffff88813b9d7868
 RBP: ffffc90000623de0 R08: 0000000000000000 R09: c0000000ffffefff
 R10: ffffc900006239d8 R11: ffffc900006239d0 R12: 00000000ffffffea
 R13: 0000000000000000 R14: ffff888102e94000 R15: ffff888102e91000
 FS:  00007efcab508800(0000) GS:ffff88813b800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ff08eec56f4 CR3: 0000000102f31000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  ext4_fill_super+0x289/0x32b0
  ? bdev_name.isra.7+0x53/0xd0
  ? vsnprintf+0x379/0x520
  ? ext4_enable_quotas+0x260/0x260
  ? mount_bdev+0x18a/0x1c0
  ? ext4_enable_quotas+0x260/0x260
  mount_bdev+0x18a/0x1c0
  legacy_get_tree+0x30/0x50
  vfs_get_tree+0x23/0x90
  ? ns_capable_common+0x2b/0x50
  path_mount+0x6da/0xa50
  ? kmem_cache_free+0xf4/0x140
  do_mount+0x75/0x90
  __x64_sys_mount+0xc4/0xe0
  do_syscall_64+0x3a/0xb0
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7efcab71f6ba
 Code: 48 8b 0d b1 f7 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7e f7 0b 00 f7 d8 64 89

 RSP: 002b:00007ffefb824338 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
 RAX: ffffffffffffffda RBX: 00007efcab873264 RCX: 00007efcab71f6ba
 RDX: 000055a2867dad10 RSI: 000055a2867d40f0 RDI: 000055a2867d40d0
 RBP: 000055a2867d3ea0 R08: 0000000000000000 R09: 000055a2867d3010
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 000055a2867d40d0 R14: 000055a2867dad10 R15: 000055a2867d3ea0
 Modules linked in:
 ---[ end trace abcd43d820168730 ]---



>  	kfree(um);
>  }
>  EXPORT_SYMBOL(utf8_unload);
> diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
> index 7c1f28ab31a80..829c7e2ad764a 100644
> --- a/fs/unicode/utf8-norm.c
> +++ b/fs/unicode/utf8-norm.c
> @@ -6,21 +6,12 @@
>  
>  #include "utf8n.h"
>  
> -struct utf8data {
> -	unsigned int maxage;
> -	unsigned int offset;
> -};
> -
> -#define __INCLUDED_FROM_UTF8NORM_C__
> -#include "utf8data.h"
> -#undef __INCLUDED_FROM_UTF8NORM_C__
> -
> -int utf8version_is_supported(unsigned int version)
> +int utf8version_is_supported(const struct unicode_map *um, unsigned int version)
>  {
> -	int i = ARRAY_SIZE(utf8agetab) - 1;
> +	int i = um->tables->utf8agetab_size - 1;
>  
> -	while (i >= 0 && utf8agetab[i] != 0) {
> -		if (version == utf8agetab[i])
> +	while (i >= 0 && um->tables->utf8agetab[i] != 0) {
> +		if (version == um->tables->utf8agetab[i])
>  			return 1;
>  		i--;
>  	}
> @@ -161,7 +152,7 @@ typedef const unsigned char utf8trie_t;
>   * underlying datatype: unsigned char.
>   *
>   * leaf[0]: The unicode version, stored as a generation number that is
> - *          an index into utf8agetab[].  With this we can filter code
> + *          an index into ->utf8agetab[].  With this we can filter code
>   *          points based on the unicode version in which they were
>   *          defined.  The CCC of a non-defined code point is 0.
>   * leaf[1]: Canonical Combining Class. During normalization, we need
> @@ -313,7 +304,7 @@ static utf8leaf_t *utf8nlookup(const struct unicode_map *um,
>  		enum utf8_normalization n, unsigned char *hangul, const char *s,
>  		size_t len)
>  {
> -	utf8trie_t	*trie = utf8data + um->ntab[n]->offset;
> +	utf8trie_t	*trie = um->tables->utf8data + um->ntab[n]->offset;
>  	int		offlen;
>  	int		offset;
>  	int		mask;
> @@ -404,7 +395,8 @@ ssize_t utf8nlen(const struct unicode_map *um, enum utf8_normalization n,
>  		leaf = utf8nlookup(um, n, hangul, s, len);
>  		if (!leaf)
>  			return -1;
> -		if (utf8agetab[LEAF_GEN(leaf)] > um->ntab[n]->maxage)
> +		if (um->tables->utf8agetab[LEAF_GEN(leaf)] >
> +		    um->ntab[n]->maxage)
>  			ret += utf8clen(s);
>  		else if (LEAF_CCC(leaf) == DECOMPOSE)
>  			ret += strlen(LEAF_STR(leaf));
> @@ -520,7 +512,7 @@ int utf8byte(struct utf8cursor *u8c)
>  
>  		ccc = LEAF_CCC(leaf);
>  		/* Characters that are too new have CCC 0. */
> -		if (utf8agetab[LEAF_GEN(leaf)] >
> +		if (u8c->um->tables->utf8agetab[LEAF_GEN(leaf)] >
>  		    u8c->um->ntab[u8c->n]->maxage) {
>  			ccc = STOPPER;
>  		} else if (ccc == DECOMPOSE) {
> @@ -597,25 +589,3 @@ int utf8byte(struct utf8cursor *u8c)
>  	}
>  }
>  EXPORT_SYMBOL(utf8byte);
> -
> -const struct utf8data *utf8nfdi(unsigned int maxage)
> -{
> -	int i = ARRAY_SIZE(utf8nfdidata) - 1;
> -
> -	while (maxage < utf8nfdidata[i].maxage)
> -		i--;
> -	if (maxage > utf8nfdidata[i].maxage)
> -		return NULL;
> -	return &utf8nfdidata[i];
> -}
> -
> -const struct utf8data *utf8nfdicf(unsigned int maxage)
> -{
> -	int i = ARRAY_SIZE(utf8nfdicfdata) - 1;
> -
> -	while (maxage < utf8nfdicfdata[i].maxage)
> -		i--;
> -	if (maxage > utf8nfdicfdata[i].maxage)
> -		return NULL;
> -	return &utf8nfdicfdata[i];
> -}
> diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
> index cfa3832b75f42..eb2bbdd688d71 100644
> --- a/fs/unicode/utf8-selftest.c
> +++ b/fs/unicode/utf8-selftest.c
> @@ -255,21 +255,21 @@ static void check_utf8_comparisons(struct unicode_map *table)
>  	}
>  }
>  
> -static void check_supported_versions(void)
> +static void check_supported_versions(struct unicode_map *um)
>  {
>  	/* Unicode 7.0.0 should be supported. */
> -	test(utf8version_is_supported(UNICODE_AGE(7, 0, 0)));
> +	test(utf8version_is_supported(um, UNICODE_AGE(7, 0, 0)));
>  
>  	/* Unicode 9.0.0 should be supported. */
> -	test(utf8version_is_supported(UNICODE_AGE(9, 0, 0)));
> +	test(utf8version_is_supported(um, UNICODE_AGE(9, 0, 0)));
>  
>  	/* Unicode 1x.0.0 (the latest version) should be supported. */
> -	test(utf8version_is_supported(UTF8_LATEST));
> +	test(utf8version_is_supported(um, UTF8_LATEST));
>  
>  	/* Next versions don't exist. */
> -	test(!utf8version_is_supported(UNICODE_AGE(13, 0, 0)));
> -	test(!utf8version_is_supported(UNICODE_AGE(0, 0, 0)));
> -	test(!utf8version_is_supported(UNICODE_AGE(-1, -1, -1)));
> +	test(!utf8version_is_supported(um, UNICODE_AGE(13, 0, 0)));
> +	test(!utf8version_is_supported(um, UNICODE_AGE(0, 0, 0)));
> +	test(!utf8version_is_supported(um, UNICODE_AGE(-1, -1, -1)));
>  }
>  
>  static int __init init_test_ucd(void)
> @@ -285,7 +285,7 @@ static int __init init_test_ucd(void)
>  		return PTR_ERR(um);
>  	}
>  
> -	check_supported_versions();
> +	check_supported_versions(um);
>  	check_utf8_nfdi(um);
>  	check_utf8_nfdicf(um);
>  	check_utf8_comparisons(um);
> diff --git a/fs/unicode/utf8data.h_shipped b/fs/unicode/utf8data.c_shipped
> similarity index 99%
> rename from fs/unicode/utf8data.h_shipped
> rename to fs/unicode/utf8data.c_shipped
> index 76e4f0e1b0891..d9b62901aa96b 100644
> --- a/fs/unicode/utf8data.h_shipped
> +++ b/fs/unicode/utf8data.c_shipped
> @@ -1,9 +1,8 @@
>  /* This file is generated code, do not edit. */
> -#ifndef __INCLUDED_FROM_UTF8NORM_C__
> -#error Only nls_utf8-norm.c should include this file.
> -#endif
>  
> -static const unsigned int utf8vers = 0xc0100;
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include "utf8n.h"
>  
>  static const unsigned int utf8agetab[] = {
>  	0,
> @@ -4107,3 +4106,18 @@ static const unsigned char utf8data[64256] = {
>  	0x52,0x04,0x00,0x00,0x11,0x04,0x00,0x00,0x02,0x00,0xcf,0x86,0xcf,0x06,0x02,0x00,
>  	0x81,0x80,0xcf,0x86,0x85,0x84,0xcf,0x86,0xcf,0x06,0x02,0x00,0x00,0x00,0x00,0x00
>  };
> +
> +struct utf8data_table utf8_data_table = {
> +	.utf8agetab = utf8agetab,
> +	.utf8agetab_size = ARRAY_SIZE(utf8agetab),
> +
> +	.utf8nfdicfdata = utf8nfdicfdata,
> +	.utf8nfdicfdata_size = ARRAY_SIZE(utf8nfdicfdata),
> +
> +	.utf8nfdidata = utf8nfdidata,
> +	.utf8nfdidata_size = ARRAY_SIZE(utf8nfdidata),
> +
> +	.utf8data = utf8data,
> +};
> +EXPORT_SYMBOL_GPL(utf8_data_table);
> +MODULE_LICENSE("GPL v2");
> diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
> index 206c89f0dbf71..bd00d587747a7 100644
> --- a/fs/unicode/utf8n.h
> +++ b/fs/unicode/utf8n.h
> @@ -13,25 +13,7 @@
>  #include <linux/module.h>
>  #include <linux/unicode.h>
>  
> -int utf8version_is_supported(unsigned int version);
> -
> -/*
> - * Look for the correct const struct utf8data for a unicode version.
> - * Returns NULL if the version requested is too new.
> - *
> - * Two normalization forms are supported: nfdi and nfdicf.
> - *
> - * nfdi:
> - *  - Apply unicode normalization form NFD.
> - *  - Remove any Default_Ignorable_Code_Point.
> - *
> - * nfdicf:
> - *  - Apply unicode normalization form NFD.
> - *  - Remove any Default_Ignorable_Code_Point.
> - *  - Apply a full casefold (C + F).
> - */
> -extern const struct utf8data *utf8nfdi(unsigned int maxage);
> -extern const struct utf8data *utf8nfdicf(unsigned int maxage);
> +int utf8version_is_supported(const struct unicode_map *um, unsigned int version);
>  
>  /*
>   * Determine the length of the normalized from of the string,
> @@ -78,4 +60,24 @@ int utf8ncursor(struct utf8cursor *u8c, const struct unicode_map *um,
>   */
>  extern int utf8byte(struct utf8cursor *u8c);
>  
> +struct utf8data {
> +	unsigned int maxage;
> +	unsigned int offset;
> +};
> +
> +struct utf8data_table {
> +	const unsigned int *utf8agetab;
> +	int utf8agetab_size;
> +
> +	const struct utf8data *utf8nfdicfdata;
> +	int utf8nfdicfdata_size;
> +
> +	const struct utf8data *utf8nfdidata;
> +	int utf8nfdidata_size;
> +
> +	const unsigned char *utf8data;
> +};
> +
> +extern struct utf8data_table utf8_data_table;
> +
>  #endif /* UTF8NORM_H */
> diff --git a/include/linux/unicode.h b/include/linux/unicode.h
> index 526ca8b8391a5..4d39e6e11a950 100644
> --- a/include/linux/unicode.h
> +++ b/include/linux/unicode.h
> @@ -6,6 +6,7 @@
>  #include <linux/dcache.h>
>  
>  struct utf8data;
> +struct utf8data_table;
>  
>  #define UNICODE_MAJ_SHIFT		16
>  #define UNICODE_MIN_SHIFT		8
> @@ -49,6 +50,7 @@ enum utf8_normalization {
>  struct unicode_map {
>  	unsigned int version;
>  	const struct utf8data *ntab[UTF8_NMAX];
> +	const struct utf8data_table *tables;
>  };
>  
>  int utf8_validate(const struct unicode_map *um, const struct qstr *str);

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/11] unicode: Add utf8-data module
  2021-10-12 11:25   ` Gabriel Krisman Bertazi
@ 2021-10-12 12:49     ` Christoph Hellwig
  2021-10-12 14:40       ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2021-10-12 12:49 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Christoph Hellwig, Shreeya Patel, linux-fsdevel, linux-ext4,
	linux-f2fs-devel

[fullquote deleted]

On Tue, Oct 12, 2021 at 08:25:23AM -0300, Gabriel Krisman Bertazi wrote:
> > @@ -187,6 +207,7 @@ EXPORT_SYMBOL(utf8_load);
> >  
> >  void utf8_unload(struct unicode_map *um)
> >  {
> > +	symbol_put(utf8_data_table);
> 
> This triggers a BUG_ON if the symbol isn't loaded/loadable,
> i.e. ext4_fill_super fails early.  I'm not sure how to fix it, though.

Does this fix it?

diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
index 38ca824f10158..67aaadc3ab072 100644
--- a/fs/unicode/utf8-core.c
+++ b/fs/unicode/utf8-core.c
@@ -207,8 +207,10 @@ EXPORT_SYMBOL(utf8_load);
 
 void utf8_unload(struct unicode_map *um)
 {
-	symbol_put(utf8_data_table);
-	kfree(um);
+	if (um) {
+		symbol_put(utf8_data_table);
+		kfree(um);
+	}
 }
 EXPORT_SYMBOL(utf8_unload);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/11] unicode: Add utf8-data module
  2021-10-12 12:49     ` Christoph Hellwig
@ 2021-10-12 14:40       ` Gabriel Krisman Bertazi
  2021-10-26  7:45         ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-10-12 14:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

Christoph Hellwig <hch@lst.de> writes:

> [fullquote deleted]
>
> On Tue, Oct 12, 2021 at 08:25:23AM -0300, Gabriel Krisman Bertazi wrote:
>> > @@ -187,6 +207,7 @@ EXPORT_SYMBOL(utf8_load);
>> >  
>> >  void utf8_unload(struct unicode_map *um)
>> >  {
>> > +	symbol_put(utf8_data_table);
>> 
>> This triggers a BUG_ON if the symbol isn't loaded/loadable,
>> i.e. ext4_fill_super fails early.  I'm not sure how to fix it, though.
>
> Does this fix it?

Yes, it does.

I  will fold this into the original patch and queue this series for 5.16.

Thank you,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/11] unicode: Add utf8-data module
  2021-10-12 14:40       ` Gabriel Krisman Bertazi
@ 2021-10-26  7:45         ` Christoph Hellwig
  2021-10-26 13:56           ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2021-10-26  7:45 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Christoph Hellwig, Shreeya Patel, linux-fsdevel, linux-ext4,
	linux-f2fs-devel

On Tue, Oct 12, 2021 at 11:40:56AM -0300, Gabriel Krisman Bertazi wrote:
> > Does this fix it?
> 
> Yes, it does.
> 
> I  will fold this into the original patch and queue this series for 5.16.

This series still doesn't seem to be queued up.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/11] unicode: Add utf8-data module
  2021-10-26  7:45         ` Christoph Hellwig
@ 2021-10-26 13:56           ` Gabriel Krisman Bertazi
  2021-10-26 22:02             ` Stephen Rothwell
  0 siblings, 1 reply; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-10-26 13:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel,
	Stephen Rothwell

Christoph Hellwig <hch@lst.de> writes:

> On Tue, Oct 12, 2021 at 11:40:56AM -0300, Gabriel Krisman Bertazi wrote:
>> > Does this fix it?
>> 
>> Yes, it does.
>> 
>> I  will fold this into the original patch and queue this series for 5.16.
>
> This series still doesn't seem to be queued up.

Hm, I'm keeping it here:

https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/log/?h=for-next_5.16

Sorry, but I'm not sure what is the process to get tracked by
linux-next.  I'm Cc'ing Stephen to hopefully help me figure it out.

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/11] unicode: Add utf8-data module
  2021-10-26 13:56           ` Gabriel Krisman Bertazi
@ 2021-10-26 22:02             ` Stephen Rothwell
  2021-10-28  2:00               ` Track unicode tree in linux-next (was Re: [PATCH 10/11] unicode: Add utf8-data module) Gabriel Krisman Bertazi
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen Rothwell @ 2021-10-26 22:02 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Christoph Hellwig, Shreeya Patel, linux-fsdevel, linux-ext4,
	linux-f2fs-devel

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

Hi Gabriel,

On Tue, 26 Oct 2021 10:56:20 -0300 Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>
> Christoph Hellwig <hch@lst.de> writes:
> 
> > On Tue, Oct 12, 2021 at 11:40:56AM -0300, Gabriel Krisman Bertazi wrote:  
> >> > Does this fix it?  
> >> 
> >> Yes, it does.
> >> 
> >> I  will fold this into the original patch and queue this series for 5.16.  
> >
> > This series still doesn't seem to be queued up.  
> 
> Hm, I'm keeping it here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/log/?h=for-next_5.16
> 
> Sorry, but I'm not sure what is the process to get tracked by
> linux-next.  I'm Cc'ing Stephen to hopefully help me figure it out.

You just need to send me a git URL for your tree/branch (not a cgit or
gitweb URL, please), plus some idea of what the tree include and how it
is sent to Linus (directly or via another tree).  The branch should
have a generic name (i.e. not including a version) as I will continuet
to fetch that branch every day until you tell me to stop.  When your
code is ready to be included in linux-next, all you have to do is
update that branch to include the new code.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Track unicode tree in linux-next (was Re: [PATCH 10/11] unicode: Add utf8-data module)
  2021-10-26 22:02             ` Stephen Rothwell
@ 2021-10-28  2:00               ` Gabriel Krisman Bertazi
  2021-10-28  9:47                 ` Stephen Rothwell
  0 siblings, 1 reply; 21+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-10-28  2:00 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Christoph Hellwig, Shreeya Patel, linux-fsdevel, linux-ext4,
	linux-f2fs-devel

Stephen Rothwell <sfr@canb.auug.org.au> writes:

> You just need to send me a git URL for your tree/branch (not a cgit or
> gitweb URL, please), plus some idea of what the tree include and how it
> is sent to Linus (directly or via another tree).  The branch should
> have a generic name (i.e. not including a version) as I will continuet
> to fetch that branch every day until you tell me to stop.  When your
> code is ready to be included in linux-next, all you have to do is
> update that branch to include the new code.

Hi Stephen,

Thanks for the information.

I'd like to ask you to track the branch 'for-next' of the following repository:

git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git

This branch is used as a staging area for development of the Unicode
subsystem used by native case-insensitive filesystems for file name
normalization and casefolding.  It goes to Linus through Ted Ts'o's ext4
tree.

Thank you,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Track unicode tree in linux-next (was Re: [PATCH 10/11] unicode: Add utf8-data module)
  2021-10-28  2:00               ` Track unicode tree in linux-next (was Re: [PATCH 10/11] unicode: Add utf8-data module) Gabriel Krisman Bertazi
@ 2021-10-28  9:47                 ` Stephen Rothwell
  0 siblings, 0 replies; 21+ messages in thread
From: Stephen Rothwell @ 2021-10-28  9:47 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Christoph Hellwig, Shreeya Patel, linux-fsdevel, linux-ext4,
	linux-f2fs-devel

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

Hi Gabriel,

On Wed, 27 Oct 2021 23:00:55 -0300 Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>> 
> I'd like to ask you to track the branch 'for-next' of the following repository:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git
> 
> This branch is used as a staging area for development of the Unicode
> subsystem used by native case-insensitive filesystems for file name
> normalization and casefolding.  It goes to Linus through Ted Ts'o's ext4
> tree.

Added from today.

Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgement of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
     * submitted under GPL v2 (or later) and include the Contributor's
        Signed-off-by,
     * posted to the relevant mailing list,
     * reviewed by you (or another maintainer of your subsystem tree),
     * successfully unit tested, and 
     * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.

-- 
Cheers,
Stephen Rothwell 
sfr@canb.auug.org.au

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 10/11] unicode: Add utf8-data module
  2021-08-18 14:06 unicode cleanups, and split the data table into a separate module Christoph Hellwig
@ 2021-08-18 14:06 ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:06 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Shreeya Patel, linux-fsdevel, linux-ext4, linux-f2fs-devel

utf8data.h contains a large database table which is an auto-generated
decodification trie for the unicode normalization functions.

Allow building it into a separate module.

Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/unicode/Kconfig                            | 13 ++++-
 fs/unicode/Makefile                           | 13 ++---
 fs/unicode/mkutf8data.c                       | 24 ++++++++--
 fs/unicode/utf8-core.c                        | 35 +++++++++++---
 fs/unicode/utf8-norm.c                        | 48 ++++---------------
 fs/unicode/utf8-selftest.c                    | 16 +++----
 ...{utf8data.h_shipped => utf8data.c_shipped} | 22 +++++++--
 fs/unicode/utf8n.h                            | 40 ++++++++--------
 include/linux/unicode.h                       |  2 +
 9 files changed, 123 insertions(+), 90 deletions(-)
 rename fs/unicode/{utf8data.h_shipped => utf8data.c_shipped} (99%)

diff --git a/fs/unicode/Kconfig b/fs/unicode/Kconfig
index 2c27b9a5cd6c..610d7bc05d6e 100644
--- a/fs/unicode/Kconfig
+++ b/fs/unicode/Kconfig
@@ -8,7 +8,16 @@ config UNICODE
 	  Say Y here to enable UTF-8 NFD normalization and NFD+CF casefolding
 	  support.
 
+config UNICODE_UTF8_DATA
+	tristate "UTF-8 normalization and casefolding tables"
+	depends on UNICODE
+	default UNICODE
+	help
+	  This contains a large table of case foldings, which can be loaded as
+	  a separate module if you say M here.  To be on the safe side stick
+	  to the default of Y.  Saying N here makes no sense, if you do not want
+	  utf8 casefolding support, disable CONFIG_UNICODE instead.
+
 config UNICODE_NORMALIZATION_SELFTEST
 	tristate "Test UTF-8 normalization support"
-	depends on UNICODE
-	default n
+	depends on UNICODE_UTF8_DATA
diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
index b88aecc86550..2f9d9188852b 100644
--- a/fs/unicode/Makefile
+++ b/fs/unicode/Makefile
@@ -2,14 +2,15 @@
 
 obj-$(CONFIG_UNICODE) += unicode.o
 obj-$(CONFIG_UNICODE_NORMALIZATION_SELFTEST) += utf8-selftest.o
+obj-$(CONFIG_UNICODE_UTF8_DATA) += utf8data.o
 
 unicode-y := utf8-norm.o utf8-core.o
 
-$(obj)/utf8-norm.o: $(obj)/utf8data.h
+$(obj)/utf8-data.o: $(obj)/utf8data.c
 
-# In the normal build, the checked-in utf8data.h is just shipped.
+# In the normal build, the checked-in utf8data.c is just shipped.
 #
-# To generate utf8data.h from UCD, put *.txt files in this directory
+# To generate utf8data.c from UCD, put *.txt files in this directory
 # and pass REGENERATE_UTF8DATA=1 from the command line.
 ifdef REGENERATE_UTF8DATA
 
@@ -24,15 +25,15 @@ quiet_cmd_utf8data = GEN     $@
 		-t $(srctree)/$(src)/NormalizationTest.txt \
 		-o $@
 
-$(obj)/utf8data.h: $(obj)/mkutf8data $(filter %.txt, $(cmd_utf8data)) FORCE
+$(obj)/utf8data.c: $(obj)/mkutf8data $(filter %.txt, $(cmd_utf8data)) FORCE
 	$(call if_changed,utf8data)
 
 else
 
-$(obj)/utf8data.h: $(src)/utf8data.h_shipped FORCE
+$(obj)/utf8data.c: $(src)/utf8data.c_shipped FORCE
 	$(call if_changed,shipped)
 
 endif
 
-targets += utf8data.h
+targets += utf8data.c
 hostprogs += mkutf8data
diff --git a/fs/unicode/mkutf8data.c b/fs/unicode/mkutf8data.c
index ff2025ac5a32..bc1a7c8b5c8d 100644
--- a/fs/unicode/mkutf8data.c
+++ b/fs/unicode/mkutf8data.c
@@ -3287,12 +3287,10 @@ static void write_file(void)
 		open_fail(utf8_name, errno);
 
 	fprintf(file, "/* This file is generated code, do not edit. */\n");
-	fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
-	fprintf(file, "#error Only nls_utf8-norm.c should include this file.\n");
-	fprintf(file, "#endif\n");
 	fprintf(file, "\n");
-	fprintf(file, "static const unsigned int utf8vers = %#x;\n",
-		unicode_maxage);
+	fprintf(file, "#include <linux/module.h>\n");
+	fprintf(file, "#include <linux/kernel.h>\n");
+	fprintf(file, "#include \"utf8n.h\"\n");
 	fprintf(file, "\n");
 	fprintf(file, "static const unsigned int utf8agetab[] = {\n");
 	for (i = 0; i != ages_count; i++)
@@ -3339,6 +3337,22 @@ static void write_file(void)
 		fprintf(file, "\n");
 	}
 	fprintf(file, "};\n");
+	fprintf(file, "\n");
+	fprintf(file, "struct utf8data_table utf8_data_table = {\n");
+	fprintf(file, "\t.utf8agetab = utf8agetab,\n");
+	fprintf(file, "\t.utf8agetab_size = ARRAY_SIZE(utf8agetab),\n");
+	fprintf(file, "\n");
+	fprintf(file, "\t.utf8nfdicfdata = utf8nfdicfdata,\n");
+	fprintf(file, "\t.utf8nfdicfdata_size = ARRAY_SIZE(utf8nfdicfdata),\n");
+	fprintf(file, "\n");
+	fprintf(file, "\t.utf8nfdidata = utf8nfdidata,\n");
+	fprintf(file, "\t.utf8nfdidata_size = ARRAY_SIZE(utf8nfdidata),\n");
+	fprintf(file, "\n");
+	fprintf(file, "\t.utf8data = utf8data,\n");
+	fprintf(file, "};\n");
+	fprintf(file, "EXPORT_SYMBOL_GPL(utf8_data_table);");
+	fprintf(file, "\n");
+	fprintf(file, "MODULE_LICENSE(\"GPL v2\");\n");
 	fclose(file);
 }
 
diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
index d9f713d38c0a..38ca824f1015 100644
--- a/fs/unicode/utf8-core.c
+++ b/fs/unicode/utf8-core.c
@@ -160,25 +160,45 @@ int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
 }
 EXPORT_SYMBOL(utf8_normalize);
 
+static const struct utf8data *find_table_version(const struct utf8data *table,
+		size_t nr_entries, unsigned int version)
+{
+	size_t i = nr_entries - 1;
+
+	while (version < table[i].maxage)
+		i--;
+	if (version > table[i].maxage)
+		return NULL;
+	return &table[i];
+}
+
 struct unicode_map *utf8_load(unsigned int version)
 {
 	struct unicode_map *um;
 
-	if (!utf8version_is_supported(version))
-		return ERR_PTR(-EINVAL);
-
 	um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL);
 	if (!um)
 		return ERR_PTR(-ENOMEM);
 	um->version = version;
-	um->ntab[UTF8_NFDI] = utf8nfdi(version);
-	if (!um->ntab[UTF8_NFDI])
+
+	um->tables = symbol_request(utf8_data_table);
+	if (!um->tables)
 		goto out_free_um;
-	um->ntab[UTF8_NFDICF] = utf8nfdicf(version);
+
+	if (!utf8version_is_supported(um, version))
+		goto out_symbol_put;
+	um->ntab[UTF8_NFDI] = find_table_version(um->tables->utf8nfdidata,
+			um->tables->utf8nfdidata_size, um->version);
+	if (!um->ntab[UTF8_NFDI])
+		goto out_symbol_put;
+	um->ntab[UTF8_NFDICF] = find_table_version(um->tables->utf8nfdicfdata,
+			um->tables->utf8nfdicfdata_size, um->version);
 	if (!um->ntab[UTF8_NFDICF])
-		goto out_free_um;
+		goto out_symbol_put;
 	return um;
 
+out_symbol_put:
+	symbol_put(um->tables);
 out_free_um:
 	kfree(um);
 	return ERR_PTR(-EINVAL);
@@ -187,6 +207,7 @@ EXPORT_SYMBOL(utf8_load);
 
 void utf8_unload(struct unicode_map *um)
 {
+	symbol_put(utf8_data_table);
 	kfree(um);
 }
 EXPORT_SYMBOL(utf8_unload);
diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 7c1f28ab31a8..829c7e2ad764 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -6,21 +6,12 @@
 
 #include "utf8n.h"
 
-struct utf8data {
-	unsigned int maxage;
-	unsigned int offset;
-};
-
-#define __INCLUDED_FROM_UTF8NORM_C__
-#include "utf8data.h"
-#undef __INCLUDED_FROM_UTF8NORM_C__
-
-int utf8version_is_supported(unsigned int version)
+int utf8version_is_supported(const struct unicode_map *um, unsigned int version)
 {
-	int i = ARRAY_SIZE(utf8agetab) - 1;
+	int i = um->tables->utf8agetab_size - 1;
 
-	while (i >= 0 && utf8agetab[i] != 0) {
-		if (version == utf8agetab[i])
+	while (i >= 0 && um->tables->utf8agetab[i] != 0) {
+		if (version == um->tables->utf8agetab[i])
 			return 1;
 		i--;
 	}
@@ -161,7 +152,7 @@ typedef const unsigned char utf8trie_t;
  * underlying datatype: unsigned char.
  *
  * leaf[0]: The unicode version, stored as a generation number that is
- *          an index into utf8agetab[].  With this we can filter code
+ *          an index into ->utf8agetab[].  With this we can filter code
  *          points based on the unicode version in which they were
  *          defined.  The CCC of a non-defined code point is 0.
  * leaf[1]: Canonical Combining Class. During normalization, we need
@@ -313,7 +304,7 @@ static utf8leaf_t *utf8nlookup(const struct unicode_map *um,
 		enum utf8_normalization n, unsigned char *hangul, const char *s,
 		size_t len)
 {
-	utf8trie_t	*trie = utf8data + um->ntab[n]->offset;
+	utf8trie_t	*trie = um->tables->utf8data + um->ntab[n]->offset;
 	int		offlen;
 	int		offset;
 	int		mask;
@@ -404,7 +395,8 @@ ssize_t utf8nlen(const struct unicode_map *um, enum utf8_normalization n,
 		leaf = utf8nlookup(um, n, hangul, s, len);
 		if (!leaf)
 			return -1;
-		if (utf8agetab[LEAF_GEN(leaf)] > um->ntab[n]->maxage)
+		if (um->tables->utf8agetab[LEAF_GEN(leaf)] >
+		    um->ntab[n]->maxage)
 			ret += utf8clen(s);
 		else if (LEAF_CCC(leaf) == DECOMPOSE)
 			ret += strlen(LEAF_STR(leaf));
@@ -520,7 +512,7 @@ int utf8byte(struct utf8cursor *u8c)
 
 		ccc = LEAF_CCC(leaf);
 		/* Characters that are too new have CCC 0. */
-		if (utf8agetab[LEAF_GEN(leaf)] >
+		if (u8c->um->tables->utf8agetab[LEAF_GEN(leaf)] >
 		    u8c->um->ntab[u8c->n]->maxage) {
 			ccc = STOPPER;
 		} else if (ccc == DECOMPOSE) {
@@ -597,25 +589,3 @@ int utf8byte(struct utf8cursor *u8c)
 	}
 }
 EXPORT_SYMBOL(utf8byte);
-
-const struct utf8data *utf8nfdi(unsigned int maxage)
-{
-	int i = ARRAY_SIZE(utf8nfdidata) - 1;
-
-	while (maxage < utf8nfdidata[i].maxage)
-		i--;
-	if (maxage > utf8nfdidata[i].maxage)
-		return NULL;
-	return &utf8nfdidata[i];
-}
-
-const struct utf8data *utf8nfdicf(unsigned int maxage)
-{
-	int i = ARRAY_SIZE(utf8nfdicfdata) - 1;
-
-	while (maxage < utf8nfdicfdata[i].maxage)
-		i--;
-	if (maxage > utf8nfdicfdata[i].maxage)
-		return NULL;
-	return &utf8nfdicfdata[i];
-}
diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
index cfa3832b75f4..eb2bbdd688d7 100644
--- a/fs/unicode/utf8-selftest.c
+++ b/fs/unicode/utf8-selftest.c
@@ -255,21 +255,21 @@ static void check_utf8_comparisons(struct unicode_map *table)
 	}
 }
 
-static void check_supported_versions(void)
+static void check_supported_versions(struct unicode_map *um)
 {
 	/* Unicode 7.0.0 should be supported. */
-	test(utf8version_is_supported(UNICODE_AGE(7, 0, 0)));
+	test(utf8version_is_supported(um, UNICODE_AGE(7, 0, 0)));
 
 	/* Unicode 9.0.0 should be supported. */
-	test(utf8version_is_supported(UNICODE_AGE(9, 0, 0)));
+	test(utf8version_is_supported(um, UNICODE_AGE(9, 0, 0)));
 
 	/* Unicode 1x.0.0 (the latest version) should be supported. */
-	test(utf8version_is_supported(UTF8_LATEST));
+	test(utf8version_is_supported(um, UTF8_LATEST));
 
 	/* Next versions don't exist. */
-	test(!utf8version_is_supported(UNICODE_AGE(13, 0, 0)));
-	test(!utf8version_is_supported(UNICODE_AGE(0, 0, 0)));
-	test(!utf8version_is_supported(UNICODE_AGE(-1, -1, -1)));
+	test(!utf8version_is_supported(um, UNICODE_AGE(13, 0, 0)));
+	test(!utf8version_is_supported(um, UNICODE_AGE(0, 0, 0)));
+	test(!utf8version_is_supported(um, UNICODE_AGE(-1, -1, -1)));
 }
 
 static int __init init_test_ucd(void)
@@ -285,7 +285,7 @@ static int __init init_test_ucd(void)
 		return PTR_ERR(um);
 	}
 
-	check_supported_versions();
+	check_supported_versions(um);
 	check_utf8_nfdi(um);
 	check_utf8_nfdicf(um);
 	check_utf8_comparisons(um);
diff --git a/fs/unicode/utf8data.h_shipped b/fs/unicode/utf8data.c_shipped
similarity index 99%
rename from fs/unicode/utf8data.h_shipped
rename to fs/unicode/utf8data.c_shipped
index 76e4f0e1b089..d9b62901aa96 100644
--- a/fs/unicode/utf8data.h_shipped
+++ b/fs/unicode/utf8data.c_shipped
@@ -1,9 +1,8 @@
 /* This file is generated code, do not edit. */
-#ifndef __INCLUDED_FROM_UTF8NORM_C__
-#error Only nls_utf8-norm.c should include this file.
-#endif
 
-static const unsigned int utf8vers = 0xc0100;
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include "utf8n.h"
 
 static const unsigned int utf8agetab[] = {
 	0,
@@ -4107,3 +4106,18 @@ static const unsigned char utf8data[64256] = {
 	0x52,0x04,0x00,0x00,0x11,0x04,0x00,0x00,0x02,0x00,0xcf,0x86,0xcf,0x06,0x02,0x00,
 	0x81,0x80,0xcf,0x86,0x85,0x84,0xcf,0x86,0xcf,0x06,0x02,0x00,0x00,0x00,0x00,0x00
 };
+
+struct utf8data_table utf8_data_table = {
+	.utf8agetab = utf8agetab,
+	.utf8agetab_size = ARRAY_SIZE(utf8agetab),
+
+	.utf8nfdicfdata = utf8nfdicfdata,
+	.utf8nfdicfdata_size = ARRAY_SIZE(utf8nfdicfdata),
+
+	.utf8nfdidata = utf8nfdidata,
+	.utf8nfdidata_size = ARRAY_SIZE(utf8nfdidata),
+
+	.utf8data = utf8data,
+};
+EXPORT_SYMBOL_GPL(utf8_data_table);
+MODULE_LICENSE("GPL v2");
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index 206c89f0dbf7..bd00d587747a 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -13,25 +13,7 @@
 #include <linux/module.h>
 #include <linux/unicode.h>
 
-int utf8version_is_supported(unsigned int version);
-
-/*
- * Look for the correct const struct utf8data for a unicode version.
- * Returns NULL if the version requested is too new.
- *
- * Two normalization forms are supported: nfdi and nfdicf.
- *
- * nfdi:
- *  - Apply unicode normalization form NFD.
- *  - Remove any Default_Ignorable_Code_Point.
- *
- * nfdicf:
- *  - Apply unicode normalization form NFD.
- *  - Remove any Default_Ignorable_Code_Point.
- *  - Apply a full casefold (C + F).
- */
-extern const struct utf8data *utf8nfdi(unsigned int maxage);
-extern const struct utf8data *utf8nfdicf(unsigned int maxage);
+int utf8version_is_supported(const struct unicode_map *um, unsigned int version);
 
 /*
  * Determine the length of the normalized from of the string,
@@ -78,4 +60,24 @@ int utf8ncursor(struct utf8cursor *u8c, const struct unicode_map *um,
  */
 extern int utf8byte(struct utf8cursor *u8c);
 
+struct utf8data {
+	unsigned int maxage;
+	unsigned int offset;
+};
+
+struct utf8data_table {
+	const unsigned int *utf8agetab;
+	int utf8agetab_size;
+
+	const struct utf8data *utf8nfdicfdata;
+	int utf8nfdicfdata_size;
+
+	const struct utf8data *utf8nfdidata;
+	int utf8nfdidata_size;
+
+	const unsigned char *utf8data;
+};
+
+extern struct utf8data_table utf8_data_table;
+
 #endif /* UTF8NORM_H */
diff --git a/include/linux/unicode.h b/include/linux/unicode.h
index 3e502c7456e8..2b3849e7cd64 100644
--- a/include/linux/unicode.h
+++ b/include/linux/unicode.h
@@ -6,6 +6,7 @@
 #include <linux/dcache.h>
 
 struct utf8data;
+struct utf8data_table;
 
 /* Encoding a unicode version number as a single unsigned int. */
 #define UNICODE_MAJ_SHIFT		(16)
@@ -35,6 +36,7 @@ enum utf8_normalization {
 struct unicode_map {
 	unsigned int version;
 	const struct utf8data *ntab[UTF8_NMAX];
+	const struct utf8data_table *tables;
 };
 
 int utf8_validate(const struct unicode_map *um, const struct qstr *str);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-10-28  9:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-15  6:59 unicode cleanups, and split the data table into a separate module v2 Christoph Hellwig
2021-09-15  6:59 ` [PATCH 01/11] ext4: simplify ext4_sb_read_encoding Christoph Hellwig
2021-09-15  6:59 ` [PATCH 02/11] f2fs: simplify f2fs_sb_read_encoding Christoph Hellwig
2021-09-15  6:59 ` [PATCH 03/11] unicode: remove the charset field from struct unicode_map Christoph Hellwig
2021-09-15  6:59 ` [PATCH 04/11] unicode: mark the version field in struct unicode_map unsigned Christoph Hellwig
2021-09-15  7:00 ` [PATCH 05/11] unicode: pass a UNICODE_AGE() tripple to utf8_load Christoph Hellwig
2021-09-15  7:00 ` [PATCH 06/11] unicode: remove the unused utf8{,n}age{min,max} functions Christoph Hellwig
2021-09-15  7:00 ` [PATCH 07/11] unicode: simplify utf8len Christoph Hellwig
2021-09-15  7:00 ` [PATCH 08/11] unicode: move utf8cursor to utf8-selftest.c Christoph Hellwig
2021-09-15  7:00 ` [PATCH 09/11] unicode: cache the normalization tables in struct unicode_map Christoph Hellwig
2021-09-15  7:00 ` [PATCH 10/11] unicode: Add utf8-data module Christoph Hellwig
2021-10-12 11:25   ` Gabriel Krisman Bertazi
2021-10-12 12:49     ` Christoph Hellwig
2021-10-12 14:40       ` Gabriel Krisman Bertazi
2021-10-26  7:45         ` Christoph Hellwig
2021-10-26 13:56           ` Gabriel Krisman Bertazi
2021-10-26 22:02             ` Stephen Rothwell
2021-10-28  2:00               ` Track unicode tree in linux-next (was Re: [PATCH 10/11] unicode: Add utf8-data module) Gabriel Krisman Bertazi
2021-10-28  9:47                 ` Stephen Rothwell
2021-09-15  7:00 ` [PATCH 11/11] unicode: only export internal symbols for the selftests Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2021-08-18 14:06 unicode cleanups, and split the data table into a separate module Christoph Hellwig
2021-08-18 14:06 ` [PATCH 10/11] unicode: Add utf8-data module Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).