linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@collabora.com>
To: tytso@mit.edu
Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com,
	linux-ext4@vger.kernel.org,
	Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Subject: [PATCH v4 10/23] nls: Add optional normalization and casefold hooks
Date: Thu,  6 Dec 2018 18:08:50 -0500	[thread overview]
Message-ID: <20181206230903.30011-11-krisman@collabora.com> (raw)
In-Reply-To: <20181206230903.30011-1-krisman@collabora.com>

From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>

The Normalization operation applies a transformation to strings to
obtain the normalization form, which allow the user to determine whether
any two strings are equivalent to each other.  The NLS subsystem doesn't
impose any constraint on what means to be equivalent, for any charsets.
Unicode-based charsets, for instance, are free to support one, a few or
all kinds of Unicode equivalences.

The Casefold operation is similar to Normalization, in a sense that it
also allows the caller to identify equivalent strings, but it
disregards case, making it ideal for case insensitive comparisons.

Default implementation are provided by the nls core, such that existing
charsets can operate on the new interface. The Normalization default
operation is the format NLS_NORMALIZATION_TYPE_PLAIN, which returns the
identity of the string, which means no normalization.  The casefold
default is NLS_CASEFOLD_TYPE_TOUPPER, which returns the string with all
characters converted to uppercase.

Changes since V1:
  - Add default operations for casefold and normalization

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
---
 fs/nls/nls_core.c   |  11 +++++
 include/linux/nls.h | 116 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 127 insertions(+)

diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c
index 49a15bb2174f..c49088f36f4c 100644
--- a/fs/nls/nls_core.c
+++ b/fs/nls/nls_core.c
@@ -25,6 +25,17 @@ static int nls_validate_flags(struct nls_table *table, unsigned int flags)
 	if (flags & NLS_STRICT_MODE && !table->ops->validate)
 		return -1;
 
+	if ((flags & NLS_NORMALIZATION_TYPE_MASK) && !table->ops->normalize)
+		return -1;
+
+	if ((flags & NLS_CASEFOLD_TYPE_MASK) && !table->ops->casefold)
+		return -1;
+
+	/* Reject unused flags */
+	if (flags & ~(NLS_CASEFOLD_TYPE_MASK | NLS_NORMALIZATION_TYPE_MASK |
+		      NLS_STRICT_MODE))
+		return -1;
+
 	return 0;
 }
 
diff --git a/include/linux/nls.h b/include/linux/nls.h
index 980103d4c363..44a06a9c69e7 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -4,6 +4,7 @@
 
 #include <linux/init.h>
 #include <linux/string.h>
+#include <linux/errno.h>
 
 /* Unicode has changed over the years.  Unicode code points no longer
  * fit into 16 bits; as of Unicode 5 valid code points range from 0
@@ -65,6 +66,51 @@ struct nls_ops {
 	int (*strncasecmp)(const struct nls_table *charset,
 			   const unsigned char *str1, size_t len1,
 			   const unsigned char *str2, size_t len2);
+	/**
+	 * @normalize:
+	 *
+	 * Obtain the normalized form of a string, which can be used to
+	 * determine whether any two strings are equivalent.  The NLS
+	 * subsystem doesn't impose any constraint on the charsets
+	 * regarding what it means to be equivalent.  Unicode-based
+	 * charsets, for instance, are free to support one, a few or all
+	 * kinds of Unicode equivalences.  Different kinds of
+	 * normalizations can be specified using the nls_table flags.
+	 *
+	 * This hook is responsible for performing string validation if
+	 * the strict mode flag is set.  The only case where it is not
+	 * called by nls_core is when strict mode and normalization are
+	 * disabled, because in this case the normalization is
+	 * guaranteed to be the string identity.
+	 *
+	 * Not every charset implements this hook.  It is only required
+	 * if the charset supports strict mode or some kind of
+	 * normalization.
+	 *
+	 * If this operation cannot be executed for this charset,
+	 * -ENOTSUPP is returned.  If the sequence is invalid, -EINVAL
+	 * is returned.  Otherwise, this function returns the size of the
+	 * new string.
+	 **/
+	int (*normalize)(const struct nls_table *charset,
+			 const unsigned char *str, size_t len,
+			 unsigned char *dest, size_t dlen);
+	/**
+	 * @casefold:
+	 *
+	 * Casefold returns a version of the string that can be used to
+	 * perform case-insensitive comparisons.  The kind of casefold
+	 * algorithm that will be used is charset dependent, and can be
+	 * configured using the nls_table flags field.
+	 *
+	 * If this operation cannot be executed for this charset,
+	 * -ENOTSUPP is returned.  If the sequence fails, -EINVAL is
+	 * returned.  Otherwise, this function returns the size of the
+	 * new string.
+	 **/
+	int (*casefold)(const struct nls_table *charset,
+			const unsigned char *str, size_t len,
+			unsigned char *dest, size_t dlen);
 	unsigned char (*lowercase)(const struct nls_table *charset,
 				   unsigned int c);
 	unsigned char (*uppercase)(const struct nls_table *charset,
@@ -101,13 +147,37 @@ enum utf16_endian {
 	UTF16_BIG_ENDIAN
 };
 
+#define NLS_NORMALIZATION_TYPE(i)	((i & 0x7) << 1)
+#define NLS_CASEFOLD_TYPE(i)		((i & 0x7) << 4)
+
 #define NLS_STRICT_MODE			0x00000001
+#define NLS_NORMALIZATION_TYPE_PLAIN	NLS_NORMALIZATION_TYPE(0)
+#define NLS_NORMALIZATION_TYPE_MASK	0x0000000E
+#define NLS_CASEFOLD_TYPE_TOUPPER	NLS_CASEFOLD_TYPE(0)
+#define NLS_CASEFOLD_TYPE_MASK		0x00000070
 
 static inline int IS_STRICT_MODE(const struct nls_table *charset)
 {
 	return (charset->flags & NLS_STRICT_MODE);
 }
 
+#define NLS_NORMALIZATION_FUNCS(charset, type, i)			\
+static inline int							\
+IS_NORMALIZATION_TYPE_##charset##_##type(const struct nls_table *c)	\
+{									\
+	return ((c->flags & NLS_NORMALIZATION_TYPE_MASK) == i);		\
+}
+
+#define NLS_CASEFOLD_FUNCS(charset, type, i)			    	\
+static inline int							\
+IS_CASEFOLD_TYPE_##charset##_##type(const struct nls_table *c)		\
+{									\
+	return ((c->flags & NLS_CASEFOLD_TYPE_MASK) == i);		\
+}
+
+NLS_NORMALIZATION_FUNCS(ALL, PLAIN, NLS_NORMALIZATION_TYPE_PLAIN)
+NLS_CASEFOLD_FUNCS(ALL, TOUPPER, NLS_CASEFOLD_TYPE_TOUPPER)
+
 /* nls_base.c */
 extern int __register_nls(struct nls_charset *, struct module *);
 extern int unregister_nls(struct nls_charset *);
@@ -213,6 +283,52 @@ static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1,
 	return nls_strncasecmp(t, s1, len, s2, len);
 }
 
+static inline int nls_casefold(const struct nls_table *t,
+			       const unsigned char *str, size_t len,
+			       unsigned char *dest, size_t dlen)
+{
+	int i;
+
+	if (t->ops->casefold)
+		return t->ops->casefold(t, str, len, dest, dlen);
+
+	if (!IS_CASEFOLD_TYPE_ALL_TOUPPER(t))
+		return -ENOTSUPP;
+
+	if (IS_STRICT_MODE(t) && nls_validate(t, str, len))
+		return -EINVAL;
+
+	if (len > dlen)
+		return -EINVAL;
+
+	for (i = 0 ; i < len; i++)
+		dest[i] = nls_toupper(t, str[i]);
+
+	return len;
+}
+
+static inline int nls_normalize(const struct nls_table *t,
+				const unsigned char *str, size_t len,
+				unsigned char *dest, size_t dlen)
+{
+	if (t->ops->normalize)
+		return t->ops->normalize(t, str, len, dest, dlen);
+
+	if (!IS_NORMALIZATION_TYPE_ALL_PLAIN(t))
+		return -ENOTSUPP;
+
+	if (IS_STRICT_MODE(t) && nls_validate(t, str, len))
+		return -EINVAL;
+
+	if (len > dlen)
+		return -EINVAL;
+
+	/* If normalization are disabled, normalization is the
+	 * identity. */
+	strncpy(dest, str, len);
+	return len;
+}
+
 /*
  * nls_nullsize - return length of null character for codepage
  * @codepage - codepage for which to return length of NULL terminator
-- 
2.20.0.rc2

  parent reply	other threads:[~2018-12-06 23:09 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-06 23:08 [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 01/23] nls: Wrap uni2char/char2uni callers Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 02/23] nls: Wrap charset field access Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 03/23] nls: Wrap charset hooks in ops structure Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 04/23] nls: Split default charset from NLS core Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 05/23] nls: Split struct nls_charset from struct nls_table Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 06/23] nls: Add support for multiple versions of an encoding Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 07/23] nls: Implement NLS_STRICT_MODE flag Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 08/23] nls: Let charsets define the behavior of tolower/toupper Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 09/23] nls: Add new interface for string comparisons Gabriel Krisman Bertazi
2018-12-06 23:08 ` Gabriel Krisman Bertazi [this message]
2018-12-06 23:08 ` [PATCH v4 11/23] nls: ascii: Support validation and normalization operations Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 12/23] nls: utf8: Add unicode character database files Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 13/23] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 14/23] nls: utf8: Move nls-utf8{,-core}.c Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 15/23] nls: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 16/23] nls: utf8n: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 17/23] nls: utf8: Integrate utf8 normalization code with utf8 charset Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 18/23] nls: utf8: Introduce test module for normalized utf8 implementation Gabriel Krisman Bertazi
2018-12-06 23:08 ` [PATCH v4 19/23] ext4: Reserve superblock fields for encoding information Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 20/23] ext4: Include encoding information in the superblock Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 21/23] ext4: Support encoding-aware file name lookups Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 22/23] ext4: Implement EXT4_CASEFOLD_FL flag Gabriel Krisman Bertazi
2018-12-06 23:09 ` [PATCH v4 23/23] docs: ext4.rst: Document encoding and case-insensitive Gabriel Krisman Bertazi
2018-12-07 18:41 ` [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support Randy Dunlap
     [not found] ` <20181208194128.GE20708@thunk.org>
2018-12-08 21:48   ` Linus Torvalds
2018-12-08 21:58     ` Linus Torvalds
2018-12-08 22:59       ` Linus Torvalds
2018-12-09  0:46         ` Andreas Dilger
     [not found]       ` <20181209050326.GA28659@mit.edu>
2018-12-09 17:41         ` Linus Torvalds
2018-12-09 20:10           ` Theodore Y. Ts'o
2018-12-09 20:54             ` Linus Torvalds
2018-12-10  0:08               ` Theodore Y. Ts'o
2018-12-10 19:35                 ` Linus Torvalds
2018-12-09 20:53           ` Gabriel Krisman Bertazi
2018-12-09 21:05             ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2018-12-06 22:04 Gabriel Krisman Bertazi
2018-12-06 22:04 ` [PATCH v4 10/23] nls: Add optional normalization and casefold hooks Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181206230903.30011-11-krisman@collabora.com \
    --to=krisman@collabora.com \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.co.uk \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).