linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v5 00/11] Ext4 Encoding and Case-insensitive support
@ 2019-01-28 21:32 Gabriel Krisman Bertazi
  2019-01-28 21:32 ` [PATCH RFC v5 01/11] unicode: Add unicode character database files Gabriel Krisman Bertazi
                   ` (13 more replies)
  0 siblings, 14 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2019-01-28 21:32 UTC (permalink / raw)
  To: tytso
  Cc: linux-fsdevel, linux-ext4, sfrench, darrick.wong,
	samba-technical, jlayton, bfields, paulus,
	Gabriel Krisman Bertazi

Hi Ted,

Following Linus comments, this version is back as an RFC, in order to
discuss the normalization method used.  At a first glance, you will
notice the series got a lot smaller, with the separation of unicode code
from the NLS subsystem, as Linus requested.  The ext4 parts are pretty
much the same, with only the addition of a verification in
ext4_feature_set_ok() to fail encoding mounts when without
CONFIG_UNICODE on newer kernels.

The main change presented here is a proposal to migrate the
normalization method from NFKD to NFD.  After our discussions, and
reviewing other operating systems and languages aspects, I am more
convinced that canonical decomposition is more viable solution than
compatibility decomposition, because it doesn't ignore eliminate any
semantic meaning, like the definitive case of superscript numbers.  NFD
is also the documented method used by HFS+ and APFS, so there is
precedent. Notice however, that as far as my research goes, APFS doesn't
completely follows NFD, and in some cases, like <compat> flags, it
actually does NFKD, but not in others (<fraction>), where it applies the
canonical form.  We take a more consistent approach and always do plain NFD.

This RFC, therefore, aims to resume/start conversation with some
stalkeholders that may have something to say regarding the normalization
method used.  I added people from SMB, NFS and FS development who
might be interested on this.

Regarding Casefold, I am unsure whether Casefold Common + Full still
makes sense after migrating from the compatibility to the canonical
form.  While Casefold Full, by definition, addresses cases where the
casefolding grows in size, like the casefold of the german eszett to SS,
it also is responsible for folding smallcase ligatures without a
corresponding uppercase to their compatible counterpart.  Which means
that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
+F directories they will match.  This seems unaceptable to me,
suggesting that we should start to use Common + Simple instead of Common
+ Full, but I would like more input on what seems more reasonable to
you.

After we decide on this, I will be sending new patches to update
e2fsprogs to the agreed method and remove the normalization/casefold
type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
patch series for inclusion in the kernel.

Practical things, w.r.t. this patch series:

  - As usual, the UCD files are not part of the series, because they
  would bounce.  To test this one would need to fetch the files as
  explained in the commit message.

  - If you prefer, you can checkout from
     https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls

  - More details on the design decisions restricted to ext4 are
    available in the corresponding commit messages.

Thanks for keeping up with this.

Gabriel Krisman Bertazi (7):
  unicode: Implement higher level API for string handling
  unicode: Introduce test module for normalized utf8 implementation
  MAINTAINERS: Add Unicode subsystem entry
  ext4: Include encoding information in the superblock
  ext4: Support encoding-aware file name lookups
  ext4: Implement EXT4_CASEFOLD_FL flag
  docs: ext4.rst: Document encoding and case-insensitive

Olaf Weber (4):
  unicode: Add unicode character database files
  scripts: add trie generator for UTF-8
  unicode: Introduce code for UTF-8 normalization
  unicode: reduce the size of utf8data[]

 Documentation/admin-guide/ext4.rst |   41 +
 MAINTAINERS                        |    6 +
 fs/Kconfig                         |    1 +
 fs/Makefile                        |    1 +
 fs/ext4/dir.c                      |   43 +
 fs/ext4/ext4.h                     |   42 +-
 fs/ext4/hash.c                     |   38 +-
 fs/ext4/ialloc.c                   |    2 +-
 fs/ext4/inline.c                   |    2 +-
 fs/ext4/inode.c                    |    4 +-
 fs/ext4/ioctl.c                    |   18 +
 fs/ext4/namei.c                    |  104 +-
 fs/ext4/super.c                    |   91 +
 fs/unicode/Kconfig                 |   13 +
 fs/unicode/Makefile                |   22 +
 fs/unicode/ucd/README              |   33 +
 fs/unicode/utf8-core.c             |  183 ++
 fs/unicode/utf8-norm.c             |  797 +++++++
 fs/unicode/utf8-selftest.c         |  320 +++
 fs/unicode/utf8n.h                 |  117 +
 include/linux/fs.h                 |    2 +
 include/linux/unicode.h            |   30 +
 scripts/Makefile                   |    1 +
 scripts/mkutf8data.c               | 3418 ++++++++++++++++++++++++++++
 24 files changed, 5307 insertions(+), 22 deletions(-)
 create mode 100644 fs/unicode/Kconfig
 create mode 100644 fs/unicode/Makefile
 create mode 100644 fs/unicode/ucd/README
 create mode 100644 fs/unicode/utf8-core.c
 create mode 100644 fs/unicode/utf8-norm.c
 create mode 100644 fs/unicode/utf8-selftest.c
 create mode 100644 fs/unicode/utf8n.h
 create mode 100644 include/linux/unicode.h
 create mode 100644 scripts/mkutf8data.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-02-19 19:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-28 21:32 [PATCH RFC v5 00/11] Ext4 Encoding and Case-insensitive support Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 01/11] unicode: Add unicode character database files Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 02/11] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 03/11] unicode: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 04/11] unicode: reduce the size of utf8data[] Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 05/11] unicode: Implement higher level API for string handling Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 06/11] unicode: Introduce test module for normalized utf8 implementation Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 07/11] MAINTAINERS: Add Unicode subsystem entry Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 08/11] ext4: Include encoding information in the superblock Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 09/11] ext4: Support encoding-aware file name lookups Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 10/11] ext4: Implement EXT4_CASEFOLD_FL flag Gabriel Krisman Bertazi
2019-01-28 21:32 ` [PATCH RFC v5 11/11] docs: ext4.rst: Document encoding and case-insensitive Gabriel Krisman Bertazi
2019-01-29 16:54 ` [PATCH RFC v5 00/11] Ext4 Encoding and Case-insensitive support J. Bruce Fields
2019-02-05 18:10 ` Pali Rohár
2019-02-05 19:08   ` Gabriel Krisman Bertazi
2019-02-06  8:47     ` Pali Rohár
2019-02-06 16:04       ` Gabriel Krisman Bertazi
2019-02-06 16:43         ` Pali Rohár
2019-02-19 19:04 ` Gabriel Krisman Bertazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).