linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Randy Dunlap <rdunlap@infradead.org>
To: Gabriel Krisman Bertazi <krisman@collabora.com>, tytso@mit.edu
Cc: linux-ext4@vger.kernel.org, sfrench@samba.org,
	darrick.wong@oracle.com, jlayton@kernel.org,
	bfields@fieldses.org, paulus@samba.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support
Date: Thu, 21 Mar 2019 15:30:35 -0700	[thread overview]
Message-ID: <05dfd6a7-49f0-81a7-cd68-ff9f07182461@infradead.org> (raw)
In-Reply-To: <20190318202745.5200-1-krisman@collabora.com>

On 3/18/19 1:27 PM, Gabriel Krisman Bertazi wrote:
> Hi,
> 
> This version pretty much the same as v5. I am resending cause as the
> previous version didn't grab much discussion on the main topic of moving
> from KD to D.
> 
> Same as version 5, at a first glance, you will notice the series got a
> lot smaller, with the separation of unicode code from the NLS subsystem,
> as Linus requested.  The ext4 parts are pretty much the same, with only
> the addition of a verification in ext4_feature_set_ok() to fail encoding
> mounts when without CONFIG_UNICODE on newer kernels.
> 
> The main change presented here is a proposal to migrate the
> normalization method from NFKD to NFD.  After our discussions, and
> reviewing other operating systems and languages aspects, I am more
> convinced that canonical decomposition is more viable solution than
> compatibility decomposition, because it doesn't ignore eliminate any
> semantic meaning, like the definitive case of superscript numbers.  NFD
> is also the documented method used by HFS+ and APFS, so there is
> precedent. Notice however, that as far as my research goes, APFS doesn't
> completely follows NFD, and in some cases, like <compat> flags, it
> actually does NFKD, but not in others (<fraction>), where it applies the
> canonical form.  We take a more consistent approach and always do plain NFD.
> 
> This RFC, therefore, aims to resume/start conversation with some
> stalkeholders that may have something to say regarding the normalization
> method used.  I added people from SMB, NFS and FS development who
> might be interested on this.
> 
> Regarding Casefold, I am unsure whether Casefold Common + Full still
> makes sense after migrating from the compatibility to the canonical
> form.  While Casefold Full, by definition, addresses cases where the
> casefolding grows in size, like the casefold of the german eszett to SS,
> it also is responsible for folding smallcase ligatures without a
> corresponding uppercase to their compatible counterpart.  Which means
> that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
> +F directories they will match.  This seems unaceptable to me,
> suggesting that we should start to use Common + Simple instead of Common
> + Full, but I would like more input on what seems more reasonable to
> you.
> 
> After we decide on this, I will be sending new patches to update
> e2fsprogs to the agreed method and remove the normalization/casefold
> type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
> EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
> patch series for inclusion in the kernel.
> 
> For the record, I am aware that unicode 12 was released 2 weeks ago. The
> world can't live without a new set of emojis every 6 months.  I will
> withold updating the unicode version until we get something
> upstreamable, then I will update to the latest version and send a new
> version.  This way I avoid having to update versions that will never
> actually be used.
> 
> Practical things, w.r.t. this patch series:
> 
>   - As usual, the UCD files are not part of the series, because they
>   would cause the email to bounce.  To test this one would need to fetch
>   the files as explained in the commit message.
> 
>   - If you prefer, you can checkout from
>      https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls
> 
>   - More details on the design decisions restricted to ext4 are
>     available in the corresponding commit messages.
> 
> Thanks!
> 

Hi,
I briefly scanned but did not look terribly closely:

Does this patch series ignore ext3 filesystems that are being handled
by the ext4fs code?

Thanks.

> 
> Gabriel Krisman Bertazi (7):
>   unicode: Implement higher level API for string handling
>   unicode: Introduce test module for normalized utf8 implementation
>   MAINTAINERS: Add Unicode subsystem entry
>   ext4: Include encoding information in the superblock
>   ext4: Support encoding-aware file name lookups
>   ext4: Implement EXT4_CASEFOLD_FL flag
>   docs: ext4.rst: Document encoding and case-insensitive
> 
> Olaf Weber (4):
>   unicode: Add unicode character database files
>   scripts: add trie generator for UTF-8
>   unicode: Introduce code for UTF-8 normalization
>   unicode: reduce the size of utf8data[]
> 
>  Documentation/admin-guide/ext4.rst |   41 +
>  MAINTAINERS                        |    6 +
>  fs/Kconfig                         |    1 +
>  fs/Makefile                        |    1 +
>  fs/ext4/dir.c                      |   43 +
>  fs/ext4/ext4.h                     |   42 +-
>  fs/ext4/hash.c                     |   38 +-
>  fs/ext4/ialloc.c                   |    2 +-
>  fs/ext4/inline.c                   |    2 +-
>  fs/ext4/inode.c                    |    4 +-
>  fs/ext4/ioctl.c                    |   18 +
>  fs/ext4/namei.c                    |  104 +-
>  fs/ext4/super.c                    |   91 +
>  fs/unicode/Kconfig                 |   13 +
>  fs/unicode/Makefile                |   22 +
>  fs/unicode/ucd/README              |   33 +
>  fs/unicode/utf8-core.c             |  183 ++
>  fs/unicode/utf8-norm.c             |  797 +++++++
>  fs/unicode/utf8-selftest.c         |  320 +++
>  fs/unicode/utf8n.h                 |  117 +
>  include/linux/fs.h                 |    2 +
>  include/linux/unicode.h            |   30 +
>  scripts/Makefile                   |    1 +
>  scripts/mkutf8data.c               | 3418 ++++++++++++++++++++++++++++
>  24 files changed, 5307 insertions(+), 22 deletions(-)
>  create mode 100644 fs/unicode/Kconfig
>  create mode 100644 fs/unicode/Makefile
>  create mode 100644 fs/unicode/ucd/README
>  create mode 100644 fs/unicode/utf8-core.c
>  create mode 100644 fs/unicode/utf8-norm.c
>  create mode 100644 fs/unicode/utf8-selftest.c
>  create mode 100644 fs/unicode/utf8n.h
>  create mode 100644 include/linux/unicode.h
>  create mode 100644 scripts/mkutf8data.c
> 


-- 
~Randy

  parent reply	other threads:[~2019-03-21 22:30 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-18 20:27 [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 01/11] unicode: Add unicode character database files Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 02/11] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 03/11] unicode: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 04/11] unicode: reduce the size of utf8data[] Gabriel Krisman Bertazi
2019-04-06 19:53   ` Theodore Ts'o
2019-04-08 12:02     ` Weber, Olaf (HPC Data Management & Storage)
2019-03-18 20:27 ` [PATCH RFC v6 05/11] unicode: Implement higher level API for string handling Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 06/11] unicode: Introduce test module for normalized utf8 implementation Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 07/11] MAINTAINERS: Add Unicode subsystem entry Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 08/11] ext4: Include encoding information in the superblock Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 09/11] ext4: Support encoding-aware file name lookups Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 10/11] ext4: Implement EXT4_CASEFOLD_FL flag Gabriel Krisman Bertazi
2019-03-18 20:27 ` [PATCH RFC v6 11/11] docs: ext4.rst: Document encoding and case-insensitive Gabriel Krisman Bertazi
2019-03-21 22:30 ` Randy Dunlap [this message]
2019-03-22 23:57   ` [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05dfd6a7-49f0-81a7-cd68-ff9f07182461@infradead.org \
    --to=rdunlap@infradead.org \
    --cc=bfields@fieldses.org \
    --cc=darrick.wong@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=krisman@collabora.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=sfrench@samba.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).