All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Shreeya Patel <shreeya.patel@collabora.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	adilger.kernel@dilger.ca, jaegeuk@kernel.org, chao@kernel.org,
	krisman@collabora.com, ebiggers@google.com, drosen@google.com,
	ebiggers@kernel.org, yuchao0@huawei.com,
	linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net,
	linux-fsdevel@vger.kernel.org, kernel@collabora.com,
	andre.almeida@collabora.com
Subject: Re: [PATCH v8 4/4] fs: unicode: Add utf8 module and a unicode layer
Date: Tue, 27 Apr 2021 10:50:38 -0400	[thread overview]
Message-ID: <YIgkvjdrJPjeoJH7@mit.edu> (raw)
In-Reply-To: <61d85255-d23e-7016-7fb5-7ab0a6b4b39f@collabora.com>

On Tue, Apr 27, 2021 at 03:39:15PM +0530, Shreeya Patel wrote:
> > > Hence, make UTF-8 encoding loadable by converting it into a module and
> > > also add built-in UTF-8 support option for compiling it into the
> > > kernel whenever required by the filesystem.
> > The way this is implemement looks rather awkward.

I think that's a bit awkard is the trying to create an abstraction
separation between the unicode and utf8 layers, just in case, at some
point, we want fs/unicode to support more than just utf8.

I think we're better off being opinionated here, and say that the only
unicode encoding that will be supported by the kernel is UTF-8.
Period.  In which case, we don't need to try to insert this unneeded
abstraction layer.

If you really want to make make fs/unicode support more than one
encoding --- say, UTF-16LE, as used by NTFS --- at that point we can
think about what the abstractions should look like.  For example, it
doesn't _actually_ make sense for the data-trie structures to be part
of the utf-8 encoding.  The normalization tables are for Unicode, and
it wouldn't make sense for UTF-16 to have its own normalization
tables, bloating the kernel even more.

It *is* true that the normalization tables have been optimized for
utf-8, because that's what the whole world actually uses; utf-16le is
really a legacy use case.  So presumably, we would probably find a way
to code up the utf-16 functions in a way that used the utf-8 data
tables, even if it wasn't 100% optimal in terms of speed.

But it's probably not worth it at this point.

> > Given that the large memory usage is for a data table and not for code,
> > why not treat is as a firmware blob and load it using request_firmware?
> 
> utf8 module not just has the data table but also has some kernel code.
> The big part that we are trying to keep out of the kernel is a tree
> structure that gets traversed based on a key that is the file name.
> This is done when issuing a lookup in the filesystem, which has to be very
> fast. So maybe it would not be so good to use request_firmware for
> such a core feature.

Speed really isn't a great argument here; the request_firmware is
something that would only need to be done once, when a file system
which requires Unicode normalization and/or case-folding is mounted.

I think the better argument to make is just one of simplicity;
separating the Unicode data table from the kernel adds complexity.  It
also reduces flexibility, since for use cases where it's actually
_preferable_ to have Unicode functionality permanently built-in the
kernel, we now force the use of some kind of initial ramdisk to load a
module before the root file system (which might require Unicode
support) could even be mounted.

The argument *for* making the Unicode table be a loadable firmware is
that it might make it possible to upgrade to a newer version of
Unicode without needing to do a kernel recompile.  On average, Unicode
relases a new to support new character sets every year or so, or when
there Japanese Emperor requiring a new reign name :-).  Usually the
new character sets are for obscure ancient alphabets, and so it's
really not a big deal if the kernel doesn't support, say,
Chorasmian[1] or Dives Akuru[2].  Perhaps people would make a much
bigger deal about new Emoji characters, or new code points for the
Creative Commons symbols.  I'm personally not excited enough to claim
that it's worth the extra complexity, but some people might think so.  :-)

[1] used in Central Asia across Uzbekistan, Kazakhstan, and
Turkmenistan to write an extinct Eastern Iranian language.

[2] historically used in the Maldives until the 20th century.

Of course, using those new Emoji symbols in file names would reduce
portability of that file system if Strict Normalization was mandated.
Fortunately, ext4 and f2fs don't enable strict normalizaation by
default, which is also good, because it means if we don't have the
latest Unicode update in the kernel, it doesn't really matter that
much.... again, not worth the extra complexity/headache IMHO.

Cheers,

					- Ted

WARNING: multiple messages have this Message-ID (diff)
From: "Theodore Ts'o" <tytso@mit.edu>
To: Shreeya Patel <shreeya.patel@collabora.com>
Cc: ebiggers@kernel.org, kernel@collabora.com, drosen@google.com,
	ebiggers@google.com, linux-kernel@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net,
	Christoph Hellwig <hch@infradead.org>,
	adilger.kernel@dilger.ca, linux-fsdevel@vger.kernel.org,
	jaegeuk@kernel.org, andre.almeida@collabora.com,
	linux-ext4@vger.kernel.org, krisman@collabora.com
Subject: Re: [f2fs-dev] [PATCH v8 4/4] fs: unicode: Add utf8 module and a unicode layer
Date: Tue, 27 Apr 2021 10:50:38 -0400	[thread overview]
Message-ID: <YIgkvjdrJPjeoJH7@mit.edu> (raw)
In-Reply-To: <61d85255-d23e-7016-7fb5-7ab0a6b4b39f@collabora.com>

On Tue, Apr 27, 2021 at 03:39:15PM +0530, Shreeya Patel wrote:
> > > Hence, make UTF-8 encoding loadable by converting it into a module and
> > > also add built-in UTF-8 support option for compiling it into the
> > > kernel whenever required by the filesystem.
> > The way this is implemement looks rather awkward.

I think that's a bit awkard is the trying to create an abstraction
separation between the unicode and utf8 layers, just in case, at some
point, we want fs/unicode to support more than just utf8.

I think we're better off being opinionated here, and say that the only
unicode encoding that will be supported by the kernel is UTF-8.
Period.  In which case, we don't need to try to insert this unneeded
abstraction layer.

If you really want to make make fs/unicode support more than one
encoding --- say, UTF-16LE, as used by NTFS --- at that point we can
think about what the abstractions should look like.  For example, it
doesn't _actually_ make sense for the data-trie structures to be part
of the utf-8 encoding.  The normalization tables are for Unicode, and
it wouldn't make sense for UTF-16 to have its own normalization
tables, bloating the kernel even more.

It *is* true that the normalization tables have been optimized for
utf-8, because that's what the whole world actually uses; utf-16le is
really a legacy use case.  So presumably, we would probably find a way
to code up the utf-16 functions in a way that used the utf-8 data
tables, even if it wasn't 100% optimal in terms of speed.

But it's probably not worth it at this point.

> > Given that the large memory usage is for a data table and not for code,
> > why not treat is as a firmware blob and load it using request_firmware?
> 
> utf8 module not just has the data table but also has some kernel code.
> The big part that we are trying to keep out of the kernel is a tree
> structure that gets traversed based on a key that is the file name.
> This is done when issuing a lookup in the filesystem, which has to be very
> fast. So maybe it would not be so good to use request_firmware for
> such a core feature.

Speed really isn't a great argument here; the request_firmware is
something that would only need to be done once, when a file system
which requires Unicode normalization and/or case-folding is mounted.

I think the better argument to make is just one of simplicity;
separating the Unicode data table from the kernel adds complexity.  It
also reduces flexibility, since for use cases where it's actually
_preferable_ to have Unicode functionality permanently built-in the
kernel, we now force the use of some kind of initial ramdisk to load a
module before the root file system (which might require Unicode
support) could even be mounted.

The argument *for* making the Unicode table be a loadable firmware is
that it might make it possible to upgrade to a newer version of
Unicode without needing to do a kernel recompile.  On average, Unicode
relases a new to support new character sets every year or so, or when
there Japanese Emperor requiring a new reign name :-).  Usually the
new character sets are for obscure ancient alphabets, and so it's
really not a big deal if the kernel doesn't support, say,
Chorasmian[1] or Dives Akuru[2].  Perhaps people would make a much
bigger deal about new Emoji characters, or new code points for the
Creative Commons symbols.  I'm personally not excited enough to claim
that it's worth the extra complexity, but some people might think so.  :-)

[1] used in Central Asia across Uzbekistan, Kazakhstan, and
Turkmenistan to write an extinct Eastern Iranian language.

[2] historically used in the Maldives until the 20th century.

Of course, using those new Emoji symbols in file names would reduce
portability of that file system if Strict Normalization was mandated.
Fortunately, ext4 and f2fs don't enable strict normalizaation by
default, which is also good, because it means if we don't have the
latest Unicode update in the kernel, it doesn't really matter that
much.... again, not worth the extra complexity/headache IMHO.

Cheers,

					- Ted


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

  reply	other threads:[~2021-04-27 14:51 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-23 20:51 [PATCH v8 0/4] Make UTF-8 encoding loadable Shreeya Patel
2021-04-23 20:51 ` [f2fs-dev] " Shreeya Patel
2021-04-23 20:51 ` [PATCH v8 1/4] fs: unicode: Use strscpy() instead of strncpy() Shreeya Patel
2021-04-23 20:51   ` [f2fs-dev] " Shreeya Patel
2021-04-23 20:51 ` [PATCH v8 2/4] fs: unicode: Rename function names from utf8 to unicode Shreeya Patel
2021-04-23 20:51   ` [f2fs-dev] " Shreeya Patel
2021-04-23 20:51 ` [PATCH v8 3/4] fs: unicode: Rename utf8-core file to unicode-core Shreeya Patel
2021-04-23 20:51   ` [f2fs-dev] " Shreeya Patel
2021-04-23 20:51 ` [PATCH v8 4/4] fs: unicode: Add utf8 module and a unicode layer Shreeya Patel
2021-04-23 20:51   ` [f2fs-dev] " Shreeya Patel
2021-04-27  6:29   ` Christoph Hellwig
2021-04-27  6:29     ` [f2fs-dev] " Christoph Hellwig
2021-04-27 10:09     ` Shreeya Patel
2021-04-27 10:09       ` [f2fs-dev] " Shreeya Patel
2021-04-27 14:50       ` Theodore Ts'o [this message]
2021-04-27 14:50         ` Theodore Ts'o
2021-04-27 15:06         ` Gabriel Krisman Bertazi
2021-04-27 15:06           ` [f2fs-dev] " Gabriel Krisman Bertazi
2021-04-28 14:12           ` Theodore Ts'o
2021-04-28 14:12             ` [f2fs-dev] " Theodore Ts'o
2021-04-28 18:58             ` Gabriel Krisman Bertazi
2021-04-28 18:58               ` [f2fs-dev] " Gabriel Krisman Bertazi
     [not found]               ` <7caab939-2800-0cc2-7b65-345af3fce73d@collabora.com>
2021-05-11  4:35                 ` Christoph Hellwig
2021-05-11  4:35                   ` [f2fs-dev] " Christoph Hellwig
2021-05-20 20:19                   ` Shreeya Patel
2021-05-20 20:19                     ` [f2fs-dev] " Shreeya Patel
2021-06-03  0:07                     ` Gabriel Krisman Bertazi
2021-06-03  0:07                       ` [f2fs-dev] " Gabriel Krisman Bertazi
2021-06-16  4:09                       ` Christoph Hellwig
2021-06-16  4:09                         ` [f2fs-dev] " Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YIgkvjdrJPjeoJH7@mit.edu \
    --to=tytso@mit.edu \
    --cc=adilger.kernel@dilger.ca \
    --cc=andre.almeida@collabora.com \
    --cc=chao@kernel.org \
    --cc=drosen@google.com \
    --cc=ebiggers@google.com \
    --cc=ebiggers@kernel.org \
    --cc=hch@infradead.org \
    --cc=jaegeuk@kernel.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shreeya.patel@collabora.com \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.