linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Pali Rohár" <pali.rohar@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-ntfs-dev@lists.sourceforge.net, linux-cifs@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Luis de Bethencourt <luisbg@kernel.org>,
	Salah Triki <salah.triki@gmail.com>,
	Steve French <sfrench@samba.org>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	David Sterba <dsterba@suse.com>,
	Dave Kleikamp <shaggy@kernel.org>,
	Anton Altaparmakov <anton@tuxera.com>, Jan Kara <jack@suse.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Eric Sandeen <sandeen@redhat.com>,
	Namjae Jeon <linkinjeon@gmail.com>, Pavel Machek <pavel@ucw.cz>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: Unification of filesystem encoding options
Date: Tue, 7 Jan 2020 18:38:42 +0100	[thread overview]
Message-ID: <20200107173842.ciskn4ahuhiklycm@pali> (raw)
In-Reply-To: <20200107133233.GC25547@quack2.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 3732 bytes --]

On Tuesday 07 January 2020 14:32:33 Jan Kara wrote:
> On Thu 02-01-20 22:18:55, Pali Rohár wrote:
> > 1) Unify mount options for specifying charset.
> > 
> > Currently all filesystems except msdos and hfsplus have mount option
> > iocharset=<charset>. hfsplus has nls=<charset> and msdos does not
> > implement re-encoding support. Plus vfat, udf and isofs have broken
> > iocharset=utf8 option (but working utf8 option) And ntfs has deprecated
> > iocharset=<charset> option.
> > 
> > I would suggest following changes for unification:
> > 
> > * Add a new alias iocharset= for hfsplus which would do same as nls=
> > * Make iocharset=utf8 option for vfat, udf and isofs to do same as utf8
> > * Un-deprecate iocharset=<charset> option for ntfs
> > 
> > This would cause that all filesystems would have iocharset=<charset>
> > option which would work for any charset, including iocharset=utf8.
> > And it would fix also broken iocharset=utf8 for vfat, udf and isofs.
> 
> Makes sense to me.

Ok!

> > 2) Add support for Unicode code points above U+FFFF for filesystems
> > befs, hfs, hfsplus, jfs and ntfs, so iocharset=utf8 option would work
> > also with filenames in userspace which would be 4 bytes long UTF-8.
> 
> Also looks good but when doing this, I'd suggest we extend NLS to support
> full UTF-8 rather than implementing it by hand like e.g. we did for UDF.

Current kernel NLS framework API supports upper-case / lower-case
conversion only for single byte encodings. So no case-insensitive
support for UTF-8 encoding. And for Unicode conversion it supports only
UCS-2, therefore code points up to the U+FFFF, so for UTF-8 maximally
3byte long sequences.

This really is not possible to fix without rewriting existing
filesystems which uses NLS API.

One hacky option would be to extend NLS API from UCS-2 to UTF-16 and fix
all users of NLS API to expects UTF-16 surrogate pairs.

But I dislike UTF-16 and rather would use usage of unicode_t (UTF-32)
which is already present in kernel. But because existing filesystems
drivers pass their UCS-2/UTF-16 buffers from FS to NLS API it is not
easy to change whole NLS API from UCS-2 to UTF-32.

And still this change does not add support for case-insensitivity, so
is useless for all MS filesystems (msdos, vfat, ntfs), which is
majority.

Kernel already provides functions for converting between UTF-8 and
UTF-16, so this seems to be the easiest way how to provide full UTF-8
support for filesystems which internally uses UTF-16. Similarly like it
is implemented in UDF.

Moreover all NLS encodings except UTF-8 are single byte encodings and
maps into Plane-0, so can be represented by currently used UCS-2
encoding. Therefore conversion to Unicode works correctly and also their
case-insensitivity functions (or rather tables).

Adding support for case-insensitivity into UTF-8 NLS encoding would mean
to create completely new kernel NLS API (which would support variable
length encodings) and rewrite all NLS filesystems to use this new API.
Also all existing NLS encodings would be needed to port into this new
API.

It is really something which have a value? Just because of UTF-8?

For me it looks like better option would be to remove UTF-8 NLS encoding
as it is broken. Some filesystems already do not use NLS API for their
UTF-8 support (e.g. vfat, udf or newly prepared exfat). And others could
be modified/extended/fixed in similar way.

> > 3) Add support for iocharset= and codepage= options for msdos
> > filesystem. It shares lot of pars of code with vfat driver.
> 
> I guess this is for msdos filesystem maintainers to decide.

Yes!

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

  reply	other threads:[~2020-01-07 17:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-02 21:18 Unification of filesystem encoding options Pali Rohár
2020-01-07 13:32 ` Jan Kara
2020-01-07 17:38   ` Pali Rohár [this message]
2020-01-07 20:03     ` Theodore Y. Ts'o
2020-01-07 20:37       ` Pali Rohár
2020-01-08  7:13         ` OGAWA Hirofumi
2020-01-08  7:00     ` OGAWA Hirofumi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200107173842.ciskn4ahuhiklycm@pali \
    --to=pali.rohar@gmail.com \
    --cc=anton@tuxera.com \
    --cc=dsterba@suse.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@infradead.org \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=linkinjeon@gmail.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-ntfs-dev@lists.sourceforge.net \
    --cc=luisbg@kernel.org \
    --cc=pavel@ucw.cz \
    --cc=salah.triki@gmail.com \
    --cc=sandeen@redhat.com \
    --cc=sfrench@samba.org \
    --cc=shaggy@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).