linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How to add Unicode character tables to the kernel?
@ 2019-03-31 23:09 Theodore Ts'o
  2019-04-01 17:51 ` Linus Torvalds
  0 siblings, 1 reply; 3+ messages in thread
From: Theodore Ts'o @ 2019-03-31 23:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Gabriel Krisman Bertazi

Hi Linus,

I'm currently looking to integrate Unicode case-folding and
normalization support into ext4.  In order to do this, we need to
include some Unicode table data into the kernel sources.  Per your
suggestion, the plan is to put them in fs/unicode and not fs/nls.

The question is how to do this, with different tradeoffs.  One is to
simply include a utf8data.h file, which will be 320k.  That might
sound large, but in fs/nls there are 3544k worth of similar files.
Some are relatively small --- only 16k.  But others are quite large
--- 480k to 856k.  The table for Chinese character set is such an
example.  So in comparison, the 320k size of utf8data.h is quite
compact.

The problem with this solution is that the files in fs/nls, and the
proposed utf8data.h, are generated files.  For example:

/*
 * linux/fs/nls/nls_cp850.c
 *
 * Charset cp850 translation tables.
 * Generated automatically from the Unicode and charset
 * tables from the Unicode Organization (www.unicode.org).
 * The Unicode to charset table has only exact mappings.
 */
	....
static const wchar_t charset2uni[256] = {
	/* 0x00*/
	0x0000, 0x0001, 0x0002, 0x0003,
	0x0004, 0x0005, 0x0006, 0x0007,
	....

Now, one could argue that these tables are not the preferred form of
modification, per the definition in the GPL.  So alternatively we
could include the underlying Unicode data files from unicode.org, and
a program that generates utf8data.h from those data files.  The
downside of this approach is that it will increase the size of the
kernel tree by over 5 megabytes:

<tytso@lambda> {/usr/projects/linux/ext4}   (unicode)
1395% ls fs/unicode/ucd
total 5544
  84 CaseFolding-11.0.0.txt	          4 NormalizationCorrections-11.0.0.txt
 112 DerivedAge-11.0.0.txt	       2492 NormalizationTest-11.0.0.txt
 160 DerivedCombiningClass-11.0.0.txt     4 README
 960 DerivedCoreProperties-11.0.0.txt  1728 UnicodeData-11.0.0.txt

Generation of the utf8data.h is fast; so this is basically a disk
space question.  The files *are* compressible; and if we compressed
them all, it would be about 932k.  This won't help the increase in the
size of the git pack files, and we'll still need to decompress the
files when building the kernel, so some might still not be excited
about this.

So Linus, do you have a preference between:

* Just drop the 320k utf8data.h file into fs/unicode.  The file is
  basically much like the fils in fs/nls, so there is precedence for
  this.  Similarly, the files in lib/font are also data files, and
  we've historically not been worried about whether or not this would
  cause objections from people who would argue that these are not the
  "preferred form of modification".  Of course, I very much doubt
  anyone has ever *wanted* to modify these files, but....

* Drop the uncompressed 5544k worth of fs/unicode/ucd/*.txt files into
  the kernel sources.

* Drop the compressed fs/unicode/ucd/*.txt.gz into the kernl sources.
  This will increase the kernel sources by 932k.

If we go down the first path, we will include the progam to generate
the utf8data.h, and instructions for how to download the
fs/unicode/ucd/*.txt from unicode.org.  I don't forsee any kernel
developers actually wanting to modify these files, since if we do this
we break compatibility with everyone else using Unicode.  The only
reason to include them is people who are nit-picky with respect to the
GPL.

Personally, I don't care.  I just want direction of which path you
would prefer, since I predict that no matter which path gets chosen,
there will be some people who will be kvetching and registering
objections.

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: How to add Unicode character tables to the kernel?
  2019-03-31 23:09 How to add Unicode character tables to the kernel? Theodore Ts'o
@ 2019-04-01 17:51 ` Linus Torvalds
  2019-04-01 19:23   ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 3+ messages in thread
From: Linus Torvalds @ 2019-04-01 17:51 UTC (permalink / raw)
  To: Theodore Ts'o, Linus Torvalds, Linux List Kernel Mailing,
	Gabriel Krisman Bertazi

On Sun, Mar 31, 2019 at 4:09 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> The question is how to do this, with different tradeoffs.  One is to
> simply include a utf8data.h file, which will be 320k.  That might
> sound large, but in fs/nls there are 3544k worth of similar files.
> Some are relatively small --- only 16k.  But others are quite large
> --- 480k to 856k.  The table for Chinese character set is such an
> example.  So in comparison, the 320k size of utf8data.h is quite
> compact.
>
> The problem with this solution is that the files in fs/nls, and the
> proposed utf8data.h, are generated files.

Oh, we definitely don't want to copy the original huge tables, and we
don't even *want* people to edit those things in the first place.

So generated files are fine. It's not like the source data isn't
public, and yes, the commit message should have a pointer to it and
how to get the source and the generated files. But no, we shouldn't
feel like we should encourage people to be able to generate their own
modified unicode tables.

                Linus

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: How to add Unicode character tables to the kernel?
  2019-04-01 17:51 ` Linus Torvalds
@ 2019-04-01 19:23   ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 3+ messages in thread
From: Gabriel Krisman Bertazi @ 2019-04-01 19:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Ts'o, Linux List Kernel Mailing

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Mar 31, 2019 at 4:09 PM Theodore Ts'o <tytso@mit.edu> wrote:
>>
>> The question is how to do this, with different tradeoffs.  One is to
>> simply include a utf8data.h file, which will be 320k.  That might
>> sound large, but in fs/nls there are 3544k worth of similar files.
>> Some are relatively small --- only 16k.  But others are quite large
>> --- 480k to 856k.  The table for Chinese character set is such an
>> example.  So in comparison, the 320k size of utf8data.h is quite
>> compact.
>>
>> The problem with this solution is that the files in fs/nls, and the
>> proposed utf8data.h, are generated files.
>
> Oh, we definitely don't want to copy the original huge tables, and we
> don't even *want* people to edit those things in the first place.
>
> So generated files are fine. It's not like the source data isn't
> public, and yes, the commit message should have a pointer to it and
> how to get the source and the generated files. But no, we shouldn't
> feel like we should encourage people to be able to generate their own
> modified unicode tables.

Thanks!  Ted, as you know, the current patchset depends on the original
tables, so I will adapt it to only include the generated files and
submit a new version to the mailing list.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-04-01 19:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-31 23:09 How to add Unicode character tables to the kernel? Theodore Ts'o
2019-04-01 17:51 ` Linus Torvalds
2019-04-01 19:23   ` Gabriel Krisman Bertazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).