linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
       [not found] <114108759.7167494.1511886580033.ref@mail.yahoo.com>
@ 2017-11-28 16:29 ` Hin-Tak Leung
  0 siblings, 0 replies; 19+ messages in thread
From: Hin-Tak Leung @ 2017-11-28 16:29 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández


--------------------------------------------
On Tue, 28/11/17, Hin-Tak Leung <htl10@users.sourceforge.net> wrote:

> 1. Hangul canonical composition and
> decomposition is a separate topic from compositions of latin
> characters with accents. It is described in 
 
>  http://www.unicode.org/reports/tr15/tr15-18.html#Hangul


More current version of the URL: 

http://www.unicode.org/reports/tr15/#Hangul_Composition

The code samples, etc had been splitted off to urls, etc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2018-08-23 18:29               ` Ernesto A. Fernández
@ 2018-08-24  1:20                 ` tchou
  0 siblings, 0 replies; 19+ messages in thread
From: tchou @ 2018-08-24  1:20 UTC (permalink / raw)
  To: Ernesto A. Fernández
  Cc: slava, linux-fsdevel, htl10, linux-fsdevel-owner

Hi,

Ernesto A. Fernández 於 2018-08-24 02:29 寫到:
> Hi again:
> 
> On Mon, Nov 27, 2017 at 04:36:54PM -0300, Ernesto A. Fernández wrote:
>> It would be good to have some feedback on the use of code under the
>> Unicode License, I don't know if that can be a problem.
>> 
>> To Ting-Chang Hou: This patch is slightly different from the one you
>> tested with your mac. You are still tagged as Tested-by, but you may
>> want to check again that I have not broken anything.
>> 
>> Thanks,
>> Ernest
>> 
>> -- >8 --
>> Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
>> 
>> Files created under macOS become essentially unusable under linux if
>> their names contain any Hangul, and vice versa.
>> 
>> This happens because the normalization of Hangul characters is a 
>> special
>> case: it can be done algorithmically, without need of a table. The
>> hfsplus module deals with Hangul correctly when composing, but
>> completely forgets about it when performing a decomposition.
>> 
>> Solve this by using the Hangul decomposition function provided in the
>> Unicode Standard. It's under the Unicode License, compatible with the
>> GPL.
>> 
>> This patch will cause trouble for Hangul filenames already created by
>> the module in the past. This shouldn't be a special concern because 
>> the
>> main purpose of the module was always sharing with macOS. If a user
>> actually needs to access such a file the nodecompose mount option 
>> should
>> be enough.
>> 
>> Reported-by: Ting-Chang Hou <tchou@synology.com>
>> Tested-by: Ting-Chang Hou <tchou@synology.com>
>> Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
>> 
> 
> It took a while but this patch has been merged now. Thanks again for 
> your
> report.

And thanks for your patch, too.

> 
> Ernest

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-27 19:36             ` [PATCH] hfsplus: fix decomposition of Hangul characters Ernesto A. Fernández
  2017-11-27 22:40               ` Viacheslav Dubeyko
@ 2018-08-23 18:29               ` Ernesto A. Fernández
  2018-08-24  1:20                 ` tchou
  1 sibling, 1 reply; 19+ messages in thread
From: Ernesto A. Fernández @ 2018-08-23 18:29 UTC (permalink / raw)
  To: tchou; +Cc: slava, linux-fsdevel, htl10

Hi again:

On Mon, Nov 27, 2017 at 04:36:54PM -0300, Ernesto A. Fernández wrote:
> It would be good to have some feedback on the use of code under the
> Unicode License, I don't know if that can be a problem.
> 
> To Ting-Chang Hou: This patch is slightly different from the one you
> tested with your mac. You are still tagged as Tested-by, but you may
> want to check again that I have not broken anything.
> 
> Thanks,
> Ernest
> 
> -- >8 --
> Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> 
> Files created under macOS become essentially unusable under linux if
> their names contain any Hangul, and vice versa.
> 
> This happens because the normalization of Hangul characters is a special
> case: it can be done algorithmically, without need of a table. The
> hfsplus module deals with Hangul correctly when composing, but
> completely forgets about it when performing a decomposition.
> 
> Solve this by using the Hangul decomposition function provided in the
> Unicode Standard. It's under the Unicode License, compatible with the
> GPL.
> 
> This patch will cause trouble for Hangul filenames already created by
> the module in the past. This shouldn't be a special concern because the
> main purpose of the module was always sharing with macOS. If a user
> actually needs to access such a file the nodecompose mount option should
> be enough.
> 
> Reported-by: Ting-Chang Hou <tchou@synology.com>
> Tested-by: Ting-Chang Hou <tchou@synology.com>
> Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
>

It took a while but this patch has been merged now. Thanks again for your
report.

Ernest

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] hfsplus: fix decomposition of Hangul characters
@ 2018-07-17 22:09 Ernesto A. Fernández
  0 siblings, 0 replies; 19+ messages in thread
From: Ernesto A. Fernández @ 2018-07-17 22:09 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andrew Morton, Ting-Chang Hou

Files created under macOS cannot be opened under linux if their names
contain Korean characters, and vice versa.

The Korean alphabet is special because its normalization is done without
a table. The module deals with it correctly when composing, but forgets
about it for the decomposition.

Fix this using the Hangul decomposition function provided in the Unicode
Standard. The code fits a bit awkwardly because it requires a buffer,
while all the other normalizations are returned as pointers to the
decomposition table. This is actually also a bug because reordering may
still be needed, but for now leave it as it is.

The patch will cause trouble for Hangul filenames already created by the
module in the past. This shouldn't really be concern because its main
purpose was always sharing with macOS. If a user actually needs to
access such a file the nodecompose mount option should be enough.

Reported-and-tested-by: Ting-Chang Hou <tchou@synology.com>
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
---
 fs/hfsplus/unicode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 56 insertions(+), 6 deletions(-)

diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
index dfa90c21948f..c8d1b2be7854 100644
--- a/fs/hfsplus/unicode.c
+++ b/fs/hfsplus/unicode.c
@@ -272,8 +272,8 @@ static inline int asc2unichar(struct super_block *sb, const char *astr, int len,
 	return size;
 }
 
-/* Decomposes a single unicode character. */
-static inline u16 *decompose_unichar(wchar_t uc, int *size)
+/* Decomposes a non-Hangul unicode character. */
+static u16 *hfsplus_decompose_nonhangul(wchar_t uc, int *size)
 {
 	int off;
 
@@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t uc, int *size)
 	return hfsplus_decompose_table + (off / 4);
 }
 
+/*
+ * Try to decompose a unicode character as Hangul. Return 0 if @uc is not
+ * precomposed Hangul, otherwise return the length of the decomposition.
+ *
+ * This function was adapted from sample code from the Unicode Standard
+ * Annex #15: Unicode Normalization Forms, version 3.2.0.
+ *
+ * Copyright (C) 1991-2018 Unicode, Inc.  All rights reserved.  Distributed
+ * under the Terms of Use in http://www.unicode.org/copyright.html.
+ */
+static int hfsplus_try_decompose_hangul(wchar_t uc, u16 *result)
+{
+	int index;
+	int l, v, t;
+
+	index = uc - Hangul_SBase;
+	if (index < 0 || index >= Hangul_SCount)
+		return 0;
+
+	l = Hangul_LBase + index / Hangul_NCount;
+	v = Hangul_VBase + (index % Hangul_NCount) / Hangul_TCount;
+	t = Hangul_TBase + index % Hangul_TCount;
+
+	result[0] = l;
+	result[1] = v;
+	if (t != Hangul_TBase) {
+		result[2] = t;
+		return 3;
+	}
+	return 2;
+}
+
+/* Decomposes a single unicode character. */
+static u16 *decompose_unichar(wchar_t uc, int *size, u16 *hangul_buffer)
+{
+	u16 *result;
+
+	/* Hangul is handled separately */
+	result = hangul_buffer;
+	*size = hfsplus_try_decompose_hangul(uc, result);
+	if (*size == 0)
+		result = hfsplus_decompose_nonhangul(uc, size);
+	return result;
+}
+
 int hfsplus_asc2uni(struct super_block *sb,
 		    struct hfsplus_unistr *ustr, int max_unistr_len,
 		    const char *astr, int len)
@@ -303,13 +348,14 @@ int hfsplus_asc2uni(struct super_block *sb,
 	int size, dsize, decompose;
 	u16 *dstr, outlen = 0;
 	wchar_t c;
+	u16 dhangul[3];
 
 	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
 	while (outlen < max_unistr_len && len > 0) {
 		size = asc2unichar(sb, astr, len, &c);
 
 		if (decompose)
-			dstr = decompose_unichar(c, &dsize);
+			dstr = decompose_unichar(c, &dsize, dhangul);
 		else
 			dstr = NULL;
 		if (dstr) {
@@ -344,6 +390,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
 	unsigned long hash;
 	wchar_t c;
 	u16 c2;
+	u16 dhangul[3];
 
 	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
 	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
@@ -357,7 +404,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
 		len -= size;
 
 		if (decompose)
-			dstr = decompose_unichar(c, &dsize);
+			dstr = decompose_unichar(c, &dsize, dhangul);
 		else
 			dstr = NULL;
 		if (dstr) {
@@ -396,6 +443,7 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
 	const char *astr1, *astr2;
 	u16 c1, c2;
 	wchar_t c;
+	u16 dhangul_1[3], dhangul_2[3];
 
 	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
 	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
@@ -413,7 +461,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
 			len1 -= size;
 
 			if (decompose)
-				dstr1 = decompose_unichar(c, &dsize1);
+				dstr1 = decompose_unichar(c, &dsize1,
+							  dhangul_1);
 			if (!decompose || !dstr1) {
 				c1 = c;
 				dstr1 = &c1;
@@ -427,7 +476,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
 			len2 -= size;
 
 			if (decompose)
-				dstr2 = decompose_unichar(c, &dsize2);
+				dstr2 = decompose_unichar(c, &dsize2,
+							  dhangul_2);
 			if (!decompose || !dstr2) {
 				c2 = c;
 				dstr2 = &c2;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-29 16:28     ` Viacheslav Dubeyko
@ 2017-11-30  4:03       ` Ernesto A. Fernández
  0 siblings, 0 replies; 19+ messages in thread
From: Ernesto A. Fernández @ 2017-11-30  4:03 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández, Hin-Tak Leung

On Wed, Nov 29, 2017 at 08:28:56AM -0800, Viacheslav Dubeyko wrote:
> On Wed, 2017-11-29 at 11:42 -0300, Ernesto A. Fernández wrote:
> > On Wed, Nov 29, 2017 at 08:57:46AM +0000, Hin-Tak Leung wrote:
> > > 
> > > 
> > > --------------------------------------------
> > > On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail
> > > .com> wrote:
> > > 
> > > > 
> > > > This does not link to the code
> > > > sample at all, in fact it links to a page
> > > > that links to a page that links to a page that
> > > > links to the pdf.
> > >  
> > > I think this process is called bike-shedding - i.e. spending too
> > > much time arguing about trivial matters. The amount of (wasted)
> > > effort put into arguing about not providing pointers to relevant
> > > information. Please just put the relevant information in the next
> > > revision of the patch and re-submit.
> > Of course you ignored me, but I must insist: I would be far more
> > inclined
> > to take you seriously if you could just get your email client to send
> > replies correctly. And while you are at it, keep the lines under 80
> > characters, will you?
> 
> You cannot insist. Everybody has the right to do the things in own way.

I beg to differ. From Documentation/process/email-clients.rst, lines
43-44:

    "Email clients should generate and maintain References: or
     In-Reply-To: headers so that mail threading is not broken."

> You are ignoring our opinions. Why do we need to change our opinions?

I have not ignored you at all, in fact I have taken the time to reply to
every single one of your opinions. I even asked you to elaborate on some
points, which you are yet to do.

The real problem here is that you expect me to take your advice blindly,
which I'm not really inclined to do given our past conversations.

> Why do we need to take into account your vision? 

What vision are you talking about? This is just a bugfix. The reason my
work should be considered is because I took the trouble of replying to the
reporter, tracking down the bug and fixing their problem.

I did all that while you kept stubbornly denying that there was a bug at
all, and you actually recommended using a mount option that would have
made their files unreadable on a mac.

> What's the point to
> consider you in a serious way if you treat us like the kids?

Getting the bug fixed? That's why we are here, right?

If you don't want me to bother you with my patches anymore just say so.
I only cc'd you in this last one because you were already part of the
conversation and I thought it would be rude to remove you, but I have no
problem if you want out.

> By the way, my remarks are still the same. I don't agree this current
> state of the patch. The patch needs to be improved.

This was to be expected. I've been dealing with you for a few months now
and I have never seen you consider that you could be wrong about anything.

> Thanks,
> Vyacheslav Dubeyko.
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-29 14:42   ` Ernesto A. Fernández
@ 2017-11-29 16:28     ` Viacheslav Dubeyko
  2017-11-30  4:03       ` Ernesto A. Fernández
  0 siblings, 1 reply; 19+ messages in thread
From: Viacheslav Dubeyko @ 2017-11-29 16:28 UTC (permalink / raw)
  To: Ernesto A. Fernández, Hin-Tak Leung
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton

On Wed, 2017-11-29 at 11:42 -0300, Ernesto A. Fernández wrote:
> On Wed, Nov 29, 2017 at 08:57:46AM +0000, Hin-Tak Leung wrote:
> > 
> > 
> > --------------------------------------------
> > On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail
> > .com> wrote:
> > 
> > > 
> > > This does not link to the code
> > > sample at all, in fact it links to a page
> > > that links to a page that links to a page that
> > > links to the pdf.
> >  
> > I think this process is called bike-shedding - i.e. spending too
> > much time arguing about trivial matters. The amount of (wasted)
> > effort put into arguing about not providing pointers to relevant
> > information. Please just put the relevant information in the next
> > revision of the patch and re-submit.
> Of course you ignored me, but I must insist: I would be far more
> inclined
> to take you seriously if you could just get your email client to send
> replies correctly. And while you are at it, keep the lines under 80
> characters, will you?

You cannot insist. Everybody has the right to do the things in own way.
You are ignoring our opinions. Why do we need to change our opinions?
Why do we need to take into account your vision? What's the point to
consider you in a serious way if you treat us like the kids?

By the way, my remarks are still the same. I don't agree this current
state of the patch. The patch needs to be improved.

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-29  8:57 ` Hin-Tak Leung
@ 2017-11-29 14:42   ` Ernesto A. Fernández
  2017-11-29 16:28     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 19+ messages in thread
From: Ernesto A. Fernández @ 2017-11-29 14:42 UTC (permalink / raw)
  To: Hin-Tak Leung
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández, Viacheslav Dubeyko

On Wed, Nov 29, 2017 at 08:57:46AM +0000, Hin-Tak Leung wrote:
> 
> --------------------------------------------
> On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> wrote:
> 
> > This does not link to the code
> > sample at all, in fact it links to a page
> > that links to a page that links to a page that
> > links to the pdf.
>  
> I think this process is called bike-shedding - i.e. spending too much time arguing about trivial matters. The amount of (wasted) effort put into arguing about not providing pointers to relevant information. Please just put the relevant information in the next revision of the patch and re-submit.

Of course you ignored me, but I must insist: I would be far more inclined
to take you seriously if you could just get your email client to send
replies correctly. And while you are at it, keep the lines under 80
characters, will you?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
       [not found] <748914583.7912009.1511948801051.ref@mail.yahoo.com>
@ 2017-11-29  9:46 ` Hin-Tak Leung
  0 siblings, 0 replies; 19+ messages in thread
From: Hin-Tak Leung @ 2017-11-29  9:46 UTC (permalink / raw)
  To: Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández, Viacheslav Dubeyko

Please include this (or other relevant information) in the commit message when re-submitting the patch:

http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf

Page 143, Section 3.12 Conjourning Jamo Behavior,
page 149-151, "Sample Code for Hangul Algorithms" .

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
       [not found] <1691800505.7838608.1511945866988.ref@mail.yahoo.com>
@ 2017-11-29  8:57 ` Hin-Tak Leung
  2017-11-29 14:42   ` Ernesto A. Fernández
  0 siblings, 1 reply; 19+ messages in thread
From: Hin-Tak Leung @ 2017-11-29  8:57 UTC (permalink / raw)
  To: Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández, Viacheslav Dubeyko


--------------------------------------------
On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> wrote:

> This does not link to the code
> sample at all, in fact it links to a page
> that links to a page that links to a page that
> links to the pdf.
 
I think this process is called bike-shedding - i.e. spending too much time arguing about trivial matters. The amount of (wasted) effort put into arguing about not providing pointers to relevant information. Please just put the relevant information in the next revision of the patch and re-submit.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-28 21:48 ` Hin-Tak Leung
@ 2017-11-28 23:10   ` Ernesto A. Fernández
  0 siblings, 0 replies; 19+ messages in thread
From: Ernesto A. Fernández @ 2017-11-28 23:10 UTC (permalink / raw)
  To: Hin-Tak Leung
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández, Viacheslav Dubeyko

On Tue, Nov 28, 2017 at 09:48:40PM +0000, Hin-Tak Leung wrote:
> --------------------------------------------
> On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> wrote:
> 
> > Is that necessary? I think people can find the
> > Unicode Standard on their
> > own. Even if I did
> > provide a link, it's a pdf, so people would still
> > need
> > to look up the right sections. It's
> > not going to save anybody much work.
>  
> Yes, it is necessary. No, it is not a pdf. The code sample is inlined in the older version of this technical report:
> 
> http://www.unicode.org/reports/tr15/tr15-18.html#Hangul

This is extremely old, clearly not the version of Unicode I'm referencing.

> The code samples, etc had been splitted off and referenced in the current version:
> 
> http://www.unicode.org/reports/tr15/#Hangul_Composition

This does not link to the code sample at all, in fact it links to a page
that links to a page that links to a page that links to the pdf.

> Please:
> 1. include the above urls when you re-submit the next revision of the patch.
> 
> 2. include a brief description of minimal example - e.g. touch "\xXX\xYY\xZZ..." to create a minimal empty file for testing,
> in the commit message.

The way to test this patch is by creating a Hangul filename under MacOS
and trying to open it under linux, as the reporter did. You will not notice
anything wrong if you only work under linux.

> "People could look it up" is not an excuse not to give precise pointers to relevant information.
>

I would say "Section 3.12 of the Unicode Standard, version 10.0" is a very
precise pointer.

By the way, your mail client is making a huge mess of this thread; I have
no idea what message you are replying to anymore.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
       [not found] <1047295440.7535049.1511905720846.ref@mail.yahoo.com>
@ 2017-11-28 21:48 ` Hin-Tak Leung
  2017-11-28 23:10   ` Ernesto A. Fernández
  0 siblings, 1 reply; 19+ messages in thread
From: Hin-Tak Leung @ 2017-11-28 21:48 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández

--------------------------------------------
On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> wrote:

> Is that necessary? I think people can find the
> Unicode Standard on their
> own. Even if I did
> provide a link, it's a pdf, so people would still
> need
> to look up the right sections. It's
> not going to save anybody much work.
 
Yes, it is necessary. No, it is not a pdf. The code sample is inlined in the older version of this technical report:

http://www.unicode.org/reports/tr15/tr15-18.html#Hangul

The code samples, etc had been splitted off and referenced in the current version:

http://www.unicode.org/reports/tr15/#Hangul_Composition

Please:
1. include the above urls when you re-submit the next revision of the patch.

2. include a brief description of minimal example - e.g. touch "\xXX\xYY\xZZ..." to create a minimal empty file for testing,
in the commit message.

"People could look it up" is not an excuse not to give precise pointers to relevant information.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-28 16:15 ` Hin-Tak Leung
  2017-11-28 17:50   ` Viacheslav Dubeyko
@ 2017-11-28 18:51   ` Ernesto A. Fernández
  1 sibling, 0 replies; 19+ messages in thread
From: Ernesto A. Fernández @ 2017-11-28 18:51 UTC (permalink / raw)
  To: Hin-Tak Leung
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Viacheslav Dubeyko, Ernesto A. Fernández

On Tue, Nov 28, 2017 at 04:15:18PM +0000, Hin-Tak Leung wrote:
> 
> --------------------------------------------
> On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> wrote:
> 
> > The algorithm is very simple, the best way to
> > understand it is just
> > looking at the code. I
> > don't know the first thing about Korean writing, so
> > I don't think I should attempt to explain
> > why the decomposition is done
> > this way. If
> > somebody else is interested in the details, they can
> > follow
> > the citation in the header comment of
> > the decompose_hangul function.
>  
> Apologies for coming into this a bit late.
> 
> A couple of points:
> 
> 1. Hangul canonical composition and decomposition is a separate topic from compositions of latin characters with accents. It is described in 
> 
>  http://www.unicode.org/reports/tr15/tr15-18.html#Hangul
> 
> among other sources.

I am aware of that, that's the reason this patch exists. If you check the
commit message, you will see it reads:

    "This happens because the normalization of Hangul characters is
    a special case"
 
> 2. I think the mount option is a bit of a red-herring. I think we should just do what Mac OS X does - I think in the tech note it says something about storing things always in the decomposed form or composed form. Ideally we should make the differing mount options no-ops. Mac OS X does not need extra mount options, we shouldn't either.

MacOS stores the filenames in the NFD form, that is, decomposed. The problem
here was that linux was forgetting to decompose the Hangul.

The mount option has nothing to do with this patch, other than the fact
that it could later be used to access Hangul filenames potentially
stored (mistakenly) without decomposition. It is a very unlikely situation.

As to why the mount option exists, I couldn't tell you. It was here before
the git tree. It is disabled by default anyway, so it's not bothering
anyone.

Thanks,
Ernest

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-28 16:30                   ` Viacheslav Dubeyko
@ 2017-11-28 18:15                     ` Ernesto A. Fernández
  0 siblings, 0 replies; 19+ messages in thread
From: Ernesto A. Fernández @ 2017-11-28 18:15 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: linux-fsdevel, tchou, htl10, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández

On Tue, Nov 28, 2017 at 08:30:54AM -0800, Viacheslav Dubeyko wrote:
> On Tue, 2017-11-28 at 12:02 -0300, Ernesto A. Fernández wrote:
> > On Mon, Nov 27, 2017 at 02:40:53PM -0800, Viacheslav Dubeyko wrote:
> > > 
> > > On Mon, 2017-11-27 at 16:36 -0300, Ernesto A. Fernández wrote:
> > > > 
> > > > It would be good to have some feedback on the use of code under
> > > > the
> > > > Unicode License, I don't know if that can be a problem.
> > > > 
> > > > To Ting-Chang Hou: This patch is slightly different from the one
> > > > you
> > > > tested with your mac. You are still tagged as Tested-by, but you
> > > > may
> > > > want to check again that I have not broken anything.
> > > > 
> > > > Thanks,
> > > > Ernest
> > > > 
> > > > -- >8 --
> > > > Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> > > > 
> > > > Files created under macOS become essentially unusable under linux
> > > > if
> > > > their names contain any Hangul, and vice versa.
> > > > 
> > > > This happens because the normalization of Hangul characters is a
> > > > special
> > > > case: it can be done algorithmically, without need of a table.
> > > > The
> > > > hfsplus module deals with Hangul correctly when composing, but
> > > > completely forgets about it when performing a decomposition.
> > > > 
> > > > Solve this by using the Hangul decomposition function provided in
> > > > the
> > > > Unicode Standard. It's under the Unicode License, compatible with
> > > > the
> > > > GPL.
> > > > 
> > > I believe it makes sense to share more details about algorithm
> > > details.
> > > I think that the Unicode standard contains some explanation how to
> > > decompose correctly. It will be great to have some brief citation
> > > in the
> > > patch description.
> > The algorithm is very simple, the best way to understand it is just
> > looking at the code. I don't know the first thing about Korean
> > writing, so
> > I don't think I should attempt to explain why the decomposition is
> > done
> > this way. If somebody else is interested in the details, they can
> > follow
> > the citation in the header comment of the decompose_hangul function.
> > 
> 
> You should share the link where it can be found.

Is that necessary? I think people can find the Unicode Standard on their
own. Even if I did provide a link, it's a pdf, so people would still need
to look up the right sections. It's not going to save anybody much work.

> Finally, the comment
> needs in more detailed explanation.

Are you talking about the function header comment? I don't know if I
follow. Perhaps I could explain that the return value is the size, but
it seems pretty obvious from the use.

> > > 
> > > > 
> > > > This patch will cause trouble for Hangul filenames already
> > > > created by
> > > > the module in the past. This shouldn't be a special concern
> > > > because the
> > > > main purpose of the module was always sharing with macOS. If a
> > > > user
> > > > actually needs to access such a file the nodecompose mount option
> > > > should
> > > > be enough.
> > > > 
> > > > Reported-by: Ting-Chang Hou <tchou@synology.com>
> > > > Tested-by: Ting-Chang Hou <tchou@synology.com>
> > > > Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.
> > > > com>
> > > > ---
> > > >  fs/hfsplus/unicode.c | 62
> > > > +++++++++++++++++++++++++++++++++++++++++++++++-----
> > > >  1 file changed, 56 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
> > > > index dfa90c2..7ceae91 100644
> > > > --- a/fs/hfsplus/unicode.c
> > > > +++ b/fs/hfsplus/unicode.c
> > > > @@ -272,8 +272,8 @@ static inline int asc2unichar(struct
> > > > super_block *sb, const char *astr, int len,
> > > >  	return size;
> > > >  }
> > > >  
> > > > -/* Decomposes a single unicode character. */
> > > > -static inline u16 *decompose_unichar(wchar_t uc, int *size)
> > > > +/* Decomposes a non-Hangul unicode character. */
> > > > +static u16 *decompose_nonhangul(wchar_t uc, int *size)
> > > >  {
> > > >  	int off;
> > > >  
> > > > @@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t
> > > > uc, int *size)
> > > >  	return hfsplus_decompose_table + (off / 4);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Decomposes a Hangul unicode character.
> > > > + *
> > > > + * This function was adapted from sample code in section 3.12 of
> > > > the
> > > > + * Unicode Standard, version 10.0.
> > > > + *
> > > > + * Copyright (C) 1991-2017 Unicode, Inc.  All rights
> > > > reserved.  Distributed
> > > > + * under the Terms of Use in http://www.unicode.org/copyright.ht
> > > > ml.
> > > > + */
> > > > +static int decompose_hangul(wchar_t uc, u16 *result)
> > > > +{
> > > > +	int index;
> > > > +	int l, v, t;
> > > > +
> > > > +	index = uc - Hangul_SBase;
> > > > +	if (index < 0 || index >= Hangul_SCount)
> > > > +		return 0;
> > > > +
> > > > +	l = Hangul_LBase + index / Hangul_NCount;
> > > > +	v = Hangul_VBase + (index % Hangul_NCount) /
> > > > Hangul_TCount;
> > > > +	t = Hangul_TBase + index % Hangul_TCount;
> > > > +
> > > > +	result[0] = l;
> > > > +	result[1] = v;
> > > > +	if (t != Hangul_TBase) {
> > > > +		result[2] = t;
> > > > +		return 3;
> > > > +	}
> > > > +	return 2;
> > > > +}
> > > > +
> > > > +/* Decomposes a single unicode character. */
> > > > +static u16 *decompose_unichar(wchar_t uc, int *size, u16
> > > > *hangul_buffer)
> > > > +{
> > > > +	u16 *result;
> > > > +
> > > > +	/* Hangul is handled separately */
> > > > +	result = hangul_buffer;
> > > > +	*size = decompose_hangul(uc, result);
> > > Is it possible to distinguish the different cases beforehand and to
> > > select one proper function for execution without the execution both
> > > ones
> > > in the worst case?
> > The decompose_hangul and decompose_nonhangul functions are called
> > only once,
> > so I would expect the compiler to inline them both. The result should
> > be
> > similar to your suggestion.
> > 
> 
> 
> It will be much better to have:
> 
> If (hangful) {
>     <decompose_hangul>
> } else {
>     <decompose_nonhangul>
> }
> 
> Currently, the code looks weird. Especially, decompose_hangul() is able
> to distinguish two different cases. So, it's easy to extract the check.

OK, your previous objection seemed to be about efficiency, not style.

Moving the Hangul check outside the function would require a reader of
decompose_unichar() to figure out what Hangul_SBase and Hangul_SCount
are, when they may not be interested in the Hangul case at all. My
version may be weird but I think it's easier to understand.

Keep in mind that the old decompose_unichar function was also checking
if the character was Hangul, and doing nothing if that was the case. So
this is not a big departure from the existing code.

Finally, there is the issue of the license; I don't know if I should be
splitting decompose_hangul() since it would make it harder to give
attribution.

> > > 
> > > > 
> > > > +	if (*size == 0)
> > > > +		/* not Hangul */
> > > > +		result = decompose_nonhangul(uc, size);
> > > I think it makes sense to use the brackets here. If I remember
> > > correctly, this is the code style's requirement.
> > I couldn't find such a requirement, but I did find some other places
> > within
> > the hfsplus code where the comments are done my way. For example, in
> > the
> > hfsplus_cat_write_inode() function of the inode.c file.
> > 
> 
> 
> The hfsplus code contains many awful places.

I'd say most of the older code is decent and easy to understand.

> It doesn't mean that it
> needs to follow the bad style examples. If you have only one string:
> 
> if (condition)
>     <do_something>
> 
> then it's OK not to use the brackets. Otherwise, you should include the
> branch into the brackets.

I looked around outside of hfsplus, and it seems that both styles are
in use. The most charming example I found was in the rb_next_postorder()
function of lib/rbtree.c, where one branch of a conditional uses your
style and the other uses mine.

If you can actually find the place in the syle guide where this is
required then I will make the change. Otherwise I would prefer to leave
it as it is.

> Thanks,
> Vyacheslav Dubeyko.
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-28 16:15 ` Hin-Tak Leung
@ 2017-11-28 17:50   ` Viacheslav Dubeyko
  2017-11-28 18:51   ` Ernesto A. Fernández
  1 sibling, 0 replies; 19+ messages in thread
From: Viacheslav Dubeyko @ 2017-11-28 17:50 UTC (permalink / raw)
  To: Hin-Tak Leung, Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton

On Tue, 2017-11-28 at 16:15 +0000, Hin-Tak Leung wrote:
> --------------------------------------------
> On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.c
> om> wrote:
> 
> > 
> > The algorithm is very simple, the best way to
> > understand it is just
> > looking at the code. I
> > don't know the first thing about Korean writing, so
> > I don't think I should attempt to explain
> > why the decomposition is done
> > this way. If
> > somebody else is interested in the details, they can
> > follow
> > the citation in the header comment of
> > the decompose_hangul function.
>  
> Apologies for coming into this a bit late.
> 
> A couple of points:
> 
> 1. Hangul canonical composition and decomposition is a separate topic
> from compositions of latin characters with accents. It is described
> in 
> 
>  http://www.unicode.org/reports/tr15/tr15-18.html#Hangul
> 
> among other sources.
> 
> 2. I think the mount option is a bit of a red-herring. I think we
> should just do what Mac OS X does - I think in the tech note it says
> something about storing things always in the decomposed form or
> composed form. Ideally we should make the differing mount options no-
> ops. Mac OS X does not need extra mount options, we shouldn't either.

Do you mean the using of textEncoding field in the Volume Header or/and
textEncoding in the Ctalog Folder's records? But I am not completelely
sure what flag can be Hangul related. Do you have any hints?

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-28 15:02                 ` Ernesto A. Fernández
@ 2017-11-28 16:30                   ` Viacheslav Dubeyko
  2017-11-28 18:15                     ` Ernesto A. Fernández
  0 siblings, 1 reply; 19+ messages in thread
From: Viacheslav Dubeyko @ 2017-11-28 16:30 UTC (permalink / raw)
  To: Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, htl10, Alexander Viro, Andrew Morton

On Tue, 2017-11-28 at 12:02 -0300, Ernesto A. Fernández wrote:
> On Mon, Nov 27, 2017 at 02:40:53PM -0800, Viacheslav Dubeyko wrote:
> > 
> > On Mon, 2017-11-27 at 16:36 -0300, Ernesto A. Fernández wrote:
> > > 
> > > It would be good to have some feedback on the use of code under
> > > the
> > > Unicode License, I don't know if that can be a problem.
> > > 
> > > To Ting-Chang Hou: This patch is slightly different from the one
> > > you
> > > tested with your mac. You are still tagged as Tested-by, but you
> > > may
> > > want to check again that I have not broken anything.
> > > 
> > > Thanks,
> > > Ernest
> > > 
> > > -- >8 --
> > > Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> > > 
> > > Files created under macOS become essentially unusable under linux
> > > if
> > > their names contain any Hangul, and vice versa.
> > > 
> > > This happens because the normalization of Hangul characters is a
> > > special
> > > case: it can be done algorithmically, without need of a table.
> > > The
> > > hfsplus module deals with Hangul correctly when composing, but
> > > completely forgets about it when performing a decomposition.
> > > 
> > > Solve this by using the Hangul decomposition function provided in
> > > the
> > > Unicode Standard. It's under the Unicode License, compatible with
> > > the
> > > GPL.
> > > 
> > I believe it makes sense to share more details about algorithm
> > details.
> > I think that the Unicode standard contains some explanation how to
> > decompose correctly. It will be great to have some brief citation
> > in the
> > patch description.
> The algorithm is very simple, the best way to understand it is just
> looking at the code. I don't know the first thing about Korean
> writing, so
> I don't think I should attempt to explain why the decomposition is
> done
> this way. If somebody else is interested in the details, they can
> follow
> the citation in the header comment of the decompose_hangul function.
> 

You should share the link where it can be found. Finally, the comment
needs in more detailed explanation.


> > 
> > > 
> > > This patch will cause trouble for Hangul filenames already
> > > created by
> > > the module in the past. This shouldn't be a special concern
> > > because the
> > > main purpose of the module was always sharing with macOS. If a
> > > user
> > > actually needs to access such a file the nodecompose mount option
> > > should
> > > be enough.
> > > 
> > > Reported-by: Ting-Chang Hou <tchou@synology.com>
> > > Tested-by: Ting-Chang Hou <tchou@synology.com>
> > > Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.
> > > com>
> > > ---
> > >  fs/hfsplus/unicode.c | 62
> > > +++++++++++++++++++++++++++++++++++++++++++++++-----
> > >  1 file changed, 56 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
> > > index dfa90c2..7ceae91 100644
> > > --- a/fs/hfsplus/unicode.c
> > > +++ b/fs/hfsplus/unicode.c
> > > @@ -272,8 +272,8 @@ static inline int asc2unichar(struct
> > > super_block *sb, const char *astr, int len,
> > >  	return size;
> > >  }
> > >  
> > > -/* Decomposes a single unicode character. */
> > > -static inline u16 *decompose_unichar(wchar_t uc, int *size)
> > > +/* Decomposes a non-Hangul unicode character. */
> > > +static u16 *decompose_nonhangul(wchar_t uc, int *size)
> > >  {
> > >  	int off;
> > >  
> > > @@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t
> > > uc, int *size)
> > >  	return hfsplus_decompose_table + (off / 4);
> > >  }
> > >  
> > > +/*
> > > + * Decomposes a Hangul unicode character.
> > > + *
> > > + * This function was adapted from sample code in section 3.12 of
> > > the
> > > + * Unicode Standard, version 10.0.
> > > + *
> > > + * Copyright (C) 1991-2017 Unicode, Inc.  All rights
> > > reserved.  Distributed
> > > + * under the Terms of Use in http://www.unicode.org/copyright.ht
> > > ml.
> > > + */
> > > +static int decompose_hangul(wchar_t uc, u16 *result)
> > > +{
> > > +	int index;
> > > +	int l, v, t;
> > > +
> > > +	index = uc - Hangul_SBase;
> > > +	if (index < 0 || index >= Hangul_SCount)
> > > +		return 0;
> > > +
> > > +	l = Hangul_LBase + index / Hangul_NCount;
> > > +	v = Hangul_VBase + (index % Hangul_NCount) /
> > > Hangul_TCount;
> > > +	t = Hangul_TBase + index % Hangul_TCount;
> > > +
> > > +	result[0] = l;
> > > +	result[1] = v;
> > > +	if (t != Hangul_TBase) {
> > > +		result[2] = t;
> > > +		return 3;
> > > +	}
> > > +	return 2;
> > > +}
> > > +
> > > +/* Decomposes a single unicode character. */
> > > +static u16 *decompose_unichar(wchar_t uc, int *size, u16
> > > *hangul_buffer)
> > > +{
> > > +	u16 *result;
> > > +
> > > +	/* Hangul is handled separately */
> > > +	result = hangul_buffer;
> > > +	*size = decompose_hangul(uc, result);
> > Is it possible to distinguish the different cases beforehand and to
> > select one proper function for execution without the execution both
> > ones
> > in the worst case?
> The decompose_hangul and decompose_nonhangul functions are called
> only once,
> so I would expect the compiler to inline them both. The result should
> be
> similar to your suggestion.
> 


It will be much better to have:

If (hangful) {
    <decompose_hangul>
} else {
    <decompose_nonhangul>
}

Currently, the code looks weird. Especially, decompose_hangul() is able
to distinguish two different cases. So, it's easy to extract the check.

> > 
> > > 
> > > +	if (*size == 0)
> > > +		/* not Hangul */
> > > +		result = decompose_nonhangul(uc, size);
> > I think it makes sense to use the brackets here. If I remember
> > correctly, this is the code style's requirement.
> I couldn't find such a requirement, but I did find some other places
> within
> the hfsplus code where the comments are done my way. For example, in
> the
> hfsplus_cat_write_inode() function of the inode.c file.
> 


The hfsplus code contains many awful places. It doesn't mean that it
needs to follow the bad style examples. If you have only one string:

if (condition)
    <do_something>

then it's OK not to use the brackets. Otherwise, you should include the
branch into the brackets.

Thanks,
Vyacheslav Dubeyko.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
       [not found] <1999104027.7153972.1511885718712.ref@mail.yahoo.com>
@ 2017-11-28 16:15 ` Hin-Tak Leung
  2017-11-28 17:50   ` Viacheslav Dubeyko
  2017-11-28 18:51   ` Ernesto A. Fernández
  0 siblings, 2 replies; 19+ messages in thread
From: Hin-Tak Leung @ 2017-11-28 16:15 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández


--------------------------------------------
On Tue, 28/11/17, Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> wrote:

> The algorithm is very simple, the best way to
> understand it is just
> looking at the code. I
> don't know the first thing about Korean writing, so
> I don't think I should attempt to explain
> why the decomposition is done
> this way. If
> somebody else is interested in the details, they can
> follow
> the citation in the header comment of
> the decompose_hangul function.
 
Apologies for coming into this a bit late.

A couple of points:

1. Hangul canonical composition and decomposition is a separate topic from compositions of latin characters with accents. It is described in 

 http://www.unicode.org/reports/tr15/tr15-18.html#Hangul

among other sources.

2. I think the mount option is a bit of a red-herring. I think we should just do what Mac OS X does - I think in the tech note it says something about storing things always in the decomposed form or composed form. Ideally we should make the differing mount options no-ops. Mac OS X does not need extra mount options, we shouldn't either.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-27 22:40               ` Viacheslav Dubeyko
@ 2017-11-28 15:02                 ` Ernesto A. Fernández
  2017-11-28 16:30                   ` Viacheslav Dubeyko
  0 siblings, 1 reply; 19+ messages in thread
From: Ernesto A. Fernández @ 2017-11-28 15:02 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: linux-fsdevel, tchou, htl10, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández

On Mon, Nov 27, 2017 at 02:40:53PM -0800, Viacheslav Dubeyko wrote:
> On Mon, 2017-11-27 at 16:36 -0300, Ernesto A. Fernández wrote:
> > It would be good to have some feedback on the use of code under the
> > Unicode License, I don't know if that can be a problem.
> > 
> > To Ting-Chang Hou: This patch is slightly different from the one you
> > tested with your mac. You are still tagged as Tested-by, but you may
> > want to check again that I have not broken anything.
> > 
> > Thanks,
> > Ernest
> > 
> > -- >8 --
> > Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> > 
> > Files created under macOS become essentially unusable under linux if
> > their names contain any Hangul, and vice versa.
> > 
> > This happens because the normalization of Hangul characters is a special
> > case: it can be done algorithmically, without need of a table. The
> > hfsplus module deals with Hangul correctly when composing, but
> > completely forgets about it when performing a decomposition.
> > 
> > Solve this by using the Hangul decomposition function provided in the
> > Unicode Standard. It's under the Unicode License, compatible with the
> > GPL.
> > 
> 
> I believe it makes sense to share more details about algorithm details.
> I think that the Unicode standard contains some explanation how to
> decompose correctly. It will be great to have some brief citation in the
> patch description.

The algorithm is very simple, the best way to understand it is just
looking at the code. I don't know the first thing about Korean writing, so
I don't think I should attempt to explain why the decomposition is done
this way. If somebody else is interested in the details, they can follow
the citation in the header comment of the decompose_hangul function.

> > This patch will cause trouble for Hangul filenames already created by
> > the module in the past. This shouldn't be a special concern because the
> > main purpose of the module was always sharing with macOS. If a user
> > actually needs to access such a file the nodecompose mount option should
> > be enough.
> > 
> > Reported-by: Ting-Chang Hou <tchou@synology.com>
> > Tested-by: Ting-Chang Hou <tchou@synology.com>
> > Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
> > ---
> >  fs/hfsplus/unicode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 56 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
> > index dfa90c2..7ceae91 100644
> > --- a/fs/hfsplus/unicode.c
> > +++ b/fs/hfsplus/unicode.c
> > @@ -272,8 +272,8 @@ static inline int asc2unichar(struct super_block *sb, const char *astr, int len,
> >  	return size;
> >  }
> >  
> > -/* Decomposes a single unicode character. */
> > -static inline u16 *decompose_unichar(wchar_t uc, int *size)
> > +/* Decomposes a non-Hangul unicode character. */
> > +static u16 *decompose_nonhangul(wchar_t uc, int *size)
> >  {
> >  	int off;
> >  
> > @@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t uc, int *size)
> >  	return hfsplus_decompose_table + (off / 4);
> >  }
> >  
> > +/*
> > + * Decomposes a Hangul unicode character.
> > + *
> > + * This function was adapted from sample code in section 3.12 of the
> > + * Unicode Standard, version 10.0.
> > + *
> > + * Copyright (C) 1991-2017 Unicode, Inc.  All rights reserved.  Distributed
> > + * under the Terms of Use in http://www.unicode.org/copyright.html.
> > + */
> > +static int decompose_hangul(wchar_t uc, u16 *result)
> > +{
> > +	int index;
> > +	int l, v, t;
> > +
> > +	index = uc - Hangul_SBase;
> > +	if (index < 0 || index >= Hangul_SCount)
> > +		return 0;
> > +
> > +	l = Hangul_LBase + index / Hangul_NCount;
> > +	v = Hangul_VBase + (index % Hangul_NCount) / Hangul_TCount;
> > +	t = Hangul_TBase + index % Hangul_TCount;
> > +
> > +	result[0] = l;
> > +	result[1] = v;
> > +	if (t != Hangul_TBase) {
> > +		result[2] = t;
> > +		return 3;
> > +	}
> > +	return 2;
> > +}
> > +
> > +/* Decomposes a single unicode character. */
> > +static u16 *decompose_unichar(wchar_t uc, int *size, u16 *hangul_buffer)
> > +{
> > +	u16 *result;
> > +
> > +	/* Hangul is handled separately */
> > +	result = hangul_buffer;
> > +	*size = decompose_hangul(uc, result);
> 
> Is it possible to distinguish the different cases beforehand and to
> select one proper function for execution without the execution both ones
> in the worst case?

The decompose_hangul and decompose_nonhangul functions are called only once,
so I would expect the compiler to inline them both. The result should be
similar to your suggestion.

> > +	if (*size == 0)
> > +		/* not Hangul */
> > +		result = decompose_nonhangul(uc, size);
> 
> I think it makes sense to use the brackets here. If I remember
> correctly, this is the code style's requirement.

I couldn't find such a requirement, but I did find some other places within
the hfsplus code where the comments are done my way. For example, in the
hfsplus_cat_write_inode() function of the inode.c file.

> Thanks,
> Vyacheslav Dubeyko.

Thanks,
Ernest
 
> > +	return result;
> > +}
> > +
> >  int hfsplus_asc2uni(struct super_block *sb,
> >  		    struct hfsplus_unistr *ustr, int max_unistr_len,
> >  		    const char *astr, int len)
> > @@ -303,13 +348,14 @@ int hfsplus_asc2uni(struct super_block *sb,
> >  	int size, dsize, decompose;
> >  	u16 *dstr, outlen = 0;
> >  	wchar_t c;
> > +	u16 dhangul[3];
> >  
> >  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> >  	while (outlen < max_unistr_len && len > 0) {
> >  		size = asc2unichar(sb, astr, len, &c);
> >  
> >  		if (decompose)
> > -			dstr = decompose_unichar(c, &dsize);
> > +			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
> >  		else
> >  			dstr = NULL;
> >  		if (dstr) {
> > @@ -344,6 +390,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
> >  	unsigned long hash;
> >  	wchar_t c;
> >  	u16 c2;
> > +	u16 dhangul[3];
> >  
> >  	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
> >  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> > @@ -357,7 +404,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
> >  		len -= size;
> >  
> >  		if (decompose)
> > -			dstr = decompose_unichar(c, &dsize);
> > +			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
> >  		else
> >  			dstr = NULL;
> >  		if (dstr) {
> > @@ -396,6 +443,7 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
> >  	const char *astr1, *astr2;
> >  	u16 c1, c2;
> >  	wchar_t c;
> > +	u16 dhangul_1[3], dhangul_2[3];
> >  
> >  	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
> >  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> > @@ -413,7 +461,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
> >  			len1 -= size;
> >  
> >  			if (decompose)
> > -				dstr1 = decompose_unichar(c, &dsize1);
> > +				dstr1 = decompose_unichar(c, &dsize1,
> > +							  &dhangul_1[0]);
> >  			if (!decompose || !dstr1) {
> >  				c1 = c;
> >  				dstr1 = &c1;
> > @@ -427,7 +476,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
> >  			len2 -= size;
> >  
> >  			if (decompose)
> > -				dstr2 = decompose_unichar(c, &dsize2);
> > +				dstr2 = decompose_unichar(c, &dsize2,
> > +							  &dhangul_2[0]);
> >  			if (!decompose || !dstr2) {
> >  				c2 = c;
> >  				dstr2 = &c2;
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-27 19:36             ` [PATCH] hfsplus: fix decomposition of Hangul characters Ernesto A. Fernández
@ 2017-11-27 22:40               ` Viacheslav Dubeyko
  2017-11-28 15:02                 ` Ernesto A. Fernández
  2018-08-23 18:29               ` Ernesto A. Fernández
  1 sibling, 1 reply; 19+ messages in thread
From: Viacheslav Dubeyko @ 2017-11-27 22:40 UTC (permalink / raw)
  To: Ernesto A. Fernández
  Cc: linux-fsdevel, tchou, htl10, Alexander Viro, Andrew Morton

On Mon, 2017-11-27 at 16:36 -0300, Ernesto A. Fernández wrote:
> It would be good to have some feedback on the use of code under the
> Unicode License, I don't know if that can be a problem.
> 
> To Ting-Chang Hou: This patch is slightly different from the one you
> tested with your mac. You are still tagged as Tested-by, but you may
> want to check again that I have not broken anything.
> 
> Thanks,
> Ernest
> 
> -- >8 --
> Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> 
> Files created under macOS become essentially unusable under linux if
> their names contain any Hangul, and vice versa.
> 
> This happens because the normalization of Hangul characters is a special
> case: it can be done algorithmically, without need of a table. The
> hfsplus module deals with Hangul correctly when composing, but
> completely forgets about it when performing a decomposition.
> 
> Solve this by using the Hangul decomposition function provided in the
> Unicode Standard. It's under the Unicode License, compatible with the
> GPL.
> 

I believe it makes sense to share more details about algorithm details.
I think that the Unicode standard contains some explanation how to
decompose correctly. It will be great to have some brief citation in the
patch description.

> This patch will cause trouble for Hangul filenames already created by
> the module in the past. This shouldn't be a special concern because the
> main purpose of the module was always sharing with macOS. If a user
> actually needs to access such a file the nodecompose mount option should
> be enough.
> 
> Reported-by: Ting-Chang Hou <tchou@synology.com>
> Tested-by: Ting-Chang Hou <tchou@synology.com>
> Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
> ---
>  fs/hfsplus/unicode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 56 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
> index dfa90c2..7ceae91 100644
> --- a/fs/hfsplus/unicode.c
> +++ b/fs/hfsplus/unicode.c
> @@ -272,8 +272,8 @@ static inline int asc2unichar(struct super_block *sb, const char *astr, int len,
>  	return size;
>  }
>  
> -/* Decomposes a single unicode character. */
> -static inline u16 *decompose_unichar(wchar_t uc, int *size)
> +/* Decomposes a non-Hangul unicode character. */
> +static u16 *decompose_nonhangul(wchar_t uc, int *size)
>  {
>  	int off;
>  
> @@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t uc, int *size)
>  	return hfsplus_decompose_table + (off / 4);
>  }
>  
> +/*
> + * Decomposes a Hangul unicode character.
> + *
> + * This function was adapted from sample code in section 3.12 of the
> + * Unicode Standard, version 10.0.
> + *
> + * Copyright (C) 1991-2017 Unicode, Inc.  All rights reserved.  Distributed
> + * under the Terms of Use in http://www.unicode.org/copyright.html.
> + */
> +static int decompose_hangul(wchar_t uc, u16 *result)
> +{
> +	int index;
> +	int l, v, t;
> +
> +	index = uc - Hangul_SBase;
> +	if (index < 0 || index >= Hangul_SCount)
> +		return 0;
> +
> +	l = Hangul_LBase + index / Hangul_NCount;
> +	v = Hangul_VBase + (index % Hangul_NCount) / Hangul_TCount;
> +	t = Hangul_TBase + index % Hangul_TCount;
> +
> +	result[0] = l;
> +	result[1] = v;
> +	if (t != Hangul_TBase) {
> +		result[2] = t;
> +		return 3;
> +	}
> +	return 2;
> +}
> +
> +/* Decomposes a single unicode character. */
> +static u16 *decompose_unichar(wchar_t uc, int *size, u16 *hangul_buffer)
> +{
> +	u16 *result;
> +
> +	/* Hangul is handled separately */
> +	result = hangul_buffer;
> +	*size = decompose_hangul(uc, result);

Is it possible to distinguish the different cases beforehand and to
select one proper function for execution without the execution both ones
in the worst case?

> +	if (*size == 0)
> +		/* not Hangul */
> +		result = decompose_nonhangul(uc, size);

I think it makes sense to use the brackets here. If I remember
correctly, this is the code style's requirement.

Thanks,
Vyacheslav Dubeyko.

> +	return result;
> +}
> +
>  int hfsplus_asc2uni(struct super_block *sb,
>  		    struct hfsplus_unistr *ustr, int max_unistr_len,
>  		    const char *astr, int len)
> @@ -303,13 +348,14 @@ int hfsplus_asc2uni(struct super_block *sb,
>  	int size, dsize, decompose;
>  	u16 *dstr, outlen = 0;
>  	wchar_t c;
> +	u16 dhangul[3];
>  
>  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
>  	while (outlen < max_unistr_len && len > 0) {
>  		size = asc2unichar(sb, astr, len, &c);
>  
>  		if (decompose)
> -			dstr = decompose_unichar(c, &dsize);
> +			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
>  		else
>  			dstr = NULL;
>  		if (dstr) {
> @@ -344,6 +390,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
>  	unsigned long hash;
>  	wchar_t c;
>  	u16 c2;
> +	u16 dhangul[3];
>  
>  	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
>  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> @@ -357,7 +404,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
>  		len -= size;
>  
>  		if (decompose)
> -			dstr = decompose_unichar(c, &dsize);
> +			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
>  		else
>  			dstr = NULL;
>  		if (dstr) {
> @@ -396,6 +443,7 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
>  	const char *astr1, *astr2;
>  	u16 c1, c2;
>  	wchar_t c;
> +	u16 dhangul_1[3], dhangul_2[3];
>  
>  	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
>  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> @@ -413,7 +461,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
>  			len1 -= size;
>  
>  			if (decompose)
> -				dstr1 = decompose_unichar(c, &dsize1);
> +				dstr1 = decompose_unichar(c, &dsize1,
> +							  &dhangul_1[0]);
>  			if (!decompose || !dstr1) {
>  				c1 = c;
>  				dstr1 = &c1;
> @@ -427,7 +476,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
>  			len2 -= size;
>  
>  			if (decompose)
> -				dstr2 = decompose_unichar(c, &dsize2);
> +				dstr2 = decompose_unichar(c, &dsize2,
> +							  &dhangul_2[0]);
>  			if (!decompose || !dstr2) {
>  				c2 = c;
>  				dstr2 = &c2;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] hfsplus: fix decomposition of Hangul characters
  2017-11-27  2:07           ` tchou
@ 2017-11-27 19:36             ` Ernesto A. Fernández
  2017-11-27 22:40               ` Viacheslav Dubeyko
  2018-08-23 18:29               ` Ernesto A. Fernández
  0 siblings, 2 replies; 19+ messages in thread
From: Ernesto A. Fernández @ 2017-11-27 19:36 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: tchou, slava, htl10, Alexander Viro, Andrew Morton,
	Ernesto A. Fernández

It would be good to have some feedback on the use of code under the
Unicode License, I don't know if that can be a problem.

To Ting-Chang Hou: This patch is slightly different from the one you
tested with your mac. You are still tagged as Tested-by, but you may
want to check again that I have not broken anything.

Thanks,
Ernest

-- >8 --
Subject: [PATCH] hfsplus: fix decomposition of Hangul characters

Files created under macOS become essentially unusable under linux if
their names contain any Hangul, and vice versa.

This happens because the normalization of Hangul characters is a special
case: it can be done algorithmically, without need of a table. The
hfsplus module deals with Hangul correctly when composing, but
completely forgets about it when performing a decomposition.

Solve this by using the Hangul decomposition function provided in the
Unicode Standard. It's under the Unicode License, compatible with the
GPL.

This patch will cause trouble for Hangul filenames already created by
the module in the past. This shouldn't be a special concern because the
main purpose of the module was always sharing with macOS. If a user
actually needs to access such a file the nodecompose mount option should
be enough.

Reported-by: Ting-Chang Hou <tchou@synology.com>
Tested-by: Ting-Chang Hou <tchou@synology.com>
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
---
 fs/hfsplus/unicode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 56 insertions(+), 6 deletions(-)

diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
index dfa90c2..7ceae91 100644
--- a/fs/hfsplus/unicode.c
+++ b/fs/hfsplus/unicode.c
@@ -272,8 +272,8 @@ static inline int asc2unichar(struct super_block *sb, const char *astr, int len,
 	return size;
 }
 
-/* Decomposes a single unicode character. */
-static inline u16 *decompose_unichar(wchar_t uc, int *size)
+/* Decomposes a non-Hangul unicode character. */
+static u16 *decompose_nonhangul(wchar_t uc, int *size)
 {
 	int off;
 
@@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t uc, int *size)
 	return hfsplus_decompose_table + (off / 4);
 }
 
+/*
+ * Decomposes a Hangul unicode character.
+ *
+ * This function was adapted from sample code in section 3.12 of the
+ * Unicode Standard, version 10.0.
+ *
+ * Copyright (C) 1991-2017 Unicode, Inc.  All rights reserved.  Distributed
+ * under the Terms of Use in http://www.unicode.org/copyright.html.
+ */
+static int decompose_hangul(wchar_t uc, u16 *result)
+{
+	int index;
+	int l, v, t;
+
+	index = uc - Hangul_SBase;
+	if (index < 0 || index >= Hangul_SCount)
+		return 0;
+
+	l = Hangul_LBase + index / Hangul_NCount;
+	v = Hangul_VBase + (index % Hangul_NCount) / Hangul_TCount;
+	t = Hangul_TBase + index % Hangul_TCount;
+
+	result[0] = l;
+	result[1] = v;
+	if (t != Hangul_TBase) {
+		result[2] = t;
+		return 3;
+	}
+	return 2;
+}
+
+/* Decomposes a single unicode character. */
+static u16 *decompose_unichar(wchar_t uc, int *size, u16 *hangul_buffer)
+{
+	u16 *result;
+
+	/* Hangul is handled separately */
+	result = hangul_buffer;
+	*size = decompose_hangul(uc, result);
+	if (*size == 0)
+		/* not Hangul */
+		result = decompose_nonhangul(uc, size);
+	return result;
+}
+
 int hfsplus_asc2uni(struct super_block *sb,
 		    struct hfsplus_unistr *ustr, int max_unistr_len,
 		    const char *astr, int len)
@@ -303,13 +348,14 @@ int hfsplus_asc2uni(struct super_block *sb,
 	int size, dsize, decompose;
 	u16 *dstr, outlen = 0;
 	wchar_t c;
+	u16 dhangul[3];
 
 	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
 	while (outlen < max_unistr_len && len > 0) {
 		size = asc2unichar(sb, astr, len, &c);
 
 		if (decompose)
-			dstr = decompose_unichar(c, &dsize);
+			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
 		else
 			dstr = NULL;
 		if (dstr) {
@@ -344,6 +390,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
 	unsigned long hash;
 	wchar_t c;
 	u16 c2;
+	u16 dhangul[3];
 
 	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
 	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
@@ -357,7 +404,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
 		len -= size;
 
 		if (decompose)
-			dstr = decompose_unichar(c, &dsize);
+			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
 		else
 			dstr = NULL;
 		if (dstr) {
@@ -396,6 +443,7 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
 	const char *astr1, *astr2;
 	u16 c1, c2;
 	wchar_t c;
+	u16 dhangul_1[3], dhangul_2[3];
 
 	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
 	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
@@ -413,7 +461,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
 			len1 -= size;
 
 			if (decompose)
-				dstr1 = decompose_unichar(c, &dsize1);
+				dstr1 = decompose_unichar(c, &dsize1,
+							  &dhangul_1[0]);
 			if (!decompose || !dstr1) {
 				c1 = c;
 				dstr1 = &c1;
@@ -427,7 +476,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
 			len2 -= size;
 
 			if (decompose)
-				dstr2 = decompose_unichar(c, &dsize2);
+				dstr2 = decompose_unichar(c, &dsize2,
+							  &dhangul_2[0]);
 			if (!decompose || !dstr2) {
 				c2 = c;
 				dstr2 = &c2;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-08-24  4:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <114108759.7167494.1511886580033.ref@mail.yahoo.com>
2017-11-28 16:29 ` [PATCH] hfsplus: fix decomposition of Hangul characters Hin-Tak Leung
2018-07-17 22:09 Ernesto A. Fernández
     [not found] <748914583.7912009.1511948801051.ref@mail.yahoo.com>
2017-11-29  9:46 ` Hin-Tak Leung
     [not found] <1691800505.7838608.1511945866988.ref@mail.yahoo.com>
2017-11-29  8:57 ` Hin-Tak Leung
2017-11-29 14:42   ` Ernesto A. Fernández
2017-11-29 16:28     ` Viacheslav Dubeyko
2017-11-30  4:03       ` Ernesto A. Fernández
     [not found] <1047295440.7535049.1511905720846.ref@mail.yahoo.com>
2017-11-28 21:48 ` Hin-Tak Leung
2017-11-28 23:10   ` Ernesto A. Fernández
     [not found] <1999104027.7153972.1511885718712.ref@mail.yahoo.com>
2017-11-28 16:15 ` Hin-Tak Leung
2017-11-28 17:50   ` Viacheslav Dubeyko
2017-11-28 18:51   ` Ernesto A. Fernández
  -- strict thread matches above, loose matches on Subject: below --
2017-11-17  8:20 [PATCH] hfsplus: fix the bug that cannot recognize files with hangul file name Ting-Chang Hou
2017-11-19  0:57 ` Ernesto A. Fernández
2017-11-23  3:57   ` tchou
2017-11-23 11:32     ` Ernesto A. Fernández
2017-11-24  7:25       ` tchou
2017-11-24 11:45         ` Ernesto A. Fernández
2017-11-27  2:07           ` tchou
2017-11-27 19:36             ` [PATCH] hfsplus: fix decomposition of Hangul characters Ernesto A. Fernández
2017-11-27 22:40               ` Viacheslav Dubeyko
2017-11-28 15:02                 ` Ernesto A. Fernández
2017-11-28 16:30                   ` Viacheslav Dubeyko
2017-11-28 18:15                     ` Ernesto A. Fernández
2018-08-23 18:29               ` Ernesto A. Fernández
2018-08-24  1:20                 ` tchou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).