From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-qt0-f175.google.com ([209.85.216.175]:41401 "EHLO
        mail-qt0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750904AbdK1SPh (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 28 Nov 2017 13:15:37 -0500
Received: by mail-qt0-f175.google.com with SMTP id i40so1010925qti.8
        for <linux-fsdevel@vger.kernel.org>; Tue, 28 Nov 2017 10:15:37 -0800 (PST)
Date: Tue, 28 Nov 2017 15:15:31 -0300
From: Ernesto =?utf-8?Q?A=2E_Fern=C3=A1ndez?=
        <ernesto.mnd.fernandez@gmail.com>
To: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: linux-fsdevel@vger.kernel.org, tchou <tchou@synology.com>,
        htl10@users.sourceforge.net,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>,
        Ernesto =?utf-8?Q?A=2E_Fern=C3=A1ndez?=
        <ernesto.mnd.fernandez@gmail.com>
Subject: Re: [PATCH] hfsplus: fix decomposition of Hangul characters
Message-ID: <20171128181530.GA2801@debian.home>
References: <20171119005704.GA3495@debian.home>
 <080024a85dc413b72c181c6e75bdc736@synology.com>
 <20171123113230.GA5581@debian.home>
 <1028889c01cdd513c2cf8cabf6c1762e@synology.com>
 <20171124114525.GA3265@debian.home>
 <510d6873cbcd4d93f0902c62e3a2c22d@synology.com>
 <20171127193652.GA3690@debian.home>
 <1511822453.2830.7.camel@slavad-ubuntu-14.04>
 <20171128150234.GA1268@debian.home>
 <1511886654.2822.6.camel@dubeyko.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1511886654.2822.6.camel@dubeyko.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Nov 28, 2017 at 08:30:54AM -0800, Viacheslav Dubeyko wrote:
> On Tue, 2017-11-28 at 12:02 -0300, Ernesto A. Fernández wrote:
> > On Mon, Nov 27, 2017 at 02:40:53PM -0800, Viacheslav Dubeyko wrote:
> > > 
> > > On Mon, 2017-11-27 at 16:36 -0300, Ernesto A. Fernández wrote:
> > > > 
> > > > It would be good to have some feedback on the use of code under
> > > > the
> > > > Unicode License, I don't know if that can be a problem.
> > > > 
> > > > To Ting-Chang Hou: This patch is slightly different from the one
> > > > you
> > > > tested with your mac. You are still tagged as Tested-by, but you
> > > > may
> > > > want to check again that I have not broken anything.
> > > > 
> > > > Thanks,
> > > > Ernest
> > > > 
> > > > -- >8 --
> > > > Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> > > > 
> > > > Files created under macOS become essentially unusable under linux
> > > > if
> > > > their names contain any Hangul, and vice versa.
> > > > 
> > > > This happens because the normalization of Hangul characters is a
> > > > special
> > > > case: it can be done algorithmically, without need of a table.
> > > > The
> > > > hfsplus module deals with Hangul correctly when composing, but
> > > > completely forgets about it when performing a decomposition.
> > > > 
> > > > Solve this by using the Hangul decomposition function provided in
> > > > the
> > > > Unicode Standard. It's under the Unicode License, compatible with
> > > > the
> > > > GPL.
> > > > 
> > > I believe it makes sense to share more details about algorithm
> > > details.
> > > I think that the Unicode standard contains some explanation how to
> > > decompose correctly. It will be great to have some brief citation
> > > in the
> > > patch description.
> > The algorithm is very simple, the best way to understand it is just
> > looking at the code. I don't know the first thing about Korean
> > writing, so
> > I don't think I should attempt to explain why the decomposition is
> > done
> > this way. If somebody else is interested in the details, they can
> > follow
> > the citation in the header comment of the decompose_hangul function.
> > 
> 
> You should share the link where it can be found.

Is that necessary? I think people can find the Unicode Standard on their
own. Even if I did provide a link, it's a pdf, so people would still need
to look up the right sections. It's not going to save anybody much work.

> Finally, the comment
> needs in more detailed explanation.

Are you talking about the function header comment? I don't know if I
follow. Perhaps I could explain that the return value is the size, but
it seems pretty obvious from the use.

> > > 
> > > > 
> > > > This patch will cause trouble for Hangul filenames already
> > > > created by
> > > > the module in the past. This shouldn't be a special concern
> > > > because the
> > > > main purpose of the module was always sharing with macOS. If a
> > > > user
> > > > actually needs to access such a file the nodecompose mount option
> > > > should
> > > > be enough.
> > > > 
> > > > Reported-by: Ting-Chang Hou <tchou@synology.com>
> > > > Tested-by: Ting-Chang Hou <tchou@synology.com>
> > > > Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.
> > > > com>
> > > > ---
> > > >  fs/hfsplus/unicode.c | 62
> > > > +++++++++++++++++++++++++++++++++++++++++++++++-----
> > > >  1 file changed, 56 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
> > > > index dfa90c2..7ceae91 100644
> > > > --- a/fs/hfsplus/unicode.c
> > > > +++ b/fs/hfsplus/unicode.c
> > > > @@ -272,8 +272,8 @@ static inline int asc2unichar(struct
> > > > super_block *sb, const char *astr, int len,
> > > >  	return size;
> > > >  }
> > > >  
> > > > -/* Decomposes a single unicode character. */
> > > > -static inline u16 *decompose_unichar(wchar_t uc, int *size)
> > > > +/* Decomposes a non-Hangul unicode character. */
> > > > +static u16 *decompose_nonhangul(wchar_t uc, int *size)
> > > >  {
> > > >  	int off;
> > > >  
> > > > @@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t
> > > > uc, int *size)
> > > >  	return hfsplus_decompose_table + (off / 4);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Decomposes a Hangul unicode character.
> > > > + *
> > > > + * This function was adapted from sample code in section 3.12 of
> > > > the
> > > > + * Unicode Standard, version 10.0.
> > > > + *
> > > > + * Copyright (C) 1991-2017 Unicode, Inc.  All rights
> > > > reserved.  Distributed
> > > > + * under the Terms of Use in http://www.unicode.org/copyright.ht
> > > > ml.
> > > > + */
> > > > +static int decompose_hangul(wchar_t uc, u16 *result)
> > > > +{
> > > > +	int index;
> > > > +	int l, v, t;
> > > > +
> > > > +	index = uc - Hangul_SBase;
> > > > +	if (index < 0 || index >= Hangul_SCount)
> > > > +		return 0;
> > > > +
> > > > +	l = Hangul_LBase + index / Hangul_NCount;
> > > > +	v = Hangul_VBase + (index % Hangul_NCount) /
> > > > Hangul_TCount;
> > > > +	t = Hangul_TBase + index % Hangul_TCount;
> > > > +
> > > > +	result[0] = l;
> > > > +	result[1] = v;
> > > > +	if (t != Hangul_TBase) {
> > > > +		result[2] = t;
> > > > +		return 3;
> > > > +	}
> > > > +	return 2;
> > > > +}
> > > > +
> > > > +/* Decomposes a single unicode character. */
> > > > +static u16 *decompose_unichar(wchar_t uc, int *size, u16
> > > > *hangul_buffer)
> > > > +{
> > > > +	u16 *result;
> > > > +
> > > > +	/* Hangul is handled separately */
> > > > +	result = hangul_buffer;
> > > > +	*size = decompose_hangul(uc, result);
> > > Is it possible to distinguish the different cases beforehand and to
> > > select one proper function for execution without the execution both
> > > ones
> > > in the worst case?
> > The decompose_hangul and decompose_nonhangul functions are called
> > only once,
> > so I would expect the compiler to inline them both. The result should
> > be
> > similar to your suggestion.
> > 
> 
> 
> It will be much better to have:
> 
> If (hangful) {
>     <decompose_hangul>
> } else {
>     <decompose_nonhangul>
> }
> 
> Currently, the code looks weird. Especially, decompose_hangul() is able
> to distinguish two different cases. So, it's easy to extract the check.

OK, your previous objection seemed to be about efficiency, not style.

Moving the Hangul check outside the function would require a reader of
decompose_unichar() to figure out what Hangul_SBase and Hangul_SCount
are, when they may not be interested in the Hangul case at all. My
version may be weird but I think it's easier to understand.

Keep in mind that the old decompose_unichar function was also checking
if the character was Hangul, and doing nothing if that was the case. So
this is not a big departure from the existing code.

Finally, there is the issue of the license; I don't know if I should be
splitting decompose_hangul() since it would make it harder to give
attribution.

> > > 
> > > > 
> > > > +	if (*size == 0)
> > > > +		/* not Hangul */
> > > > +		result = decompose_nonhangul(uc, size);
> > > I think it makes sense to use the brackets here. If I remember
> > > correctly, this is the code style's requirement.
> > I couldn't find such a requirement, but I did find some other places
> > within
> > the hfsplus code where the comments are done my way. For example, in
> > the
> > hfsplus_cat_write_inode() function of the inode.c file.
> > 
> 
> 
> The hfsplus code contains many awful places.

I'd say most of the older code is decent and easy to understand.

> It doesn't mean that it
> needs to follow the bad style examples. If you have only one string:
> 
> if (condition)
>     <do_something>
> 
> then it's OK not to use the brackets. Otherwise, you should include the
> branch into the brackets.

I looked around outside of hfsplus, and it seems that both styles are
in use. The most charming example I found was in the rb_next_postorder()
function of lib/rbtree.c, where one branch of a conditional uses your
style and the other uses mine.

If you can actually find the place in the syle guide where this is
required then I will make the change. Otherwise I would prefer to leave
it as it is.

> Thanks,
> Vyacheslav Dubeyko.
> 
>