All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hin-Tak Leung <htl10@users.sourceforge.net>
To: aia21@cam.ac.uk
Cc: linux-fsdevel@vger.kernel.org, slava@dubeyko.com,
	viro@zeniv.linux.org.uk, hch@infradead.org
Subject: Re: [PATCH] hfsplus: fixes worst-case unicode to char conversion of file names
Date: Fri, 4 Apr 2014 23:11:11 +0100 (BST)	[thread overview]
Message-ID: <1396649471.39656.YahooMailBasic@web172304.mail.ir2.yahoo.com> (raw)

------------------------------
On Fri, Apr 4, 2014 10:24 PM BST Anton Altaparmakov wrote:

>Hi,
>
>On 4 Apr 2014, at 20:46, Hin-Tak Leung <hintak.leung@gmail.com> wrote:
>> From: Hin-Tak Leung <htl10@users.sourceforge.net>
>> 
>> The HFS Plus Volume Format specification (TN1150) states that
>> file names are stored internally as a maximum of 255 unicode
>> characters, as defined by The Unicode Standard, Version 2.0
>> [Unicode, Inc. ISBN 0-201-48345-9]. File names are converted by
>> the NLS system on Linux before presented to the user.
>> 
>> Though it is rare, the worst-case is 255 CJK characters converting
>> to UTF-8 with 1 unicode character to 3 bytes. Surrogate pairs are
>> no worse. The receiver buffer needs to be 255 x 3 bytes,
>> not 255 bytes as the code has always been.
>
>You are correct that that buffer is too small.  However:
>
>1) The correct size for the buffer is NLS_MAX_CHARSET_SIZE * HFSPLUS_MAX_STRLEN + 1 and not using a magic constant "3" (which is actually not big enough in case the string is storing UTF-16 rather than UCS-2 Unicode which I have observed happen on NTFS written to by asian versions of Windows but I see no reason why it could not happen on OS X, too, especially on a HFS+ volume that has been written to by a Windows HFS+ driver - even if native OS X driver would not normally do it - I have not looked at it I admit).  That reliable source of information Wikipedia suggests Mac OS X also uses UTF-16 as of OS X 10.3 at least in userspace so chances are it either also uses it in the kernel or if not yet it might well do in future:
>
>    http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
>
>2) You are now allocating a huge buffer on the stack.  This is not a good thing to do in the kernel (think 4k stack kernel config - that single variable is consuming about a quarter of available stack).  You need to allocate the buffer dynamically.  As you only need to do the allocation on entry to hfsplus_readdir() and deallocate it on exit it is not a problem as it could be if you had to allocate/free for every filename.
>

Hi Anton,

Thanks for the comments. NLS_MAX_CHARSET_SIZE is 6 include/linux/nls.h but I think it is too generous in this case. It is correct that a unicode character needs at worst 6 bytes to code, but those in the upper range of that when encoded in UTF-16 would require a surrogate pair - i.e. it goes from *two* UTF-16 units to 6 bytes. So that's still x3, not x6. Also Unicode 2.0 covers only the first supplementary plane, and only requires up to 4 bytes. So that's what my "Surrogate pairs are no worse." part of the message was about. Please correct me if this reasoning is wrong.

I'll switch to dynamic allocation and prepare a revised patch, after further discussion on the x3 vs x6 issue.

Hin-Tak
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

             reply	other threads:[~2014-04-04 22:14 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-04 22:11 Hin-Tak Leung [this message]
2014-04-05  9:07 ` [PATCH] hfsplus: fixes worst-case unicode to char conversion of file names Vyacheslav Dubeyko
2014-04-05 20:37 ` Anton Altaparmakov
  -- strict thread matches above, loose matches on Subject: below --
2014-04-05 22:09 Hin-Tak Leung
2014-04-04 19:46 Hin-Tak Leung
2014-04-04 21:24 ` Anton Altaparmakov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1396649471.39656.YahooMailBasic@web172304.mail.ir2.yahoo.com \
    --to=htl10@users.sourceforge.net \
    --cc=aia21@cam.ac.uk \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=slava@dubeyko.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.