Re: [PATCH] hfsplus: fixes worst-case unicode to char conversion of file names - Hin-Tak Leung

From: Hin-Tak Leung <htl10@users.sourceforge.net>
To: aia21@cam.ac.uk
Cc: linux-fsdevel@vger.kernel.org, slava@dubeyko.com,
	viro@zeniv.linux.org.uk, hch@infradead.org
Subject: Re: [PATCH] hfsplus: fixes worst-case unicode to char conversion of file names
Date: Fri, 4 Apr 2014 23:11:11 +0100 (BST)	[thread overview]
Message-ID: <1396649471.39656.YahooMailBasic@web172304.mail.ir2.yahoo.com> (raw)

------------------------------
On Fri, Apr 4, 2014 10:24 PM BST Anton Altaparmakov wrote:

>Hi,
>
>On 4 Apr 2014, at 20:46, Hin-Tak Leung <hintak.leung@gmail.com> wrote:
>> From: Hin-Tak Leung <htl10@users.sourceforge.net>
>> 
>> The HFS Plus Volume Format specification (TN1150) states that
>> file names are stored internally as a maximum of 255 unicode
>> characters, as defined by The Unicode Standard, Version 2.0
>> [Unicode, Inc. ISBN 0-201-48345-9]. File names are converted by
>> the NLS system on Linux before presented to the user.
>> 
>> Though it is rare, the worst-case is 255 CJK characters converting
>> to UTF-8 with 1 unicode character to 3 bytes. Surrogate pairs are
>> no worse. The receiver buffer needs to be 255 x 3 bytes,
>> not 255 bytes as the code has always been.
>
>You are correct that that buffer is too small.  However:
>
>1) The correct size for the buffer is NLS_MAX_CHARSET_SIZE * HFSPLUS_MAX_STRLEN + 1 and not using a magic constant "3" (which is actually not big enough in case the string is storing UTF-16 rather than UCS-2 Unicode which I have observed happen on NTFS written to by asian versions of Windows but I see no reason why it could not happen on OS X, too, especially on a HFS+ volume that has been written to by a Windows HFS+ driver - even if native OS X driver would not normally do it - I have not looked at it I admit).  That reliable source of information Wikipedia suggests Mac OS X also uses UTF-16 as of OS X 10.3 at least in userspace so chances are it either also uses it in the kernel or if not yet it might well do in future:
>
>    http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
>
>2) You are now allocating a huge buffer on the stack.  This is not a good thing to do in the kernel (think 4k stack kernel config - that single variable is consuming about a quarter of available stack).  You need to allocate the buffer dynamically.  As you only need to do the allocation on entry to hfsplus_readdir() and deallocate it on exit it is not a problem as it could be if you had to allocate/free for every filename.
>

Hi Anton,

Thanks for the comments. NLS_MAX_CHARSET_SIZE is 6 include/linux/nls.h but I think it is too generous in this case. It is correct that a unicode character needs at worst 6 bytes to code, but those in the upper range of that when encoded in UTF-16 would require a surrogate pair - i.e. it goes from *two* UTF-16 units to 6 bytes. So that's still x3, not x6. Also Unicode 2.0 covers only the first supplementary plane, and only requires up to 4 bytes. So that's what my "Surrogate pairs are no worse." part of the message was about. Please correct me if this reasoning is wrong.

I'll switch to dynamic allocation and prepare a revised patch, after further discussion on the x3 vs x6 issue.

Hin-Tak
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html