All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 6/8] Support non-BMP characters in UDF
@ 2012-05-15 23:10 Vladimir 'φ-coder/phcoder' Serbinenko
  2012-05-16 14:34 ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2012-05-15 23:10 UTC (permalink / raw)
  To: Jan Kara, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2004 bytes --]

I also have a counterpart for mkudffs/udf-tools but sourceforge homepage seems to be abandoned does anybody know if there is a new homepage for mkudffs?

Signed-off-by: Vladimir Serbinenko <phcoder@gmail.com>
---
 fs/udf/unicode.c |   28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/udf/unicode.c b/fs/udf/unicode.c
index 9b1b2de..2d8cc12 100644
--- a/fs/udf/unicode.c
+++ b/fs/udf/unicode.c
@@ -280,6 +280,14 @@ static int udf_CS0toNLS(struct nls_table *nls, struct ustr *utf_o,
 		if (cmp_id == 16)
 			c = (c << 8) | ocu[i++];
 
+		if (cmp_id == 16 && (c & 0xfc00) == 0xd800
+		    && i + 1 < ocu_len && ((ocu[i] & 0xfc) == 0xdc)) {
+			uint16_t l;
+			l = ocu[i++] << 8;
+			l |= ocu[i++];
+			c = (((c & 0x3ff) << 10) | (l & 0x3ff)) + 0x10000;
+		}
+
 		len = nls->uni2char(c, &utf_o->u_name[utf_o->u_len],
 				    UDF_NAME_LEN - utf_o->u_len);
 		/* Valid character? */
@@ -312,20 +320,30 @@ try_again:
 		if (!len)
 			continue;
 		/* Invalid character, deal with it */
-		if (len < 0 || uni_char > 0xffff) {
+		if (len < 0 || uni_char > 0x10ffff) {
 			len = 1;
 			uni_char = '?';
 		}
 
 		if (uni_char > max_val) {
-			max_val = 0xffffU;
+			max_val = 0x10ffffU;
 			ocu[0] = (uint8_t)0x10U;
 			goto try_again;
 		}
 
-		if (max_val == 0xffffU)
-			ocu[++u_len] = (uint8_t)(uni_char >> 8);
-		ocu[++u_len] = (uint8_t)(uni_char & 0xffU);
+		if (uni_char > 0xffff) {
+			u16 h, l;
+			h = 0xd800 | (((uni_char - 0x10000) >> 10) & 0x3ff);
+			l = 0xdc00 | ((uni_char - 0x10000) & 0x3ff);
+			ocu[++u_len] = (uint8_t)(h >> 8);
+			ocu[++u_len] = (uint8_t)(h & 0xffU);
+			ocu[++u_len] = (uint8_t)(l >> 8);
+			ocu[++u_len] = (uint8_t)(l & 0xffU);
+		} else {
+			if (max_val == 0x10ffffU)
+				ocu[++u_len] = (uint8_t)(uni_char >> 8);
+			ocu[++u_len] = (uint8_t)(uni_char & 0xffU);
+		}
 		i += len - 1;
 	}
 
-- 
1.7.10

-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/8] Support non-BMP characters in UDF
  2012-05-15 23:10 [PATCH 6/8] Support non-BMP characters in UDF Vladimir 'φ-coder/phcoder' Serbinenko
@ 2012-05-16 14:34 ` Jan Kara
  2012-05-16 15:14   ` Vladimir 'φ-coder/phcoder' Serbinenko
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2012-05-16 14:34 UTC (permalink / raw)
  To: Vladimir 'φ-coder/phcoder' Serbinenko
  Cc: Jan Kara, linux-kernel, linux-fsdevel

On Wed 16-05-12 01:10:22, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> I also have a counterpart for mkudffs/udf-tools but sourceforge homepage
> seems to be abandoned does anybody know if there is a new homepage for
> mkudffs?
  Thanks for the patch! It looks OK but shouldn't we rather use the helper
functions you introduced in the NLS code? It look wrong to replicate
decoding of UTF16 here.

								Honza
> 
> Signed-off-by: Vladimir Serbinenko <phcoder@gmail.com>
> ---
>  fs/udf/unicode.c |   28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/udf/unicode.c b/fs/udf/unicode.c
> index 9b1b2de..2d8cc12 100644
> --- a/fs/udf/unicode.c
> +++ b/fs/udf/unicode.c
> @@ -280,6 +280,14 @@ static int udf_CS0toNLS(struct nls_table *nls, struct ustr *utf_o,
>  		if (cmp_id == 16)
>  			c = (c << 8) | ocu[i++];
>  
> +		if (cmp_id == 16 && (c & 0xfc00) == 0xd800
> +		    && i + 1 < ocu_len && ((ocu[i] & 0xfc) == 0xdc)) {
> +			uint16_t l;
> +			l = ocu[i++] << 8;
> +			l |= ocu[i++];
> +			c = (((c & 0x3ff) << 10) | (l & 0x3ff)) + 0x10000;
> +		}
> +
>  		len = nls->uni2char(c, &utf_o->u_name[utf_o->u_len],
>  				    UDF_NAME_LEN - utf_o->u_len);
>  		/* Valid character? */
> @@ -312,20 +320,30 @@ try_again:
>  		if (!len)
>  			continue;
>  		/* Invalid character, deal with it */
> -		if (len < 0 || uni_char > 0xffff) {
> +		if (len < 0 || uni_char > 0x10ffff) {
>  			len = 1;
>  			uni_char = '?';
>  		}
>  
>  		if (uni_char > max_val) {
> -			max_val = 0xffffU;
> +			max_val = 0x10ffffU;
>  			ocu[0] = (uint8_t)0x10U;
>  			goto try_again;
>  		}
>  
> -		if (max_val == 0xffffU)
> -			ocu[++u_len] = (uint8_t)(uni_char >> 8);
> -		ocu[++u_len] = (uint8_t)(uni_char & 0xffU);
> +		if (uni_char > 0xffff) {
> +			u16 h, l;
> +			h = 0xd800 | (((uni_char - 0x10000) >> 10) & 0x3ff);
> +			l = 0xdc00 | ((uni_char - 0x10000) & 0x3ff);
> +			ocu[++u_len] = (uint8_t)(h >> 8);
> +			ocu[++u_len] = (uint8_t)(h & 0xffU);
> +			ocu[++u_len] = (uint8_t)(l >> 8);
> +			ocu[++u_len] = (uint8_t)(l & 0xffU);
> +		} else {
> +			if (max_val == 0x10ffffU)
> +				ocu[++u_len] = (uint8_t)(uni_char >> 8);
> +			ocu[++u_len] = (uint8_t)(uni_char & 0xffU);
> +		}
>  		i += len - 1;
>  	}
>  
> -- 
> 1.7.10
> 
> -- 
> Regards
> Vladimir 'φ-coder/phcoder' Serbinenko
> 


-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/8] Support non-BMP characters in UDF
  2012-05-16 14:34 ` Jan Kara
@ 2012-05-16 15:14   ` Vladimir 'φ-coder/phcoder' Serbinenko
  2012-05-16 20:04       ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2012-05-16 15:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

On 16.05.2012 16:34, Jan Kara wrote:

> On Wed 16-05-12 01:10:22, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
>> I also have a counterpart for mkudffs/udf-tools but sourceforge homepage
>> seems to be abandoned does anybody know if there is a new homepage for
>> mkudffs?
>   Thanks for the patch!

You're welcome. Thanks for reviewing.

 It looks OK but shouldn't we rather use the helper
> functions you introduced in the NLS code? It look wrong to replicate
> decoding of UTF16 here.
> 

The helper functions are limited to buffers aligned on 16-bit boundary
which is not the case of this buffer. I see following solutions:
0) Homegrown like in previous patch
1) Add a new "endianness" UTF16_LITTLE_ENDIAN_UNALIGNED
2) Split code for "compressed" vs "uncompressed" and copy the string to
a temporary buffer in "uncompressed" branch.
3) Like 2 but make buffer sliding and contain only 2 elements.

I think 1 or 3 would be the most reasonable. Which solution do you prefer?

-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/8] Support non-BMP characters in UDF
  2012-05-16 15:14   ` Vladimir 'φ-coder/phcoder' Serbinenko
@ 2012-05-16 20:04       ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2012-05-16 20:04 UTC (permalink / raw)
  To: Vladimir 'φ-coder/phcoder' Serbinenko
  Cc: Jan Kara, linux-kernel, linux-fsdevel

On Wed 16-05-12 17:14:23, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> On 16.05.2012 16:34, Jan Kara wrote:
> > On Wed 16-05-12 01:10:22, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> >> I also have a counterpart for mkudffs/udf-tools but sourceforge homepage
> >> seems to be abandoned does anybody know if there is a new homepage for
> >> mkudffs?
  Oh, and I forgot to reply here: mkudffs is really unmaintained. But also
it's not used too much AFAIK. Most people use genisoimage to generate udf
filesystems.

> >   Thanks for the patch!
> 
> You're welcome. Thanks for reviewing.
> 
>  It looks OK but shouldn't we rather use the helper
> > functions you introduced in the NLS code? It look wrong to replicate
> > decoding of UTF16 here.
> > 
> 
> The helper functions are limited to buffers aligned on 16-bit boundary
> which is not the case of this buffer. I see following solutions:
  I see.

> 0) Homegrown like in previous patch
> 1) Add a new "endianness" UTF16_LITTLE_ENDIAN_UNALIGNED
> 2) Split code for "compressed" vs "uncompressed" and copy the string to
> a temporary buffer in "uncompressed" branch.
> 3) Like 2 but make buffer sliding and contain only 2 elements.
> 
> I think 1 or 3 would be the most reasonable. Which solution do you prefer?
  I think 1 would be the best since then it can be easily reused by other
filesystems which may have similar issue.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/8] Support non-BMP characters in UDF
@ 2012-05-16 20:04       ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2012-05-16 20:04 UTC (permalink / raw)
  To: Vladimir 'φ-coder/phcoder' Serbinenko
  Cc: Jan Kara, linux-kernel, linux-fsdevel

On Wed 16-05-12 17:14:23, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> On 16.05.2012 16:34, Jan Kara wrote:
> > On Wed 16-05-12 01:10:22, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> >> I also have a counterpart for mkudffs/udf-tools but sourceforge homepage
> >> seems to be abandoned does anybody know if there is a new homepage for
> >> mkudffs?
  Oh, and I forgot to reply here: mkudffs is really unmaintained. But also
it's not used too much AFAIK. Most people use genisoimage to generate udf
filesystems.

> >   Thanks for the patch!
> 
> You're welcome. Thanks for reviewing.
> 
>  It looks OK but shouldn't we rather use the helper
> > functions you introduced in the NLS code? It look wrong to replicate
> > decoding of UTF16 here.
> > 
> 
> The helper functions are limited to buffers aligned on 16-bit boundary
> which is not the case of this buffer. I see following solutions:
  I see.

> 0) Homegrown like in previous patch
> 1) Add a new "endianness" UTF16_LITTLE_ENDIAN_UNALIGNED
> 2) Split code for "compressed" vs "uncompressed" and copy the string to
> a temporary buffer in "uncompressed" branch.
> 3) Like 2 but make buffer sliding and contain only 2 elements.
> 
> I think 1 or 3 would be the most reasonable. Which solution do you prefer?
  I think 1 would be the best since then it can be easily reused by other
filesystems which may have similar issue.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/8] Support non-BMP characters in UDF
  2012-05-16 20:04       ` Jan Kara
  (?)
@ 2012-05-17  0:37       ` Vladimir 'φ-coder/phcoder' Serbinenko
  2012-05-17  0:48         ` Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF) Vladimir 'φ-coder/phcoder' Serbinenko
  -1 siblings, 1 reply; 11+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2012-05-17  0:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1927 bytes --]

On 16.05.2012 22:04, Jan Kara wrote:

> On Wed 16-05-12 17:14:23, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
>> On 16.05.2012 16:34, Jan Kara wrote:
>>> On Wed 16-05-12 01:10:22, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
>>>> I also have a counterpart for mkudffs/udf-tools but sourceforge homepage
>>>> seems to be abandoned does anybody know if there is a new homepage for
>>>> mkudffs?
>   Oh, and I forgot to reply here: mkudffs is really unmaintained. But also
> it's not used too much AFAIK. Most people use genisoimage to generate udf
> filesystems.

But it doesn't seem to be appropriate for non-optical media.


>> 0) Homegrown like in previous patch
>> 1) Add a new "endianness" UTF16_LITTLE_ENDIAN_UNALIGNED
>> 2) Split code for "compressed" vs "uncompressed" and copy the string to
>> a temporary buffer in "uncompressed" branch.
>> 3) Like 2 but make buffer sliding and contain only 2 elements.
>>
>> I think 1 or 3 would be the most reasonable. Which solution do you prefer?
>   I think 1 would be the best since then it can be easily reused by other
> filesystems which may have similar issue.
> 

Ok, I'll do it. I've noticed another duplication in the UDF code: there
is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
actually: with -o utf8 and -o iocharset=utf8 which imply different
codepaths. Specific UTF-8 support is probably slightly faster by
avoiding calls and basically doing everything with shifts (or can be
made so with a small patch). Should I perhaps kill one of them? Is
iocharset!=utf8 still of any importance? I haven't seen it in ages.
Perhaps we could keep just the performant UTF-8 support and map
iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
has no users anyway so keeping it we're likely to keep bugs and code
duplication with no benefit.

-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF)
  2012-05-17  0:37       ` Vladimir 'φ-coder/phcoder' Serbinenko
@ 2012-05-17  0:48         ` Vladimir 'φ-coder/phcoder' Serbinenko
  2012-05-17 14:40             ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2012-05-17  0:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1158 bytes --]


> I've noticed another duplication in the UDF code: there
> is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
> actually: with -o utf8 and -o iocharset=utf8 which imply different
> codepaths. Specific UTF-8 support is probably slightly faster by
> avoiding calls and basically doing everything with shifts (or can be
> made so with a small patch). Should I perhaps kill one of them? Is
> iocharset!=utf8 still of any importance? I haven't seen it in ages.
> Perhaps we could keep just the performant UTF-8 support and map
> iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
> has no users anyway so keeping it we're likely to keep bugs and code
> duplication with no benefit.
> 

Linux seems to support UTF-8-only pretty strongly: http://yarchive.net/comp/linux/utf8.html
(message from Sun, 15 Feb 2004 02:42:45 GMT).
And I completely agree.
If it's ok to kill iocharset!=utf8 I'll propose a series of 3 patches (killing iocharset!=utf8,
extending utf16toutf8/utf8toutf16 for unaligned input, changing UDF code to use common functions)


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF)
  2012-05-17  0:48         ` Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF) Vladimir 'φ-coder/phcoder' Serbinenko
@ 2012-05-17 14:40             ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2012-05-17 14:40 UTC (permalink / raw)
  To: Vladimir 'φ-coder/phcoder' Serbinenko
  Cc: Jan Kara, linux-kernel, linux-fsdevel

On Thu 17-05-12 02:48:49, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> 
> > I've noticed another duplication in the UDF code: there
> > is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
> > actually: with -o utf8 and -o iocharset=utf8 which imply different
> > codepaths. Specific UTF-8 support is probably slightly faster by
> > avoiding calls and basically doing everything with shifts (or can be
> > made so with a small patch). Should I perhaps kill one of them? Is
> > iocharset!=utf8 still of any importance? I haven't seen it in ages.
> > Perhaps we could keep just the performant UTF-8 support and map
> > iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
> > has no users anyway so keeping it we're likely to keep bugs and code
> > duplication with no benefit.
> > 
> 
> Linux seems to support UTF-8-only pretty strongly: http://yarchive.net/comp/linux/utf8.html
> (message from Sun, 15 Feb 2004 02:42:45 GMT).
> And I completely agree.
> If it's ok to kill iocharset!=utf8 I'll propose a series of 3 patches (killing iocharset!=utf8,
> extending utf16toutf8/utf8toutf16 for unaligned input, changing UDF code to use common functions)
  Well, yes, utf8 is currently the only sane setting but that doesn't mean
someone isn't using (e.g. iso8859-2) for strange reasons... We should
regress in user visible functionality only for really good reasons and here
I don't see a strong reason. So I'd like to keep current iocharset mount
option and make utf8 option equivalent to iocharset=utf8. Since I don't
think the speed benefit of dedicated CS0<->UTF8 functions is really that
big and UDF isn't exactly a filesystem where it would matter anyway, I'd
just remove those dedicated functions and use the generic ones instead.

								Honza


-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF)
@ 2012-05-17 14:40             ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2012-05-17 14:40 UTC (permalink / raw)
  To: Vladimir 'φ-coder/phcoder' Serbinenko
  Cc: Jan Kara, linux-kernel, linux-fsdevel

On Thu 17-05-12 02:48:49, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> 
> > I've noticed another duplication in the UDF code: there
> > is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
> > actually: with -o utf8 and -o iocharset=utf8 which imply different
> > codepaths. Specific UTF-8 support is probably slightly faster by
> > avoiding calls and basically doing everything with shifts (or can be
> > made so with a small patch). Should I perhaps kill one of them? Is
> > iocharset!=utf8 still of any importance? I haven't seen it in ages.
> > Perhaps we could keep just the performant UTF-8 support and map
> > iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
> > has no users anyway so keeping it we're likely to keep bugs and code
> > duplication with no benefit.
> > 
> 
> Linux seems to support UTF-8-only pretty strongly: http://yarchive.net/comp/linux/utf8.html
> (message from Sun, 15 Feb 2004 02:42:45 GMT).
> And I completely agree.
> If it's ok to kill iocharset!=utf8 I'll propose a series of 3 patches (killing iocharset!=utf8,
> extending utf16toutf8/utf8toutf16 for unaligned input, changing UDF code to use common functions)
  Well, yes, utf8 is currently the only sane setting but that doesn't mean
someone isn't using (e.g. iso8859-2) for strange reasons... We should
regress in user visible functionality only for really good reasons and here
I don't see a strong reason. So I'd like to keep current iocharset mount
option and make utf8 option equivalent to iocharset=utf8. Since I don't
think the speed benefit of dedicated CS0<->UTF8 functions is really that
big and UDF isn't exactly a filesystem where it would matter anyway, I'd
just remove those dedicated functions and use the generic ones instead.

								Honza


-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF)
  2012-05-17 14:40             ` Jan Kara
  (?)
@ 2012-05-17 15:30             ` Vladimir 'φ-coder/phcoder' Serbinenko
  2012-05-17 19:45               ` Jan Kara
  -1 siblings, 1 reply; 11+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2012-05-17 15:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3170 bytes --]

On 17.05.2012 16:40, Jan Kara wrote:

> On Thu 17-05-12 02:48:49, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
>>
>>> I've noticed another duplication in the UDF code: there
>>> is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
>>> actually: with -o utf8 and -o iocharset=utf8 which imply different
>>> codepaths. Specific UTF-8 support is probably slightly faster by
>>> avoiding calls and basically doing everything with shifts (or can be
>>> made so with a small patch). Should I perhaps kill one of them? Is
>>> iocharset!=utf8 still of any importance? I haven't seen it in ages.
>>> Perhaps we could keep just the performant UTF-8 support and map
>>> iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
>>> has no users anyway so keeping it we're likely to keep bugs and code
>>> duplication with no benefit.
>>>
>>
>> Linux seems to support UTF-8-only pretty strongly: http://yarchive.net/comp/linux/utf8.html
>> (message from Sun, 15 Feb 2004 02:42:45 GMT).
>> And I completely agree.
>> If it's ok to kill iocharset!=utf8 I'll propose a series of 3 patches (killing iocharset!=utf8,
>> extending utf16toutf8/utf8toutf16 for unaligned input, changing UDF code to use common functions)
>   Well, yes, utf8 is currently the only sane setting but that doesn't mean
> someone isn't using (e.g. iso8859-2) for strange reasons...


What would be the correct behaviour if we encounter the characters which
can't be represented in the given charset? Currently the code replaces
them with question marks but since this doesn't complete round trip
successfully someone attempting to open or stat the file by name won't
be able to. So these files become pretty much "ghosts" that you see but
can't do anything with them. Hiding them altogether would lead to
situations when the disk appears empty but df shows that it's 100% full.
While encodings like iso-8859-1 are relatively straightforward, some
other (East Asian) encodings may produce '/' as part of another
character and so confuse the kernel. Such encodings are also stateful
and I'm pretty sure that current code bugs on them.
I don't know if these quirks can be used to make a program load a file
it wasn't intended to and whether it's of any security concern.
I'm aware of bash security problems with such characters when part of
Chinese character is interpreted as backtick.
I don't think that these problems can create a security hole on kernel
side, they can be used to confuse userspace but I doubt it's anything
exploitable but it's something I'd be doubtful about.

> We should
> regress in user visible functionality only for really good reasons and here
> I don't see a strong reason. So I'd like to keep current iocharset mount
> option and make utf8 option equivalent to iocharset=utf8. Since I don't
> think the speed benefit of dedicated CS0<->UTF8 functions is really that
> big and UDF isn't exactly a filesystem where it would matter anyway, I'd
> just remove those dedicated functions and use the generic ones instead.

Ok, I'll prepare a patch.
-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF)
  2012-05-17 15:30             ` Vladimir 'φ-coder/phcoder' Serbinenko
@ 2012-05-17 19:45               ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2012-05-17 19:45 UTC (permalink / raw)
  To: Vladimir 'φ-coder/phcoder' Serbinenko
  Cc: Jan Kara, linux-kernel, linux-fsdevel

On Thu 17-05-12 17:30:32, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> On 17.05.2012 16:40, Jan Kara wrote:
> > On Thu 17-05-12 02:48:49, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> >>
> >>> I've noticed another duplication in the UDF code: there
> >>> is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
> >>> actually: with -o utf8 and -o iocharset=utf8 which imply different
> >>> codepaths. Specific UTF-8 support is probably slightly faster by
> >>> avoiding calls and basically doing everything with shifts (or can be
> >>> made so with a small patch). Should I perhaps kill one of them? Is
> >>> iocharset!=utf8 still of any importance? I haven't seen it in ages.
> >>> Perhaps we could keep just the performant UTF-8 support and map
> >>> iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
> >>> has no users anyway so keeping it we're likely to keep bugs and code
> >>> duplication with no benefit.
> >>>
> >>
> >> Linux seems to support UTF-8-only pretty strongly: http://yarchive.net/comp/linux/utf8.html
> >> (message from Sun, 15 Feb 2004 02:42:45 GMT).
> >> And I completely agree.
> >> If it's ok to kill iocharset!=utf8 I'll propose a series of 3 patches (killing iocharset!=utf8,
> >> extending utf16toutf8/utf8toutf16 for unaligned input, changing UDF code to use common functions)
> >   Well, yes, utf8 is currently the only sane setting but that doesn't mean
> > someone isn't using (e.g. iso8859-2) for strange reasons...
> 
> 
> What would be the correct behaviour if we encounter the characters which
> can't be represented in the given charset? Currently the code replaces
> them with question marks but since this doesn't complete round trip
> successfully someone attempting to open or stat the file by name won't
> be able to. So these files become pretty much "ghosts" that you see but
> can't do anything with them.
  Yeah. So maybe we can just pass the bytes encoding such characters
further? Sure the names would look awkward but at least they would be some
names to use. I don't say it's ideal but it's at least some sensible way...
But that's a separate question from our current discussion AFAICT.  Also so
far noone has complained about the question marks either so if someone is
using iocharset, he probably knows what he is doing ;). So I don't think
fixing this is really important.
 
> Hiding them altogether would lead to
> situations when the disk appears empty but df shows that it's 100% full.
> While encodings like iso-8859-1 are relatively straightforward, some
> other (East Asian) encodings may produce '/' as part of another
> character and so confuse the kernel. Such encodings are also stateful
> and I'm pretty sure that current code bugs on them.
> I don't know if these quirks can be used to make a program load a file
> it wasn't intended to and whether it's of any security concern.
> I'm aware of bash security problems with such characters when part of
> Chinese character is interpreted as backtick.
> I don't think that these problems can create a security hole on kernel
> side, they can be used to confuse userspace but I doubt it's anything
> exploitable but it's something I'd be doubtful about.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-05-17 19:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-15 23:10 [PATCH 6/8] Support non-BMP characters in UDF Vladimir 'φ-coder/phcoder' Serbinenko
2012-05-16 14:34 ` Jan Kara
2012-05-16 15:14   ` Vladimir 'φ-coder/phcoder' Serbinenko
2012-05-16 20:04     ` Jan Kara
2012-05-16 20:04       ` Jan Kara
2012-05-17  0:37       ` Vladimir 'φ-coder/phcoder' Serbinenko
2012-05-17  0:48         ` Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF) Vladimir 'φ-coder/phcoder' Serbinenko
2012-05-17 14:40           ` Jan Kara
2012-05-17 14:40             ` Jan Kara
2012-05-17 15:30             ` Vladimir 'φ-coder/phcoder' Serbinenko
2012-05-17 19:45               ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.