linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Fixes for exfat driver
@ 2020-03-17 22:25 ` Pali Rohár
  2020-03-17 22:25   ` [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Pali Rohár
                     ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Pali Rohár @ 2020-03-17 22:25 UTC (permalink / raw)
  To: Namjae Jeon, Sungjong Seo, Alexander Viro; +Cc: linux-fsdevel, linux-kernel

This patch series contains small fixes for exfat driver. It removes
conversion from UTF-16 to UTF-16 at two places where it is not needed
and fixes discard support.

Patches are also in my exfat branch:
https://git.kernel.org/pub/scm/linux/kernel/git/pali/linux.git/log/?h=exfat

Pali Rohár (4):
  exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF
  exfat: Remove unused functions exfat_high_surrogate() and
    exfat_low_surrogate()
  exfat: Fix discard support

 fs/exfat/exfat_fs.h |  2 --
 fs/exfat/namei.c    | 19 ++++---------------
 fs/exfat/nls.c      | 13 -------------
 fs/exfat/super.c    |  5 +++--
 4 files changed, 7 insertions(+), 32 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-03-17 22:25 ` [PATCH 0/4] Fixes for exfat driver Pali Rohár
@ 2020-03-17 22:25   ` Pali Rohár
  2020-03-18  0:09     ` Al Viro
  2020-03-17 22:25   ` [PATCH 2/4] exfat: Simplify exfat_utf8_d_cmp() " Pali Rohár
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-03-17 22:25 UTC (permalink / raw)
  To: Namjae Jeon, Sungjong Seo, Alexander Viro; +Cc: linux-fsdevel, linux-kernel

Function partial_name_hash() takes long type value into which can be stored
one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
needed.

Signed-off-by: Pali Rohár <pali@kernel.org>
---
 fs/exfat/namei.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index a8681d91f569..e0ec4ff366f5 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -147,16 +147,10 @@ static int exfat_utf8_d_hash(const struct dentry *dentry, struct qstr *qstr)
 			return charlen;
 
 		/*
-		 * Convert to UTF-16: code points above U+FFFF are encoded as
-		 * surrogate pairs.
 		 * exfat_toupper() works only for code points up to the U+FFFF.
 		 */
-		if (u > 0xFFFF) {
-			hash = partial_name_hash(exfat_high_surrogate(u), hash);
-			hash = partial_name_hash(exfat_low_surrogate(u), hash);
-		} else {
-			hash = partial_name_hash(exfat_toupper(sb, u), hash);
-		}
+		hash = partial_name_hash(u <= 0xFFFF ? exfat_toupper(sb, u) : u,
+					 hash);
 	}
 
 	qstr->hash = end_name_hash(hash);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF
  2020-03-17 22:25 ` [PATCH 0/4] Fixes for exfat driver Pali Rohár
  2020-03-17 22:25   ` [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Pali Rohár
@ 2020-03-17 22:25   ` Pali Rohár
  2020-03-17 22:25   ` [PATCH 3/4] exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate() Pali Rohár
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Pali Rohár @ 2020-03-17 22:25 UTC (permalink / raw)
  To: Namjae Jeon, Sungjong Seo, Alexander Viro; +Cc: linux-fsdevel, linux-kernel

If two Unicode code points represented in UTF-16 are different then also
their UTF-32 representation must be different. Therefore conversion from
UTF-32 to UTF-16 is not needed.

Signed-off-by: Pali Rohár <pali@kernel.org>
---
 fs/exfat/namei.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index e0ec4ff366f5..f07cab5fcd28 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -179,14 +179,9 @@ static int exfat_utf8_d_cmp(const struct dentry *dentry, unsigned int len,
 		if (u_a <= 0xFFFF && u_b <= 0xFFFF) {
 			if (exfat_toupper(sb, u_a) != exfat_toupper(sb, u_b))
 				return 1;
-		} else if (u_a > 0xFFFF && u_b > 0xFFFF) {
-			if (exfat_low_surrogate(u_a) !=
-					exfat_low_surrogate(u_b) ||
-			    exfat_high_surrogate(u_a) !=
-					exfat_high_surrogate(u_b))
-				return 1;
 		} else {
-			return 1;
+			if (u_a != u_b)
+				return 1;
 		}
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate()
  2020-03-17 22:25 ` [PATCH 0/4] Fixes for exfat driver Pali Rohár
  2020-03-17 22:25   ` [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Pali Rohár
  2020-03-17 22:25   ` [PATCH 2/4] exfat: Simplify exfat_utf8_d_cmp() " Pali Rohár
@ 2020-03-17 22:25   ` Pali Rohár
  2020-03-17 22:25   ` [PATCH 4/4] exfat: Fix discard support Pali Rohár
  2020-03-17 23:20   ` [PATCH 0/4] Fixes for exfat driver Namjae Jeon
  4 siblings, 0 replies; 22+ messages in thread
From: Pali Rohár @ 2020-03-17 22:25 UTC (permalink / raw)
  To: Namjae Jeon, Sungjong Seo, Alexander Viro; +Cc: linux-fsdevel, linux-kernel

After applying previous two patches, these functions are not used anymore.

Signed-off-by: Pali Rohár <pali@kernel.org>
---
 fs/exfat/exfat_fs.h |  2 --
 fs/exfat/nls.c      | 13 -------------
 2 files changed, 15 deletions(-)

diff --git a/fs/exfat/exfat_fs.h b/fs/exfat/exfat_fs.h
index 67d4e46fb810..8a176a803206 100644
--- a/fs/exfat/exfat_fs.h
+++ b/fs/exfat/exfat_fs.h
@@ -492,8 +492,6 @@ int exfat_nls_to_utf16(struct super_block *sb,
 		struct exfat_uni_name *uniname, int *p_lossy);
 int exfat_create_upcase_table(struct super_block *sb);
 void exfat_free_upcase_table(struct exfat_sb_info *sbi);
-unsigned short exfat_high_surrogate(unicode_t u);
-unsigned short exfat_low_surrogate(unicode_t u);
 
 /* exfat/misc.c */
 void __exfat_fs_error(struct super_block *sb, int report, const char *fmt, ...)
diff --git a/fs/exfat/nls.c b/fs/exfat/nls.c
index 6d1c3ae130ff..e3a9f5e08f68 100644
--- a/fs/exfat/nls.c
+++ b/fs/exfat/nls.c
@@ -537,22 +537,9 @@ static int exfat_utf8_to_utf16(struct super_block *sb,
 	return unilen;
 }
 
-#define PLANE_SIZE	0x00010000
 #define SURROGATE_MASK	0xfffff800
 #define SURROGATE_PAIR	0x0000d800
 #define SURROGATE_LOW	0x00000400
-#define SURROGATE_BITS	0x000003ff
-
-unsigned short exfat_high_surrogate(unicode_t u)
-{
-	return ((u - PLANE_SIZE) >> 10) + SURROGATE_PAIR;
-}
-
-unsigned short exfat_low_surrogate(unicode_t u)
-{
-	return ((u - PLANE_SIZE) & SURROGATE_BITS) | SURROGATE_PAIR |
-		SURROGATE_LOW;
-}
 
 static int __exfat_utf16_to_nls(struct super_block *sb,
 		struct exfat_uni_name *p_uniname, unsigned char *p_cstring,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] exfat: Fix discard support
  2020-03-17 22:25 ` [PATCH 0/4] Fixes for exfat driver Pali Rohár
                     ` (2 preceding siblings ...)
  2020-03-17 22:25   ` [PATCH 3/4] exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate() Pali Rohár
@ 2020-03-17 22:25   ` Pali Rohár
  2020-03-17 23:20   ` [PATCH 0/4] Fixes for exfat driver Namjae Jeon
  4 siblings, 0 replies; 22+ messages in thread
From: Pali Rohár @ 2020-03-17 22:25 UTC (permalink / raw)
  To: Namjae Jeon, Sungjong Seo, Alexander Viro; +Cc: linux-fsdevel, linux-kernel

Discard support was always unconditionally disabled. Now it is disabled
only in the case when blk_queue_discard() returns false.

Signed-off-by: Pali Rohár <pali@kernel.org>
---
 fs/exfat/super.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/exfat/super.c b/fs/exfat/super.c
index 16ed202ef527..30e914ad17b5 100644
--- a/fs/exfat/super.c
+++ b/fs/exfat/super.c
@@ -531,10 +531,11 @@ static int exfat_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (opts->discard) {
 		struct request_queue *q = bdev_get_queue(sb->s_bdev);
 
-		if (!blk_queue_discard(q))
+		if (!blk_queue_discard(q)) {
 			exfat_msg(sb, KERN_WARNING,
 				"mounting with \"discard\" option, but the device does not support discard");
-		opts->discard = 0;
+			opts->discard = 0;
+		}
 	}
 
 	sb->s_flags |= SB_NODIRATIME;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: [PATCH 0/4] Fixes for exfat driver
  2020-03-17 22:25 ` [PATCH 0/4] Fixes for exfat driver Pali Rohár
                     ` (3 preceding siblings ...)
  2020-03-17 22:25   ` [PATCH 4/4] exfat: Fix discard support Pali Rohár
@ 2020-03-17 23:20   ` Namjae Jeon
  2020-04-15  8:01     ` Pali Rohár
  4 siblings, 1 reply; 22+ messages in thread
From: Namjae Jeon @ 2020-03-17 23:20 UTC (permalink / raw)
  To: 'Pali Rohár', 'Alexander Viro'
  Cc: linux-fsdevel, linux-kernel, 'Sungjong Seo'

> This patch series contains small fixes for exfat driver. It removes
> conversion from UTF-16 to UTF-16 at two places where it is not needed and
> fixes discard support.
Looks good to me.
Acked-by: Namjae Jeon <namjae.jeon@samsung.com>

Hi Al,

Could you please push these patches into your #for-next ?
Thanks!

> 
> Patches are also in my exfat branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/pali/linux.git/log/?h=exfa
> t
> 
> Pali Rohár (4):
>   exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
>   exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF
>   exfat: Remove unused functions exfat_high_surrogate() and
>     exfat_low_surrogate()
>   exfat: Fix discard support
> 
>  fs/exfat/exfat_fs.h |  2 --
>  fs/exfat/namei.c    | 19 ++++---------------
>  fs/exfat/nls.c      | 13 -------------
>  fs/exfat/super.c    |  5 +++--
>  4 files changed, 7 insertions(+), 32 deletions(-)
> 
> --
> 2.20.1




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-03-17 22:25   ` [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Pali Rohár
@ 2020-03-18  0:09     ` Al Viro
  2020-03-18  9:32       ` Pali Rohár
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2020-03-18  0:09 UTC (permalink / raw)
  To: Pali Rohár; +Cc: Namjae Jeon, Sungjong Seo, linux-fsdevel, linux-kernel

On Tue, Mar 17, 2020 at 11:25:52PM +0100, Pali Rohár wrote:
> Function partial_name_hash() takes long type value into which can be stored
> one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
> needed.

Hmm...  You might want to update the comment in stringhash.h...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-03-18  0:09     ` Al Viro
@ 2020-03-18  9:32       ` Pali Rohár
  2020-03-28 23:40         ` Pali Rohár
  0 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-03-18  9:32 UTC (permalink / raw)
  To: Al Viro; +Cc: Namjae Jeon, Sungjong Seo, linux-fsdevel, linux-kernel

On Wednesday 18 March 2020 00:09:25 Al Viro wrote:
> On Tue, Mar 17, 2020 at 11:25:52PM +0100, Pali Rohár wrote:
> > Function partial_name_hash() takes long type value into which can be stored
> > one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
> > needed.
> 
> Hmm...  You might want to update the comment in stringhash.h...

Well, initially I have not looked at hashing functions deeply. Used
hashing function in stringhash.h is defined as:

static inline unsigned long
partial_name_hash(unsigned long c, unsigned long prevhash)
{
	return (prevhash + (c << 4) + (c >> 4)) * 11;
}

I guess it was designed for 8bit types, not for long (64bit types) and
I'm not sure how effective it is even for 16bit types for which it is
already used.

So question is, what should we do for either 21bit number (one Unicode
code point = equivalent of UTF-32) or for sequence of 16bit numbers
(UTF-16)?

Any opinion?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-03-18  9:32       ` Pali Rohár
@ 2020-03-28 23:40         ` Pali Rohár
  0 siblings, 0 replies; 22+ messages in thread
From: Pali Rohár @ 2020-03-28 23:40 UTC (permalink / raw)
  To: Al Viro; +Cc: Namjae Jeon, Sungjong Seo, linux-fsdevel, linux-kernel

On Wednesday 18 March 2020 10:32:51 Pali Rohár wrote:
> On Wednesday 18 March 2020 00:09:25 Al Viro wrote:
> > On Tue, Mar 17, 2020 at 11:25:52PM +0100, Pali Rohár wrote:
> > > Function partial_name_hash() takes long type value into which can be stored
> > > one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
> > > needed.
> > 
> > Hmm...  You might want to update the comment in stringhash.h...
> 
> Well, initially I have not looked at hashing functions deeply. Used
> hashing function in stringhash.h is defined as:
> 
> static inline unsigned long
> partial_name_hash(unsigned long c, unsigned long prevhash)
> {
> 	return (prevhash + (c << 4) + (c >> 4)) * 11;
> }
> 
> I guess it was designed for 8bit types, not for long (64bit types) and
> I'm not sure how effective it is even for 16bit types for which it is
> already used.
> 
> So question is, what should we do for either 21bit number (one Unicode
> code point = equivalent of UTF-32) or for sequence of 16bit numbers
> (UTF-16)?
> 
> Any opinion?

So what to do with that hashing function?

Anyway, "[PATCH 4/4] exfat: Fix discard support" should be reviewed as
currently discard support in exfat is broken.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Fixes for exfat driver
  2020-03-17 23:20   ` [PATCH 0/4] Fixes for exfat driver Namjae Jeon
@ 2020-04-15  8:01     ` Pali Rohár
  2020-04-15 23:43       ` Namjae Jeon
  0 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-04-15  8:01 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Namjae Jeon, linux-fsdevel, linux-kernel, Sungjong Seo

On Wednesday 18 March 2020 08:20:04 Namjae Jeon wrote:
> > This patch series contains small fixes for exfat driver. It removes
> > conversion from UTF-16 to UTF-16 at two places where it is not needed and
> > fixes discard support.
> Looks good to me.
> Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
> 
> Hi Al,
> 
> Could you please push these patches into your #for-next ?
> Thanks!

Al, could you please take this patch series? Based on feedback current
hashing code is good enough. And we do not want to have broken discard
support in upcoming Linux kernel version.

> > 
> > Patches are also in my exfat branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/pali/linux.git/log/?h=exfa
> > t
> > 
> > Pali Rohár (4):
> >   exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
> >   exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF
> >   exfat: Remove unused functions exfat_high_surrogate() and
> >     exfat_low_surrogate()
> >   exfat: Fix discard support
> > 
> >  fs/exfat/exfat_fs.h |  2 --
> >  fs/exfat/namei.c    | 19 ++++---------------
> >  fs/exfat/nls.c      | 13 -------------
> >  fs/exfat/super.c    |  5 +++--
> >  4 files changed, 7 insertions(+), 32 deletions(-)
> > 
> > --
> > 2.20.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 0/4] Fixes for exfat driver
  2020-04-15  8:01     ` Pali Rohár
@ 2020-04-15 23:43       ` Namjae Jeon
  0 siblings, 0 replies; 22+ messages in thread
From: Namjae Jeon @ 2020-04-15 23:43 UTC (permalink / raw)
  To: 'Pali Rohár'
  Cc: linux-fsdevel, linux-kernel, 'Sungjong Seo',
	'Alexander Viro'

> On Wednesday 18 March 2020 08:20:04 Namjae Jeon wrote:
> > > This patch series contains small fixes for exfat driver. It removes
> > > conversion from UTF-16 to UTF-16 at two places where it is not
> > > needed and fixes discard support.
> > Looks good to me.
> > Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
> >
> > Hi Al,
> >
> > Could you please push these patches into your #for-next ?
> > Thanks!
> 
> Al, could you please take this patch series? Based on feedback current
> hashing code is good enough. And we do not want to have broken discard
> support in upcoming Linux kernel version.
Hi Pali,

I will push them to exfat git tree.

Thanks for your work!
> 
> > >
> > > Patches are also in my exfat branch:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/pali/linux.git/log/?
> > > h=exfa
> > > t
> > >
> > > Pali Rohár (4):
> > >   exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
> > >   exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF
> > >   exfat: Remove unused functions exfat_high_surrogate() and
> > >     exfat_low_surrogate()
> > >   exfat: Fix discard support
> > >
> > >  fs/exfat/exfat_fs.h |  2 --
> > >  fs/exfat/namei.c    | 19 ++++---------------
> > >  fs/exfat/nls.c      | 13 -------------
> > >  fs/exfat/super.c    |  5 +++--
> > >  4 files changed, 7 insertions(+), 32 deletions(-)
> > >
> > > --
> > > 2.20.1
> >
> >
> >



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-14  9:47                 ` Pali Rohár
@ 2020-04-15  7:46                   ` Kohada.Tetsuhiro
  0 siblings, 0 replies; 22+ messages in thread
From: Kohada.Tetsuhiro @ 2020-04-15  7:46 UTC (permalink / raw)
  To: 'Pali Rohár'
  Cc: viro, 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	Mori.Takahiro, Ohara.Eiji

> > UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
> > It just says "Unicode".
> 
> That is because in MS world, "Unicode" term lot of times means UCS-2 or UTF-16. 

For example, the Joliet Specification describes using UCS-2 for character sets.
Similarly, the UDF Specification describes using Unicode Version 2.0 for character sets.
However, Windows File Systems also accepts UTF-16 encoded UCS-4.
The foundation of their main product(Windows NT) was designed in the era when UTF-16 and UCS-2 were equal.
The non-BMP plains were probably not fully considered.

> You need to have a crystal ball to correctly understand their specifications.

Exactly!!
My crystal ball says ...
"They've designed D800-DFFF to be a mysterious area, so it's going through it."

> > Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
> > The character type is basically 'wchar_t'(16bit).
> > The description "0000h to FFFFh" also assumes the use of 'wchar_t'.
> >
> > This “0000h to FFFFh” also includes surrogate characters(U+D800 to
> > U+DFFF), but these should not be converted to upper case.
> > Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
> > (* RtlUpcaseUnicodeChar() is one of Windows native API)
> >
> > If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
> > With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.
> >
> > The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
> > To be more strict...
> > D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().
> 
> Exactly, that is why surrogate pairs cannot be put into any "to upper"
> function. Or rather "to upper" function needs to be identity for them to not break anything. "to upper" does not make
> any sense on one u16 item from UTF-16 sequence when you do not have a complete code point.
> So API for UTF-16 "to upper" function needs to take full string, not just one u16.
>
> So for code points above U+FFFF it is needed some other mechanism how to represent upcase table (e.g. by providing full
> UTF-16 pair or code point encoded in UTF-32). And this is unknown and reason why I put question which was IIRC forwarded
> to MS.

That's exactly the case with the "generic" UTF-16 toupper function.
However, exfat (and other MS-FS's) does not require uppercase conversion for non-BMP plains characters.
For non-BMP characters, I think it's enough to just do nothing (no skip, no conversion).So like Windows.


> > WTF-8 is new to me.
> > That's an interesting idea, but is it needed for exfat?
> >
> > For characters over U+FFFF,
> >  -For UTF-32, a value of 0x10000 or more  -For UTF-16, the value from
> > 0xd800 to 0xdfff I think these are just "don't convert to uppercase."
> >
> > If the File Name Directory Entry contains illegal surrogate
> > characters(such as one unpaired surrogate half), it will simply be ignored by utf16s_to_utf8s().
> 
> This is the example why it can be useful for exfat on linux. exfat filename can contain just sequence of unpaired halves
> of surrogate pairs. Such thing is not representable in UTF-8, but valid in exfat.
> Therefore current linux kernel exfat driver with UTF-8 encoding cannot handle such filenames. But with WTF-8 it is possible.

In fact, exfat(and other MS-FSs) accept unpaired surrogate characters.
But this is illegal unicode.
Also, it is very rarely generated by normal user operation (except for VFAT shortname).
Illegal unicode characters were often a security risk and I think they should not be accepted. even if possible.

> So if we want that userspace would be able to read such files from exfat fs, some mechanism for converting "unpaired halves"
> to NULL-term char* string suitable for filenames is needed. And WTF-8 seems like a good choice as it is backward compatible
> with UTF-8.

I think there are very few requirements to access such file names.
It is rare to use non-BMP characters in file names, and it is even rarer to illegally record only half of them.

> > string after utf8 conversion does not include illegal byte sequence.
> 
> Yes, but this is loosy conversion. When you would have two filenames with different "surrogate halves" they would be converted
> to same file name. So you would not be able to access both of them.

I also think there is a problem with this conversion.
Illegal byte sequences are stripped off, and behave as if they didn't exist 
from the beginning (like a legal UTF-8 string).
I think it's safest to fail the conversion if it detects an illegal byte sequence.
And it's also popular to replace it with another character(such as'_ ').
(not perfect, but works reasonably)

Anyway, we don't need to convert non-BMP characters or unpaired surrogate characters 
to uppercase in exfat(and other MS-FSs).


BR
---
Kohada Tetsuhiro <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-14  9:29               ` Kohada.Tetsuhiro
@ 2020-04-14  9:47                 ` Pali Rohár
  2020-04-15  7:46                   ` Kohada.Tetsuhiro
  0 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-04-14  9:47 UTC (permalink / raw)
  To: Kohada.Tetsuhiro
  Cc: viro, 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	Mori.Takahiro

On Tuesday 14 April 2020 09:29:32 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote:
> > We do not know how code points above U+FFFF could be converted to upper case. 
> 
> Code points above U+FFFF do not need to be converted to uppercase.
> 
> > Basically from exfat specification can be deduced it only for
> > U+0000 .. U+FFFF code points. 
> 
> exFAT specifications (sec.7.2.5.1) saids ...
> -- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive).
> 
> UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
> It just says "Unicode".

That is because in MS world, "Unicode" term lot of times means UCS-2 or
UTF-16. You need to have a crystal ball to correctly understand their
specifications.

> 
> > Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between
> > it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So
> > surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is
> > also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values
> > of single / half surrogate.
> 
> Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
> The character type is basically 'wchar_t'(16bit).
> The description "0000h to FFFFh" also assumes the use of 'wchar_t'.
> 
> This “0000h to FFFFh” also includes surrogate characters(U+D800 to U+DFFF),
> but these should not be converted to upper case.
> Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
> (* RtlUpcaseUnicodeChar() is one of Windows native API)
> 
> If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
> With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.
> 
> The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
> To be more strict...
> D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().

Exactly, that is why surrogate pairs cannot be put into any "to upper"
function. Or rather "to upper" function needs to be identity for them to
not break anything. "to upper" does not make any sense on one u16 item
from UTF-16 sequence when you do not have a complete code point.
So API for UTF-16 "to upper" function needs to take full string, not
just one u16.

So for code points above U+FFFF it is needed some other mechanism how to
represent upcase table (e.g. by providing full UTF-16 pair or code point
encoded in UTF-32). And this is unknown and reason why I put question
which was IIRC forwarded to MS.

> > Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative
> > values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8 
> > encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction 
> > for surrogate pairs values.
> 
> WTF-8 is new to me.
> That's an interesting idea, but is it needed for exfat?
> 
> For characters over U+FFFF,
>  -For UTF-32, a value of 0x10000 or more
>  -For UTF-16, the value from 0xd800 to 0xdfff
> I think these are just "don't convert to uppercase."
> 
> If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half),
> it will simply be ignored by utf16s_to_utf8s().

This is the example why it can be useful for exfat on linux. exfat
filename can contain just sequence of unpaired halves of surrogate
pairs. Such thing is not representable in UTF-8, but valid in exfat.
Therefore current linux kernel exfat driver with UTF-8 encoding cannot
handle such filenames. But with WTF-8 it is possible.

So if we want that userspace would be able to read such files from exfat
fs, some mechanism for converting "unpaired halves" to NULL-term char*
string suitable for filenames is needed. And WTF-8 seems like a good
choice as it is backward compatible with UTF-8.

> string after utf8 conversion does not include illegal byte sequence.

Yes, but this is loosy conversion. When you would have two filenames
with different "surrogate halves" they would be converted to same file
name. So you would not be able to access both of them.

> 
> > Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers.
> 
> Ugh...
> 
> 
> BR
> ---
> Kohada Tetsuhiro <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-13 10:10             ` Pali Rohár
@ 2020-04-14  9:29               ` Kohada.Tetsuhiro
  2020-04-14  9:47                 ` Pali Rohár
  0 siblings, 1 reply; 22+ messages in thread
From: Kohada.Tetsuhiro @ 2020-04-14  9:29 UTC (permalink / raw)
  To: 'Pali Rohár'
  Cc: viro, 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	Mori.Takahiro

> We do not know how code points above U+FFFF could be converted to upper case. 

Code points above U+FFFF do not need to be converted to uppercase.

> Basically from exfat specification can be deduced it only for
> U+0000 .. U+FFFF code points. 

exFAT specifications (sec.7.2.5.1) saids ...
-- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive).

UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
It just says "Unicode".


> Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between
> it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So
> surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is
> also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values
> of single / half surrogate.

Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
The character type is basically 'wchar_t'(16bit).
The description "0000h to FFFFh" also assumes the use of 'wchar_t'.

This “0000h to FFFFh” also includes surrogate characters(U+D800 to U+DFFF),
but these should not be converted to upper case.
Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
(* RtlUpcaseUnicodeChar() is one of Windows native API)

If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.

The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
To be more strict...
D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().

> Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative
> values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8 
> encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction 
> for surrogate pairs values.

WTF-8 is new to me.
That's an interesting idea, but is it needed for exfat?

For characters over U+FFFF,
 -For UTF-32, a value of 0x10000 or more
 -For UTF-16, the value from 0xd800 to 0xdfff
I think these are just "don't convert to uppercase."

If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half),
it will simply be ignored by utf16s_to_utf8s().
string after utf8 conversion does not include illegal byte sequence.


> Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers.

Ugh...


BR
---
Kohada Tetsuhiro <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-13  8:13           ` Kohada.Tetsuhiro
@ 2020-04-13 10:10             ` Pali Rohár
  2020-04-14  9:29               ` Kohada.Tetsuhiro
  0 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-04-13 10:10 UTC (permalink / raw)
  To: Kohada.Tetsuhiro
  Cc: viro, 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com'

On Monday 13 April 2020 08:13:45 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote:
> > On Wednesday 08 April 2020 03:59:06 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote:
> > > > So partial_name_hash() like I used it in this patch series is enough?
> > >
> > > I think partial_name_hash() is enough for 8/16/21bit characters.
> > 
> > Great!
> > 
> > Al, could you please take this patch series?
> 
> I think it's good.
> 
> 
> > > Another point about the discrimination of 21bit characters:
> > > I think that checking in exfat_toupper () can be more simplified.
> > >
> > >  ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;
> > 
> > I was thinking about it, but it needs more refactoring. Currently
> > exfat_toupper() is used on other places for UTF-16 (u16 array) and therefore it cannot be extended to take more then 16
> > bit value.
> 
> I’m also a little worried that exfat_toupper() is designed for only utf16.
> Currently, it is converting from utf8 to utf32 in some places, and from utf8 to utf16 in others.
> Another way would be to unify to utf16.
> 
> > But I agree that this is another step which can be improved.
> 
> Yes.

There are two problems with it:

We do not know how code points above U+FFFF could be converted to upper
case. Basically from exfat specification can be deduced it only for
U+0000 .. U+FFFF code points. We asked if we can get answer from MS, but
I have not received any response yet.

Second problem is that all MS filesystems (vfat, ntfs and exfat) do not
use UCS-2 nor UTF-16, but rather some mix between it. Basically any
sequence of 16bit values (except those :/<>... vfat chars) is valid,
even unpaired surrogate half. So surrogate pair (two 16bit values)
represents one unicode code point (as in UTF-16), but one unpaired
surrogate half is also valid and represent (invalid) unicode code point
of its value. In unicode are not defined code points for values of
single / half surrogate.

Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to
fix a way how to handle those non-representative values in VFS encoding
(iocharset=) as UTF-8 is not able to represent it too. One option is to
extend UTF-8 to WTF-8 encoding [1] (yes, this is a real and make sense!)
and then ideally change exfat_toupper() to UTF-32 without restriction
for surrogate pairs values.

Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet
kernel drivers.

[1] - https://simonsapin.github.io/wtf-8/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-08  9:04         ` Pali Rohár
@ 2020-04-13  8:13           ` Kohada.Tetsuhiro
  2020-04-13 10:10             ` Pali Rohár
  0 siblings, 1 reply; 22+ messages in thread
From: Kohada.Tetsuhiro @ 2020-04-13  8:13 UTC (permalink / raw)
  To: 'Pali Rohár', viro
  Cc: 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com'

> On Wednesday 08 April 2020 03:59:06 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote:
> > > So partial_name_hash() like I used it in this patch series is enough?
> >
> > I think partial_name_hash() is enough for 8/16/21bit characters.
> 
> Great!
> 
> Al, could you please take this patch series?

I think it's good.


> > Another point about the discrimination of 21bit characters:
> > I think that checking in exfat_toupper () can be more simplified.
> >
> >  ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;
> 
> I was thinking about it, but it needs more refactoring. Currently
> exfat_toupper() is used on other places for UTF-16 (u16 array) and therefore it cannot be extended to take more then 16
> bit value.

I’m also a little worried that exfat_toupper() is designed for only utf16.
Currently, it is converting from utf8 to utf32 in some places, and from utf8 to utf16 in others.
Another way would be to unify to utf16.

> But I agree that this is another step which can be improved.

Yes.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-08  3:59       ` Kohada.Tetsuhiro
@ 2020-04-08  9:04         ` Pali Rohár
  2020-04-13  8:13           ` Kohada.Tetsuhiro
  0 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-04-08  9:04 UTC (permalink / raw)
  To: Kohada.Tetsuhiro, viro
  Cc: 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com'

On Wednesday 08 April 2020 03:59:06 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote:
> > So partial_name_hash() like I used it in this patch series is enough?
> 
> I think partial_name_hash() is enough for 8/16/21bit characters.

Great!

Al, could you please take this patch series?

> Another point about the discrimination of 21bit characters:
> I think that checking in exfat_toupper () can be more simplified.
> 
>  ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;

I was thinking about it, but it needs more refactoring. Currently
exfat_toupper() is used on other places for UTF-16 (u16 array) and
therefore it cannot be extended to take more then 16 bit value.

But I agree that this is another step which can be improved.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-07 10:06     ` Pali Rohár
@ 2020-04-08  3:59       ` Kohada.Tetsuhiro
  2020-04-08  9:04         ` Pali Rohár
  0 siblings, 1 reply; 22+ messages in thread
From: Kohada.Tetsuhiro @ 2020-04-08  3:59 UTC (permalink / raw)
  To: 'pali@kernel.org'
  Cc: 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	'viro@zeniv.linux.org.uk'

> So partial_name_hash() like I used it in this patch series is enough?

I think partial_name_hash() is enough for 8/16/21bit characters.

Another point about the discrimination of 21bit characters:
I think that checking in exfat_toupper () can be more simplified.

 ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a;

---
Kohada Tetsuhiro <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-06  9:37   ` Kohada.Tetsuhiro
@ 2020-04-07 10:06     ` Pali Rohár
  2020-04-08  3:59       ` Kohada.Tetsuhiro
  0 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-04-07 10:06 UTC (permalink / raw)
  To: Kohada.Tetsuhiro
  Cc: 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	'viro@zeniv.linux.org.uk'

On Monday 06 April 2020 09:37:38 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote:
> > > If you want to get an unbiased hash value by specifying an 8 or 16-bit
> > > value,
> > 
> > Hello! In exfat we have sequence of 21-bit values (not 8, not 16).
> 
> hash_32() generates a less-biased hash, even for 21-bit characters.
> 
> The hash of partial_name_hash() for the filename with the following character is ...
>  - 21-bit(surrogate pair): the upper 3-bits of hash tend to be 0.
>  - 16-bit(mostly CJKV): the upper 8-bits of hash tend to be 0.
>  - 8-bit(mostly latin): the upper 16-bits of hash tend to be 0.
> 
> I think the more frequently used latin/CJKV characters are more important
> when considering the hash efficiency of surrogate pair characters.
> 
> The hash of partial_name_hash() for 8/16-bit characters is also biased.
> However, it works well.
> 
> Surrogate pair characters are used less frequently, and the hash of 
> partial_name_hash() has less bias than for 8/16 bit characters.
> 
> So I think there is no problem with your patch.

So partial_name_hash() like I used it in this patch series is enough?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-03 20:40 ` Pali Rohár
@ 2020-04-06  9:37   ` Kohada.Tetsuhiro
  2020-04-07 10:06     ` Pali Rohár
  0 siblings, 1 reply; 22+ messages in thread
From: Kohada.Tetsuhiro @ 2020-04-06  9:37 UTC (permalink / raw)
  To: 'Pali Rohár'
  Cc: 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	'viro@zeniv.linux.org.uk'

> > If you want to get an unbiased hash value by specifying an 8 or 16-bit
> > value,
> 
> Hello! In exfat we have sequence of 21-bit values (not 8, not 16).

hash_32() generates a less-biased hash, even for 21-bit characters.

The hash of partial_name_hash() for the filename with the following character is ...
 - 21-bit(surrogate pair): the upper 3-bits of hash tend to be 0.
 - 16-bit(mostly CJKV): the upper 8-bits of hash tend to be 0.
 - 8-bit(mostly latin): the upper 16-bits of hash tend to be 0.

I think the more frequently used latin/CJKV characters are more important
when considering the hash efficiency of surrogate pair characters.

The hash of partial_name_hash() for 8/16-bit characters is also biased.
However, it works well.

Surrogate pair characters are used less frequently, and the hash of 
partial_name_hash() has less bias than for 8/16 bit characters.

So I think there is no problem with your patch.


> Did you mean hash_32() function from linux/hash.h?

Oops. I forgot '_'.
hash_32() is correct.


---
Kohada Tetsuhiro <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  2020-04-03  2:18 [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Kohada.Tetsuhiro
@ 2020-04-03 20:40 ` Pali Rohár
  2020-04-06  9:37   ` Kohada.Tetsuhiro
  0 siblings, 1 reply; 22+ messages in thread
From: Pali Rohár @ 2020-04-03 20:40 UTC (permalink / raw)
  To: Kohada.Tetsuhiro
  Cc: 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	'viro@zeniv.linux.org.uk'

On Friday 03 April 2020 02:18:15 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote:
> > I guess it was designed for 8bit types, not for long (64bit types) and
> > I'm not sure how effective it is even for 16bit types for which it is
> > already used.
> 
> In partial_name_hash (), when 8bit value or 16bit value is specified, 
> upper 8-12bits tend to be 0.
> 
> > So question is, what should we do for either 21bit number (one Unicode
> > code point = equivalent of UTF-32) or for sequence of 16bit numbers
> > (UTF-16)?
> 
> If you want to get an unbiased hash value by specifying an 8 or 16-bit value,

Hello! In exfat we have sequence of 21-bit values (not 8, not 16).

> the hash32() function is a good choice.
> ex1: Prepare by hash32 () function.
>    hash = partial_name_hash (hash32 (val16,32), hash);
> ex2: Use the hash32() function directly.
>    hash + = hash32 (val16,32);

Did you mean hash_32() function from linux/hash.h?

> > partial_name_hash(unsigned long c, unsigned long prevhash)
> > {
> >	return (prevhash + (c << 4) + (c >> 4)) * 11;
> > }
> 
> Another way may replace partial_name_hash().
> 
> 	return prevhash + hash32(c,32)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
@ 2020-04-03  2:18 Kohada.Tetsuhiro
  2020-04-03 20:40 ` Pali Rohár
  0 siblings, 1 reply; 22+ messages in thread
From: Kohada.Tetsuhiro @ 2020-04-03  2:18 UTC (permalink / raw)
  To: 'pali@kernel.org'
  Cc: 'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'namjae.jeon@samsung.com',
	'sj1557.seo@samsung.com',
	'viro@zeniv.linux.org.uk'


> I guess it was designed for 8bit types, not for long (64bit types) and
> I'm not sure how effective it is even for 16bit types for which it is
> already used.

In partial_name_hash (), when 8bit value or 16bit value is specified, 
upper 8-12bits tend to be 0.

> So question is, what should we do for either 21bit number (one Unicode
> code point = equivalent of UTF-32) or for sequence of 16bit numbers
> (UTF-16)?

If you want to get an unbiased hash value by specifying an 8 or 16-bit value,
the hash32() function is a good choice.
ex1: Prepare by hash32 () function.
   hash = partial_name_hash (hash32 (val16,32), hash);
ex2: Use the hash32() function directly.
   hash + = hash32 (val16,32);

> partial_name_hash(unsigned long c, unsigned long prevhash)
> {
>	return (prevhash + (c << 4) + (c >> 4)) * 11;
> }

Another way may replace partial_name_hash().

	return prevhash + hash32(c,32)


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-04-15 23:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20200317222604epcas1p1559308b0199c5320a9c77f5ad9f033a2@epcas1p1.samsung.com>
2020-03-17 22:25 ` [PATCH 0/4] Fixes for exfat driver Pali Rohár
2020-03-17 22:25   ` [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Pali Rohár
2020-03-18  0:09     ` Al Viro
2020-03-18  9:32       ` Pali Rohár
2020-03-28 23:40         ` Pali Rohár
2020-03-17 22:25   ` [PATCH 2/4] exfat: Simplify exfat_utf8_d_cmp() " Pali Rohár
2020-03-17 22:25   ` [PATCH 3/4] exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate() Pali Rohár
2020-03-17 22:25   ` [PATCH 4/4] exfat: Fix discard support Pali Rohár
2020-03-17 23:20   ` [PATCH 0/4] Fixes for exfat driver Namjae Jeon
2020-04-15  8:01     ` Pali Rohár
2020-04-15 23:43       ` Namjae Jeon
2020-04-03  2:18 [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Kohada.Tetsuhiro
2020-04-03 20:40 ` Pali Rohár
2020-04-06  9:37   ` Kohada.Tetsuhiro
2020-04-07 10:06     ` Pali Rohár
2020-04-08  3:59       ` Kohada.Tetsuhiro
2020-04-08  9:04         ` Pali Rohár
2020-04-13  8:13           ` Kohada.Tetsuhiro
2020-04-13 10:10             ` Pali Rohár
2020-04-14  9:29               ` Kohada.Tetsuhiro
2020-04-14  9:47                 ` Pali Rohár
2020-04-15  7:46                   ` Kohada.Tetsuhiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).