[PATCH] udf: Fix regression which is preventing symlink access

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] udf: Fix regression which is preventing symlink access
@ 2021-09-10 16:03 Glenn Washburn
  2021-09-14 14:27 ` Daniel Kiper
  0 siblings, 1 reply; 6+ messages in thread
From: Glenn Washburn @ 2021-09-10 16:03 UTC (permalink / raw)
  To: grub-devel, Daniel Kiper
  Cc: Vladimir 'phcoder' Serbinenko, Peter Jones, Glenn Washburn

This code was broken by commit 3f05d693 ("malloc: Use overflow checking
primitives where we do complex allocations"), which added overflow checking
in many areas. The problem here is that the changes update the local
variable sz, which was already in use and which was not updated before the
change. So the code using sz was getting a different value of than it would
have previously for the same UDF image. This causes the logic getting the
destination of the symlink to not realize that its gotten the full
destination, but keeps trying to read past the end of the destination. The
bytes after the end are generally NULL padding bytes, but that's not a
valid component type (ECMA-167 14.16.1.1). So grub_udf_read_symlink branches
to error logic, returning NULL, instead of the symlink destination path.

The result of this bug is that the UDF filesystem tests were failing in the
symlink test with the grub-fstest error message:

    grub-fstest: error: cannot open `(loop0)/sym': invalid symlink.

This change stores the result of doubling sz in another local variable s,
so as not to modify sz. Also remove unnecessary grub_add, which increased
the output by 1 to account for a NULL byte. This isn't needed because an
output buffer of size twice sz is already guaranteed to be more than enough
to contain the path components converted to UTF-8. The worst case upper-
bound for the needed output buffer size is (sz-4)*1.5, where 4 is the size
of a path component header and 1.5 is the maximum growth in bytes when
converting from 2-byte unicode code-points to UTF-8 (from 2 bytes to 3).

Signed-off-by: Glenn Washburn <development@efficientek.com>
---
 grub-core/fs/udf.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/grub-core/fs/udf.c b/grub-core/fs/udf.c
index 2ac5c1d00..197e60d12 100644
--- a/grub-core/fs/udf.c
+++ b/grub-core/fs/udf.c
@@ -1022,7 +1022,7 @@ grub_udf_iterate_dir (grub_fshelp_node_t dir,
 static char *
 grub_udf_read_symlink (grub_fshelp_node_t node)
 {
-  grub_size_t sz = U64 (node->block.fe.file_size);
+  grub_size_t s, sz = U64 (node->block.fe.file_size);
   grub_uint8_t *raw;
   const grub_uint8_t *ptr;
   char *out = NULL, *optr;
@@ -1035,11 +1035,19 @@ grub_udf_read_symlink (grub_fshelp_node_t node)
   if (grub_udf_read_file (node, NULL, NULL, 0, sz, (char *) raw) < 0)
     goto fail_1;

-  if (grub_mul (sz, 2, &sz) ||
-      grub_add (sz, 1, &sz))
+  /*
+   * Local sz is the size of the symlink file data, which contains a sequence
+   * of path components (ECMA-167 14.16.1) representing the link destination.
+   * This size is an upper-bound on the number of bytes of a contained and
+   * potentially compressed 2-byte character string. Allocate 2*sz for the
+   * output buffer contained the string converted to UTF-8 because the
+   * resulting string can not be more than double the size (2-byte unicode
+   * points will be converted to a maximum of 3 bytes in UTF-8).
+   */
+  if (grub_mul (sz, 2, &s))
     goto fail_0;

-  out = grub_malloc (sz);
+  out = grub_malloc (s);
   if (!out)
     {
  fail_0:
@@ -1051,7 +1059,6 @@ grub_udf_read_symlink (grub_fshelp_node_t node)

   for (ptr = raw; ptr < raw + sz; )
     {
-      grub_size_t s;
       if ((grub_size_t) (ptr - raw + 4) > sz)
 	goto fail_1;
       if (!(ptr[2] == 0 && ptr[3] == 0))
-- 
2.32.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] udf: Fix regression which is preventing symlink access
  2021-09-10 16:03 [PATCH] udf: Fix regression which is preventing symlink access Glenn Washburn
@ 2021-09-14 14:27 ` Daniel Kiper
  2021-09-14 18:19   ` Glenn Washburn
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Kiper @ 2021-09-14 14:27 UTC (permalink / raw)
  To: Glenn Washburn
  Cc: grub-devel, Vladimir 'phcoder' Serbinenko, Peter Jones

On Fri, Sep 10, 2021 at 04:03:23PM +0000, Glenn Washburn wrote:
> This code was broken by commit 3f05d693 ("malloc: Use overflow checking
> primitives where we do complex allocations"), which added overflow checking
> in many areas. The problem here is that the changes update the local
> variable sz, which was already in use and which was not updated before the
> change. So the code using sz was getting a different value of than it would
> have previously for the same UDF image. This causes the logic getting the
> destination of the symlink to not realize that its gotten the full
> destination, but keeps trying to read past the end of the destination. The
> bytes after the end are generally NULL padding bytes, but that's not a
> valid component type (ECMA-167 14.16.1.1). So grub_udf_read_symlink branches
> to error logic, returning NULL, instead of the symlink destination path.
>
> The result of this bug is that the UDF filesystem tests were failing in the
> symlink test with the grub-fstest error message:
>
>     grub-fstest: error: cannot open `(loop0)/sym': invalid symlink.
>
> This change stores the result of doubling sz in another local variable s,
> so as not to modify sz. Also remove unnecessary grub_add, which increased
> the output by 1 to account for a NULL byte. This isn't needed because an
> output buffer of size twice sz is already guaranteed to be more than enough
> to contain the path components converted to UTF-8. The worst case upper-
> bound for the needed output buffer size is (sz-4)*1.5, where 4 is the size

I think 4 comes from ECMA-167 spec. Could you add a reference to it here?
The number of paragraph would be perfect...

> of a path component header and 1.5 is the maximum growth in bytes when
> converting from 2-byte unicode code-points to UTF-8 (from 2 bytes to 3).

Could you explain how did you come up with the 1.5 value? It would be
nice if you refer to a spec or something like that.

Daniel

> Signed-off-by: Glenn Washburn <development@efficientek.com>
> ---
>  grub-core/fs/udf.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/grub-core/fs/udf.c b/grub-core/fs/udf.c
> index 2ac5c1d00..197e60d12 100644
> --- a/grub-core/fs/udf.c
> +++ b/grub-core/fs/udf.c
> @@ -1022,7 +1022,7 @@ grub_udf_iterate_dir (grub_fshelp_node_t dir,
>  static char *
>  grub_udf_read_symlink (grub_fshelp_node_t node)
>  {
> -  grub_size_t sz = U64 (node->block.fe.file_size);
> +  grub_size_t s, sz = U64 (node->block.fe.file_size);
>    grub_uint8_t *raw;
>    const grub_uint8_t *ptr;
>    char *out = NULL, *optr;
> @@ -1035,11 +1035,19 @@ grub_udf_read_symlink (grub_fshelp_node_t node)
>    if (grub_udf_read_file (node, NULL, NULL, 0, sz, (char *) raw) < 0)
>      goto fail_1;
>
> -  if (grub_mul (sz, 2, &sz) ||
> -      grub_add (sz, 1, &sz))
> +  /*
> +   * Local sz is the size of the symlink file data, which contains a sequence
> +   * of path components (ECMA-167 14.16.1) representing the link destination.
> +   * This size is an upper-bound on the number of bytes of a contained and
> +   * potentially compressed 2-byte character string. Allocate 2*sz for the
> +   * output buffer contained the string converted to UTF-8 because the
> +   * resulting string can not be more than double the size (2-byte unicode
> +   * points will be converted to a maximum of 3 bytes in UTF-8).
> +   */
> +  if (grub_mul (sz, 2, &s))
>      goto fail_0;
>
> -  out = grub_malloc (sz);
> +  out = grub_malloc (s);
>    if (!out)
>      {
>   fail_0:
> @@ -1051,7 +1059,6 @@ grub_udf_read_symlink (grub_fshelp_node_t node)
>
>    for (ptr = raw; ptr < raw + sz; )
>      {
> -      grub_size_t s;
>        if ((grub_size_t) (ptr - raw + 4) > sz)
>  	goto fail_1;
>        if (!(ptr[2] == 0 && ptr[3] == 0))
> --
> 2.32.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] udf: Fix regression which is preventing symlink access
  2021-09-14 14:27 ` Daniel Kiper
@ 2021-09-14 18:19   ` Glenn Washburn
  2021-09-15 14:52     ` Daniel Kiper
  0 siblings, 1 reply; 6+ messages in thread
From: Glenn Washburn @ 2021-09-14 18:19 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: grub-devel, Vladimir 'phcoder' Serbinenko, Peter Jones

On Tue, 14 Sep 2021 16:27:55 +0200
Daniel Kiper <dkiper@net-space.pl> wrote:

> On Fri, Sep 10, 2021 at 04:03:23PM +0000, Glenn Washburn wrote:
> > This code was broken by commit 3f05d693 ("malloc: Use overflow
> > checking primitives where we do complex allocations"), which added
> > overflow checking in many areas. The problem here is that the
> > changes update the local variable sz, which was already in use and
> > which was not updated before the change. So the code using sz was
> > getting a different value of than it would have previously for the
> > same UDF image. This causes the logic getting the destination of
> > the symlink to not realize that its gotten the full destination,
> > but keeps trying to read past the end of the destination. The bytes
> > after the end are generally NULL padding bytes, but that's not a
> > valid component type (ECMA-167 14.16.1.1). So grub_udf_read_symlink
> > branches to error logic, returning NULL, instead of the symlink
> > destination path.
> >
> > The result of this bug is that the UDF filesystem tests were
> > failing in the symlink test with the grub-fstest error message:
> >
> >     grub-fstest: error: cannot open `(loop0)/sym': invalid symlink.
> >
> > This change stores the result of doubling sz in another local
> > variable s, so as not to modify sz. Also remove unnecessary
> > grub_add, which increased the output by 1 to account for a NULL
> > byte. This isn't needed because an output buffer of size twice sz
> > is already guaranteed to be more than enough to contain the path
> > components converted to UTF-8. The worst case upper- bound for the
> > needed output buffer size is (sz-4)*1.5, where 4 is the size
> 
> I think 4 comes from ECMA-167 spec. Could you add a reference to it
> here? The number of paragraph would be perfect...

Its 14.16.1 basically in the same place as the reference earlier, which
is why I didn't include it. But, yes, I can include it.

> > of a path component header and 1.5 is the maximum growth in bytes
> > when converting from 2-byte unicode code-points to UTF-8 (from 2
> > bytes to 3).
> 
> Could you explain how did you come up with the 1.5 value? It would be
> nice if you refer to a spec or something like that.

There is no spec that I know of (but would be happy to know of one). Its
something I've deduced based on my understanding of Unicode, UTF-8, and
UTF-16. All unicode code points less than or equal to 2 bytes (code
points <0x10000) can be represented in UTF-8 by a maximum of 3 bytes
[1]. Longer UTF-16 encodings don't matter because those will be 4 bytes
or longer. The maximum number of bytes for a UTF-8 encoding of a unicode
code point is 4 bytes. So the longer UTF-16 encodings can only be equal
to or longer than the UTF-8 encoding, thus the UTF-16 -> UTF-8 would be
shrinking or maintaining the length of the original byte string. Since
the worst case growth is 2 bytes to 3, that's 1.5 times the original
string size. QED.

Do you want all that in there? I could just remove that part about the
1.5 too.

Here's an SO question that addresses this. Yes, unofficial, but I think
adds some weight as to the correctness of the logic above.

https://stackoverflow.com/questions/55056322/maximum-utf-8-string-size-given-utf-16-size

Glenn

[1] https://en.wikipedia.org/wiki/UTF-8#Encoding

> 
> Daniel
> 
> > Signed-off-by: Glenn Washburn <development@efficientek.com>
> > ---
> >  grub-core/fs/udf.c | 17 ++++++++++++-----
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/grub-core/fs/udf.c b/grub-core/fs/udf.c
> > index 2ac5c1d00..197e60d12 100644
> > --- a/grub-core/fs/udf.c
> > +++ b/grub-core/fs/udf.c
> > @@ -1022,7 +1022,7 @@ grub_udf_iterate_dir (grub_fshelp_node_t dir,
> >  static char *
> >  grub_udf_read_symlink (grub_fshelp_node_t node)
> >  {
> > -  grub_size_t sz = U64 (node->block.fe.file_size);
> > +  grub_size_t s, sz = U64 (node->block.fe.file_size);
> >    grub_uint8_t *raw;
> >    const grub_uint8_t *ptr;
> >    char *out = NULL, *optr;
> > @@ -1035,11 +1035,19 @@ grub_udf_read_symlink (grub_fshelp_node_t
> > node) if (grub_udf_read_file (node, NULL, NULL, 0, sz, (char *)
> > raw) < 0) goto fail_1;
> >
> > -  if (grub_mul (sz, 2, &sz) ||
> > -      grub_add (sz, 1, &sz))
> > +  /*
> > +   * Local sz is the size of the symlink file data, which contains
> > a sequence
> > +   * of path components (ECMA-167 14.16.1) representing the link
> > destination.
> > +   * This size is an upper-bound on the number of bytes of a
> > contained and
> > +   * potentially compressed 2-byte character string. Allocate 2*sz
> > for the
> > +   * output buffer contained the string converted to UTF-8 because
> > the
> > +   * resulting string can not be more than double the size (2-byte
> > unicode
> > +   * points will be converted to a maximum of 3 bytes in UTF-8).
> > +   */
> > +  if (grub_mul (sz, 2, &s))
> >      goto fail_0;
> >
> > -  out = grub_malloc (sz);
> > +  out = grub_malloc (s);
> >    if (!out)
> >      {
> >   fail_0:
> > @@ -1051,7 +1059,6 @@ grub_udf_read_symlink (grub_fshelp_node_t
> > node)
> >
> >    for (ptr = raw; ptr < raw + sz; )
> >      {
> > -      grub_size_t s;
> >        if ((grub_size_t) (ptr - raw + 4) > sz)
> >  	goto fail_1;
> >        if (!(ptr[2] == 0 && ptr[3] == 0))
> > --
> > 2.32.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] udf: Fix regression which is preventing symlink access
  2021-09-14 18:19   ` Glenn Washburn
@ 2021-09-15 14:52     ` Daniel Kiper
  2021-09-15 22:52       ` Glenn Washburn
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Kiper @ 2021-09-15 14:52 UTC (permalink / raw)
  To: Glenn Washburn
  Cc: grub-devel, Vladimir 'phcoder' Serbinenko, Peter Jones

On Tue, Sep 14, 2021 at 06:19:03PM +0000, Glenn Washburn wrote:
> On Tue, 14 Sep 2021 16:27:55 +0200
> Daniel Kiper <dkiper@net-space.pl> wrote:
>
> > On Fri, Sep 10, 2021 at 04:03:23PM +0000, Glenn Washburn wrote:
> > > This code was broken by commit 3f05d693 ("malloc: Use overflow
> > > checking primitives where we do complex allocations"), which added
> > > overflow checking in many areas. The problem here is that the
> > > changes update the local variable sz, which was already in use and
> > > which was not updated before the change. So the code using sz was
> > > getting a different value of than it would have previously for the
> > > same UDF image. This causes the logic getting the destination of
> > > the symlink to not realize that its gotten the full destination,
> > > but keeps trying to read past the end of the destination. The bytes
> > > after the end are generally NULL padding bytes, but that's not a
> > > valid component type (ECMA-167 14.16.1.1). So grub_udf_read_symlink
> > > branches to error logic, returning NULL, instead of the symlink
> > > destination path.
> > >
> > > The result of this bug is that the UDF filesystem tests were
> > > failing in the symlink test with the grub-fstest error message:
> > >
> > >     grub-fstest: error: cannot open `(loop0)/sym': invalid symlink.
> > >
> > > This change stores the result of doubling sz in another local
> > > variable s, so as not to modify sz. Also remove unnecessary
> > > grub_add, which increased the output by 1 to account for a NULL
> > > byte. This isn't needed because an output buffer of size twice sz
> > > is already guaranteed to be more than enough to contain the path
> > > components converted to UTF-8. The worst case upper- bound for the
> > > needed output buffer size is (sz-4)*1.5, where 4 is the size
> >
> > I think 4 comes from ECMA-167 spec. Could you add a reference to it
> > here? The number of paragraph would be perfect...
>
> Its 14.16.1 basically in the same place as the reference earlier, which
> is why I didn't include it. But, yes, I can include it.

Yes, please.

> > > of a path component header and 1.5 is the maximum growth in bytes
> > > when converting from 2-byte unicode code-points to UTF-8 (from 2
> > > bytes to 3).
> >
> > Could you explain how did you come up with the 1.5 value? It would be
> > nice if you refer to a spec or something like that.
>
> There is no spec that I know of (but would be happy to know of one). Its
> something I've deduced based on my understanding of Unicode, UTF-8, and
> UTF-16. All unicode code points less than or equal to 2 bytes (code
> points <0x10000) can be represented in UTF-8 by a maximum of 3 bytes
> [1]. Longer UTF-16 encodings don't matter because those will be 4 bytes
> or longer. The maximum number of bytes for a UTF-8 encoding of a unicode

The [1] says: Since Errata DCN-5157, the range of code points was
expanded to all code points from Unicode 4.0 (or any newer or older
version), which includes Plane 1-16 characters such as Emoji.

So, I think your assumption about longer encodings is incorrect for the UDF.

> code point is 4 bytes. So the longer UTF-16 encodings can only be equal
> to or longer than the UTF-8 encoding, thus the UTF-16 -> UTF-8 would be
> shrinking or maintaining the length of the original byte string. Since
> the worst case growth is 2 bytes to 3, that's 1.5 times the original
> string size. QED.
>
> Do you want all that in there? I could just remove that part about the
> 1.5 too.
>
> Here's an SO question that addresses this. Yes, unofficial, but I think
> adds some weight as to the correctness of the logic above.
>
> https://stackoverflow.com/questions/55056322/maximum-utf-8-string-size-given-utf-16-size
>
> Glenn
>
> [1] https://en.wikipedia.org/wiki/UTF-8#Encoding

Daniel

[1] https://en.wikipedia.org/wiki/Universal_Disk_Format#Character_set


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] udf: Fix regression which is preventing symlink access
  2021-09-15 14:52     ` Daniel Kiper
@ 2021-09-15 22:52       ` Glenn Washburn
  2021-09-16 21:34         ` Daniel Kiper
  0 siblings, 1 reply; 6+ messages in thread
From: Glenn Washburn @ 2021-09-15 22:52 UTC (permalink / raw)
  To: Daniel Kiper
  Cc: grub-devel, Vladimir 'phcoder' Serbinenko, Peter Jones

On Wed, 15 Sep 2021 16:52:28 +0200
Daniel Kiper <dkiper@net-space.pl> wrote:

> On Tue, Sep 14, 2021 at 06:19:03PM +0000, Glenn Washburn wrote:
> > On Tue, 14 Sep 2021 16:27:55 +0200
> > Daniel Kiper <dkiper@net-space.pl> wrote:
> >
> > > On Fri, Sep 10, 2021 at 04:03:23PM +0000, Glenn Washburn wrote:
> > > > This code was broken by commit 3f05d693 ("malloc: Use overflow
> > > > checking primitives where we do complex allocations"), which
> > > > added overflow checking in many areas. The problem here is that
> > > > the changes update the local variable sz, which was already in
> > > > use and which was not updated before the change. So the code
> > > > using sz was getting a different value of than it would have
> > > > previously for the same UDF image. This causes the logic
> > > > getting the destination of the symlink to not realize that its
> > > > gotten the full destination, but keeps trying to read past the
> > > > end of the destination. The bytes after the end are generally
> > > > NULL padding bytes, but that's not a valid component type
> > > > (ECMA-167 14.16.1.1). So grub_udf_read_symlink branches to
> > > > error logic, returning NULL, instead of the symlink destination
> > > > path.
> > > >
> > > > The result of this bug is that the UDF filesystem tests were
> > > > failing in the symlink test with the grub-fstest error message:
> > > >
> > > >     grub-fstest: error: cannot open `(loop0)/sym': invalid
> > > > symlink.
> > > >
> > > > This change stores the result of doubling sz in another local
> > > > variable s, so as not to modify sz. Also remove unnecessary
> > > > grub_add, which increased the output by 1 to account for a NULL
> > > > byte. This isn't needed because an output buffer of size twice
> > > > sz is already guaranteed to be more than enough to contain the
> > > > path components converted to UTF-8. The worst case upper- bound
> > > > for the needed output buffer size is (sz-4)*1.5, where 4 is the
> > > > size
> > >
> > > I think 4 comes from ECMA-167 spec. Could you add a reference to
> > > it here? The number of paragraph would be perfect...
> >
> > Its 14.16.1 basically in the same place as the reference earlier,
> > which is why I didn't include it. But, yes, I can include it.
> 
> Yes, please.

Ok, will do.

> > > > of a path component header and 1.5 is the maximum growth in
> > > > bytes when converting from 2-byte unicode code-points to UTF-8
> > > > (from 2 bytes to 3).
> > >
> > > Could you explain how did you come up with the 1.5 value? It
> > > would be nice if you refer to a spec or something like that.
> >
> > There is no spec that I know of (but would be happy to know of
> > one). Its something I've deduced based on my understanding of
> > Unicode, UTF-8, and UTF-16. All unicode code points less than or
> > equal to 2 bytes (code points <0x10000) can be represented in UTF-8
> > by a maximum of 3 bytes [1]. Longer UTF-16 encodings don't matter
> > because those will be 4 bytes or longer. The maximum number of
> > bytes for a UTF-8 encoding of a unicode
> 
> The [1] says: Since Errata DCN-5157, the range of code points was
> expanded to all code points from Unicode 4.0 (or any newer or older
> version), which includes Plane 1-16 characters such as Emoji.
> 
> So, I think your assumption about longer encodings is incorrect for
> the UDF.

No, I don't believe so. The "assumption", which was actually a logical
conclusion, that I understand you to be saying is incorrect is "Longer
UTF-16 encodings don't matter because those will be 4 bytes or longer".
This was perhaps a little confusing as there are no UTF-16 encoded
codepoints longer than 4 bytes. Be that as it may, I go on to explain
that they do not matter because those UTF-16 bytes strings will never
occupy more bytes when encoded in UTF-8. The question of debate is "how
much can a UTF-16 byte string grow when re-encoded as UTF-8?". Thus
codepoints that can not grow (most in fact strink!) can be disregarded
in the analysis. You talk about unicode Planes 1-16 as relevant, but
remember those Planes are the ones where the UTF-16 is 4 bytes, and
thus the set of code words we just disregarded as not relevant.

Perhaps you do not believe that all codepoints outside of Plane 0
require no more bytes in UTF-16 than in UTF-8. If so, please provide one
counter example, that is one codepoint satisfying the condition.
Likewise, if you believe that there exists a 2-byte codepoint in UTF-16
which occupies 4-bytes in UTF-8, a counter example would help to make
your case.

This explanation may provide more clarity on the matter[1].

> > code point is 4 bytes. So the longer UTF-16 encodings can only be
> > equal to or longer than the UTF-8 encoding, thus the UTF-16 ->
> > UTF-8 would be shrinking or maintaining the length of the original
> > byte string. Since the worst case growth is 2 bytes to 3, that's
> > 1.5 times the original string size. QED.
> >
> > Do you want all that in there? I could just remove that part about
> > the 1.5 too.
> >
> > Here's an SO question that addresses this. Yes, unofficial, but I
> > think adds some weight as to the correctness of the logic above.
> >
> > https://stackoverflow.com/questions/55056322/maximum-utf-8-string-size-given-utf-16-size
> >
> > Glenn
> >
> > [1] https://en.wikipedia.org/wiki/UTF-8#Encoding
> 
> Daniel
> 
> [1] https://en.wikipedia.org/wiki/Universal_Disk_Format#Character_set

Glenn

[1]
https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Efficiency


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] udf: Fix regression which is preventing symlink access
  2021-09-15 22:52       ` Glenn Washburn
@ 2021-09-16 21:34         ` Daniel Kiper
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Kiper @ 2021-09-16 21:34 UTC (permalink / raw)
  To: Glenn Washburn
  Cc: grub-devel, Vladimir 'phcoder' Serbinenko, Peter Jones

On Wed, Sep 15, 2021 at 10:52:40PM +0000, Glenn Washburn wrote:
> On Wed, 15 Sep 2021 16:52:28 +0200
> Daniel Kiper <dkiper@net-space.pl> wrote:
>
> > On Tue, Sep 14, 2021 at 06:19:03PM +0000, Glenn Washburn wrote:
> > > On Tue, 14 Sep 2021 16:27:55 +0200
> > > Daniel Kiper <dkiper@net-space.pl> wrote:
> > >
> > > > On Fri, Sep 10, 2021 at 04:03:23PM +0000, Glenn Washburn wrote:
> > > > > This code was broken by commit 3f05d693 ("malloc: Use overflow
> > > > > checking primitives where we do complex allocations"), which
> > > > > added overflow checking in many areas. The problem here is that
> > > > > the changes update the local variable sz, which was already in
> > > > > use and which was not updated before the change. So the code
> > > > > using sz was getting a different value of than it would have
> > > > > previously for the same UDF image. This causes the logic
> > > > > getting the destination of the symlink to not realize that its
> > > > > gotten the full destination, but keeps trying to read past the
> > > > > end of the destination. The bytes after the end are generally
> > > > > NULL padding bytes, but that's not a valid component type
> > > > > (ECMA-167 14.16.1.1). So grub_udf_read_symlink branches to
> > > > > error logic, returning NULL, instead of the symlink destination
> > > > > path.
> > > > >
> > > > > The result of this bug is that the UDF filesystem tests were
> > > > > failing in the symlink test with the grub-fstest error message:
> > > > >
> > > > >     grub-fstest: error: cannot open `(loop0)/sym': invalid
> > > > > symlink.
> > > > >
> > > > > This change stores the result of doubling sz in another local
> > > > > variable s, so as not to modify sz. Also remove unnecessary
> > > > > grub_add, which increased the output by 1 to account for a NULL
> > > > > byte. This isn't needed because an output buffer of size twice
> > > > > sz is already guaranteed to be more than enough to contain the
> > > > > path components converted to UTF-8. The worst case upper- bound
> > > > > for the needed output buffer size is (sz-4)*1.5, where 4 is the
> > > > > size
> > > >
> > > > I think 4 comes from ECMA-167 spec. Could you add a reference to
> > > > it here? The number of paragraph would be perfect...
> > >
> > > Its 14.16.1 basically in the same place as the reference earlier,
> > > which is why I didn't include it. But, yes, I can include it.
> >
> > Yes, please.
>
> Ok, will do.

Thanks!

> > > > > of a path component header and 1.5 is the maximum growth in
> > > > > bytes when converting from 2-byte unicode code-points to UTF-8
> > > > > (from 2 bytes to 3).
> > > >
> > > > Could you explain how did you come up with the 1.5 value? It
> > > > would be nice if you refer to a spec or something like that.
> > >
> > > There is no spec that I know of (but would be happy to know of
> > > one). Its something I've deduced based on my understanding of
> > > Unicode, UTF-8, and UTF-16. All unicode code points less than or
> > > equal to 2 bytes (code points <0x10000) can be represented in UTF-8
> > > by a maximum of 3 bytes [1]. Longer UTF-16 encodings don't matter
> > > because those will be 4 bytes or longer. The maximum number of
> > > bytes for a UTF-8 encoding of a unicode
> >
> > The [1] says: Since Errata DCN-5157, the range of code points was
> > expanded to all code points from Unicode 4.0 (or any newer or older
> > version), which includes Plane 1-16 characters such as Emoji.
> >
> > So, I think your assumption about longer encodings is incorrect for
> > the UDF.
>
> No, I don't believe so. The "assumption", which was actually a logical
> conclusion, that I understand you to be saying is incorrect is "Longer
> UTF-16 encodings don't matter because those will be 4 bytes or longer".

Yep...

> This was perhaps a little confusing as there are no UTF-16 encoded
> codepoints longer than 4 bytes. Be that as it may, I go on to explain
> that they do not matter because those UTF-16 bytes strings will never
> occupy more bytes when encoded in UTF-8. The question of debate is "how
> much can a UTF-16 byte string grow when re-encoded as UTF-8?". Thus
> codepoints that can not grow (most in fact strink!) can be disregarded
> in the analysis. You talk about unicode Planes 1-16 as relevant, but
> remember those Planes are the ones where the UTF-16 is 4 bytes, and
> thus the set of code words we just disregarded as not relevant.
>
> Perhaps you do not believe that all codepoints outside of Plane 0
> require no more bytes in UTF-16 than in UTF-8. If so, please provide one
> counter example, that is one codepoint satisfying the condition.
> Likewise, if you believe that there exists a 2-byte codepoint in UTF-16
> which occupies 4-bytes in UTF-8, a counter example would help to make
> your case.
>
> This explanation may provide more clarity on the matter[1].

OK, now everything is clear. It seams to me that "Longer UTF-16 encodings
don't matter because those will be 4 bytes or longer" sentence should be
rephrased a bit to make it less confusing. I think "don't matter" is
especially confusing here. At least it confused me. Additionally, I think
taking some text from the [1] would help too.

Anyway, please polish the commit messages then I will get this patch.

Daniel

[1] https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Efficiency


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-16 21:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10 16:03 [PATCH] udf: Fix regression which is preventing symlink access Glenn Washburn
2021-09-14 14:27 ` Daniel Kiper
2021-09-14 18:19   ` Glenn Washburn
2021-09-15 14:52     ` Daniel Kiper
2021-09-15 22:52       ` Glenn Washburn
2021-09-16 21:34         ` Daniel Kiper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.