All of lore.kernel.org
 help / color / mirror / Atom feed
* Practical Limit on EROFS lcluster size
@ 2021-12-20 13:45 Kelvin Zhang via Linux-erofs
  2021-12-20 14:53 ` Gao Xiang
  0 siblings, 1 reply; 7+ messages in thread
From: Kelvin Zhang via Linux-erofs @ 2021-12-20 13:45 UTC (permalink / raw)
  To: linux-erofs mailing list

Hi Gao,
    I was playing with large pcluster sizes recently, I noticed a
quirk about EROFS. In summary, logical cluster size has a practical
limit of 8MB. Here's why:

   Looking at https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n92
, line 92, we see the following code:

if (d0 == 1 && erofs_sb_has_big_pcluster()) {
        type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
        di.di_u.delta[0] = cpu_to_le16(ctx->compressedblks |
                Z_EROFS_VLE_DI_D0_CBLKCNT); // This line
        di.di_u.delta[1] = cpu_to_le16(d1);
} else if (d0) {
        type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;

        di.di_u.delta[0] = cpu_to_le16(d0);  // and this line
        di.di_u.delta[1] = cpu_to_le16(d1);
}

When a compressed index has type NOHEAD, delta[0] stores d0(distance
to head block). But The 11th bit of d0 is also used as a flag bit to
indicate that d0 stores the pcluster size. This means d0 cannot exceed
Z_EROFS_VLE_DI_D0_CBLKCNT(2048), or else the parser will incorrectly
interpret d0 as pcluster size, rather than distance to head block.
    Is this an intentional design choice? It's not necessarily bad,
but it's something I think is worth documenting in code.


-- 
Sincerely,

Kelvin Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Practical Limit on EROFS lcluster size
  2021-12-20 13:45 Practical Limit on EROFS lcluster size Kelvin Zhang via Linux-erofs
@ 2021-12-20 14:53 ` Gao Xiang
  2021-12-20 15:02   ` Kelvin Zhang via Linux-erofs
  2021-12-20 15:04   ` Gao Xiang
  0 siblings, 2 replies; 7+ messages in thread
From: Gao Xiang @ 2021-12-20 14:53 UTC (permalink / raw)
  To: Kelvin Zhang; +Cc: linux-erofs mailing list

Hi Kelvin,

On Mon, Dec 20, 2021 at 08:45:42AM -0500, Kelvin Zhang wrote:
> Hi Gao,
>     I was playing with large pcluster sizes recently, I noticed a
> quirk about EROFS. In summary, logical cluster size has a practical
> limit of 8MB. Here's why:
> 
>    Looking at https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n92
> , line 92, we see the following code:
> 
> if (d0 == 1 && erofs_sb_has_big_pcluster()) {
>         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
>         di.di_u.delta[0] = cpu_to_le16(ctx->compressedblks |
>                 Z_EROFS_VLE_DI_D0_CBLKCNT); // This line
>         di.di_u.delta[1] = cpu_to_le16(d1);
> } else if (d0) {
>         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> 
>         di.di_u.delta[0] = cpu_to_le16(d0);  // and this line
>         di.di_u.delta[1] = cpu_to_le16(d1);
> }
> 
> When a compressed index has type NOHEAD, delta[0] stores d0(distance
> to head block). But The 11th bit of d0 is also used as a flag bit to
> indicate that d0 stores the pcluster size. This means d0 cannot exceed
> Z_EROFS_VLE_DI_D0_CBLKCNT(2048), or else the parser will incorrectly
> interpret d0 as pcluster size, rather than distance to head block.
>     Is this an intentional design choice? It's not necessarily bad,
> but it's something I think is worth documenting in code.

Thanks for this great insight! Actually on-disk EROFS format doesn't
have such limitation by design, since if it looks back to the delta0
lcluster and it's still a NONHEAD lcluster, it will look back with
new delta0 again until finding the final HEAD lcluster.

But I'm not sure if mkfs code can handle > 8MiB lcluster properly yet,
without modification since lcluster size is strictly limited with
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/include/erofs/compress.h#n14
EROFS_CONFIG_COMPR_MAX_SZ * 2

Yeah, I have to admit the current document might not be so detailed,
partially due to my somewhat bad English written speed, and limited
time...

Thanks,
Gao Xiang

> 
> 
> -- 
> Sincerely,
> 
> Kelvin Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Practical Limit on EROFS lcluster size
  2021-12-20 14:53 ` Gao Xiang
@ 2021-12-20 15:02   ` Kelvin Zhang via Linux-erofs
  2021-12-20 15:09     ` Gao Xiang
  2021-12-20 15:04   ` Gao Xiang
  1 sibling, 1 reply; 7+ messages in thread
From: Kelvin Zhang via Linux-erofs @ 2021-12-20 15:02 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-erofs mailing list

On Mon, Dec 20, 2021 at 9:53 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Kelvin,
>
> On Mon, Dec 20, 2021 at 08:45:42AM -0500, Kelvin Zhang wrote:
> > Hi Gao,
> >     I was playing with large pcluster sizes recently, I noticed a
> > quirk about EROFS. In summary, logical cluster size has a practical
> > limit of 8MB. Here's why:
> >
> >    Looking at https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n92
> > , line 92, we see the following code:
> >
> > if (d0 == 1 && erofs_sb_has_big_pcluster()) {
> >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> >         di.di_u.delta[0] = cpu_to_le16(ctx->compressedblks |
> >                 Z_EROFS_VLE_DI_D0_CBLKCNT); // This line
> >         di.di_u.delta[1] = cpu_to_le16(d1);
> > } else if (d0) {
> >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> >
> >         di.di_u.delta[0] = cpu_to_le16(d0);  // and this line
> >         di.di_u.delta[1] = cpu_to_le16(d1);
> > }
> >
> > When a compressed index has type NOHEAD, delta[0] stores d0(distance
> > to head block). But The 11th bit of d0 is also used as a flag bit to
> > indicate that d0 stores the pcluster size. This means d0 cannot exceed
> > Z_EROFS_VLE_DI_D0_CBLKCNT(2048), or else the parser will incorrectly
> > interpret d0 as pcluster size, rather than distance to head block.
> >     Is this an intentional design choice? It's not necessarily bad,
> > but it's something I think is worth documenting in code.
>
> Thanks for this great insight! Actually on-disk EROFS format doesn't
> have such limitation by design, since if it looks back to the delta0
> lcluster and it's still a NONHEAD lcluster, it will look back with
> new delta0 again until finding the final HEAD lcluster.
>
> But I'm not sure if mkfs code can handle > 8MiB lcluster properly yet,
> without modification since lcluster size is strictly limited with
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/include/erofs/compress.h#n14
> EROFS_CONFIG_COMPR_MAX_SZ * 2

Right, the current lcluster buffer is on stack, and has size
EROFS_CONFIG_COMPR_MAX_SZ*2.
I was working on a patch that moves lcluster buffer to heap and
increase it to way beyond 900KB.
With large pclusters it make sense to have a larger lcluster limit as
well, or else users wouldn't
be able to take full advantage of large pclusters.
Currently this fails during writing compressed indices. As
write_compacted_indexes
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n313
would mistakenly interpret a large delta0 value as
Z_EROFS_VLE_DI_D0_CBLKCNT, resulting in
sanity check failure later on. Let me see if I can fix that... I
haven't  read the kernel code so I'm not
sure what the kernel is going to do with a large delta0 value which
happens to have the
Z_EROFS_VLE_DI_D0_CBLKCNT bit set.

>
> Yeah, I have to admit the current document might not be so detailed,
> partially due to my somewhat bad English written speed, and limited
> time...
>
> Thanks,
> Gao Xiang
>
> >
> >
> > --
> > Sincerely,
> >
> > Kelvin Zhang



-- 
Sincerely,

Kelvin Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Practical Limit on EROFS lcluster size
  2021-12-20 14:53 ` Gao Xiang
  2021-12-20 15:02   ` Kelvin Zhang via Linux-erofs
@ 2021-12-20 15:04   ` Gao Xiang
  2021-12-20 15:06     ` Kelvin Zhang via Linux-erofs
  1 sibling, 1 reply; 7+ messages in thread
From: Gao Xiang @ 2021-12-20 15:04 UTC (permalink / raw)
  To: Kelvin Zhang; +Cc: linux-erofs mailing list

On Mon, Dec 20, 2021 at 10:53:10PM +0800, Gao Xiang wrote:
> Hi Kelvin,
> 
> On Mon, Dec 20, 2021 at 08:45:42AM -0500, Kelvin Zhang wrote:
> > Hi Gao,
> >     I was playing with large pcluster sizes recently, I noticed a
> > quirk about EROFS. In summary, logical cluster size has a practical
> > limit of 8MB. Here's why:
> > 
> >    Looking at https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n92
> > , line 92, we see the following code:
> > 
> > if (d0 == 1 && erofs_sb_has_big_pcluster()) {
> >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> >         di.di_u.delta[0] = cpu_to_le16(ctx->compressedblks |
> >                 Z_EROFS_VLE_DI_D0_CBLKCNT); // This line
> >         di.di_u.delta[1] = cpu_to_le16(d1);
> > } else if (d0) {
> >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> > 
> >         di.di_u.delta[0] = cpu_to_le16(d0);  // and this line
> >         di.di_u.delta[1] = cpu_to_le16(d1);
> > }
> > 
> > When a compressed index has type NOHEAD, delta[0] stores d0(distance
> > to head block). But The 11th bit of d0 is also used as a flag bit to
> > indicate that d0 stores the pcluster size. This means d0 cannot exceed
> > Z_EROFS_VLE_DI_D0_CBLKCNT(2048), or else the parser will incorrectly
> > interpret d0 as pcluster size, rather than distance to head block.
> >     Is this an intentional design choice? It's not necessarily bad,
> > but it's something I think is worth documenting in code.
> 
> Thanks for this great insight! Actually on-disk EROFS format doesn't
> have such limitation by design, since if it looks back to the delta0
> lcluster and it's still a NONHEAD lcluster, it will look back with
> new delta0 again until finding the final HEAD lcluster.
> 
> But I'm not sure if mkfs code can handle > 8MiB lcluster properly yet,
> without modification since lcluster size is strictly limited with
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/include/erofs/compress.h#n14
> EROFS_CONFIG_COMPR_MAX_SZ * 2
> 
> Yeah, I have to admit the current document might not be so detailed,
> partially due to my somewhat bad English written speed, and limited
> time...

By the way, I'd like to correct some concepts here (sorry I didn't use
them correctly in my previous email).

A lcluster includes a HEAD or NONHEAD lcluster, currently only 4KiB
is supported.
A pcluster contains arbitrary compressed blocks, which can be
decompressed independently.
A compressed extent matches a pcluster, and several lclusters (maybe
partially).

So strictly speaking is "practical Limit on EROFS compressed extent
size"...

> 
> Thanks,
> Gao Xiang
> 
> > 
> > 
> > -- 
> > Sincerely,
> > 
> > Kelvin Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Practical Limit on EROFS lcluster size
  2021-12-20 15:04   ` Gao Xiang
@ 2021-12-20 15:06     ` Kelvin Zhang via Linux-erofs
  2021-12-20 15:20       ` Gao Xiang
  0 siblings, 1 reply; 7+ messages in thread
From: Kelvin Zhang via Linux-erofs @ 2021-12-20 15:06 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-erofs mailing list

[-- Attachment #1: Type: text/plain, Size: 3190 bytes --]

Thanks… Another thing, I’m happy to help writing English documentation for
EROFS if you have a Chinese version.

On Mon, Dec 20, 2021 at 10:04 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> On Mon, Dec 20, 2021 at 10:53:10PM +0800, Gao Xiang wrote:
> > Hi Kelvin,
> >
> > On Mon, Dec 20, 2021 at 08:45:42AM -0500, Kelvin Zhang wrote:
> > > Hi Gao,
> > >     I was playing with large pcluster sizes recently, I noticed a
> > > quirk about EROFS. In summary, logical cluster size has a practical
> > > limit of 8MB. Here's why:
> > >
> > >    Looking at
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n92
> > > , line 92, we see the following code:
> > >
> > > if (d0 == 1 && erofs_sb_has_big_pcluster()) {
> > >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> > >         di.di_u.delta[0] = cpu_to_le16(ctx->compressedblks |
> > >                 Z_EROFS_VLE_DI_D0_CBLKCNT); // This line
> > >         di.di_u.delta[1] = cpu_to_le16(d1);
> > > } else if (d0) {
> > >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> > >
> > >         di.di_u.delta[0] = cpu_to_le16(d0);  // and this line
> > >         di.di_u.delta[1] = cpu_to_le16(d1);
> > > }
> > >
> > > When a compressed index has type NOHEAD, delta[0] stores d0(distance
> > > to head block). But The 11th bit of d0 is also used as a flag bit to
> > > indicate that d0 stores the pcluster size. This means d0 cannot exceed
> > > Z_EROFS_VLE_DI_D0_CBLKCNT(2048), or else the parser will incorrectly
> > > interpret d0 as pcluster size, rather than distance to head block.
> > >     Is this an intentional design choice? It's not necessarily bad,
> > > but it's something I think is worth documenting in code.
> >
> > Thanks for this great insight! Actually on-disk EROFS format doesn't
> > have such limitation by design, since if it looks back to the delta0
> > lcluster and it's still a NONHEAD lcluster, it will look back with
> > new delta0 again until finding the final HEAD lcluster.
> >
> > But I'm not sure if mkfs code can handle > 8MiB lcluster properly yet,
> > without modification since lcluster size is strictly limited with
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/include/erofs/compress.h#n14
> > EROFS_CONFIG_COMPR_MAX_SZ * 2
> >
> > Yeah, I have to admit the current document might not be so detailed,
> > partially due to my somewhat bad English written speed, and limited
> > time...
>
> By the way, I'd like to correct some concepts here (sorry I didn't use
> them correctly in my previous email).
>
> A lcluster includes a HEAD or NONHEAD lcluster, currently only 4KiB
> is supported.
> A pcluster contains arbitrary compressed blocks, which can be
> decompressed independently.
> A compressed extent matches a pcluster, and several lclusters (maybe
> partially).
>
> So strictly speaking is "practical Limit on EROFS compressed extent
> size"...
>
> >
> > Thanks,
> > Gao Xiang
> >
> > >
> > >
> > > --
> > > Sincerely,
> > >
> > > Kelvin Zhang
>
-- 
Sincerely,

Kelvin Zhang

[-- Attachment #2: Type: text/html, Size: 4518 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Practical Limit on EROFS lcluster size
  2021-12-20 15:02   ` Kelvin Zhang via Linux-erofs
@ 2021-12-20 15:09     ` Gao Xiang
  0 siblings, 0 replies; 7+ messages in thread
From: Gao Xiang @ 2021-12-20 15:09 UTC (permalink / raw)
  To: Kelvin Zhang; +Cc: linux-erofs mailing list

On Mon, Dec 20, 2021 at 10:02:13AM -0500, Kelvin Zhang wrote:
> On Mon, Dec 20, 2021 at 9:53 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >
> > Hi Kelvin,
> >
> > On Mon, Dec 20, 2021 at 08:45:42AM -0500, Kelvin Zhang wrote:
> > > Hi Gao,
> > >     I was playing with large pcluster sizes recently, I noticed a
> > > quirk about EROFS. In summary, logical cluster size has a practical
> > > limit of 8MB. Here's why:
> > >
> > >    Looking at https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n92
> > > , line 92, we see the following code:
> > >
> > > if (d0 == 1 && erofs_sb_has_big_pcluster()) {
> > >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> > >         di.di_u.delta[0] = cpu_to_le16(ctx->compressedblks |
> > >                 Z_EROFS_VLE_DI_D0_CBLKCNT); // This line
> > >         di.di_u.delta[1] = cpu_to_le16(d1);
> > > } else if (d0) {
> > >         type = Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD;
> > >
> > >         di.di_u.delta[0] = cpu_to_le16(d0);  // and this line
> > >         di.di_u.delta[1] = cpu_to_le16(d1);
> > > }
> > >
> > > When a compressed index has type NOHEAD, delta[0] stores d0(distance
> > > to head block). But The 11th bit of d0 is also used as a flag bit to
> > > indicate that d0 stores the pcluster size. This means d0 cannot exceed
> > > Z_EROFS_VLE_DI_D0_CBLKCNT(2048), or else the parser will incorrectly
> > > interpret d0 as pcluster size, rather than distance to head block.
> > >     Is this an intentional design choice? It's not necessarily bad,
> > > but it's something I think is worth documenting in code.
> >
> > Thanks for this great insight! Actually on-disk EROFS format doesn't
> > have such limitation by design, since if it looks back to the delta0
> > lcluster and it's still a NONHEAD lcluster, it will look back with
> > new delta0 again until finding the final HEAD lcluster.
> >
> > But I'm not sure if mkfs code can handle > 8MiB lcluster properly yet,
> > without modification since lcluster size is strictly limited with
> > https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/include/erofs/compress.h#n14
> > EROFS_CONFIG_COMPR_MAX_SZ * 2
> 
> Right, the current lcluster buffer is on stack, and has size
> EROFS_CONFIG_COMPR_MAX_SZ*2.
> I was working on a patch that moves lcluster buffer to heap and
> increase it to way beyond 900KB.
> With large pclusters it make sense to have a larger lcluster limit as
> well, or else users wouldn't
> be able to take full advantage of large pclusters.

Okay, make sense.

> Currently this fails during writing compressed indices. As
> write_compacted_indexes
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/tree/lib/compress.c?h=experimental&id=564adb0a852b38a1790db20516862fc31bca314d#n313
> would mistakenly interpret a large delta0 value as
> Z_EROFS_VLE_DI_D0_CBLKCNT, resulting in
> sanity check failure later on. Let me see if I can fix that... I
> haven't  read the kernel code so I'm not
> sure what the kernel is going to do with a large delta0 value which
> happens to have the
> Z_EROFS_VLE_DI_D0_CBLKCNT bit set.

I think kernel lookback function works as expected, it just looks back
with delta0 hints until finding the HEAD lcluster to get the whole
extent information.. But yeah, we might need to actually test it if
possible.

(Although my current top priority stuff now is ztailpacking feature..)

Thanks,
Gao Xiang

> 
> >
> > Yeah, I have to admit the current document might not be so detailed,
> > partially due to my somewhat bad English written speed, and limited
> > time...
> >
> > Thanks,
> > Gao Xiang
> >
> > >
> > >
> > > --
> > > Sincerely,
> > >
> > > Kelvin Zhang
> 
> 
> 
> -- 
> Sincerely,
> 
> Kelvin Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Practical Limit on EROFS lcluster size
  2021-12-20 15:06     ` Kelvin Zhang via Linux-erofs
@ 2021-12-20 15:20       ` Gao Xiang
  0 siblings, 0 replies; 7+ messages in thread
From: Gao Xiang @ 2021-12-20 15:20 UTC (permalink / raw)
  To: Kelvin Zhang; +Cc: linux-erofs mailing list

On Mon, Dec 20, 2021 at 10:06:33AM -0500, Kelvin Zhang wrote:
> Thanks… Another thing, I’m happy to help writing English documentation for
> EROFS if you have a Chinese version.
>

:) I wrote several Chinese materials many years ago internally, especially
for somewhat complicated compact compressed indexes (which is a reduced
metadata to minimize metadata runtime overhead for better performance,
each lcluster only takes 2-byte metadata on average).

Finally I think we might need a website to document all of this, both in
English and Chinese, including the new container use cases -- Nydus
image service.

I plan to do these after ztailpacking, fscache, folio adaptions are all
finished... (and I'm happy to try any better maintaining approach to
make the overall EROFS solution better/useful for everyone...)

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-20 15:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-20 13:45 Practical Limit on EROFS lcluster size Kelvin Zhang via Linux-erofs
2021-12-20 14:53 ` Gao Xiang
2021-12-20 15:02   ` Kelvin Zhang via Linux-erofs
2021-12-20 15:09     ` Gao Xiang
2021-12-20 15:04   ` Gao Xiang
2021-12-20 15:06     ` Kelvin Zhang via Linux-erofs
2021-12-20 15:20       ` Gao Xiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.