linux-erofs.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Gao Xiang <xiang@kernel.org>
To: linux-erofs@lists.ozlabs.org, Chao Yu <yuchao0@huawei.com>,
	Chao Yu <chao@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 00/10] erofs: add big pcluster compression support
Date: Sat, 3 Apr 2021 11:17:56 +0800	[thread overview]
Message-ID: <20210403031755.GA16255@hsiangkao-HP-ZHAN-66-Pro-G1> (raw)
In-Reply-To: <20210401032954.20555-1-xiang@kernel.org>

On Thu, Apr 01, 2021 at 11:29:44AM +0800, Gao Xiang wrote:
> Hi folks,
> 
> This is the formal version of EROFS big pcluster support, which means
> EROFS can compress data into more than 1 fs block after this patchset.
> 
> {l,p}cluster are EROFS-specific concepts, standing for `logical cluster'
> and `physical cluster' correspondingly. Logical cluster is the basic unit
> of compress indexes in file logical mapping, e.g. it can build compress
> indexes in 2 blocks rather than 1 block (currently only 1 block lcluster
> is supported). Physical cluster is a container of physical compressed
> blocks which contains compressed data, the size of which is the multiple
> of lclustersize.
> 
> Different from previous thoughts, which had fixed-sized pclusterblks
> recorded in the on-disk compress index header, our on-disk design allows
> variable-sized pclusterblks now. The main reasons are
>  - user data varies in compression ratio locally, so fixed-sized
>    clustersize approach is space-wasting and causes extra read
>    amplificationfor high CR cases;
> 
>  - inplace decompression needs zero padding to guarantee its safe margin,
>    but we don't want to pad more than 1 fs block for big pcluster;
> 
>  - end users can now customize the pcluster size according to data type
>    since various pclustersize can exist in a file, for example, using
>    different pcluster size for executable code and one-shot data. such
>    design should be more flexible than many other public compression fses
>    (Btw, each file in EROFS can have maximum 2 algorithms at the same time
>    by using HEAD1/2, which will be formally added with LZMA support.)
> 
> In brief, EROFS can now compress from variable-sized input to
> variable-sized pcluster blocks, as illustrated below:
> 
>   |<-_lcluster_->|________________________|<-_lcluster_->|
>   |____._________|_________ .. ___________|_______.______|
>         .                                        .
>          .                                     .
>           .__________________________________.
>           |______________| .. |______________|
>           |<-          pcluster            ->|
> 
> The next step would be how to record the compressed block count in
> lclusters. In compress indexes, there are 2 concepts called HEAD and
> NONHEAD lclusters. The difference is that HEAD lcluster starts a new
> pcluster in the lcluster, but NONHEAD not. It's easy to understand
> that big pclusters at least have 2 pclusters, thus at least 2 lclusters
> as well.
> 
> Therefore, let the delta0 (distance to its HEAD lcluster) of first NONHEAD
> compress index store the compressed block count with a special flag as a
> new called CBLKCNT compress index. It's also easy to know its delta0 is
> constantly 1, as illustrated below:
>   ________________________________________________________
>  |_HEAD_|_CBLKCNT_|_NONHEAD_|_..._|_NONHEAD_|_HEAD | HEAD |
>     |<------ a pcluster with CBLKCNT --------->|<-- -->|
>                                                    ^ a pcluster with 1
> 
> If another HEAD follows a HEAD lcluster, there is no room to record
> CBLKCNT, but it's easy to know the size of pcluster will be 1.
> 
> More implementation details about this and compact indexes are in the
> commit message.
> 
> On the runtime performance side, the current EROFS test results are:
>  ________________________________________________________________
> |  file system  |   size    | seq read | rand read | rand9m read |
> |_______________|___________|_ MiB/s __|__ MiB/s __|___ MiB/s ___|
> |___erofs_4k____|_556879872_|_ 781.4 __|__ 55.3 ___|___ 25.3  ___|
> |___erofs_16k___|_452509696_|_ 864.8 __|_ 123.2 ___|___ 20.8  ___|
> |___erofs_32k___|_415223808_|_ 899.8 __|_ 105.8 _*_|___ 16.8 ____|
> |___erofs_64k___|_393814016_|_ 906.6 __|__ 66.6 _*_|___ 11.8 ____|
> |__squashfs_8k__|_556191744_|_  64.9 __|__ 19.3 ___|____ 9.1 ____|
> |__squashfs_16k_|_502661120_|_  98.9 __|__ 38.0 ___|____ 9.8 ____|
> |__squashfs_32k_|_458784768_|_ 115.4 __|__ 71.6 _*_|___ 10.0 ____|
> |_squashfs_128k_|_398204928_|_ 257.2 __|_ 253.8 _*_|___ 10.9 ____|
> |____ext4_4k____|____()_____|_ 786.6 __|__ 28.6 ___|___ 27.8 ____|
> 
> 
> * Squashfs grabs more page cache to keep all decompressed data with
>   grab_cache_page_nowait() than the normal requested readahead (see
>   squashfs_copy_cache and squashfs_readpage_block).
>   In principle, EROFS can also cache such all decompressed data
>   if necessary, yet it's low priority for now and has little use
>   (rand9m is actually a better rand read workload, since the amount
>    of I/O is 9m rather than full-sized 1000m).
> 
> More details are in
> https://lore.kernel.org/r/20210329053654.GA3281654@xiangao.remote.csb
> 
> Also it's easy to know EROFS is not a fixed pcluster design, so users
> can make several optimized strategy according to data type when mkfs.
> And there is still room to optimize runtime performance for big pcluster
> even further.
> 
> Finally, it passes ro_fsstress and can also successfully boot buildroot
> & Android system with android-mainline repo.
> 
> current mkfs repo for big pcluster:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experimental-bigpcluster-compact
> 
> Thanks for your time on reading this!
> 
> Thanks,
> Gao Xiang
> 
> changes since v1:
>  - add a missing vunmap in erofs_pcpubuf_exit();
>  - refine comments and commit messages.
> 
>  (btw, I'll apply this patchset for -next first for further integration
>   test, which will be aimed to 5.13-rc1.)
>

As a quick update, I've applied the following update to the next version
(some minor fix):

diff --git a/fs/erofs/decompressor.c b/fs/erofs/decompressor.c
index c7b1d3fe8184..fe46a9c34923 100644
--- a/fs/erofs/decompressor.c
+++ b/fs/erofs/decompressor.c
@@ -43,11 +43,15 @@ int z_erofs_load_lz4_config(struct super_block *sb,
 		distance = le16_to_cpu(lz4->max_distance);
 
 		sbi->lz4.max_pclusterblks = le16_to_cpu(lz4->max_pclusterblks);
-		if (sbi->lz4.max_pclusterblks >
-		    Z_EROFS_PCLUSTER_MAX_SIZE / EROFS_BLKSIZ) {
-			erofs_err(sb, "too large lz4 pcluster blocks %u",
+		if (!sbi->lz4.max_pclusterblks) {
+			sbi->lz4.max_pclusterblks = 1;	/* reserved case */
+		} else if (sbi->lz4.max_pclusterblks >
+			   Z_EROFS_PCLUSTER_MAX_SIZE / EROFS_BLKSIZ) {
+			erofs_err(sb, "too large lz4 pclusterblks %u",
 				  sbi->lz4.max_pclusterblks);
 			return -EINVAL;
+		} else if (sbi->lz4.max_pclusterblks >= 2) {
+			erofs_info(sb, "EXPERIMENTAL big pcluster feature in use. Use at your own risk!");
 		}
 	} else {
 		distance = le16_to_cpu(dsb->u1.lz4_max_distance);
diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index 545cd5989e6a..6fc8e7fdaef8 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -77,12 +77,21 @@ static int z_erofs_fill_inode_lazy(struct inode *inode)
 	}
 
 	vi->z_logical_clusterbits = LOG_BLOCK_SIZE + (h->h_clusterbits & 7);
+	if (!erofs_sb_has_big_pcluster(EROFS_SB(sb)) &&
+	    vi->z_advise & (Z_EROFS_ADVISE_BIG_PCLUSTER_1 |
+			    Z_EROFS_ADVISE_BIG_PCLUSTER_2)) {
+		erofs_err(sb, "per-inode big pcluster without sb feature for nid %llu",
+			  vi->nid);
+		err = -EFSCORRUPTED;
+		goto unmap_done;
+	}
 	if (vi->datalayout == EROFS_INODE_FLAT_COMPRESSION &&
 	    !(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_1) ^
 	    !(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_2)) {
 		erofs_err(sb, "big pcluster head1/2 of compact indexes should be consistent for nid %llu",
 			  vi->nid);
-		return -EFSCORRUPTED;
+		err = -EFSCORRUPTED;
+		goto unmap_done;
 	}
 	/* paired with smp_mb() at the beginning of the function */
 	smp_mb();


  parent reply	other threads:[~2021-04-03  3:18 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-01  3:29 [PATCH v2 00/10] erofs: add big pcluster compression support Gao Xiang
2021-04-01  3:29 ` [PATCH v2 01/10] erofs: reserve physical_clusterbits[] Gao Xiang
2021-04-01  3:29 ` [PATCH v2 02/10] erofs: introduce multipage per-CPU buffers Gao Xiang
2021-04-01  3:29 ` [PATCH v2 03/10] erofs: introduce physical cluster slab pools Gao Xiang
2021-04-01  3:29 ` [PATCH v2 04/10] erofs: fix up inplace I/O pointer for big pcluster Gao Xiang
2021-04-01  3:29 ` [PATCH v2 05/10] erofs: add big physical cluster definition Gao Xiang
2021-04-01  3:29 ` [PATCH v2 06/10] erofs: adjust per-CPU buffers according to max_pclusterblks Gao Xiang
2021-04-01  3:29 ` [PATCH v2 07/10] erofs: support parsing big pcluster compress indexes Gao Xiang
2021-04-01  3:29 ` [PATCH v2 08/10] erofs: support parsing big pcluster compact indexes Gao Xiang
2021-04-01  3:29 ` [PATCH v2 09/10] erofs: support decompress big pcluster for lz4 backend Gao Xiang
2021-04-01  3:29 ` [PATCH v2 10/10] erofs: enable big pcluster feature Gao Xiang
2021-04-03  3:17 ` Gao Xiang [this message]
2021-04-03  3:52 ` [PATCH v2 00/10] erofs: add big pcluster compression support Chao Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210403031755.GA16255@hsiangkao-HP-ZHAN-66-Pro-G1 \
    --to=xiang@kernel.org \
    --cc=chao@kernel.org \
    --cc=linux-erofs@lists.ozlabs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).