From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-22.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6BD3C433B4 for ; Sat, 3 Apr 2021 03:52:16 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 98D3E611CE for ; Sat, 3 Apr 2021 03:52:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 98D3E611CE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linux-erofs-bounces+linux-erofs=archiver.kernel.org@lists.ozlabs.org Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4FC2zy2HL7z3bsy for ; Sat, 3 Apr 2021 14:52:14 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=dm/C09gZ; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=kernel.org (client-ip=198.145.29.99; helo=mail.kernel.org; envelope-from=chao@kernel.org; receiver=) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=dm/C09gZ; dkim-atps=neutral Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4FC2zw3zxzz2yjS for ; Sat, 3 Apr 2021 14:52:12 +1100 (AEDT) Received: by mail.kernel.org (Postfix) with ESMTPSA id 3EF4D61177; Sat, 3 Apr 2021 03:52:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1617421930; bh=Gcbhjp2hTrSu3AA+Ysg1cwzwUzf7hrlEJ66hTxf6VcQ=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=dm/C09gZ05/a1nINUpO4C66J3RMOupdbfSmnHcjNbuvHkUUGDZvWrsHjksPNbMniU TDoeuCFCR/kTB25ES/nYitGEQkPsoath3YU8zrk/SZoMJqoTr18mTBcF0iCqXn3GDA SOc9LmfLvfAvA40okb0xrJv6HQJEXc7BCVXkbgP2pfV0wqV5XNxX3BCEFdaV+L3D1R G9kzRfvEKF6y9jMZ0U0EYNljaAh5Ue5CGV8z92q/nKpO68vuK8YZ1s3yAoxp8kzHP6 wmiXJTSs4ofF2dARheR4xHhRMdydwZn4njkUrlccUq1bNdCL9So0GYvCqMLe2r9bwL Dd9kKTKWlfwdw== Subject: Re: [PATCH v2 00/10] erofs: add big pcluster compression support To: Gao Xiang , linux-erofs@lists.ozlabs.org, Chao Yu References: <20210401032954.20555-1-xiang@kernel.org> From: Chao Yu Message-ID: <18509211-374c-be19-bae7-f2ce852bfb15@kernel.org> Date: Sat, 3 Apr 2021 11:52:06 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 MIME-Version: 1.0 In-Reply-To: <20210401032954.20555-1-xiang@kernel.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: linux-erofs@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Development of Linux EROFS file system List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: LKML Errors-To: linux-erofs-bounces+linux-erofs=archiver.kernel.org@lists.ozlabs.org Sender: "Linux-erofs" On 2021/4/1 11:29, Gao Xiang wrote: > Hi folks, > > This is the formal version of EROFS big pcluster support, which means > EROFS can compress data into more than 1 fs block after this patchset. > > {l,p}cluster are EROFS-specific concepts, standing for `logical cluster' > and `physical cluster' correspondingly. Logical cluster is the basic unit > of compress indexes in file logical mapping, e.g. it can build compress > indexes in 2 blocks rather than 1 block (currently only 1 block lcluster > is supported). Physical cluster is a container of physical compressed > blocks which contains compressed data, the size of which is the multiple > of lclustersize. > > Different from previous thoughts, which had fixed-sized pclusterblks > recorded in the on-disk compress index header, our on-disk design allows > variable-sized pclusterblks now. The main reasons are > - user data varies in compression ratio locally, so fixed-sized > clustersize approach is space-wasting and causes extra read > amplificationfor high CR cases; > > - inplace decompression needs zero padding to guarantee its safe margin, > but we don't want to pad more than 1 fs block for big pcluster; > > - end users can now customize the pcluster size according to data type > since various pclustersize can exist in a file, for example, using > different pcluster size for executable code and one-shot data. such > design should be more flexible than many other public compression fses > (Btw, each file in EROFS can have maximum 2 algorithms at the same time > by using HEAD1/2, which will be formally added with LZMA support.) > > In brief, EROFS can now compress from variable-sized input to > variable-sized pcluster blocks, as illustrated below: > > |<-_lcluster_->|________________________|<-_lcluster_->| > |____._________|_________ .. ___________|_______.______| > . . > . . > .__________________________________. > |______________| .. |______________| > |<- pcluster ->| > > The next step would be how to record the compressed block count in > lclusters. In compress indexes, there are 2 concepts called HEAD and > NONHEAD lclusters. The difference is that HEAD lcluster starts a new > pcluster in the lcluster, but NONHEAD not. It's easy to understand > that big pclusters at least have 2 pclusters, thus at least 2 lclusters > as well. > > Therefore, let the delta0 (distance to its HEAD lcluster) of first NONHEAD > compress index store the compressed block count with a special flag as a > new called CBLKCNT compress index. It's also easy to know its delta0 is > constantly 1, as illustrated below: > ________________________________________________________ > |_HEAD_|_CBLKCNT_|_NONHEAD_|_..._|_NONHEAD_|_HEAD | HEAD | > |<------ a pcluster with CBLKCNT --------->|<-- -->| > ^ a pcluster with 1 > > If another HEAD follows a HEAD lcluster, there is no room to record > CBLKCNT, but it's easy to know the size of pcluster will be 1. > > More implementation details about this and compact indexes are in the > commit message. > > On the runtime performance side, the current EROFS test results are: > ________________________________________________________________ > | file system | size | seq read | rand read | rand9m read | > |_______________|___________|_ MiB/s __|__ MiB/s __|___ MiB/s ___| > |___erofs_4k____|_556879872_|_ 781.4 __|__ 55.3 ___|___ 25.3 ___| > |___erofs_16k___|_452509696_|_ 864.8 __|_ 123.2 ___|___ 20.8 ___| > |___erofs_32k___|_415223808_|_ 899.8 __|_ 105.8 _*_|___ 16.8 ____| > |___erofs_64k___|_393814016_|_ 906.6 __|__ 66.6 _*_|___ 11.8 ____| > |__squashfs_8k__|_556191744_|_ 64.9 __|__ 19.3 ___|____ 9.1 ____| > |__squashfs_16k_|_502661120_|_ 98.9 __|__ 38.0 ___|____ 9.8 ____| > |__squashfs_32k_|_458784768_|_ 115.4 __|__ 71.6 _*_|___ 10.0 ____| > |_squashfs_128k_|_398204928_|_ 257.2 __|_ 253.8 _*_|___ 10.9 ____| > |____ext4_4k____|____()_____|_ 786.6 __|__ 28.6 ___|___ 27.8 ____| > > > * Squashfs grabs more page cache to keep all decompressed data with > grab_cache_page_nowait() than the normal requested readahead (see > squashfs_copy_cache and squashfs_readpage_block). > In principle, EROFS can also cache such all decompressed data > if necessary, yet it's low priority for now and has little use > (rand9m is actually a better rand read workload, since the amount > of I/O is 9m rather than full-sized 1000m). > > More details are in > https://lore.kernel.org/r/20210329053654.GA3281654@xiangao.remote.csb > > Also it's easy to know EROFS is not a fixed pcluster design, so users > can make several optimized strategy according to data type when mkfs. > And there is still room to optimize runtime performance for big pcluster > even further. > > Finally, it passes ro_fsstress and can also successfully boot buildroot > & Android system with android-mainline repo. > > current mkfs repo for big pcluster: > https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experimental-bigpcluster-compact > > Thanks for your time on reading this! Nice job! Acked-by: Chao Yu Thanks, > > Thanks, > Gao Xiang > > changes since v1: > - add a missing vunmap in erofs_pcpubuf_exit(); > - refine comments and commit messages. > > (btw, I'll apply this patchset for -next first for further integration > test, which will be aimed to 5.13-rc1.) > > Gao Xiang (10): > erofs: reserve physical_clusterbits[] > erofs: introduce multipage per-CPU buffers > erofs: introduce physical cluster slab pools > erofs: fix up inplace I/O pointer for big pcluster > erofs: add big physical cluster definition > erofs: adjust per-CPU buffers according to max_pclusterblks > erofs: support parsing big pcluster compress indexes > erofs: support parsing big pcluster compact indexes > erofs: support decompress big pcluster for lz4 backend > erofs: enable big pcluster feature > > fs/erofs/Kconfig | 14 --- > fs/erofs/Makefile | 2 +- > fs/erofs/decompressor.c | 216 +++++++++++++++++++++++++--------------- > fs/erofs/erofs_fs.h | 31 ++++-- > fs/erofs/internal.h | 31 ++---- > fs/erofs/pcpubuf.c | 134 +++++++++++++++++++++++++ > fs/erofs/super.c | 1 + > fs/erofs/utils.c | 12 --- > fs/erofs/zdata.c | 193 ++++++++++++++++++++++------------- > fs/erofs/zdata.h | 14 +-- > fs/erofs/zmap.c | 155 ++++++++++++++++++++++------ > 11 files changed, 560 insertions(+), 243 deletions(-) > create mode 100644 fs/erofs/pcpubuf.c >