linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Richard Weinberger <richard.weinberger@gmail.com>
To: Gao Xiang <gaoxiang25@huawei.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	miaoxie@huawei.com, yuchao0@huawei.com, sunqiuyang@huawei.com,
	fangwei1@huawei.com, liguifu2@huawei.com, weidu.du@huawei.com,
	chen.chun.yen@huawei.com, brooke.wangzhigang@hisilicon.com,
	dongjinguang@huawei.com
Subject: Re: [NOMERGE] [RFC PATCH 00/12] erofs: introduce erofs file system
Date: Fri, 1 Jun 2018 09:48:12 +0200	[thread overview]
Message-ID: <CAFLxGvw5PXBLKuaaK5xipiwTOXohtdeenD0XQHgX6+r4rS=GqQ@mail.gmail.com> (raw)
In-Reply-To: <1527764767-22190-1-git-send-email-gaoxiang25@huawei.com>

On Thu, May 31, 2018 at 1:06 PM, Gao Xiang <gaoxiang25@huawei.com> wrote:
> Hi all,
>
> Read-only file systems are used in many cases, such as read-only storage media.
> We are now focusing on the Android device which several read-only partitions exist.
> Due to limited read-only solutions, a new read-only file system EROFS
> (Extendable Read-Only File System) is introduced.

In which sense is it extendable?

> As the other read-only file systems, several meta regions in generic file systems
> such as free space bitmap are omitted. But the difference is that EROFS focuses
> more on performance than purely on saving storage space as much as possible.
>
> Furthermore, we also add the compression support called z_erofs.
>
> Traditional file systems with the compression support use the fixed-sized input
> compression, the output compressed units could be arbitrary lengths.
> However, data is accessed in the block unit for block devices, which means
> (A) if the accessed compressed data is not buffered, some data read from
> the physical block cannot be further utilized, which is illustrated as follows:
>
>    ++-----------++-----------++         ++-----------++-----------++
> ...||           ||           ||   ...   ||           ||           ||  ... original data
>    ++-----------++-----------++         ++-----------++-----------++
>     \                         /          \                         /
>        \                   /                \                    /
>           \             /                      \               /
>       ++---|-------++--|--------++       ++-----|----++--------|--++
>       ||xxx|       ||  |xxxxxxxx||  ...  ||xxxxx|    ||        |xx||  compressed data
>       ++---|-------++--|--------++       ++-----|----++--------|--++
>
> The shadow regions read from the block device but cannot be used for decompression.
>
> (B) If the compressed data is also buffered, it will increase the memory overhead.
> Because these are compressed data, it cannot be directly used, and we don't know
> when the corresponding compressed blocks are accessed, which is not friendly to
> the random read.
>
> In order to reduce the proportion of the data which cannot be directly decompressed,
> larger compressed sizes are preferred to be selected, which is also not friendly to
> the random read.
>
> Erofs implements the compression in a different approach, the details of which will
> be discussed in the next section.
>
> In brief, the following points summarize our design at a high level:
>
> 1) Use page-sized blocks so that there are no buffer heads.
>
> 2) By introducing a more general inline data / xattr, metadata and small data have
> the opportunity to be read with the inode metadata at the same time.
>
> 3) Introduce another shared xattr region in order to store the common xattrs (eg.
> selinux labels) or xattrs too large to be suitable for meta inline.
>
> 4) Metadata and data could be mixed by design, so it could be more flexible for mkfs
> to organize files and data.
>
> 5) instead of using the fixed-sized input compression, we put forward a new fixed
> output compression to make the full use of IO (which means all data from IO can be
> decompressed), reduce the read amplification, improve random read and keep the
> relatively lower compression ratios, illustrated as follows:
>
>
>         |---- varient-length extent ----|------ VLE ------|---  VLE ---|
>          /> clusterofs                  /> clusterofs     /> clusterofs /> clusterofs
>    ++---|-------++-----------++---------|-++-----------++-|---------++-|
> ...||   |       ||           ||         | ||           || |         || | ... original data
>    ++---|-------++-----------++---------|-++-----------++-|---------++-|
>    ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++
>         size         size         size         size         size
>          \                             /                 /            /
>           \                      /              /            /
>            \               /            /            /
>             ++-----------++-----------++-----------++
>         ... ||           ||           ||           || ... compressed clusters
>             ++-----------++-----------++-----------++
>             ++->cluster<-++->cluster<-++->cluster<-++
>                  size         size         size
>
>    A cluster could have more than one blocks by design, but currently we only have the
> page-sized cluster implementation (page-sized fixed output compression can also have
> better compression ratio than fixed input compression).
>
>    All compressed clusters have a fixed size but could be decompressed into extents with
> arbitrary lengths.
>
>    In addition, if a buffered IO reads the following shadow region (x), we could make a more
>    customized path (to replace generic_file_buffered_read) which only reads one compressed
>    cluster and makes the partial page available.
>          /> clusterofs
>    ++---|-------++
> ...||   | xxxx  || ...
>    ||---|-------||
>
> Some numbers using fixed output compression (VLE, cluster size = block size = 4k) on
> the server and Android phone (kirin970 platform):
>
> Server (magnetic disk):
>
> compression  EROFS seq read  EXT4 seq read        EROFS random read  EXT4 random read
> ratio           bw[MB/s]       bw[MB/s]             bw[MB/s] (20%)    bw[MB/s] (20%)
>
>   4              480.3          502.5                   69.8               11.1
>  10              472.3          503.3                   56.4               10.0
>  15              457.6          495.3                   47.0               10.9
>  26              401.5          511.2                   34.7               11.1
>  35              389.1          512.5                   28.0               11.0
>  48              375.4          496.5                   23.2               10.6
>  53              370.2          512.0                   21.8               11.0
>  66              349.2          512.0                   19.0               11.4
>  76              310.5          497.3                   17.3               11.6
>  85              301.2          512.0                   16.0               11.0
>  94              292.7          496.5                   14.6               11.1
> 100              538.9          512.0                   11.4               10.8
>
> Kirin970 (A73 Big-core 2361Mhz, A53 little-core 0Mhz, DDR 1866Mhz):

What storage was used? An eMMC?

> compression  EROFS seq read  EXT4 seq read        EROFS random read  EXT4 random read
> ratio           bw[MB/s]       bw[MB/s]             bw[MB/s] (20%)    bw[MB/s] (20%)
>
>   4              546.7          544.3                    157.7              57.9
>  10              535.7          521.0                    152.7              62.0
>  15              529.0          520.3                    125.0              65.0
>  26              418.0          526.3                     97.6              63.7
>  35              367.7          511.7                     89.0              63.7
>  48              415.7          500.7                     78.2              61.2
>  53              423.0          566.7                     72.8              62.9
>  66              334.3          537.3                     69.8              58.3
>  76              387.3          546.0                     65.2              56.0
>  85              306.3          546.0                     63.8              57.7
>  94              345.0          589.7                     59.2              49.9
> 100              579.7          556.7                     62.1              57.7

How does it compare to existing read only filesystems, such as squashfs?

-- 
Thanks,
//richard

  reply	other threads:[~2018-06-01  7:48 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-31 11:06 [NOMERGE] [RFC PATCH 00/12] erofs: introduce erofs file system Gao Xiang
2018-06-01  7:48 ` Richard Weinberger [this message]
2018-06-01  9:11   ` Gao Xiang
2018-06-01  9:28     ` Richard Weinberger
2018-06-01 11:16       ` Gao Xiang
2018-06-07 10:26         ` Pavel Machek
2018-07-27  0:55       ` Joey Pabalinas
2018-07-27  0:57         ` Joey Pabalinas
2018-07-26 12:21 ` [PATCH 00/25] staging: " Gao Xiang
2018-07-26 12:21   ` [PATCH 01/25] staging: erofs: add on-disk layout Gao Xiang
2018-07-26 12:21   ` [PATCH 02/25] staging: erofs: add erofs in-memory stuffs Gao Xiang
2018-07-26 12:21   ` [PATCH 03/25] staging: erofs: add super block operations Gao Xiang
2018-07-26 12:21   ` [PATCH 04/25] staging: erofs: add raw address_space operations Gao Xiang
2018-07-26 12:21   ` [PATCH 05/25] staging: erofs: add inode operations Gao Xiang
2018-07-26 12:21   ` [PATCH 06/25] staging: erofs: add directory operations Gao Xiang
2018-07-26 12:21   ` [PATCH 07/25] staging: erofs: add namei functions Gao Xiang
2018-07-26 12:21   ` [PATCH 08/25] staging: erofs: update Kconfig and Makefile Gao Xiang
2018-07-26 12:21   ` [PATCH 09/25] staging: erofs: introduce xattr & acl support Gao Xiang
2018-07-26 12:21   ` [PATCH 10/25] staging: erofs: support special inode Gao Xiang
2018-07-26 12:21   ` [PATCH 11/25] staging: erofs: introduce error injection infrastructure Gao Xiang
2018-07-26 12:21   ` [PATCH 12/25] staging: erofs: support tracepoint Gao Xiang
2018-07-26 12:21   ` [PATCH 13/25] staging: erofs: <linux/tagptr.h>: introduce tagged pointer Gao Xiang
2018-07-26 12:21   ` [PATCH 14/25] staging: erofs: introduce pagevec for unzip subsystem Gao Xiang
2018-07-26 12:21   ` [PATCH 15/25] staging: erofs: add erofs_map_blocks_iter Gao Xiang
2018-07-26 12:21   ` [PATCH 16/25] staging: erofs: add erofs_allocpage Gao Xiang
2018-07-26 12:22   ` [PATCH 17/25] staging: erofs: globalize prepare_bio and __submit_bio Gao Xiang
2018-07-26 12:22   ` [PATCH 18/25] staging: erofs: introduce a customized LZ4 decompression Gao Xiang
2018-07-26 12:22   ` [PATCH 19/25] staging: erofs: add a generic z_erofs VLE decompressor Gao Xiang
2018-07-26 12:22   ` [PATCH 20/25] staging: erofs: introduce superblock registration Gao Xiang
2018-07-26 12:22   ` [PATCH 21/25] staging: erofs: introduce erofs shrinker Gao Xiang
2018-07-26 12:22   ` [PATCH 22/25] staging: erofs: introduce workstation for decompression Gao Xiang
2018-07-26 12:22   ` [PATCH 23/25] staging: erofs: introduce VLE decompression support Gao Xiang
2018-07-26 12:22   ` [PATCH 24/25] staging: erofs: introduce cached decompression Gao Xiang
2018-07-26 12:22   ` [PATCH 25/25] staging: erofs: add a TODO and update MAINTAINERS for staging Gao Xiang
2018-07-28  7:10     ` [PATCH] staging: erofs: fix a compile warning of Z_EROFS_VLE_VMAP_ONSTACK_PAGES Gao Xiang
2018-07-28 10:43       ` Chao Yu
2018-07-29  5:34       ` [PATCH 1/2] staging: erofs: fix compile error without built-in decompression support Gao Xiang
2018-07-29  5:37         ` [PATCH 2/2] staging: erofs: fix conditional uninitialized `pcn' in z_erofs_map_blocks_iter Gao Xiang
2018-07-30  1:51           ` [PATCH] staging: erofs: use the wrapped PTR_ERR_OR_ZERO instead of open code Gao Xiang
2018-07-30  6:58             ` Chao Yu
2018-08-01  6:38             ` [PATCH 1/2] staging: erofs: add the missing break in z_erofs_map_blocks_iter Gao Xiang
2018-08-01  6:38               ` [PATCH 2/2] staging: erofs: remove a redundant marco in xattr Gao Xiang
2018-08-01  9:02               ` [PATCH 1/2] staging: erofs: add the missing break in z_erofs_map_blocks_iter Dan Carpenter
2018-08-01  9:19                 ` Gao Xiang
2018-08-01  9:36                   ` [PATCH RESEND " Gao Xiang
2018-08-01 11:36                     ` Dan Carpenter
2018-08-01 12:08                       ` Gao Xiang
2018-07-30  2:07           ` [PATCH 2/2] staging: erofs: fix conditional uninitialized `pcn' " Chao Yu
2018-07-30  2:07         ` [PATCH 1/2] staging: erofs: fix compile error without built-in decompression support Chao Yu
2018-07-30  2:32           ` Gao Xiang
2018-07-30  3:07             ` Chao Yu
2018-07-30  3:55               ` Gao Xiang
2018-07-27  0:25   ` [PATCH 00/25] staging: erofs: introduce erofs file system Christian Kujau
2018-07-27  1:39     ` Gao Xiang
2018-07-27  1:56       ` Gao Xiang
2018-07-28  7:25   ` Greg Kroah-Hartman
2018-07-28  9:33     ` Gao Xiang
2018-07-28 10:34     ` Chao Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFLxGvw5PXBLKuaaK5xipiwTOXohtdeenD0XQHgX6+r4rS=GqQ@mail.gmail.com' \
    --to=richard.weinberger@gmail.com \
    --cc=brooke.wangzhigang@hisilicon.com \
    --cc=chen.chun.yen@huawei.com \
    --cc=dongjinguang@huawei.com \
    --cc=fangwei1@huawei.com \
    --cc=gaoxiang25@huawei.com \
    --cc=liguifu2@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miaoxie@huawei.com \
    --cc=sunqiuyang@huawei.com \
    --cc=weidu.du@huawei.com \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).