linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gao Xiang <gaoxiang25@huawei.com>
To: Richard Weinberger <richard.weinberger@gmail.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	<miaoxie@huawei.com>, <yuchao0@huawei.com>,
	<sunqiuyang@huawei.com>, <fangwei1@huawei.com>,
	<liguifu2@huawei.com>, <weidu.du@huawei.com>,
	<chen.chun.yen@huawei.com>, <brooke.wangzhigang@hisilicon.com>,
	<dongjinguang@huawei.com>
Subject: Re: [NOMERGE] [RFC PATCH 00/12] erofs: introduce erofs file system
Date: Fri, 1 Jun 2018 17:11:21 +0800	[thread overview]
Message-ID: <aae078b7-c364-106a-dd56-91abc945d7bb@huawei.com> (raw)
In-Reply-To: <CAFLxGvw5PXBLKuaaK5xipiwTOXohtdeenD0XQHgX6+r4rS=GqQ@mail.gmail.com>

Hi Richard,

On 2018/6/1 15:48, Richard Weinberger wrote:
> On Thu, May 31, 2018 at 1:06 PM, Gao Xiang <gaoxiang25@huawei.com> wrote:
>> Hi all,
>>
>> Read-only file systems are used in many cases, such as read-only storage media.
>> We are now focusing on the Android device which several read-only partitions exist.
>> Due to limited read-only solutions, a new read-only file system EROFS
>> (Extendable Read-Only File System) is introduced.
> 
> In which sense is it extendable?

Actually, the meaning of an enhanced (means not just read-only, but with the scalable
on-disk layout, compression, or fs-verify in the future) read-only file system is emphasized.

We also think of other candidate full names, such as
Enhanced / Extented Read-only File System, all the names short for "erofs" are okay.

> 
>> As the other read-only file systems, several meta regions in generic file systems
>> such as free space bitmap are omitted. But the difference is that EROFS focuses
>> more on performance than purely on saving storage space as much as possible.
>>
>> Furthermore, we also add the compression support called z_erofs.
>>
>> Traditional file systems with the compression support use the fixed-sized input
>> compression, the output compressed units could be arbitrary lengths.
>> However, data is accessed in the block unit for block devices, which means
>> (A) if the accessed compressed data is not buffered, some data read from
>> the physical block cannot be further utilized, which is illustrated as follows:
>>
>>    ++-----------++-----------++         ++-----------++-----------++
>> ...||           ||           ||   ...   ||           ||           ||  ... original data
>>    ++-----------++-----------++         ++-----------++-----------++
>>     \                         /          \                         /
>>        \                   /                \                    /
>>           \             /                      \               /
>>       ++---|-------++--|--------++       ++-----|----++--------|--++
>>       ||xxx|       ||  |xxxxxxxx||  ...  ||xxxxx|    ||        |xx||  compressed data
>>       ++---|-------++--|--------++       ++-----|----++--------|--++
>>
>> The shadow regions read from the block device but cannot be used for decompression.
>>
>> (B) If the compressed data is also buffered, it will increase the memory overhead.
>> Because these are compressed data, it cannot be directly used, and we don't know
>> when the corresponding compressed blocks are accessed, which is not friendly to
>> the random read.
>>
>> In order to reduce the proportion of the data which cannot be directly decompressed,
>> larger compressed sizes are preferred to be selected, which is also not friendly to
>> the random read.
>>
>> Erofs implements the compression in a different approach, the details of which will
>> be discussed in the next section.
>>
>> In brief, the following points summarize our design at a high level:
>>
>> 1) Use page-sized blocks so that there are no buffer heads.
>>
>> 2) By introducing a more general inline data / xattr, metadata and small data have
>> the opportunity to be read with the inode metadata at the same time.
>>
>> 3) Introduce another shared xattr region in order to store the common xattrs (eg.
>> selinux labels) or xattrs too large to be suitable for meta inline.
>>
>> 4) Metadata and data could be mixed by design, so it could be more flexible for mkfs
>> to organize files and data.
>>
>> 5) instead of using the fixed-sized input compression, we put forward a new fixed
>> output compression to make the full use of IO (which means all data from IO can be
>> decompressed), reduce the read amplification, improve random read and keep the
>> relatively lower compression ratios, illustrated as follows:
>>
>>
>>         |---- varient-length extent ----|------ VLE ------|---  VLE ---|
>>          /> clusterofs                  /> clusterofs     /> clusterofs /> clusterofs
>>    ++---|-------++-----------++---------|-++-----------++-|---------++-|
>> ...||   |       ||           ||         | ||           || |         || | ... original data
>>    ++---|-------++-----------++---------|-++-----------++-|---------++-|
>>    ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++
>>         size         size         size         size         size
>>          \                             /                 /            /
>>           \                      /              /            /
>>            \               /            /            /
>>             ++-----------++-----------++-----------++
>>         ... ||           ||           ||           || ... compressed clusters
>>             ++-----------++-----------++-----------++
>>             ++->cluster<-++->cluster<-++->cluster<-++
>>                  size         size         size
>>
>>    A cluster could have more than one blocks by design, but currently we only have the
>> page-sized cluster implementation (page-sized fixed output compression can also have
>> better compression ratio than fixed input compression).
>>
>>    All compressed clusters have a fixed size but could be decompressed into extents with
>> arbitrary lengths.
>>
>>    In addition, if a buffered IO reads the following shadow region (x), we could make a more
>>    customized path (to replace generic_file_buffered_read) which only reads one compressed
>>    cluster and makes the partial page available.
>>          /> clusterofs
>>    ++---|-------++
>> ...||   | xxxx  || ...
>>    ||---|-------||
>>
>> Some numbers using fixed output compression (VLE, cluster size = block size = 4k) on
>> the server and Android phone (kirin970 platform):
>>
>> Server (magnetic disk):
>>
>> compression  EROFS seq read  EXT4 seq read        EROFS random read  EXT4 random read
>> ratio           bw[MB/s]       bw[MB/s]             bw[MB/s] (20%)    bw[MB/s] (20%)
>>
>>   4              480.3          502.5                   69.8               11.1
>>  10              472.3          503.3                   56.4               10.0
>>  15              457.6          495.3                   47.0               10.9
>>  26              401.5          511.2                   34.7               11.1
>>  35              389.1          512.5                   28.0               11.0
>>  48              375.4          496.5                   23.2               10.6
>>  53              370.2          512.0                   21.8               11.0
>>  66              349.2          512.0                   19.0               11.4
>>  76              310.5          497.3                   17.3               11.6
>>  85              301.2          512.0                   16.0               11.0
>>  94              292.7          496.5                   14.6               11.1
>> 100              538.9          512.0                   11.4               10.8
>>
>> Kirin970 (A73 Big-core 2361Mhz, A53 little-core 0Mhz, DDR 1866Mhz):
> 
> What storage was used? An eMMC?

UFS device, fio with psync, bs=4k, iodepth=1.

> 
>> compression  EROFS seq read  EXT4 seq read        EROFS random read  EXT4 random read
>> ratio           bw[MB/s]       bw[MB/s]             bw[MB/s] (20%)    bw[MB/s] (20%)
>>
>>   4              546.7          544.3                    157.7              57.9
>>  10              535.7          521.0                    152.7              62.0
>>  15              529.0          520.3                    125.0              65.0
>>  26              418.0          526.3                     97.6              63.7
>>  35              367.7          511.7                     89.0              63.7
>>  48              415.7          500.7                     78.2              61.2
>>  53              423.0          566.7                     72.8              62.9
>>  66              334.3          537.3                     69.8              58.3
>>  76              387.3          546.0                     65.2              56.0
>>  85              306.3          546.0                     63.8              57.7
>>  94              345.0          589.7                     59.2              49.9
>> 100              579.7          556.7                     62.1              57.7
> 
> How does it compare to existing read only filesystems, such as squashfs?
> 

You are quite right.

We are now focusing on improving our decompression subsystem and
these numbers will be successively added in the future non-RFC patches.

We haven't pay much attention on comparing squashfs and erofs
yet since we once tried to use squashfs on our products with
different block sizes several years ago, it behaves
unacceptable in the low free memory scenario besides its
performance.


This version patchset is mainly used for the opensource archive.


Thanks for your attention :)


Thanks,

  reply	other threads:[~2018-06-01  9:12 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-31 11:06 [NOMERGE] [RFC PATCH 00/12] erofs: introduce erofs file system Gao Xiang
2018-06-01  7:48 ` Richard Weinberger
2018-06-01  9:11   ` Gao Xiang [this message]
2018-06-01  9:28     ` Richard Weinberger
2018-06-01 11:16       ` Gao Xiang
2018-06-07 10:26         ` Pavel Machek
2018-07-27  0:55       ` Joey Pabalinas
2018-07-27  0:57         ` Joey Pabalinas
2018-07-26 12:21 ` [PATCH 00/25] staging: " Gao Xiang
2018-07-26 12:21   ` [PATCH 01/25] staging: erofs: add on-disk layout Gao Xiang
2018-07-26 12:21   ` [PATCH 02/25] staging: erofs: add erofs in-memory stuffs Gao Xiang
2018-07-26 12:21   ` [PATCH 03/25] staging: erofs: add super block operations Gao Xiang
2018-07-26 12:21   ` [PATCH 04/25] staging: erofs: add raw address_space operations Gao Xiang
2018-07-26 12:21   ` [PATCH 05/25] staging: erofs: add inode operations Gao Xiang
2018-07-26 12:21   ` [PATCH 06/25] staging: erofs: add directory operations Gao Xiang
2018-07-26 12:21   ` [PATCH 07/25] staging: erofs: add namei functions Gao Xiang
2018-07-26 12:21   ` [PATCH 08/25] staging: erofs: update Kconfig and Makefile Gao Xiang
2018-07-26 12:21   ` [PATCH 09/25] staging: erofs: introduce xattr & acl support Gao Xiang
2018-07-26 12:21   ` [PATCH 10/25] staging: erofs: support special inode Gao Xiang
2018-07-26 12:21   ` [PATCH 11/25] staging: erofs: introduce error injection infrastructure Gao Xiang
2018-07-26 12:21   ` [PATCH 12/25] staging: erofs: support tracepoint Gao Xiang
2018-07-26 12:21   ` [PATCH 13/25] staging: erofs: <linux/tagptr.h>: introduce tagged pointer Gao Xiang
2018-07-26 12:21   ` [PATCH 14/25] staging: erofs: introduce pagevec for unzip subsystem Gao Xiang
2018-07-26 12:21   ` [PATCH 15/25] staging: erofs: add erofs_map_blocks_iter Gao Xiang
2018-07-26 12:21   ` [PATCH 16/25] staging: erofs: add erofs_allocpage Gao Xiang
2018-07-26 12:22   ` [PATCH 17/25] staging: erofs: globalize prepare_bio and __submit_bio Gao Xiang
2018-07-26 12:22   ` [PATCH 18/25] staging: erofs: introduce a customized LZ4 decompression Gao Xiang
2018-07-26 12:22   ` [PATCH 19/25] staging: erofs: add a generic z_erofs VLE decompressor Gao Xiang
2018-07-26 12:22   ` [PATCH 20/25] staging: erofs: introduce superblock registration Gao Xiang
2018-07-26 12:22   ` [PATCH 21/25] staging: erofs: introduce erofs shrinker Gao Xiang
2018-07-26 12:22   ` [PATCH 22/25] staging: erofs: introduce workstation for decompression Gao Xiang
2018-07-26 12:22   ` [PATCH 23/25] staging: erofs: introduce VLE decompression support Gao Xiang
2018-07-26 12:22   ` [PATCH 24/25] staging: erofs: introduce cached decompression Gao Xiang
2018-07-26 12:22   ` [PATCH 25/25] staging: erofs: add a TODO and update MAINTAINERS for staging Gao Xiang
2018-07-28  7:10     ` [PATCH] staging: erofs: fix a compile warning of Z_EROFS_VLE_VMAP_ONSTACK_PAGES Gao Xiang
2018-07-28 10:43       ` Chao Yu
2018-07-29  5:34       ` [PATCH 1/2] staging: erofs: fix compile error without built-in decompression support Gao Xiang
2018-07-29  5:37         ` [PATCH 2/2] staging: erofs: fix conditional uninitialized `pcn' in z_erofs_map_blocks_iter Gao Xiang
2018-07-30  1:51           ` [PATCH] staging: erofs: use the wrapped PTR_ERR_OR_ZERO instead of open code Gao Xiang
2018-07-30  6:58             ` Chao Yu
2018-08-01  6:38             ` [PATCH 1/2] staging: erofs: add the missing break in z_erofs_map_blocks_iter Gao Xiang
2018-08-01  6:38               ` [PATCH 2/2] staging: erofs: remove a redundant marco in xattr Gao Xiang
2018-08-01  9:02               ` [PATCH 1/2] staging: erofs: add the missing break in z_erofs_map_blocks_iter Dan Carpenter
2018-08-01  9:19                 ` Gao Xiang
2018-08-01  9:36                   ` [PATCH RESEND " Gao Xiang
2018-08-01 11:36                     ` Dan Carpenter
2018-08-01 12:08                       ` Gao Xiang
2018-07-30  2:07           ` [PATCH 2/2] staging: erofs: fix conditional uninitialized `pcn' " Chao Yu
2018-07-30  2:07         ` [PATCH 1/2] staging: erofs: fix compile error without built-in decompression support Chao Yu
2018-07-30  2:32           ` Gao Xiang
2018-07-30  3:07             ` Chao Yu
2018-07-30  3:55               ` Gao Xiang
2018-07-27  0:25   ` [PATCH 00/25] staging: erofs: introduce erofs file system Christian Kujau
2018-07-27  1:39     ` Gao Xiang
2018-07-27  1:56       ` Gao Xiang
2018-07-28  7:25   ` Greg Kroah-Hartman
2018-07-28  9:33     ` Gao Xiang
2018-07-28 10:34     ` Chao Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aae078b7-c364-106a-dd56-91abc945d7bb@huawei.com \
    --to=gaoxiang25@huawei.com \
    --cc=brooke.wangzhigang@hisilicon.com \
    --cc=chen.chun.yen@huawei.com \
    --cc=dongjinguang@huawei.com \
    --cc=fangwei1@huawei.com \
    --cc=liguifu2@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miaoxie@huawei.com \
    --cc=richard.weinberger@gmail.com \
    --cc=sunqiuyang@huawei.com \
    --cc=weidu.du@huawei.com \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).