linux-erofs.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Gao Xiang <hsiangkao@redhat.com>
To: 胡玮文 <sehuww@mail.scut.edu.cn>
Cc: linux-erofs@lists.ozlabs.org
Subject: Re: [PATCH v4 2/2] erofs-utils: optimize buffer allocation logic
Date: Mon, 18 Jan 2021 22:06:54 +0800	[thread overview]
Message-ID: <20210118140654.GC2423918@xiangao.remote.csb> (raw)
In-Reply-To: <35F8431A-29B6-4211-8BB1-E5241238D45D@mail.scut.edu.cn>

Hi Weiwen,

On Mon, Jan 18, 2021 at 09:02:16PM +0800, 胡玮文 wrote:
> 
> > 在 2021年1月16日,14:41,Gao Xiang <hsiangkao@aol.com> 写道:
> > 
> > From: Hu Weiwen <sehuww@mail.scut.edu.cn>
> > 
> > When using EROFS to pack our dataset which consists of millions of
> > files, mkfs.erofs is very slow compared with mksquashfs.
> > 
> > The bottleneck is `erofs_balloc` and `erofs_mapbh` function, which
> > iterate over all previously allocated buffer blocks, making the
> > complexity of the algrithm O(N^2) where N is the number of files.
> > 
> > With this patch:
> > 
> > * global `last_mapped_block` is mantained to avoid full scan in
> >  `erofs_mapbh` function.
> > 
> > * global `non_full_buffer_blocks` mantains a list of buffer block for
> >  each type and each possible remaining bytes in the block. Then it is
> >  used to identify the most suitable blocks in future `erofs_balloc`,
> >  avoiding full scan.
> > 
> > Some new data structure is allocated in this patch, more RAM usage is
> > expected, but not much. When I test it with ImageNet dataset (1.33M
> > files), 7GiB RAM is consumed, and it takes about 4 hours. Most time is
> > spent on IO.
> > 
> > Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
> > Signed-off-by: Gao Xiang <hsiangkao@aol.com>
> > ---
> > 
> > I've simplified the most-fit finding logic of v3... Since buffers.off
> > has to be aligned to alignsize, so I think it's better to use
> > buffers.off as the index of mapped_buckets compared to using remaining
> > size as it looks more straight-forward.
> > 
> > Also, I found the exist logic handling expended blocks might be
> > potential ineffective as well... we have to skip used < used0 only
> > after oob (extra blocks is allocated, so not expend such blocks but
> > allocate a new bb...) It might be more effective to reuse such
> > non-mapped buffer blocks...
> 
> I don’t get this. If not oob, then “used” should be larger than “used_before”, then we will not skip such block anyway, right?

Sorry, I think I was completely misleaded by the comment above the
code at that time. It's too far to get the original intention :(
I think you're right (anyway, I think the main purpose of this
condition was that I didn't want to introduce too many new bbs
with a lot of large unused space. But anyway such condition is
approximate since it's actually a 0-1 Knapsack problem). Please
ignore my previous words about this...

Thanks,
Gao Xiang


  reply	other threads:[~2021-01-18 14:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20210116050438.4456-1-hsiangkao.ref@aol.com>
2021-01-16  5:04 ` [PATCH v3 1/2] erofs-utils: get rid of `end' argument from erofs_mapbh() Gao Xiang via Linux-erofs
2021-01-16  5:04   ` [PATCH v3 2/2] erofs-utils: optimize mkfs to O(N), N = #files Gao Xiang via Linux-erofs
2021-01-16  5:20     ` Gao Xiang
2021-01-16  6:31     ` [PATCH v4 2/2] erofs-utils: optimize buffer allocation logic Gao Xiang via Linux-erofs
2021-01-18 12:34       ` [PATCH v5 " Hu Weiwen
2021-01-18 13:02       ` [PATCH v4 " 胡玮文
2021-01-18 14:06         ` Gao Xiang [this message]
2021-01-19  5:49       ` [PATCH v6 " Hu Weiwen
2021-01-22 13:14         ` Gao Xiang
2021-01-22 13:21           ` Gao Xiang
2021-01-22 15:57             ` Huang Jianan
2021-01-18 12:29 [PATCH v4 " 胡玮文
2021-01-18 13:54 ` Gao Xiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210118140654.GC2423918@xiangao.remote.csb \
    --to=hsiangkao@redhat.com \
    --cc=linux-erofs@lists.ozlabs.org \
    --cc=sehuww@mail.scut.edu.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).