linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@dilger.ca>
To: "Alex Zhuravlev" <azhuravlev@whamcloud.com>,
	"Благодаренко Артём" <artem.blagodarenko@gmail.com>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: [RFC] improve malloc for large filesystems
Date: Mon, 25 Nov 2019 14:39:59 -0700	[thread overview]
Message-ID: <E02E44A9-6206-4B73-B52F-C3A1BC4C7D1E@dilger.ca> (raw)
In-Reply-To: <BCFC8274-0A4E-42E7-9D11-647D47316BD2@whamcloud.com>

[-- Attachment #1: Type: text/plain, Size: 2754 bytes --]

On Nov 21, 2019, at 7:41 AM, Alex Zhuravlev <azhuravlev@whamcloud.com> wrote:
> 
> On 21 Nov 2019, at 12:18, Artem Blagodarenko <artem.blagodarenko@gmail.com> wrote:
>> Assume we have one fragmented part of disk and all other parts are quite free.
>> Allocator will spend a lot of time to go through this fragmented part, because
>> will brake cr0 and cr1 and get range that satisfy c3.
> 
> Even at cr=3 we still search for the goal size.
> 
> Thus we shouldn’t really allocate bad chunks because we break cr=0 and cr=1,
> we just stop to look for nicely looking groups and fallback to regular (more
> expensive) search for free extents.

I think it is important to understand what the actual goal size is at this
point.  The filesystems where we are seeing problems are _huge_ (650TiB and
larger) and are relatively full (70% or more) but take tens of minutes to
finish mounting.  Lustre does some small writes at mount time, but it shouldn't
take so long to find some small allocations for the config log update.

The filesystems are automatically getting "s_stripe_size = 512" from mke2fs
(presumably from the underlying RAID), and I _think_ this is causing mballoc
to inflate the IO request to 8-16MB prealloc chunks, which would be much
harder to find, and unnecessary for a small allocation.

>> c3 requirement is quite simple “get first group that have enough free
>> blocks to allocate requested range”.
> 
> This is only group selection, then we try to find that extent within that
> group, can fail and move to the next group.
> EXT4_MB_HINT_FIRST is set outside of the main cr=0..3 loop.
> 
>> With hight probability allocator find such group at the start of c3 loop,
>> so goal (allocator starts its searching from goal) will not significantly
>> changed. Thus allocator go through this fragmented range using small steps.
>> 
>> Without suggested optimisation, allocator skips this fragmented range at
>> moment and continue to allocate blocks.
> 
> 1000 groups * 5ms avg.time = 5 seconds to skip 1000 bad uninitialized groups. This is the real problem. You mentioned 4M groups...

Yes, these filesystems have 5M or more groups, which is a real problem.
Alex is working on a patch to do prefetch of the bitmaps, and to read them
in chunks of flex_bg size (256 blocks = 1MB) to cut down on the number of
seeks needed to fetch them from disk.

Using bigalloc would also help, and getting the number of block groups lower
will avoid the need for meta_bg (which puts each group descriptor into a
separate group, rather than packed contiguously)  but we've had to fix a few
performance issues with bigalloc as well, and have not deployed it yet in
production.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

  reply	other threads:[~2019-11-25 21:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-20 10:35 [RFC] improve malloc for large filesystems Alex Zhuravlev
2019-11-20 11:56 ` Artem Blagodarenko
2019-11-20 18:33   ` Alex Zhuravlev
2019-11-20 18:13 ` Theodore Y. Ts'o
2019-11-20 18:22   ` Alex Zhuravlev
2019-11-21  7:03   ` Alex Zhuravlev
2019-11-21  8:30     ` Artem Blagodarenko
2019-11-21  8:52       ` Alex Zhuravlev
2019-11-21  9:18         ` Artem Blagodarenko
2019-11-21 14:41           ` Alex Zhuravlev
2019-11-25 21:39             ` Andreas Dilger [this message]
2019-12-02  8:46               ` Alex Zhuravlev
2019-11-21  7:03   ` Alex Zhuravlev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E02E44A9-6206-4B73-B52F-C3A1BC4C7D1E@dilger.ca \
    --to=adilger@dilger.ca \
    --cc=artem.blagodarenko@gmail.com \
    --cc=azhuravlev@whamcloud.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).