From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail02.iobjects.de ([188.40.134.68]:42476 "EHLO mail02.iobjects.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933927AbeBMQYk (ORCPT ); Tue, 13 Feb 2018 11:24:40 -0500 Subject: Re: mount btrfs takes 30 minutes, btrfs check runs out of memory To: Qu Wenruo , John Ettedgui Cc: Qu Wenruo , Austin S Hemmelgarn , btrfs References: <5cc93522-1bd2-bdc1-d5da-a11d5e4816a7@cn.fujitsu.com> <799054c4-b4f5-2dc0-4f70-4345159d9078@cn.fujitsu.com> <36291cd0-64bc-8708-9e23-0aac30539785@cn.fujitsu.com> <87975474-0cb9-13d7-f623-c0622b31f437@gmx.com> <5fc99565-b46a-a52a-b1d4-b3ce3e10c830@applied-asynchrony.com> <900c77fc-6ad0-624c-6831-b8e5da636fb4@gmx.com> From: =?UTF-8?Q?Holger_Hoffst=c3=a4tte?= Message-ID: Date: Tue, 13 Feb 2018 17:24:37 +0100 MIME-Version: 1.0 In-Reply-To: <900c77fc-6ad0-624c-6831-b8e5da636fb4@gmx.com> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="TIs6FDMU68RbAH6vMY2kG2lGAXOq8TEy5" Sender: linux-btrfs-owner@vger.kernel.org List-ID: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --TIs6FDMU68RbAH6vMY2kG2lGAXOq8TEy5 Content-Type: multipart/mixed; boundary="wmLuskFdD3xTdTlvFKQGPj41Ryqet4KRT"; protected-headers="v1" From: =?UTF-8?Q?Holger_Hoffst=c3=a4tte?= To: Qu Wenruo , John Ettedgui Cc: Qu Wenruo , Austin S Hemmelgarn , btrfs Message-ID: Subject: Re: mount btrfs takes 30 minutes, btrfs check runs out of memory References: <5cc93522-1bd2-bdc1-d5da-a11d5e4816a7@cn.fujitsu.com> <799054c4-b4f5-2dc0-4f70-4345159d9078@cn.fujitsu.com> <36291cd0-64bc-8708-9e23-0aac30539785@cn.fujitsu.com> <87975474-0cb9-13d7-f623-c0622b31f437@gmx.com> <5fc99565-b46a-a52a-b1d4-b3ce3e10c830@applied-asynchrony.com> <900c77fc-6ad0-624c-6831-b8e5da636fb4@gmx.com> In-Reply-To: <900c77fc-6ad0-624c-6831-b8e5da636fb4@gmx.com> --wmLuskFdD3xTdTlvFKQGPj41Ryqet4KRT Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable On 02/13/18 13:54, Qu Wenruo wrote: > On 2018=E5=B9=B402=E6=9C=8813=E6=97=A5 20:26, Holger Hoffst=C3=A4tte wr= ote: >> On 02/13/18 12:40, Qu Wenruo wrote: >>>>> The problem is not about how much space it takes, but how many exte= nts >>>>> are here in the filesystem. >> >> I have no idea why btrfs' mount even needs to touch all block groups t= o >> get going (which seems to be the root of the problem), but here's a >> not so crazy idea for more "mechanical sympathy". Feel free to mock >> me if this is terribly wrong or not possible. ;) >> >> Mounting of even large filesystems (with many extents) seems to be fin= e >> on SSDS, but not so fine on rotational storage. We've heard that from >> several people with large (multi-TB) filesystems, and obviously it's >> even more terrible on 5400RPM drives because their seeks are sooo sloo= w. >> >> If the problem is that the bgs are touched/iterated in "tree order", >> would it then not be possible to sort the block groups in physical ord= er >> before trying to load whatever mount needs to load? >=20 > This is in fact a good idea. > Make block group into its own tree. Well, that's not what I was thinking about at all..yet. :) (keep in mind I'm not really that familiar with the internals). Out of curiosity I ran a bit of perf on my own mount process, which is fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and small files. Unfortunately it's also very fresh since I recreated it just= this weekend, so everything is neatly packed together and fast. In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is prett= y old and fragmented (but running an up-to-date kernel). His fs mounts in ~= 5s. My perf run shows that the only interesting part responsible for mount ti= me is the nested loop in btrfs_read_block_groups calling find_first_block_gr= oup (which got inlined & is not in the perf callgraph) over and over again, accounting for 75% of time spent. I now understand your comment why the real solution to this problem is to move bgs into their own tree, and agree: both kitchens and database= s have figured out a long time ago that the key to fast scan and lookup performance is to not put different things in the same storage container;= in the case of analytical DBMS this is columnar storage. :) But what I originally meant was something much simpler and more brute-force-ish. I see that btrfs_read_block_groups adds readahead (is that actually effective?) but what I was looking for was the equivale= nt of a DBMS' sequential scan. Right now finding (and loading) a bg seems to= involve a nested loop of tree lookups. It seems easier to rip through the= entire tree in nice 8MB chunks and discard what you don't need instead of seeking around trying to find all the right bits in scattered order. Could we alleviate cold mounts by starting more readaheads in btrfs_read_block_groups, so that the extent tree is scanned more linearly= ? cheers, Holger --wmLuskFdD3xTdTlvFKQGPj41Ryqet4KRT-- --TIs6FDMU68RbAH6vMY2kG2lGAXOq8TEy5 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iHwEARECADwWIQSvMMXkF4alyejNm0QPgTQJ2NzuzQUCWoMRRR4caG9sZ2VyQGFw cGxpZWQtYXN5bmNocm9ueS5jb20ACgkQD4E0Cdjc7s1LVgCgtD3+PZm3D/svga7W H5hTNFOsdEQAn01dwSwSunov6ss8pkUXZ76P4N+9 =tnQ9 -----END PGP SIGNATURE----- --TIs6FDMU68RbAH6vMY2kG2lGAXOq8TEy5--