From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.gmx.net ([212.227.17.21]:61536 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752048AbdLKAYX (ORCPT ); Sun, 10 Dec 2017 19:24:23 -0500 Subject: Re: exclusive subvolume space missing From: Qu Wenruo To: Tomasz Pala , linux-btrfs@vger.kernel.org References: <20171201161555.GA11892@polanet.pl> <55036341-2e8e-41dc-535f-f68d8e74d43f@gmx.com> <20171202012324.GB20205@polanet.pl> <0d3cd6f5-04ad-b080-6e62-7f25824860f1@gmx.com> <20171202022153.GA7727@polanet.pl> <20171202093301.GA28256@polanet.pl> <65f1545c-7fc5-ee26-ed6b-cf1ed6e4f226@gmx.com> <20171210112738.GA24090@polanet.pl> Message-ID: <599e2f5d-8e78-9b59-879c-6ba375510508@gmx.com> Date: Mon, 11 Dec 2017 08:24:16 +0800 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="2nJ5FD3Rit8AMrj65BQu2Gnt1QDpUUh1S" Sender: linux-btrfs-owner@vger.kernel.org List-ID: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --2nJ5FD3Rit8AMrj65BQu2Gnt1QDpUUh1S Content-Type: multipart/mixed; boundary="Htgvhglbe0V8XtFiTsdrNJgghETFBuPPs"; protected-headers="v1" From: Qu Wenruo To: Tomasz Pala , linux-btrfs@vger.kernel.org Message-ID: <599e2f5d-8e78-9b59-879c-6ba375510508@gmx.com> Subject: Re: exclusive subvolume space missing References: <20171201161555.GA11892@polanet.pl> <55036341-2e8e-41dc-535f-f68d8e74d43f@gmx.com> <20171202012324.GB20205@polanet.pl> <0d3cd6f5-04ad-b080-6e62-7f25824860f1@gmx.com> <20171202022153.GA7727@polanet.pl> <20171202093301.GA28256@polanet.pl> <65f1545c-7fc5-ee26-ed6b-cf1ed6e4f226@gmx.com> <20171210112738.GA24090@polanet.pl> In-Reply-To: --Htgvhglbe0V8XtFiTsdrNJgghETFBuPPs Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable On 2017=E5=B9=B412=E6=9C=8811=E6=97=A5 07:44, Qu Wenruo wrote: >=20 >=20 > On 2017=E5=B9=B412=E6=9C=8810=E6=97=A5 19:27, Tomasz Pala wrote: >> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: >> >>>> 1. is there any switch resulting in 'defrag only exclusive data'? >>> >>> IIRC, no. >> >> I have found a directory - pam_abl databases, which occupy 10 MB (yes,= >> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after >> defrag. After defragging files were not snapshotted again and I've los= t >> 3.6 GB again, so I got this fully reproducible. >> There are 7 files, one of which is 99% of the space (10 MB). None of >> them has nocow set, so they're riding all-btrfs. >> >> I could debug something before I'll clean this up, is there anything y= ou >> want to me to check/know about the files? >=20 > fiemap result along with btrfs dump-tree -t2 result. >=20 > Both output has nothing related to file name/dir name, but only some > "meaningless" bytenr, so it should be completely OK to share them. >=20 >> >> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS >> condition which could be triggered by malicious user during a few hour= s >> or faster >=20 > You won't want to hear this: > The biggest ratio in theory is, 128M / 4K =3D 32768. >=20 >> - I've lost 3.6 GB during the night with reasonably small >> amount of writes, I guess it might be possible to trash entire >> filesystem within 10 minutes if doing this on purpose. >=20 > That's a little complex. > To get into such situation, snapshot must be used and one must know > which file extent is shared and how it's shared. >=20 > But yes, it's possible. >=20 > While on the other hand, XFS, which also supports reflink, handles it > quite well, so I'm wondering if it's possible for btrfs to follow its > behavior. >=20 >> >>>> 3. I guess there aren't, so how could I accomplish my target, i.e. >>>> reclaiming space that was lost due to fragmentation, without brea= king >>>> spanshoted CoW where it would be not only pointless, but actually= harmful? >>> >>> What about using old kernel, like v4.13? >> >> Unfortunately (I guess you had 3.13 on mind), I need the new ones and >> will be pushing towards 4.14. >=20 > No, I really mean v4.13. My fault, it is v3.13. What a stupid error... >=20 > From btrfs(5): > --- > Warning > Defragmenting with Linux kernel versions < 3.9 or =E2=89= =A5 > 3.14-rc2 as > well as with Linux stable kernel versions =E2=89=A5 3.10= =2E31, =E2=89=A5 > 3.12.12 > or =E2=89=A5 3.13.4 will break up the ref-links of CoW d= ata (for > example files copied with cp --reflink, snapshots or > de-duplicated data). This may cause considerable increas= e of > space usage depending on the broken up ref-links. > --- >=20 >> >>>> 4. How can I prevent this from happening again? All the files, that = are >>>> written constantly (stats collector here, PostgreSQL database and= >>>> logs on other machines), are marked with nocow (+C); maybe some n= ew >>>> attribute to mark file as autodefrag? +t? >>> >>> Unfortunately, nocow only works if there is no other subvolume/inode >>> referring to it. >> >> This shouldn't be my case anymore after defrag (=3D=3Dbreaking links).= >> I guess no easy way to check refcounts of the blocks? >=20 > No easy way unfortunately. > It's either time consuming (used by qgroup) or complex (manually tree > search and do the backref walk by yourself) >=20 >> >>> But in my understanding, btrfs is not suitable for such conflicting >>> situation, where you want to have snapshots of frequent partial updat= es. >>> >>> IIRC, btrfs is better for use case where either update is less freque= nt, >>> or update is replacing the whole file, not just part of it. >>> >>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib wh= ich >>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run. >> >> That is something coherent with my conclusions after 2 years on btrfs,= >> however I didn't expect a single file to eat 1000 times more space tha= n it >> should... >> >> >> I wonder how many other filesystems were trashed like this - I'm short= >> of ~10 GB on other system, many other users might be affected by that >> (telling the Internet stories about btrfs running out of space). >=20 > Firstly, no other filesystem supports snapshot. > So it's pretty hard to get a baseline. >=20 > But as I mentioned, XFS supports reflink, which means file extent can b= e > shared between several inodes. >=20 > From the message I got from XFS guys, they free any unused space of a > file extent, so it should handle it quite well. >=20 > But it's quite a hard work to achieve in btrfs, needs years development= > at least. >=20 >> >> It is not a problem that I need to defrag a file, the problem is I don= 't know: >> 1. whether I need to defrag, >> 2. *what* should I defrag >> nor have a tool that would defrag smart - only the exclusive data or, = in >> general, the block that are worth defragging if space released from >> extents is greater than space lost on inter-snapshot duplication. >> >> I can't just defrag entire filesystem since it breaks links with snaps= hots. >> This change was a real deal-breaker here... >=20 > IIRC it's better to add a option to make defrag snapshot-aware. > (Don't break snapshot sharing but only to defrag exclusive data) >=20 > Thanks, > Qu >=20 >> >> Any way to fed the deduplication code with snapshots maybe? There are >> directories and files in the same layout, this could be fast-tracked t= o >> check and deduplicate. >> >=20 --Htgvhglbe0V8XtFiTsdrNJgghETFBuPPs-- --2nJ5FD3Rit8AMrj65BQu2Gnt1QDpUUh1S Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iQFLBAEBCAA1FiEELd9y5aWlW6idqkLhwj2R86El/qgFAlot0DAXHHF1d2VucnVv LmJ0cmZzQGdteC5jb20ACgkQwj2R86El/qixtgf7BEBg+l3QUJKI0SUW6MjvkWel midw9DYOfeXkcE6SJiTa+FpLvkFykT362c7WaF5HpbqkrA49LrexcSv7boEu/qZ+ YNnI4Fq8ejcoiXozmp5CCl9r2BS/8ScpcRS95B1lNUGe343nHQ8yWy/2bDMI31kr KBhf3A99cKwYwJWNE9lEFZpPAc1P3O041H3m6sLTc8hnXaZBoW6lt84W2ERkrR39 RbwNN0QjmbSGk3thBST5GQEyscp/Ilm2AnrTBphDaz8hM/44CxbwObWs/xHny8/d 013fRd1cLbAZvEuTEG54yjQ2cIE8FDZ+cUEYDij8/Au0d9YmZk82XGf/IwkCkA== =5AtY -----END PGP SIGNATURE----- --2nJ5FD3Rit8AMrj65BQu2Gnt1QDpUUh1S--