From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f174.google.com ([209.85.223.174]:51204 "EHLO mail-io0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751663AbdKFTMZ (ORCPT ); Mon, 6 Nov 2017 14:12:25 -0500 Received: by mail-io0-f174.google.com with SMTP id b186so16817125iof.8 for ; Mon, 06 Nov 2017 11:12:25 -0800 (PST) Subject: Re: Problem with file system To: Chris Murphy Cc: Adam Borowski , Marat Khalili , Dave , Linux fs Btrfs , Fred Van Andel References: <9871a669-141b-ac64-9da6-9050bcad7640@cn.fujitsu.com> <10fb0b92-bc93-a217-0608-5284ac1a05cd@rqc.ru> <20171104044634.thg7mnchm4hvzdic@angband.pl> <6833d956-05c6-ee7b-ba80-b0a29c2772c6@gmail.com> From: "Austin S. Hemmelgarn" Message-ID: <01e731bf-8831-b7de-81a9-e0ce2f7d3f88@gmail.com> Date: Mon, 6 Nov 2017 14:12:20 -0500 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2017-11-06 13:45, Chris Murphy wrote: > On Mon, Nov 6, 2017 at 6:29 AM, Austin S. Hemmelgarn > wrote: > >> >> With ATA devices (including SATA), except on newer SSD's, TRIM commands >> can't be queued, > > SATA spec 3.1 includes queued trim. There are SATA spec 3.1 products > on the market claiming to do queued trim. Some of them fuck up, and > have been black listed in the kernel for queued trim. > Yes, but some still work, and they are invariably very new devices by most people's definitions. >>> Anyway right now I consider discard mount option fundamentally broken >>> on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's >>> broken there too. >> >> For LVM thinp, discard there deallocates the blocks, and unallocated regions >> read back as zeroes, just like in a sparse file (in fact, if you just think >> of LVM thinp as a sparse file with reflinking for snapshots, you get >> remarkably close to how it's actually implemented from a semantic >> perspective), so it is broken there. In fact, it's guaranteed broken on any >> block device that has the discard_zeroes_data flag set, and theoretically >> broken on many things that don't have that flag (although block devices that >> don't have that flag are inherently broken from a security perspective >> anyway, but that's orthogonal to this discussion). > > So this is really only solvable by having Btrfs delay, possibly > substantially, the discarding of metadata blocks. Aside from physical > device trim, there are benefits in thin provisioning for trim and some > use cases will require file system discard, being unable to rely on > periodic fstrim. Yes. However, from a simplicity of implementation perspective, it makes more sense to keep some number of old trees instead of keeping old trees for some amount of time. That would remove the need to track timing info in the filesystem, provide sufficient protection, and probably be a bit easier to explain in the documentation. Such logic could also be applied to regular block devices that don't support discard to provide a better guarantee that you won't overwrite old trees that might be useful for recovery.