From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8AFFC3A5A0 for ; Sun, 19 Apr 2020 22:37:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AE9722075E for ; Sun, 19 Apr 2020 22:37:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725905AbgDSWhC (ORCPT ); Sun, 19 Apr 2020 18:37:02 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:51636 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725848AbgDSWhB (ORCPT ); Sun, 19 Apr 2020 18:37:01 -0400 Received: from dread.disaster.area (pa49-180-0-232.pa.nsw.optusnet.com.au [49.180.0.232]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id DCF9E3A354F; Mon, 20 Apr 2020 08:36:47 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1jQIYQ-0006ON-43; Mon, 20 Apr 2020 08:36:46 +1000 Date: Mon, 20 Apr 2020 08:36:46 +1000 From: Dave Chinner To: "Martin K. Petersen" Cc: Chaitanya Kulkarni , hch@lst.de, darrick.wong@oracle.com, axboe@kernel.dk, tytso@mit.edu, adilger.kernel@dilger.ca, ming.lei@redhat.com, jthumshirn@suse.de, minwoo.im.dev@gmail.com, damien.lemoal@wdc.com, andrea.parri@amarulasolutions.com, hare@suse.com, tj@kernel.org, hannes@cmpxchg.org, khlebnikov@yandex-team.ru, ajay.joshi@wdc.com, bvanassche@acm.org, arnd@arndb.de, houtao1@huawei.com, asml.silence@gmail.com, linux-block@vger.kernel.org, linux-ext4@vger.kernel.org Subject: Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Message-ID: <20200419223646.GB9765@dread.disaster.area> References: <20200329174714.32416-1-chaitanya.kulkarni@wdc.com> <20200402224124.GK10737@dread.disaster.area> <20200403025757.GL10737@dread.disaster.area> <20200407022705.GA24067@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=XYjVcjsg+1UI/cdbgX7I7g==:117 a=XYjVcjsg+1UI/cdbgX7I7g==:17 a=kj9zAlcOel0A:10 a=cl8xLZFz6L8A:10 a=7-415B0cAAAA:8 a=9_JA7O5G14uhOetSDJ0A:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Wed, Apr 08, 2020 at 12:10:12AM -0400, Martin K. Petersen wrote: > > Hi Dave! > > >> In the standards space, the allocation concept was mainly aimed at > >> protecting filesystem internals against out-of-space conditions on > >> devices that dedup identical blocks and where simply zeroing the blocks > >> therefore is ineffective. > > > Um, so we're supposed to use space allocation before overwriting > > existing metadata in the filesystem? > > Not before overwriting, no. Once you have allocated an LBA it remains > allocated until you discard it. That is not a consistent argument. If the data has been deduped and we overwrite, the storage array has to allocate new physical space for an overwrite to an existing LBA. i.e. deduped data has multiple LBAs pointing to the same physical storage. Any overwrite of an LBA that maps to mulitply referenced physical storage requires the storage array to allocate new physical space for that overwrite. i.e. allocation is not determined by whether the LBA has been written to, "pinned" or not - it's whether the act of writing to that LBA requires the storage to allocate new space to allow the write to proceed. That's my point here - one particular shared data overwrite case is being special cased by preallocation (avoiding dedupe of zero filled data) to prevent ENOSPC, ignoring all the other cases where we overwrite shared non-zero data and will also require new physical space for the new data. In all those cases, the storage has to take the same action - allocation on overwrite - and so all of them are susceptible to ENOSPC. > > So that the underlying storage can reserve space for it before we > > write it? Which would mean we have to issue a space allocation before > > we dirty the metadata, which means before we dirty any metadata in a > > transaction. Which means we'll basically have to redesign the > > filesystems from the ground up, yes? > > My understanding is that this facility was aimed at filesystems that do > not dynamically allocate metadata. The intent was that mkfs would > preallocate the metadata LBA ranges, not the filesystem. For filesystems > that allocate metadata dynamically, then yes, an additional step is > required if you want to pin the LBAs. Ok, so you are confirming what I thought: it's almost completely useless to us. i.e. this requires issuing IO to "reserve" space whilst preserving data before every metadata object goes from clean to dirty in memory. But the problem with that is we don't know how much metadata we are going to dirty in any specific operation. Worse is that we don't know exactly *what* metadata we will modify until we walk structures and do lookups, which often happen after we've dirtied other structures. An ENOSPC from a space reservation at that point is fatal to the filesystem anyway, so there's no point in even trying to do this. Like I said, functionality like this cannot be retrofitted to existing filesysetms. IOWs, this is pretty much useless functionality for the filesystem layer, and if the only use is for some mythical filesystem with completely static metadata then the standards space really jumped the shark on this one.... > > You might be talking about filesystem metadata and block devices, > > but this patchset ends up connecting ext4's user data fallocate() to > > the block device, thereby allowing users to reserve space directly > > in the underlying block device and directly exposing this issue to > > userspace. > > I missed that Chaitanya's repost of this series included the ext4 patch. > Sorry! > > >> How XFS decides to enforce space allocation policy and potentially > >> leverage this plumbing is entirely up to you. > > > > Do I understand this correctly? i.e. that it is the filesystem's > > responsibility to prevent users from preallocating more space than > > exists in an underlying storage pool that has been intentionally > > hidden from the filesystem so it can be underprovisioned? > > No. But as an administrative policy it is useful to prevent runaway > applications from writing a petabyte of random garbage to media. My > point was that it is up to you and the other filesystem developers to > decide how you want to leverage the low-level allocation capability and > how you want to provide it to processes. And whether CAP_SYS_ADMIN, > ulimit, or something else is the appropriate policy interface for this. My cynical translation: the storage standards space haven't given any thought to how it can be used and/or administered in the real world. Pass the buck - let the filesystem people work that out. What I'm hearing is that this wasn't designed for typical filesystem use, it wasn't designed for typical user application use, and how to prevent abuse wasn't thought about at all. That sounds like a big fat NACK to me.... > In terms of thin provisioning and space management there are various > thresholds that may be reported by the device. In past discussions there > haven't been much interest in getting these exposed. It is also unclear > to me whether it is actually beneficial to send low space warnings to > hundreds or thousands of hosts attached to an array. In many cases the > individual server admins are not even the right audience. The most > common notification mechanism is a message to the storage array admin > saying "click here to buy more disk". Notifications are not relevant to preallocation functionality at all. -Dave. -- Dave Chinner david@fromorbit.com