From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FFEEC4708A for ; Wed, 26 May 2021 21:07:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9C72D613EC for ; Wed, 26 May 2021 21:07:48 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9C72D613EC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A6CC46B006C; Wed, 26 May 2021 17:07:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A144D6B0073; Wed, 26 May 2021 17:07:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 622B48D0001; Wed, 26 May 2021 17:07:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0236.hostedemail.com [216.40.44.236]) by kanga.kvack.org (Postfix) with ESMTP id 1B9506B006C for ; Wed, 26 May 2021 17:07:47 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A5A30181AEF09 for ; Wed, 26 May 2021 21:07:46 +0000 (UTC) X-FDA: 78184618932.07.6197990 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf15.hostedemail.com (Postfix) with ESMTP id EA95DA0001C5 for ; Wed, 26 May 2021 21:07:41 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id D8EF7613D7; Wed, 26 May 2021 21:07:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1622063265; bh=h6hDfGXPxmlIkGsnH0b8u2EKoXb9irFLw9MCylg2Paw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=LNluHNmru4wGy0TaWP+m0QdGm2A43bD1dhtku/fDb+bN2pposzRo1g/t6UUJ2uhcZ Vq7xNd+nwGbNSHYfTX10W4rnX2QztQJc8xy93QrBUFteQd0AnReRvkyIKLadrv1YYJ BIqVe5p6QmpqrYH58hf6dMXlxwIkQ8ChqsDdLwgLinEwV17NMSxoiFENVGfxGr2tP/ DcmBL54G0I7LhyzQ/tEavVRuXbdpzMbOJapRM9Ei2xCC5WhcmcVV7XP5Xxcza2n2ql R3XjwPtaq78v2pEAssp9oVDFhe7dqrRXSJ77e0vYOFPTMEOUp/uyxnen41O8BCo/SN KfDjVgPMVAwTg== Date: Wed, 26 May 2021 14:07:42 -0700 From: Keith Busch To: Matthew Wilcox Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org Subject: Re: [LSF/MM/BPF TOPIC] Memory folios Message-ID: <20210526210742.GA3706388@dhcp-10-100-145-180.wdc.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: EA95DA0001C5 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=LNluHNmr; spf=pass (imf15.hostedemail.com: domain of kbusch@kernel.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=kbusch@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam04 X-Stat-Signature: sjepzheo86jiuczy3iurow36uyzg9mk8 X-HE-Tag: 1622063261-873783 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 14, 2021 at 06:48:26PM +0100, Matthew Wilcox wrote: > On Mon, May 10, 2021 at 06:56:17PM +0100, Matthew Wilcox wrote: > > I don't know exactly how much will be left to discuss about supporting > > larger memory allocation units in the page cache by December. In my > > ideal world, all the patches I've submitted so far are accepted, I > > persuade every filesystem maintainer to convert their own filesystem > > and struct page is nothing but a bad memory by December. In reality, > > I'm just not that persuasive. > > > > So, probably some kind of discussion will be worthwhile about > > converting the remaining filesystems to use folios, when it's worth > > having filesystems opt-in to multi-page folios, what we can do about > > buffer-head based filesystems, and so on. > > > > Hopefully we aren't still discussing whether folios are a good idea > > or not by then. > > I got an email from Hannes today asking about memory folios as they > pertain to the block layer, and I thought this would be a good chance > to talk about them. If you're not familiar with the term "folio", > https://lore.kernel.org/lkml/20210505150628.111735-10-willy@infradead.org/ > is not a bad introduction. > > Thanks to the work done by Ming Lei in 2017, the block layer already > supports multipage bvecs, so to a first order of approximation, I don't > need anything from the block layer on down through the various storage > layers. Which is why I haven't been talking to anyone in storage! > > It might change (slightly) the contents of bios. For example, > bvec[n]->bv_offset might now be larger than PAGE_SIZE. Drivers should > handle this OK, but probably haven't been audited to make sure they do. > Mostly, it's simply that drivers will now see fewer, larger, segments > in their bios. Once a filesystem supports multipage folios, we will > allocate order-N pages as part of readahead (and sufficiently large > writes). Dirtiness is tracked on a per-folio basis (not per page), > so folios take trips around the LRU as a single unit and finally make > it to being written back as a single unit. > > Drivers still need to cope with sub-folio-sized reads and writes. > O_DIRECT still exists and (eg) doing a sub-page, block-aligned write > will not necessarily cause readaround to happen. Filesystems may read > and write their own metadata at whatever granularity and alignment they > see fit. But the vast majority of pagecache I/O will be folio-sized > and folio-aligned. > > I do have two small patches which make it easier for the one > filesystem that I've converted so far (iomap/xfs) to add folios to bios > and get folios back out of bios: > > https://lore.kernel.org/lkml/20210505150628.111735-72-willy@infradead.org/ > https://lore.kernel.org/lkml/20210505150628.111735-73-willy@infradead.org/ > > as well as a third patch that estimates how large a bio to allocate, > given the current folio that it's working on: > https://git.infradead.org/users/willy/pagecache.git/commitdiff/89541b126a59dc7319ad618767e2d880fcadd6c2 > > It would be possible to make other changes in future. For example, if > we decide it'd be better, we could change bvecs from being (page, offset, > length) to (folio, offset, length). I don't know that it's worth doing; > it would need to be evaluated on its merits. Personally, I'd rather > see us move to a (phys_addr, length) pair, but I'm a little busy at the > moment. > > Hannes has some fun ideas about using the folio work to support larger > sector sizes, and I think they're doable. I'm also interested in this, and was looking into the exact same thing recently. Some of the very high capacity SSDs that can really benefit from better large sector support. If this is a topic for the conference, I would like to attend this session.