From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E30F9C4708A for ; Thu, 27 May 2021 07:41:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 77C8360FF0 for ; Thu, 27 May 2021 07:41:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 77C8360FF0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0D9F86B0036; Thu, 27 May 2021 03:41:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 08D496B006E; Thu, 27 May 2021 03:41:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1E626B0070; Thu, 27 May 2021 03:41:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0082.hostedemail.com [216.40.44.82]) by kanga.kvack.org (Postfix) with ESMTP id A76F16B0036 for ; Thu, 27 May 2021 03:41:53 -0400 (EDT) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 42E97181AF5C1 for ; Thu, 27 May 2021 07:41:53 +0000 (UTC) X-FDA: 78186216906.31.3B41C2C Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf12.hostedemail.com (Postfix) with ESMTP id 4269FE4 for ; Thu, 27 May 2021 07:41:42 +0000 (UTC) Received: from imap.suse.de (imap-alt.suse-dmz.suse.de [192.168.254.47]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id B39AF218DD; Thu, 27 May 2021 07:41:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1622101311; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3ttE23UWAFNVZ79voJwxcR9BZTIr+cGS5rkk0rMZAv4=; b=G37Rd2Tl9nmQmo/aSnHsfaxoVuJG1gpZDW/+7zIsNaJD73F6mKMwY8BAWS3atZSFjRq9U1 q0cqkNLwbfrheSF3otx/7lWDVPtkHs+puLD3YmTciEnKDhgay24iVTl7wdjjIec7JAS0dt ThHonE2EnIdnEWo5Qpy4Va4BiYj4uyY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1622101311; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3ttE23UWAFNVZ79voJwxcR9BZTIr+cGS5rkk0rMZAv4=; b=0aHPCyCS6WhEkAtclxXHAn4zgLcfsCll1qrl0BdnA057x2WxMF3dvQY6Nfq4MbQnpuLevZ L/2M2m9lLskm1GCQ== Received: from director2.suse.de (director2.suse-dmz.suse.de [192.168.254.72]) by imap.suse.de (Postfix) with ESMTPSA id 965C711A98; Thu, 27 May 2021 07:41:51 +0000 (UTC) To: Keith Busch , Matthew Wilcox Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org References: <20210526210742.GA3706388@dhcp-10-100-145-180.wdc.com> From: Hannes Reinecke Organization: SUSE Linux GmbH Subject: Re: [LSF/MM/BPF TOPIC] Memory folios Message-ID: <97698689-0a18-81e0-a0ff-b4f92e56be5b@suse.de> Date: Thu, 27 May 2021 09:41:51 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 In-Reply-To: <20210526210742.GA3706388@dhcp-10-100-145-180.wdc.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-Rspamd-Queue-Id: 4269FE4 Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=G37Rd2Tl; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=0aHPCyCS; spf=pass (imf12.hostedemail.com: domain of hare@suse.de designates 195.135.220.28 as permitted sender) smtp.mailfrom=hare@suse.de; dmarc=none X-Rspamd-Server: rspam04 X-Stat-Signature: o85jfrh97wzi3ecdzfpwxncj54kaotet X-HE-Tag: 1622101302-945923 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/26/21 11:07 PM, Keith Busch wrote: > On Fri, May 14, 2021 at 06:48:26PM +0100, Matthew Wilcox wrote: >> On Mon, May 10, 2021 at 06:56:17PM +0100, Matthew Wilcox wrote: >>> I don't know exactly how much will be left to discuss about supportin= g >>> larger memory allocation units in the page cache by December. In my >>> ideal world, all the patches I've submitted so far are accepted, I >>> persuade every filesystem maintainer to convert their own filesystem >>> and struct page is nothing but a bad memory by December. In reality, >>> I'm just not that persuasive. >>> >>> So, probably some kind of discussion will be worthwhile about >>> converting the remaining filesystems to use folios, when it's worth >>> having filesystems opt-in to multi-page folios, what we can do about >>> buffer-head based filesystems, and so on. >>> >>> Hopefully we aren't still discussing whether folios are a good idea >>> or not by then. >> >> I got an email from Hannes today asking about memory folios as they >> pertain to the block layer, and I thought this would be a good chance >> to talk about them. If you're not familiar with the term "folio", >> https://lore.kernel.org/lkml/20210505150628.111735-10-willy@infradead.= org/ >> is not a bad introduction. >> >> Thanks to the work done by Ming Lei in 2017, the block layer already >> supports multipage bvecs, so to a first order of approximation, I don'= t >> need anything from the block layer on down through the various storage >> layers. Which is why I haven't been talking to anyone in storage! >> >> It might change (slightly) the contents of bios. For example, >> bvec[n]->bv_offset might now be larger than PAGE_SIZE. Drivers should >> handle this OK, but probably haven't been audited to make sure they do= . >> Mostly, it's simply that drivers will now see fewer, larger, segments >> in their bios. Once a filesystem supports multipage folios, we will >> allocate order-N pages as part of readahead (and sufficiently large >> writes). Dirtiness is tracked on a per-folio basis (not per page), >> so folios take trips around the LRU as a single unit and finally make >> it to being written back as a single unit. >> >> Drivers still need to cope with sub-folio-sized reads and writes. >> O_DIRECT still exists and (eg) doing a sub-page, block-aligned write >> will not necessarily cause readaround to happen. Filesystems may read >> and write their own metadata at whatever granularity and alignment the= y >> see fit. But the vast majority of pagecache I/O will be folio-sized >> and folio-aligned. >> >> I do have two small patches which make it easier for the one >> filesystem that I've converted so far (iomap/xfs) to add folios to bio= s >> and get folios back out of bios: >> >> https://lore.kernel.org/lkml/20210505150628.111735-72-willy@infradead.= org/ >> https://lore.kernel.org/lkml/20210505150628.111735-73-willy@infradead.= org/ >> >> as well as a third patch that estimates how large a bio to allocate, >> given the current folio that it's working on: >> https://git.infradead.org/users/willy/pagecache.git/commitdiff/89541b1= 26a59dc7319ad618767e2d880fcadd6c2 >> >> It would be possible to make other changes in future. For example, if >> we decide it'd be better, we could change bvecs from being (page, offs= et, >> length) to (folio, offset, length). I don't know that it's worth doin= g; >> it would need to be evaluated on its merits. Personally, I'd rather >> see us move to a (phys_addr, length) pair, but I'm a little busy at th= e >> moment. >> >> Hannes has some fun ideas about using the folio work to support larger >> sector sizes, and I think they're doable. >=20 > I'm also interested in this, and was looking into the exact same thing > recently. Some of the very high capacity SSDs that can really benefit > from better large sector support. If this is a topic for the conference= , > I would like to attend this session. >=20 And, of course, so would I :-) Cheers, Hannes --=20 Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, 90409 N=C3=BCrnberg GF: F. Imend=C3=B6rffer, HRB 36809 (AG N=C3=BCrnberg)