From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F90CC433DB for ; Mon, 11 Jan 2021 07:27:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F01852255F for ; Mon, 11 Jan 2021 07:27:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F01852255F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id ED6D78D0021; Mon, 11 Jan 2021 02:27:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E875F8D0020; Mon, 11 Jan 2021 02:27:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D75FC8D0021; Mon, 11 Jan 2021 02:27:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0073.hostedemail.com [216.40.44.73]) by kanga.kvack.org (Postfix) with ESMTP id BEB598D0020 for ; Mon, 11 Jan 2021 02:27:00 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 80500180AD815 for ; Mon, 11 Jan 2021 07:27:00 +0000 (UTC) X-FDA: 77692662600.20.plane10_5f03f552750b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id 635D0180C07A3 for ; Mon, 11 Jan 2021 07:27:00 +0000 (UTC) X-HE-Tag: plane10_5f03f552750b X-Filterd-Recvd-Size: 7806 Received: from hqnvemgate25.nvidia.com (hqnvemgate25.nvidia.com [216.228.121.64]) by imf11.hostedemail.com (Postfix) with ESMTP for ; Mon, 11 Jan 2021 07:26:59 +0000 (UTC) Received: from hqmail.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate25.nvidia.com (using TLS: TLSv1.2, AES256-SHA) id ; Sun, 10 Jan 2021 23:26:58 -0800 Received: from [10.2.61.11] (172.20.145.6) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Mon, 11 Jan 2021 07:26:57 +0000 Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse To: Linus Torvalds , Andrea Arcangeli CC: Andrew Morton , Linux-MM , Linux Kernel Mailing List , Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , "Minchan Kim" , Will Deacon , Peter Zijlstra , Hugh Dickins , "Kirill A. Shutemov" , Matthew Wilcox , Oleg Nesterov , Jann Horn , Kees Cook , Leon Romanovsky , Jason Gunthorpe , Jan Kara , Kirill Tkhai , Nadav Amit , Jens Axboe References: <20210110004435.26382-1-aarcange@redhat.com> From: John Hubbard Message-ID: <45806a5a-65c2-67ce-fc92-dc8c2144d766@nvidia.com> Date: Sun, 10 Jan 2021 23:26:57 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Thunderbird/84.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [172.20.145.6] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To HQMAIL107.nvidia.com (172.20.187.13) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1610350018; bh=V3jZWyS0o/TKWgXi/0u/OEHx5PeFoE3S+tnbYQhoPtw=; h=Subject:To:CC:References:From:Message-ID:Date:User-Agent: MIME-Version:In-Reply-To:Content-Type:Content-Language: Content-Transfer-Encoding:X-Originating-IP:X-ClientProxiedBy; b=BPH7HXZnuY44oYibc6WX2DQVkEELCdtiVj347qOnPW4XEsAX4wpvZ+eyuaaSSDfd8 ZaRcbGPtArCDXcgSjGxT45QM/QnHo2kxG13T4ysr8ql5fd6OyjUyB4uBek8mZAkYir 9AGfP9VYxI/IhvVVadVarfk8etrVNmVLEY9TiJJR8gSZvkgBy5CHJ+l9k3UTEjQivd JDEpHoG9NcJIEakKcUfvleFkcNn0YBg/oVt6NcTHVaf8LNlPYJg8/YRbT/wxT4rgtb fNTamGyglm/eXszVkgpwRj60/7G30icCqVcJ4lFAHr7cV6z/GHid3B9hDIrBXI1b6W WFEGuREUo/fpQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 1/10/21 11:30 AM, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds > wrote: > > Just having a bit in the page flags for "I already made this > exclusive, and fork() will keep it so" is I feel the best option. In a > way, "page is writable" right now _is_ that bit. By definition, if you > have a writable page in an anonymous mapping, that is an exclusive > user. > > But because "writable" has these interactions with other operations, > it would be better if it was a harder bit than that "maybe_pinned()", > though. It would be lovely if a regular non-pinning write-GUP just > always set it, for example. > > "maybe_pinned()" is good enough for the fork() case, which is the one > that matters for long-term pinning. But it's admittedly not perfect. > There is at least one way to improve this part of it--maybe. IMHO, a lot of the bits in page _refcount are still being wasted (even after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that there are many callers of gup/pup per page. If anyone points out that that is wrong, then the rest of this falls apart, but...if we were to make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever simultaneously allowed on a given page", then several things become possible: 1) "maybe dma pinned" could become "very likely dma-pinned", by allowing perhaps 23 bits for normal page refcounts (leaving just 8 bits for dma pins), thus all but ensuring no aliasing between normal refcounts and dma pins. The name change below to "likely" is not actually necessary, I'm just using it to make the point clear: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..28f7f64e888f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,7 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define GUP_PIN_COUNTING_BIAS (1U << 23) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, @@ -1274,7 +1274,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages); * @Return: True, if it is likely that the page has been "dma-pinned". * False, if the page is definitely not dma-pinned. */ -static inline bool page_maybe_dma_pinned(struct page *page) +static inline bool page_likely_dma_pinned(struct page *page) { if (hpage_pincount_available(page)) return compound_pincount(page) > 0; 2) Doing (1) allows, arguably, failing mprotect calls if page_likely_dma_pinned() returns true, because the level of confidence is much higher now. 3) An additional counter, for normal gup (FOLL_GET) is possible: divide up the top 8 bits into two separate counters of just 4 bits each. Only allow either FOLL_GET or FOLL_PIN (and btw, I'm now mentally equating FOLL_PIN with FOLL_LONGTERM), not both, on a given page. Like this: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..ce5af27e3057 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,8 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define DMA_PIN_COUNTING_BIAS (1U << 23) +#define GUP_PIN_COUNTING_BIAS (1U << 27) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, So: FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount. These are long term pins for dma. FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount. These are not long term pins. Doing (3) would require yet another release call variant: unpin_user_pages() would need to take a flag to say which refcount to release: FOLL_GET or FOLL_PIN. However, I think that's relatively easy (compared to the original effort of teasing apart generic put_page() calls, into unpin_user_pages() calls). We already have all the unpin_user_pages() calls in place, and so it's "merely" a matter of adding a flag to 74 call sites. The real question is whether we can get away with supporting only a very low count of FOLL_PIN and FOLL_GET pages. Can we? thanks, -- John Hubbard NVIDIA