From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3466C433DB for ; Tue, 9 Feb 2021 19:58:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C29DD64EC7 for ; Tue, 9 Feb 2021 19:58:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C29DD64EC7 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F40626B0005; Tue, 9 Feb 2021 14:58:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EF0036B006C; Tue, 9 Feb 2021 14:58:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E072B6B006E; Tue, 9 Feb 2021 14:58:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0189.hostedemail.com [216.40.44.189]) by kanga.kvack.org (Postfix) with ESMTP id C9D046B0005 for ; Tue, 9 Feb 2021 14:58:24 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 971F711013 for ; Tue, 9 Feb 2021 19:58:24 +0000 (UTC) X-FDA: 77799791328.01.band95_45093ef2760a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id 74E69100471ED for ; Tue, 9 Feb 2021 19:58:24 +0000 (UTC) X-HE-Tag: band95_45093ef2760a X-Filterd-Recvd-Size: 4881 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Tue, 9 Feb 2021 19:58:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=MB88M4ckbnRwixfT3vv1FxK1EFpEI0Fw/Cd7dXlx3XA=; b=tmC1bMeYy5+b3XT9/a3/+u9Vtz Y9hhLVqbNRFETrGKYBlJD2x/qq8JslthNT6wCe1ERfoMPMR/TBE4eDe7gKIB2KvKNSnQ89O5rQ7bo Cdkqbqv8KZeRqnsgz8PDSF9Dxagkwiin6/05wduUz4gozVL5N86HvqVcc+BTk7y3fbFpjXA3n/GxP 500+5wLj5pe4Qhgu4sisTk9RbCeaIXHJDIC7K+h4E7MnFKi0fWfuHq6Y/wpx6hmFfA0YnxApdES65 uyO+8T/2PKuH7/awyG8veoe/HqyaVLnaPRUfIuzV3YNiiAyT607XlYl5A/UATDs71qKCb5mnPen9T iqx0gAlA==; Received: from willy by casper.infradead.org with local (Exim 4.94 #2 (Red Hat Linux)) id 1l9Z9N-007tKn-JY; Tue, 09 Feb 2021 19:58:18 +0000 Date: Tue, 9 Feb 2021 19:58:17 +0000 From: Matthew Wilcox To: Jason Gunthorpe Cc: Laurent Dufour , linux-mm@kvack.org, "Liam R. Howlett" , Paul McKenney Subject: Re: synchronize_rcu in munmap? Message-ID: <20210209195817.GZ308988@casper.infradead.org> References: <20210208132643.GP308988@casper.infradead.org> <20210209142941.GY308988@casper.infradead.org> <17e3b4d0-8a16-75ba-e1c7-b678e4cf2089@linux.ibm.com> <20210209173822.GH4718@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20210209173822.GH4718@ziepe.ca> Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 09, 2021 at 01:38:22PM -0400, Jason Gunthorpe wrote: > On Tue, Feb 09, 2021 at 06:19:35PM +0100, Laurent Dufour wrote: > > Le 09/02/2021 =E0 15:29, Matthew Wilcox a =E9crit=A0: > > > On Mon, Feb 08, 2021 at 01:26:43PM +0000, Matthew Wilcox wrote: > > > > Next problem: /proc/$pid/smaps calls walk_page_vma() which starts= out by > > > > saying: > > > > mmap_assert_locked(walk.mm); > > > > which made me realise that smaps is also going to walk the page t= ables. > > > > So the page tables have to be pinned by the existence of the VMA. > > > > Which means the page tables must be freed by the same RCU callbac= k that > > > > frees the VMA. But doing that means that a task which calls mmap= (); > > > > munmap(); mmap(); must avoid allocating the same address for the = second > > > > mmap (until the RCU grace period has elapsed), otherwise threads = on > > > > other CPUs may see the stale PTEs instead of the new ones. > > > >=20 > > > > Solution 1: Move the page table freeing into the RCU callback, ca= ll > > > > synchronize_rcu() in munmap(). > > > >=20 > > > > Solution 2: Refcount the VMA and free the page tables on refcount > > > > dropping to zero. This doesn't actually work because the stale P= TE > > > > problem still exists. > > > >=20 > > > > Solution 3: When unmapping a VMA, instead of erasing the VMA from= the > > > > maple tree, put a "dead" entry in its place. Once the RCU freein= g and the > > > > TLB shootdown has happened, erase the entry and it can then be al= located. > > > > If we do that MAP_FIXED will have to synchronize_rcu() if it over= laps > > > > a dead entry. > > >=20 > > > Solution 4: RCU free the page table pages and teach pagewalk.c to > > > be RCU-safe. That means that it will have to use rcu_dereference() > > > or READ_ONCE to dereference (eg) pmdp, but also allows GUP-fast to = run > > > under the rcu read lock instead of disabling interrupts. > >=20 > > I might be wrong but my understanding is that the RCU window could no= t be > > closed on a CPU where IRQs are disabled. So in a first step GUP-fast = might > > continue to disable interrupts to get safe walking the page directori= es. >=20 > Yes, this is right. PPC already uses RCU for the TLB flush and the > GUP-fast trick is safe against that. >=20 > The comments for PPC say the downside of RCU is having to do an > allocation in paths that really don't want to fail on memory > exhaustion >=20 > The pagewalk.c needs to call its ops in a sleepable context, otherwise > it could just use the normal page table locks.. Not sure RCU could be > fit into here? Depends on the caller of walk_page_*() whether the ops need to sleep or not. The specific problem we're trying to solve here is avoiding taking the mmap_sem in /proc/$pid/smaps. Now, we could just disable interrupts instead of taking the mmap_sem, but I was hoping to do better. So let's call that Solution 5: - smaps disables interrupts while calling pagewalk. - pagewalk accepts that it can be called locklessly (uses ptep_get_lockless() and so on) - smaps figures out how to handle races with khugepaged