From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40DD6C433E0 for ; Wed, 24 Feb 2021 01:32:26 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id CEA8F64E57 for ; Wed, 24 Feb 2021 01:32:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CEA8F64E57 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 7396A100F225F; Tue, 23 Feb 2021 17:32:25 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::634; helo=mail-ej1-x634.google.com; envelope-from=dan.j.williams@intel.com; receiver= Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 39BEF100F225A for ; Tue, 23 Feb 2021 17:32:22 -0800 (PST) Received: by mail-ej1-x634.google.com with SMTP id a22so423516ejv.9 for ; Tue, 23 Feb 2021 17:32:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BzWzSMy6H2fKaT6wDPfgdGVM2lphxwe7fjsWLK8ZsE4=; b=S75Qch0ZOea7fh5SP9tMaw8ApZ3bEDxJO1l9QkpiBXeoIJGJ40yXpr1WktRrEViy3D vijr1JJSKl1rZgNb1bzAKWw9GR/43vYcydkbd/anpDloO7xO2ODxp1Hps6Rie67Ov8DQ I+2Ywacc2vCwjEO2ApRfBzs329z5yGUwawppt7NpEPadCzuOtaWkQCu4OGozKmdxwDs8 7DrfpQo6uRf/T66mj4TJAREDUZFgM8Um1n4t9IP8U/2vDd6s6Jcs4Q/l3dVgjflFBaN9 aLE0IJtbQgdR3x2B39xqV9j+BJSVQ2gu2fUUPMNSVkXgKdgJHmN74G2ZhP3zdWAR8pKk hQ7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BzWzSMy6H2fKaT6wDPfgdGVM2lphxwe7fjsWLK8ZsE4=; b=bqveScjl7U+GMGq/v9zrKRc/dk70LUpNHMDCSexRYpq8zXbryZZj9dvnSFbQaXy9cw vRoEs4eMbChk86didjjZ5WR88ZD+v+HChX3kJd0zG8Xv6ftfj3HuhNy57qZq45khcGsu h1njjU2CcItrRzUQnGKI9vtcRo1h0g+Sxv2CHJ4QPXwiXEtbsgBBGNm40IMhqtEJMz3B M1cy1c5F6dSj0WV+a5DLS+NpeXtMZOFjo20pc2Rr28W29BvYs/wcsGl9LOThS4ZoZCex mLEzASUno6XdAfXrD9X/YOvQViadljjTw3DXASme+NwA6jfbFxtHetyN/5q9LreeGNnu xT0g== X-Gm-Message-State: AOAM533tndJKXi4vqkYlxOZlakRgd3KzGooWB8UKFyeORMSdP1qUUJry T/fYhMQGwMZqxZRBhSyxCDVgNmWLj8YxWJkgHTySAg== X-Google-Smtp-Source: ABdhPJweeOqjTTK2aT3IWCvDjLq0yad1F25BQovQcrNlUrurFfezD/l7/vcOUTDDJ4QHuLwgfbeO1bJ/899s1E/jfdA= X-Received: by 2002:a17:907:3fa3:: with SMTP id hr35mr1175229ejc.418.1614130341190; Tue, 23 Feb 2021 17:32:21 -0800 (PST) MIME-Version: 1.0 References: <20201208172901.17384-1-joao.m.martins@oracle.com> <6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com> <20210223185435.GO2643399@ziepe.ca> <20210223230723.GP2643399@ziepe.ca> <20210224010017.GQ2643399@ziepe.ca> In-Reply-To: <20210224010017.GQ2643399@ziepe.ca> From: Dan Williams Date: Tue, 23 Feb 2021 17:32:16 -0800 Message-ID: Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps To: Jason Gunthorpe Message-ID-Hash: FDADZ2RBHMFAVXJFYYMWFEMK2LCNUQRO X-Message-ID-Hash: FDADZ2RBHMFAVXJFYYMWFEMK2LCNUQRO X-MailFrom: dan.j.williams@intel.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header CC: Joao Martins , Linux MM , linux-nvdimm , Matthew Wilcox , Muchun Song , Mike Kravetz , Andrew Morton , Ralph Campbell X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51E33C433DB for ; Wed, 24 Feb 2021 01:32:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8BB5B64E57 for ; Wed, 24 Feb 2021 01:32:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8BB5B64E57 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CC80E6B0006; Tue, 23 Feb 2021 20:32:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C78966B006C; Tue, 23 Feb 2021 20:32:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB5526B006E; Tue, 23 Feb 2021 20:32:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0189.hostedemail.com [216.40.44.189]) by kanga.kvack.org (Postfix) with ESMTP id A692B6B0006 for ; Tue, 23 Feb 2021 20:32:23 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 698018248047 for ; Wed, 24 Feb 2021 01:32:23 +0000 (UTC) X-FDA: 77851436166.06.BEC3361 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf21.hostedemail.com (Postfix) with ESMTP id 7C173E000104 for ; Wed, 24 Feb 2021 01:32:19 +0000 (UTC) Received: by mail-ej1-f54.google.com with SMTP id d8so449953ejc.4 for ; Tue, 23 Feb 2021 17:32:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BzWzSMy6H2fKaT6wDPfgdGVM2lphxwe7fjsWLK8ZsE4=; b=S75Qch0ZOea7fh5SP9tMaw8ApZ3bEDxJO1l9QkpiBXeoIJGJ40yXpr1WktRrEViy3D vijr1JJSKl1rZgNb1bzAKWw9GR/43vYcydkbd/anpDloO7xO2ODxp1Hps6Rie67Ov8DQ I+2Ywacc2vCwjEO2ApRfBzs329z5yGUwawppt7NpEPadCzuOtaWkQCu4OGozKmdxwDs8 7DrfpQo6uRf/T66mj4TJAREDUZFgM8Um1n4t9IP8U/2vDd6s6Jcs4Q/l3dVgjflFBaN9 aLE0IJtbQgdR3x2B39xqV9j+BJSVQ2gu2fUUPMNSVkXgKdgJHmN74G2ZhP3zdWAR8pKk hQ7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BzWzSMy6H2fKaT6wDPfgdGVM2lphxwe7fjsWLK8ZsE4=; b=FVsAS5X7o6hPVKqlMNuI3F7I2XEYiCWrfwWa5EfytB15H4l/k4Fan8zB3oPGbQv5f7 JEs3HYJI/d8yfLUp2fkgNc9t5UvOmf6LcILPP/+RwhBUhUvCs6eF44lXB6sE/TlXVRRH wuXOVtj/VcnWh9P+ZtyFlfE6ogqnpkEkLRMmvLF1NLQsGdDFhOi205fqIkKf8jYecyO8 I6MPNIzsSvFNqRbD2NER28uXS0ET9LNtvospfzcV43QAt35rluLEggd9dUv+zyhwrZdL d6lHwLspdTsA22p6gCWp6rGX/1oGUKh5Iw5+NGeS1I3c3ejPq6yHlWwnlGUtIRSLh9lP hXIg== X-Gm-Message-State: AOAM532gwdP43I0caS6MaevF3OuqY/8oKnWBDibbMq+o0BzutAzVBVWr rn9NvYY53ZzlAS5984+UBZJ5/W16hiqOpKzyymSQkQ== X-Google-Smtp-Source: ABdhPJweeOqjTTK2aT3IWCvDjLq0yad1F25BQovQcrNlUrurFfezD/l7/vcOUTDDJ4QHuLwgfbeO1bJ/899s1E/jfdA= X-Received: by 2002:a17:907:3fa3:: with SMTP id hr35mr1175229ejc.418.1614130341190; Tue, 23 Feb 2021 17:32:21 -0800 (PST) MIME-Version: 1.0 References: <20201208172901.17384-1-joao.m.martins@oracle.com> <6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com> <20210223185435.GO2643399@ziepe.ca> <20210223230723.GP2643399@ziepe.ca> <20210224010017.GQ2643399@ziepe.ca> In-Reply-To: <20210224010017.GQ2643399@ziepe.ca> From: Dan Williams Date: Tue, 23 Feb 2021 17:32:16 -0800 Message-ID: Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps To: Jason Gunthorpe Cc: Joao Martins , Linux MM , Ira Weiny , linux-nvdimm , Matthew Wilcox , Jane Chu , Muchun Song , Mike Kravetz , Andrew Morton , Ralph Campbell Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 7C173E000104 X-Stat-Signature: dhjr3bwycffp7ktcndqzrwdkkee19it3 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf21; identity=mailfrom; envelope-from=""; helo=mail-ej1-f54.google.com; client-ip=209.85.218.54 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1614130339-182173 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver.