From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72097C433E0 for ; Wed, 24 Feb 2021 00:14:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 084FF64E21 for ; Wed, 24 Feb 2021 00:14:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 084FF64E21 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5CCFF6B006E; Tue, 23 Feb 2021 19:14:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 557476B0070; Tue, 23 Feb 2021 19:14:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 41EFE8D0001; Tue, 23 Feb 2021 19:14:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0114.hostedemail.com [216.40.44.114]) by kanga.kvack.org (Postfix) with ESMTP id 2991E6B006E for ; Tue, 23 Feb 2021 19:14:08 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id E92C48E65 for ; Wed, 24 Feb 2021 00:14:07 +0000 (UTC) X-FDA: 77851238934.06.523F599 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf02.hostedemail.com (Postfix) with ESMTP id 85621407F8E8 for ; Wed, 24 Feb 2021 00:13:54 +0000 (UTC) Received: by mail-ed1-f53.google.com with SMTP id w21so462036edc.7 for ; Tue, 23 Feb 2021 16:14:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=XLJHJxH7tTmQkhw5tNgwiSSOljZxViPOSF8nXy5WWfU=; b=jwLWHoalomh7mZYDd6f08ljdmp1NrN76kqY92StAT7tyl1cxzrxARguiZzVFDC3QsA 8yH7pk+Br8oB95JokcDWQxFtsGIvwV82t47BPCQgtsyRjr6VR4mtT2C+KkGZCTZ1bR0S VO6hF1qheKSseiel8Ymwp+FDPfWAw+V10wrRqfBLmM4DxFoxRQmiJef7azeeN+/0Dsb9 duJMGWqePS7oY+11zggkukfRJshIAXOgAwhP1Rfbaivy5wPhQV9tyspTf1FQ5cPCnDXS 44/J2aLE7NawJCOuvxjimXGlcQObOFVmvn8i1wcDeYpNh2l+OCi2Nykl9encVyg2MWH1 iTXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XLJHJxH7tTmQkhw5tNgwiSSOljZxViPOSF8nXy5WWfU=; b=BbBjhoG4FTMFLqvL8jAU7sdW9NoRq9rUsWfHPp6fnJQIR2rBWT6LRIUQkIYaC4Sw7Y lNx5SmCMejO168simDDXJ1tCvSdjgwnBU0W1LyGmebFcY9urqVrdDphUhZfGXEVSiPLr GVEnEKDnHFWwt7PTc1ulFOqZ/g6rkXEn0Bow5yih4JWzmBFHe1BQEIbCk68Qit+cpY5i nP5DdgxUuO2Cs3AQBoDMiqAqFb/Qa0hOUbI5+9SOON82k7G3c2Sx4jr3Kx3xWHbcNxyG lgBVJtwZ3sDO//jqfrL0y/f2EY7yiXQHgCL2GklHIpiV/CGwgl2ptVUW1qZv6+tMG9UW 8YAw== X-Gm-Message-State: AOAM533hroSGx11ZM8rVT9/WWzmNMli9yLbt+AONx1LZxK9bgHwDYnjD Tt1mfNkOTUWqcls/PUCii2DMKnQmP9PlgMx+jM8t2Q== X-Google-Smtp-Source: ABdhPJxkmsixyDMgZwiRPFz7g1wKznwwN5YVqpozccKEnIhEeJptpOXzGwk1UXrRI++d9AmrNXsfBib4IrHb2IbuuIg= X-Received: by 2002:a05:6402:3585:: with SMTP id y5mr30090604edc.97.1614125645812; Tue, 23 Feb 2021 16:14:05 -0800 (PST) MIME-Version: 1.0 References: <20201208172901.17384-1-joao.m.martins@oracle.com> <6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com> <20210223185435.GO2643399@ziepe.ca> <20210223230723.GP2643399@ziepe.ca> In-Reply-To: <20210223230723.GP2643399@ziepe.ca> From: Dan Williams Date: Tue, 23 Feb 2021 16:14:01 -0800 Message-ID: Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps To: Jason Gunthorpe Cc: Joao Martins , Linux MM , Ira Weiny , linux-nvdimm , Matthew Wilcox , Jane Chu , Muchun Song , Mike Kravetz , Andrew Morton , Ralph Campbell Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 85621407F8E8 X-Stat-Signature: 4xw61g5t8tzxibun3f8e5nsg8kiwr73i Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf02; identity=mailfrom; envelope-from=""; helo=mail-ed1-f53.google.com; client-ip=209.85.208.53 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1614125634-367556 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [ add Ralph ] On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > per gup-fast() call. > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > path. It should be measurable that this change is at least as fast or > > > > faster than falling back to the slow path, but it would be good to > > > > measure. > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > I've been wondering for a while.. > > > > It's there to synchronize against dax-device removal. The device will > > suspend removal awaiting all page references to be dropped, but > > gup-fast could be racing device removal. So gup-fast checks for > > pte_devmap() to grab a live reference to the device before assuming it > > can pin a page. > > From the perspective of CPU A it can't tell if CPU B is doing a HW > page table walk or a GUP fast when it invalidates a page table. The > design of gup-fast is supposed to be the same as the design of a HW > page table walk, and the tlb invalidate CPU A does when removing a > page from a page table is supposed to serialize against both a HW page > table walk and gup-fast. > > Given that the HW page table walker does not do dev_pagemap stuff, why > does gup-fast? gup-fast historically assumed that the 'struct page' and memory backing the page-table walk could not physically be removed from the system during its walk because those pages were allocated from the page allocator before being mapped into userspace. So there is an implied elevated reference on any page that gup-fast would be asked to walk, or pte_special() is there to "say wait, nevermind this isn't a page allocator page fallback to gup-slow()". pte_devmap() is there to say "wait, there is no implied elevated reference for this page, check and hold dev_pagemap alive until a page reference can be taken". So it splits the difference between pte_special() and typical page allocator pages. > Can you sketch the exact race this is protecting against? Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and issues direct I/O with that mapping as the target buffer, Thread2 does "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without the dev_pagemap check reference gup-fast could execute get_page(pte_page(pte)) on a page that doesn't even exist anymore because the driver unbind has already performed remove_pages(). Effectively the same percpu_ref that protects the pmem0 block device from new command submissions while the device is dying also prevents new dax page references being taken while the device is dying. This could be solved with the traditional gup-fast rules if the device driver could tell the filesystem to unmap all dax files and force them to re-fault through the gup-slow path to see that the device is now dying. I'll likely be working on that sooner rather than later given some of the expectations of the CXL persistent memory "dirty shutdown" detection.