From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3312C433DB for ; Wed, 24 Feb 2021 01:01:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0EDC664DBD for ; Wed, 24 Feb 2021 01:01:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0EDC664DBD Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ziepe.ca Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 43EA16B0006; Tue, 23 Feb 2021 20:01:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F0146B006C; Tue, 23 Feb 2021 20:01:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2E0556B006E; Tue, 23 Feb 2021 20:01:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0220.hostedemail.com [216.40.44.220]) by kanga.kvack.org (Postfix) with ESMTP id 191956B0006 for ; Tue, 23 Feb 2021 20:01:21 -0500 (EST) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id D87E518038B71 for ; Wed, 24 Feb 2021 01:01:20 +0000 (UTC) X-FDA: 77851357920.15.48A558B Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by imf22.hostedemail.com (Postfix) with ESMTP id 3D902C000C71 for ; Wed, 24 Feb 2021 01:00:17 +0000 (UTC) Received: by mail-qt1-f172.google.com with SMTP id v64so311847qtd.5 for ; Tue, 23 Feb 2021 17:00:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=6WPduII89HQGY2mOQx1eA9fRflQX96utB1H/q4mIMLc=; b=Y7r6UkGuw1IOmIsjW2zMqyZtAqmh9FlTjXHu6Qjy/Lel+vX9NIzPNsiFCgD7Me/M3q ObdZj0eiLVJvaFiU/aZ+zHuWayWxJtRkXyKBeEWhuZel1Ja4CHeX686dbEVUE47DG16b tK3wFs3mNt2cRSSc7e/Xe5rCi1moYRUsluvQwo8193dhW7czkihLylEja9ai4i++cYMf GxipGQOF0Zq76FxnH959HW7Svx57guIfWBQrFS/pScrfaaKhhGgu7JYg6qnGeqEPQMsL 8cnEu0bJV9JpWtMm7lEEX42iaNRywGDEjmBc4EFeIqZmw3JP94tMJmRwDQPohom5eGF1 qnEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=6WPduII89HQGY2mOQx1eA9fRflQX96utB1H/q4mIMLc=; b=CxuHEwcVF0tQhr9wunryB/rAWkDhcePZW4VKTVq0PzMtQb7HcvPKbRQX8X3mipEoCi u2lH6XfEGeljo7vmanqy/ExVLRt0eCQGlt+jCJx0YD0twpqwRKc9XXQPz0aqhCGt24S3 9qKXS4/9yJGMoc0TgRBxqaxyYgwCTnsVA29g1wNoC3UELtYv6cOpFo3JWXXey/CVu1DK arAEL16DEUicF8k5WhBAy6mLO6x8Nfa5oq0C9RkJ0tYX7lNDd/Nu5/GU8FDUjNX73UO/ ZFegCGZrxgmU/cmgXPxWA3WAEGUYH1s1MCTd8/yd9kzFTP5D/sDqwdQcdXedCLAMeOlA ofAA== X-Gm-Message-State: AOAM532qsRLDw51epcGCr15ax9ofTCYk6LAZ1yBcln7eP+QEglBMtdmL ASu+09VO+6jMn4Tlqn5UhqzXIw== X-Google-Smtp-Source: ABdhPJyXCEaFxwPkiwXFwQd9jQ5shGMplv1JKOW48VWiet+roHRGMlUjn6WashH1trgg/+jZuKUlNA== X-Received: by 2002:aed:2c85:: with SMTP id g5mr2402271qtd.306.1614128419534; Tue, 23 Feb 2021 17:00:19 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-142-162-115-133.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.115.133]) by smtp.gmail.com with ESMTPSA id d1sm258268qtq.94.2021.02.23.17.00.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Feb 2021 17:00:18 -0800 (PST) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1lEiXJ-00GOv1-Tl; Tue, 23 Feb 2021 21:00:17 -0400 Date: Tue, 23 Feb 2021 21:00:17 -0400 From: Jason Gunthorpe To: Dan Williams Cc: Joao Martins , Linux MM , Ira Weiny , linux-nvdimm , Matthew Wilcox , Jane Chu , Muchun Song , Mike Kravetz , Andrew Morton , Ralph Campbell Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Message-ID: <20210224010017.GQ2643399@ziepe.ca> References: <20201208172901.17384-1-joao.m.martins@oracle.com> <6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com> <20210223185435.GO2643399@ziepe.ca> <20210223230723.GP2643399@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 3D902C000C71 X-Stat-Signature: gcxz5euhbua3ostsawn88fo1ey7kgb7i Received-SPF: none (ziepe.ca>: No applicable sender policy available) receiver=imf22; identity=mailfrom; envelope-from=""; helo=mail-qt1-f172.google.com; client-ip=209.85.160.172 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1614128417-839921 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > [ add Ralph ] > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > per gup-fast() call. > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > path. It should be measurable that this change is at least as fast or > > > > > faster than falling back to the slow path, but it would be good to > > > > > measure. > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > I've been wondering for a while.. > > > > > > It's there to synchronize against dax-device removal. The device will > > > suspend removal awaiting all page references to be dropped, but > > > gup-fast could be racing device removal. So gup-fast checks for > > > pte_devmap() to grab a live reference to the device before assuming it > > > can pin a page. > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > page table walk or a GUP fast when it invalidates a page table. The > > design of gup-fast is supposed to be the same as the design of a HW > > page table walk, and the tlb invalidate CPU A does when removing a > > page from a page table is supposed to serialize against both a HW page > > table walk and gup-fast. > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > does gup-fast? > > gup-fast historically assumed that the 'struct page' and memory > backing the page-table walk could not physically be removed from the > system during its walk because those pages were allocated from the > page allocator before being mapped into userspace. No, I'd say gup-fast assumes that any non-special PTE it finds in a page table must have a struct page. If something wants to remove that struct page it must first remove all the PTEs pointing at it from the entire system and flush the TLBs, which directly prevents a future gup-fast from running and trying to access the struct page. No extra locking needed > implied elevated reference on any page that gup-fast would be asked to > walk, or pte_special() is there to "say wait, nevermind this isn't a > page allocator page fallback to gup-slow()". pte_special says there is no struct page, and some of those cases can be fixed up in gup-slow. > > Can you sketch the exact race this is protecting against? > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > issues direct I/O with that mapping as the target buffer, Thread2 does > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > the dev_pagemap check reference gup-fast could execute > get_page(pte_page(pte)) on a page that doesn't even exist anymore > because the driver unbind has already performed remove_pages(). Surely the unbind either waits for all the VMAs to be destroyed or zaps them before allowing things to progress to remove_pages()? Having a situation where the CPU page tables still point at physical pages that have been removed sounds so crazy/insecure, that can't be what is happening, can it?? Jason