From: Dan Williams <dan.j.williams@intel.com> To: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joao Martins <joao.m.martins@oracle.com>, Linux MM <linux-mm@kvack.org>, Ira Weiny <ira.weiny@intel.com>, linux-nvdimm <linux-nvdimm@lists.01.org>, Matthew Wilcox <willy@infradead.org>, Jane Chu <jane.chu@oracle.com>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, Andrew Morton <akpm@linux-foundation.org>, Ralph Campbell <rcampbell@nvidia.com> Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Date: Tue, 23 Feb 2021 17:32:16 -0800 [thread overview] Message-ID: <CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com> (raw) In-Reply-To: <20210224010017.GQ2643399@ziepe.ca> On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver.
prev parent reply other threads:[~2021-02-24 1:32 UTC|newest] Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-12-08 17:28 Joao Martins 2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins 2020-12-09 5:59 ` John Hubbard 2020-12-09 6:33 ` Matthew Wilcox 2020-12-09 13:12 ` Joao Martins 2021-02-20 1:43 ` Dan Williams 2021-02-22 11:24 ` Joao Martins 2021-02-22 20:37 ` Dan Williams 2021-02-23 15:46 ` Joao Martins 2021-02-23 16:50 ` Dan Williams 2021-02-23 17:18 ` Joao Martins 2021-02-23 18:18 ` Dan Williams 2021-03-10 18:12 ` Joao Martins 2021-03-12 5:54 ` Dan Williams 2021-02-20 1:24 ` Dan Williams 2021-02-22 11:09 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins 2020-12-09 6:16 ` John Hubbard 2020-12-09 13:51 ` Joao Martins 2021-02-20 1:49 ` Dan Williams 2021-02-22 11:26 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins 2020-12-08 17:38 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins 2021-02-20 3:34 ` Dan Williams 2021-02-22 11:42 ` Joao Martins 2021-02-22 22:40 ` Dan Williams 2021-02-23 15:46 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins 2021-02-20 6:17 ` Dan Williams 2021-02-22 12:01 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 5/9] device-dax: Compound pagemap support Joao Martins 2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins 2020-12-08 19:49 ` Jason Gunthorpe 2020-12-09 11:05 ` Joao Martins 2020-12-09 15:15 ` Jason Gunthorpe 2020-12-09 16:02 ` Joao Martins 2020-12-09 16:24 ` Jason Gunthorpe 2020-12-09 17:27 ` Joao Martins 2020-12-09 18:14 ` Matthew Wilcox 2020-12-09 19:08 ` Jason Gunthorpe 2020-12-10 15:43 ` Joao Martins 2020-12-09 4:40 ` John Hubbard 2020-12-09 13:44 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 7/9] mm/gup: Decrement head page " Joao Martins 2020-12-08 19:34 ` Jason Gunthorpe 2020-12-09 5:06 ` John Hubbard 2020-12-09 13:43 ` Jason Gunthorpe 2020-12-09 12:17 ` Joao Martins 2020-12-17 19:05 ` Joao Martins 2020-12-17 20:05 ` Jason Gunthorpe 2020-12-17 22:34 ` Joao Martins 2020-12-18 14:25 ` Jason Gunthorpe 2020-12-19 2:06 ` John Hubbard 2020-12-19 13:10 ` Joao Martins 2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins 2020-12-08 19:29 ` Jason Gunthorpe 2020-12-09 10:59 ` Joao Martins 2020-12-19 13:15 ` Joao Martins 2020-12-09 5:18 ` John Hubbard 2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins 2020-12-08 19:57 ` Jason Gunthorpe 2020-12-09 8:05 ` Christoph Hellwig 2020-12-09 11:19 ` Joao Martins 2020-12-09 5:23 ` John Hubbard 2020-12-09 9:38 ` [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps David Hildenbrand 2020-12-09 9:52 ` [External] " Muchun Song 2021-02-20 1:18 ` Dan Williams 2021-02-22 11:06 ` Joao Martins 2021-02-22 14:32 ` Joao Martins 2021-02-23 16:28 ` Joao Martins 2021-02-23 16:44 ` Dan Williams 2021-02-23 17:15 ` Joao Martins 2021-02-23 18:15 ` Dan Williams 2021-02-23 18:54 ` Jason Gunthorpe 2021-02-23 22:48 ` Dan Williams 2021-02-23 23:07 ` Jason Gunthorpe 2021-02-24 0:14 ` Dan Williams 2021-02-24 1:00 ` Jason Gunthorpe 2021-02-24 1:32 ` Dan Williams [this message]
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com \ --to=dan.j.williams@intel.com \ --cc=akpm@linux-foundation.org \ --cc=ira.weiny@intel.com \ --cc=jane.chu@oracle.com \ --cc=jgg@ziepe.ca \ --cc=joao.m.martins@oracle.com \ --cc=linux-mm@kvack.org \ --cc=linux-nvdimm@lists.01.org \ --cc=mike.kravetz@oracle.com \ --cc=rcampbell@nvidia.com \ --cc=songmuchun@bytedance.com \ --cc=willy@infradead.org \ --subject='Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).