From: Dan Williams <dan.j.williams@intel.com> To: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joao Martins <joao.m.martins@oracle.com>, Linux MM <linux-mm@kvack.org>, linux-nvdimm <linux-nvdimm@lists.01.org>, Matthew Wilcox <willy@infradead.org>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, Andrew Morton <akpm@linux-foundation.org>, Ralph Campbell <rcampbell@nvidia.com> Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Date: Tue, 23 Feb 2021 17:32:16 -0800 [thread overview] Message-ID: <CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com> (raw) In-Reply-To: <20210224010017.GQ2643399@ziepe.ca> On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joao Martins <joao.m.martins@oracle.com>, Linux MM <linux-mm@kvack.org>, Ira Weiny <ira.weiny@intel.com>, linux-nvdimm <linux-nvdimm@lists.01.org>, Matthew Wilcox <willy@infradead.org>, Jane Chu <jane.chu@oracle.com>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, Andrew Morton <akpm@linux-foundation.org>, Ralph Campbell <rcampbell@nvidia.com> Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Date: Tue, 23 Feb 2021 17:32:16 -0800 [thread overview] Message-ID: <CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com> (raw) In-Reply-To: <20210224010017.GQ2643399@ziepe.ca> On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver.
next prev parent reply other threads:[~2021-02-24 1:32 UTC|newest] Thread overview: 147+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins 2020-12-08 17:28 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins 2020-12-08 17:28 ` Joao Martins 2020-12-09 5:59 ` John Hubbard 2020-12-09 5:59 ` John Hubbard 2020-12-09 6:33 ` Matthew Wilcox 2020-12-09 6:33 ` Matthew Wilcox 2020-12-09 13:12 ` Joao Martins 2020-12-09 13:12 ` Joao Martins 2021-02-20 1:43 ` Dan Williams 2021-02-20 1:43 ` Dan Williams 2021-02-22 11:24 ` Joao Martins 2021-02-22 11:24 ` Joao Martins 2021-02-22 20:37 ` Dan Williams 2021-02-22 20:37 ` Dan Williams 2021-02-23 15:46 ` Joao Martins 2021-02-23 15:46 ` Joao Martins 2021-02-23 16:50 ` Dan Williams 2021-02-23 16:50 ` Dan Williams 2021-02-23 17:18 ` Joao Martins 2021-02-23 17:18 ` Joao Martins 2021-02-23 18:18 ` Dan Williams 2021-02-23 18:18 ` Dan Williams 2021-03-10 18:12 ` Joao Martins 2021-03-10 18:12 ` Joao Martins 2021-03-12 5:54 ` Dan Williams 2021-03-12 5:54 ` Dan Williams 2021-02-20 1:24 ` Dan Williams 2021-02-20 1:24 ` Dan Williams 2021-02-22 11:09 ` Joao Martins 2021-02-22 11:09 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins 2020-12-08 17:28 ` Joao Martins 2020-12-09 6:16 ` John Hubbard 2020-12-09 6:16 ` John Hubbard 2020-12-09 13:51 ` Joao Martins 2020-12-09 13:51 ` Joao Martins 2021-02-20 1:49 ` Dan Williams 2021-02-20 1:49 ` Dan Williams 2021-02-22 11:26 ` Joao Martins 2021-02-22 11:26 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins 2020-12-08 17:28 ` Joao Martins 2020-12-08 17:38 ` Joao Martins 2020-12-08 17:38 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins 2020-12-08 17:28 ` Joao Martins 2021-02-20 3:34 ` Dan Williams 2021-02-20 3:34 ` Dan Williams 2021-02-22 11:42 ` Joao Martins 2021-02-22 11:42 ` Joao Martins 2021-02-22 22:40 ` Dan Williams 2021-02-22 22:40 ` Dan Williams 2021-02-23 15:46 ` Joao Martins 2021-02-23 15:46 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins 2020-12-08 17:28 ` Joao Martins 2021-02-20 6:17 ` Dan Williams 2021-02-20 6:17 ` Dan Williams 2021-02-22 12:01 ` Joao Martins 2021-02-22 12:01 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 5/9] device-dax: Compound pagemap support Joao Martins 2020-12-08 17:28 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins 2020-12-08 17:28 ` Joao Martins 2020-12-08 19:49 ` Jason Gunthorpe 2020-12-09 11:05 ` Joao Martins 2020-12-09 11:05 ` Joao Martins 2020-12-09 15:15 ` Jason Gunthorpe 2020-12-09 16:02 ` Joao Martins 2020-12-09 16:02 ` Joao Martins 2020-12-09 16:24 ` Jason Gunthorpe 2020-12-09 17:27 ` Joao Martins 2020-12-09 17:27 ` Joao Martins 2020-12-09 18:14 ` Matthew Wilcox 2020-12-09 18:14 ` Matthew Wilcox 2020-12-09 19:08 ` Jason Gunthorpe 2020-12-10 15:43 ` Joao Martins 2020-12-10 15:43 ` Joao Martins 2020-12-09 4:40 ` John Hubbard 2020-12-09 4:40 ` John Hubbard 2020-12-09 13:44 ` Joao Martins 2020-12-09 13:44 ` Joao Martins 2020-12-08 17:28 ` [PATCH RFC 7/9] mm/gup: Decrement head page " Joao Martins 2020-12-08 17:28 ` Joao Martins 2020-12-08 19:34 ` Jason Gunthorpe 2020-12-09 5:06 ` John Hubbard 2020-12-09 5:06 ` John Hubbard 2020-12-09 13:43 ` Jason Gunthorpe 2020-12-09 12:17 ` Joao Martins 2020-12-09 12:17 ` Joao Martins 2020-12-17 19:05 ` Joao Martins 2020-12-17 19:05 ` Joao Martins 2020-12-17 20:05 ` Jason Gunthorpe 2020-12-17 22:34 ` Joao Martins 2020-12-17 22:34 ` Joao Martins 2020-12-18 14:25 ` Jason Gunthorpe 2020-12-19 2:06 ` John Hubbard 2020-12-19 2:06 ` John Hubbard 2020-12-19 13:10 ` Joao Martins 2020-12-19 13:10 ` Joao Martins 2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins 2020-12-08 17:29 ` Joao Martins 2020-12-08 19:29 ` Jason Gunthorpe 2020-12-09 10:59 ` Joao Martins 2020-12-09 10:59 ` Joao Martins 2020-12-19 13:15 ` Joao Martins 2020-12-19 13:15 ` Joao Martins 2020-12-09 5:18 ` John Hubbard 2020-12-09 5:18 ` John Hubbard 2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins 2020-12-08 17:29 ` Joao Martins 2020-12-08 19:57 ` Jason Gunthorpe 2020-12-09 8:05 ` Christoph Hellwig 2020-12-09 8:05 ` Christoph Hellwig 2020-12-09 11:19 ` Joao Martins 2020-12-09 11:19 ` Joao Martins 2020-12-09 5:23 ` John Hubbard 2020-12-09 5:23 ` John Hubbard 2020-12-09 9:38 ` [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps David Hildenbrand 2020-12-09 9:38 ` David Hildenbrand 2020-12-09 9:52 ` [External] " Muchun Song 2020-12-09 9:52 ` Muchun Song 2021-02-20 1:18 ` Dan Williams 2021-02-20 1:18 ` Dan Williams 2021-02-22 11:06 ` Joao Martins 2021-02-22 11:06 ` Joao Martins 2021-02-22 14:32 ` Joao Martins 2021-02-22 14:32 ` Joao Martins 2021-02-23 16:28 ` Joao Martins 2021-02-23 16:28 ` Joao Martins 2021-02-23 16:44 ` Dan Williams 2021-02-23 16:44 ` Dan Williams 2021-02-23 17:15 ` Joao Martins 2021-02-23 17:15 ` Joao Martins 2021-02-23 18:15 ` Dan Williams 2021-02-23 18:15 ` Dan Williams 2021-02-23 18:54 ` Jason Gunthorpe 2021-02-23 22:48 ` Dan Williams 2021-02-23 22:48 ` Dan Williams 2021-02-23 23:07 ` Jason Gunthorpe 2021-02-24 0:14 ` Dan Williams 2021-02-24 0:14 ` Dan Williams 2021-02-24 1:00 ` Jason Gunthorpe 2021-02-24 1:32 ` Dan Williams [this message] 2021-02-24 1:32 ` Dan Williams
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com \ --to=dan.j.williams@intel.com \ --cc=akpm@linux-foundation.org \ --cc=jgg@ziepe.ca \ --cc=joao.m.martins@oracle.com \ --cc=linux-mm@kvack.org \ --cc=linux-nvdimm@lists.01.org \ --cc=mike.kravetz@oracle.com \ --cc=rcampbell@nvidia.com \ --cc=songmuchun@bytedance.com \ --cc=willy@infradead.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.