All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joao Martins <joao.m.martins@oracle.com>,
	Linux MM <linux-mm@kvack.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Matthew Wilcox <willy@infradead.org>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ralph Campbell <rcampbell@nvidia.com>
Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
Date: Tue, 23 Feb 2021 16:14:01 -0800	[thread overview]
Message-ID: <CAPcyv4hAHaGZ52TtZxTyYtQQVMKW+MaqYDsDKJe94o-cNZNv4g@mail.gmail.com> (raw)
In-Reply-To: <20210223230723.GP2643399@ziepe.ca>

[ add Ralph ]

On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > >
> > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > per gup-fast() call.
> > > >
> > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > path. It should be measurable that this change is at least as fast or
> > > > faster than falling back to the slow path, but it would be good to
> > > > measure.
> > >
> > > What is the dev_pagemap thing doing in gup fast anyhow?
> > >
> > > I've been wondering for a while..
> >
> > It's there to synchronize against dax-device removal. The device will
> > suspend removal awaiting all page references to be dropped, but
> > gup-fast could be racing device removal. So gup-fast checks for
> > pte_devmap() to grab a live reference to the device before assuming it
> > can pin a page.
>
> From the perspective of CPU A it can't tell if CPU B is doing a HW
> page table walk or a GUP fast when it invalidates a page table. The
> design of gup-fast is supposed to be the same as the design of a HW
> page table walk, and the tlb invalidate CPU A does when removing a
> page from a page table is supposed to serialize against both a HW page
> table walk and gup-fast.
>
> Given that the HW page table walker does not do dev_pagemap stuff, why
> does gup-fast?

gup-fast historically assumed that the 'struct page' and memory
backing the page-table walk could not physically be removed from the
system during its walk because those pages were allocated from the
page allocator before being mapped into userspace. So there is an
implied elevated reference on any page that gup-fast would be asked to
walk, or pte_special() is there to "say wait, nevermind this isn't a
page allocator page fallback to gup-slow()". pte_devmap() is there to
say "wait, there is no implied elevated reference for this page, check
and hold dev_pagemap alive until a page reference can be taken". So it
splits the difference between pte_special() and typical page allocator
pages.

> Can you sketch the exact race this is protecting against?

Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
issues direct I/O with that mapping as the target buffer, Thread2 does
"echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
the dev_pagemap check reference gup-fast could execute
get_page(pte_page(pte)) on a page that doesn't even exist anymore
because the driver unbind has already performed remove_pages().

Effectively the same percpu_ref that protects the pmem0 block device
from new command submissions while the device is dying also prevents
new dax page references being taken while the device is dying.

This could be solved with the traditional gup-fast rules if the device
driver could tell the filesystem to unmap all dax files and force them
to re-fault through the gup-slow path to see that the device is now
dying. I'll likely be working on that sooner rather than later given
some of the expectations of the CXL persistent memory "dirty shutdown"
detection.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

WARNING: multiple messages have this Message-ID
From: Dan Williams <dan.j.williams@intel.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joao Martins <joao.m.martins@oracle.com>,
	Linux MM <linux-mm@kvack.org>,  Ira Weiny <ira.weiny@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	 Matthew Wilcox <willy@infradead.org>,
	Jane Chu <jane.chu@oracle.com>,
	 Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Ralph Campbell <rcampbell@nvidia.com>
Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
Date: Tue, 23 Feb 2021 16:14:01 -0800	[thread overview]
Message-ID: <CAPcyv4hAHaGZ52TtZxTyYtQQVMKW+MaqYDsDKJe94o-cNZNv4g@mail.gmail.com> (raw)
In-Reply-To: <20210223230723.GP2643399@ziepe.ca>

[ add Ralph ]

On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > >
> > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > per gup-fast() call.
> > > >
> > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > path. It should be measurable that this change is at least as fast or
> > > > faster than falling back to the slow path, but it would be good to
> > > > measure.
> > >
> > > What is the dev_pagemap thing doing in gup fast anyhow?
> > >
> > > I've been wondering for a while..
> >
> > It's there to synchronize against dax-device removal. The device will
> > suspend removal awaiting all page references to be dropped, but
> > gup-fast could be racing device removal. So gup-fast checks for
> > pte_devmap() to grab a live reference to the device before assuming it
> > can pin a page.
>
> From the perspective of CPU A it can't tell if CPU B is doing a HW
> page table walk or a GUP fast when it invalidates a page table. The
> design of gup-fast is supposed to be the same as the design of a HW
> page table walk, and the tlb invalidate CPU A does when removing a
> page from a page table is supposed to serialize against both a HW page
> table walk and gup-fast.
>
> Given that the HW page table walker does not do dev_pagemap stuff, why
> does gup-fast?

gup-fast historically assumed that the 'struct page' and memory
backing the page-table walk could not physically be removed from the
system during its walk because those pages were allocated from the
page allocator before being mapped into userspace. So there is an
implied elevated reference on any page that gup-fast would be asked to
walk, or pte_special() is there to "say wait, nevermind this isn't a
page allocator page fallback to gup-slow()". pte_devmap() is there to
say "wait, there is no implied elevated reference for this page, check
and hold dev_pagemap alive until a page reference can be taken". So it
splits the difference between pte_special() and typical page allocator
pages.

> Can you sketch the exact race this is protecting against?

Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
issues direct I/O with that mapping as the target buffer, Thread2 does
"echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
the dev_pagemap check reference gup-fast could execute
get_page(pte_page(pte)) on a page that doesn't even exist anymore
because the driver unbind has already performed remove_pages().

Effectively the same percpu_ref that protects the pmem0 block device
from new command submissions while the device is dying also prevents
new dax page references being taken while the device is dying.

This could be solved with the traditional gup-fast rules if the device
driver could tell the filesystem to unmap all dax files and force them
to re-fault through the gup-slow path to see that the device is now
dying. I'll likely be working on that sooner rather than later given
some of the expectations of the CXL persistent memory "dirty shutdown"
detection.


  reply	other threads:[~2021-02-24  0:14 UTC|newest]

Thread overview: 147+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-08 17:28 Joao Martins
2020-12-08 17:28 ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-09  5:59   ` John Hubbard
2020-12-09  5:59     ` John Hubbard
2020-12-09  6:33     ` Matthew Wilcox
2020-12-09  6:33       ` Matthew Wilcox
2020-12-09 13:12       ` Joao Martins
2020-12-09 13:12         ` Joao Martins
2021-02-20  1:43     ` Dan Williams
2021-02-20  1:43       ` Dan Williams
2021-02-22 11:24       ` Joao Martins
2021-02-22 11:24         ` Joao Martins
2021-02-22 20:37         ` Dan Williams
2021-02-22 20:37           ` Dan Williams
2021-02-23 15:46           ` Joao Martins
2021-02-23 15:46             ` Joao Martins
2021-02-23 16:50             ` Dan Williams
2021-02-23 16:50               ` Dan Williams
2021-02-23 17:18               ` Joao Martins
2021-02-23 17:18                 ` Joao Martins
2021-02-23 18:18                 ` Dan Williams
2021-02-23 18:18                   ` Dan Williams
2021-03-10 18:12           ` Joao Martins
2021-03-10 18:12             ` Joao Martins
2021-03-12  5:54             ` Dan Williams
2021-03-12  5:54               ` Dan Williams
2021-02-20  1:24   ` Dan Williams
2021-02-20  1:24     ` Dan Williams
2021-02-22 11:09     ` Joao Martins
2021-02-22 11:09       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-09  6:16   ` John Hubbard
2020-12-09  6:16     ` John Hubbard
2020-12-09 13:51     ` Joao Martins
2020-12-09 13:51       ` Joao Martins
2021-02-20  1:49   ` Dan Williams
2021-02-20  1:49     ` Dan Williams
2021-02-22 11:26     ` Joao Martins
2021-02-22 11:26       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 17:38   ` Joao Martins
2020-12-08 17:38     ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins
2020-12-08 17:28   ` Joao Martins
2021-02-20  3:34   ` Dan Williams
2021-02-20  3:34     ` Dan Williams
2021-02-22 11:42     ` Joao Martins
2021-02-22 11:42       ` Joao Martins
2021-02-22 22:40       ` Dan Williams
2021-02-22 22:40         ` Dan Williams
2021-02-23 15:46         ` Joao Martins
2021-02-23 15:46           ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins
2020-12-08 17:28   ` Joao Martins
2021-02-20  6:17   ` Dan Williams
2021-02-20  6:17     ` Dan Williams
2021-02-22 12:01     ` Joao Martins
2021-02-22 12:01       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 5/9] device-dax: Compound pagemap support Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 19:49   ` Jason Gunthorpe
2020-12-09 11:05     ` Joao Martins
2020-12-09 11:05       ` Joao Martins
2020-12-09 15:15       ` Jason Gunthorpe
2020-12-09 16:02         ` Joao Martins
2020-12-09 16:02           ` Joao Martins
2020-12-09 16:24           ` Jason Gunthorpe
2020-12-09 17:27             ` Joao Martins
2020-12-09 17:27               ` Joao Martins
2020-12-09 18:14             ` Matthew Wilcox
2020-12-09 18:14               ` Matthew Wilcox
2020-12-09 19:08               ` Jason Gunthorpe
2020-12-10 15:43               ` Joao Martins
2020-12-10 15:43                 ` Joao Martins
2020-12-09  4:40   ` John Hubbard
2020-12-09  4:40     ` John Hubbard
2020-12-09 13:44     ` Joao Martins
2020-12-09 13:44       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 7/9] mm/gup: Decrement head page " Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 19:34   ` Jason Gunthorpe
2020-12-09  5:06     ` John Hubbard
2020-12-09  5:06       ` John Hubbard
2020-12-09 13:43       ` Jason Gunthorpe
2020-12-09 12:17     ` Joao Martins
2020-12-09 12:17       ` Joao Martins
2020-12-17 19:05     ` Joao Martins
2020-12-17 19:05       ` Joao Martins
2020-12-17 20:05       ` Jason Gunthorpe
2020-12-17 22:34         ` Joao Martins
2020-12-17 22:34           ` Joao Martins
2020-12-18 14:25           ` Jason Gunthorpe
2020-12-19  2:06         ` John Hubbard
2020-12-19  2:06           ` John Hubbard
2020-12-19 13:10           ` Joao Martins
2020-12-19 13:10             ` Joao Martins
2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins
2020-12-08 17:29   ` Joao Martins
2020-12-08 19:29   ` Jason Gunthorpe
2020-12-09 10:59     ` Joao Martins
2020-12-09 10:59       ` Joao Martins
2020-12-19 13:15       ` Joao Martins
2020-12-19 13:15         ` Joao Martins
2020-12-09  5:18   ` John Hubbard
2020-12-09  5:18     ` John Hubbard
2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins
2020-12-08 17:29   ` Joao Martins
2020-12-08 19:57   ` Jason Gunthorpe
2020-12-09  8:05     ` Christoph Hellwig
2020-12-09  8:05       ` Christoph Hellwig
2020-12-09 11:19     ` Joao Martins
2020-12-09 11:19       ` Joao Martins
2020-12-09  5:23   ` John Hubbard
2020-12-09  5:23     ` John Hubbard
2020-12-09  9:38 ` [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps David Hildenbrand
2020-12-09  9:38   ` David Hildenbrand
2020-12-09  9:52 ` [External] " Muchun Song
2020-12-09  9:52   ` Muchun Song
2021-02-20  1:18 ` Dan Williams
2021-02-20  1:18   ` Dan Williams
2021-02-22 11:06   ` Joao Martins
2021-02-22 11:06     ` Joao Martins
2021-02-22 14:32     ` Joao Martins
2021-02-22 14:32       ` Joao Martins
2021-02-23 16:28   ` Joao Martins
2021-02-23 16:28     ` Joao Martins
2021-02-23 16:44     ` Dan Williams
2021-02-23 16:44       ` Dan Williams
2021-02-23 17:15       ` Joao Martins
2021-02-23 17:15         ` Joao Martins
2021-02-23 18:15         ` Dan Williams
2021-02-23 18:15           ` Dan Williams
2021-02-23 18:54       ` Jason Gunthorpe
2021-02-23 22:48         ` Dan Williams
2021-02-23 22:48           ` Dan Williams
2021-02-23 23:07           ` Jason Gunthorpe
2021-02-24  0:14             ` Dan Williams [this message]
2021-02-24  0:14               ` Dan Williams
2021-02-24  1:00               ` Jason Gunthorpe
2021-02-24  1:32                 ` Dan Williams
2021-02-24  1:32                   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4hAHaGZ52TtZxTyYtQQVMKW+MaqYDsDKJe94o-cNZNv4g@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=jgg@ziepe.ca \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mike.kravetz@oracle.com \
    --cc=rcampbell@nvidia.com \
    --cc=songmuchun@bytedance.com \
    --cc=willy@infradead.org \
    --subject='Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.