linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page
@ 2018-12-03 19:25 Alexander Duyck
  2018-12-03 19:25 ` [PATCH RFC 1/3] kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn Alexander Duyck
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Alexander Duyck @ 2018-12-03 19:25 UTC (permalink / raw)
  To: dan.j.williams, pbonzini, yi.z.zhang, brho, kvm, linux-nvdimm
  Cc: linux-kernel, linux-mm, dave.jiang, yu.c.zhang, pagupta, david,
	jack, hch, rkrcmar, jglisse

I have loosely based this patch series off of the following patch series
from Zhang Yi:
https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zhang@linux.intel.com

The original set had attempted to address the fact that DAX pages were
treated like MMIO pages which had resulted in reduced performance. It
attempted to address this by ignoring the PageReserved flag if the page
was either a DEV_DAX or FS_DAX page.

I am proposing this as an alternative to that set. The main reason for this
is because I believe there are a few issues that were overlooked with that
original set. Specifically KVM seems to have two different uses for the
PageReserved flag. One being whether or not we can pin the memory, the other
being if we should be marking the pages as dirty or accessed. I believe
only the pinning really applies so I have split the uses of
kvm_is_reserved_pfn and updated the function uses to determine support for
page pinning to include a check of the pgmap to see if it supports pinning.

---

Alexander Duyck (3):
      kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn
      mm: Add support for exposing if dev_pagemap supports refcount pinning
      kvm: Add additional check to determine if a page is refcounted


 arch/x86/kvm/mmu.c        |    6 +++---
 drivers/nvdimm/pfn_devs.c |    2 ++
 include/linux/kvm_host.h  |    2 +-
 include/linux/memremap.h  |    5 ++++-
 include/linux/mm.h        |   11 +++++++++++
 virt/kvm/kvm_main.c       |   34 +++++++++++++++++++++++++---------
 6 files changed, 46 insertions(+), 14 deletions(-)

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/3] kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn
  2018-12-03 19:25 [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Alexander Duyck
@ 2018-12-03 19:25 ` Alexander Duyck
  2018-12-03 19:25 ` [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning Alexander Duyck
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2018-12-03 19:25 UTC (permalink / raw)
  To: dan.j.williams, pbonzini, yi.z.zhang, brho, kvm, linux-nvdimm
  Cc: linux-kernel, linux-mm, dave.jiang, yu.c.zhang, pagupta, david,
	jack, hch, rkrcmar, jglisse

The function kvm_is_reserved_pfn really has two uses. One is to test for if
we should be updating the reference count on a page when we are accessing
it. The other is to determine if we should be updating the dirty flag or
marking pages as accessed.

In preparation for blurring the lines between ZONE_DEVICE and system RAM I
am splitting out the dirty/accessed cases into their own checks. Doing this
allows us to add ZONE_DEVICE to the list of refcounted pages without having
to worry about us introducing possible issues with pages being marked as
dirty or accessed and possibly causing any issues with attempted LRU
accesses on the ZONE_DEVICE pages.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 arch/x86/kvm/mmu.c       |    6 +++---
 include/linux/kvm_host.h |    2 +-
 virt/kvm/kvm_main.c      |   22 +++++++++++++---------
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7c03c0f35444..7c61cc260c23 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -798,7 +798,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	 * kvm mmu, before reclaiming the page, we should
 	 * unmap it from mmu first.
 	 */
-	WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
+	WARN_ON(kvm_is_refcounted_pfn(pfn) && !page_count(pfn_to_page(pfn)));
 
 	if (is_accessed_spte(old_spte))
 		kvm_set_pfn_accessed(pfn);
@@ -3166,7 +3166,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
 	 * here.
 	 */
-	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
+	if (!is_error_noslot_pfn(pfn) && kvm_is_refcounted_pfn(pfn) &&
 	    level == PT_PAGE_TABLE_LEVEL &&
 	    PageTransCompoundMap(pfn_to_page(pfn)) &&
 	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
@@ -5668,7 +5668,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 * mapping if the indirect sp has level = 1.
 		 */
 		if (sp->role.direct &&
-			!kvm_is_reserved_pfn(pfn) &&
+			kvm_is_refcounted_pfn(pfn) &&
 			PageTransCompoundMap(pfn_to_page(pfn))) {
 			pte_list_remove(rmap_head, sptep);
 			need_tlb_flush = 1;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c926698040e0..132e5dbc9049 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -906,7 +906,7 @@ void kvm_arch_sync_events(struct kvm *kvm);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
-bool kvm_is_reserved_pfn(kvm_pfn_t pfn);
+bool kvm_is_refcounted_pfn(kvm_pfn_t pfn);
 
 struct kvm_irq_ack_notifier {
 	struct hlist_node link;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2679e476b6c3..5e666df5666d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -146,7 +146,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	return 0;
 }
 
-bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
+bool kvm_is_refcounted_pfn(kvm_pfn_t pfn)
+{
+	if (pfn_valid(pfn))
+		return !PageReserved(pfn_to_page(pfn));
+
+	return false;
+}
+
+static bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return PageReserved(pfn_to_page(pfn));
@@ -1678,7 +1686,7 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
-	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
+	if (!is_error_noslot_pfn(pfn) && kvm_is_refcounted_pfn(pfn))
 		put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
@@ -1700,12 +1708,8 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
 
 void kvm_set_pfn_dirty(kvm_pfn_t pfn)
 {
-	if (!kvm_is_reserved_pfn(pfn)) {
-		struct page *page = pfn_to_page(pfn);
-
-		if (!PageReserved(page))
-			SetPageDirty(page);
-	}
+	if (!kvm_is_reserved_pfn(pfn))
+		SetPageDirty(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
@@ -1718,7 +1722,7 @@ EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
 void kvm_get_pfn(kvm_pfn_t pfn)
 {
-	if (!kvm_is_reserved_pfn(pfn))
+	if (kvm_is_refcounted_pfn(pfn))
 		get_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_get_pfn);


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 19:25 [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Alexander Duyck
  2018-12-03 19:25 ` [PATCH RFC 1/3] kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn Alexander Duyck
@ 2018-12-03 19:25 ` Alexander Duyck
  2018-12-03 19:47   ` Dan Williams
  2018-12-03 19:25 ` [PATCH RFC 3/3] kvm: Add additional check to determine if a page is refcounted Alexander Duyck
  2018-12-04  6:59 ` [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Yi Zhang
  3 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2018-12-03 19:25 UTC (permalink / raw)
  To: dan.j.williams, pbonzini, yi.z.zhang, brho, kvm, linux-nvdimm
  Cc: linux-kernel, linux-mm, dave.jiang, yu.c.zhang, pagupta, david,
	jack, hch, rkrcmar, jglisse

Add a means of exposing if a pagemap supports refcount pinning. I am doing
this to expose if a given pagemap has backing struct pages that will allow
for the reference count of the page to be incremented to lock the page
into place.

The KVM code already has several spots where it was trying to use a
pfn_valid check combined with a PageReserved check to determien if it could
take a reference on the page. I am adding this check so in the case of the
page having the reserved flag checked we can check the pagemap for the page
to determine if we might fall into the special DAX case.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/nvdimm/pfn_devs.c |    2 ++
 include/linux/memremap.h  |    5 ++++-
 include/linux/mm.h        |   11 +++++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 6f22272e8d80..7a4a85bcf7f4 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 	} else
 		return -ENXIO;
 
+	pgmap->support_refcount_pinning = true;
+
 	return 0;
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 55db66b3716f..6e7b85542208 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -109,6 +109,8 @@ typedef void (*dev_page_free_t)(struct page *page, void *data);
  * @page_fault: callback when CPU fault on an unaddressable device page
  * @page_free: free page callback when page refcount reaches 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
+ * @altmap_valid: bitflag indicating if altmap is valid
+ * @support_refcount_pinning: bitflag indicating if we support refcount pinning
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @kill: callback to transition @ref to the dead state
@@ -120,7 +122,8 @@ struct dev_pagemap {
 	dev_page_fault_t page_fault;
 	dev_page_free_t page_free;
 	struct vmem_altmap altmap;
-	bool altmap_valid;
+	bool altmap_valid:1;
+	bool support_refcount_pinning:1;
 	struct resource res;
 	struct percpu_ref *ref;
 	void (*kill)(struct percpu_ref *ref);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3eb3bf7774f1..5faf66dd4559 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -970,6 +970,12 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 }
 #endif /* CONFIG_PCI_P2PDMA */
 
+static inline bool is_device_pinnable_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->support_refcount_pinning;
+}
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline void dev_pagemap_get_ops(void)
 {
@@ -998,6 +1004,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_device_pinnable_page(const struct page *page)
+{
+	return false;
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/3] kvm: Add additional check to determine if a page is refcounted
  2018-12-03 19:25 [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Alexander Duyck
  2018-12-03 19:25 ` [PATCH RFC 1/3] kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn Alexander Duyck
  2018-12-03 19:25 ` [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning Alexander Duyck
@ 2018-12-03 19:25 ` Alexander Duyck
  2018-12-04  6:59 ` [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Yi Zhang
  3 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2018-12-03 19:25 UTC (permalink / raw)
  To: dan.j.williams, pbonzini, yi.z.zhang, brho, kvm, linux-nvdimm
  Cc: linux-kernel, linux-mm, dave.jiang, yu.c.zhang, pagupta, david,
	jack, hch, rkrcmar, jglisse

The function kvm_is_refcounted_page is used primarily to determine if KVM
is allowed to take a reference on the page. It was using the PG_reserved
flag to determine this previously, however in the case of DAX the page has
the PG_reserved flag set, but supports pinning by taking a reference on
the page. As such I have updated the check to add a special case for
ZONE_DEVICE pages that have the new support_refcount_pinning flag set.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 virt/kvm/kvm_main.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5e666df5666d..2e7e9fbb67bf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -148,8 +148,20 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_refcounted_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return !PageReserved(pfn_to_page(pfn));
+	if (pfn_valid(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * The reference count for MMIO pages are not updated.
+		 * Previously this was being tested for with just the
+		 * PageReserved check, however now ZONE_DEVICE pages may
+		 * also allow for the refcount to be updated for the sake
+		 * of pinning the pages so use the additional check provided
+		 * to determine if the reference count on the page can be
+		 * used to pin it.
+		 */
+		return !PageReserved(page) || is_device_pinnable_page(page);
+	}
 
 	return false;
 }


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 19:25 ` [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning Alexander Duyck
@ 2018-12-03 19:47   ` Dan Williams
  2018-12-03 20:21     ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-12-03 19:47 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> Add a means of exposing if a pagemap supports refcount pinning. I am doing
> this to expose if a given pagemap has backing struct pages that will allow
> for the reference count of the page to be incremented to lock the page
> into place.
>
> The KVM code already has several spots where it was trying to use a
> pfn_valid check combined with a PageReserved check to determien if it could
> take a reference on the page. I am adding this check so in the case of the
> page having the reserved flag checked we can check the pagemap for the page
> to determine if we might fall into the special DAX case.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  drivers/nvdimm/pfn_devs.c |    2 ++
>  include/linux/memremap.h  |    5 ++++-
>  include/linux/mm.h        |   11 +++++++++++
>  3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 6f22272e8d80..7a4a85bcf7f4 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
>         } else
>                 return -ENXIO;
>
> +       pgmap->support_refcount_pinning = true;
> +

There should be no dev_pagemap instance instance where this isn't
true, so I'm missing why this is needed?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 19:47   ` Dan Williams
@ 2018-12-03 20:21     ` Alexander Duyck
  2018-12-03 20:31       ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2018-12-03 20:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Mon, 2018-12-03 at 11:47 -0800, Dan Williams wrote:
> On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> > 
> > Add a means of exposing if a pagemap supports refcount pinning. I am doing
> > this to expose if a given pagemap has backing struct pages that will allow
> > for the reference count of the page to be incremented to lock the page
> > into place.
> > 
> > The KVM code already has several spots where it was trying to use a
> > pfn_valid check combined with a PageReserved check to determien if it could
> > take a reference on the page. I am adding this check so in the case of the
> > page having the reserved flag checked we can check the pagemap for the page
> > to determine if we might fall into the special DAX case.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> >  drivers/nvdimm/pfn_devs.c |    2 ++
> >  include/linux/memremap.h  |    5 ++++-
> >  include/linux/mm.h        |   11 +++++++++++
> >  3 files changed, 17 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > index 6f22272e8d80..7a4a85bcf7f4 100644
> > --- a/drivers/nvdimm/pfn_devs.c
> > +++ b/drivers/nvdimm/pfn_devs.c
> > @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
> >         } else
> >                 return -ENXIO;
> > 
> > +       pgmap->support_refcount_pinning = true;
> > +
> 
> There should be no dev_pagemap instance instance where this isn't
> true, so I'm missing why this is needed?

I thought in the case of HMM there were instances where you couldn't
pin the page, isn't there? Specifically I am thinking of the definition
of MEMORY_DEVICE_PUBLIC:
  Device memory that is cache coherent from device and CPU point of 
  view. This is use on platform that have an advance system bus (like 
  CAPI or CCIX). A driver can hotplug the device memory using 
  ZONE_DEVICE and with that memory type. Any page of a process can be 
  migrated to such memory. However no one should be allow to pin such 
  memory so that it can always be evicted.

It sounds like MEMORY_DEVICE_PUBLIC and MMIO would want to fall into
the same category here in order to allow a hot-plug event to remove the
device and take the memory with it, or is my understanding on this not
correct?

Thanks.

- Alex




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 20:21     ` Alexander Duyck
@ 2018-12-03 20:31       ` Dan Williams
  2018-12-03 20:53         ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-12-03 20:31 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Mon, Dec 3, 2018 at 12:21 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> On Mon, 2018-12-03 at 11:47 -0800, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> > >
> > > Add a means of exposing if a pagemap supports refcount pinning. I am doing
> > > this to expose if a given pagemap has backing struct pages that will allow
> > > for the reference count of the page to be incremented to lock the page
> > > into place.
> > >
> > > The KVM code already has several spots where it was trying to use a
> > > pfn_valid check combined with a PageReserved check to determien if it could
> > > take a reference on the page. I am adding this check so in the case of the
> > > page having the reserved flag checked we can check the pagemap for the page
> > > to determine if we might fall into the special DAX case.
> > >
> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > ---
> > >  drivers/nvdimm/pfn_devs.c |    2 ++
> > >  include/linux/memremap.h  |    5 ++++-
> > >  include/linux/mm.h        |   11 +++++++++++
> > >  3 files changed, 17 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > > index 6f22272e8d80..7a4a85bcf7f4 100644
> > > --- a/drivers/nvdimm/pfn_devs.c
> > > +++ b/drivers/nvdimm/pfn_devs.c
> > > @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
> > >         } else
> > >                 return -ENXIO;
> > >
> > > +       pgmap->support_refcount_pinning = true;
> > > +
> >
> > There should be no dev_pagemap instance instance where this isn't
> > true, so I'm missing why this is needed?
>
> I thought in the case of HMM there were instances where you couldn't
> pin the page, isn't there? Specifically I am thinking of the definition
> of MEMORY_DEVICE_PUBLIC:
>   Device memory that is cache coherent from device and CPU point of
>   view. This is use on platform that have an advance system bus (like
>   CAPI or CCIX). A driver can hotplug the device memory using
>   ZONE_DEVICE and with that memory type. Any page of a process can be
>   migrated to such memory. However no one should be allow to pin such
>   memory so that it can always be evicted.
>
> It sounds like MEMORY_DEVICE_PUBLIC and MMIO would want to fall into
> the same category here in order to allow a hot-plug event to remove the
> device and take the memory with it, or is my understanding on this not
> correct?

I don't understand how HMM expects to enforce no pinning, but in any
event it should always be the expectation an elevated reference count
on a page prevents that page from disappearing. Anything else is
broken.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 20:31       ` Dan Williams
@ 2018-12-03 20:53         ` Alexander Duyck
  2018-12-03 21:05           ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2018-12-03 20:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Mon, 2018-12-03 at 12:31 -0800, Dan Williams wrote:
> On Mon, Dec 3, 2018 at 12:21 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> > 
> > On Mon, 2018-12-03 at 11:47 -0800, Dan Williams wrote:
> > > On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
> > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > 
> > > > Add a means of exposing if a pagemap supports refcount pinning. I am doing
> > > > this to expose if a given pagemap has backing struct pages that will allow
> > > > for the reference count of the page to be incremented to lock the page
> > > > into place.
> > > > 
> > > > The KVM code already has several spots where it was trying to use a
> > > > pfn_valid check combined with a PageReserved check to determien if it could
> > > > take a reference on the page. I am adding this check so in the case of the
> > > > page having the reserved flag checked we can check the pagemap for the page
> > > > to determine if we might fall into the special DAX case.
> > > > 
> > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > ---
> > > >  drivers/nvdimm/pfn_devs.c |    2 ++
> > > >  include/linux/memremap.h  |    5 ++++-
> > > >  include/linux/mm.h        |   11 +++++++++++
> > > >  3 files changed, 17 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > > > index 6f22272e8d80..7a4a85bcf7f4 100644
> > > > --- a/drivers/nvdimm/pfn_devs.c
> > > > +++ b/drivers/nvdimm/pfn_devs.c
> > > > @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
> > > >         } else
> > > >                 return -ENXIO;
> > > > 
> > > > +       pgmap->support_refcount_pinning = true;
> > > > +
> > > 
> > > There should be no dev_pagemap instance instance where this isn't
> > > true, so I'm missing why this is needed?
> > 
> > I thought in the case of HMM there were instances where you couldn't
> > pin the page, isn't there? Specifically I am thinking of the definition
> > of MEMORY_DEVICE_PUBLIC:
> >   Device memory that is cache coherent from device and CPU point of
> >   view. This is use on platform that have an advance system bus (like
> >   CAPI or CCIX). A driver can hotplug the device memory using
> >   ZONE_DEVICE and with that memory type. Any page of a process can be
> >   migrated to such memory. However no one should be allow to pin such
> >   memory so that it can always be evicted.
> > 
> > It sounds like MEMORY_DEVICE_PUBLIC and MMIO would want to fall into
> > the same category here in order to allow a hot-plug event to remove the
> > device and take the memory with it, or is my understanding on this not
> > correct?
> 
> I don't understand how HMM expects to enforce no pinning, but in any
> event it should always be the expectation an elevated reference count
> on a page prevents that page from disappearing. Anything else is
> broken.

I don't think that is true for device MMIO though.

In the case of MMIO you have the memory region backed by a device, if
that device is hot-plugged or fails in some way then that backing would
go away and the reads would return and all 1's response.

Holding a reference to the page doesn't guarantee that the backing
device cannot go away. I believe that is the origin of the original use
of the PageReserved check in KVM in terms of if it will try to use the
get_page/put_page functions. I believe this is also why
MEMORY_DEVICE_PUBLIC specifically calls out that you should not allow
pinning such memory.

- Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 20:53         ` Alexander Duyck
@ 2018-12-03 21:05           ` Dan Williams
  2018-12-03 21:50             ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-12-03 21:05 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Mon, Dec 3, 2018 at 12:53 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> On Mon, 2018-12-03 at 12:31 -0800, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 12:21 PM Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> > >
> > > On Mon, 2018-12-03 at 11:47 -0800, Dan Williams wrote:
> > > > On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
> > > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > >
> > > > > Add a means of exposing if a pagemap supports refcount pinning. I am doing
> > > > > this to expose if a given pagemap has backing struct pages that will allow
> > > > > for the reference count of the page to be incremented to lock the page
> > > > > into place.
> > > > >
> > > > > The KVM code already has several spots where it was trying to use a
> > > > > pfn_valid check combined with a PageReserved check to determien if it could
> > > > > take a reference on the page. I am adding this check so in the case of the
> > > > > page having the reserved flag checked we can check the pagemap for the page
> > > > > to determine if we might fall into the special DAX case.
> > > > >
> > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > ---
> > > > >  drivers/nvdimm/pfn_devs.c |    2 ++
> > > > >  include/linux/memremap.h  |    5 ++++-
> > > > >  include/linux/mm.h        |   11 +++++++++++
> > > > >  3 files changed, 17 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > > > > index 6f22272e8d80..7a4a85bcf7f4 100644
> > > > > --- a/drivers/nvdimm/pfn_devs.c
> > > > > +++ b/drivers/nvdimm/pfn_devs.c
> > > > > @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
> > > > >         } else
> > > > >                 return -ENXIO;
> > > > >
> > > > > +       pgmap->support_refcount_pinning = true;
> > > > > +
> > > >
> > > > There should be no dev_pagemap instance instance where this isn't
> > > > true, so I'm missing why this is needed?
> > >
> > > I thought in the case of HMM there were instances where you couldn't
> > > pin the page, isn't there? Specifically I am thinking of the definition
> > > of MEMORY_DEVICE_PUBLIC:
> > >   Device memory that is cache coherent from device and CPU point of
> > >   view. This is use on platform that have an advance system bus (like
> > >   CAPI or CCIX). A driver can hotplug the device memory using
> > >   ZONE_DEVICE and with that memory type. Any page of a process can be
> > >   migrated to such memory. However no one should be allow to pin such
> > >   memory so that it can always be evicted.
> > >
> > > It sounds like MEMORY_DEVICE_PUBLIC and MMIO would want to fall into
> > > the same category here in order to allow a hot-plug event to remove the
> > > device and take the memory with it, or is my understanding on this not
> > > correct?
> >
> > I don't understand how HMM expects to enforce no pinning, but in any
> > event it should always be the expectation an elevated reference count
> > on a page prevents that page from disappearing. Anything else is
> > broken.
>
> I don't think that is true for device MMIO though.
>
> In the case of MMIO you have the memory region backed by a device, if
> that device is hot-plugged or fails in some way then that backing would
> go away and the reads would return and all 1's response.

Until p2pdma there are no struct pages for device memory, is that what
you're referring?

Otherwise any device driver that leaks "struct pages" into random code
paths in the kernel had better not expect to be able to
surprise-remove those pages from the system. Any dev_pagemap user
should expect to do a coordinated removal with the driver that waits
for page references to drop before the device can be physically
removed.

> Holding a reference to the page doesn't guarantee that the backing
> device cannot go away.

Correct there is no physical guarantee, but that's not the point. It
needs to be coordinated, otherwise all bets are off with respect to
system stability.

> I believe that is the origin of the original use
> of the PageReserved check in KVM in terms of if it will try to use the
> get_page/put_page functions.

Is it? MMIO does not typically have a corresponding 'struct page'.

> I believe this is also why
> MEMORY_DEVICE_PUBLIC specifically calls out that you should not allow
> pinning such memory.

I don't think that call out was referencing device hotplug, I believe
it was the HMM expectation that it should be able to move an HMM page
from device to System-RAM at will.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 21:05           ` Dan Williams
@ 2018-12-03 21:50             ` Alexander Duyck
  2018-12-04 19:08               ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2018-12-03 21:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Mon, 2018-12-03 at 13:05 -0800, Dan Williams wrote:
> On Mon, Dec 3, 2018 at 12:53 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> > 
> > On Mon, 2018-12-03 at 12:31 -0800, Dan Williams wrote:
> > > On Mon, Dec 3, 2018 at 12:21 PM Alexander Duyck
> > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > 
> > > > On Mon, 2018-12-03 at 11:47 -0800, Dan Williams wrote:
> > > > > On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
> > > > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > > > 
> > > > > > Add a means of exposing if a pagemap supports refcount pinning. I am doing
> > > > > > this to expose if a given pagemap has backing struct pages that will allow
> > > > > > for the reference count of the page to be incremented to lock the page
> > > > > > into place.
> > > > > > 
> > > > > > The KVM code already has several spots where it was trying to use a
> > > > > > pfn_valid check combined with a PageReserved check to determien if it could
> > > > > > take a reference on the page. I am adding this check so in the case of the
> > > > > > page having the reserved flag checked we can check the pagemap for the page
> > > > > > to determine if we might fall into the special DAX case.
> > > > > > 
> > > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > > ---
> > > > > >  drivers/nvdimm/pfn_devs.c |    2 ++
> > > > > >  include/linux/memremap.h  |    5 ++++-
> > > > > >  include/linux/mm.h        |   11 +++++++++++
> > > > > >  3 files changed, 17 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > > > > > index 6f22272e8d80..7a4a85bcf7f4 100644
> > > > > > --- a/drivers/nvdimm/pfn_devs.c
> > > > > > +++ b/drivers/nvdimm/pfn_devs.c
> > > > > > @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
> > > > > >         } else
> > > > > >                 return -ENXIO;
> > > > > > 
> > > > > > +       pgmap->support_refcount_pinning = true;
> > > > > > +
> > > > > 
> > > > > There should be no dev_pagemap instance instance where this isn't
> > > > > true, so I'm missing why this is needed?
> > > > 
> > > > I thought in the case of HMM there were instances where you couldn't
> > > > pin the page, isn't there? Specifically I am thinking of the definition
> > > > of MEMORY_DEVICE_PUBLIC:
> > > >   Device memory that is cache coherent from device and CPU point of
> > > >   view. This is use on platform that have an advance system bus (like
> > > >   CAPI or CCIX). A driver can hotplug the device memory using
> > > >   ZONE_DEVICE and with that memory type. Any page of a process can be
> > > >   migrated to such memory. However no one should be allow to pin such
> > > >   memory so that it can always be evicted.
> > > > 
> > > > It sounds like MEMORY_DEVICE_PUBLIC and MMIO would want to fall into
> > > > the same category here in order to allow a hot-plug event to remove the
> > > > device and take the memory with it, or is my understanding on this not
> > > > correct?
> > > 
> > > I don't understand how HMM expects to enforce no pinning, but in any
> > > event it should always be the expectation an elevated reference count
> > > on a page prevents that page from disappearing. Anything else is
> > > broken.
> > 
> > I don't think that is true for device MMIO though.
> > 
> > In the case of MMIO you have the memory region backed by a device, if
> > that device is hot-plugged or fails in some way then that backing would
> > go away and the reads would return and all 1's response.
> 
> Until p2pdma there are no struct pages for device memory, is that what
> you're referring?

Honestly I am not sure. It is possible I am getting beyond my depth.

My understanding is that we have a 'struct page' for any of these pages
that we are currently using. It is just a matter of if we want to pass
the struct page around or not. So for example in the case of an MMIO
page we still have a 'struct page', however the PG_reserved flag is set
on such a page, so KVM is opting to not touch the reference count,
modify the dirty/accessed bits, and is generally reducing performance
as a result.

> Otherwise any device driver that leaks "struct pages" into random code
> paths in the kernel had better not expect to be able to
> surprise-remove those pages from the system. Any dev_pagemap user
> should expect to do a coordinated removal with the driver that waits
> for page references to drop before the device can be physically
> removed.

Right. This part I get. However I would imagine there still has to be
some exception handling in the case of a PCIe backed region of memory
so that if the device falls of the bus we clean up the dev_pagemap
memory.

> > Holding a reference to the page doesn't guarantee that the backing
> > device cannot go away.
> 
> Correct there is no physical guarantee, but that's not the point. It
> needs to be coordinated, otherwise all bets are off with respect to
> system stability.

Right.

> > I believe that is the origin of the original use
> > of the PageReserved check in KVM in terms of if it will try to use the
> > get_page/put_page functions.
> 
> Is it? MMIO does not typically have a corresponding 'struct page'.

I think we might be talking about different things when we say 'struct
page'. I'm pretty sure there has to be a 'struct page' for the MMIO
region as otherwise we wouldn't be able to check for the PG_reserved
bit in the 'struct page'. Do you maybe mean that MMIO doesn't have a
corresponding virtual address or TLB entry? I know that is what we are
normally generating via the ioremap family of calls in device drivers
in order to access such memory if I am not mistaken.

> > I believe this is also why
> > MEMORY_DEVICE_PUBLIC specifically calls out that you should not allow
> > pinning such memory.
> 
> I don't think that call out was referencing device hotplug, I believe
> it was the HMM expectation that it should be able to move an HMM page
> from device to System-RAM at will.

I could be wrong. If so that would make this patch set easier since
essentially it would just mean that any PageReserved page that matches
is_zone_device_page would fall into this category then and I could just
drop patch 2, and probably combine the entire fix for all of this into
one patch as it would only really be a few additional lines.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page
  2018-12-03 19:25 [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Alexander Duyck
                   ` (2 preceding siblings ...)
  2018-12-03 19:25 ` [PATCH RFC 3/3] kvm: Add additional check to determine if a page is refcounted Alexander Duyck
@ 2018-12-04  6:59 ` Yi Zhang
  2018-12-04 18:45   ` Alexander Duyck
  3 siblings, 1 reply; 18+ messages in thread
From: Yi Zhang @ 2018-12-04  6:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: dan.j.williams, pbonzini, brho, kvm, linux-nvdimm, linux-kernel,
	linux-mm, dave.jiang, yu.c.zhang, pagupta, david, jack, hch,
	rkrcmar, jglisse

On 2018-12-03 at 11:25:20 -0800, Alexander Duyck wrote:
> I have loosely based this patch series off of the following patch series
> from Zhang Yi:
> https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zhang@linux.intel.com
> 
> The original set had attempted to address the fact that DAX pages were
> treated like MMIO pages which had resulted in reduced performance. It
> attempted to address this by ignoring the PageReserved flag if the page
> was either a DEV_DAX or FS_DAX page.
> 
> I am proposing this as an alternative to that set. The main reason for this
> is because I believe there are a few issues that were overlooked with that
> original set. Specifically KVM seems to have two different uses for the
> PageReserved flag. One being whether or not we can pin the memory, the other
> being if we should be marking the pages as dirty or accessed. I believe
> only the pinning really applies so I have split the uses of
> kvm_is_reserved_pfn and updated the function uses to determine support for
> page pinning to include a check of the pgmap to see if it supports pinning.
kvm is not the only one users of the dax page.

A similar user of PageReserved to look at is:
 drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn(
vfio is also want to know the page is capable for pinning.

I throught that you have removed the reserved flag on the dax page

in https://patchwork.kernel.org/patch/10707267/

is something I missing here?

> 
> ---
> 
> Alexander Duyck (3):
>       kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn
>       mm: Add support for exposing if dev_pagemap supports refcount pinning
>       kvm: Add additional check to determine if a page is refcounted
> 
> 
>  arch/x86/kvm/mmu.c        |    6 +++---
>  drivers/nvdimm/pfn_devs.c |    2 ++
>  include/linux/kvm_host.h  |    2 +-
>  include/linux/memremap.h  |    5 ++++-
>  include/linux/mm.h        |   11 +++++++++++
>  virt/kvm/kvm_main.c       |   34 +++++++++++++++++++++++++---------
>  6 files changed, 46 insertions(+), 14 deletions(-)
> 
> --

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page
  2018-12-04  6:59 ` [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Yi Zhang
@ 2018-12-04 18:45   ` Alexander Duyck
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2018-12-04 18:45 UTC (permalink / raw)
  To: Yi Zhang
  Cc: dan.j.williams, pbonzini, brho, kvm, linux-nvdimm, linux-kernel,
	linux-mm, dave.jiang, yu.c.zhang, pagupta, david, jack, hch,
	rkrcmar, jglisse

On Tue, 2018-12-04 at 14:59 +0800, Yi Zhang wrote:
> On 2018-12-03 at 11:25:20 -0800, Alexander Duyck wrote:
> > I have loosely based this patch series off of the following patch series
> > from Zhang Yi:
> > https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zhang@linux.intel.com
> > 
> > The original set had attempted to address the fact that DAX pages were
> > treated like MMIO pages which had resulted in reduced performance. It
> > attempted to address this by ignoring the PageReserved flag if the page
> > was either a DEV_DAX or FS_DAX page.
> > 
> > I am proposing this as an alternative to that set. The main reason for this
> > is because I believe there are a few issues that were overlooked with that
> > original set. Specifically KVM seems to have two different uses for the
> > PageReserved flag. One being whether or not we can pin the memory, the other
> > being if we should be marking the pages as dirty or accessed. I believe
> > only the pinning really applies so I have split the uses of
> > kvm_is_reserved_pfn and updated the function uses to determine support for
> > page pinning to include a check of the pgmap to see if it supports pinning.
> 
> kvm is not the only one users of the dax page.

Yes, but KVM and virtualization in general seems to be the place where
the code carrying the assumption that PageReserved == MMIO exists.

> A similar user of PageReserved to look at is:
>  drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn(
> vfio is also want to know the page is capable for pinning.

I would lump vfio in with virtualization as I said above.

A quick search also shows that there is also
arch/x86/kvm/mmu.c:kvm_is_mmio_pfn() which had a similar assumption but
is already carrying workarounds.

> I throught that you have removed the reserved flag on the dax page
> 
> in https://patchwork.kernel.org/patch/10707267/
> 
> is something I missing here?

That patch wasn't about DAX memory. That patch was about the fact that
the reserved flag was expensive as a __set_bit operation. I was leaving
the bit set for DAX and all other hot-plug memory and not setting it
for deferred memory init.

The reserved bit is essentially meant to flag everything that is not
standard system RAM page. Historically speaking most of that was MMIO,
now that isn't necessarily the case with the introduction of
ZONE_DEVICE pages.

The issue is DAX isn't necessarily system RAM either. So if we don't
set the reserved bit for DAX then we have to go through and start
adding exception cases to the paths that handle system RAM to split it
off from DAX. Dan had pointed out one such example in
kernel/power/snapshot.c:saveable_page() as I recall.

> > 
> > ---
> > 
> > Alexander Duyck (3):
> >       kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn
> >       mm: Add support for exposing if dev_pagemap supports refcount pinning
> >       kvm: Add additional check to determine if a page is refcounted
> > 
> > 
> >  arch/x86/kvm/mmu.c        |    6 +++---
> >  drivers/nvdimm/pfn_devs.c |    2 ++
> >  include/linux/kvm_host.h  |    2 +-
> >  include/linux/memremap.h  |    5 ++++-
> >  include/linux/mm.h        |   11 +++++++++++
> >  virt/kvm/kvm_main.c       |   34 +++++++++++++++++++++++++---------
> >  6 files changed, 46 insertions(+), 14 deletions(-)
> > 
> > --


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-03 21:50             ` Alexander Duyck
@ 2018-12-04 19:08               ` Dan Williams
  2018-12-04 22:51                 ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-12-04 19:08 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Mon, Dec 3, 2018 at 1:50 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> On Mon, 2018-12-03 at 13:05 -0800, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 12:53 PM Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> > >
> > > On Mon, 2018-12-03 at 12:31 -0800, Dan Williams wrote:
> > > > On Mon, Dec 3, 2018 at 12:21 PM Alexander Duyck
> > > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > >
> > > > > On Mon, 2018-12-03 at 11:47 -0800, Dan Williams wrote:
> > > > > > On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
> > > > > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > > > >
> > > > > > > Add a means of exposing if a pagemap supports refcount pinning. I am doing
> > > > > > > this to expose if a given pagemap has backing struct pages that will allow
> > > > > > > for the reference count of the page to be incremented to lock the page
> > > > > > > into place.
> > > > > > >
> > > > > > > The KVM code already has several spots where it was trying to use a
> > > > > > > pfn_valid check combined with a PageReserved check to determien if it could
> > > > > > > take a reference on the page. I am adding this check so in the case of the
> > > > > > > page having the reserved flag checked we can check the pagemap for the page
> > > > > > > to determine if we might fall into the special DAX case.
> > > > > > >
> > > > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > > > ---
> > > > > > >  drivers/nvdimm/pfn_devs.c |    2 ++
> > > > > > >  include/linux/memremap.h  |    5 ++++-
> > > > > > >  include/linux/mm.h        |   11 +++++++++++
> > > > > > >  3 files changed, 17 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > > > > > > index 6f22272e8d80..7a4a85bcf7f4 100644
> > > > > > > --- a/drivers/nvdimm/pfn_devs.c
> > > > > > > +++ b/drivers/nvdimm/pfn_devs.c
> > > > > > > @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
> > > > > > >         } else
> > > > > > >                 return -ENXIO;
> > > > > > >
> > > > > > > +       pgmap->support_refcount_pinning = true;
> > > > > > > +
> > > > > >
> > > > > > There should be no dev_pagemap instance instance where this isn't
> > > > > > true, so I'm missing why this is needed?
> > > > >
> > > > > I thought in the case of HMM there were instances where you couldn't
> > > > > pin the page, isn't there? Specifically I am thinking of the definition
> > > > > of MEMORY_DEVICE_PUBLIC:
> > > > >   Device memory that is cache coherent from device and CPU point of
> > > > >   view. This is use on platform that have an advance system bus (like
> > > > >   CAPI or CCIX). A driver can hotplug the device memory using
> > > > >   ZONE_DEVICE and with that memory type. Any page of a process can be
> > > > >   migrated to such memory. However no one should be allow to pin such
> > > > >   memory so that it can always be evicted.
> > > > >
> > > > > It sounds like MEMORY_DEVICE_PUBLIC and MMIO would want to fall into
> > > > > the same category here in order to allow a hot-plug event to remove the
> > > > > device and take the memory with it, or is my understanding on this not
> > > > > correct?
> > > >
> > > > I don't understand how HMM expects to enforce no pinning, but in any
> > > > event it should always be the expectation an elevated reference count
> > > > on a page prevents that page from disappearing. Anything else is
> > > > broken.
> > >
> > > I don't think that is true for device MMIO though.
> > >
> > > In the case of MMIO you have the memory region backed by a device, if
> > > that device is hot-plugged or fails in some way then that backing would
> > > go away and the reads would return and all 1's response.
> >
> > Until p2pdma there are no struct pages for device memory, is that what
> > you're referring?
>
> Honestly I am not sure. It is possible I am getting beyond my depth.
>
> My understanding is that we have a 'struct page' for any of these pages
> that we are currently using. It is just a matter of if we want to pass
> the struct page around or not. So for example in the case of an MMIO
> page we still have a 'struct page', however the PG_reserved flag is set
> on such a page, so KVM is opting to not touch the reference count,
> modify the dirty/accessed bits, and is generally reducing performance
> as a result.
>
> > Otherwise any device driver that leaks "struct pages" into random code
> > paths in the kernel had better not expect to be able to
> > surprise-remove those pages from the system. Any dev_pagemap user
> > should expect to do a coordinated removal with the driver that waits
> > for page references to drop before the device can be physically
> > removed.
>
> Right. This part I get. However I would imagine there still has to be
> some exception handling in the case of a PCIe backed region of memory
> so that if the device falls of the bus we clean up the dev_pagemap
> memory.

There isn't. This is a gap. Persistent memory gets around this by the
fact that none of the current NVDIMM bus implementations support
surprise physical unplug. HMM may be broken in this regard when GPUs
fall of the bus, but I don't think this is a recoverable situation
once applications have been handed direct access to the memory
resource. There is no recovery without a device-driver intermediary to
gracefully fail transactions.

> > > Holding a reference to the page doesn't guarantee that the backing
> > > device cannot go away.
> >
> > Correct there is no physical guarantee, but that's not the point. It
> > needs to be coordinated, otherwise all bets are off with respect to
> > system stability.
>
> Right.
>
> > > I believe that is the origin of the original use
> > > of the PageReserved check in KVM in terms of if it will try to use the
> > > get_page/put_page functions.
> >
> > Is it? MMIO does not typically have a corresponding 'struct page'.
>
> I think we might be talking about different things when we say 'struct
> page'. I'm pretty sure there has to be a 'struct page' for the MMIO
> region as otherwise we wouldn't be able to check for the PG_reserved
> bit in the 'struct page'. Do you maybe mean that MMIO doesn't have a
> corresponding virtual address or TLB entry?

No.

> I know that is what we are
> normally generating via the ioremap family of calls in device drivers
> in order to access such memory if I am not mistaken.

I think the confusion arises from the fact that there are a few MMIO
resources with a struct page and all the rest MMIO resources without.
The problem comes from the coarse definition of pfn_valid(), it may
return 'true' for things that are not System-RAM, because pfn_valid()
may be something as simplistic as a single "address < X" check. Then
PageReserved is a fallback to clarify the pfn_valid() result. The
typical case is that MMIO space is not caught up in this linear map
confusion. An MMIO address may or may not have an associated 'struct
page' and in most cases it does not.

> > > I believe this is also why
> > > MEMORY_DEVICE_PUBLIC specifically calls out that you should not allow
> > > pinning such memory.
> >
> > I don't think that call out was referencing device hotplug, I believe
> > it was the HMM expectation that it should be able to move an HMM page
> > from device to System-RAM at will.
>
> I could be wrong. If so that would make this patch set easier since
> essentially it would just mean that any PageReserved page that matches
> is_zone_device_page would fall into this category then and I could just
> drop patch 2, and probably combine the entire fix for all of this into
> one patch as it would only really be a few additional lines.
>

I took another look at PageReserved usage, I really think we're better
off just deleting PageReserved for dax and fixing up the small number
of places where it matters. By my count there are only 44 tests for
PageReserved and most of those are locations that would never see a
dax page.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-04 19:08               ` Dan Williams
@ 2018-12-04 22:51                 ` Alexander Duyck
  2018-12-04 23:24                   ` Barret Rhoden
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2018-12-04 22:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Paolo Bonzini, Zhang Yi, Barret Rhoden, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Tue, 2018-12-04 at 11:08 -0800, Dan Williams wrote:
> On Mon, Dec 3, 2018 at 1:50 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> > 
> > On Mon, 2018-12-03 at 13:05 -0800, Dan Williams wrote:
> > > On Mon, Dec 3, 2018 at 12:53 PM Alexander Duyck
> > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > 
> > > > On Mon, 2018-12-03 at 12:31 -0800, Dan Williams wrote:
> > > > > On Mon, Dec 3, 2018 at 12:21 PM Alexander Duyck
> > > > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > > > 
> > > > > > On Mon, 2018-12-03 at 11:47 -0800, Dan Williams wrote:
> > > > > > > On Mon, Dec 3, 2018 at 11:25 AM Alexander Duyck
> > > > > > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > > > > > 
> > > > > > > > Add a means of exposing if a pagemap supports refcount pinning. I am doing
> > > > > > > > this to expose if a given pagemap has backing struct pages that will allow
> > > > > > > > for the reference count of the page to be incremented to lock the page
> > > > > > > > into place.
> > > > > > > > 
> > > > > > > > The KVM code already has several spots where it was trying to use a
> > > > > > > > pfn_valid check combined with a PageReserved check to determien if it could
> > > > > > > > take a reference on the page. I am adding this check so in the case of the
> > > > > > > > page having the reserved flag checked we can check the pagemap for the page
> > > > > > > > to determine if we might fall into the special DAX case.
> > > > > > > > 
> > > > > > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/nvdimm/pfn_devs.c |    2 ++
> > > > > > > >  include/linux/memremap.h  |    5 ++++-
> > > > > > > >  include/linux/mm.h        |   11 +++++++++++
> > > > > > > >  3 files changed, 17 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> > > > > > > > index 6f22272e8d80..7a4a85bcf7f4 100644
> > > > > > > > --- a/drivers/nvdimm/pfn_devs.c
> > > > > > > > +++ b/drivers/nvdimm/pfn_devs.c
> > > > > > > > @@ -640,6 +640,8 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
> > > > > > > >         } else
> > > > > > > >                 return -ENXIO;
> > > > > > > > 
> > > > > > > > +       pgmap->support_refcount_pinning = true;
> > > > > > > > +
> > > > > > > 
> > > > > > > There should be no dev_pagemap instance instance where this isn't
> > > > > > > true, so I'm missing why this is needed?
> > > > > > 
> > > > > > I thought in the case of HMM there were instances where you couldn't
> > > > > > pin the page, isn't there? Specifically I am thinking of the definition
> > > > > > of MEMORY_DEVICE_PUBLIC:
> > > > > >   Device memory that is cache coherent from device and CPU point of
> > > > > >   view. This is use on platform that have an advance system bus (like
> > > > > >   CAPI or CCIX). A driver can hotplug the device memory using
> > > > > >   ZONE_DEVICE and with that memory type. Any page of a process can be
> > > > > >   migrated to such memory. However no one should be allow to pin such
> > > > > >   memory so that it can always be evicted.
> > > > > > 
> > > > > > It sounds like MEMORY_DEVICE_PUBLIC and MMIO would want to fall into
> > > > > > the same category here in order to allow a hot-plug event to remove the
> > > > > > device and take the memory with it, or is my understanding on this not
> > > > > > correct?
> > > > > 
> > > > > I don't understand how HMM expects to enforce no pinning, but in any
> > > > > event it should always be the expectation an elevated reference count
> > > > > on a page prevents that page from disappearing. Anything else is
> > > > > broken.
> > > > 
> > > > I don't think that is true for device MMIO though.
> > > > 
> > > > In the case of MMIO you have the memory region backed by a device, if
> > > > that device is hot-plugged or fails in some way then that backing would
> > > > go away and the reads would return and all 1's response.
> > > 
> > > Until p2pdma there are no struct pages for device memory, is that what
> > > you're referring?
> > 
> > Honestly I am not sure. It is possible I am getting beyond my depth.
> > 
> > My understanding is that we have a 'struct page' for any of these pages
> > that we are currently using. It is just a matter of if we want to pass
> > the struct page around or not. So for example in the case of an MMIO
> > page we still have a 'struct page', however the PG_reserved flag is set
> > on such a page, so KVM is opting to not touch the reference count,
> > modify the dirty/accessed bits, and is generally reducing performance
> > as a result.
> > 
> > > Otherwise any device driver that leaks "struct pages" into random code
> > > paths in the kernel had better not expect to be able to
> > > surprise-remove those pages from the system. Any dev_pagemap user
> > > should expect to do a coordinated removal with the driver that waits
> > > for page references to drop before the device can be physically
> > > removed.
> > 
> > Right. This part I get. However I would imagine there still has to be
> > some exception handling in the case of a PCIe backed region of memory
> > so that if the device falls of the bus we clean up the dev_pagemap
> > memory.
> 
> There isn't. This is a gap. Persistent memory gets around this by the
> fact that none of the current NVDIMM bus implementations support
> surprise physical unplug. HMM may be broken in this regard when GPUs
> fall of the bus, but I don't think this is a recoverable situation
> once applications have been handed direct access to the memory
> resource. There is no recovery without a device-driver intermediary to
> gracefully fail transactions.
> 
> > > > Holding a reference to the page doesn't guarantee that the backing
> > > > device cannot go away.
> > > 
> > > Correct there is no physical guarantee, but that's not the point. It
> > > needs to be coordinated, otherwise all bets are off with respect to
> > > system stability.
> > 
> > Right.
> > 
> > > > I believe that is the origin of the original use
> > > > of the PageReserved check in KVM in terms of if it will try to use the
> > > > get_page/put_page functions.
> > > 
> > > Is it? MMIO does not typically have a corresponding 'struct page'.
> > 
> > I think we might be talking about different things when we say 'struct
> > page'. I'm pretty sure there has to be a 'struct page' for the MMIO
> > region as otherwise we wouldn't be able to check for the PG_reserved
> > bit in the 'struct page'. Do you maybe mean that MMIO doesn't have a
> > corresponding virtual address or TLB entry?
> 
> No.
> 
> > I know that is what we are
> > normally generating via the ioremap family of calls in device drivers
> > in order to access such memory if I am not mistaken.
> 
> I think the confusion arises from the fact that there are a few MMIO
> resources with a struct page and all the rest MMIO resources without.
> The problem comes from the coarse definition of pfn_valid(), it may
> return 'true' for things that are not System-RAM, because pfn_valid()
> may be something as simplistic as a single "address < X" check. Then
> PageReserved is a fallback to clarify the pfn_valid() result. The
> typical case is that MMIO space is not caught up in this linear map
> confusion. An MMIO address may or may not have an associated 'struct
> page' and in most cases it does not.

Okay. I think I understand this somewhat now. So the page might be
physically there, but with the reserved bit it is not supposed to be
touched.

My main concern with just dropping the bit is that we start seeing some
other uses that I was not certain what the impact would be. For example
the functions like kvm_set_pfn_accessed start going in and manipulating
things that I am not sure should be messed with for a DAX page.

> > > > I believe this is also why
> > > > MEMORY_DEVICE_PUBLIC specifically calls out that you should not allow
> > > > pinning such memory.
> > > 
> > > I don't think that call out was referencing device hotplug, I believe
> > > it was the HMM expectation that it should be able to move an HMM page
> > > from device to System-RAM at will.
> > 
> > I could be wrong. If so that would make this patch set easier since
> > essentially it would just mean that any PageReserved page that matches
> > is_zone_device_page would fall into this category then and I could just
> > drop patch 2, and probably combine the entire fix for all of this into
> > one patch as it would only really be a few additional lines.
> > 
> 
> I took another look at PageReserved usage, I really think we're better
> off just deleting PageReserved for dax and fixing up the small number
> of places where it matters. By my count there are only 44 tests for
> PageReserved and most of those are locations that would never see a
> dax page.

Right. The thing I am worried about is the cascade of secondary effects
that start kicking in because of us not setting the PageReserved bit.

So for example kvm_set_pfn_accessed can call mark_page_accessed. That
in turn takes a few calls to get into any issues as the first call will
leave the page as inactive, and set the referenced flag. The second
call to it is supposed to set the active flag but I am assuming that
won't happen due to the fact that we aren't supposed to touch LRU.
However, we are not a swap backed page so I suspect we will go through
workingset_activation() and I am not sure what the impact there will be
since I don't believe we have a memory cgroup associated with the DAX
pages do we?





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-04 22:51                 ` Alexander Duyck
@ 2018-12-04 23:24                   ` Barret Rhoden
  2018-12-05  0:01                     ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Barret Rhoden @ 2018-12-04 23:24 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Dan Williams, Paolo Bonzini, Zhang Yi, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

Hi -

On 2018-12-04 at 14:51 Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:

[snip]

> > I think the confusion arises from the fact that there are a few MMIO
> > resources with a struct page and all the rest MMIO resources without.
> > The problem comes from the coarse definition of pfn_valid(), it may
> > return 'true' for things that are not System-RAM, because pfn_valid()
> > may be something as simplistic as a single "address < X" check. Then
> > PageReserved is a fallback to clarify the pfn_valid() result. The
> > typical case is that MMIO space is not caught up in this linear map
> > confusion. An MMIO address may or may not have an associated 'struct
> > page' and in most cases it does not.  
> 
> Okay. I think I understand this somewhat now. So the page might be
> physically there, but with the reserved bit it is not supposed to be
> touched.
> 
> My main concern with just dropping the bit is that we start seeing some
> other uses that I was not certain what the impact would be. For example
> the functions like kvm_set_pfn_accessed start going in and manipulating
> things that I am not sure should be messed with for a DAX page.

One thing regarding the accessed and dirty bits is that we might want
to have DAX pages marked dirty/accessed, even if we can't LRU-reclaim
or swap them.  I don't have a real example and I'm fairly ignorant
about the specifics here.  But one possibility would be using the A/D
bits to detect changes to a guest's memory for VM migration.  Maybe
there would be issues with KSM too.

Barret


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-04 23:24                   ` Barret Rhoden
@ 2018-12-05  0:01                     ` Alexander Duyck
  2018-12-05  0:26                       ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2018-12-05  0:01 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Dan Williams, Paolo Bonzini, Zhang Yi, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Tue, 2018-12-04 at 18:24 -0500, Barret Rhoden wrote:
> Hi -
> 
> On 2018-12-04 at 14:51 Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> 
> [snip]
> 
> > > I think the confusion arises from the fact that there are a few MMIO
> > > resources with a struct page and all the rest MMIO resources without.
> > > The problem comes from the coarse definition of pfn_valid(), it may
> > > return 'true' for things that are not System-RAM, because pfn_valid()
> > > may be something as simplistic as a single "address < X" check. Then
> > > PageReserved is a fallback to clarify the pfn_valid() result. The
> > > typical case is that MMIO space is not caught up in this linear map
> > > confusion. An MMIO address may or may not have an associated 'struct
> > > page' and in most cases it does not.  
> > 
> > Okay. I think I understand this somewhat now. So the page might be
> > physically there, but with the reserved bit it is not supposed to be
> > touched.
> > 
> > My main concern with just dropping the bit is that we start seeing some
> > other uses that I was not certain what the impact would be. For example
> > the functions like kvm_set_pfn_accessed start going in and manipulating
> > things that I am not sure should be messed with for a DAX page.
> 
> One thing regarding the accessed and dirty bits is that we might want
> to have DAX pages marked dirty/accessed, even if we can't LRU-reclaim
> or swap them.  I don't have a real example and I'm fairly ignorant
> about the specifics here.  But one possibility would be using the A/D
> bits to detect changes to a guest's memory for VM migration.  Maybe
> there would be issues with KSM too.
> 
> Barret

I get that, but the issue is that the code associated with those bits
currently assumes you are working with either an anonymous swap backed
page or a page cache page. We should really be updating that logic now,
and then enabling DAX to access it rather than trying to do things the
other way around which is how this feels.

- Alex



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-05  0:01                     ` Alexander Duyck
@ 2018-12-05  0:26                       ` Dan Williams
  2018-12-05  8:13                         ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-12-05  0:26 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: Barret Rhoden, Paolo Bonzini, Zhang Yi, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, David Hildenbrand, Jan Kara, Christoph Hellwig,
	rkrcmar, Jérôme Glisse

On Tue, Dec 4, 2018 at 4:01 PM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> On Tue, 2018-12-04 at 18:24 -0500, Barret Rhoden wrote:
> > Hi -
> >
> > On 2018-12-04 at 14:51 Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> >
> > [snip]
> >
> > > > I think the confusion arises from the fact that there are a few MMIO
> > > > resources with a struct page and all the rest MMIO resources without.
> > > > The problem comes from the coarse definition of pfn_valid(), it may
> > > > return 'true' for things that are not System-RAM, because pfn_valid()
> > > > may be something as simplistic as a single "address < X" check. Then
> > > > PageReserved is a fallback to clarify the pfn_valid() result. The
> > > > typical case is that MMIO space is not caught up in this linear map
> > > > confusion. An MMIO address may or may not have an associated 'struct
> > > > page' and in most cases it does not.
> > >
> > > Okay. I think I understand this somewhat now. So the page might be
> > > physically there, but with the reserved bit it is not supposed to be
> > > touched.
> > >
> > > My main concern with just dropping the bit is that we start seeing some
> > > other uses that I was not certain what the impact would be. For example
> > > the functions like kvm_set_pfn_accessed start going in and manipulating
> > > things that I am not sure should be messed with for a DAX page.
> >
> > One thing regarding the accessed and dirty bits is that we might want
> > to have DAX pages marked dirty/accessed, even if we can't LRU-reclaim
> > or swap them.  I don't have a real example and I'm fairly ignorant
> > about the specifics here.  But one possibility would be using the A/D
> > bits to detect changes to a guest's memory for VM migration.  Maybe
> > there would be issues with KSM too.
> >
> > Barret
>
> I get that, but the issue is that the code associated with those bits
> currently assumes you are working with either an anonymous swap backed
> page or a page cache page. We should really be updating that logic now,
> and then enabling DAX to access it rather than trying to do things the
> other way around which is how this feels.

Agree. I understand the concern about unintended side effects of
dropping PageReserved for dax pages, but they simply don't fit the
definition of the intended use of PageReserved. We've already had
fallout from legacy code paths doing the wrong thing with dax pages
where PageReserved wouldn't have helped. For example, see commit
6e2608dfd934 "xfs, dax: introduce xfs_dax_aops", or commit
6100e34b2526 "mm, memory_failure: Teach memory_failure() about
dev_pagemap pages". So formerly teaching kvm about these page
semantics and dropping the reliance on a side effect of PageReserved()
seems the right direction.

That said, for mark_page_accessed(), it does not look like it will
have any effect on dax pages. PageLRU will be false,
__lru_cache_activate_page() will not find a page on a percpu pagevec,
and workingset_activation() won't find an associated memcg. I would
not be surprised if mark_page_accessed() is already being called today
via the ext4 + dax use case.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning
  2018-12-05  0:26                       ` Dan Williams
@ 2018-12-05  8:13                         ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2018-12-05  8:13 UTC (permalink / raw)
  To: Dan Williams, alexander.h.duyck
  Cc: Barret Rhoden, Paolo Bonzini, Zhang Yi, KVM list, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Dave Jiang, Zhang, Yu C,
	Pankaj Gupta, Jan Kara, Christoph Hellwig, rkrcmar,
	Jérôme Glisse

On 05.12.18 01:26, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 4:01 PM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
>>
>> On Tue, 2018-12-04 at 18:24 -0500, Barret Rhoden wrote:
>>> Hi -
>>>
>>> On 2018-12-04 at 14:51 Alexander Duyck
>>> <alexander.h.duyck@linux.intel.com> wrote:
>>>
>>> [snip]
>>>
>>>>> I think the confusion arises from the fact that there are a few MMIO
>>>>> resources with a struct page and all the rest MMIO resources without.
>>>>> The problem comes from the coarse definition of pfn_valid(), it may
>>>>> return 'true' for things that are not System-RAM, because pfn_valid()
>>>>> may be something as simplistic as a single "address < X" check. Then
>>>>> PageReserved is a fallback to clarify the pfn_valid() result. The
>>>>> typical case is that MMIO space is not caught up in this linear map
>>>>> confusion. An MMIO address may or may not have an associated 'struct
>>>>> page' and in most cases it does not.
>>>>
>>>> Okay. I think I understand this somewhat now. So the page might be
>>>> physically there, but with the reserved bit it is not supposed to be
>>>> touched.
>>>>
>>>> My main concern with just dropping the bit is that we start seeing some
>>>> other uses that I was not certain what the impact would be. For example
>>>> the functions like kvm_set_pfn_accessed start going in and manipulating
>>>> things that I am not sure should be messed with for a DAX page.
>>>
>>> One thing regarding the accessed and dirty bits is that we might want
>>> to have DAX pages marked dirty/accessed, even if we can't LRU-reclaim
>>> or swap them.  I don't have a real example and I'm fairly ignorant
>>> about the specifics here.  But one possibility would be using the A/D
>>> bits to detect changes to a guest's memory for VM migration.  Maybe
>>> there would be issues with KSM too.
>>>
>>> Barret
>>
>> I get that, but the issue is that the code associated with those bits
>> currently assumes you are working with either an anonymous swap backed
>> page or a page cache page. We should really be updating that logic now,
>> and then enabling DAX to access it rather than trying to do things the
>> other way around which is how this feels.
> 
> Agree. I understand the concern about unintended side effects of
> dropping PageReserved for dax pages, but they simply don't fit the
> definition of the intended use of PageReserved. We've already had
> fallout from legacy code paths doing the wrong thing with dax pages
> where PageReserved wouldn't have helped. For example, see commit
> 6e2608dfd934 "xfs, dax: introduce xfs_dax_aops", or commit
> 6100e34b2526 "mm, memory_failure: Teach memory_failure() about
> dev_pagemap pages". So formerly teaching kvm about these page
> semantics and dropping the reliance on a side effect of PageReserved()
> seems the right direction.
> 
> That said, for mark_page_accessed(), it does not look like it will
> have any effect on dax pages. PageLRU will be false,
> __lru_cache_activate_page() will not find a page on a percpu pagevec,
> and workingset_activation() won't find an associated memcg. I would
> not be surprised if mark_page_accessed() is already being called today
> via the ext4 + dax use case.
> 

I agree to what Dan says here. I'd vote for getting rid of the
PageReserved bit for these pages and rather fixing the fallout from that
(if any, I also doubt that there will be much). One thing I already
mentioned in another thread is hindering hibernation code from touching
ZONE_DEVICE memory is one thing to take care of.

PageReserved as part of a user space process can mean many things (and I
still have a patch pending for submission to document that). It can mean
zero pages, VDSO pages, MMIO pages and right now DAX pages. For the
first three, we don't want to touch the struct page ever
(->PageReserved). For DAX it should not harm (-> no need for PageReserved).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-12-05  8:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 19:25 [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Alexander Duyck
2018-12-03 19:25 ` [PATCH RFC 1/3] kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn Alexander Duyck
2018-12-03 19:25 ` [PATCH RFC 2/3] mm: Add support for exposing if dev_pagemap supports refcount pinning Alexander Duyck
2018-12-03 19:47   ` Dan Williams
2018-12-03 20:21     ` Alexander Duyck
2018-12-03 20:31       ` Dan Williams
2018-12-03 20:53         ` Alexander Duyck
2018-12-03 21:05           ` Dan Williams
2018-12-03 21:50             ` Alexander Duyck
2018-12-04 19:08               ` Dan Williams
2018-12-04 22:51                 ` Alexander Duyck
2018-12-04 23:24                   ` Barret Rhoden
2018-12-05  0:01                     ` Alexander Duyck
2018-12-05  0:26                       ` Dan Williams
2018-12-05  8:13                         ` David Hildenbrand
2018-12-03 19:25 ` [PATCH RFC 3/3] kvm: Add additional check to determine if a page is refcounted Alexander Duyck
2018-12-04  6:59 ` [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page Yi Zhang
2018-12-04 18:45   ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).