DriverDev-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-24 12:09 David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never.

In ancient times, we needed PG_reserved, because there was no way to tell
whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

One of the biggest fears were side effects. I went ahead and audited all
users of PageReserved(). The details can be found in "mm/memory_hotplug:
Don't mark pages PG_reserved when initializing the memmap".

This patch set adapts all relevant users of PageReserved() to keep the
existing behavior in respect to ZONE_DEVICE pages. The biggest part part
that needs changes is KVM, to keep the existing behavior (that's all I
care about in this series).

Note that this series is able to rely completely on pfn_to_online_page().
No new is_zone_device_page() calles are introduced (as requested by Dan).
We are currently discussing a way to mark also ZONE_DEVICE memmaps as
active/initialized - pfn_active() - and lightweight locking to make sure
memmaps remain active (e.g., using RCU). We might later be able to convert
some suers of pfn_to_online_page() to pfn_active(). Details can be found
in [3], however, this represents yet another cleanup/fix we'll perform
on top of this cleanup.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't
test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all).
Compile-tested on x86-64 and PPC.

Based on next/master. The current version (kept updated) can be found at:
    https://github.com/davidhildenbrand/linux.git online_reserved_cleanup

RFC -> v1:
- Dropped "staging/gasket: Prepare gasket_release_page() for PG_reserved
  changes"
- Dropped "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
  changes"
- Converted "mm/usercopy.c: Prepare check_page_span() for PG_reserved
  changes" to "mm/usercopy.c: Update comment in check_page_span()
  regarding ZONE_DEVICE"
- No new users of is_zone_device_page() are introduced.
- Rephrased comments and patch descriptions.

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034
[3] https://www.spinics.net/lists/linux-mm/msg194112.html

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
Cc: Alexander Duyck <alexander.duyck@gmail.com>

David Hildenbrand (10):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
    holes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
    PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
    changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
    memmap
  mm/usercopy.c: Update comment in check_page_span() regarding
    ZONE_DEVICE

 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 +++++----
 arch/powerpc/mm/book3s64/hash_utils.c  | 10 +++---
 arch/powerpc/mm/pgtable.c              | 10 +++---
 arch/x86/kvm/mmu.c                     | 29 ++++++++++-------
 arch/x86/mm/ioremap.c                  | 13 ++++++--
 drivers/hv/hv_balloon.c                |  6 ++++
 drivers/vfio/vfio_iommu_type1.c        | 10 ++++--
 drivers/xen/balloon.c                  |  7 +++++
 include/linux/page-flags.h             |  8 +----
 mm/memory_hotplug.c                    | 43 +++++++++++++++++++-------
 mm/page_alloc.c                        | 11 -------
 mm/usercopy.c                          |  6 ++--
 virt/kvm/kvm_main.c                    | 10 ++++--
 13 files changed, 111 insertions(+), 66 deletions(-)

-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
@ 2019-10-24 12:09 ` David Hildenbrand
  2019-11-05  1:30   ` Dan Williams
  2019-10-24 12:09 ` [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes David Hildenbrand
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes (a range that is not
IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see
add_memory_resource()). All boot memory is alread online.

Therefore, when we stop allowing to offline memory blocks with holes, we
implicitly no longer have to deal with onlining memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way (unmovable data might have easily ended up on that memory during
boot). So stopping to do that should not really hurt (+ people are not
even aware of a setup where that used to work and that the existing code
still works correctly with memory holes). For the use case of offlining
memory to unplug DIMMs, we should see no change. (holes on DIMMs would be
weird).

Please note that hardware errors (PG_hwpoison) are not memory holes and
not affected by this change when offlining.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..8d81730cf036 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+				     unsigned long nr_pages, void *data)
+{
+	unsigned long *nr_system_ram_pages = data;
+
+	*nr_system_ram_pages += nr_pages;
+	return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
-	unsigned long pfn, nr_pages;
+	unsigned long pfn, nr_pages = 0;
 	unsigned long offlined_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	mem_hotplug_begin();
 
+	/*
+	 * Don't allow to offline memory blocks that contain holes.
+	 * Consecuently, memory blocks with holes can never get onlined
+	 * (hotplugged memory has no holes and all boot memory is online).
+	 * This allows to simplify the onlining/offlining code quite a lot.
+	 */
+	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
+			      count_system_ram_pages_cb);
+	if (nr_pages != end_pfn - start_pfn) {
+		ret = -EINVAL;
+		reason = "memory holes";
+		goto failed_removal;
+	}
+
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
 	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	zone = page_zone(pfn_to_page(valid_start));
 	node = zone_to_nid(zone);
-	nr_pages = end_pfn - start_pfn;
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
@ 2019-10-24 12:09 ` David Hildenbrand
  2019-11-05  1:37   ` Dan Williams
  2019-10-24 12:09 ` [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() " David Hildenbrand
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_mmio_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/kvm/mmu.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..f03089a336de 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+	struct page *page = pfn_to_online_page(pfn);
+
+	/*
+	 * ZONE_DEVICE pages are never online. Online pages that are reserved
+	 * either indicate the zero page or MMIO pages.
+	 */
+	if (page)
+		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+	/*
+	 * Anything with a valid (but not online) memmap could be ZONE_DEVICE.
+	 * Treat only UC/UC-/WC pages as MMIO.
+	 */
 	if (pfn_valid(pfn))
-		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-			/*
-			 * Some reserved pages, such as those from NVDIMM
-			 * DAX devices, are not for MMIO, and can be mapped
-			 * with cached memory type for better performance.
-			 * However, the above check misconceives those pages
-			 * as MMIO, and results in KVM mapping them with UC
-			 * memory type, which would hurt the performance.
-			 * Therefore, we check the host memory type in addition
-			 * and only treat UC/UC-/WC pages as MMIO.
-			 */
-			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+		return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+	/*
+	 * Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+	 */
 	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 				     pfn_to_hpa(pfn + 1) - 1,
 				     E820_TYPE_RAM);
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes David Hildenbrand
@ 2019-10-24 12:09 ` " David Hildenbrand
  2019-11-05  4:38   ` Dan Williams
  2019-10-24 12:09 ` [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() " David Hildenbrand
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_reserved_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e9eb666eb6e8..9d18cc67d124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (2 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() " David Hildenbrand
@ 2019-10-24 12:09 ` " David Hildenbrand
  2019-11-07 15:40   ` Dan Williams
  2019-10-24 12:09 ` [PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() " David Hildenbrand
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
sure the function produces the same result once we stop setting ZONE_DEVICE
pages PG_reserved.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (3 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() " David Hildenbrand
@ 2019-10-24 12:09 ` " David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() " David Hildenbrand
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvmppc_book3s_instantiate_page() similar to kvm_is_reserved_pfn()
to make sure the function produces the same result once we stop setting
ZONE_DEVICE pages PG_reserved.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 					   writing, upgrade_p);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+		/*
+		 * We treat any pages that are not online (not managed by the
+		 * buddy) as reserved - this includes ZONE_DEVICE pages and
+		 * pages without a memmap (e.g., mapped via /dev/mem).
+		 */
+		page = pfn_to_online_page(pfn);
+		if (page && PageReserved(page))
+			page = NULL;
 	}
 
 	/*
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (4 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() " David Hildenbrand
@ 2019-10-24 12:09 ` " David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() " David Hildenbrand
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite hash_page_do_lazy_icache() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-	struct page *page;
+	struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-	if (!pfn_valid(pte_pfn(pte)))
+	/*
+	 * We ignore any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	if (!page)
 		return pp;
 
-	page = pte_page(pte);
-
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (5 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() " David Hildenbrand
@ 2019-10-24 12:09 ` " David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() " David Hildenbrand
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite maybe_pte_to_page() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/pgtable.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
 	unsigned long pfn = pte_pfn(pte);
 	struct page *page;
 
-	if (unlikely(!pfn_valid(pfn)))
-		return NULL;
-	page = pfn_to_page(pfn);
-	if (PageReserved(page))
+	/*
+	 * We reject any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (unlikely(!page || PageReserved(page)))
 		return NULL;
 	return page;
 }
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (6 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() " David Hildenbrand
@ 2019-10-24 12:09 ` " David Hildenbrand
  2019-10-24 12:09 ` [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap David Hildenbrand
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite __ioremap_check_ram() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/ioremap.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 	start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	stop_pfn = (res->end + 1) >> PAGE_SHIFT;
 	if (stop_pfn > start_pfn) {
-		for (i = 0; i < (stop_pfn - start_pfn); ++i)
-			if (pfn_valid(start_pfn + i) &&
-			    !PageReserved(pfn_to_page(start_pfn + i)))
+		for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+			struct page *page;
+			 /*
+			  * We treat any pages that are not online (not managed
+			  * by the buddy) as not being RAM. This includes
+			  * ZONE_DEVICE pages.
+			  */
+			page = pfn_to_online_page(start_pfn + i);
+			if (page && !PageReserved(page))
 				return IORES_MAP_SYSTEM_RAM;
+		}
 	}
 
 	return 0;
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (7 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() " David Hildenbrand
@ 2019-10-24 12:09 ` David Hildenbrand
  2019-11-04 22:44   ` Boris Ostrovsky
  2019-10-24 12:09 ` [PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE David Hildenbrand
  2019-11-01 19:24 ` [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
  10 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved (especially,
   by relying on pfn_to_online_page() for now). Details can be found
   below.
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places in the MM core.

There are three PageReserved() users that might be affected by this change.
 - drivers/staging/gasket/gasket_page_table.c:gasket_release_page()
   -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page
   -> I assume "we don't care"
 - drivers/staging/kpc2000/kpc_dma/fileops.c:transfer_complete_cb()
   -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page
   -> I assume "we don't care"
 - mm/usercopy.c: check_page_span()
   -> According to Dan, non-HMM ZONE_DEVICE usage excluded this code since
      commit 52f476a323f9 ("libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY
      overhead")
   -> It is unclear whether we rally cared about ZONE_DEVICE here (HMM) or
      simply about "PG_reserved". The worst thing that could happen is a
      false negative with CONFIG_HARDENED_USERCOPY we should be able to
      identify easily.
   -> There is a discussion to rip out that code completely
   -> I assume "not relevant" / "we don't care"

I audited the other PageReserved() users. They don't affect ZONE_DEVICE:
 - mm/page_owner.c:pagetypeinfo_showmixedcount_print()
   -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
 - mm/page_owner.c:init_pages_in_zone()
   -> Never called for ZONE_DEVICE (!populated_zone(zone))
 - mm/page_ext.c:free_page_ext()
   -> Only a BUG_ON(PageReserved(page)), not relevant
 - mm/page_ext.c:has_unmovable_pages()
   -> Not releveant for ZONE_DEVICE
 - mm/page_ext.c:pfn_range_valid_contig()
   -> pfn_to_online_page() already guards us
 - mm/mempolicy.c:queue_pages_pte_range()
   -> vm_normal_page() checks against pte_devmap()
 - mm/memory-failure.c:hwpoison_user_mappings()
   -> Not reached via memory_failure() due to pfn_to_online_page()
   -> Also not reached indirectly via memory_failure_hugetlb()
 - mm/hugetlb.c:gather_bootmem_prealloc()
   -> Only a WARN_ON(PageReserved(page)), not relevant
 - kernel/power/snapshot.c:saveable_highmem_page()
   -> pfn_to_online_page() already guards us
 - kernel/power/snapshot.c:saveable_page()
   -> pfn_to_online_page() already guards us
 - fs/proc/task_mmu.c:can_gather_numa_stats()
   -> vm_normal_page() checks against pte_devmap()
 - fs/proc/task_mmu.c:can_gather_numa_stats_pmd
   -> vm_normal_page_pmd() checks against pte_devmap()
 - fs/proc/page.c:stable_page_flags()
   -> The reserved bit is simply copied, irrelevant
 - drivers/firmware/memmap.c:release_firmware_map_entry()
   -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
 - arch/ia64/kernel/mca_drv.c
 - arch/mips/mm/init.c
 - arch/mips/mm/ioremap.c
 - arch/nios2/mm/ioremap.c
 - arch/parisc/mm/ioremap.c
 - arch/sparc/mm/tlb.c
 - arch/xtensa/mm/cache.c
   -> No ZONE_DEVICE support
 - arch/powerpc/mm/init_64.c:vmemmap_free()
   -> Special-cases memmap on altmap
   -> Only a check for bootmem
 - arch/x86/kernel/alternative.c:__text_poke()
   -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
 - arch/x86/mm/init_64.c
   -> Only a check for bootmem

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matt Sickler <Matt.Sickler@daktronics.com>
Cc: Kees Cook <keescook@chromium.org>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/hv/hv_balloon.c    |  6 ++++++
 drivers/xen/balloon.c      |  7 +++++++
 include/linux/page-flags.h |  8 +-------
 mm/memory_hotplug.c        | 17 +++++++----------
 mm/page_alloc.c            | 11 -----------
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index c722079d3c24..3214b0ef5247 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
+	/*
+	 * TODO: The core used to mark the pages reserved. Most probably
+	 * we can stop doing that now.
+	 */
+	__SetPageReserved(pg);
+
 	if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
 		if (!PageOffline(pg))
 			__SetPageOffline(pg);
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order)
 	mutex_lock(&balloon_mutex);
 	for (i = 0; i < size; i++) {
 		p = pfn_to_page(start_pfn + i);
+		/*
+		 * TODO: The core used to mark the pages reserved. Most probably
+		 * we can stop doing that now. However, especially
+		 * alloc_xenballooned_pages() left PG_reserved set
+		 * on pages that can get mapped to user space.
+		 */
+		__SetPageReserved(p);
 		balloon_append(p);
 	}
 	mutex_unlock(&balloon_mutex);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 3b8e5c5f7e1f..e9a7465219d1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,24 +30,18 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
- * - Device memory (e.g. PMEM, DAX, HMM)
  * Some PG_reserved pages will be excluded from the hibernation image.
  * PG_reserved does in general not hinder anybody from dumping or swapping
  * and is no longer required for remap_pfn_range(). ioremap might require it.
  * Consequently, PG_reserved for a page mapped into user space can indicate
- * the zero page, the vDSO, MMIO pages or device memory.
+ * the zero page, the vDSO, or MMIO pages.
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8d81730cf036..2714edce98dd 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -501,9 +501,7 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
  * @altmap: alternative device page map or %NULL if default memmap is used
  *
  * Generic helper function to remove section mappings and sysfs entries
- * for the section of the memory we are removing. Caller needs to make
- * sure that pages are marked reserved and zones are adjust properly by
- * calling offline_pages().
+ * for the section of the memory we are removing.
  */
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
@@ -584,9 +582,9 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 	int order;
 
 	/*
-	 * Online the pages. The callback might decide to keep some pages
-	 * PG_reserved (to add them to the buddy later), but we still account
-	 * them as being online/belonging to this zone ("present").
+	 * Online the pages. The callback might decide to not free some pages
+	 * (to add them to the buddy later), but we still account them as
+	 * being online/belonging to this zone ("present").
 	 */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1ul << order) {
 		order = min(MAX_ORDER - 1, get_order(PFN_PHYS(end_pfn - pfn)));
@@ -659,8 +657,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
 }
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
- * and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * and resizing the pgdat/zone data to span the added pages.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap)
@@ -684,8 +681,8 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	/*
 	 * TODO now we have a visible range of pages which are not associated
 	 * with their zone properly. Not nice but set_pfnblock_flags_mask
-	 * expects the zone spans the pfn range. All the pages in the range
-	 * are reserved so nobody should be touching them so we should be safe
+	 * expects the zone spans the pfn range. The sections are not yet
+	 * marked online so nobody should be touching the memmap.
 	 */
 	memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
 			MEMMAP_HOTPLUG, altmap);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f9488efff680..aa6ecac27b68 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5927,8 +5927,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5980,15 +5978,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
 		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
 		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (8 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap David Hildenbrand
@ 2019-10-24 12:09 ` David Hildenbrand
  2019-11-01 19:24 ` [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
  10 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Dan Williams,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, x86, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Thomas Gleixner, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

ZONE_DEVICE (a.k.a. device memory) is no longer marked PG_reserved. Update
the comment.

While at it, make it match what the code is acutally doing (reject vs.
accept).

Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/usercopy.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..80f254024c97 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -199,9 +199,9 @@ static inline void check_page_span(const void *ptr, unsigned long n,
 		return;
 
 	/*
-	 * Reject if range is entirely either Reserved (i.e. special or
-	 * device memory), or CMA. Otherwise, reject since the object spans
-	 * several independently allocated pages.
+	 * Accept if the range is entirely either Reserved ("special") or
+	 * CMA. Otherwise, reject since the object spans several independently
+	 * allocated pages.
 	 */
 	is_reserved = PageReserved(page);
 	is_cma = is_migrate_cma_page(page);
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
                   ` (9 preceding siblings ...)
  2019-10-24 12:09 ` [PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE David Hildenbrand
@ 2019-11-01 19:24 ` David Hildenbrand
  10 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-11-01 19:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Dan Williams, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Jeff Moyer, Ingo Molnar,
	Vlastimil Babka, Anthony Yznaga, Oscar Salvador,
	Isaac J. Manjarres, Juergen Gross, Anshuman Khandual,
	Haiyang Zhang, Sasha Levin, kvm-ppc, Qian Cai, Alex Williamson,
	Mike Rapoport, Borislav Petkov, Nicholas Piggin, Andy Lutomirski,
	xen-devel, Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Mel Gorman, Cornelia Huck,
	Pavel Tatashin, Sean Christopherson, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 24.10.19 14:09, David Hildenbrand wrote:
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never.
> 
> In ancient times, we needed PG_reserved, because there was no way to tell
> whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
> 
> One of the biggest fears were side effects. I went ahead and audited all
> users of PageReserved(). The details can be found in "mm/memory_hotplug:
> Don't mark pages PG_reserved when initializing the memmap".
> 
> This patch set adapts all relevant users of PageReserved() to keep the
> existing behavior in respect to ZONE_DEVICE pages. The biggest part part
> that needs changes is KVM, to keep the existing behavior (that's all I
> care about in this series).
> 
> Note that this series is able to rely completely on pfn_to_online_page().
> No new is_zone_device_page() calles are introduced (as requested by Dan).
> We are currently discussing a way to mark also ZONE_DEVICE memmaps as
> active/initialized - pfn_active() - and lightweight locking to make sure
> memmaps remain active (e.g., using RCU). We might later be able to convert
> some suers of pfn_to_online_page() to pfn_active(). Details can be found
> in [3], however, this represents yet another cleanup/fix we'll perform
> on top of this cleanup.
> 
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't
> test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all).
> Compile-tested on x86-64 and PPC.
> 

Jeff Moyer ran some NVDIMM test cases for me (thanks!!!), including 
xfstests, pmdk, and ndctl. No regressions found.

I will run some KVM tests, especially NDIMM passthrough, but will have 
to setup a test environment first.

I would appreciate some review in the meantime. :)

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
  2019-10-24 12:09 ` [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap David Hildenbrand
@ 2019-11-04 22:44   ` Boris Ostrovsky
  2019-11-05 10:18     ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Boris Ostrovsky @ 2019-11-04 22:44 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Dan Williams, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 10/24/19 8:09 AM, David Hildenbrand wrote:
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index 4f2e78a5e4db..af69f057913a 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order)
>  	mutex_lock(&balloon_mutex);
>  	for (i = 0; i < size; i++) {
>  		p = pfn_to_page(start_pfn + i);
> +		/*
> +		 * TODO: The core used to mark the pages reserved. Most probably
> +		 * we can stop doing that now. However, especially
> +		 * alloc_xenballooned_pages() left PG_reserved set
> +		 * on pages that can get mapped to user space.
> +		 */
> +		__SetPageReserved(p);

I suspect this is not needed. Pages can get into balloon either from
here or from non-hotplug path (e.g. decrease_reservation()) and so when
we get a page from the balloon we would get a random page that may or
may not have Reserved bit set.

-boris


>  		balloon_append(p);
>  	}
>  	mutex_unlock(&balloon_mutex);
>

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
  2019-10-24 12:09 ` [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
@ 2019-11-05  1:30   ` Dan Williams
  2019-11-05  9:31     ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-05  1:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand <david@redhat.com> wrote:
>
> Our onlining/offlining code is unnecessarily complicated. Only memory
> blocks added during boot can have holes (a range that is not
> IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see
> add_memory_resource()). All boot memory is alread online.

s/alread/already/

...also perhaps clarify "already online" by what point in time and why
that is relevant. For example a description of the difference between
the SetPageReserved() in the bootmem path and the one in the hotplug
path.

> Therefore, when we stop allowing to offline memory blocks with holes, we
> implicitly no longer have to deal with onlining memory blocks with holes.

Maybe an explicit reference of the code areas that deal with holes
would help to back up that assertion. Certainly it would have saved me
some time for the review.

> This allows to simplify the code. For example, we no longer have to
> worry about marking pages that fall into memory holes PG_reserved when
> onlining memory. We can stop setting pages PG_reserved.

...but not for bootmem, right?

>
> Offlining memory blocks added during boot is usually not guranteed to work

s/guranteed/guaranteed/

> either way (unmovable data might have easily ended up on that memory during
> boot). So stopping to do that should not really hurt (+ people are not
> even aware of a setup where that used to work

Maybe put a "Link: https://lkml.kernel.org/r/$msg_id" to that discussion?

> and that the existing code
> still works correctly with memory holes). For the use case of offlining
> memory to unplug DIMMs, we should see no change. (holes on DIMMs would be
> weird).

However, less memory can be offlined than was theoretically allowed
previously, so I don't understand the "we should see no change"
comment. I still agree that's a price worth paying to get the code
cleanups and if someone screams we can look at adding it back, but the
fact that it was already fragile seems decent enough protection.

>
> Please note that hardware errors (PG_hwpoison) are not memory holes and
> not affected by this change when offlining.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 561371ead39a..8d81730cf036 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>                 node_clear_state(node, N_MEMORY);
>  }
>
> +static int count_system_ram_pages_cb(unsigned long start_pfn,
> +                                    unsigned long nr_pages, void *data)
> +{
> +       unsigned long *nr_system_ram_pages = data;
> +
> +       *nr_system_ram_pages += nr_pages;
> +       return 0;
> +}
> +
>  static int __ref __offline_pages(unsigned long start_pfn,
>                   unsigned long end_pfn)
>  {
> -       unsigned long pfn, nr_pages;
> +       unsigned long pfn, nr_pages = 0;
>         unsigned long offlined_pages = 0;
>         int ret, node, nr_isolate_pageblock;
>         unsigned long flags;
> @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
>
>         mem_hotplug_begin();
>
> +       /*
> +        * Don't allow to offline memory blocks that contain holes.
> +        * Consecuently, memory blocks with holes can never get onlined

s/Consecuently/Consequently/

> +        * (hotplugged memory has no holes and all boot memory is online).
> +        * This allows to simplify the onlining/offlining code quite a lot.
> +        */

The last sentence of this comment makes sense in the context of this
patch, but I don't think it stands by itself in the code base after
the fact. The person reading the comment can't see the simplifications
because the code is already gone. I'd clarify it to talk about why it
is safe to not mess around with PG_Reserved in the hotplug path
because of this check.

After those clarifications you can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  2019-10-24 12:09 ` [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes David Hildenbrand
@ 2019-11-05  1:37   ` Dan Williams
  2019-11-05 11:09     ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-05  1:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand <david@redhat.com> wrote:
>
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
>
> KVM has this weird use case that you can map anything from /dev/mem
> into the guest. pfn_valid() is not a reliable check whether the memmap
> was initialized and can be touched. pfn_to_online_page() makes sure
> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>
> Rewrite kvm_is_mmio_pfn() to make sure the function produces the
> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: KarimAllah Ahmed <karahmed@amazon.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/x86/kvm/mmu.c | 29 +++++++++++++++++------------
>  1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24c23c66b226..f03089a336de 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>
>  static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>  {
> +       struct page *page = pfn_to_online_page(pfn);
> +
> +       /*
> +        * ZONE_DEVICE pages are never online. Online pages that are reserved
> +        * either indicate the zero page or MMIO pages.
> +        */
> +       if (page)
> +               return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> +
> +       /*
> +        * Anything with a valid (but not online) memmap could be ZONE_DEVICE.
> +        * Treat only UC/UC-/WC pages as MMIO.

You might clarify that ZONE_DEVICE pages that belong to WB cacheable
pmem have the correct memtype established by devm_memremap_pages().

Other than that, feel free to add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-10-24 12:09 ` [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() " David Hildenbrand
@ 2019-11-05  4:38   ` Dan Williams
  2019-11-05  9:17     ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-05  4:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Sean Christopherson,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
>
> KVM has this weird use case that you can map anything from /dev/mem
> into the guest. pfn_valid() is not a reliable check whether the memmap
> was initialized and can be touched. pfn_to_online_page() makes sure
> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>
> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: KarimAllah Ahmed <karahmed@amazon.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e9eb666eb6e8..9d18cc67d124 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
> -       if (pfn_valid(pfn))
> -               return PageReserved(pfn_to_page(pfn));
> +       struct page *page = pfn_to_online_page(pfn);
>
> +       /*
> +        * We treat any pages that are not online (not managed by the buddy)
> +        * as reserved - this includes ZONE_DEVICE pages and pages without
> +        * a memmap (e.g., mapped via /dev/mem).
> +        */
> +       if (page)
> +               return PageReserved(page);
>         return true;
>  }

So after this all the pfn_valid() usage in kvm_main.c is replaced with
pfn_to_online_page()? Looks correct to me.

However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
through some other path resulting in this:

    https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/

I'll see if this patch set modulates or maintains that failure mode.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05  4:38   ` Dan Williams
@ 2019-11-05  9:17     ` David Hildenbrand
  2019-11-05  9:49       ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-05  9:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Sean Christopherson,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 05.11.19 05:38, Dan Williams wrote:
> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>> change that.
>>
>> KVM has this weird use case that you can map anything from /dev/mem
>> into the guest. pfn_valid() is not a reliable check whether the memmap
>> was initialized and can be touched. pfn_to_online_page() makes sure
>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>
>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
>> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: KarimAllah Ahmed <karahmed@amazon.de>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   virt/kvm/kvm_main.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index e9eb666eb6e8..9d18cc67d124 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>
>>   bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>   {
>> -       if (pfn_valid(pfn))
>> -               return PageReserved(pfn_to_page(pfn));
>> +       struct page *page = pfn_to_online_page(pfn);
>>
>> +       /*
>> +        * We treat any pages that are not online (not managed by the buddy)
>> +        * as reserved - this includes ZONE_DEVICE pages and pages without
>> +        * a memmap (e.g., mapped via /dev/mem).
>> +        */
>> +       if (page)
>> +               return PageReserved(page);
>>          return true;
>>   }
> 
> So after this all the pfn_valid() usage in kvm_main.c is replaced with
> pfn_to_online_page()? Looks correct to me.
> 
> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
> through some other path resulting in this:
> 
>      https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
> 
> I'll see if this patch set modulates or maintains that failure mode.
> 

I'd assume that the behavior is unchanged. Ithink we get a reference to 
these ZONE_DEVICE pages via __get_user_pages_fast() and friends in 
hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
  2019-11-05  1:30   ` Dan Williams
@ 2019-11-05  9:31     ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-11-05  9:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On 05.11.19 02:30, Dan Williams wrote:
> On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> Our onlining/offlining code is unnecessarily complicated. Only memory
>> blocks added during boot can have holes (a range that is not
>> IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see
>> add_memory_resource()). All boot memory is alread online.
> 
> s/alread/already/
> 

Thanks.

> ...also perhaps clarify "already online" by what point in time and why
> that is relevant. For example a description of the difference between
> the SetPageReserved() in the bootmem path and the one in the hotplug
> path.

Will add.

> 
>> Therefore, when we stop allowing to offline memory blocks with holes, we
>> implicitly no longer have to deal with onlining memory blocks with holes.
> 
> Maybe an explicit reference of the code areas that deal with holes
> would help to back up that assertion. Certainly it would have saved me
> some time for the review.

I can add a reference to the onlining code that will only online pages 
that don't fall into memory holes.

> 
>> This allows to simplify the code. For example, we no longer have to
>> worry about marking pages that fall into memory holes PG_reserved when
>> onlining memory. We can stop setting pages PG_reserved.
> 
> ...but not for bootmem, right?

Yes, bootmem is not changed. (especially, early allocations and memory 
holes are marked PG_reserved - basically everything not given to the 
buddy after boot)

> 
>>
>> Offlining memory blocks added during boot is usually not guranteed to work
> 
> s/guranteed/guaranteed/

Thanks.

> 
>> either way (unmovable data might have easily ended up on that memory during
>> boot). So stopping to do that should not really hurt (+ people are not
>> even aware of a setup where that used to work
> 
> Maybe put a "Link: https://lkml.kernel.org/r/$msg_id" to that discussion?
> 
>> and that the existing code
>> still works correctly with memory holes). For the use case of offlining
>> memory to unplug DIMMs, we should see no change. (holes on DIMMs would be
>> weird).
> 
> However, less memory can be offlined than was theoretically allowed
> previously, so I don't understand the "we should see no change"
> comment. I still agree that's a price worth paying to get the code
> cleanups and if someone screams we can look at adding it back, but the
> fact that it was already fragile seems decent enough protection.

Hotplugged DIMMs (add_memory()) have no holes and will therefore see no 
change.

>>
>> Please note that hardware errors (PG_hwpoison) are not memory holes and
>> not affected by this change when offlining.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
>>   1 file changed, 24 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 561371ead39a..8d81730cf036 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>>                  node_clear_state(node, N_MEMORY);
>>   }
>>
>> +static int count_system_ram_pages_cb(unsigned long start_pfn,
>> +                                    unsigned long nr_pages, void *data)
>> +{
>> +       unsigned long *nr_system_ram_pages = data;
>> +
>> +       *nr_system_ram_pages += nr_pages;
>> +       return 0;
>> +}
>> +
>>   static int __ref __offline_pages(unsigned long start_pfn,
>>                    unsigned long end_pfn)
>>   {
>> -       unsigned long pfn, nr_pages;
>> +       unsigned long pfn, nr_pages = 0;
>>          unsigned long offlined_pages = 0;
>>          int ret, node, nr_isolate_pageblock;
>>          unsigned long flags;
>> @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>
>>          mem_hotplug_begin();
>>
>> +       /*
>> +        * Don't allow to offline memory blocks that contain holes.
>> +        * Consecuently, memory blocks with holes can never get onlined
> 
> s/Consecuently/Consequently/

Thanks.

> 
>> +        * (hotplugged memory has no holes and all boot memory is online).
>> +        * This allows to simplify the onlining/offlining code quite a lot.
>> +        */
> 
> The last sentence of this comment makes sense in the context of this
> patch, but I don't think it stands by itself in the code base after
> the fact. The person reading the comment can't see the simplifications
> because the code is already gone. I'd clarify it to talk about why it
> is safe to not mess around with PG_Reserved in the hotplug path
> because of this check.

I'll think of something. It's not just the PG_reserved handling but the 
whole pfn_valid()/walk_system_ram_range() handling that can be simplified.

> 
> After those clarifications you can add:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 

Thanks!

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05  9:17     ` David Hildenbrand
@ 2019-11-05  9:49       ` David Hildenbrand
  2019-11-05 10:02         ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-05  9:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Sean Christopherson,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 05.11.19 10:17, David Hildenbrand wrote:
> On 05.11.19 05:38, Dan Williams wrote:
>> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> KVM has this weird use case that you can map anything from /dev/mem
>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>
>>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
>>> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>>>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: KarimAllah Ahmed <karahmed@amazon.de>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>    virt/kvm/kvm_main.c | 10 ++++++++--
>>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index e9eb666eb6e8..9d18cc67d124 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>>
>>>    bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>>    {
>>> -       if (pfn_valid(pfn))
>>> -               return PageReserved(pfn_to_page(pfn));
>>> +       struct page *page = pfn_to_online_page(pfn);
>>>
>>> +       /*
>>> +        * We treat any pages that are not online (not managed by the buddy)
>>> +        * as reserved - this includes ZONE_DEVICE pages and pages without
>>> +        * a memmap (e.g., mapped via /dev/mem).
>>> +        */
>>> +       if (page)
>>> +               return PageReserved(page);
>>>           return true;
>>>    }
>>
>> So after this all the pfn_valid() usage in kvm_main.c is replaced with
>> pfn_to_online_page()? Looks correct to me.
>>
>> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
>> through some other path resulting in this:
>>
>>       https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
>>
>> I'll see if this patch set modulates or maintains that failure mode.
>>
> 
> I'd assume that the behavior is unchanged. Ithink we get a reference to
> these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
> hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
> 

I think I know what's going wrong:

Pages that are pinned via gfn_to_pfn() and friends take a references, 
however are often released via 
kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...


E.g., in arch/x86/kvm/x86.c:reexecute_instruction()

...
pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
...
kvm_release_pfn_clean(pfn);



void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
		put_page(pfn_to_page(pfn));
}

This function makes perfect sense as the counterpart for kvm_get_pfn():

void kvm_get_pfn(kvm_pfn_t pfn)
{
	if (!kvm_is_reserved_pfn(pfn))
		get_page(pfn_to_page(pfn));
}


As all ZONE_DEVICE pages are currently reserved, pages pinned via 
gfn_to_pfn() and friends will often not see a put_page() AFAIKS.

Now, my patch does not change that, the result of 
kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would 
probably be

a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and 
friends, after you successfully pinned the pages. (not sure if that's 
the right thing to do but you're the expert)

b) To not use kvm_release_pfn_clean() and friends on pages that were 
definitely pinned.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05  9:49       ` David Hildenbrand
@ 2019-11-05 10:02         ` David Hildenbrand
  2019-11-05 16:00           ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-05 10:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Sean Christopherson,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 05.11.19 10:49, David Hildenbrand wrote:
> On 05.11.19 10:17, David Hildenbrand wrote:
>> On 05.11.19 05:38, Dan Williams wrote:
>>> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>>> change that.
>>>>
>>>> KVM has this weird use case that you can map anything from /dev/mem
>>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>>
>>>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the
>>>> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>>>>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Cc: KarimAllah Ahmed <karahmed@amazon.de>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>     virt/kvm/kvm_main.c | 10 ++++++++--
>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index e9eb666eb6e8..9d18cc67d124 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>>>
>>>>     bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>>>     {
>>>> -       if (pfn_valid(pfn))
>>>> -               return PageReserved(pfn_to_page(pfn));
>>>> +       struct page *page = pfn_to_online_page(pfn);
>>>>
>>>> +       /*
>>>> +        * We treat any pages that are not online (not managed by the buddy)
>>>> +        * as reserved - this includes ZONE_DEVICE pages and pages without
>>>> +        * a memmap (e.g., mapped via /dev/mem).
>>>> +        */
>>>> +       if (page)
>>>> +               return PageReserved(page);
>>>>            return true;
>>>>     }
>>>
>>> So after this all the pfn_valid() usage in kvm_main.c is replaced with
>>> pfn_to_online_page()? Looks correct to me.
>>>
>>> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
>>> through some other path resulting in this:
>>>
>>>        https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
>>>
>>> I'll see if this patch set modulates or maintains that failure mode.
>>>
>>
>> I'd assume that the behavior is unchanged. Ithink we get a reference to
>> these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
>> hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
>>
> 
> I think I know what's going wrong:
> 
> Pages that are pinned via gfn_to_pfn() and friends take a references,
> however are often released via
> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> 
> 
> E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> 
> ...
> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> ...
> kvm_release_pfn_clean(pfn);
> 
> 
> 
> void kvm_release_pfn_clean(kvm_pfn_t pfn)
> {
> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> 		put_page(pfn_to_page(pfn));
> }
> 
> This function makes perfect sense as the counterpart for kvm_get_pfn():
> 
> void kvm_get_pfn(kvm_pfn_t pfn)
> {
> 	if (!kvm_is_reserved_pfn(pfn))
> 		get_page(pfn_to_page(pfn));
> }
> 
> 
> As all ZONE_DEVICE pages are currently reserved, pages pinned via
> gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> 
> Now, my patch does not change that, the result of
> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> probably be
> 
> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> friends, after you successfully pinned the pages. (not sure if that's
> the right thing to do but you're the expert)
> 
> b) To not use kvm_release_pfn_clean() and friends on pages that were
> definitely pinned.
> 

(talking to myself, sorry)

Thinking again, dropping this patch from this series could effectively 
also fix that issue. E.g., kvm_release_pfn_clean() and friends would 
always do a put_page() if "pfn_valid() and !PageReserved()", so after 
patch 9 also on ZONDE_DEVICE pages.

But it would have side effects that might not be desired. E.g.,:

1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be 
the right thing to do).

2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be 
okay)

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
  2019-11-04 22:44   ` Boris Ostrovsky
@ 2019-11-05 10:18     ` David Hildenbrand
  2019-11-05 16:06       ` Boris Ostrovsky
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-05 10:18 UTC (permalink / raw)
  To: Boris Ostrovsky, linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Dan Williams, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 04.11.19 23:44, Boris Ostrovsky wrote:
> On 10/24/19 8:09 AM, David Hildenbrand wrote:
>> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
>> index 4f2e78a5e4db..af69f057913a 100644
>> --- a/drivers/xen/balloon.c
>> +++ b/drivers/xen/balloon.c
>> @@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order)
>>   	mutex_lock(&balloon_mutex);
>>   	for (i = 0; i < size; i++) {
>>   		p = pfn_to_page(start_pfn + i);
>> +		/*
>> +		 * TODO: The core used to mark the pages reserved. Most probably
>> +		 * we can stop doing that now. However, especially
>> +		 * alloc_xenballooned_pages() left PG_reserved set
>> +		 * on pages that can get mapped to user space.
>> +		 */
>> +		__SetPageReserved(p);
> 
> I suspect this is not needed. Pages can get into balloon either from
> here or from non-hotplug path (e.g. decrease_reservation()) and so when
> we get a page from the balloon we would get a random page that may or
> may not have Reserved bit set.

Yeah, I also think it is fine. If you agree, I'll drop this hunk and add 
details to the patch description.


-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  2019-11-05  1:37   ` Dan Williams
@ 2019-11-05 11:09     ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-11-05 11:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On 05.11.19 02:37, Dan Williams wrote:
> On Thu, Oct 24, 2019 at 5:10 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>> change that.
>>
>> KVM has this weird use case that you can map anything from /dev/mem
>> into the guest. pfn_valid() is not a reliable check whether the memmap
>> was initialized and can be touched. pfn_to_online_page() makes sure
>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>
>> Rewrite kvm_is_mmio_pfn() to make sure the function produces the
>> same result once we stop setting ZONE_DEVICE pages PG_reserved.
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
>> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
>> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
>> Cc: Wanpeng Li <wanpengli@tencent.com>
>> Cc: Jim Mattson <jmattson@google.com>
>> Cc: Joerg Roedel <joro@8bytes.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: KarimAllah Ahmed <karahmed@amazon.de>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   arch/x86/kvm/mmu.c | 29 +++++++++++++++++------------
>>   1 file changed, 17 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 24c23c66b226..f03089a336de 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>>
>>   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>>   {
>> +       struct page *page = pfn_to_online_page(pfn);
>> +
>> +       /*
>> +        * ZONE_DEVICE pages are never online. Online pages that are reserved
>> +        * either indicate the zero page or MMIO pages.
>> +        */
>> +       if (page)
>> +               return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
>> +
>> +       /*
>> +        * Anything with a valid (but not online) memmap could be ZONE_DEVICE.
>> +        * Treat only UC/UC-/WC pages as MMIO.
> 
> You might clarify that ZONE_DEVICE pages that belong to WB cacheable
> pmem have the correct memtype established by devm_memremap_pages().

/*
  * Anything with a valid (but not online) memmap could be ZONE_DEVICE.
  * Treat only UC/UC-/WC pages as MMIO. devm_memremap_pages() established
  * the correct memtype for WB cacheable ZONE_DEVICE pages.
  */

Thanks!

> 
> Other than that, feel free to add:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 


-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 10:02         ` David Hildenbrand
@ 2019-11-05 16:00           ` Sean Christopherson
  2019-11-05 20:30             ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2019-11-05 16:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Boris Ostrovsky, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Dan Williams, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 05, 2019 at 11:02:46AM +0100, David Hildenbrand wrote:
> On 05.11.19 10:49, David Hildenbrand wrote:
> >On 05.11.19 10:17, David Hildenbrand wrote:
> >>On 05.11.19 05:38, Dan Williams wrote:
> >>>On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> >>>>change that.
> >>>>
> >>>>KVM has this weird use case that you can map anything from /dev/mem
> >>>>into the guest. pfn_valid() is not a reliable check whether the memmap
> >>>>was initialized and can be touched. pfn_to_online_page() makes sure
> >>>>that we have an initialized memmap (and don't have ZONE_DEVICE memory).
> >>>>
> >>>>Rewrite kvm_is_reserved_pfn() to make sure the function produces the
> >>>>same result once we stop setting ZONE_DEVICE pages PG_reserved.
> >>>>
> >>>>Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>>Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> >>>>Cc: Michal Hocko <mhocko@kernel.org>
> >>>>Cc: Dan Williams <dan.j.williams@intel.com>
> >>>>Cc: KarimAllah Ahmed <karahmed@amazon.de>
> >>>>Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>>---
> >>>>    virt/kvm/kvm_main.c | 10 ++++++++--
> >>>>    1 file changed, 8 insertions(+), 2 deletions(-)
> >>>>
> >>>>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>>>index e9eb666eb6e8..9d18cc67d124 100644
> >>>>--- a/virt/kvm/kvm_main.c
> >>>>+++ b/virt/kvm/kvm_main.c
> >>>>@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >>>>
> >>>>    bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> >>>>    {
> >>>>-       if (pfn_valid(pfn))
> >>>>-               return PageReserved(pfn_to_page(pfn));
> >>>>+       struct page *page = pfn_to_online_page(pfn);
> >>>>
> >>>>+       /*
> >>>>+        * We treat any pages that are not online (not managed by the buddy)
> >>>>+        * as reserved - this includes ZONE_DEVICE pages and pages without
> >>>>+        * a memmap (e.g., mapped via /dev/mem).
> >>>>+        */
> >>>>+       if (page)
> >>>>+               return PageReserved(page);
> >>>>           return true;
> >>>>    }
> >>>
> >>>So after this all the pfn_valid() usage in kvm_main.c is replaced with
> >>>pfn_to_online_page()? Looks correct to me.
> >>>
> >>>However, I'm worried that kvm is taking reference on ZONE_DEVICE pages
> >>>through some other path resulting in this:
> >>>
> >>>       https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/
> >>>
> >>>I'll see if this patch set modulates or maintains that failure mode.
> >>>
> >>
> >>I'd assume that the behavior is unchanged. Ithink we get a reference to
> >>these ZONE_DEVICE pages via __get_user_pages_fast() and friends in
> >>hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
> >>
> >
> >I think I know what's going wrong:
> >
> >Pages that are pinned via gfn_to_pfn() and friends take a references,
> >however are often released via
> >kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> >
> >
> >E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> >
> >...
> >pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> >...
> >kvm_release_pfn_clean(pfn);
> >
> >
> >
> >void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >{
> >	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >		put_page(pfn_to_page(pfn));
> >}
> >
> >This function makes perfect sense as the counterpart for kvm_get_pfn():
> >
> >void kvm_get_pfn(kvm_pfn_t pfn)
> >{
> >	if (!kvm_is_reserved_pfn(pfn))
> >		get_page(pfn_to_page(pfn));
> >}
> >
> >
> >As all ZONE_DEVICE pages are currently reserved, pages pinned via
> >gfn_to_pfn() and friends will often not see a put_page() AFAIKS.

Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
KVM bug.

> >Now, my patch does not change that, the result of
> >kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> >probably be
> >
> >a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> >friends, after you successfully pinned the pages. (not sure if that's
> >the right thing to do but you're the expert)
> >
> >b) To not use kvm_release_pfn_clean() and friends on pages that were
> >definitely pinned.

This is already KVM's intent, i.e. the purpose of the PageReserved() check
is simply to avoid putting a non-existent reference.  The problem is that
KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
true when the code was first added.

> (talking to myself, sorry)
> 
> Thinking again, dropping this patch from this series could effectively also
> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
> ZONDE_DEVICE pages.

Yeah, this appears to be the correct fix.

> But it would have side effects that might not be desired. E.g.,:
> 
> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
> right thing to do).

This should be ok, at least on x86.  There are only three users of
kvm_pfn_to_page().  Two of those are on allocations that are controlled by
KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
memory when running a nested guest, and in that case supporting ZONE_DEVICE
memory is desirable, i.e. KVM should play nice with a guest that is backed
by ZONE_DEVICE memory.

> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
> okay)

This is ok from a KVM perspective.

The scarier code (for me) is transparent_hugepage_adjust() and
kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
interaction between THP and _PAGE_DEVMAP.

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
  2019-11-05 10:18     ` David Hildenbrand
@ 2019-11-05 16:06       ` Boris Ostrovsky
  0 siblings, 0 replies; 43+ messages in thread
From: Boris Ostrovsky @ 2019-11-05 16:06 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Dan Williams, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Sean Christopherson, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 11/5/19 5:18 AM, David Hildenbrand wrote:
> On 04.11.19 23:44, Boris Ostrovsky wrote:
>> On 10/24/19 8:09 AM, David Hildenbrand wrote:
>>> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
>>> index 4f2e78a5e4db..af69f057913a 100644
>>> --- a/drivers/xen/balloon.c
>>> +++ b/drivers/xen/balloon.c
>>> @@ -374,6 +374,13 @@ static void xen_online_page(struct page *page,
>>> unsigned int order)
>>>       mutex_lock(&balloon_mutex);
>>>       for (i = 0; i < size; i++) {
>>>           p = pfn_to_page(start_pfn + i);
>>> +        /*
>>> +         * TODO: The core used to mark the pages reserved. Most
>>> probably
>>> +         * we can stop doing that now. However, especially
>>> +         * alloc_xenballooned_pages() left PG_reserved set
>>> +         * on pages that can get mapped to user space.
>>> +         */
>>> +        __SetPageReserved(p);
>>
>> I suspect this is not needed. Pages can get into balloon either from
>> here or from non-hotplug path (e.g. decrease_reservation()) and so when
>> we get a page from the balloon we would get a random page that may or
>> may not have Reserved bit set.
>
> Yeah, I also think it is fine. If you agree, I'll drop this hunk and
> add details to the patch description.
>
>


Yes, let's do that. Thanks.

-boris
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 16:00           ` Sean Christopherson
@ 2019-11-05 20:30             ` David Hildenbrand
  2019-11-05 22:22               ` Sean Christopherson
  2019-11-05 23:02               ` Dan Williams
  0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-11-05 20:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Boris Ostrovsky, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Dan Williams, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

>>> I think I know what's going wrong:
>>>
>>> Pages that are pinned via gfn_to_pfn() and friends take a references,
>>> however are often released via
>>> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
>>>
>>>
>>> E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
>>>
>>> ...
>>> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
>>> ...
>>> kvm_release_pfn_clean(pfn);
>>>
>>>
>>>
>>> void kvm_release_pfn_clean(kvm_pfn_t pfn)
>>> {
>>> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
>>> 		put_page(pfn_to_page(pfn));
>>> }
>>>
>>> This function makes perfect sense as the counterpart for kvm_get_pfn():
>>>
>>> void kvm_get_pfn(kvm_pfn_t pfn)
>>> {
>>> 	if (!kvm_is_reserved_pfn(pfn))
>>> 		get_page(pfn_to_page(pfn));
>>> }
>>>
>>>
>>> As all ZONE_DEVICE pages are currently reserved, pages pinned via
>>> gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> 
> Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
> KVM bug.

Yes, it does take a reference AFAIKs. E.g.,

mm/gup.c:gup_pte_range():
...
		if (pte_devmap(pte)) {
			if (unlikely(flags & FOLL_LONGTERM))
				goto pte_unmap;

			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
			if (unlikely(!pgmap)) {
				undo_dev_pagemap(nr, nr_start, pages);
				goto pte_unmap;
			}
		} else if (pte_special(pte))
			goto pte_unmap;

		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
		page = pte_page(pte);

		head = try_get_compound_head(page, 1);

try_get_compound_head() will increment the reference count.

> 
>>> Now, my patch does not change that, the result of
>>> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
>>> probably be
>>>
>>> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
>>> friends, after you successfully pinned the pages. (not sure if that's
>>> the right thing to do but you're the expert)
>>>
>>> b) To not use kvm_release_pfn_clean() and friends on pages that were
>>> definitely pinned.
> 
> This is already KVM's intent, i.e. the purpose of the PageReserved() check
> is simply to avoid putting a non-existent reference.  The problem is that
> KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
> true when the code was first added.
> 
>> (talking to myself, sorry)
>>
>> Thinking again, dropping this patch from this series could effectively also
>> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
>> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
>> ZONDE_DEVICE pages.
> 
> Yeah, this appears to be the correct fix.
> 
>> But it would have side effects that might not be desired. E.g.,:
>>
>> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
>> right thing to do).
> 
> This should be ok, at least on x86.  There are only three users of
> kvm_pfn_to_page().  Two of those are on allocations that are controlled by
> KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
> memory when running a nested guest, and in that case supporting ZONE_DEVICE
> memory is desirable, i.e. KVM should play nice with a guest that is backed
> by ZONE_DEVICE memory.
> 
>> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
>> okay)
> 
> This is ok from a KVM perspective.

What about

void kvm_get_pfn(kvm_pfn_t pfn)
{
	if (!kvm_is_reserved_pfn(pfn))
		get_page(pfn_to_page(pfn));
}

Is a pure get_page() sufficient in case of ZONE_DEVICE?
(asking because of the live references obtained via 
get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() 
somewhat confuse me :) )


> 
> The scarier code (for me) is transparent_hugepage_adjust() and
> kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> interaction between THP and _PAGE_DEVMAP.

The x86 KVM MMU code is one of the ugliest code I know (sorry, but it 
had to be said :/ ). Luckily, this should be independent of the 
PG_reserved thingy AFAIKs.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 20:30             ` David Hildenbrand
@ 2019-11-05 22:22               ` Sean Christopherson
  2019-11-05 23:02               ` Dan Williams
  1 sibling, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2019-11-05 22:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Boris Ostrovsky, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Dan Williams, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 05, 2019 at 09:30:53PM +0100, David Hildenbrand wrote:
> >>>I think I know what's going wrong:
> >>>
> >>>Pages that are pinned via gfn_to_pfn() and friends take a references,
> >>>however are often released via
> >>>kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> >>>
> >>>
> >>>E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> >>>
> >>>...
> >>>pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> >>>...
> >>>kvm_release_pfn_clean(pfn);
> >>>
> >>>
> >>>
> >>>void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >>>{
> >>>	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >>>		put_page(pfn_to_page(pfn));
> >>>}
> >>>
> >>>This function makes perfect sense as the counterpart for kvm_get_pfn():
> >>>
> >>>void kvm_get_pfn(kvm_pfn_t pfn)
> >>>{
> >>>	if (!kvm_is_reserved_pfn(pfn))
> >>>		get_page(pfn_to_page(pfn));
> >>>}
> >>>
> >>>
> >>>As all ZONE_DEVICE pages are currently reserved, pages pinned via
> >>>gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> >
> >Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
> >KVM bug.
> 
> Yes, it does take a reference AFAIKs. E.g.,
> 
> mm/gup.c:gup_pte_range():
> ...
> 		if (pte_devmap(pte)) {
> 			if (unlikely(flags & FOLL_LONGTERM))
> 				goto pte_unmap;
> 
> 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
> 			if (unlikely(!pgmap)) {
> 				undo_dev_pagemap(nr, nr_start, pages);
> 				goto pte_unmap;
> 			}
> 		} else if (pte_special(pte))
> 			goto pte_unmap;
> 
> 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> 		page = pte_page(pte);
> 
> 		head = try_get_compound_head(page, 1);
> 
> try_get_compound_head() will increment the reference count.

Doh, I looked right at that code and somehow didn't connect the dots.
Thanks!

> >>>Now, my patch does not change that, the result of
> >>>kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> >>>probably be
> >>>
> >>>a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> >>>friends, after you successfully pinned the pages. (not sure if that's
> >>>the right thing to do but you're the expert)
> >>>
> >>>b) To not use kvm_release_pfn_clean() and friends on pages that were
> >>>definitely pinned.
> >
> >This is already KVM's intent, i.e. the purpose of the PageReserved() check
> >is simply to avoid putting a non-existent reference.  The problem is that
> >KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
> >true when the code was first added.
> >
> >>(talking to myself, sorry)
> >>
> >>Thinking again, dropping this patch from this series could effectively also
> >>fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
> >>put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
> >>ZONDE_DEVICE pages.
> >
> >Yeah, this appears to be the correct fix.
> >
> >>But it would have side effects that might not be desired. E.g.,:
> >>
> >>1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
> >>right thing to do).
> >
> >This should be ok, at least on x86.  There are only three users of
> >kvm_pfn_to_page().  Two of those are on allocations that are controlled by
> >KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
> >memory when running a nested guest, and in that case supporting ZONE_DEVICE
> >memory is desirable, i.e. KVM should play nice with a guest that is backed
> >by ZONE_DEVICE memory.
> >
> >>2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
> >>okay)
> >
> >This is ok from a KVM perspective.
> 
> What about
> 
> void kvm_get_pfn(kvm_pfn_t pfn)
> {
> 	if (!kvm_is_reserved_pfn(pfn))
> 		get_page(pfn_to_page(pfn));
> }
> 
> Is a pure get_page() sufficient in case of ZONE_DEVICE?
> (asking because of the live references obtained via
> get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() somewhat
> confuse me :) )

This ties into my concern with thp_adjust().  On x86, kvm_get_pfn() is
only used in two flows, to manually get a ref for VM_IO/VM_PFNMAP pages
and to switch the ref when mapping a non-hugetlbfs compound page, i.e. a
THP.

I assume VM_IO and PFNMAP can't apply to ZONE_DEVICE pages.

In the thp_adjust() case, when a THP is encountered and the original PFN
is for a non-PG_head page, KVM transfers the reference to the associated
PG_head page[*] and maps the associated 2mb chunk/page.  This is where KVM
uses kvm_get_pfn() and could run afoul of the get_dev_pagemap() refcounts.


[*] Technically I don't think it's guaranteed to be a PG_head, e.g. if the
    THP is a 1gb page, as KVM currently only maps THP as 2mb pages.  But
    the idea is the same, transfer the refcount the PFN that's actually
    going into KVM's page tables.

> >
> >The scarier code (for me) is transparent_hugepage_adjust() and
> >kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> >interaction between THP and _PAGE_DEVMAP.
> 
> The x86 KVM MMU code is one of the ugliest code I know (sorry, but it had to
> be said :/ ). Luckily, this should be independent of the PG_reserved thingy
> AFAIKs.

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 20:30             ` David Hildenbrand
  2019-11-05 22:22               ` Sean Christopherson
@ 2019-11-05 23:02               ` Dan Williams
  2019-11-05 23:13                 ` Sean Christopherson
  1 sibling, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-05 23:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Sean Christopherson,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
>
> >>> I think I know what's going wrong:
> >>>
> >>> Pages that are pinned via gfn_to_pfn() and friends take a references,
> >>> however are often released via
> >>> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...
> >>>
> >>>
> >>> E.g., in arch/x86/kvm/x86.c:reexecute_instruction()
> >>>
> >>> ...
> >>> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> >>> ...
> >>> kvm_release_pfn_clean(pfn);
> >>>
> >>>
> >>>
> >>> void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >>> {
> >>>     if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >>>             put_page(pfn_to_page(pfn));
> >>> }
> >>>
> >>> This function makes perfect sense as the counterpart for kvm_get_pfn():
> >>>
> >>> void kvm_get_pfn(kvm_pfn_t pfn)
> >>> {
> >>>     if (!kvm_is_reserved_pfn(pfn))
> >>>             get_page(pfn_to_page(pfn));
> >>> }
> >>>
> >>>
> >>> As all ZONE_DEVICE pages are currently reserved, pages pinned via
> >>> gfn_to_pfn() and friends will often not see a put_page() AFAIKS.
> >
> > Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
> > KVM bug.
>
> Yes, it does take a reference AFAIKs. E.g.,
>
> mm/gup.c:gup_pte_range():
> ...
>                 if (pte_devmap(pte)) {
>                         if (unlikely(flags & FOLL_LONGTERM))
>                                 goto pte_unmap;
>
>                         pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
>                         if (unlikely(!pgmap)) {
>                                 undo_dev_pagemap(nr, nr_start, pages);
>                                 goto pte_unmap;
>                         }
>                 } else if (pte_special(pte))
>                         goto pte_unmap;
>
>                 VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>                 page = pte_page(pte);
>
>                 head = try_get_compound_head(page, 1);
>
> try_get_compound_head() will increment the reference count.
>
> >
> >>> Now, my patch does not change that, the result of
> >>> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
> >>> probably be
> >>>
> >>> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
> >>> friends, after you successfully pinned the pages. (not sure if that's
> >>> the right thing to do but you're the expert)
> >>>
> >>> b) To not use kvm_release_pfn_clean() and friends on pages that were
> >>> definitely pinned.
> >
> > This is already KVM's intent, i.e. the purpose of the PageReserved() check
> > is simply to avoid putting a non-existent reference.  The problem is that
> > KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
> > true when the code was first added.
> >
> >> (talking to myself, sorry)
> >>
> >> Thinking again, dropping this patch from this series could effectively also
> >> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
> >> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
> >> ZONDE_DEVICE pages.
> >
> > Yeah, this appears to be the correct fix.
> >
> >> But it would have side effects that might not be desired. E.g.,:
> >>
> >> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
> >> right thing to do).
> >
> > This should be ok, at least on x86.  There are only three users of
> > kvm_pfn_to_page().  Two of those are on allocations that are controlled by
> > KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
> > memory when running a nested guest, and in that case supporting ZONE_DEVICE
> > memory is desirable, i.e. KVM should play nice with a guest that is backed
> > by ZONE_DEVICE memory.
> >
> >> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
> >> okay)
> >
> > This is ok from a KVM perspective.
>
> What about
>
> void kvm_get_pfn(kvm_pfn_t pfn)
> {
>         if (!kvm_is_reserved_pfn(pfn))
>                 get_page(pfn_to_page(pfn));
> }
>
> Is a pure get_page() sufficient in case of ZONE_DEVICE?
> (asking because of the live references obtained via
> get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range()
> somewhat confuse me :) )

It is not sufficient. PTE_DEVMAP is there to tell the gup path "be
careful, this pfn has device-lifetime, make sure the device is pinned
and not actively in the process of dying before pinning any pages
associated with this device".

However, if kvm_get_pfn() is honoring kvm_is_reserved_pfn() that
returns true for ZONE_DEVICE, I'm missing how it is messing up the
reference counts.

> > The scarier code (for me) is transparent_hugepage_adjust() and
> > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > interaction between THP and _PAGE_DEVMAP.
>
> The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> had to be said :/ ). Luckily, this should be independent of the
> PG_reserved thingy AFAIKs.

Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
are honoring kvm_is_reserved_pfn(), so again I'm missing where the
page count gets mismanaged and leads to the reported hang.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 23:02               ` Dan Williams
@ 2019-11-05 23:13                 ` Sean Christopherson
  2019-11-05 23:30                   ` Dan Williams
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2019-11-05 23:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, David Hildenbrand, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Pavel Tatashin,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > interaction between THP and _PAGE_DEVMAP.
> >
> > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > had to be said :/ ). Luckily, this should be independent of the
> > PG_reserved thingy AFAIKs.
> 
> Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> page count gets mismanaged and leads to the reported hang.

When mapping pages into the guest, KVM gets the page via gup(), which
increments the page count for ZONE_DEVICE pages.  But KVM puts the page
using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
and so never puts its reference to ZONE_DEVICE pages.

My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
comments were for a post-patch/series scenario wheren PageReserved() is
no longer true for ZONE_DEVICE pages.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 23:13                 ` Sean Christopherson
@ 2019-11-05 23:30                   ` Dan Williams
  2019-11-05 23:42                     ` Sean Christopherson
  2019-11-05 23:43                     ` Dan Williams
  0 siblings, 2 replies; 43+ messages in thread
From: Dan Williams @ 2019-11-05 23:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, David Hildenbrand, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Pavel Tatashin,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > interaction between THP and _PAGE_DEVMAP.
> > >
> > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > had to be said :/ ). Luckily, this should be independent of the
> > > PG_reserved thingy AFAIKs.
> >
> > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > page count gets mismanaged and leads to the reported hang.
>
> When mapping pages into the guest, KVM gets the page via gup(), which
> increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> and so never puts its reference to ZONE_DEVICE pages.

Oh, yeah, that's busted.

> My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> comments were for a post-patch/series scenario wheren PageReserved() is
> no longer true for ZONE_DEVICE pages.

Ah, ok, for that David is preserving kvm_is_reserved_pfn() returning
true for ZONE_DEVICE because pfn_to_online_page() will fail for
ZONE_DEVICE.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 23:30                   ` Dan Williams
@ 2019-11-05 23:42                     ` Sean Christopherson
  2019-11-05 23:43                     ` Dan Williams
  1 sibling, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2019-11-05 23:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, David Hildenbrand, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Pavel Tatashin,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 05, 2019 at 03:30:00PM -0800, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > interaction between THP and _PAGE_DEVMAP.
> > > >
> > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > had to be said :/ ). Luckily, this should be independent of the
> > > > PG_reserved thingy AFAIKs.
> > >
> > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > page count gets mismanaged and leads to the reported hang.
> >
> > When mapping pages into the guest, KVM gets the page via gup(), which
> > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > and so never puts its reference to ZONE_DEVICE pages.
> 
> Oh, yeah, that's busted.
> 
> > My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > comments were for a post-patch/series scenario wheren PageReserved() is
> > no longer true for ZONE_DEVICE pages.
> 
> Ah, ok, for that David is preserving kvm_is_reserved_pfn() returning
> true for ZONE_DEVICE because pfn_to_online_page() will fail for
> ZONE_DEVICE.

But David's proposed fix for the above refcount bug is to omit the patch
so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
like the right thing to do, including for thp_adjust(), e.g. it would
naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
mapped with a huge page (2mb or above) in the host.  The only hiccup is
figuring out how to correctly transfer the reference.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 23:30                   ` Dan Williams
  2019-11-05 23:42                     ` Sean Christopherson
@ 2019-11-05 23:43                     ` Dan Williams
  2019-11-06  0:03                       ` Sean Christopherson
  1 sibling, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-05 23:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, David Hildenbrand, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Pavel Tatashin,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > interaction between THP and _PAGE_DEVMAP.
> > > >
> > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > had to be said :/ ). Luckily, this should be independent of the
> > > > PG_reserved thingy AFAIKs.
> > >
> > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > page count gets mismanaged and leads to the reported hang.
> >
> > When mapping pages into the guest, KVM gets the page via gup(), which
> > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > and so never puts its reference to ZONE_DEVICE pages.
>
> Oh, yeah, that's busted.

Ugh, it's extra busted because every other gup user in the kernel
tracks the pages resulting from gup and puts them (put_page()) when
they are done. KVM wants to forget about whether it did a gup to get
the page and optionally trigger put_page() based purely on the pfn.
Outside of VFIO device assignment that needs pages pinned for DMA, why
does KVM itself need to pin pages? If pages are pinned over a return
to userspace that needs to be a FOLL_LONGTERM gup.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-05 23:43                     ` Dan Williams
@ 2019-11-06  0:03                       ` Sean Christopherson
  2019-11-06  0:08                         ` Dan Williams
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2019-11-06  0:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, David Hildenbrand, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Pavel Tatashin,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > > interaction between THP and _PAGE_DEVMAP.
> > > > >
> > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > > had to be said :/ ). Luckily, this should be independent of the
> > > > > PG_reserved thingy AFAIKs.
> > > >
> > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > > page count gets mismanaged and leads to the reported hang.
> > >
> > > When mapping pages into the guest, KVM gets the page via gup(), which
> > > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > > and so never puts its reference to ZONE_DEVICE pages.
> >
> > Oh, yeah, that's busted.
> 
> Ugh, it's extra busted because every other gup user in the kernel
> tracks the pages resulting from gup and puts them (put_page()) when
> they are done. KVM wants to forget about whether it did a gup to get
> the page and optionally trigger put_page() based purely on the pfn.
> Outside of VFIO device assignment that needs pages pinned for DMA, why
> does KVM itself need to pin pages? If pages are pinned over a return
> to userspace that needs to be a FOLL_LONGTERM gup.

Short answer, KVM pins the page to ensure correctness with respect to the
primary MMU invalidating the associated host virtual address, e.g. when
the page is being migrated or unmapped from host userspace.

The main use of gup() is to handle guest page faults and map pages into
the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
PFN and to temporarily pin the page.  The pin is held just long enough to
guaranteed that any invalidation via the mmu_notifier will be stalled
until after KVM finishes installing the page into the secondary MMU, i.e.
the pin is short-term and not held across a return to userspace or entry
into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
pulls the PFN from the secondary MMU and uses that to update accessed
and dirty bits in the host.

There are a few other KVM flows that eventually call into gup(), but those
are "traditional" short-term pins and use put_page() directly.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-06  0:03                       ` Sean Christopherson
@ 2019-11-06  0:08                         ` Dan Williams
  2019-11-06  6:56                           ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-06  0:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, David Hildenbrand, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Pavel Tatashin,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
> > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
> > > <sean.j.christopherson@intel.com> wrote:
> > > >
> > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
> > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and
> > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
> > > > > > > interaction between THP and _PAGE_DEVMAP.
> > > > > >
> > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
> > > > > > had to be said :/ ). Luckily, this should be independent of the
> > > > > > PG_reserved thingy AFAIKs.
> > > > >
> > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
> > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the
> > > > > page count gets mismanaged and leads to the reported hang.
> > > >
> > > > When mapping pages into the guest, KVM gets the page via gup(), which
> > > > increments the page count for ZONE_DEVICE pages.  But KVM puts the page
> > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
> > > > and so never puts its reference to ZONE_DEVICE pages.
> > >
> > > Oh, yeah, that's busted.
> >
> > Ugh, it's extra busted because every other gup user in the kernel
> > tracks the pages resulting from gup and puts them (put_page()) when
> > they are done. KVM wants to forget about whether it did a gup to get
> > the page and optionally trigger put_page() based purely on the pfn.
> > Outside of VFIO device assignment that needs pages pinned for DMA, why
> > does KVM itself need to pin pages? If pages are pinned over a return
> > to userspace that needs to be a FOLL_LONGTERM gup.
>
> Short answer, KVM pins the page to ensure correctness with respect to the
> primary MMU invalidating the associated host virtual address, e.g. when
> the page is being migrated or unmapped from host userspace.
>
> The main use of gup() is to handle guest page faults and map pages into
> the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
> PFN and to temporarily pin the page.  The pin is held just long enough to
> guaranteed that any invalidation via the mmu_notifier will be stalled
> until after KVM finishes installing the page into the secondary MMU, i.e.
> the pin is short-term and not held across a return to userspace or entry
> into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
> pulls the PFN from the secondary MMU and uses that to update accessed
> and dirty bits in the host.
>
> There are a few other KVM flows that eventually call into gup(), but those
> are "traditional" short-term pins and use put_page() directly.

Ok, I was misinterpreting the effect of the bug with what KVM is using
the reference to do.

To your other point:

> But David's proposed fix for the above refcount bug is to omit the patch
> so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
> like the right thing to do, including for thp_adjust(), e.g. it would
> naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
> mapped with a huge page (2mb or above) in the host.  The only hiccup is
> figuring out how to correctly transfer the reference.

That might not be the only hiccup. There's currently no such thing as
huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
but the result of pfn_to_page() on such a mapping does not yield a
huge 'struct page'. It seems there are other paths in KVM that assume
that more typical page machinery is active like SetPageDirty() based
on kvm_is_reserved_pfn(). While I told David that I did not want to
see more usage of is_zone_device_page(), this patch below (untested)
seems a cleaner path with less surprises:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4df0aa6b8e5c..fbea17c1810c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);

 void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
-       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
+       if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||
+           (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))
                put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

This is safe because the reference that KVM took earlier protects the
is_zone_device_page() lookup from racing device teardown. Otherwise,
if KVM does not have a reference it's unsafe, but that's already even
more broken because KVM would be releasing a page that it never
referenced. Every other KVM operation that assumes page allocator
pages would continue to honor kvm_is_reserved_pfn().
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-06  0:08                         ` Dan Williams
@ 2019-11-06  6:56                           ` David Hildenbrand
  2019-11-06 16:09                             ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-06  6:56 UTC (permalink / raw)
  To: Dan Williams, Sean Christopherson
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On 06.11.19 01:08, Dan Williams wrote:
> On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
>>
>> On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
>>> On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>>>
>>>> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
>>>> <sean.j.christopherson@intel.com> wrote:
>>>>>
>>>>> On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
>>>>>> On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>> The scarier code (for me) is transparent_hugepage_adjust() and
>>>>>>>> kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
>>>>>>>> interaction between THP and _PAGE_DEVMAP.
>>>>>>>
>>>>>>> The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
>>>>>>> had to be said :/ ). Luckily, this should be independent of the
>>>>>>> PG_reserved thingy AFAIKs.
>>>>>>
>>>>>> Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
>>>>>> are honoring kvm_is_reserved_pfn(), so again I'm missing where the
>>>>>> page count gets mismanaged and leads to the reported hang.
>>>>>
>>>>> When mapping pages into the guest, KVM gets the page via gup(), which
>>>>> increments the page count for ZONE_DEVICE pages.  But KVM puts the page
>>>>> using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
>>>>> and so never puts its reference to ZONE_DEVICE pages.
>>>>
>>>> Oh, yeah, that's busted.
>>>
>>> Ugh, it's extra busted because every other gup user in the kernel
>>> tracks the pages resulting from gup and puts them (put_page()) when
>>> they are done. KVM wants to forget about whether it did a gup to get
>>> the page and optionally trigger put_page() based purely on the pfn.
>>> Outside of VFIO device assignment that needs pages pinned for DMA, why
>>> does KVM itself need to pin pages? If pages are pinned over a return
>>> to userspace that needs to be a FOLL_LONGTERM gup.
>>
>> Short answer, KVM pins the page to ensure correctness with respect to the
>> primary MMU invalidating the associated host virtual address, e.g. when
>> the page is being migrated or unmapped from host userspace.
>>
>> The main use of gup() is to handle guest page faults and map pages into
>> the guest, i.e. into KVM's secondary MMU.  KVM uses gup() to both get the
>> PFN and to temporarily pin the page.  The pin is held just long enough to
>> guaranteed that any invalidation via the mmu_notifier will be stalled
>> until after KVM finishes installing the page into the secondary MMU, i.e.
>> the pin is short-term and not held across a return to userspace or entry
>> into the guest.  When a subsequent mmu_notifier invalidation occurs, KVM
>> pulls the PFN from the secondary MMU and uses that to update accessed
>> and dirty bits in the host.
>>
>> There are a few other KVM flows that eventually call into gup(), but those
>> are "traditional" short-term pins and use put_page() directly.
> 
> Ok, I was misinterpreting the effect of the bug with what KVM is using
> the reference to do.
> 
> To your other point:
> 
>> But David's proposed fix for the above refcount bug is to omit the patch
>> so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
>> like the right thing to do, including for thp_adjust(), e.g. it would
>> naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
>> mapped with a huge page (2mb or above) in the host.  The only hiccup is
>> figuring out how to correctly transfer the reference.
> 
> That might not be the only hiccup. There's currently no such thing as
> huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
> but the result of pfn_to_page() on such a mapping does not yield a
> huge 'struct page'. It seems there are other paths in KVM that assume
> that more typical page machinery is active like SetPageDirty() based
> on kvm_is_reserved_pfn(). While I told David that I did not want to
> see more usage of is_zone_device_page(), this patch below (untested)
> seems a cleaner path with less surprises:
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4df0aa6b8e5c..fbea17c1810c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
> 
>   void kvm_release_pfn_clean(kvm_pfn_t pfn)
>   {
> -       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> +       if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||
> +           (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))
>                  put_page(pfn_to_page(pfn));
>   }
>   EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);

I had the same thought, but I do wonder about the kvm_get_pfn() users, 
e.g.,:

hva_to_pfn_remapped():
	r = follow_pfn(vma, addr, &pfn);
	...
	kvm_get_pfn(pfn);
	...

We would not take a reference for ZONE_DEVICE, but later drop one 
reference via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* 
dangerous to use. I can't tell if this can happen right now.

We do have 3 users of kvm_get_pfn() that we have to audit before this 
change. Also, we should add a comment to kvm_get_pfn() that it should 
never be used with possible ZONE_DEVICE pages.

Also, we should add a comment to kvm_release_pfn_clean(), describing why 
we treat ZONE_DEVICE in a special way here.


We can then progress like this

1. Get this fix upstream, it's somewhat unrelated to this series.
2. This patch here remains as is in this series (+/- documentation update)
3. Long term, rework KVM to not have to not treat ZONE_DEVICE like 
reserved pages. E.g., get rid of kvm_get_pfn(). Then, this special zone 
check can go.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-11-06  6:56                           ` David Hildenbrand
@ 2019-11-06 16:09                             ` Sean Christopherson
  0 siblings, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2019-11-06 16:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Boris Ostrovsky, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Dan Williams, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Adam Borowski, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Wed, Nov 06, 2019 at 07:56:34AM +0100, David Hildenbrand wrote:
> On 06.11.19 01:08, Dan Williams wrote:
> >On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
> >>But David's proposed fix for the above refcount bug is to omit the patch
> >>so that KVM no longer treats ZONE_DEVICE pages as reserved.  That seems
> >>like the right thing to do, including for thp_adjust(), e.g. it would
> >>naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
> >>mapped with a huge page (2mb or above) in the host.  The only hiccup is
> >>figuring out how to correctly transfer the reference.
> >
> >That might not be the only hiccup. There's currently no such thing as
> >huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
> >but the result of pfn_to_page() on such a mapping does not yield a
> >huge 'struct page'. It seems there are other paths in KVM that assume
> >that more typical page machinery is active like SetPageDirty() based
> >on kvm_is_reserved_pfn(). While I told David that I did not want to
> >see more usage of is_zone_device_page(), this patch below (untested)
> >seems a cleaner path with less surprises:
> >
> >diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >index 4df0aa6b8e5c..fbea17c1810c 100644
> >--- a/virt/kvm/kvm_main.c
> >+++ b/virt/kvm/kvm_main.c
> >@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
> >
> >  void kvm_release_pfn_clean(kvm_pfn_t pfn)
> >  {
> >-       if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
> >+       if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||

The is_error_noslot_pfn() check shouldn't be overriden by zone_device.

> >+           (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))

But rather than special case kvm_release_pfn_clean(), I'd rather KVM
explicitly handle ZONE_DEVICE pages, there are other flows where KVM
really should be aware of ZONE_DEVICE pages, e.g. for sanity checks and
whatnot.  There are surprisingly few callers of kvm_is_reserved_pfn(), so
it's actually not too big of a change. 

> >                 put_page(pfn_to_page(pfn));
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
> 
> I had the same thought, but I do wonder about the kvm_get_pfn() users,
> e.g.,:
> 
> hva_to_pfn_remapped():
> 	r = follow_pfn(vma, addr, &pfn);
> 	...
> 	kvm_get_pfn(pfn);
> 	...
> 
> We would not take a reference for ZONE_DEVICE, but later drop one reference
> via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* dangerous to
> use. I can't tell if this can happen right now.
> 
> We do have 3 users of kvm_get_pfn() that we have to audit before this
> change. Also, we should add a comment to kvm_get_pfn() that it should never
> be used with possible ZONE_DEVICE pages.
> 
> Also, we should add a comment to kvm_release_pfn_clean(), describing why we
> treat ZONE_DEVICE in a special way here.
> 
> 
> We can then progress like this
> 
> 1. Get this fix upstream, it's somewhat unrelated to this series.
> 2. This patch here remains as is in this series (+/- documentation update)
> 3. Long term, rework KVM to not have to not treat ZONE_DEVICE like reserved
> pages. E.g., get rid of kvm_get_pfn(). Then, this special zone check can go.

Dropping kvm_get_pfn() is less than ideal, and at this point unnecessary.
I'm 99% sure the existing call sites for kvm_get_pfn() can never be
reached with ZONE_DEVICE pages.  I think we can do:

  1. Get a fix upstream to have KVM stop treating ZONE_DEVICE pages as
     reserved PFNs, i.e. exempt them in kvm_is_reserved_pfn() and change
     the callers of kvm_is_reserved_pfn() to handle ZONE_DEVICE pages.
  2. Drop this patch from the series, and instead remove the special
     treatment of ZONE_DEVICE pages from kvm_is_reserved_pfn().

Give me a few minutes to prep a patch.

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
> 
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-10-24 12:09 ` [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() " David Hildenbrand
@ 2019-11-07 15:40   ` Dan Williams
  2019-11-07 18:22     ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-07 15:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
>
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
>
> KVM has this weird use case that you can map anything from /dev/mem
> into the guest. pfn_valid() is not a reliable check whether the memmap
> was initialized and can be touched. pfn_to_online_page() makes sure
> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>
> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
> sure the function produces the same result once we stop setting ZONE_DEVICE
> pages PG_reserved.
>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ada8e6cdb88..f8ce8c408ba8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>   */
>  static bool is_invalid_reserved_pfn(unsigned long pfn)
>  {
> -       if (pfn_valid(pfn))
> -               return PageReserved(pfn_to_page(pfn));
> +       struct page *page = pfn_to_online_page(pfn);

Ugh, I just realized this is not a safe conversion until
pfn_to_online_page() is moved over to subsection granularity. As it
stands it will return true for any ZONE_DEVICE pages that share a
section with boot memory.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-11-07 15:40   ` Dan Williams
@ 2019-11-07 18:22     ` David Hildenbrand
  2019-11-07 22:07       ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-07 18:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, David Hildenbrand, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Pavel Tatashin,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev



> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>:
> 
> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
>> 
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>> change that.
>> 
>> KVM has this weird use case that you can map anything from /dev/mem
>> into the guest. pfn_valid() is not a reliable check whether the memmap
>> was initialized and can be touched. pfn_to_online_page() makes sure
>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>> 
>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
>> sure the function produces the same result once we stop setting ZONE_DEVICE
>> pages PG_reserved.
>> 
>> Cc: Alex Williamson <alex.williamson@redhat.com>
>> Cc: Cornelia Huck <cohuck@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2ada8e6cdb88..f8ce8c408ba8 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>  */
>> static bool is_invalid_reserved_pfn(unsigned long pfn)
>> {
>> -       if (pfn_valid(pfn))
>> -               return PageReserved(pfn_to_page(pfn));
>> +       struct page *page = pfn_to_online_page(pfn);
> 
> Ugh, I just realized this is not a safe conversion until
> pfn_to_online_page() is moved over to subsection granularity. As it
> stands it will return true for any ZONE_DEVICE pages that share a
> section with boot memory.

That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-11-07 18:22     ` David Hildenbrand
@ 2019-11-07 22:07       ` David Hildenbrand
  2019-11-08  5:09         ` Dan Williams
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-07 22:07 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On 07.11.19 19:22, David Hildenbrand wrote:
> 
> 
>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>:
>>
>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> KVM has this weird use case that you can map anything from /dev/mem
>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>
>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
>>> sure the function produces the same result once we stop setting ZONE_DEVICE
>>> pages PG_reserved.
>>>
>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>> Cc: Cornelia Huck <cohuck@redhat.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
>>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 2ada8e6cdb88..f8ce8c408ba8 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>   */
>>> static bool is_invalid_reserved_pfn(unsigned long pfn)
>>> {
>>> -       if (pfn_valid(pfn))
>>> -               return PageReserved(pfn_to_page(pfn));
>>> +       struct page *page = pfn_to_online_page(pfn);
>>
>> Ugh, I just realized this is not a safe conversion until
>> pfn_to_online_page() is moved over to subsection granularity. As it
>> stands it will return true for any ZONE_DEVICE pages that share a
>> section with boot memory.
> 
> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
> 

I just realized the "boot memory" part. Is that a real thing? IOW, can 
we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat 
have doubts that this would work ...

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-11-07 22:07       ` David Hildenbrand
@ 2019-11-08  5:09         ` Dan Williams
  2019-11-08  7:14           ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-08  5:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paolo Bonzini,
	Paul Mackerras, Michael Ellerman, H. Peter Anvin, Wanpeng Li,
	Alexander Duyck, Thomas Gleixner, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	David Hildenbrand, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Borislav Petkov, Andrew Morton, linuxppc-dev

On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.11.19 19:22, David Hildenbrand wrote:
> >
> >
> >> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>:
> >>
> >> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> >>> change that.
> >>>
> >>> KVM has this weird use case that you can map anything from /dev/mem
> >>> into the guest. pfn_valid() is not a reliable check whether the memmap
> >>> was initialized and can be touched. pfn_to_online_page() makes sure
> >>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
> >>>
> >>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
> >>> sure the function produces the same result once we stop setting ZONE_DEVICE
> >>> pages PG_reserved.
> >>>
> >>> Cc: Alex Williamson <alex.williamson@redhat.com>
> >>> Cc: Cornelia Huck <cohuck@redhat.com>
> >>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>> ---
> >>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
> >>> 1 file changed, 8 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index 2ada8e6cdb88..f8ce8c408ba8 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>>   */
> >>> static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>> {
> >>> -       if (pfn_valid(pfn))
> >>> -               return PageReserved(pfn_to_page(pfn));
> >>> +       struct page *page = pfn_to_online_page(pfn);
> >>
> >> Ugh, I just realized this is not a safe conversion until
> >> pfn_to_online_page() is moved over to subsection granularity. As it
> >> stands it will return true for any ZONE_DEVICE pages that share a
> >> section with boot memory.
> >
> > That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
> >
>
> I just realized the "boot memory" part. Is that a real thing? IOW, can
> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
> have doubts that this would work ...

One of the real world failure cases that started the subsection effect
is that Persistent Memory collides with System RAM on a 64MB boundary
on shipping platforms. System RAM ends on a 64MB boundary and due to a
lack of memory controller resources PMEM is mapped contiguously at the
end of that boundary. Some more details in the subsection cover letter
/ changelogs [1] [2]. It's not sufficient to just lose some memory,
that's the broken implementation that lead to the subsection work
because the lost memory may change from one boot to the next and
software can't reliably inject a padding that conforms to the x86
128MB section constraint.

Suffice to say I think we need your pfn_active() to get subsection
granularity pfn_to_online_page() before PageReserved() can be removed.

[1]: https://lore.kernel.org/linux-mm/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-mm/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com/
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-11-08  5:09         ` Dan Williams
@ 2019-11-08  7:14           ` David Hildenbrand
  2019-11-08 10:21             ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-08  7:14 UTC (permalink / raw)
  To: Dan Williams, Michal Hocko
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Paul Mackerras, Linux MM, Paolo Bonzini, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	David Hildenbrand, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Borislav Petkov, Andrew Morton, linuxppc-dev

On 08.11.19 06:09, Dan Williams wrote:
> On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 07.11.19 19:22, David Hildenbrand wrote:
>>>
>>>
>>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>:
>>>>
>>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>>>> change that.
>>>>>
>>>>> KVM has this weird use case that you can map anything from /dev/mem
>>>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>>>
>>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
>>>>> sure the function produces the same result once we stop setting ZONE_DEVICE
>>>>> pages PG_reserved.
>>>>>
>>>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>>>> Cc: Cornelia Huck <cohuck@redhat.com>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
>>>>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>    */
>>>>> static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>>> {
>>>>> -       if (pfn_valid(pfn))
>>>>> -               return PageReserved(pfn_to_page(pfn));
>>>>> +       struct page *page = pfn_to_online_page(pfn);
>>>>
>>>> Ugh, I just realized this is not a safe conversion until
>>>> pfn_to_online_page() is moved over to subsection granularity. As it
>>>> stands it will return true for any ZONE_DEVICE pages that share a
>>>> section with boot memory.
>>>
>>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
>>>
>>
>> I just realized the "boot memory" part. Is that a real thing? IOW, can
>> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
>> have doubts that this would work ...
> 
> One of the real world failure cases that started the subsection effect
> is that Persistent Memory collides with System RAM on a 64MB boundary
> on shipping platforms. System RAM ends on a 64MB boundary and due to a
> lack of memory controller resources PMEM is mapped contiguously at the
> end of that boundary. Some more details in the subsection cover letter
> / changelogs [1] [2]. It's not sufficient to just lose some memory,
> that's the broken implementation that lead to the subsection work
> because the lost memory may change from one boot to the next and
> software can't reliably inject a padding that conforms to the x86
> 128MB section constraint.

Thanks, I thought it was mostly for weird alignment where other parts of 
the section are basically "holes" and not memory.

Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are 
marked SECTION_IS_ONLINE.

> 
> Suffice to say I think we need your pfn_active() to get subsection
> granularity pfn_to_online_page() before PageReserved() can be removed.

I agree that we have to fix this. I don't like ZONE_DEVICE pages falling 
into memory device blocks (e.g., cannot get offlined), but I guess that 
train is gone :) As long as it's not for memory hotplug, I can most 
probably live with this.

Also, I'd like to get Michals opinion on this and the pfn_active() 
approach, but I can understand he's busy.

This patch set can wait, I won't be working next week besides 
reading/writing mails either way.

Is anybody looking into the pfn_active() thingy?

> 
> [1]: https://lore.kernel.org/linux-mm/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-mm/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com/
> 


-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-11-08  7:14           ` David Hildenbrand
@ 2019-11-08 10:21             ` David Hildenbrand
  2019-11-08 18:29               ` Dan Williams
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-11-08 10:21 UTC (permalink / raw)
  To: Dan Williams, Michal Hocko
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Paul Mackerras, Linux MM, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Thomas Gleixner,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, X86 ML, YueHaibing,
	Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On 08.11.19 08:14, David Hildenbrand wrote:
> On 08.11.19 06:09, Dan Williams wrote:
>> On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 07.11.19 19:22, David Hildenbrand wrote:
>>>>
>>>>
>>>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>:
>>>>>
>>>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>>>>> change that.
>>>>>>
>>>>>> KVM has this weird use case that you can map anything from /dev/mem
>>>>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>>>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>>>>
>>>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
>>>>>> sure the function produces the same result once we stop setting ZONE_DEVICE
>>>>>> pages PG_reserved.
>>>>>>
>>>>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>>>>> Cc: Cornelia Huck <cohuck@redhat.com>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
>>>>>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>>     */
>>>>>> static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>>>> {
>>>>>> -       if (pfn_valid(pfn))
>>>>>> -               return PageReserved(pfn_to_page(pfn));
>>>>>> +       struct page *page = pfn_to_online_page(pfn);
>>>>>
>>>>> Ugh, I just realized this is not a safe conversion until
>>>>> pfn_to_online_page() is moved over to subsection granularity. As it
>>>>> stands it will return true for any ZONE_DEVICE pages that share a
>>>>> section with boot memory.
>>>>
>>>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
>>>>
>>>
>>> I just realized the "boot memory" part. Is that a real thing? IOW, can
>>> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
>>> have doubts that this would work ...
>>
>> One of the real world failure cases that started the subsection effect
>> is that Persistent Memory collides with System RAM on a 64MB boundary
>> on shipping platforms. System RAM ends on a 64MB boundary and due to a
>> lack of memory controller resources PMEM is mapped contiguously at the
>> end of that boundary. Some more details in the subsection cover letter
>> / changelogs [1] [2]. It's not sufficient to just lose some memory,
>> that's the broken implementation that lead to the subsection work
>> because the lost memory may change from one boot to the next and
>> software can't reliably inject a padding that conforms to the x86
>> 128MB section constraint.
> 
> Thanks, I thought it was mostly for weird alignment where other parts of
> the section are basically "holes" and not memory.
> 
> Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are
> marked SECTION_IS_ONLINE.
> 
>>
>> Suffice to say I think we need your pfn_active() to get subsection
>> granularity pfn_to_online_page() before PageReserved() can be removed.
> 
> I agree that we have to fix this. I don't like ZONE_DEVICE pages falling
> into memory device blocks (e.g., cannot get offlined), but I guess that
> train is gone :) As long as it's not for memory hotplug, I can most
> probably live with this.
> 
> Also, I'd like to get Michals opinion on this and the pfn_active()
> approach, but I can understand he's busy.
> 
> This patch set can wait, I won't be working next week besides
> reading/writing mails either way.
> 
> Is anybody looking into the pfn_active() thingy?
> 

I wonder if we should do something like this right now to fix this 
(exclude the false positive ZONE_DEVICE pages we could have within an 
online section, which was not possible before subsection hotplug):

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 384ffb3d69ab..490a9e9358b3 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -30,6 +30,8 @@ struct vmem_altmap;
         if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
             pfn_valid_within(___pfn))                              \
                 ___page = pfn_to_page(___pfn);                     \
+       if (unlikely(___page && is_zone_device_page(___page)))     \
+               ___page = NULL;                                    \
         ___page;                                                   \
  })


Yeah, it's another is_zone_device_page(), but it should not be racy 
here, as we want to exclude, not include ZONE_DEVICE.

I don't have time to look into this right now, unfortunately.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-11-08 10:21             ` David Hildenbrand
@ 2019-11-08 18:29               ` Dan Williams
  2019-11-08 23:01                 ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-11-08 18:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On Fri, Nov 8, 2019 at 2:22 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.11.19 08:14, David Hildenbrand wrote:
> > On 08.11.19 06:09, Dan Williams wrote:
> >> On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 07.11.19 19:22, David Hildenbrand wrote:
> >>>>
> >>>>
> >>>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>:
> >>>>>
> >>>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>
> >>>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> >>>>>> change that.
> >>>>>>
> >>>>>> KVM has this weird use case that you can map anything from /dev/mem
> >>>>>> into the guest. pfn_valid() is not a reliable check whether the memmap
> >>>>>> was initialized and can be touched. pfn_to_online_page() makes sure
> >>>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
> >>>>>>
> >>>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
> >>>>>> sure the function produces the same result once we stop setting ZONE_DEVICE
> >>>>>> pages PG_reserved.
> >>>>>>
> >>>>>> Cc: Alex Williamson <alex.williamson@redhat.com>
> >>>>>> Cc: Cornelia Huck <cohuck@redhat.com>
> >>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>>>> ---
> >>>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
> >>>>>> 1 file changed, 8 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644
> >>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>>>>>     */
> >>>>>> static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>>>>> {
> >>>>>> -       if (pfn_valid(pfn))
> >>>>>> -               return PageReserved(pfn_to_page(pfn));
> >>>>>> +       struct page *page = pfn_to_online_page(pfn);
> >>>>>
> >>>>> Ugh, I just realized this is not a safe conversion until
> >>>>> pfn_to_online_page() is moved over to subsection granularity. As it
> >>>>> stands it will return true for any ZONE_DEVICE pages that share a
> >>>>> section with boot memory.
> >>>>
> >>>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
> >>>>
> >>>
> >>> I just realized the "boot memory" part. Is that a real thing? IOW, can
> >>> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
> >>> have doubts that this would work ...
> >>
> >> One of the real world failure cases that started the subsection effect
> >> is that Persistent Memory collides with System RAM on a 64MB boundary
> >> on shipping platforms. System RAM ends on a 64MB boundary and due to a
> >> lack of memory controller resources PMEM is mapped contiguously at the
> >> end of that boundary. Some more details in the subsection cover letter
> >> / changelogs [1] [2]. It's not sufficient to just lose some memory,
> >> that's the broken implementation that lead to the subsection work
> >> because the lost memory may change from one boot to the next and
> >> software can't reliably inject a padding that conforms to the x86
> >> 128MB section constraint.
> >
> > Thanks, I thought it was mostly for weird alignment where other parts of
> > the section are basically "holes" and not memory.
> >
> > Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are
> > marked SECTION_IS_ONLINE.
> >
> >>
> >> Suffice to say I think we need your pfn_active() to get subsection
> >> granularity pfn_to_online_page() before PageReserved() can be removed.
> >
> > I agree that we have to fix this. I don't like ZONE_DEVICE pages falling
> > into memory device blocks (e.g., cannot get offlined), but I guess that
> > train is gone :) As long as it's not for memory hotplug, I can most
> > probably live with this.
> >
> > Also, I'd like to get Michals opinion on this and the pfn_active()
> > approach, but I can understand he's busy.
> >
> > This patch set can wait, I won't be working next week besides
> > reading/writing mails either way.
> >
> > Is anybody looking into the pfn_active() thingy?
> >
>
> I wonder if we should do something like this right now to fix this
> (exclude the false positive ZONE_DEVICE pages we could have within an
> online section, which was not possible before subsection hotplug):
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 384ffb3d69ab..490a9e9358b3 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -30,6 +30,8 @@ struct vmem_altmap;
>          if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
>              pfn_valid_within(___pfn))                              \
>                  ___page = pfn_to_page(___pfn);                     \
> +       if (unlikely(___page && is_zone_device_page(___page)))     \
> +               ___page = NULL;                                    \
>          ___page;                                                   \
>   })
>
>
> Yeah, it's another is_zone_device_page(), but it should not be racy
> here, as we want to exclude, not include ZONE_DEVICE.
>
> I don't have time to look into this right now, unfortunately.

I don't want to band-aid without an actual bug report. I'll take a
look at a subsection-map for the online state.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-11-08 18:29               ` Dan Williams
@ 2019-11-08 23:01                 ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-11-08 23:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-hyperv, Michal Hocko, Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Thomas Gleixner, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Matthew Wilcox (Oracle),
	Mike Rapoport, Peter Zijlstra, Ingo Molnar, Vlastimil Babka,
	Anthony Yznaga, Oscar Salvador, Isaac J. Manjarres,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang, Sasha Levin,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Christophe Leroy, Mel Gorman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, linuxppc-dev

On 08.11.19 19:29, Dan Williams wrote:
> On Fri, Nov 8, 2019 at 2:22 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.11.19 08:14, David Hildenbrand wrote:
>>> On 08.11.19 06:09, Dan Williams wrote:
>>>> On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 07.11.19 19:22, David Hildenbrand wrote:
>>>>>>
>>>>>>
>>>>>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>:
>>>>>>>
>>>>>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>
>>>>>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>>>>>>> change that.
>>>>>>>>
>>>>>>>> KVM has this weird use case that you can map anything from /dev/mem
>>>>>>>> into the guest. pfn_valid() is not a reliable check whether the memmap
>>>>>>>> was initialized and can be touched. pfn_to_online_page() makes sure
>>>>>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory).
>>>>>>>>
>>>>>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
>>>>>>>> sure the function produces the same result once we stop setting ZONE_DEVICE
>>>>>>>> pages PG_reserved.
>>>>>>>>
>>>>>>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>>>>>>> Cc: Cornelia Huck <cohuck@redhat.com>
>>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>>> ---
>>>>>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
>>>>>>>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644
>>>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>>>>      */
>>>>>>>> static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>>>>>> {
>>>>>>>> -       if (pfn_valid(pfn))
>>>>>>>> -               return PageReserved(pfn_to_page(pfn));
>>>>>>>> +       struct page *page = pfn_to_online_page(pfn);
>>>>>>>
>>>>>>> Ugh, I just realized this is not a safe conversion until
>>>>>>> pfn_to_online_page() is moved over to subsection granularity. As it
>>>>>>> stands it will return true for any ZONE_DEVICE pages that share a
>>>>>>> section with boot memory.
>>>>>>
>>>>>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
>>>>>>
>>>>>
>>>>> I just realized the "boot memory" part. Is that a real thing? IOW, can
>>>>> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat
>>>>> have doubts that this would work ...
>>>>
>>>> One of the real world failure cases that started the subsection effect
>>>> is that Persistent Memory collides with System RAM on a 64MB boundary
>>>> on shipping platforms. System RAM ends on a 64MB boundary and due to a
>>>> lack of memory controller resources PMEM is mapped contiguously at the
>>>> end of that boundary. Some more details in the subsection cover letter
>>>> / changelogs [1] [2]. It's not sufficient to just lose some memory,
>>>> that's the broken implementation that lead to the subsection work
>>>> because the lost memory may change from one boot to the next and
>>>> software can't reliably inject a padding that conforms to the x86
>>>> 128MB section constraint.
>>>
>>> Thanks, I thought it was mostly for weird alignment where other parts of
>>> the section are basically "holes" and not memory.
>>>
>>> Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are
>>> marked SECTION_IS_ONLINE.
>>>
>>>>
>>>> Suffice to say I think we need your pfn_active() to get subsection
>>>> granularity pfn_to_online_page() before PageReserved() can be removed.
>>>
>>> I agree that we have to fix this. I don't like ZONE_DEVICE pages falling
>>> into memory device blocks (e.g., cannot get offlined), but I guess that
>>> train is gone :) As long as it's not for memory hotplug, I can most
>>> probably live with this.
>>>
>>> Also, I'd like to get Michals opinion on this and the pfn_active()
>>> approach, but I can understand he's busy.
>>>
>>> This patch set can wait, I won't be working next week besides
>>> reading/writing mails either way.
>>>
>>> Is anybody looking into the pfn_active() thingy?
>>>
>>
>> I wonder if we should do something like this right now to fix this
>> (exclude the false positive ZONE_DEVICE pages we could have within an
>> online section, which was not possible before subsection hotplug):
>>
>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> index 384ffb3d69ab..490a9e9358b3 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -30,6 +30,8 @@ struct vmem_altmap;
>>           if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
>>               pfn_valid_within(___pfn))                              \
>>                   ___page = pfn_to_page(___pfn);                     \
>> +       if (unlikely(___page && is_zone_device_page(___page)))     \
>> +               ___page = NULL;                                    \
>>           ___page;                                                   \
>>    })
>>
>>
>> Yeah, it's another is_zone_device_page(), but it should not be racy
>> here, as we want to exclude, not include ZONE_DEVICE.
>>
>> I don't have time to look into this right now, unfortunately.
> 
> I don't want to band-aid without an actual bug report. I'll take a
> look at a subsection-map for the online state.
> 

Fair enough, but at least in what I proposed for pfn_active(), this 
check would exist in pfn_to_online_page() in a similar way - and it is 
certainly easier to backport. But yeah, triggering this might not be easy.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, back to index

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 12:09 [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
2019-11-05  1:30   ` Dan Williams
2019-11-05  9:31     ` David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes David Hildenbrand
2019-11-05  1:37   ` Dan Williams
2019-11-05 11:09     ` David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() " David Hildenbrand
2019-11-05  4:38   ` Dan Williams
2019-11-05  9:17     ` David Hildenbrand
2019-11-05  9:49       ` David Hildenbrand
2019-11-05 10:02         ` David Hildenbrand
2019-11-05 16:00           ` Sean Christopherson
2019-11-05 20:30             ` David Hildenbrand
2019-11-05 22:22               ` Sean Christopherson
2019-11-05 23:02               ` Dan Williams
2019-11-05 23:13                 ` Sean Christopherson
2019-11-05 23:30                   ` Dan Williams
2019-11-05 23:42                     ` Sean Christopherson
2019-11-05 23:43                     ` Dan Williams
2019-11-06  0:03                       ` Sean Christopherson
2019-11-06  0:08                         ` Dan Williams
2019-11-06  6:56                           ` David Hildenbrand
2019-11-06 16:09                             ` Sean Christopherson
2019-10-24 12:09 ` [PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() " David Hildenbrand
2019-11-07 15:40   ` Dan Williams
2019-11-07 18:22     ` David Hildenbrand
2019-11-07 22:07       ` David Hildenbrand
2019-11-08  5:09         ` Dan Williams
2019-11-08  7:14           ` David Hildenbrand
2019-11-08 10:21             ` David Hildenbrand
2019-11-08 18:29               ` Dan Williams
2019-11-08 23:01                 ` David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() " David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() " David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() " David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() " David Hildenbrand
2019-10-24 12:09 ` [PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap David Hildenbrand
2019-11-04 22:44   ` Boris Ostrovsky
2019-11-05 10:18     ` David Hildenbrand
2019-11-05 16:06       ` Boris Ostrovsky
2019-10-24 12:09 ` [PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE David Hildenbrand
2019-11-01 19:24 ` [PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand

DriverDev-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/driverdev-devel/0 driverdev-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 driverdev-devel driverdev-devel/ https://lore.kernel.org/driverdev-devel \
		driverdev-devel@linuxdriverproject.org devel@driverdev.osuosl.org
	public-inbox-index driverdev-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.linuxdriverproject.driverdev-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git