All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-22 17:12 ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

This series is based on [2], which should pop up in linux/next soon:
	https://lkml.org/lkml/2019/10/21/1034

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never. In ancient times, we needed PG_reserved, because there was no way
to tell whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

I remember that some time ago, we already talked about stopping to set
ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
Also, I forgot who was part of the discussion :)

One of the biggest fear were side effects. I went ahead and audited all
users of PageReserved(). The ones that don't need any care (patches)
can be found below. I will double check and hope I am not missing something
important.

I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
on x86-64 and PPC.

Other users of PageReserved() that should be fine:
- mm/page_owner.c:pagetypeinfo_showmixedcount_print()
  -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
- mm/page_owner.c:init_pages_in_zone()
  -> Never called for ZONE_DEVICE (!populated_zone(zone))
- mm/page_ext.c:free_page_ext()
  -> Only a BUG_ON(PageReserved(page)), not relevant
- mm/page_ext.c:has_unmovable_pages()
  -> Not releveant for ZONE_DEVICE
- mm/page_ext.c:pfn_range_valid_contig()
  -> pfn_to_online_page() already guards us
- mm/mempolicy.c:queue_pages_pte_range()
  -> vm_normal_page() checks against pte_devmap()
- mm/memory-failure.c:hwpoison_user_mappings()
  -> Not reached via memory_failure() due to pfn_to_online_page()
  -> Also not reached indirectly via memory_failure_hugetlb()
- mm/hugetlb.c:gather_bootmem_prealloc()
  -> Only a WARN_ON(PageReserved(page)), not relevant
- kernel/power/snapshot.c:saveable_highmem_page()
  -> pfn_to_online_page() already guards us
- kernel/power/snapshot.c:saveable_page()
  -> pfn_to_online_page() already guards us
- fs/proc/task_mmu.c:can_gather_numa_stats()
  -> vm_normal_page() checks against pte_devmap()
- fs/proc/task_mmu.c:can_gather_numa_stats_pmd
  -> vm_normal_page_pmd() checks against pte_devmap()
- fs/proc/page.c:stable_page_flags()
  -> The reserved bit is simply copied, irrelevant
- drivers/firmware/memmap.c:release_firmware_map_entry()
  -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
- arch/ia64/kernel/mca_drv.c
- arch/mips/mm/init.c
- arch/mips/mm/ioremap.c
- arch/nios2/mm/ioremap.c
- arch/parisc/mm/ioremap.c
- arch/sparc/mm/tlb.c
- arch/xtensa/mm/cache.c
  -> No ZONE_DEVICE support
- arch/powerpc/mm/init_64.c:vmemmap_free()
  -> Special-cases memmap on altmap
  -> Only a check for bootmem
- arch/x86/kernel/alternative.c:__text_poke()
  -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
- arch/x86/mm/init_64.c
  -> Only a check for bootmem

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
Cc: Alexander Duyck <alexander.duyck@gmail.com>

David Hildenbrand (12):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
    holes
  mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
    changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
    PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
    changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
    memmap

 arch/powerpc/kvm/book3s_64_mmu_radix.c     | 14 ++++---
 arch/powerpc/mm/book3s64/hash_utils.c      | 10 +++--
 arch/powerpc/mm/pgtable.c                  | 10 +++--
 arch/x86/kvm/mmu.c                         | 30 +++++++++------
 arch/x86/mm/ioremap.c                      | 13 +++++--
 drivers/hv/hv_balloon.c                    |  6 +++
 drivers/staging/gasket/gasket_page_table.c |  2 +-
 drivers/staging/kpc2000/kpc_dma/fileops.c  |  3 +-
 drivers/vfio/vfio_iommu_type1.c            | 10 ++++-
 drivers/xen/balloon.c                      |  7 ++++
 include/linux/page-flags.h                 |  8 +---
 mm/memory_hotplug.c                        | 43 ++++++++++++++++------
 mm/page_alloc.c                            | 11 ------
 mm/usercopy.c                              |  5 ++-
 virt/kvm/kvm_main.c                        | 10 ++++-
 15 files changed, 115 insertions(+), 67 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-22 17:12 ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

This series is based on [2], which should pop up in linux/next soon:
	https://lkml.org/lkml/2019/10/21/1034

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never. In ancient times, we needed PG_reserved, because there was no way
to tell whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

I remember that some time ago, we already talked about stopping to set
ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
Also, I forgot who was part of the discussion :)

One of the biggest fear were side effects. I went ahead and audited all
users of PageReserved(). The ones that don't need any care (patches)
can be found below. I will double check and hope I am not missing something
important.

I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
on x86-64 and PPC.

Other users of PageReserved() that should be fine:
- mm/page_owner.c:pagetypeinfo_showmixedcount_print()
  -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
- mm/page_owner.c:init_pages_in_zone()
  -> Never called for ZONE_DEVICE (!populated_zone(zone))
- mm/page_ext.c:free_page_ext()
  -> Only a BUG_ON(PageReserved(page)), not relevant
- mm/page_ext.c:has_unmovable_pages()
  -> Not releveant for ZONE_DEVICE
- mm/page_ext.c:pfn_range_valid_contig()
  -> pfn_to_online_page() already guards us
- mm/mempolicy.c:queue_pages_pte_range()
  -> vm_normal_page() checks against pte_devmap()
- mm/memory-failure.c:hwpoison_user_mappings()
  -> Not reached via memory_failure() due to pfn_to_online_page()
  -> Also not reached indirectly via memory_failure_hugetlb()
- mm/hugetlb.c:gather_bootmem_prealloc()
  -> Only a WARN_ON(PageReserved(page)), not relevant
- kernel/power/snapshot.c:saveable_highmem_page()
  -> pfn_to_online_page() already guards us
- kernel/power/snapshot.c:saveable_page()
  -> pfn_to_online_page() already guards us
- fs/proc/task_mmu.c:can_gather_numa_stats()
  -> vm_normal_page() checks against pte_devmap()
- fs/proc/task_mmu.c:can_gather_numa_stats_pmd
  -> vm_normal_page_pmd() checks against pte_devmap()
- fs/proc/page.c:stable_page_flags()
  -> The reserved bit is simply copied, irrelevant
- drivers/firmware/memmap.c:release_firmware_map_entry()
  -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
- arch/ia64/kernel/mca_drv.c
- arch/mips/mm/init.c
- arch/mips/mm/ioremap.c
- arch/nios2/mm/ioremap.c
- arch/parisc/mm/ioremap.c
- arch/sparc/mm/tlb.c
- arch/xtensa/mm/cache.c
  -> No ZONE_DEVICE support
- arch/powerpc/mm/init_64.c:vmemmap_free()
  -> Special-cases memmap on altmap
  -> Only a check for bootmem
- arch/x86/kernel/alternative.c:__text_poke()
  -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
- arch/x86/mm/init_64.c
  -> Only a check for bootmem

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
Cc: Alexander Duyck <alexander.duyck@gmail.com>

David Hildenbrand (12):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
    holes
  mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
    changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
    PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
    changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
    memmap

 arch/powerpc/kvm/book3s_64_mmu_radix.c     | 14 ++++---
 arch/powerpc/mm/book3s64/hash_utils.c      | 10 +++--
 arch/powerpc/mm/pgtable.c                  | 10 +++--
 arch/x86/kvm/mmu.c                         | 30 +++++++++------
 arch/x86/mm/ioremap.c                      | 13 +++++--
 drivers/hv/hv_balloon.c                    |  6 +++
 drivers/staging/gasket/gasket_page_table.c |  2 +-
 drivers/staging/kpc2000/kpc_dma/fileops.c  |  3 +-
 drivers/vfio/vfio_iommu_type1.c            | 10 ++++-
 drivers/xen/balloon.c                      |  7 ++++
 include/linux/page-flags.h                 |  8 +---
 mm/memory_hotplug.c                        | 43 ++++++++++++++++------
 mm/page_alloc.c                            | 11 ------
 mm/usercopy.c                              |  5 ++-
 virt/kvm/kvm_main.c                        | 10 ++++-
 15 files changed, 115 insertions(+), 67 deletions(-)

-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-22 17:12 ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

This series is based on [2], which should pop up in linux/next soon:
	https://lkml.org/lkml/2019/10/21/1034

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never. In ancient times, we needed PG_reserved, because there was no way
to tell whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

I remember that some time ago, we already talked about stopping to set
ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
Also, I forgot who was part of the discussion :)

One of the biggest fear were side effects. I went ahead and audited all
users of PageReserved(). The ones that don't need any care (patches)
can be found below. I will double check and hope I am not missing something
important.

I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
on x86-64 and PPC.

Other users of PageReserved() that should be fine:
- mm/page_owner.c:pagetypeinfo_showmixedcount_print()
  -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
- mm/page_owner.c:init_pages_in_zone()
  -> Never called for ZONE_DEVICE (!populated_zone(zone))
- mm/page_ext.c:free_page_ext()
  -> Only a BUG_ON(PageReserved(page)), not relevant
- mm/page_ext.c:has_unmovable_pages()
  -> Not releveant for ZONE_DEVICE
- mm/page_ext.c:pfn_range_valid_contig()
  -> pfn_to_online_page() already guards us
- mm/mempolicy.c:queue_pages_pte_range()
  -> vm_normal_page() checks against pte_devmap()
- mm/memory-failure.c:hwpoison_user_mappings()
  -> Not reached via memory_failure() due to pfn_to_online_page()
  -> Also not reached indirectly via memory_failure_hugetlb()
- mm/hugetlb.c:gather_bootmem_prealloc()
  -> Only a WARN_ON(PageReserved(page)), not relevant
- kernel/power/snapshot.c:saveable_highmem_page()
  -> pfn_to_online_page() already guards us
- kernel/power/snapshot.c:saveable_page()
  -> pfn_to_online_page() already guards us
- fs/proc/task_mmu.c:can_gather_numa_stats()
  -> vm_normal_page() checks against pte_devmap()
- fs/proc/task_mmu.c:can_gather_numa_stats_pmd
  -> vm_normal_page_pmd() checks against pte_devmap()
- fs/proc/page.c:stable_page_flags()
  -> The reserved bit is simply copied, irrelevant
- drivers/firmware/memmap.c:release_firmware_map_entry()
  -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
- arch/ia64/kernel/mca_drv.c
- arch/mips/mm/init.c
- arch/mips/mm/ioremap.c
- arch/nios2/mm/ioremap.c
- arch/parisc/mm/ioremap.c
- arch/sparc/mm/tlb.c
- arch/xtensa/mm/cache.c
  -> No ZONE_DEVICE support
- arch/powerpc/mm/init_64.c:vmemmap_free()
  -> Special-cases memmap on altmap
  -> Only a check for bootmem
- arch/x86/kernel/alternative.c:__text_poke()
  -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
- arch/x86/mm/init_64.c
  -> Only a check for bootmem

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
Cc: Alexander Duyck <alexander.duyck@gmail.com>

David Hildenbrand (12):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
    holes
  mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
    changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
    PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
    changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
    memmap

 arch/powerpc/kvm/book3s_64_mmu_radix.c     | 14 ++++---
 arch/powerpc/mm/book3s64/hash_utils.c      | 10 +++--
 arch/powerpc/mm/pgtable.c                  | 10 +++--
 arch/x86/kvm/mmu.c                         | 30 +++++++++------
 arch/x86/mm/ioremap.c                      | 13 +++++--
 drivers/hv/hv_balloon.c                    |  6 +++
 drivers/staging/gasket/gasket_page_table.c |  2 +-
 drivers/staging/kpc2000/kpc_dma/fileops.c  |  3 +-
 drivers/vfio/vfio_iommu_type1.c            | 10 ++++-
 drivers/xen/balloon.c                      |  7 ++++
 include/linux/page-flags.h                 |  8 +---
 mm/memory_hotplug.c                        | 43 ++++++++++++++++------
 mm/page_alloc.c                            | 11 ------
 mm/usercopy.c                              |  5 ++-
 virt/kvm/kvm_main.c                        | 10 ++++-
 15 files changed, 115 insertions(+), 67 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-22 17:12 ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

This series is based on [2], which should pop up in linux/next soon:
	https://lkml.org/lkml/2019/10/21/1034

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never. In ancient times, we needed PG_reserved, because there was no way
to tell whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

I remember that some time ago, we already talked about stopping to set
ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
Also, I forgot who was part of the discussion :)

One of the biggest fear were side effects. I went ahead and audited all
users of PageReserved(). The ones that don't need any care (patches)
can be found below. I will double check and hope I am not missing something
important.

I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
on x86-64 and PPC.

Other users of PageReserved() that should be fine:
- mm/page_owner.c:pagetypeinfo_showmixedcount_print()
  -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
- mm/page_owner.c:init_pages_in_zone()
  -> Never called for ZONE_DEVICE (!populated_zone(zone))
- mm/page_ext.c:free_page_ext()
  -> Only a BUG_ON(PageReserved(page)), not relevant
- mm/page_ext.c:has_unmovable_pages()
  -> Not releveant for ZONE_DEVICE
- mm/page_ext.c:pfn_range_valid_contig()
  -> pfn_to_online_page() already guards us
- mm/mempolicy.c:queue_pages_pte_range()
  -> vm_normal_page() checks against pte_devmap()
- mm/memory-failure.c:hwpoison_user_mappings()
  -> Not reached via memory_failure() due to pfn_to_online_page()
  -> Also not reached indirectly via memory_failure_hugetlb()
- mm/hugetlb.c:gather_bootmem_prealloc()
  -> Only a WARN_ON(PageReserved(page)), not relevant
- kernel/power/snapshot.c:saveable_highmem_page()
  -> pfn_to_online_page() already guards us
- kernel/power/snapshot.c:saveable_page()
  -> pfn_to_online_page() already guards us
- fs/proc/task_mmu.c:can_gather_numa_stats()
  -> vm_normal_page() checks against pte_devmap()
- fs/proc/task_mmu.c:can_gather_numa_stats_pmd
  -> vm_normal_page_pmd() checks against pte_devmap()
- fs/proc/page.c:stable_page_flags()
  -> The reserved bit is simply copied, irrelevant
- drivers/firmware/memmap.c:release_firmware_map_entry()
  -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
- arch/ia64/kernel/mca_drv.c
- arch/mips/mm/init.c
- arch/mips/mm/ioremap.c
- arch/nios2/mm/ioremap.c
- arch/parisc/mm/ioremap.c
- arch/sparc/mm/tlb.c
- arch/xtensa/mm/cache.c
  -> No ZONE_DEVICE support
- arch/powerpc/mm/init_64.c:vmemmap_free()
  -> Special-cases memmap on altmap
  -> Only a check for bootmem
- arch/x86/kernel/alternative.c:__text_poke()
  -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
- arch/x86/mm/init_64.c
  -> Only a check for bootmem

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
Cc: Alexander Duyck <alexander.duyck@gmail.com>

David Hildenbrand (12):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
    holes
  mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
    changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
    PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
    changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
    memmap

 arch/powerpc/kvm/book3s_64_mmu_radix.c     | 14 ++++---
 arch/powerpc/mm/book3s64/hash_utils.c      | 10 +++--
 arch/powerpc/mm/pgtable.c                  | 10 +++--
 arch/x86/kvm/mmu.c                         | 30 +++++++++------
 arch/x86/mm/ioremap.c                      | 13 +++++--
 drivers/hv/hv_balloon.c                    |  6 +++
 drivers/staging/gasket/gasket_page_table.c |  2 +-
 drivers/staging/kpc2000/kpc_dma/fileops.c  |  3 +-
 drivers/vfio/vfio_iommu_type1.c            | 10 ++++-
 drivers/xen/balloon.c                      |  7 ++++
 include/linux/page-flags.h                 |  8 +---
 mm/memory_hotplug.c                        | 43 ++++++++++++++++------
 mm/page_alloc.c                            | 11 ------
 mm/usercopy.c                              |  5 ++-
 virt/kvm/kvm_main.c                        | 10 ++++-
 15 files changed, 115 insertions(+), 67 deletions(-)

-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes. Hotplugged memory never has
holes. That memory is already online.

When we stop allowing to offline memory blocks with holes, we implicitly
stop to online memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way. So stopping to do that (if anybody really used and tested
this over the years) should not really hurt. For the use case of
offlining memory to unplug DIMMs, we should see no change. (holes on
DIMMs would be weird)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..7210f4375279 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+				     unsigned long nr_pages, void *data)
+{
+	unsigned long *nr_system_ram_pages = data;
+
+	*nr_system_ram_pages += nr_pages;
+	return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
-	unsigned long pfn, nr_pages;
+	unsigned long pfn, nr_pages = 0;
 	unsigned long offlined_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	mem_hotplug_begin();
 
+	/*
+	 * We don't allow to offline memory blocks that contain holes
+	 * and consecuently don't allow to online memory blocks that contain
+	 * holes. This allows to simplify the code quite a lot and we don't
+	 * have to mess with PG_reserved pages for memory holes.
+	 */
+	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
+			      count_system_ram_pages_cb);
+	if (nr_pages != end_pfn - start_pfn) {
+		ret = -EINVAL;
+		reason = "memory holes";
+		goto failed_removal;
+	}
+
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
 	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	zone = page_zone(pfn_to_page(valid_start));
 	node = zone_to_nid(zone);
-	nr_pages = end_pfn - start_pfn;
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes. Hotplugged memory never has
holes. That memory is already online.

When we stop allowing to offline memory blocks with holes, we implicitly
stop to online memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way. So stopping to do that (if anybody really used and tested
this over the years) should not really hurt. For the use case of
offlining memory to unplug DIMMs, we should see no change. (holes on
DIMMs would be weird)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..7210f4375279 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+				     unsigned long nr_pages, void *data)
+{
+	unsigned long *nr_system_ram_pages = data;
+
+	*nr_system_ram_pages += nr_pages;
+	return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
-	unsigned long pfn, nr_pages;
+	unsigned long pfn, nr_pages = 0;
 	unsigned long offlined_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	mem_hotplug_begin();
 
+	/*
+	 * We don't allow to offline memory blocks that contain holes
+	 * and consecuently don't allow to online memory blocks that contain
+	 * holes. This allows to simplify the code quite a lot and we don't
+	 * have to mess with PG_reserved pages for memory holes.
+	 */
+	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
+			      count_system_ram_pages_cb);
+	if (nr_pages != end_pfn - start_pfn) {
+		ret = -EINVAL;
+		reason = "memory holes";
+		goto failed_removal;
+	}
+
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
 	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	zone = page_zone(pfn_to_page(valid_start));
 	node = zone_to_nid(zone);
-	nr_pages = end_pfn - start_pfn;
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes. Hotplugged memory never has
holes. That memory is already online.

When we stop allowing to offline memory blocks with holes, we implicitly
stop to online memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way. So stopping to do that (if anybody really used and tested
this over the years) should not really hurt. For the use case of
offlining memory to unplug DIMMs, we should see no change. (holes on
DIMMs would be weird)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..7210f4375279 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+				     unsigned long nr_pages, void *data)
+{
+	unsigned long *nr_system_ram_pages = data;
+
+	*nr_system_ram_pages += nr_pages;
+	return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
-	unsigned long pfn, nr_pages;
+	unsigned long pfn, nr_pages = 0;
 	unsigned long offlined_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	mem_hotplug_begin();
 
+	/*
+	 * We don't allow to offline memory blocks that contain holes
+	 * and consecuently don't allow to online memory blocks that contain
+	 * holes. This allows to simplify the code quite a lot and we don't
+	 * have to mess with PG_reserved pages for memory holes.
+	 */
+	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
+			      count_system_ram_pages_cb);
+	if (nr_pages != end_pfn - start_pfn) {
+		ret = -EINVAL;
+		reason = "memory holes";
+		goto failed_removal;
+	}
+
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
 	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	zone = page_zone(pfn_to_page(valid_start));
 	node = zone_to_nid(zone);
-	nr_pages = end_pfn - start_pfn;
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes. Hotplugged memory never has
holes. That memory is already online.

When we stop allowing to offline memory blocks with holes, we implicitly
stop to online memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way. So stopping to do that (if anybody really used and tested
this over the years) should not really hurt. For the use case of
offlining memory to unplug DIMMs, we should see no change. (holes on
DIMMs would be weird)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..7210f4375279 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+				     unsigned long nr_pages, void *data)
+{
+	unsigned long *nr_system_ram_pages = data;
+
+	*nr_system_ram_pages += nr_pages;
+	return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
-	unsigned long pfn, nr_pages;
+	unsigned long pfn, nr_pages = 0;
 	unsigned long offlined_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	mem_hotplug_begin();
 
+	/*
+	 * We don't allow to offline memory blocks that contain holes
+	 * and consecuently don't allow to online memory blocks that contain
+	 * holes. This allows to simplify the code quite a lot and we don't
+	 * have to mess with PG_reserved pages for memory holes.
+	 */
+	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
+			      count_system_ram_pages_cb);
+	if (nr_pages != end_pfn - start_pfn) {
+		ret = -EINVAL;
+		reason = "memory holes";
+		goto failed_removal;
+	}
+
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
 	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	zone = page_zone(pfn_to_page(valid_start));
 	node = zone_to_nid(zone);
-	nr_pages = end_pfn - start_pfn;
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Let's make sure that the logic in the function won't change. Once we no
longer set these pages to reserved, we can rework this function to
perform separate checks for ZONE_DEVICE (split from PG_reserved checks).

Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Allison Randal <allison@lohutok.net>
Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/usercopy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..a3ac4be35cde 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
 	 * device memory), or CMA. Otherwise, reject since the object spans
 	 * several independently allocated pages.
 	 */
-	is_reserved = PageReserved(page);
+	is_reserved = PageReserved(page) || is_zone_device_page(page);
 	is_cma = is_migrate_cma_page(page);
 	if (!is_reserved && !is_cma)
 		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
 
 	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
 		page = virt_to_head_page(ptr);
-		if (is_reserved && !PageReserved(page))
+		if (is_reserved && !(PageReserved(page) ||
+				     is_zone_device_page(page)))
 			usercopy_abort("spans Reserved and non-Reserved pages",
 				       NULL, to_user, 0, n);
 		if (is_cma && !is_migrate_cma_page(page))
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Let's make sure that the logic in the function won't change. Once we no
longer set these pages to reserved, we can rework this function to
perform separate checks for ZONE_DEVICE (split from PG_reserved checks).

Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Allison Randal <allison@lohutok.net>
Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/usercopy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..a3ac4be35cde 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
 	 * device memory), or CMA. Otherwise, reject since the object spans
 	 * several independently allocated pages.
 	 */
-	is_reserved = PageReserved(page);
+	is_reserved = PageReserved(page) || is_zone_device_page(page);
 	is_cma = is_migrate_cma_page(page);
 	if (!is_reserved && !is_cma)
 		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
 
 	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
 		page = virt_to_head_page(ptr);
-		if (is_reserved && !PageReserved(page))
+		if (is_reserved && !(PageReserved(page) ||
+				     is_zone_device_page(page)))
 			usercopy_abort("spans Reserved and non-Reserved pages",
 				       NULL, to_user, 0, n);
 		if (is_cma && !is_migrate_cma_page(page))
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Let's make sure that the logic in the function won't change. Once we no
longer set these pages to reserved, we can rework this function to
perform separate checks for ZONE_DEVICE (split from PG_reserved checks).

Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Allison Randal <allison@lohutok.net>
Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/usercopy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..a3ac4be35cde 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
 	 * device memory), or CMA. Otherwise, reject since the object spans
 	 * several independently allocated pages.
 	 */
-	is_reserved = PageReserved(page);
+	is_reserved = PageReserved(page) || is_zone_device_page(page);
 	is_cma = is_migrate_cma_page(page);
 	if (!is_reserved && !is_cma)
 		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
 
 	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
 		page = virt_to_head_page(ptr);
-		if (is_reserved && !PageReserved(page))
+		if (is_reserved && !(PageReserved(page) ||
+				     is_zone_device_page(page)))
 			usercopy_abort("spans Reserved and non-Reserved pages",
 				       NULL, to_user, 0, n);
 		if (is_cma && !is_migrate_cma_page(page))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Let's make sure that the logic in the function won't change. Once we no
longer set these pages to reserved, we can rework this function to
perform separate checks for ZONE_DEVICE (split from PG_reserved checks).

Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Allison Randal <allison@lohutok.net>
Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/usercopy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..a3ac4be35cde 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
 	 * device memory), or CMA. Otherwise, reject since the object spans
 	 * several independently allocated pages.
 	 */
-	is_reserved = PageReserved(page);
+	is_reserved = PageReserved(page) || is_zone_device_page(page);
 	is_cma = is_migrate_cma_page(page);
 	if (!is_reserved && !is_cma)
 		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
 
 	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
 		page = virt_to_head_page(ptr);
-		if (is_reserved && !PageReserved(page))
+		if (is_reserved && !(PageReserved(page) ||
+				     is_zone_device_page(page)))
 			usercopy_abort("spans Reserved and non-Reserved pages",
 				       NULL, to_user, 0, n);
 		if (is_cma && !is_migrate_cma_page(page))
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap - however, there is no reliable and
fast check to detect memmaps that were initialized and are ZONE_DEVICE.

Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized
memmaps that are guaranteed to not contain garbage. Make sure that
RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE
that is not UC/UC-/WC is not detected as MMIO.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/kvm/mmu.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..795869ffd4bb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,26 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+	struct page *page = pfn_to_online_page(pfn);
+
+	/*
+	 * Online pages consist of pages managed by the buddy. Especially,
+	 * ZONE_DEVICE pages are never online. Online pages that are reserved
+	 * indicate the zero page and MMIO pages.
+	 */
+	if (page)
+		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+	/*
+	 * Anything with a valid memmap could be ZONE_DEVICE - or the
+	 * memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO.
+	 */
 	if (pfn_valid(pfn))
-		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-			/*
-			 * Some reserved pages, such as those from NVDIMM
-			 * DAX devices, are not for MMIO, and can be mapped
-			 * with cached memory type for better performance.
-			 * However, the above check misconceives those pages
-			 * as MMIO, and results in KVM mapping them with UC
-			 * memory type, which would hurt the performance.
-			 * Therefore, we check the host memory type in addition
-			 * and only treat UC/UC-/WC pages as MMIO.
-			 */
-			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+		return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+	/*
+	 * Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+	 */
 	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 				     pfn_to_hpa(pfn + 1) - 1,
 				     E820_TYPE_RAM);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap - however, there is no reliable and
fast check to detect memmaps that were initialized and are ZONE_DEVICE.

Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized
memmaps that are guaranteed to not contain garbage. Make sure that
RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE
that is not UC/UC-/WC is not detected as MMIO.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/kvm/mmu.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..795869ffd4bb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,26 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+	struct page *page = pfn_to_online_page(pfn);
+
+	/*
+	 * Online pages consist of pages managed by the buddy. Especially,
+	 * ZONE_DEVICE pages are never online. Online pages that are reserved
+	 * indicate the zero page and MMIO pages.
+	 */
+	if (page)
+		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+	/*
+	 * Anything with a valid memmap could be ZONE_DEVICE - or the
+	 * memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO.
+	 */
 	if (pfn_valid(pfn))
-		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-			/*
-			 * Some reserved pages, such as those from NVDIMM
-			 * DAX devices, are not for MMIO, and can be mapped
-			 * with cached memory type for better performance.
-			 * However, the above check misconceives those pages
-			 * as MMIO, and results in KVM mapping them with UC
-			 * memory type, which would hurt the performance.
-			 * Therefore, we check the host memory type in addition
-			 * and only treat UC/UC-/WC pages as MMIO.
-			 */
-			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+		return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+	/*
+	 * Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+	 */
 	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 				     pfn_to_hpa(pfn + 1) - 1,
 				     E820_TYPE_RAM);
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap - however, there is no reliable and
fast check to detect memmaps that were initialized and are ZONE_DEVICE.

Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized
memmaps that are guaranteed to not contain garbage. Make sure that
RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE
that is not UC/UC-/WC is not detected as MMIO.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/kvm/mmu.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..795869ffd4bb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,26 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+	struct page *page = pfn_to_online_page(pfn);
+
+	/*
+	 * Online pages consist of pages managed by the buddy. Especially,
+	 * ZONE_DEVICE pages are never online. Online pages that are reserved
+	 * indicate the zero page and MMIO pages.
+	 */
+	if (page)
+		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+	/*
+	 * Anything with a valid memmap could be ZONE_DEVICE - or the
+	 * memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO.
+	 */
 	if (pfn_valid(pfn))
-		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-			/*
-			 * Some reserved pages, such as those from NVDIMM
-			 * DAX devices, are not for MMIO, and can be mapped
-			 * with cached memory type for better performance.
-			 * However, the above check misconceives those pages
-			 * as MMIO, and results in KVM mapping them with UC
-			 * memory type, which would hurt the performance.
-			 * Therefore, we check the host memory type in addition
-			 * and only treat UC/UC-/WC pages as MMIO.
-			 */
-			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+		return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+	/*
+	 * Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+	 */
 	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 				     pfn_to_hpa(pfn + 1) - 1,
 				     E820_TYPE_RAM);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap - however, there is no reliable and
fast check to detect memmaps that were initialized and are ZONE_DEVICE.

Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized
memmaps that are guaranteed to not contain garbage. Make sure that
RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE
that is not UC/UC-/WC is not detected as MMIO.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/kvm/mmu.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..795869ffd4bb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,26 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+	struct page *page = pfn_to_online_page(pfn);
+
+	/*
+	 * Online pages consist of pages managed by the buddy. Especially,
+	 * ZONE_DEVICE pages are never online. Online pages that are reserved
+	 * indicate the zero page and MMIO pages.
+	 */
+	if (page)
+		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+	/*
+	 * Anything with a valid memmap could be ZONE_DEVICE - or the
+	 * memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO.
+	 */
 	if (pfn_valid(pfn))
-		return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-			/*
-			 * Some reserved pages, such as those from NVDIMM
-			 * DAX devices, are not for MMIO, and can be mapped
-			 * with cached memory type for better performance.
-			 * However, the above check misconceives those pages
-			 * as MMIO, and results in KVM mapping them with UC
-			 * memory type, which would hurt the performance.
-			 * Therefore, we check the host memory type in addition
-			 * and only treat UC/UC-/WC pages as MMIO.
-			 */
-			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+		return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+	/*
+	 * Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+	 */
 	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 				     pfn_to_hpa(pfn + 1) - 1,
 				     E820_TYPE_RAM);
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66a977472a1c..b233d4129014 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66a977472a1c..b233d4129014 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66a977472a1c..b233d4129014 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66a977472a1c..b233d4129014 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-	if (pfn_valid(pfn))
-		return PageReserved(pfn_to_page(pfn));
+	struct page *page = pfn_to_online_page(pfn);
 
+	/*
+	 * We treat any pages that are not online (not managed by the buddy)
+	 * as reserved - this includes ZONE_DEVICE pages and pages without
+	 * a memmap (e.g., mapped via /dev/mem).
+	 */
+	if (page)
+		return PageReserved(page);
 	return true;
 }
 
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Rob Springer <rspringer@google.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/gasket/gasket_page_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
index f6d715787da8..d43fed58bf65 100644
--- a/drivers/staging/gasket/gasket_page_table.c
+++ b/drivers/staging/gasket/gasket_page_table.c
@@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
 	if (!page)
 		return false;
 
-	if (!PageReserved(page))
+	if (!PageReserved(page) && !is_zone_device_page(page))
 		SetPageDirty(page);
 	put_page(page);
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Rob Springer <rspringer@google.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/gasket/gasket_page_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
index f6d715787da8..d43fed58bf65 100644
--- a/drivers/staging/gasket/gasket_page_table.c
+++ b/drivers/staging/gasket/gasket_page_table.c
@@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
 	if (!page)
 		return false;
 
-	if (!PageReserved(page))
+	if (!PageReserved(page) && !is_zone_device_page(page))
 		SetPageDirty(page);
 	put_page(page);
 
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Rob Springer <rspringer@google.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/gasket/gasket_page_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
index f6d715787da8..d43fed58bf65 100644
--- a/drivers/staging/gasket/gasket_page_table.c
+++ b/drivers/staging/gasket/gasket_page_table.c
@@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
 	if (!page)
 		return false;
 
-	if (!PageReserved(page))
+	if (!PageReserved(page) && !is_zone_device_page(page))
 		SetPageDirty(page);
 	put_page(page);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Rob Springer <rspringer@google.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/gasket/gasket_page_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
index f6d715787da8..d43fed58bf65 100644
--- a/drivers/staging/gasket/gasket_page_table.c
+++ b/drivers/staging/gasket/gasket_page_table.c
@@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
 	if (!page)
 		return false;
 
-	if (!PageReserved(page))
+	if (!PageReserved(page) && !is_zone_device_page(page))
 		SetPageDirty(page);
 	put_page(page);
 
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vandana BN <bnvandana@gmail.com>
Cc: "Simon Sandström" <simon@nikanor.nu>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Matt Sickler <Matt.Sickler@daktronics.com>
Cc: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
index cb52bd9a6d2f..457adcc81fe6 100644
--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
 	BUG_ON(acd->ldev->pldev == NULL);
 
 	for (i = 0 ; i < acd->page_count ; i++) {
-		if (!PageReserved(acd->user_pages[i])) {
+		if (!PageReserved(acd->user_pages[i]) &&
+		    !is_zone_device_page(acd->user_pages[i])) {
 			set_page_dirty(acd->user_pages[i]);
 		}
 	}
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vandana BN <bnvandana@gmail.com>
Cc: "Simon Sandström" <simon@nikanor.nu>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Matt Sickler <Matt.Sickler@daktronics.com>
Cc: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
index cb52bd9a6d2f..457adcc81fe6 100644
--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
 	BUG_ON(acd->ldev->pldev == NULL);
 
 	for (i = 0 ; i < acd->page_count ; i++) {
-		if (!PageReserved(acd->user_pages[i])) {
+		if (!PageReserved(acd->user_pages[i]) &&
+		    !is_zone_device_page(acd->user_pages[i])) {
 			set_page_dirty(acd->user_pages[i]);
 		}
 	}
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vandana BN <bnvandana@gmail.com>
Cc: "Simon Sandström" <simon@nikanor.nu>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Matt Sickler <Matt.Sickler@daktronics.com>
Cc: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
index cb52bd9a6d2f..457adcc81fe6 100644
--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
 	BUG_ON(acd->ldev->pldev == NULL);
 
 	for (i = 0 ; i < acd->page_count ; i++) {
-		if (!PageReserved(acd->user_pages[i])) {
+		if (!PageReserved(acd->user_pages[i]) &&
+		    !is_zone_device_page(acd->user_pages[i])) {
 			set_page_dirty(acd->user_pages[i]);
 		}
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vandana BN <bnvandana@gmail.com>
Cc: "Simon Sandström" <simon@nikanor.nu>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Matt Sickler <Matt.Sickler@daktronics.com>
Cc: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
index cb52bd9a6d2f..457adcc81fe6 100644
--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
 	BUG_ON(acd->ldev->pldev == NULL);
 
 	for (i = 0 ; i < acd->page_count ; i++) {
-		if (!PageReserved(acd->user_pages[i])) {
+		if (!PageReserved(acd->user_pages[i]) &&
+		    !is_zone_device_page(acd->user_pages[i])) {
 			set_page_dirty(acd->user_pages[i]);
 		}
 	}
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 					   writing, upgrade_p);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+		/*
+		 * We treat any pages that are not online (not managed by the
+		 * buddy) as reserved - this includes ZONE_DEVICE pages and
+		 * pages without a memmap (e.g., mapped via /dev/mem).
+		 */
+		page = pfn_to_online_page(pfn);
+		if (page && PageReserved(page))
+			page = NULL;
 	}
 
 	/*
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 					   writing, upgrade_p);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+		/*
+		 * We treat any pages that are not online (not managed by the
+		 * buddy) as reserved - this includes ZONE_DEVICE pages and
+		 * pages without a memmap (e.g., mapped via /dev/mem).
+		 */
+		page = pfn_to_online_page(pfn);
+		if (page && PageReserved(page))
+			page = NULL;
 	}
 
 	/*
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 					   writing, upgrade_p);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+		/*
+		 * We treat any pages that are not online (not managed by the
+		 * buddy) as reserved - this includes ZONE_DEVICE pages and
+		 * pages without a memmap (e.g., mapped via /dev/mem).
+		 */
+		page = pfn_to_online_page(pfn);
+		if (page && PageReserved(page))
+			page = NULL;
 	}
 
 	/*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 					   writing, upgrade_p);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
-		page = NULL;
-		if (pfn_valid(pfn)) {
-			page = pfn_to_page(pfn);
-			if (PageReserved(page))
-				page = NULL;
-		}
+		/*
+		 * We treat any pages that are not online (not managed by the
+		 * buddy) as reserved - this includes ZONE_DEVICE pages and
+		 * pages without a memmap (e.g., mapped via /dev/mem).
+		 */
+		page = pfn_to_online_page(pfn);
+		if (page && PageReserved(page))
+			page = NULL;
 	}
 
 	/*
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-	struct page *page;
+	struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-	if (!pfn_valid(pte_pfn(pte)))
+	/*
+	 * We ignore any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	if (!page)
 		return pp;
 
-	page = pte_page(pte);
-
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-	struct page *page;
+	struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-	if (!pfn_valid(pte_pfn(pte)))
+	/*
+	 * We ignore any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	if (!page)
 		return pp;
 
-	page = pte_page(pte);
-
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-	struct page *page;
+	struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-	if (!pfn_valid(pte_pfn(pte)))
+	/*
+	 * We ignore any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	if (!page)
 		return pp;
 
-	page = pte_page(pte);
-
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-	struct page *page;
+	struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-	if (!pfn_valid(pte_pfn(pte)))
+	/*
+	 * We ignore any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	if (!page)
 		return pp;
 
-	page = pte_page(pte);
-
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/pgtable.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
 	unsigned long pfn = pte_pfn(pte);
 	struct page *page;
 
-	if (unlikely(!pfn_valid(pfn)))
-		return NULL;
-	page = pfn_to_page(pfn);
-	if (PageReserved(page))
+	/*
+	 * We reject any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (unlikely(!page || PageReserved(page)))
 		return NULL;
 	return page;
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/pgtable.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
 	unsigned long pfn = pte_pfn(pte);
 	struct page *page;
 
-	if (unlikely(!pfn_valid(pfn)))
-		return NULL;
-	page = pfn_to_page(pfn);
-	if (PageReserved(page))
+	/*
+	 * We reject any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (unlikely(!page || PageReserved(page)))
 		return NULL;
 	return page;
 }
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/pgtable.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
 	unsigned long pfn = pte_pfn(pte);
 	struct page *page;
 
-	if (unlikely(!pfn_valid(pfn)))
-		return NULL;
-	page = pfn_to_page(pfn);
-	if (PageReserved(page))
+	/*
+	 * We reject any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (unlikely(!page || PageReserved(page)))
 		return NULL;
 	return page;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/mm/pgtable.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
 	unsigned long pfn = pte_pfn(pte);
 	struct page *page;
 
-	if (unlikely(!pfn_valid(pfn)))
-		return NULL;
-	page = pfn_to_page(pfn);
-	if (PageReserved(page))
+	/*
+	 * We reject any pages that are not online (not managed by the buddy).
+	 * This includes ZONE_DEVICE pages.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (unlikely(!page || PageReserved(page)))
 		return NULL;
 	return page;
 }
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/ioremap.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 	start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	stop_pfn = (res->end + 1) >> PAGE_SHIFT;
 	if (stop_pfn > start_pfn) {
-		for (i = 0; i < (stop_pfn - start_pfn); ++i)
-			if (pfn_valid(start_pfn + i) &&
-			    !PageReserved(pfn_to_page(start_pfn + i)))
+		for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+			struct page *page;
+			 /*
+			  * We treat any pages that are not online (not managed
+			  * by the buddy) as not being RAM. This includes
+			  * ZONE_DEVICE pages.
+			  */
+			page = pfn_to_online_page(start_pfn + i);
+			if (page && !PageReserved(page))
 				return IORES_MAP_SYSTEM_RAM;
+		}
 	}
 
 	return 0;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/ioremap.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 	start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	stop_pfn = (res->end + 1) >> PAGE_SHIFT;
 	if (stop_pfn > start_pfn) {
-		for (i = 0; i < (stop_pfn - start_pfn); ++i)
-			if (pfn_valid(start_pfn + i) &&
-			    !PageReserved(pfn_to_page(start_pfn + i)))
+		for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+			struct page *page;
+			 /*
+			  * We treat any pages that are not online (not managed
+			  * by the buddy) as not being RAM. This includes
+			  * ZONE_DEVICE pages.
+			  */
+			page = pfn_to_online_page(start_pfn + i);
+			if (page && !PageReserved(page))
 				return IORES_MAP_SYSTEM_RAM;
+		}
 	}
 
 	return 0;
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/ioremap.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 	start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	stop_pfn = (res->end + 1) >> PAGE_SHIFT;
 	if (stop_pfn > start_pfn) {
-		for (i = 0; i < (stop_pfn - start_pfn); ++i)
-			if (pfn_valid(start_pfn + i) &&
-			    !PageReserved(pfn_to_page(start_pfn + i)))
+		for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+			struct page *page;
+			 /*
+			  * We treat any pages that are not online (not managed
+			  * by the buddy) as not being RAM. This includes
+			  * ZONE_DEVICE pages.
+			  */
+			page = pfn_to_online_page(start_pfn + i);
+			if (page && !PageReserved(page))
 				return IORES_MAP_SYSTEM_RAM;
+		}
 	}
 
 	return 0;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/ioremap.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 	start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	stop_pfn = (res->end + 1) >> PAGE_SHIFT;
 	if (stop_pfn > start_pfn) {
-		for (i = 0; i < (stop_pfn - start_pfn); ++i)
-			if (pfn_valid(start_pfn + i) &&
-			    !PageReserved(pfn_to_page(start_pfn + i)))
+		for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+			struct page *page;
+			 /*
+			  * We treat any pages that are not online (not managed
+			  * by the buddy) as not being RAM. This includes
+			  * ZONE_DEVICE pages.
+			  */
+			page = pfn_to_online_page(start_pfn + i);
+			if (page && !PageReserved(page))
 				return IORES_MAP_SYSTEM_RAM;
+		}
 	}
 
 	return 0;
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:12   ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel, xen-devel, x86,
	Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved - either
   by using pfn_to_online_page() to exclude them right away or by
   checking against is_zone_device_page().
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/hv/hv_balloon.c    |  6 ++++++
 drivers/xen/balloon.c      |  7 +++++++
 include/linux/page-flags.h |  8 +-------
 mm/memory_hotplug.c        | 17 +++++++----------
 mm/page_alloc.c            | 11 -----------
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index c722079d3c24..3214b0ef5247 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
+	/*
+	 * TODO: The core used to mark the pages reserved. Most probably
+	 * we can stop doing that now.
+	 */
+	__SetPageReserved(pg);
+
 	if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
 		if (!PageOffline(pg))
 			__SetPageOffline(pg);
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order)
 	mutex_lock(&balloon_mutex);
 	for (i = 0; i < size; i++) {
 		p = pfn_to_page(start_pfn + i);
+		/*
+		 * TODO: The core used to mark the pages reserved. Most probably
+		 * we can stop doing that now. However, especially
+		 * alloc_xenballooned_pages() left PG_reserved set
+		 * on pages that can get mapped to user space.
+		 */
+		__SetPageReserved(p);
 		balloon_append(p);
 	}
 	mutex_unlock(&balloon_mutex);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..d4f85d866b71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,24 +30,18 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
- * - Device memory (e.g. PMEM, DAX, HMM)
  * Some PG_reserved pages will be excluded from the hibernation image.
  * PG_reserved does in general not hinder anybody from dumping or swapping
  * and is no longer required for remap_pfn_range(). ioremap might require it.
  * Consequently, PG_reserved for a page mapped into user space can indicate
- * the zero page, the vDSO, MMIO pages or device memory.
+ * the zero page, the vDSO, or MMIO pages.
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7210f4375279..9fbcdeaf0339 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -501,9 +501,7 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
  * @altmap: alternative device page map or %NULL if default memmap is used
  *
  * Generic helper function to remove section mappings and sysfs entries
- * for the section of the memory we are removing. Caller needs to make
- * sure that pages are marked reserved and zones are adjust properly by
- * calling offline_pages().
+ * for the section of the memory we are removing.
  */
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
@@ -584,9 +582,9 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 	int order;
 
 	/*
-	 * Online the pages. The callback might decide to keep some pages
-	 * PG_reserved (to add them to the buddy later), but we still account
-	 * them as being online/belonging to this zone ("present").
+	 * Online the pages. The callback might decide to not free some pages
+	 * (to add them to the buddy later), but we still account them as
+	 * being online/belonging to this zone ("present").
 	 */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1ul << order) {
 		order = min(MAX_ORDER - 1, get_order(PFN_PHYS(end_pfn - pfn)));
@@ -659,8 +657,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
 }
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
- * and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * and resizing the pgdat/zone data to span the added pages.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap)
@@ -684,8 +681,8 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	/*
 	 * TODO now we have a visible range of pages which are not associated
 	 * with their zone properly. Not nice but set_pfnblock_flags_mask
-	 * expects the zone spans the pfn range. All the pages in the range
-	 * are reserved so nobody should be touching them so we should be safe
+	 * expects the zone spans the pfn range. The sections are not yet
+	 * marked online so nobody should be touching the memmap.
 	 */
 	memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
 			MEMMAP_HOTPLUG, altmap);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e153280bde9a..29787ac4aeb8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5927,8 +5927,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5980,15 +5978,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
 		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
 		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved - either
   by using pfn_to_online_page() to exclude them right away or by
   checking against is_zone_device_page().
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/hv/hv_balloon.c    |  6 ++++++
 drivers/xen/balloon.c      |  7 +++++++
 include/linux/page-flags.h |  8 +-------
 mm/memory_hotplug.c        | 17 +++++++----------
 mm/page_alloc.c            | 11 -----------
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index c722079d3c24..3214b0ef5247 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
+	/*
+	 * TODO: The core used to mark the pages reserved. Most probably
+	 * we can stop doing that now.
+	 */
+	__SetPageReserved(pg);
+
 	if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
 		if (!PageOffline(pg))
 			__SetPageOffline(pg);
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order)
 	mutex_lock(&balloon_mutex);
 	for (i = 0; i < size; i++) {
 		p = pfn_to_page(start_pfn + i);
+		/*
+		 * TODO: The core used to mark the pages reserved. Most probably
+		 * we can stop doing that now. However, especially
+		 * alloc_xenballooned_pages() left PG_reserved set
+		 * on pages that can get mapped to user space.
+		 */
+		__SetPageReserved(p);
 		balloon_append(p);
 	}
 	mutex_unlock(&balloon_mutex);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..d4f85d866b71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,24 +30,18 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
- * - Device memory (e.g. PMEM, DAX, HMM)
  * Some PG_reserved pages will be excluded from the hibernation image.
  * PG_reserved does in general not hinder anybody from dumping or swapping
  * and is no longer required for remap_pfn_range(). ioremap might require it.
  * Consequently, PG_reserved for a page mapped into user space can indicate
- * the zero page, the vDSO, MMIO pages or device memory.
+ * the zero page, the vDSO, or MMIO pages.
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7210f4375279..9fbcdeaf0339 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -501,9 +501,7 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
  * @altmap: alternative device page map or %NULL if default memmap is used
  *
  * Generic helper function to remove section mappings and sysfs entries
- * for the section of the memory we are removing. Caller needs to make
- * sure that pages are marked reserved and zones are adjust properly by
- * calling offline_pages().
+ * for the section of the memory we are removing.
  */
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
@@ -584,9 +582,9 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 	int order;
 
 	/*
-	 * Online the pages. The callback might decide to keep some pages
-	 * PG_reserved (to add them to the buddy later), but we still account
-	 * them as being online/belonging to this zone ("present").
+	 * Online the pages. The callback might decide to not free some pages
+	 * (to add them to the buddy later), but we still account them as
+	 * being online/belonging to this zone ("present").
 	 */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1ul << order) {
 		order = min(MAX_ORDER - 1, get_order(PFN_PHYS(end_pfn - pfn)));
@@ -659,8 +657,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
 }
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
- * and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * and resizing the pgdat/zone data to span the added pages.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap)
@@ -684,8 +681,8 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	/*
 	 * TODO now we have a visible range of pages which are not associated
 	 * with their zone properly. Not nice but set_pfnblock_flags_mask
-	 * expects the zone spans the pfn range. All the pages in the range
-	 * are reserved so nobody should be touching them so we should be safe
+	 * expects the zone spans the pfn range. The sections are not yet
+	 * marked online so nobody should be touching the memmap.
 	 */
 	memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
 			MEMMAP_HOTPLUG, altmap);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e153280bde9a..29787ac4aeb8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5927,8 +5927,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5980,15 +5978,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
 		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
 		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-- 
2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved - either
   by using pfn_to_online_page() to exclude them right away or by
   checking against is_zone_device_page().
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/hv/hv_balloon.c    |  6 ++++++
 drivers/xen/balloon.c      |  7 +++++++
 include/linux/page-flags.h |  8 +-------
 mm/memory_hotplug.c        | 17 +++++++----------
 mm/page_alloc.c            | 11 -----------
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index c722079d3c24..3214b0ef5247 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
+	/*
+	 * TODO: The core used to mark the pages reserved. Most probably
+	 * we can stop doing that now.
+	 */
+	__SetPageReserved(pg);
+
 	if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
 		if (!PageOffline(pg))
 			__SetPageOffline(pg);
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order)
 	mutex_lock(&balloon_mutex);
 	for (i = 0; i < size; i++) {
 		p = pfn_to_page(start_pfn + i);
+		/*
+		 * TODO: The core used to mark the pages reserved. Most probably
+		 * we can stop doing that now. However, especially
+		 * alloc_xenballooned_pages() left PG_reserved set
+		 * on pages that can get mapped to user space.
+		 */
+		__SetPageReserved(p);
 		balloon_append(p);
 	}
 	mutex_unlock(&balloon_mutex);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..d4f85d866b71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,24 +30,18 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
- * - Device memory (e.g. PMEM, DAX, HMM)
  * Some PG_reserved pages will be excluded from the hibernation image.
  * PG_reserved does in general not hinder anybody from dumping or swapping
  * and is no longer required for remap_pfn_range(). ioremap might require it.
  * Consequently, PG_reserved for a page mapped into user space can indicate
- * the zero page, the vDSO, MMIO pages or device memory.
+ * the zero page, the vDSO, or MMIO pages.
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7210f4375279..9fbcdeaf0339 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -501,9 +501,7 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
  * @altmap: alternative device page map or %NULL if default memmap is used
  *
  * Generic helper function to remove section mappings and sysfs entries
- * for the section of the memory we are removing. Caller needs to make
- * sure that pages are marked reserved and zones are adjust properly by
- * calling offline_pages().
+ * for the section of the memory we are removing.
  */
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
@@ -584,9 +582,9 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 	int order;
 
 	/*
-	 * Online the pages. The callback might decide to keep some pages
-	 * PG_reserved (to add them to the buddy later), but we still account
-	 * them as being online/belonging to this zone ("present").
+	 * Online the pages. The callback might decide to not free some pages
+	 * (to add them to the buddy later), but we still account them as
+	 * being online/belonging to this zone ("present").
 	 */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1ul << order) {
 		order = min(MAX_ORDER - 1, get_order(PFN_PHYS(end_pfn - pfn)));
@@ -659,8 +657,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
 }
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
- * and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * and resizing the pgdat/zone data to span the added pages.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap)
@@ -684,8 +681,8 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	/*
 	 * TODO now we have a visible range of pages which are not associated
 	 * with their zone properly. Not nice but set_pfnblock_flags_mask
-	 * expects the zone spans the pfn range. All the pages in the range
-	 * are reserved so nobody should be touching them so we should be safe
+	 * expects the zone spans the pfn range. The sections are not yet
+	 * marked online so nobody should be touching the memmap.
 	 */
 	memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
 			MEMMAP_HOTPLUG, altmap);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e153280bde9a..29787ac4aeb8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5927,8 +5927,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5980,15 +5978,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
 		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
 		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Xen-devel] [PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
@ 2019-10-22 17:12   ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 17:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, David Hildenbrand, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Pavel Tatashin, Paul Mackerras, Michael Ellerman,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved - either
   by using pfn_to_online_page() to exclude them right away or by
   checking against is_zone_device_page().
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/hv/hv_balloon.c    |  6 ++++++
 drivers/xen/balloon.c      |  7 +++++++
 include/linux/page-flags.h |  8 +-------
 mm/memory_hotplug.c        | 17 +++++++----------
 mm/page_alloc.c            | 11 -----------
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index c722079d3c24..3214b0ef5247 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
+	/*
+	 * TODO: The core used to mark the pages reserved. Most probably
+	 * we can stop doing that now.
+	 */
+	__SetPageReserved(pg);
+
 	if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
 		if (!PageOffline(pg))
 			__SetPageOffline(pg);
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned int order)
 	mutex_lock(&balloon_mutex);
 	for (i = 0; i < size; i++) {
 		p = pfn_to_page(start_pfn + i);
+		/*
+		 * TODO: The core used to mark the pages reserved. Most probably
+		 * we can stop doing that now. However, especially
+		 * alloc_xenballooned_pages() left PG_reserved set
+		 * on pages that can get mapped to user space.
+		 */
+		__SetPageReserved(p);
 		balloon_append(p);
 	}
 	mutex_unlock(&balloon_mutex);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..d4f85d866b71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,24 +30,18 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
- * - Device memory (e.g. PMEM, DAX, HMM)
  * Some PG_reserved pages will be excluded from the hibernation image.
  * PG_reserved does in general not hinder anybody from dumping or swapping
  * and is no longer required for remap_pfn_range(). ioremap might require it.
  * Consequently, PG_reserved for a page mapped into user space can indicate
- * the zero page, the vDSO, MMIO pages or device memory.
+ * the zero page, the vDSO, or MMIO pages.
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7210f4375279..9fbcdeaf0339 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -501,9 +501,7 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
  * @altmap: alternative device page map or %NULL if default memmap is used
  *
  * Generic helper function to remove section mappings and sysfs entries
- * for the section of the memory we are removing. Caller needs to make
- * sure that pages are marked reserved and zones are adjust properly by
- * calling offline_pages().
+ * for the section of the memory we are removing.
  */
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
@@ -584,9 +582,9 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 	int order;
 
 	/*
-	 * Online the pages. The callback might decide to keep some pages
-	 * PG_reserved (to add them to the buddy later), but we still account
-	 * them as being online/belonging to this zone ("present").
+	 * Online the pages. The callback might decide to not free some pages
+	 * (to add them to the buddy later), but we still account them as
+	 * being online/belonging to this zone ("present").
 	 */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1ul << order) {
 		order = min(MAX_ORDER - 1, get_order(PFN_PHYS(end_pfn - pfn)));
@@ -659,8 +657,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
 }
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
- * and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * and resizing the pgdat/zone data to span the added pages.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap)
@@ -684,8 +681,8 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	/*
 	 * TODO now we have a visible range of pages which are not associated
 	 * with their zone properly. Not nice but set_pfnblock_flags_mask
-	 * expects the zone spans the pfn range. All the pages in the range
-	 * are reserved so nobody should be touching them so we should be safe
+	 * expects the zone spans the pfn range. The sections are not yet
+	 * marked online so nobody should be touching the memmap.
 	 */
 	memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
 			MEMMAP_HOTPLUG, altmap);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e153280bde9a..29787ac4aeb8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5927,8 +5927,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5980,15 +5978,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
 		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
 		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* RE: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
  2019-10-22 17:12   ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 17:55     ` Matt Sickler
  -1 siblings, 0 replies; 112+ messages in thread
From: Matt Sickler @ 2019-10-22 17:55 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Michal Hocko, Andrew Morton, kvm-ppc, linuxppc-dev,
	kvm, linux-hyperv, devel, xen-devel, x86, Alexander Duyck,
	Alexander Duyck, Alex Williamson, Allison Randal,
	Andy Lutomirski, Aneesh Kumar K.V, Anshuman Khandual,
	Anthony Yznaga, Ben Chan, Benjamin Herrenschmidt,
	Borislav Petkov, Boris Ostrovsky, Christophe Leroy,
	Cornelia Huck, Dan Carpenter, Dan Williams, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, Kees Cook, K. Y. Srinivasan, Madhumitha Prabakaran,
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Nicholas Piggin, Nishka Dasgupta, Oscar Salvador,
	Paolo Bonzini, Paul Mackerras, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>
>The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

>
>Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>Cc: Vandana BN <bnvandana@gmail.com>
>Cc: "Simon Sandström" <simon@nikanor.nu>
>Cc: Dan Carpenter <dan.carpenter@oracle.com>
>Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
>Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
>Cc: Fabio Estevam <festevam@gmail.com>
>Cc: Matt Sickler <Matt.Sickler@daktronics.com>
>Cc: Jeremy Sowden <jeremy@azazel.net>
>Signed-off-by: David Hildenbrand <david@redhat.com>
>---
> drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
>index cb52bd9a6d2f..457adcc81fe6 100644
>--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
>+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
>@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
>        BUG_ON(acd->ldev->pldev == NULL);
>
>        for (i = 0 ; i < acd->page_count ; i++) {
>-               if (!PageReserved(acd->user_pages[i])) {
>+               if (!PageReserved(acd->user_pages[i]) &&
>+                   !is_zone_device_page(acd->user_pages[i])) {
>                        set_page_dirty(acd->user_pages[i]);
>                }
>        }
>--
>2.21.0


^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 17:55     ` Matt Sickler
  0 siblings, 0 replies; 112+ messages in thread
From: Matt Sickler @ 2019-10-22 17:55 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>
>The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

>
>Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>Cc: Vandana BN <bnvandana@gmail.com>
>Cc: "Simon Sandström" <simon@nikanor.nu>
>Cc: Dan Carpenter <dan.carpenter@oracle.com>
>Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
>Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
>Cc: Fabio Estevam <festevam@gmail.com>
>Cc: Matt Sickler <Matt.Sickler@daktronics.com>
>Cc: Jeremy Sowden <jeremy@azazel.net>
>Signed-off-by: David Hildenbrand <david@redhat.com>
>---
> drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
>index cb52bd9a6d2f..457adcc81fe6 100644
>--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
>+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
>@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
>        BUG_ON(acd->ldev->pldev == NULL);
>
>        for (i = 0 ; i < acd->page_count ; i++) {
>-               if (!PageReserved(acd->user_pages[i])) {
>+               if (!PageReserved(acd->user_pages[i]) &&
>+                   !is_zone_device_page(acd->user_pages[i])) {
>                        set_page_dirty(acd->user_pages[i]);
>                }
>        }
>--
>2.21.0

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 17:55     ` Matt Sickler
  0 siblings, 0 replies; 112+ messages in thread
From: Matt Sickler @ 2019-10-22 17:55 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>
>The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

>
>Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>Cc: Vandana BN <bnvandana@gmail.com>
>Cc: "Simon Sandström" <simon@nikanor.nu>
>Cc: Dan Carpenter <dan.carpenter@oracle.com>
>Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
>Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
>Cc: Fabio Estevam <festevam@gmail.com>
>Cc: Matt Sickler <Matt.Sickler@daktronics.com>
>Cc: Jeremy Sowden <jeremy@azazel.net>
>Signed-off-by: David Hildenbrand <david@redhat.com>
>---
> drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
>index cb52bd9a6d2f..457adcc81fe6 100644
>--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
>+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
>@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
>        BUG_ON(acd->ldev->pldev == NULL);
>
>        for (i = 0 ; i < acd->page_count ; i++) {
>-               if (!PageReserved(acd->user_pages[i])) {
>+               if (!PageReserved(acd->user_pages[i]) &&
>+                   !is_zone_device_page(acd->user_pages[i])) {
>                        set_page_dirty(acd->user_pages[i]);
>                }
>        }
>--
>2.21.0


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 17:55     ` Matt Sickler
  0 siblings, 0 replies; 112+ messages in thread
From: Matt Sickler @ 2019-10-22 17:55 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Jeremy Sowden,
	Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin, Mel Gorman,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>
>The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

>
>Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>Cc: Vandana BN <bnvandana@gmail.com>
>Cc: "Simon Sandström" <simon@nikanor.nu>
>Cc: Dan Carpenter <dan.carpenter@oracle.com>
>Cc: Nishka Dasgupta <nishkadg.linux@gmail.com>
>Cc: Madhumitha Prabakaran <madhumithabiw@gmail.com>
>Cc: Fabio Estevam <festevam@gmail.com>
>Cc: Matt Sickler <Matt.Sickler@daktronics.com>
>Cc: Jeremy Sowden <jeremy@azazel.net>
>Signed-off-by: David Hildenbrand <david@redhat.com>
>---
> drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
>index cb52bd9a6d2f..457adcc81fe6 100644
>--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
>+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
>@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t xfr_count, u32 flags)
>        BUG_ON(acd->ldev->pldev == NULL);
>
>        for (i = 0 ; i < acd->page_count ; i++) {
>-               if (!PageReserved(acd->user_pages[i])) {
>+               if (!PageReserved(acd->user_pages[i]) &&
>+                   !is_zone_device_page(acd->user_pages[i])) {
>                        set_page_dirty(acd->user_pages[i]);
>                }
>        }
>--
>2.21.0

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
  2019-10-22 17:55     ` Matt Sickler
  (?)
  (?)
@ 2019-10-22 21:01       ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 21:01 UTC (permalink / raw)
  To: Matt Sickler, linux-kernel
  Cc: linux-mm, Michal Hocko, Andrew Morton, kvm-ppc, linuxppc-dev,
	kvm, linux-hyperv, devel, xen-devel, x86, Alexander Duyck,
	Alexander Duyck, Alex Williamson, Allison Randal,
	Andy Lutomirski, Aneesh Kumar K.V, Anshuman Khandual,
	Anthony Yznaga, Ben Chan, Benjamin Herrenschmidt,
	Borislav Petkov, Boris Ostrovsky, Christophe Leroy,
	Cornelia Huck, Dan Carpenter, Dan Williams, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, Kees Cook, K. Y. Srinivasan, Madhumitha Prabakaran,
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Nicholas Piggin, Nishka Dasgupta, Oscar Salvador,
	Paolo Bonzini, Paul Mackerras, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 22.10.19 19:55, Matt Sickler wrote:
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>>
>> The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

ZONE_DEVICE, a.k.a. devmem, are pages that bypass the pagecache (e.g., 
DAX) completely and will therefore never get swapped. These pages are 
not managed by any page allocator (especially not the buddy), they are 
rather "directly mapped device memory".

E.g., a NVDIMM. It is mapped into the physical address space similar to 
ordinary RAM (a DIMM). Any write to such a PFN will directly end up on 
the target device. In contrast to a DIMM, the memory is persistent 
accross reboots.

Now, if you mmap such an NVDIMM into a user space process, you will end 
up with ZONE_DEVICE pages as part of the user space mapping (VMA). 
get_user_pages_fast() on this memory will result in "struct pages" that 
belong to ZONE_DEVICE. This is where this patch comes into play.

This patch makes sure that there is absolutely no change once we stop 
setting these ZONE_DEVICE pages PG_reserved. E.g., AFAIK, setting a 
ZONE_DEVICE page dirty does not make too much sense (never swapped).

Yes, it might not be a likely setup, however, it is possible. In this 
series I collect all places that *could* be affected. If that change is 
really needed has to be decided. I can see that the two staging drivers 
I have patches for might be able to just live with the change - but then 
we talked about it and are aware of the change.

Thanks!

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 21:01       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 21:01 UTC (permalink / raw)
  To: Matt Sickler, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 22.10.19 19:55, Matt Sickler wrote:
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>>
>> The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

ZONE_DEVICE, a.k.a. devmem, are pages that bypass the pagecache (e.g., 
DAX) completely and will therefore never get swapped. These pages are 
not managed by any page allocator (especially not the buddy), they are 
rather "directly mapped device memory".

E.g., a NVDIMM. It is mapped into the physical address space similar to 
ordinary RAM (a DIMM). Any write to such a PFN will directly end up on 
the target device. In contrast to a DIMM, the memory is persistent 
accross reboots.

Now, if you mmap such an NVDIMM into a user space process, you will end 
up with ZONE_DEVICE pages as part of the user space mapping (VMA). 
get_user_pages_fast() on this memory will result in "struct pages" that 
belong to ZONE_DEVICE. This is where this patch comes into play.

This patch makes sure that there is absolutely no change once we stop 
setting these ZONE_DEVICE pages PG_reserved. E.g., AFAIK, setting a 
ZONE_DEVICE page dirty does not make too much sense (never swapped).

Yes, it might not be a likely setup, however, it is possible. In this 
series I collect all places that *could* be affected. If that change is 
really needed has to be decided. I can see that the two staging drivers 
I have patches for might be able to just live with the change - but then 
we talked about it and are aware of the change.

Thanks!

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 21:01       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 21:01 UTC (permalink / raw)
  To: Matt Sickler, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 22.10.19 19:55, Matt Sickler wrote:
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>>
>> The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

ZONE_DEVICE, a.k.a. devmem, are pages that bypass the pagecache (e.g., 
DAX) completely and will therefore never get swapped. These pages are 
not managed by any page allocator (especially not the buddy), they are 
rather "directly mapped device memory".

E.g., a NVDIMM. It is mapped into the physical address space similar to 
ordinary RAM (a DIMM). Any write to such a PFN will directly end up on 
the target device. In contrast to a DIMM, the memory is persistent 
accross reboots.

Now, if you mmap such an NVDIMM into a user space process, you will end 
up with ZONE_DEVICE pages as part of the user space mapping (VMA). 
get_user_pages_fast() on this memory will result in "struct pages" that 
belong to ZONE_DEVICE. This is where this patch comes into play.

This patch makes sure that there is absolutely no change once we stop 
setting these ZONE_DEVICE pages PG_reserved. E.g., AFAIK, setting a 
ZONE_DEVICE page dirty does not make too much sense (never swapped).

Yes, it might not be a likely setup, however, it is possible. In this 
series I collect all places that *could* be affected. If that change is 
really needed has to be decided. I can see that the two staging drivers 
I have patches for might be able to just live with the change - but then 
we talked about it and are aware of the change.

Thanks!

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes
@ 2019-10-22 21:01       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-22 21:01 UTC (permalink / raw)
  To: Matt Sickler, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Jeremy Sowden,
	Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin, Mel Gorman,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 22.10.19 19:55, Matt Sickler wrote:
>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.
>>
>> The pages are obtained via get_user_pages_fast(). I assume, these could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, typically HugePages (but not always).

ZONE_DEVICE, a.k.a. devmem, are pages that bypass the pagecache (e.g., 
DAX) completely and will therefore never get swapped. These pages are 
not managed by any page allocator (especially not the buddy), they are 
rather "directly mapped device memory".

E.g., a NVDIMM. It is mapped into the physical address space similar to 
ordinary RAM (a DIMM). Any write to such a PFN will directly end up on 
the target device. In contrast to a DIMM, the memory is persistent 
accross reboots.

Now, if you mmap such an NVDIMM into a user space process, you will end 
up with ZONE_DEVICE pages as part of the user space mapping (VMA). 
get_user_pages_fast() on this memory will result in "struct pages" that 
belong to ZONE_DEVICE. This is where this patch comes into play.

This patch makes sure that there is absolutely no change once we stop 
setting these ZONE_DEVICE pages PG_reserved. E.g., AFAIK, setting a 
ZONE_DEVICE page dirty does not make too much sense (never swapped).

Yes, it might not be a likely setup, however, it is possible. In this 
series I collect all places that *could* be affected. If that change is 
really needed has to be decided. I can see that the two staging drivers 
I have patches for might be able to just live with the change - but then 
we talked about it and are aware of the change.

Thanks!

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-22 17:12 ` David Hildenbrand
  (?)
  (?)
@ 2019-10-22 21:54   ` Dan Williams
  -1 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-22 21:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux Kernel Mailing List, Linux MM, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, KVM list, linux-hyperv, devel, xen-devel,
	X86 ML, Alexander Duyck, Alexander Duyck, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, Kees Cook, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

Hi David,

Thanks for tackling this!

On Tue, Oct 22, 2019 at 10:13 AM David Hildenbrand <david@redhat.com> wrote:
>
> This series is based on [2], which should pop up in linux/next soon:
>         https://lkml.org/lkml/2019/10/21/1034
>
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never. In ancient times, we needed PG_reserved, because there was no way
> to tell whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
>
> I remember that some time ago, we already talked about stopping to set
> ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
> Also, I forgot who was part of the discussion :)

You got me, Alex, and KVM folks on the Cc, so I'd say that was it.

> One of the biggest fear were side effects. I went ahead and audited all
> users of PageReserved(). The ones that don't need any care (patches)
> can be found below. I will double check and hope I am not missing something
> important.
>
> I am probably a little bit too careful (but I don't want to break things).
> In most places (besides KVM and vfio that are nuts), the
> pfn_to_online_page() check could most probably be avoided by a
> is_zone_device_page() check. However, I usually get suspicious when I see
> a pfn_valid() check (especially after I learned that people mmap parts of
> /dev/mem into user space, including memory without memmaps. Also, people
> could memmap offline memory blocks this way :/). As long as this does not
> hurt performance, I think we should rather do it the clean way.

I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.

>
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> on x86-64 and PPC.

I'll give it a spin, but I don't think the kernel wants to grow more
is_zone_device_page() users.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-22 21:54   ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-22 21:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	Kees Cook, devel, Stefano Stabellini, Stephen Hemminger,
	Aneesh Kumar K.V, Joerg Roedel, X86 ML, YueHaibing,
	Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Hi David,

Thanks for tackling this!

On Tue, Oct 22, 2019 at 10:13 AM David Hildenbrand <david@redhat.com> wrote:
>
> This series is based on [2], which should pop up in linux/next soon:
>         https://lkml.org/lkml/2019/10/21/1034
>
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never. In ancient times, we needed PG_reserved, because there was no way
> to tell whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
>
> I remember that some time ago, we already talked about stopping to set
> ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
> Also, I forgot who was part of the discussion :)

You got me, Alex, and KVM folks on the Cc, so I'd say that was it.

> One of the biggest fear were side effects. I went ahead and audited all
> users of PageReserved(). The ones that don't need any care (patches)
> can be found below. I will double check and hope I am not missing something
> important.
>
> I am probably a little bit too careful (but I don't want to break things).
> In most places (besides KVM and vfio that are nuts), the
> pfn_to_online_page() check could most probably be avoided by a
> is_zone_device_page() check. However, I usually get suspicious when I see
> a pfn_valid() check (especially after I learned that people mmap parts of
> /dev/mem into user space, including memory without memmaps. Also, people
> could memmap offline memory blocks this way :/). As long as this does not
> hurt performance, I think we should rather do it the clean way.

I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.

>
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> on x86-64 and PPC.

I'll give it a spin, but I don't think the kernel wants to grow more
is_zone_device_page() users.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-22 21:54   ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-22 21:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, Linux MM, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Sasha Levin, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Vandana BN,
	Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Linux Kernel Mailing List, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

Hi David,

Thanks for tackling this!

On Tue, Oct 22, 2019 at 10:13 AM David Hildenbrand <david@redhat.com> wrote:
>
> This series is based on [2], which should pop up in linux/next soon:
>         https://lkml.org/lkml/2019/10/21/1034
>
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never. In ancient times, we needed PG_reserved, because there was no way
> to tell whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
>
> I remember that some time ago, we already talked about stopping to set
> ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
> Also, I forgot who was part of the discussion :)

You got me, Alex, and KVM folks on the Cc, so I'd say that was it.

> One of the biggest fear were side effects. I went ahead and audited all
> users of PageReserved(). The ones that don't need any care (patches)
> can be found below. I will double check and hope I am not missing something
> important.
>
> I am probably a little bit too careful (but I don't want to break things).
> In most places (besides KVM and vfio that are nuts), the
> pfn_to_online_page() check could most probably be avoided by a
> is_zone_device_page() check. However, I usually get suspicious when I see
> a pfn_valid() check (especially after I learned that people mmap parts of
> /dev/mem into user space, including memory without memmaps. Also, people
> could memmap offline memory blocks this way :/). As long as this does not
> hurt performance, I think we should rather do it the clean way.

I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.

>
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> on x86-64 and PPC.

I'll give it a spin, but I don't think the kernel wants to grow more
is_zone_device_page() users.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-22 21:54   ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-22 21:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Alexander Duyck,
	K. Y. Srinivasan, Fabio Estevam, Ben Chan, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Jeremy Sowden,
	Mel Gorman, Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

Hi David,

Thanks for tackling this!

On Tue, Oct 22, 2019 at 10:13 AM David Hildenbrand <david@redhat.com> wrote:
>
> This series is based on [2], which should pop up in linux/next soon:
>         https://lkml.org/lkml/2019/10/21/1034
>
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never. In ancient times, we needed PG_reserved, because there was no way
> to tell whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
>
> I remember that some time ago, we already talked about stopping to set
> ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
> Also, I forgot who was part of the discussion :)

You got me, Alex, and KVM folks on the Cc, so I'd say that was it.

> One of the biggest fear were side effects. I went ahead and audited all
> users of PageReserved(). The ones that don't need any care (patches)
> can be found below. I will double check and hope I am not missing something
> important.
>
> I am probably a little bit too careful (but I don't want to break things).
> In most places (besides KVM and vfio that are nuts), the
> pfn_to_online_page() check could most probably be avoided by a
> is_zone_device_page() check. However, I usually get suspicious when I see
> a pfn_valid() check (especially after I learned that people mmap parts of
> /dev/mem into user space, including memory without memmaps. Also, people
> could memmap offline memory blocks this way :/). As long as this does not
> hurt performance, I think we should rather do it the clean way.

I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.

>
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> on x86-64 and PPC.

I'll give it a spin, but I don't think the kernel wants to grow more
is_zone_device_page() users.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-22 21:54   ` Dan Williams
  (?)
  (?)
@ 2019-10-23  7:26     ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  7:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, Linux MM, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, KVM list, linux-hyperv, devel, xen-devel,
	X86 ML, Alexander Duyck, Kees Cook, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, Kees Cook, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 22.10.19 23:54, Dan Williams wrote:
> Hi David,
> 
> Thanks for tackling this!

Thanks for having a look :)

[...]


>> I am probably a little bit too careful (but I don't want to break things).
>> In most places (besides KVM and vfio that are nuts), the
>> pfn_to_online_page() check could most probably be avoided by a
>> is_zone_device_page() check. However, I usually get suspicious when I see
>> a pfn_valid() check (especially after I learned that people mmap parts of
>> /dev/mem into user space, including memory without memmaps. Also, people
>> could memmap offline memory blocks this way :/). As long as this does not
>> hurt performance, I think we should rather do it the clean way.
> 
> I'm concerned about using is_zone_device_page() in places that are not
> known to already have a reference to the page. Here's an audit of
> current usages, and the ones I think need to cleaned up. The "unsafe"
> ones do not appear to have any protections against the device page
> being removed (get_dev_pagemap()). Yes, some of these were added by
> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> pages into anonymous memory paths and I'm not up to speed on how it
> guarantees 'struct page' validity vs device shutdown without using
> get_dev_pagemap().
> 
> smaps_pmd_entry(): unsafe
> 
> put_devmap_managed_page(): safe, page reference is held
> 
> is_device_private_page(): safe? gpu driver manages private page lifetime
> 
> is_pci_p2pdma_page(): safe, page reference is held
> 
> uncharge_page(): unsafe? HMM
> 
> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> 
> soft_offline_page(): unsafe
> 
> remove_migration_pte(): unsafe? HMM
> 
> move_to_new_page(): unsafe? HMM
> 
> migrate_vma_pages() and helpers: unsafe? HMM
> 
> try_to_unmap_one(): unsafe? HMM
> 
> __put_page(): safe
> 
> release_pages(): safe
> 
> I'm hoping all the HMM ones can be converted to
> is_device_private_page() directlly and have that routine grow a nice
> comment about how it knows it can always safely de-reference its @page
> argument.
> 
> For the rest I'd like to propose that we add a facility to determine
> ZONE_DEVICE by pfn rather than page. The most straightforward why I
> can think of would be to just add another bitmap to mem_section_usage
> to indicate if a subsection is ZONE_DEVICE or not.

(it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
   don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

Especially, driver-reserved device memory will not get set active in
the subsection bitmap.

Now to the race. Taking the memory hotplug lock at random places is ugly. I do
wonder if we can use RCU:

The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():

	/* the memmap is guaranteed to remain active under RCU */
	rcu_read_lock();
	if (pfn_active(random_pfn)) {
		page = pfn_to_page(random_pfn);
		... use the page, stays valid
	}
	rcu_unread_lock();

Memory offlining/memremap code:

	set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
	synchronize_rcu();
	/* all users saw the bitmap update, we can invalide the memmap */
	remove_pfn_range_from_zone(zone, pfn, nr_pages);

> 
>>
>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>> on x86-64 and PPC.
> 
> I'll give it a spin, but I don't think the kernel wants to grow more
> is_zone_device_page() users.

Let's recap. In this RFC, I introduce a total of 4 (!) users only.
The other parts can rely on pfn_to_online_page() only.

1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
- Basically never used with ZONE_DEVICE.
- We hold a reference!
- All it protects is a SetPageDirty(page);

2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
- Same as 1.

3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
- We come via virt_to_head_page() / virt_to_head_page(), not sure about 
  references (I assume this should be fine as we don't come via random 
  PFNs)
- We check that we don't mix Reserved (including device memory) and CMA 
  pages when crossing compound pages.

I think we can drop 1. and 2., resulting in a total of 2 new users in
the same context. I think that is totally tolerable to finally clean
this up.


However, I think we also have to clarify if we need the change in 3 at all.
It comes from

commit f5509cc18daa7f82bcc553be70df2117c8eedc16
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Jun 7 11:05:33 2016 -0700

    mm: Hardened usercopy
    
    This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
[...]
    - otherwise, object must not span page allocations (excepting Reserved
      and CMA ranges)

Not sure if we really have to care about ZONE_DEVICE at this point.


-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23  7:26     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  7:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Pavel Tatashin,
	devel, Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Kees Cook, Anshuman Khandual,
	Haiyang Zhang, Simon Sandström, Sasha Levin, Juergen Gross,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, Linux Kernel Mailing List,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 22.10.19 23:54, Dan Williams wrote:
> Hi David,
> 
> Thanks for tackling this!

Thanks for having a look :)

[...]


>> I am probably a little bit too careful (but I don't want to break things).
>> In most places (besides KVM and vfio that are nuts), the
>> pfn_to_online_page() check could most probably be avoided by a
>> is_zone_device_page() check. However, I usually get suspicious when I see
>> a pfn_valid() check (especially after I learned that people mmap parts of
>> /dev/mem into user space, including memory without memmaps. Also, people
>> could memmap offline memory blocks this way :/). As long as this does not
>> hurt performance, I think we should rather do it the clean way.
> 
> I'm concerned about using is_zone_device_page() in places that are not
> known to already have a reference to the page. Here's an audit of
> current usages, and the ones I think need to cleaned up. The "unsafe"
> ones do not appear to have any protections against the device page
> being removed (get_dev_pagemap()). Yes, some of these were added by
> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> pages into anonymous memory paths and I'm not up to speed on how it
> guarantees 'struct page' validity vs device shutdown without using
> get_dev_pagemap().
> 
> smaps_pmd_entry(): unsafe
> 
> put_devmap_managed_page(): safe, page reference is held
> 
> is_device_private_page(): safe? gpu driver manages private page lifetime
> 
> is_pci_p2pdma_page(): safe, page reference is held
> 
> uncharge_page(): unsafe? HMM
> 
> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> 
> soft_offline_page(): unsafe
> 
> remove_migration_pte(): unsafe? HMM
> 
> move_to_new_page(): unsafe? HMM
> 
> migrate_vma_pages() and helpers: unsafe? HMM
> 
> try_to_unmap_one(): unsafe? HMM
> 
> __put_page(): safe
> 
> release_pages(): safe
> 
> I'm hoping all the HMM ones can be converted to
> is_device_private_page() directlly and have that routine grow a nice
> comment about how it knows it can always safely de-reference its @page
> argument.
> 
> For the rest I'd like to propose that we add a facility to determine
> ZONE_DEVICE by pfn rather than page. The most straightforward why I
> can think of would be to just add another bitmap to mem_section_usage
> to indicate if a subsection is ZONE_DEVICE or not.

(it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
   don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

Especially, driver-reserved device memory will not get set active in
the subsection bitmap.

Now to the race. Taking the memory hotplug lock at random places is ugly. I do
wonder if we can use RCU:

The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():

	/* the memmap is guaranteed to remain active under RCU */
	rcu_read_lock();
	if (pfn_active(random_pfn)) {
		page = pfn_to_page(random_pfn);
		... use the page, stays valid
	}
	rcu_unread_lock();

Memory offlining/memremap code:

	set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
	synchronize_rcu();
	/* all users saw the bitmap update, we can invalide the memmap */
	remove_pfn_range_from_zone(zone, pfn, nr_pages);

> 
>>
>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>> on x86-64 and PPC.
> 
> I'll give it a spin, but I don't think the kernel wants to grow more
> is_zone_device_page() users.

Let's recap. In this RFC, I introduce a total of 4 (!) users only.
The other parts can rely on pfn_to_online_page() only.

1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
- Basically never used with ZONE_DEVICE.
- We hold a reference!
- All it protects is a SetPageDirty(page);

2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
- Same as 1.

3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
- We come via virt_to_head_page() / virt_to_head_page(), not sure about 
  references (I assume this should be fine as we don't come via random 
  PFNs)
- We check that we don't mix Reserved (including device memory) and CMA 
  pages when crossing compound pages.

I think we can drop 1. and 2., resulting in a total of 2 new users in
the same context. I think that is totally tolerable to finally clean
this up.


However, I think we also have to clarify if we need the change in 3 at all.
It comes from

commit f5509cc18daa7f82bcc553be70df2117c8eedc16
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Jun 7 11:05:33 2016 -0700

    mm: Hardened usercopy
    
    This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
[...]
    - otherwise, object must not span page allocations (excepting Reserved
      and CMA ranges)

Not sure if we really have to care about ZONE_DEVICE at this point.


-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23  7:26     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  7:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, Linux MM, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Kees Cook, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Sasha Levin, Juergen Gross, kvm-ppc,
	Qian Cai, Alex Williamson, Mike Rapoport, Borislav Petkov,
	Nicholas Piggin, Andy Lutomirski, xen-devel, Boris Ostrovsky,
	Todd Poynor, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Vandana BN, Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman,
	Cornelia Huck, Linux Kernel Mailing List, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 22.10.19 23:54, Dan Williams wrote:
> Hi David,
> 
> Thanks for tackling this!

Thanks for having a look :)

[...]


>> I am probably a little bit too careful (but I don't want to break things).
>> In most places (besides KVM and vfio that are nuts), the
>> pfn_to_online_page() check could most probably be avoided by a
>> is_zone_device_page() check. However, I usually get suspicious when I see
>> a pfn_valid() check (especially after I learned that people mmap parts of
>> /dev/mem into user space, including memory without memmaps. Also, people
>> could memmap offline memory blocks this way :/). As long as this does not
>> hurt performance, I think we should rather do it the clean way.
> 
> I'm concerned about using is_zone_device_page() in places that are not
> known to already have a reference to the page. Here's an audit of
> current usages, and the ones I think need to cleaned up. The "unsafe"
> ones do not appear to have any protections against the device page
> being removed (get_dev_pagemap()). Yes, some of these were added by
> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> pages into anonymous memory paths and I'm not up to speed on how it
> guarantees 'struct page' validity vs device shutdown without using
> get_dev_pagemap().
> 
> smaps_pmd_entry(): unsafe
> 
> put_devmap_managed_page(): safe, page reference is held
> 
> is_device_private_page(): safe? gpu driver manages private page lifetime
> 
> is_pci_p2pdma_page(): safe, page reference is held
> 
> uncharge_page(): unsafe? HMM
> 
> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> 
> soft_offline_page(): unsafe
> 
> remove_migration_pte(): unsafe? HMM
> 
> move_to_new_page(): unsafe? HMM
> 
> migrate_vma_pages() and helpers: unsafe? HMM
> 
> try_to_unmap_one(): unsafe? HMM
> 
> __put_page(): safe
> 
> release_pages(): safe
> 
> I'm hoping all the HMM ones can be converted to
> is_device_private_page() directlly and have that routine grow a nice
> comment about how it knows it can always safely de-reference its @page
> argument.
> 
> For the rest I'd like to propose that we add a facility to determine
> ZONE_DEVICE by pfn rather than page. The most straightforward why I
> can think of would be to just add another bitmap to mem_section_usage
> to indicate if a subsection is ZONE_DEVICE or not.

(it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
   don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

Especially, driver-reserved device memory will not get set active in
the subsection bitmap.

Now to the race. Taking the memory hotplug lock at random places is ugly. I do
wonder if we can use RCU:

The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():

	/* the memmap is guaranteed to remain active under RCU */
	rcu_read_lock();
	if (pfn_active(random_pfn)) {
		page = pfn_to_page(random_pfn);
		... use the page, stays valid
	}
	rcu_unread_lock();

Memory offlining/memremap code:

	set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
	synchronize_rcu();
	/* all users saw the bitmap update, we can invalide the memmap */
	remove_pfn_range_from_zone(zone, pfn, nr_pages);

> 
>>
>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>> on x86-64 and PPC.
> 
> I'll give it a spin, but I don't think the kernel wants to grow more
> is_zone_device_page() users.

Let's recap. In this RFC, I introduce a total of 4 (!) users only.
The other parts can rely on pfn_to_online_page() only.

1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
- Basically never used with ZONE_DEVICE.
- We hold a reference!
- All it protects is a SetPageDirty(page);

2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
- Same as 1.

3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
- We come via virt_to_head_page() / virt_to_head_page(), not sure about 
  references (I assume this should be fine as we don't come via random 
  PFNs)
- We check that we don't mix Reserved (including device memory) and CMA 
  pages when crossing compound pages.

I think we can drop 1. and 2., resulting in a total of 2 new users in
the same context. I think that is totally tolerable to finally clean
this up.


However, I think we also have to clarify if we need the change in 3 at all.
It comes from

commit f5509cc18daa7f82bcc553be70df2117c8eedc16
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Jun 7 11:05:33 2016 -0700

    mm: Hardened usercopy
    
    This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
[...]
    - otherwise, object must not span page allocations (excepting Reserved
      and CMA ranges)

Not sure if we really have to care about ZONE_DEVICE at this point.


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23  7:26     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  7:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Kees Cook,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, Juergen Gross, kvm-ppc, Qian Cai, Alex Williamson,
	Mike Rapoport, Borislav Petkov, Nicholas Piggin, Andy Lutomirski,
	xen-devel, Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov,
	Allison Randal, Jim Mattson, Christophe Leroy, Vandana BN,
	Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman, Cornelia Huck,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

On 22.10.19 23:54, Dan Williams wrote:
> Hi David,
> 
> Thanks for tackling this!

Thanks for having a look :)

[...]


>> I am probably a little bit too careful (but I don't want to break things).
>> In most places (besides KVM and vfio that are nuts), the
>> pfn_to_online_page() check could most probably be avoided by a
>> is_zone_device_page() check. However, I usually get suspicious when I see
>> a pfn_valid() check (especially after I learned that people mmap parts of
>> /dev/mem into user space, including memory without memmaps. Also, people
>> could memmap offline memory blocks this way :/). As long as this does not
>> hurt performance, I think we should rather do it the clean way.
> 
> I'm concerned about using is_zone_device_page() in places that are not
> known to already have a reference to the page. Here's an audit of
> current usages, and the ones I think need to cleaned up. The "unsafe"
> ones do not appear to have any protections against the device page
> being removed (get_dev_pagemap()). Yes, some of these were added by
> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> pages into anonymous memory paths and I'm not up to speed on how it
> guarantees 'struct page' validity vs device shutdown without using
> get_dev_pagemap().
> 
> smaps_pmd_entry(): unsafe
> 
> put_devmap_managed_page(): safe, page reference is held
> 
> is_device_private_page(): safe? gpu driver manages private page lifetime
> 
> is_pci_p2pdma_page(): safe, page reference is held
> 
> uncharge_page(): unsafe? HMM
> 
> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> 
> soft_offline_page(): unsafe
> 
> remove_migration_pte(): unsafe? HMM
> 
> move_to_new_page(): unsafe? HMM
> 
> migrate_vma_pages() and helpers: unsafe? HMM
> 
> try_to_unmap_one(): unsafe? HMM
> 
> __put_page(): safe
> 
> release_pages(): safe
> 
> I'm hoping all the HMM ones can be converted to
> is_device_private_page() directlly and have that routine grow a nice
> comment about how it knows it can always safely de-reference its @page
> argument.
> 
> For the rest I'd like to propose that we add a facility to determine
> ZONE_DEVICE by pfn rather than page. The most straightforward why I
> can think of would be to just add another bitmap to mem_section_usage
> to indicate if a subsection is ZONE_DEVICE or not.

(it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
   don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

Especially, driver-reserved device memory will not get set active in
the subsection bitmap.

Now to the race. Taking the memory hotplug lock at random places is ugly. I do
wonder if we can use RCU:

The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():

	/* the memmap is guaranteed to remain active under RCU */
	rcu_read_lock();
	if (pfn_active(random_pfn)) {
		page = pfn_to_page(random_pfn);
		... use the page, stays valid
	}
	rcu_unread_lock();

Memory offlining/memremap code:

	set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
	synchronize_rcu();
	/* all users saw the bitmap update, we can invalide the memmap */
	remove_pfn_range_from_zone(zone, pfn, nr_pages);

> 
>>
>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>> on x86-64 and PPC.
> 
> I'll give it a spin, but I don't think the kernel wants to grow more
> is_zone_device_page() users.

Let's recap. In this RFC, I introduce a total of 4 (!) users only.
The other parts can rely on pfn_to_online_page() only.

1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
- Basically never used with ZONE_DEVICE.
- We hold a reference!
- All it protects is a SetPageDirty(page);

2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
- Same as 1.

3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
- We come via virt_to_head_page() / virt_to_head_page(), not sure about 
  references (I assume this should be fine as we don't come via random 
  PFNs)
- We check that we don't mix Reserved (including device memory) and CMA 
  pages when crossing compound pages.

I think we can drop 1. and 2., resulting in a total of 2 new users in
the same context. I think that is totally tolerable to finally clean
this up.


However, I think we also have to clarify if we need the change in 3 at all.
It comes from

commit f5509cc18daa7f82bcc553be70df2117c8eedc16
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Jun 7 11:05:33 2016 -0700

    mm: Hardened usercopy
    
    This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
[...]
    - otherwise, object must not span page allocations (excepting Reserved
      and CMA ranges)

Not sure if we really have to care about ZONE_DEVICE at this point.


-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  2019-10-22 17:12   ` David Hildenbrand
  (?)
  (?)
@ 2019-10-23  8:17     ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Michal Hocko, Andrew Morton, kvm-ppc, linuxppc-dev,
	kvm, linux-hyperv, devel, xen-devel, x86, Alexander Duyck,
	Alexander Duyck, Alex Williamson, Allison Randal,
	Andy Lutomirski, Aneesh Kumar K.V, Anshuman Khandual,
	Anthony Yznaga, Ben Chan, Benjamin Herrenschmidt,
	Borislav Petkov, Boris Ostrovsky, Christophe Leroy,
	Cornelia Huck, Dan Carpenter, Dan Williams, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, Kees Cook, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> The pages are obtained via get_user_pages_fast(). I assume, these
> could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> Cc: Rob Springer <rspringer@google.com>
> Cc: Todd Poynor <toddpoynor@google.com>
> Cc: Ben Chan <benchan@chromium.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   drivers/staging/gasket/gasket_page_table.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
> index f6d715787da8..d43fed58bf65 100644
> --- a/drivers/staging/gasket/gasket_page_table.c
> +++ b/drivers/staging/gasket/gasket_page_table.c
> @@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
>   	if (!page)
>   		return false;
>   
> -	if (!PageReserved(page))
> +	if (!PageReserved(page) && !is_zone_device_page(page))
>   		SetPageDirty(page);
>   	put_page(page);
>   
> 


@Dan, is SetPageDirty() on ZONE_DEVICE pages bad or do we simply not 
care? I think that ending up with ZONE_DEVICE pages here is very 
unlikely. I'd like to drop this (and the next) patch and document why it 
is okay to do so.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
@ 2019-10-23  8:17     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> The pages are obtained via get_user_pages_fast(). I assume, these
> could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> Cc: Rob Springer <rspringer@google.com>
> Cc: Todd Poynor <toddpoynor@google.com>
> Cc: Ben Chan <benchan@chromium.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   drivers/staging/gasket/gasket_page_table.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
> index f6d715787da8..d43fed58bf65 100644
> --- a/drivers/staging/gasket/gasket_page_table.c
> +++ b/drivers/staging/gasket/gasket_page_table.c
> @@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
>   	if (!page)
>   		return false;
>   
> -	if (!PageReserved(page))
> +	if (!PageReserved(page) && !is_zone_device_page(page))
>   		SetPageDirty(page);
>   	put_page(page);
>   
> 


@Dan, is SetPageDirty() on ZONE_DEVICE pages bad or do we simply not 
care? I think that ending up with ZONE_DEVICE pages here is very 
unlikely. I'd like to drop this (and the next) patch and document why it 
is okay to do so.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
@ 2019-10-23  8:17     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Vandana BN,
	Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin,
	Mel Gorman, Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> The pages are obtained via get_user_pages_fast(). I assume, these
> could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> Cc: Rob Springer <rspringer@google.com>
> Cc: Todd Poynor <toddpoynor@google.com>
> Cc: Ben Chan <benchan@chromium.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   drivers/staging/gasket/gasket_page_table.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
> index f6d715787da8..d43fed58bf65 100644
> --- a/drivers/staging/gasket/gasket_page_table.c
> +++ b/drivers/staging/gasket/gasket_page_table.c
> @@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
>   	if (!page)
>   		return false;
>   
> -	if (!PageReserved(page))
> +	if (!PageReserved(page) && !is_zone_device_page(page))
>   		SetPageDirty(page);
>   	put_page(page);
>   
> 


@Dan, is SetPageDirty() on ZONE_DEVICE pages bad or do we simply not 
care? I think that ending up with ZONE_DEVICE pages here is very 
unlikely. I'd like to drop this (and the next) patch and document why it 
is okay to do so.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes
@ 2019-10-23  8:17     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> The pages are obtained via get_user_pages_fast(). I assume, these
> could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.
> 
> Cc: Rob Springer <rspringer@google.com>
> Cc: Todd Poynor <toddpoynor@google.com>
> Cc: Ben Chan <benchan@chromium.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   drivers/staging/gasket/gasket_page_table.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c
> index f6d715787da8..d43fed58bf65 100644
> --- a/drivers/staging/gasket/gasket_page_table.c
> +++ b/drivers/staging/gasket/gasket_page_table.c
> @@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
>   	if (!page)
>   		return false;
>   
> -	if (!PageReserved(page))
> +	if (!PageReserved(page) && !is_zone_device_page(page))
>   		SetPageDirty(page);
>   	put_page(page);
>   
> 


@Dan, is SetPageDirty() on ZONE_DEVICE pages bad or do we simply not 
care? I think that ending up with ZONE_DEVICE pages here is very 
unlikely. I'd like to drop this (and the next) patch and document why it 
is okay to do so.

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  2019-10-22 17:12   ` David Hildenbrand
  (?)
  (?)
@ 2019-10-23  8:20     ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Michal Hocko, Andrew Morton, kvm-ppc, linuxppc-dev,
	kvm, linux-hyperv, devel, xen-devel, x86, Alexander Duyck,
	Alexander Duyck, Alex Williamson, Allison Randal,
	Andy Lutomirski, Aneesh Kumar K.V, Anshuman Khandual,
	Anthony Yznaga, Ben Chan, Benjamin Herrenschmidt,
	Borislav Petkov, Boris Ostrovsky, Christophe Leroy,
	Cornelia Huck, Dan Carpenter, Dan Williams, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, Kees Cook, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> Let's make sure that the logic in the function won't change. Once we no
> longer set these pages to reserved, we can rework this function to
> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kate Stewart <kstewart@linuxfoundation.org>
> Cc: Allison Randal <allison@lohutok.net>
> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/usercopy.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/usercopy.c b/mm/usercopy.c
> index 660717a1ea5c..a3ac4be35cde 100644
> --- a/mm/usercopy.c
> +++ b/mm/usercopy.c
> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>   	 * device memory), or CMA. Otherwise, reject since the object spans
>   	 * several independently allocated pages.
>   	 */
> -	is_reserved = PageReserved(page);
> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>   	is_cma = is_migrate_cma_page(page);
>   	if (!is_reserved && !is_cma)
>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>   
>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>   		page = virt_to_head_page(ptr);
> -		if (is_reserved && !PageReserved(page))
> +		if (is_reserved && !(PageReserved(page) ||
> +				     is_zone_device_page(page)))
>   			usercopy_abort("spans Reserved and non-Reserved pages",
>   				       NULL, to_user, 0, n);
>   		if (is_cma && !is_migrate_cma_page(page))
> 

@Kees, would it be okay to stop checking against ZONE_DEVICE pages here 
or is there a good rationale behind this?

(I would turn this patch into a simple update of the comment if we agree 
that we don't care)

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23  8:20     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> Let's make sure that the logic in the function won't change. Once we no
> longer set these pages to reserved, we can rework this function to
> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kate Stewart <kstewart@linuxfoundation.org>
> Cc: Allison Randal <allison@lohutok.net>
> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/usercopy.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/usercopy.c b/mm/usercopy.c
> index 660717a1ea5c..a3ac4be35cde 100644
> --- a/mm/usercopy.c
> +++ b/mm/usercopy.c
> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>   	 * device memory), or CMA. Otherwise, reject since the object spans
>   	 * several independently allocated pages.
>   	 */
> -	is_reserved = PageReserved(page);
> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>   	is_cma = is_migrate_cma_page(page);
>   	if (!is_reserved && !is_cma)
>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>   
>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>   		page = virt_to_head_page(ptr);
> -		if (is_reserved && !PageReserved(page))
> +		if (is_reserved && !(PageReserved(page) ||
> +				     is_zone_device_page(page)))
>   			usercopy_abort("spans Reserved and non-Reserved pages",
>   				       NULL, to_user, 0, n);
>   		if (is_cma && !is_migrate_cma_page(page))
> 

@Kees, would it be okay to stop checking against ZONE_DEVICE pages here 
or is there a good rationale behind this?

(I would turn this patch into a simple update of the comment if we agree 
that we don't care)

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23  8:20     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Vandana BN,
	Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin,
	Mel Gorman, Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> Let's make sure that the logic in the function won't change. Once we no
> longer set these pages to reserved, we can rework this function to
> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kate Stewart <kstewart@linuxfoundation.org>
> Cc: Allison Randal <allison@lohutok.net>
> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/usercopy.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/usercopy.c b/mm/usercopy.c
> index 660717a1ea5c..a3ac4be35cde 100644
> --- a/mm/usercopy.c
> +++ b/mm/usercopy.c
> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>   	 * device memory), or CMA. Otherwise, reject since the object spans
>   	 * several independently allocated pages.
>   	 */
> -	is_reserved = PageReserved(page);
> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>   	is_cma = is_migrate_cma_page(page);
>   	if (!is_reserved && !is_cma)
>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>   
>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>   		page = virt_to_head_page(ptr);
> -		if (is_reserved && !PageReserved(page))
> +		if (is_reserved && !(PageReserved(page) ||
> +				     is_zone_device_page(page)))
>   			usercopy_abort("spans Reserved and non-Reserved pages",
>   				       NULL, to_user, 0, n);
>   		if (is_cma && !is_migrate_cma_page(page))
> 

@Kees, would it be okay to stop checking against ZONE_DEVICE pages here 
or is there a good rationale behind this?

(I would turn this patch into a simple update of the comment if we agree 
that we don't care)

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23  8:20     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23  8:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Jeremy Sowden, Greg Kroah-Hartman, Cornelia Huck,
	Pavel Tatashin, Mel Gorman, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

On 22.10.19 19:12, David Hildenbrand wrote:
> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> change that.
> 
> Let's make sure that the logic in the function won't change. Once we no
> longer set these pages to reserved, we can rework this function to
> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kate Stewart <kstewart@linuxfoundation.org>
> Cc: Allison Randal <allison@lohutok.net>
> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/usercopy.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/usercopy.c b/mm/usercopy.c
> index 660717a1ea5c..a3ac4be35cde 100644
> --- a/mm/usercopy.c
> +++ b/mm/usercopy.c
> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>   	 * device memory), or CMA. Otherwise, reject since the object spans
>   	 * several independently allocated pages.
>   	 */
> -	is_reserved = PageReserved(page);
> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>   	is_cma = is_migrate_cma_page(page);
>   	if (!is_reserved && !is_cma)
>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>   
>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>   		page = virt_to_head_page(ptr);
> -		if (is_reserved && !PageReserved(page))
> +		if (is_reserved && !(PageReserved(page) ||
> +				     is_zone_device_page(page)))
>   			usercopy_abort("spans Reserved and non-Reserved pages",
>   				       NULL, to_user, 0, n);
>   		if (is_cma && !is_migrate_cma_page(page))
> 

@Kees, would it be okay to stop checking against ZONE_DEVICE pages here 
or is there a good rationale behind this?

(I would turn this patch into a simple update of the comment if we agree 
that we don't care)

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  2019-10-23  8:20     ` David Hildenbrand
  (?)
  (?)
@ 2019-10-23 16:25       ` Kees Cook
  -1 siblings, 0 replies; 112+ messages in thread
From: Kees Cook @ 2019-10-23 16:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Michal Hocko,
	Andrew Morton, kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel,
	xen-devel, x86, Alexander Duyck, Alexander Duyck,
	Alex Williamson, Allison Randal, Andy Lutomirski,
	Aneesh Kumar K.V, Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
> On 22.10.19 19:12, David Hildenbrand wrote:
> > Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> > change that.
> > 
> > Let's make sure that the logic in the function won't change. Once we no
> > longer set these pages to reserved, we can rework this function to
> > perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> > 
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kate Stewart <kstewart@linuxfoundation.org>
> > Cc: Allison Randal <allison@lohutok.net>
> > Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> > Cc: Qian Cai <cai@lca.pw>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >   mm/usercopy.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/usercopy.c b/mm/usercopy.c
> > index 660717a1ea5c..a3ac4be35cde 100644
> > --- a/mm/usercopy.c
> > +++ b/mm/usercopy.c
> > @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
> >   	 * device memory), or CMA. Otherwise, reject since the object spans
> >   	 * several independently allocated pages.
> >   	 */
> > -	is_reserved = PageReserved(page);
> > +	is_reserved = PageReserved(page) || is_zone_device_page(page);
> >   	is_cma = is_migrate_cma_page(page);
> >   	if (!is_reserved && !is_cma)
> >   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
> >   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
> >   		page = virt_to_head_page(ptr);
> > -		if (is_reserved && !PageReserved(page))
> > +		if (is_reserved && !(PageReserved(page) ||
> > +				     is_zone_device_page(page)))
> >   			usercopy_abort("spans Reserved and non-Reserved pages",
> >   				       NULL, to_user, 0, n);
> >   		if (is_cma && !is_migrate_cma_page(page))
> > 
> 
> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
> is there a good rationale behind this?
> 
> (I would turn this patch into a simple update of the comment if we agree
> that we don't care)

There has been work to actually remove the page span checks entirely,
but there wasn't consensus on what the right way forward was. I continue
to leaning toward just dropping it entirely, but Matthew Wilcox has some
alternative ideas that could use some further thought/testing.

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23 16:25       ` Kees Cook
  0 siblings, 0 replies; 112+ messages in thread
From: Kees Cook @ 2019-10-23 16:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, linux-kernel,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
> On 22.10.19 19:12, David Hildenbrand wrote:
> > Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> > change that.
> > 
> > Let's make sure that the logic in the function won't change. Once we no
> > longer set these pages to reserved, we can rework this function to
> > perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> > 
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kate Stewart <kstewart@linuxfoundation.org>
> > Cc: Allison Randal <allison@lohutok.net>
> > Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> > Cc: Qian Cai <cai@lca.pw>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >   mm/usercopy.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/usercopy.c b/mm/usercopy.c
> > index 660717a1ea5c..a3ac4be35cde 100644
> > --- a/mm/usercopy.c
> > +++ b/mm/usercopy.c
> > @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
> >   	 * device memory), or CMA. Otherwise, reject since the object spans
> >   	 * several independently allocated pages.
> >   	 */
> > -	is_reserved = PageReserved(page);
> > +	is_reserved = PageReserved(page) || is_zone_device_page(page);
> >   	is_cma = is_migrate_cma_page(page);
> >   	if (!is_reserved && !is_cma)
> >   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
> >   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
> >   		page = virt_to_head_page(ptr);
> > -		if (is_reserved && !PageReserved(page))
> > +		if (is_reserved && !(PageReserved(page) ||
> > +				     is_zone_device_page(page)))
> >   			usercopy_abort("spans Reserved and non-Reserved pages",
> >   				       NULL, to_user, 0, n);
> >   		if (is_cma && !is_migrate_cma_page(page))
> > 
> 
> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
> is there a good rationale behind this?
> 
> (I would turn this patch into a simple update of the comment if we agree
> that we don't care)

There has been work to actually remove the page span checks entirely,
but there wasn't consensus on what the right way forward was. I continue
to leaning toward just dropping it entirely, but Matthew Wilcox has some
alternative ideas that could use some further thought/testing.

-- 
Kees Cook
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23 16:25       ` Kees Cook
  0 siblings, 0 replies; 112+ messages in thread
From: Kees Cook @ 2019-10-23 16:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, linux-kernel,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
> On 22.10.19 19:12, David Hildenbrand wrote:
> > Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> > change that.
> > 
> > Let's make sure that the logic in the function won't change. Once we no
> > longer set these pages to reserved, we can rework this function to
> > perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> > 
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kate Stewart <kstewart@linuxfoundation.org>
> > Cc: Allison Randal <allison@lohutok.net>
> > Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> > Cc: Qian Cai <cai@lca.pw>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >   mm/usercopy.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/usercopy.c b/mm/usercopy.c
> > index 660717a1ea5c..a3ac4be35cde 100644
> > --- a/mm/usercopy.c
> > +++ b/mm/usercopy.c
> > @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
> >   	 * device memory), or CMA. Otherwise, reject since the object spans
> >   	 * several independently allocated pages.
> >   	 */
> > -	is_reserved = PageReserved(page);
> > +	is_reserved = PageReserved(page) || is_zone_device_page(page);
> >   	is_cma = is_migrate_cma_page(page);
> >   	if (!is_reserved && !is_cma)
> >   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
> >   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
> >   		page = virt_to_head_page(ptr);
> > -		if (is_reserved && !PageReserved(page))
> > +		if (is_reserved && !(PageReserved(page) ||
> > +				     is_zone_device_page(page)))
> >   			usercopy_abort("spans Reserved and non-Reserved pages",
> >   				       NULL, to_user, 0, n);
> >   		if (is_cma && !is_migrate_cma_page(page))
> > 
> 
> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
> is there a good rationale behind this?
> 
> (I would turn this patch into a simple update of the comment if we agree
> that we don't care)

There has been work to actually remove the page span checks entirely,
but there wasn't consensus on what the right way forward was. I continue
to leaning toward just dropping it entirely, but Matthew Wilcox has some
alternative ideas that could use some further thought/testing.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23 16:25       ` Kees Cook
  0 siblings, 0 replies; 112+ messages in thread
From: Kees Cook @ 2019-10-23 16:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Matthew Wilcox, Mike Rapoport, Madhumitha Prabakaran,
	Peter Zijlstra, Ingo Molnar, Vlastimil Babka, Nishka Dasgupta,
	Anthony Yznaga, Oscar Salvador, Dan Carpenter,
	Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Jeremy Sowden,
	Mel Gorman, Greg Kroah-Hartman, Cornelia Huck, linux-kernel,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
> On 22.10.19 19:12, David Hildenbrand wrote:
> > Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
> > change that.
> > 
> > Let's make sure that the logic in the function won't change. Once we no
> > longer set these pages to reserved, we can rework this function to
> > perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
> > 
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kate Stewart <kstewart@linuxfoundation.org>
> > Cc: Allison Randal <allison@lohutok.net>
> > Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
> > Cc: Qian Cai <cai@lca.pw>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >   mm/usercopy.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/usercopy.c b/mm/usercopy.c
> > index 660717a1ea5c..a3ac4be35cde 100644
> > --- a/mm/usercopy.c
> > +++ b/mm/usercopy.c
> > @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
> >   	 * device memory), or CMA. Otherwise, reject since the object spans
> >   	 * several independently allocated pages.
> >   	 */
> > -	is_reserved = PageReserved(page);
> > +	is_reserved = PageReserved(page) || is_zone_device_page(page);
> >   	is_cma = is_migrate_cma_page(page);
> >   	if (!is_reserved && !is_cma)
> >   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
> >   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
> >   		page = virt_to_head_page(ptr);
> > -		if (is_reserved && !PageReserved(page))
> > +		if (is_reserved && !(PageReserved(page) ||
> > +				     is_zone_device_page(page)))
> >   			usercopy_abort("spans Reserved and non-Reserved pages",
> >   				       NULL, to_user, 0, n);
> >   		if (is_cma && !is_migrate_cma_page(page))
> > 
> 
> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
> is there a good rationale behind this?
> 
> (I would turn this patch into a simple update of the comment if we agree
> that we don't care)

There has been work to actually remove the page span checks entirely,
but there wasn't consensus on what the right way forward was. I continue
to leaning toward just dropping it entirely, but Matthew Wilcox has some
alternative ideas that could use some further thought/testing.

-- 
Kees Cook

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  2019-10-23 16:25       ` Kees Cook
  (?)
  (?)
@ 2019-10-23 16:32         ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 16:32 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Michal Hocko,
	Andrew Morton, kvm-ppc, linuxppc-dev, kvm, linux-hyperv, devel,
	xen-devel, x86, Alexander Duyck, Alexander Duyck,
	Alex Williamson, Allison Randal, Andy Lutomirski,
	Aneesh Kumar K.V, Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 23.10.19 18:25, Kees Cook wrote:
> On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
>> On 22.10.19 19:12, David Hildenbrand wrote:
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> Let's make sure that the logic in the function won't change. Once we no
>>> longer set these pages to reserved, we can rework this function to
>>> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
>>>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Kate Stewart <kstewart@linuxfoundation.org>
>>> Cc: Allison Randal <allison@lohutok.net>
>>> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
>>> Cc: Qian Cai <cai@lca.pw>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/usercopy.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/usercopy.c b/mm/usercopy.c
>>> index 660717a1ea5c..a3ac4be35cde 100644
>>> --- a/mm/usercopy.c
>>> +++ b/mm/usercopy.c
>>> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>>>   	 * device memory), or CMA. Otherwise, reject since the object spans
>>>   	 * several independently allocated pages.
>>>   	 */
>>> -	is_reserved = PageReserved(page);
>>> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>>>   	is_cma = is_migrate_cma_page(page);
>>>   	if (!is_reserved && !is_cma)
>>>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>>>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>>>   		page = virt_to_head_page(ptr);
>>> -		if (is_reserved && !PageReserved(page))
>>> +		if (is_reserved && !(PageReserved(page) ||
>>> +				     is_zone_device_page(page)))
>>>   			usercopy_abort("spans Reserved and non-Reserved pages",
>>>   				       NULL, to_user, 0, n);
>>>   		if (is_cma && !is_migrate_cma_page(page))
>>>
>>
>> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
>> is there a good rationale behind this?
>>
>> (I would turn this patch into a simple update of the comment if we agree
>> that we don't care)
> 
> There has been work to actually remove the page span checks entirely,
> but there wasn't consensus on what the right way forward was. I continue
> to leaning toward just dropping it entirely, but Matthew Wilcox has some
> alternative ideas that could use some further thought/testing.

Thanks for your reply!

So, the worst thing that could happen right now, when dropping this
patch, is that we would reject some ranges when hardening is on,
correct? (sounds like that can easily be found by testing if it is
actually relevant)

Do you remember if there were real ZONE_DEVICE usecases that required
this filter to be in place for PG_reserved pages?

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23 16:32         ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 16:32 UTC (permalink / raw)
  To: Kees Cook
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, linux-kernel,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 23.10.19 18:25, Kees Cook wrote:
> On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
>> On 22.10.19 19:12, David Hildenbrand wrote:
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> Let's make sure that the logic in the function won't change. Once we no
>>> longer set these pages to reserved, we can rework this function to
>>> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
>>>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Kate Stewart <kstewart@linuxfoundation.org>
>>> Cc: Allison Randal <allison@lohutok.net>
>>> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
>>> Cc: Qian Cai <cai@lca.pw>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/usercopy.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/usercopy.c b/mm/usercopy.c
>>> index 660717a1ea5c..a3ac4be35cde 100644
>>> --- a/mm/usercopy.c
>>> +++ b/mm/usercopy.c
>>> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>>>   	 * device memory), or CMA. Otherwise, reject since the object spans
>>>   	 * several independently allocated pages.
>>>   	 */
>>> -	is_reserved = PageReserved(page);
>>> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>>>   	is_cma = is_migrate_cma_page(page);
>>>   	if (!is_reserved && !is_cma)
>>>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>>>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>>>   		page = virt_to_head_page(ptr);
>>> -		if (is_reserved && !PageReserved(page))
>>> +		if (is_reserved && !(PageReserved(page) ||
>>> +				     is_zone_device_page(page)))
>>>   			usercopy_abort("spans Reserved and non-Reserved pages",
>>>   				       NULL, to_user, 0, n);
>>>   		if (is_cma && !is_migrate_cma_page(page))
>>>
>>
>> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
>> is there a good rationale behind this?
>>
>> (I would turn this patch into a simple update of the comment if we agree
>> that we don't care)
> 
> There has been work to actually remove the page span checks entirely,
> but there wasn't consensus on what the right way forward was. I continue
> to leaning toward just dropping it entirely, but Matthew Wilcox has some
> alternative ideas that could use some further thought/testing.

Thanks for your reply!

So, the worst thing that could happen right now, when dropping this
patch, is that we would reject some ranges when hardening is on,
correct? (sounds like that can easily be found by testing if it is
actually relevant)

Do you remember if there were real ZONE_DEVICE usecases that required
this filter to be in place for PG_reserved pages?

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23 16:32         ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 16:32 UTC (permalink / raw)
  To: Kees Cook
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Matthew Wilcox, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, linux-kernel,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 23.10.19 18:25, Kees Cook wrote:
> On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
>> On 22.10.19 19:12, David Hildenbrand wrote:
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> Let's make sure that the logic in the function won't change. Once we no
>>> longer set these pages to reserved, we can rework this function to
>>> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
>>>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Kate Stewart <kstewart@linuxfoundation.org>
>>> Cc: Allison Randal <allison@lohutok.net>
>>> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
>>> Cc: Qian Cai <cai@lca.pw>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/usercopy.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/usercopy.c b/mm/usercopy.c
>>> index 660717a1ea5c..a3ac4be35cde 100644
>>> --- a/mm/usercopy.c
>>> +++ b/mm/usercopy.c
>>> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>>>   	 * device memory), or CMA. Otherwise, reject since the object spans
>>>   	 * several independently allocated pages.
>>>   	 */
>>> -	is_reserved = PageReserved(page);
>>> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>>>   	is_cma = is_migrate_cma_page(page);
>>>   	if (!is_reserved && !is_cma)
>>>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>>>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>>>   		page = virt_to_head_page(ptr);
>>> -		if (is_reserved && !PageReserved(page))
>>> +		if (is_reserved && !(PageReserved(page) ||
>>> +				     is_zone_device_page(page)))
>>>   			usercopy_abort("spans Reserved and non-Reserved pages",
>>>   				       NULL, to_user, 0, n);
>>>   		if (is_cma && !is_migrate_cma_page(page))
>>>
>>
>> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
>> is there a good rationale behind this?
>>
>> (I would turn this patch into a simple update of the comment if we agree
>> that we don't care)
> 
> There has been work to actually remove the page span checks entirely,
> but there wasn't consensus on what the right way forward was. I continue
> to leaning toward just dropping it entirely, but Matthew Wilcox has some
> alternative ideas that could use some further thought/testing.

Thanks for your reply!

So, the worst thing that could happen right now, when dropping this
patch, is that we would reject some ranges when hardening is on,
correct? (sounds like that can easily be found by testing if it is
actually relevant)

Do you remember if there were real ZONE_DEVICE usecases that required
this filter to be in place for PG_reserved pages?

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
@ 2019-10-23 16:32         ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 16:32 UTC (permalink / raw)
  To: Kees Cook
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Matthew Wilcox, Mike Rapoport, Madhumitha Prabakaran,
	Peter Zijlstra, Ingo Molnar, Vlastimil Babka, Nishka Dasgupta,
	Anthony Yznaga, Oscar Salvador, Dan Carpenter,
	Isaac J. Manjarres, Matt Sickler, Juergen Gross,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Dan Williams, kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Jeremy Sowden,
	Mel Gorman, Greg Kroah-Hartman, Cornelia Huck, linux-kernel,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 23.10.19 18:25, Kees Cook wrote:
> On Wed, Oct 23, 2019 at 10:20:14AM +0200, David Hildenbrand wrote:
>> On 22.10.19 19:12, David Hildenbrand wrote:
>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
>>> change that.
>>>
>>> Let's make sure that the logic in the function won't change. Once we no
>>> longer set these pages to reserved, we can rework this function to
>>> perform separate checks for ZONE_DEVICE (split from PG_reserved checks).
>>>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Kate Stewart <kstewart@linuxfoundation.org>
>>> Cc: Allison Randal <allison@lohutok.net>
>>> Cc: "Isaac J. Manjarres" <isaacm@codeaurora.org>
>>> Cc: Qian Cai <cai@lca.pw>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/usercopy.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/usercopy.c b/mm/usercopy.c
>>> index 660717a1ea5c..a3ac4be35cde 100644
>>> --- a/mm/usercopy.c
>>> +++ b/mm/usercopy.c
>>> @@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, unsigned long n,
>>>   	 * device memory), or CMA. Otherwise, reject since the object spans
>>>   	 * several independently allocated pages.
>>>   	 */
>>> -	is_reserved = PageReserved(page);
>>> +	is_reserved = PageReserved(page) || is_zone_device_page(page);
>>>   	is_cma = is_migrate_cma_page(page);
>>>   	if (!is_reserved && !is_cma)
>>>   		usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
>>>   	for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
>>>   		page = virt_to_head_page(ptr);
>>> -		if (is_reserved && !PageReserved(page))
>>> +		if (is_reserved && !(PageReserved(page) ||
>>> +				     is_zone_device_page(page)))
>>>   			usercopy_abort("spans Reserved and non-Reserved pages",
>>>   				       NULL, to_user, 0, n);
>>>   		if (is_cma && !is_migrate_cma_page(page))
>>>
>>
>> @Kees, would it be okay to stop checking against ZONE_DEVICE pages here or
>> is there a good rationale behind this?
>>
>> (I would turn this patch into a simple update of the comment if we agree
>> that we don't care)
> 
> There has been work to actually remove the page span checks entirely,
> but there wasn't consensus on what the right way forward was. I continue
> to leaning toward just dropping it entirely, but Matthew Wilcox has some
> alternative ideas that could use some further thought/testing.

Thanks for your reply!

So, the worst thing that could happen right now, when dropping this
patch, is that we would reject some ranges when hardening is on,
correct? (sounds like that can easily be found by testing if it is
actually relevant)

Do you remember if there were real ZONE_DEVICE usecases that required
this filter to be in place for PG_reserved pages?

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-23  7:26     ` David Hildenbrand
  (?)
  (?)
@ 2019-10-23 17:09       ` Dan Williams
  -1 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux Kernel Mailing List, Linux MM, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, KVM list, linux-hyperv, devel, xen-devel,
	X86 ML, Alexander Duyck, Kees Cook, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On Wed, Oct 23, 2019 at 12:26 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.10.19 23:54, Dan Williams wrote:
> > Hi David,
> >
> > Thanks for tackling this!
>
> Thanks for having a look :)
>
> [...]
>
>
> >> I am probably a little bit too careful (but I don't want to break things).
> >> In most places (besides KVM and vfio that are nuts), the
> >> pfn_to_online_page() check could most probably be avoided by a
> >> is_zone_device_page() check. However, I usually get suspicious when I see
> >> a pfn_valid() check (especially after I learned that people mmap parts of
> >> /dev/mem into user space, including memory without memmaps. Also, people
> >> could memmap offline memory blocks this way :/). As long as this does not
> >> hurt performance, I think we should rather do it the clean way.
> >
> > I'm concerned about using is_zone_device_page() in places that are not
> > known to already have a reference to the page. Here's an audit of
> > current usages, and the ones I think need to cleaned up. The "unsafe"
> > ones do not appear to have any protections against the device page
> > being removed (get_dev_pagemap()). Yes, some of these were added by
> > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> > pages into anonymous memory paths and I'm not up to speed on how it
> > guarantees 'struct page' validity vs device shutdown without using
> > get_dev_pagemap().
> >
> > smaps_pmd_entry(): unsafe
> >
> > put_devmap_managed_page(): safe, page reference is held
> >
> > is_device_private_page(): safe? gpu driver manages private page lifetime
> >
> > is_pci_p2pdma_page(): safe, page reference is held
> >
> > uncharge_page(): unsafe? HMM
> >
> > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> >
> > soft_offline_page(): unsafe
> >
> > remove_migration_pte(): unsafe? HMM
> >
> > move_to_new_page(): unsafe? HMM
> >
> > migrate_vma_pages() and helpers: unsafe? HMM
> >
> > try_to_unmap_one(): unsafe? HMM
> >
> > __put_page(): safe
> >
> > release_pages(): safe
> >
> > I'm hoping all the HMM ones can be converted to
> > is_device_private_page() directlly and have that routine grow a nice
> > comment about how it knows it can always safely de-reference its @page
> > argument.
> >
> > For the rest I'd like to propose that we add a facility to determine
> > ZONE_DEVICE by pfn rather than page. The most straightforward why I
> > can think of would be to just add another bitmap to mem_section_usage
> > to indicate if a subsection is ZONE_DEVICE or not.
>
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
>
> I dislike this for three reasons
>
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.

True, we need to solve that problem too. That seems to want something
lighter weight than the hotplug lock that can be held over pfn lookups
+  use rather than requiring a page lookup in paths where it's not
clear that a page reference would prevent unplug.

> c) We mix in ZONE specific stuff into the core. It should be "just another zone"

Not sure I grok this when the RFC is sprinkling zone-specific
is_zone_device_page() throughout the core?

>
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

Sorry I missed this earlier...

>
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

This does not seem to reduce any complexity because it still requires
a pfn to zone lookup at the end of the process.

I.e. converting pfn_to_online_page() to use a new pfn_active()
subsection map plus looking up the zone from pfn_to_page() is more
steps than just doing a direct pfn to zone lookup. What am I missing?

>
> Especially, driver-reserved device memory will not get set active in
> the subsection bitmap.
>
> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> wonder if we can use RCU:

Ah, yes, exactly what I was thinking above.

>
> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>
>         /* the memmap is guaranteed to remain active under RCU */
>         rcu_read_lock();
>         if (pfn_active(random_pfn)) {
>                 page = pfn_to_page(random_pfn);
>                 ... use the page, stays valid
>         }
>         rcu_unread_lock();
>
> Memory offlining/memremap code:
>
>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>         synchronize_rcu();
>         /* all users saw the bitmap update, we can invalide the memmap */
>         remove_pfn_range_from_zone(zone, pfn, nr_pages);

Looks good to me.

>
> >
> >>
> >> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >> on x86-64 and PPC.
> >
> > I'll give it a spin, but I don't think the kernel wants to grow more
> > is_zone_device_page() users.
>
> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> The other parts can rely on pfn_to_online_page() only.
>
> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> - Basically never used with ZONE_DEVICE.
> - We hold a reference!
> - All it protects is a SetPageDirty(page);
>
> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> - Same as 1.
>
> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>   references (I assume this should be fine as we don't come via random
>   PFNs)
> - We check that we don't mix Reserved (including device memory) and CMA
>   pages when crossing compound pages.
>
> I think we can drop 1. and 2., resulting in a total of 2 new users in
> the same context. I think that is totally tolerable to finally clean
> this up.

...but more is_zone_device_page() doesn't "finally clean this up".
Like we discussed above it's the missing locking that's the real
cleanup, the pfn_to_online_page() internals are secondary.

>
>
> However, I think we also have to clarify if we need the change in 3 at all.
> It comes from
>
> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> Author: Kees Cook <keescook@chromium.org>
> Date:   Tue Jun 7 11:05:33 2016 -0700
>
>     mm: Hardened usercopy
>
>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> [...]
>     - otherwise, object must not span page allocations (excepting Reserved
>       and CMA ranges)
>
> Not sure if we really have to care about ZONE_DEVICE at this point.

That check needs to be careful to ignore ZONE_DEVICE pages. There's
nothing wrong with a copy spanning ZONE_DEVICE and typical pages.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 17:09       ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Pavel Tatashin,
	devel, Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Kees Cook, Anshuman Khandual,
	Haiyang Zhang, Simon Sandström, Sasha Levin, Juergen Gross,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, Linux Kernel Mailing List,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Wed, Oct 23, 2019 at 12:26 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.10.19 23:54, Dan Williams wrote:
> > Hi David,
> >
> > Thanks for tackling this!
>
> Thanks for having a look :)
>
> [...]
>
>
> >> I am probably a little bit too careful (but I don't want to break things).
> >> In most places (besides KVM and vfio that are nuts), the
> >> pfn_to_online_page() check could most probably be avoided by a
> >> is_zone_device_page() check. However, I usually get suspicious when I see
> >> a pfn_valid() check (especially after I learned that people mmap parts of
> >> /dev/mem into user space, including memory without memmaps. Also, people
> >> could memmap offline memory blocks this way :/). As long as this does not
> >> hurt performance, I think we should rather do it the clean way.
> >
> > I'm concerned about using is_zone_device_page() in places that are not
> > known to already have a reference to the page. Here's an audit of
> > current usages, and the ones I think need to cleaned up. The "unsafe"
> > ones do not appear to have any protections against the device page
> > being removed (get_dev_pagemap()). Yes, some of these were added by
> > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> > pages into anonymous memory paths and I'm not up to speed on how it
> > guarantees 'struct page' validity vs device shutdown without using
> > get_dev_pagemap().
> >
> > smaps_pmd_entry(): unsafe
> >
> > put_devmap_managed_page(): safe, page reference is held
> >
> > is_device_private_page(): safe? gpu driver manages private page lifetime
> >
> > is_pci_p2pdma_page(): safe, page reference is held
> >
> > uncharge_page(): unsafe? HMM
> >
> > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> >
> > soft_offline_page(): unsafe
> >
> > remove_migration_pte(): unsafe? HMM
> >
> > move_to_new_page(): unsafe? HMM
> >
> > migrate_vma_pages() and helpers: unsafe? HMM
> >
> > try_to_unmap_one(): unsafe? HMM
> >
> > __put_page(): safe
> >
> > release_pages(): safe
> >
> > I'm hoping all the HMM ones can be converted to
> > is_device_private_page() directlly and have that routine grow a nice
> > comment about how it knows it can always safely de-reference its @page
> > argument.
> >
> > For the rest I'd like to propose that we add a facility to determine
> > ZONE_DEVICE by pfn rather than page. The most straightforward why I
> > can think of would be to just add another bitmap to mem_section_usage
> > to indicate if a subsection is ZONE_DEVICE or not.
>
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
>
> I dislike this for three reasons
>
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.

True, we need to solve that problem too. That seems to want something
lighter weight than the hotplug lock that can be held over pfn lookups
+  use rather than requiring a page lookup in paths where it's not
clear that a page reference would prevent unplug.

> c) We mix in ZONE specific stuff into the core. It should be "just another zone"

Not sure I grok this when the RFC is sprinkling zone-specific
is_zone_device_page() throughout the core?

>
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

Sorry I missed this earlier...

>
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

This does not seem to reduce any complexity because it still requires
a pfn to zone lookup at the end of the process.

I.e. converting pfn_to_online_page() to use a new pfn_active()
subsection map plus looking up the zone from pfn_to_page() is more
steps than just doing a direct pfn to zone lookup. What am I missing?

>
> Especially, driver-reserved device memory will not get set active in
> the subsection bitmap.
>
> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> wonder if we can use RCU:

Ah, yes, exactly what I was thinking above.

>
> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>
>         /* the memmap is guaranteed to remain active under RCU */
>         rcu_read_lock();
>         if (pfn_active(random_pfn)) {
>                 page = pfn_to_page(random_pfn);
>                 ... use the page, stays valid
>         }
>         rcu_unread_lock();
>
> Memory offlining/memremap code:
>
>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>         synchronize_rcu();
>         /* all users saw the bitmap update, we can invalide the memmap */
>         remove_pfn_range_from_zone(zone, pfn, nr_pages);

Looks good to me.

>
> >
> >>
> >> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >> on x86-64 and PPC.
> >
> > I'll give it a spin, but I don't think the kernel wants to grow more
> > is_zone_device_page() users.
>
> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> The other parts can rely on pfn_to_online_page() only.
>
> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> - Basically never used with ZONE_DEVICE.
> - We hold a reference!
> - All it protects is a SetPageDirty(page);
>
> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> - Same as 1.
>
> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>   references (I assume this should be fine as we don't come via random
>   PFNs)
> - We check that we don't mix Reserved (including device memory) and CMA
>   pages when crossing compound pages.
>
> I think we can drop 1. and 2., resulting in a total of 2 new users in
> the same context. I think that is totally tolerable to finally clean
> this up.

...but more is_zone_device_page() doesn't "finally clean this up".
Like we discussed above it's the missing locking that's the real
cleanup, the pfn_to_online_page() internals are secondary.

>
>
> However, I think we also have to clarify if we need the change in 3 at all.
> It comes from
>
> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> Author: Kees Cook <keescook@chromium.org>
> Date:   Tue Jun 7 11:05:33 2016 -0700
>
>     mm: Hardened usercopy
>
>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> [...]
>     - otherwise, object must not span page allocations (excepting Reserved
>       and CMA ranges)
>
> Not sure if we really have to care about ZONE_DEVICE at this point.

That check needs to be careful to ignore ZONE_DEVICE pages. There's
nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 17:09       ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, Linux MM, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Kees Cook, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Sasha Levin, Juergen Gross, kvm-ppc,
	Qian Cai, Alex Williamson, Mike Rapoport, Borislav Petkov,
	Nicholas Piggin, Andy Lutomirski, xen-devel, Boris Ostrovsky,
	Todd Poynor, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Vandana BN, Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman,
	Cornelia Huck, Linux Kernel Mailing List, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On Wed, Oct 23, 2019 at 12:26 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.10.19 23:54, Dan Williams wrote:
> > Hi David,
> >
> > Thanks for tackling this!
>
> Thanks for having a look :)
>
> [...]
>
>
> >> I am probably a little bit too careful (but I don't want to break things).
> >> In most places (besides KVM and vfio that are nuts), the
> >> pfn_to_online_page() check could most probably be avoided by a
> >> is_zone_device_page() check. However, I usually get suspicious when I see
> >> a pfn_valid() check (especially after I learned that people mmap parts of
> >> /dev/mem into user space, including memory without memmaps. Also, people
> >> could memmap offline memory blocks this way :/). As long as this does not
> >> hurt performance, I think we should rather do it the clean way.
> >
> > I'm concerned about using is_zone_device_page() in places that are not
> > known to already have a reference to the page. Here's an audit of
> > current usages, and the ones I think need to cleaned up. The "unsafe"
> > ones do not appear to have any protections against the device page
> > being removed (get_dev_pagemap()). Yes, some of these were added by
> > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> > pages into anonymous memory paths and I'm not up to speed on how it
> > guarantees 'struct page' validity vs device shutdown without using
> > get_dev_pagemap().
> >
> > smaps_pmd_entry(): unsafe
> >
> > put_devmap_managed_page(): safe, page reference is held
> >
> > is_device_private_page(): safe? gpu driver manages private page lifetime
> >
> > is_pci_p2pdma_page(): safe, page reference is held
> >
> > uncharge_page(): unsafe? HMM
> >
> > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> >
> > soft_offline_page(): unsafe
> >
> > remove_migration_pte(): unsafe? HMM
> >
> > move_to_new_page(): unsafe? HMM
> >
> > migrate_vma_pages() and helpers: unsafe? HMM
> >
> > try_to_unmap_one(): unsafe? HMM
> >
> > __put_page(): safe
> >
> > release_pages(): safe
> >
> > I'm hoping all the HMM ones can be converted to
> > is_device_private_page() directlly and have that routine grow a nice
> > comment about how it knows it can always safely de-reference its @page
> > argument.
> >
> > For the rest I'd like to propose that we add a facility to determine
> > ZONE_DEVICE by pfn rather than page. The most straightforward why I
> > can think of would be to just add another bitmap to mem_section_usage
> > to indicate if a subsection is ZONE_DEVICE or not.
>
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
>
> I dislike this for three reasons
>
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.

True, we need to solve that problem too. That seems to want something
lighter weight than the hotplug lock that can be held over pfn lookups
+  use rather than requiring a page lookup in paths where it's not
clear that a page reference would prevent unplug.

> c) We mix in ZONE specific stuff into the core. It should be "just another zone"

Not sure I grok this when the RFC is sprinkling zone-specific
is_zone_device_page() throughout the core?

>
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

Sorry I missed this earlier...

>
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

This does not seem to reduce any complexity because it still requires
a pfn to zone lookup at the end of the process.

I.e. converting pfn_to_online_page() to use a new pfn_active()
subsection map plus looking up the zone from pfn_to_page() is more
steps than just doing a direct pfn to zone lookup. What am I missing?

>
> Especially, driver-reserved device memory will not get set active in
> the subsection bitmap.
>
> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> wonder if we can use RCU:

Ah, yes, exactly what I was thinking above.

>
> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>
>         /* the memmap is guaranteed to remain active under RCU */
>         rcu_read_lock();
>         if (pfn_active(random_pfn)) {
>                 page = pfn_to_page(random_pfn);
>                 ... use the page, stays valid
>         }
>         rcu_unread_lock();
>
> Memory offlining/memremap code:
>
>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>         synchronize_rcu();
>         /* all users saw the bitmap update, we can invalide the memmap */
>         remove_pfn_range_from_zone(zone, pfn, nr_pages);

Looks good to me.

>
> >
> >>
> >> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >> on x86-64 and PPC.
> >
> > I'll give it a spin, but I don't think the kernel wants to grow more
> > is_zone_device_page() users.
>
> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> The other parts can rely on pfn_to_online_page() only.
>
> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> - Basically never used with ZONE_DEVICE.
> - We hold a reference!
> - All it protects is a SetPageDirty(page);
>
> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> - Same as 1.
>
> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>   references (I assume this should be fine as we don't come via random
>   PFNs)
> - We check that we don't mix Reserved (including device memory) and CMA
>   pages when crossing compound pages.
>
> I think we can drop 1. and 2., resulting in a total of 2 new users in
> the same context. I think that is totally tolerable to finally clean
> this up.

...but more is_zone_device_page() doesn't "finally clean this up".
Like we discussed above it's the missing locking that's the real
cleanup, the pfn_to_online_page() internals are secondary.

>
>
> However, I think we also have to clarify if we need the change in 3 at all.
> It comes from
>
> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> Author: Kees Cook <keescook@chromium.org>
> Date:   Tue Jun 7 11:05:33 2016 -0700
>
>     mm: Hardened usercopy
>
>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> [...]
>     - otherwise, object must not span page allocations (excepting Reserved
>       and CMA ranges)
>
> Not sure if we really have to care about ZONE_DEVICE at this point.

That check needs to be careful to ignore ZONE_DEVICE pages. There's
nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 17:09       ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Kees Cook,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, Juergen Gross, kvm-ppc, Qian Cai, Alex Williamson,
	Mike Rapoport, Borislav Petkov, Nicholas Piggin, Andy Lutomirski,
	xen-devel, Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov,
	Allison Randal, Jim Mattson, Christophe Leroy, Vandana BN,
	Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman, Cornelia Huck,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

On Wed, Oct 23, 2019 at 12:26 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.10.19 23:54, Dan Williams wrote:
> > Hi David,
> >
> > Thanks for tackling this!
>
> Thanks for having a look :)
>
> [...]
>
>
> >> I am probably a little bit too careful (but I don't want to break things).
> >> In most places (besides KVM and vfio that are nuts), the
> >> pfn_to_online_page() check could most probably be avoided by a
> >> is_zone_device_page() check. However, I usually get suspicious when I see
> >> a pfn_valid() check (especially after I learned that people mmap parts of
> >> /dev/mem into user space, including memory without memmaps. Also, people
> >> could memmap offline memory blocks this way :/). As long as this does not
> >> hurt performance, I think we should rather do it the clean way.
> >
> > I'm concerned about using is_zone_device_page() in places that are not
> > known to already have a reference to the page. Here's an audit of
> > current usages, and the ones I think need to cleaned up. The "unsafe"
> > ones do not appear to have any protections against the device page
> > being removed (get_dev_pagemap()). Yes, some of these were added by
> > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> > pages into anonymous memory paths and I'm not up to speed on how it
> > guarantees 'struct page' validity vs device shutdown without using
> > get_dev_pagemap().
> >
> > smaps_pmd_entry(): unsafe
> >
> > put_devmap_managed_page(): safe, page reference is held
> >
> > is_device_private_page(): safe? gpu driver manages private page lifetime
> >
> > is_pci_p2pdma_page(): safe, page reference is held
> >
> > uncharge_page(): unsafe? HMM
> >
> > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> >
> > soft_offline_page(): unsafe
> >
> > remove_migration_pte(): unsafe? HMM
> >
> > move_to_new_page(): unsafe? HMM
> >
> > migrate_vma_pages() and helpers: unsafe? HMM
> >
> > try_to_unmap_one(): unsafe? HMM
> >
> > __put_page(): safe
> >
> > release_pages(): safe
> >
> > I'm hoping all the HMM ones can be converted to
> > is_device_private_page() directlly and have that routine grow a nice
> > comment about how it knows it can always safely de-reference its @page
> > argument.
> >
> > For the rest I'd like to propose that we add a facility to determine
> > ZONE_DEVICE by pfn rather than page. The most straightforward why I
> > can think of would be to just add another bitmap to mem_section_usage
> > to indicate if a subsection is ZONE_DEVICE or not.
>
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
>
> I dislike this for three reasons
>
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.

True, we need to solve that problem too. That seems to want something
lighter weight than the hotplug lock that can be held over pfn lookups
+  use rather than requiring a page lookup in paths where it's not
clear that a page reference would prevent unplug.

> c) We mix in ZONE specific stuff into the core. It should be "just another zone"

Not sure I grok this when the RFC is sprinkling zone-specific
is_zone_device_page() throughout the core?

>
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)

Sorry I missed this earlier...

>
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

This does not seem to reduce any complexity because it still requires
a pfn to zone lookup at the end of the process.

I.e. converting pfn_to_online_page() to use a new pfn_active()
subsection map plus looking up the zone from pfn_to_page() is more
steps than just doing a direct pfn to zone lookup. What am I missing?

>
> Especially, driver-reserved device memory will not get set active in
> the subsection bitmap.
>
> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> wonder if we can use RCU:

Ah, yes, exactly what I was thinking above.

>
> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>
>         /* the memmap is guaranteed to remain active under RCU */
>         rcu_read_lock();
>         if (pfn_active(random_pfn)) {
>                 page = pfn_to_page(random_pfn);
>                 ... use the page, stays valid
>         }
>         rcu_unread_lock();
>
> Memory offlining/memremap code:
>
>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>         synchronize_rcu();
>         /* all users saw the bitmap update, we can invalide the memmap */
>         remove_pfn_range_from_zone(zone, pfn, nr_pages);

Looks good to me.

>
> >
> >>
> >> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >> on x86-64 and PPC.
> >
> > I'll give it a spin, but I don't think the kernel wants to grow more
> > is_zone_device_page() users.
>
> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> The other parts can rely on pfn_to_online_page() only.
>
> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> - Basically never used with ZONE_DEVICE.
> - We hold a reference!
> - All it protects is a SetPageDirty(page);
>
> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> - Same as 1.
>
> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>   references (I assume this should be fine as we don't come via random
>   PFNs)
> - We check that we don't mix Reserved (including device memory) and CMA
>   pages when crossing compound pages.
>
> I think we can drop 1. and 2., resulting in a total of 2 new users in
> the same context. I think that is totally tolerable to finally clean
> this up.

...but more is_zone_device_page() doesn't "finally clean this up".
Like we discussed above it's the missing locking that's the real
cleanup, the pfn_to_online_page() internals are secondary.

>
>
> However, I think we also have to clarify if we need the change in 3 at all.
> It comes from
>
> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> Author: Kees Cook <keescook@chromium.org>
> Date:   Tue Jun 7 11:05:33 2016 -0700
>
>     mm: Hardened usercopy
>
>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> [...]
>     - otherwise, object must not span page allocations (excepting Reserved
>       and CMA ranges)
>
> Not sure if we really have to care about ZONE_DEVICE at this point.

That check needs to be careful to ignore ZONE_DEVICE pages. There's
nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-23 17:09       ` Dan Williams
  (?)
  (?)
@ 2019-10-23 17:27         ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 17:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, Linux MM, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, KVM list, linux-hyperv, devel, xen-devel,
	X86 ML, Alexander Duyck, Kees Cook, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>>         /* the memmap is guaranteed to remain active under RCU */
>>         rcu_read_lock();
>>         if (pfn_active(random_pfn)) {
>>                 page = pfn_to_page(random_pfn);
>>                 ... use the page, stays valid
>>         }
>>         rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>>         synchronize_rcu();
>>         /* all users saw the bitmap update, we can invalide the memmap */
>>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.

The important part of this patch:

While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).

This does improve the situation.

>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@chromium.org>
>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>
>>     mm: Hardened usercopy
>>
>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>> [...]
>>     - otherwise, object must not span page allocations (excepting Reserved
>>       and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
> 
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 17:27         ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 17:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Pavel Tatashin,
	devel, Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Kees Cook, Anshuman Khandual,
	Haiyang Zhang, Simon Sandström, Sasha Levin, Juergen Gross,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, Linux Kernel Mailing List,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>>         /* the memmap is guaranteed to remain active under RCU */
>>         rcu_read_lock();
>>         if (pfn_active(random_pfn)) {
>>                 page = pfn_to_page(random_pfn);
>>                 ... use the page, stays valid
>>         }
>>         rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>>         synchronize_rcu();
>>         /* all users saw the bitmap update, we can invalide the memmap */
>>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.

The important part of this patch:

While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).

This does improve the situation.

>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@chromium.org>
>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>
>>     mm: Hardened usercopy
>>
>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>> [...]
>>     - otherwise, object must not span page allocations (excepting Reserved
>>       and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
> 
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 17:27         ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 17:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, Linux MM, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Kees Cook, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Sasha Levin, Juergen Gross, kvm-ppc,
	Qian Cai, Alex Williamson, Mike Rapoport, Borislav Petkov,
	Nicholas Piggin, Andy Lutomirski, xen-devel, Boris Ostrovsky,
	Todd Poynor, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Vandana BN, Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman,
	Cornelia Huck, Linux Kernel Mailing List, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>>         /* the memmap is guaranteed to remain active under RCU */
>>         rcu_read_lock();
>>         if (pfn_active(random_pfn)) {
>>                 page = pfn_to_page(random_pfn);
>>                 ... use the page, stays valid
>>         }
>>         rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>>         synchronize_rcu();
>>         /* all users saw the bitmap update, we can invalide the memmap */
>>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.

The important part of this patch:

While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).

This does improve the situation.

>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@chromium.org>
>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>
>>     mm: Hardened usercopy
>>
>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>> [...]
>>     - otherwise, object must not span page allocations (excepting Reserved
>>       and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
> 
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 17:27         ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 17:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Kees Cook,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, Juergen Gross, kvm-ppc, Qian Cai, Alex Williamson,
	Mike Rapoport, Borislav Petkov, Nicholas Piggin, Andy Lutomirski,
	xen-devel, Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov,
	Allison Randal, Jim Mattson, Christophe Leroy, Vandana BN,
	Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman, Cornelia Huck,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

>> I dislike this for three reasons
>>
>> a) It does not protect against any races, really, it does not improve things.
>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> 
> True, we need to solve that problem too. That seems to want something
> lighter weight than the hotplug lock that can be held over pfn lookups
> +  use rather than requiring a page lookup in paths where it's not
> clear that a page reference would prevent unplug.
> 
>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> Not sure I grok this when the RFC is sprinkling zone-specific
> is_zone_device_page() throughout the core?

Most users should not care about the zone. pfn_active() would be enough
in most situations, especially most PFN walkers - "this memmap is valid
and e.g., contains a valid zone ...".

> 
>>
>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> Sorry I missed this earlier...
> 
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>> 3. Introduce pfn_active() that checks against the subsection bitmap
>> 4. Once the memmap was initialized / prepared, set the subsection active
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. Before the memmap gets invalidated, set the subsection inactive
>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 
> This does not seem to reduce any complexity because it still requires
> a pfn to zone lookup at the end of the process.
> 
> I.e. converting pfn_to_online_page() to use a new pfn_active()
> subsection map plus looking up the zone from pfn_to_page() is more
> steps than just doing a direct pfn to zone lookup. What am I missing?

That a real "pfn to zone" lookup without going via the struct page will
require to have more than just a single bitmap. IMHO, keeping the
information at a single place (memmap) is the clean thing to do (not
replicating it somewhere else). Going via the memmap might not be as
fast as a direct lookup, but do we actually care? We are already looking
at "random PFNs we are not sure if there is a valid memmap".

>>
>> Especially, driver-reserved device memory will not get set active in
>> the subsection bitmap.
>>
>> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
>> wonder if we can use RCU:
> 
> Ah, yes, exactly what I was thinking above.
> 
>>
>> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>>
>>         /* the memmap is guaranteed to remain active under RCU */
>>         rcu_read_lock();
>>         if (pfn_active(random_pfn)) {
>>                 page = pfn_to_page(random_pfn);
>>                 ... use the page, stays valid
>>         }
>>         rcu_unread_lock();
>>
>> Memory offlining/memremap code:
>>
>>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
>>         synchronize_rcu();
>>         /* all users saw the bitmap update, we can invalide the memmap */
>>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> 
> Looks good to me.
> 
>>
>>>
>>>>
>>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
>>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
>>>> on x86-64 and PPC.
>>>
>>> I'll give it a spin, but I don't think the kernel wants to grow more
>>> is_zone_device_page() users.
>>
>> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
>> The other parts can rely on pfn_to_online_page() only.
>>
>> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
>> - Basically never used with ZONE_DEVICE.
>> - We hold a reference!
>> - All it protects is a SetPageDirty(page);
>>
>> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
>> - Same as 1.
>>
>> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
>> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>>   references (I assume this should be fine as we don't come via random
>>   PFNs)
>> - We check that we don't mix Reserved (including device memory) and CMA
>>   pages when crossing compound pages.
>>
>> I think we can drop 1. and 2., resulting in a total of 2 new users in
>> the same context. I think that is totally tolerable to finally clean
>> this up.
> 
> ...but more is_zone_device_page() doesn't "finally clean this up".
> Like we discussed above it's the missing locking that's the real
> cleanup, the pfn_to_online_page() internals are secondary.

It's a different cleanup IMHO. We can't do everything in one shot. But
maybe I can drop the is_zone_device_page() parts from this patch and
completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
it's a different story.

The important part of this patch:

While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
racy as hell (especially when concurrently initializing the memmap).

This does improve the situation.

>>
>> However, I think we also have to clarify if we need the change in 3 at all.
>> It comes from
>>
>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>> Author: Kees Cook <keescook@chromium.org>
>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>
>>     mm: Hardened usercopy
>>
>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>> [...]
>>     - otherwise, object must not span page allocations (excepting Reserved
>>       and CMA ranges)
>>
>> Not sure if we really have to care about ZONE_DEVICE at this point.
> 
> That check needs to be careful to ignore ZONE_DEVICE pages. There's
> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

Please note that the current check would *forbid* this (AFAIKs for a
single heap object). As discussed in the relevant patch, we might be
able to just stop doing that and limit it to real PG_reserved pages
(without ZONE_DEVICE). I'd be happy to not introduce new
is_zone_device_page() users.

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-23 17:27         ` David Hildenbrand
  (?)
  (?)
@ 2019-10-23 19:39           ` Dan Williams
  -1 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 19:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux Kernel Mailing List, Linux MM, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, KVM list, linux-hyperv, devel, xen-devel,
	X86 ML, Alexander Duyck, Kees Cook, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> I dislike this for three reasons
> >>
> >> a) It does not protect against any races, really, it does not improve things.
> >> b) We do have the exact same problem with pfn_to_online_page(). As long as we
> >>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> >
> > True, we need to solve that problem too. That seems to want something
> > lighter weight than the hotplug lock that can be held over pfn lookups
> > +  use rather than requiring a page lookup in paths where it's not
> > clear that a page reference would prevent unplug.
> >
> >> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> >
> > Not sure I grok this when the RFC is sprinkling zone-specific
> > is_zone_device_page() throughout the core?
>
> Most users should not care about the zone. pfn_active() would be enough
> in most situations, especially most PFN walkers - "this memmap is valid
> and e.g., contains a valid zone ...".

Oh, I see, you're saying convert most users to pfn_active() (and some
TBD rcu locking), but only pfn_to_online_page() users would need the
zone lookup? I can get on board with that.

>
> >
> >>
> >> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> >
> > Sorry I missed this earlier...
> >
> >>
> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> >> 3. Introduce pfn_active() that checks against the subsection bitmap
> >> 4. Once the memmap was initialized / prepared, set the subsection active
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. Before the memmap gets invalidated, set the subsection inactive
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> >
> > This does not seem to reduce any complexity because it still requires
> > a pfn to zone lookup at the end of the process.
> >
> > I.e. converting pfn_to_online_page() to use a new pfn_active()
> > subsection map plus looking up the zone from pfn_to_page() is more
> > steps than just doing a direct pfn to zone lookup. What am I missing?
>
> That a real "pfn to zone" lookup without going via the struct page will
> require to have more than just a single bitmap. IMHO, keeping the
> information at a single place (memmap) is the clean thing to do (not
> replicating it somewhere else). Going via the memmap might not be as
> fast as a direct lookup, but do we actually care? We are already looking
> at "random PFNs we are not sure if there is a valid memmap".

True, we only care about the validity of the check, and as you pointed
out moving the check to the pfn level does not solve the validity
race. It needs a lock.

>
> >>
> >> Especially, driver-reserved device memory will not get set active in
> >> the subsection bitmap.
> >>
> >> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> >> wonder if we can use RCU:
> >
> > Ah, yes, exactly what I was thinking above.
> >
> >>
> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
> >>
> >>         /* the memmap is guaranteed to remain active under RCU */
> >>         rcu_read_lock();
> >>         if (pfn_active(random_pfn)) {
> >>                 page = pfn_to_page(random_pfn);
> >>                 ... use the page, stays valid
> >>         }
> >>         rcu_unread_lock();
> >>
> >> Memory offlining/memremap code:
> >>
> >>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
> >>         synchronize_rcu();
> >>         /* all users saw the bitmap update, we can invalide the memmap */
> >>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> >
> > Looks good to me.
> >
> >>
> >>>
> >>>>
> >>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >>>> on x86-64 and PPC.
> >>>
> >>> I'll give it a spin, but I don't think the kernel wants to grow more
> >>> is_zone_device_page() users.
> >>
> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> >> The other parts can rely on pfn_to_online_page() only.
> >>
> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> >> - Basically never used with ZONE_DEVICE.
> >> - We hold a reference!
> >> - All it protects is a SetPageDirty(page);
> >>
> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> >> - Same as 1.
> >>
> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
> >>   references (I assume this should be fine as we don't come via random
> >>   PFNs)
> >> - We check that we don't mix Reserved (including device memory) and CMA
> >>   pages when crossing compound pages.
> >>
> >> I think we can drop 1. and 2., resulting in a total of 2 new users in
> >> the same context. I think that is totally tolerable to finally clean
> >> this up.
> >
> > ...but more is_zone_device_page() doesn't "finally clean this up".
> > Like we discussed above it's the missing locking that's the real
> > cleanup, the pfn_to_online_page() internals are secondary.
>
> It's a different cleanup IMHO. We can't do everything in one shot. But
> maybe I can drop the is_zone_device_page() parts from this patch and
> completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
> it's a different story.
>
> The important part of this patch:
>
> While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
> checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
> racy as hell (especially when concurrently initializing the memmap).
>
> This does improve the situation.

True that's a race a vector I was not considering.

>
> >>
> >> However, I think we also have to clarify if we need the change in 3 at all.
> >> It comes from
> >>
> >> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> >> Author: Kees Cook <keescook@chromium.org>
> >> Date:   Tue Jun 7 11:05:33 2016 -0700
> >>
> >>     mm: Hardened usercopy
> >>
> >>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
> >>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
> >>     work is based on code by PaX Team and Brad Spengler, and an earlier port
> >>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> >> [...]
> >>     - otherwise, object must not span page allocations (excepting Reserved
> >>       and CMA ranges)
> >>
> >> Not sure if we really have to care about ZONE_DEVICE at this point.
> >
> > That check needs to be careful to ignore ZONE_DEVICE pages. There's
> > nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>
> Please note that the current check would *forbid* this (AFAIKs for a
> single heap object). As discussed in the relevant patch, we might be
> able to just stop doing that and limit it to real PG_reserved pages
> (without ZONE_DEVICE). I'd be happy to not introduce new
> is_zone_device_page() users.

At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
excluded from this path by:

52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

So this case is one more to add to the pile of HMM auditing.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 19:39           ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 19:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Pavel Tatashin,
	devel, Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Kees Cook, Anshuman Khandual,
	Haiyang Zhang, Simon Sandström, Sasha Levin, Juergen Gross,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, Linux Kernel Mailing List,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> I dislike this for three reasons
> >>
> >> a) It does not protect against any races, really, it does not improve things.
> >> b) We do have the exact same problem with pfn_to_online_page(). As long as we
> >>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> >
> > True, we need to solve that problem too. That seems to want something
> > lighter weight than the hotplug lock that can be held over pfn lookups
> > +  use rather than requiring a page lookup in paths where it's not
> > clear that a page reference would prevent unplug.
> >
> >> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> >
> > Not sure I grok this when the RFC is sprinkling zone-specific
> > is_zone_device_page() throughout the core?
>
> Most users should not care about the zone. pfn_active() would be enough
> in most situations, especially most PFN walkers - "this memmap is valid
> and e.g., contains a valid zone ...".

Oh, I see, you're saying convert most users to pfn_active() (and some
TBD rcu locking), but only pfn_to_online_page() users would need the
zone lookup? I can get on board with that.

>
> >
> >>
> >> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> >
> > Sorry I missed this earlier...
> >
> >>
> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> >> 3. Introduce pfn_active() that checks against the subsection bitmap
> >> 4. Once the memmap was initialized / prepared, set the subsection active
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. Before the memmap gets invalidated, set the subsection inactive
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> >
> > This does not seem to reduce any complexity because it still requires
> > a pfn to zone lookup at the end of the process.
> >
> > I.e. converting pfn_to_online_page() to use a new pfn_active()
> > subsection map plus looking up the zone from pfn_to_page() is more
> > steps than just doing a direct pfn to zone lookup. What am I missing?
>
> That a real "pfn to zone" lookup without going via the struct page will
> require to have more than just a single bitmap. IMHO, keeping the
> information at a single place (memmap) is the clean thing to do (not
> replicating it somewhere else). Going via the memmap might not be as
> fast as a direct lookup, but do we actually care? We are already looking
> at "random PFNs we are not sure if there is a valid memmap".

True, we only care about the validity of the check, and as you pointed
out moving the check to the pfn level does not solve the validity
race. It needs a lock.

>
> >>
> >> Especially, driver-reserved device memory will not get set active in
> >> the subsection bitmap.
> >>
> >> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> >> wonder if we can use RCU:
> >
> > Ah, yes, exactly what I was thinking above.
> >
> >>
> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
> >>
> >>         /* the memmap is guaranteed to remain active under RCU */
> >>         rcu_read_lock();
> >>         if (pfn_active(random_pfn)) {
> >>                 page = pfn_to_page(random_pfn);
> >>                 ... use the page, stays valid
> >>         }
> >>         rcu_unread_lock();
> >>
> >> Memory offlining/memremap code:
> >>
> >>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
> >>         synchronize_rcu();
> >>         /* all users saw the bitmap update, we can invalide the memmap */
> >>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> >
> > Looks good to me.
> >
> >>
> >>>
> >>>>
> >>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >>>> on x86-64 and PPC.
> >>>
> >>> I'll give it a spin, but I don't think the kernel wants to grow more
> >>> is_zone_device_page() users.
> >>
> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> >> The other parts can rely on pfn_to_online_page() only.
> >>
> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> >> - Basically never used with ZONE_DEVICE.
> >> - We hold a reference!
> >> - All it protects is a SetPageDirty(page);
> >>
> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> >> - Same as 1.
> >>
> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
> >>   references (I assume this should be fine as we don't come via random
> >>   PFNs)
> >> - We check that we don't mix Reserved (including device memory) and CMA
> >>   pages when crossing compound pages.
> >>
> >> I think we can drop 1. and 2., resulting in a total of 2 new users in
> >> the same context. I think that is totally tolerable to finally clean
> >> this up.
> >
> > ...but more is_zone_device_page() doesn't "finally clean this up".
> > Like we discussed above it's the missing locking that's the real
> > cleanup, the pfn_to_online_page() internals are secondary.
>
> It's a different cleanup IMHO. We can't do everything in one shot. But
> maybe I can drop the is_zone_device_page() parts from this patch and
> completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
> it's a different story.
>
> The important part of this patch:
>
> While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
> checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
> racy as hell (especially when concurrently initializing the memmap).
>
> This does improve the situation.

True that's a race a vector I was not considering.

>
> >>
> >> However, I think we also have to clarify if we need the change in 3 at all.
> >> It comes from
> >>
> >> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> >> Author: Kees Cook <keescook@chromium.org>
> >> Date:   Tue Jun 7 11:05:33 2016 -0700
> >>
> >>     mm: Hardened usercopy
> >>
> >>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
> >>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
> >>     work is based on code by PaX Team and Brad Spengler, and an earlier port
> >>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> >> [...]
> >>     - otherwise, object must not span page allocations (excepting Reserved
> >>       and CMA ranges)
> >>
> >> Not sure if we really have to care about ZONE_DEVICE at this point.
> >
> > That check needs to be careful to ignore ZONE_DEVICE pages. There's
> > nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>
> Please note that the current check would *forbid* this (AFAIKs for a
> single heap object). As discussed in the relevant patch, we might be
> able to just stop doing that and limit it to real PG_reserved pages
> (without ZONE_DEVICE). I'd be happy to not introduce new
> is_zone_device_page() users.

At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
excluded from this path by:

52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

So this case is one more to add to the pile of HMM auditing.
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 19:39           ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 19:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, Linux MM, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Kees Cook, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Sasha Levin, Juergen Gross, kvm-ppc,
	Qian Cai, Alex Williamson, Mike Rapoport, Borislav Petkov,
	Nicholas Piggin, Andy Lutomirski, xen-devel, Boris Ostrovsky,
	Todd Poynor, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Vandana BN, Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman,
	Cornelia Huck, Linux Kernel Mailing List, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> I dislike this for three reasons
> >>
> >> a) It does not protect against any races, really, it does not improve things.
> >> b) We do have the exact same problem with pfn_to_online_page(). As long as we
> >>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> >
> > True, we need to solve that problem too. That seems to want something
> > lighter weight than the hotplug lock that can be held over pfn lookups
> > +  use rather than requiring a page lookup in paths where it's not
> > clear that a page reference would prevent unplug.
> >
> >> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> >
> > Not sure I grok this when the RFC is sprinkling zone-specific
> > is_zone_device_page() throughout the core?
>
> Most users should not care about the zone. pfn_active() would be enough
> in most situations, especially most PFN walkers - "this memmap is valid
> and e.g., contains a valid zone ...".

Oh, I see, you're saying convert most users to pfn_active() (and some
TBD rcu locking), but only pfn_to_online_page() users would need the
zone lookup? I can get on board with that.

>
> >
> >>
> >> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> >
> > Sorry I missed this earlier...
> >
> >>
> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> >> 3. Introduce pfn_active() that checks against the subsection bitmap
> >> 4. Once the memmap was initialized / prepared, set the subsection active
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. Before the memmap gets invalidated, set the subsection inactive
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> >
> > This does not seem to reduce any complexity because it still requires
> > a pfn to zone lookup at the end of the process.
> >
> > I.e. converting pfn_to_online_page() to use a new pfn_active()
> > subsection map plus looking up the zone from pfn_to_page() is more
> > steps than just doing a direct pfn to zone lookup. What am I missing?
>
> That a real "pfn to zone" lookup without going via the struct page will
> require to have more than just a single bitmap. IMHO, keeping the
> information at a single place (memmap) is the clean thing to do (not
> replicating it somewhere else). Going via the memmap might not be as
> fast as a direct lookup, but do we actually care? We are already looking
> at "random PFNs we are not sure if there is a valid memmap".

True, we only care about the validity of the check, and as you pointed
out moving the check to the pfn level does not solve the validity
race. It needs a lock.

>
> >>
> >> Especially, driver-reserved device memory will not get set active in
> >> the subsection bitmap.
> >>
> >> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> >> wonder if we can use RCU:
> >
> > Ah, yes, exactly what I was thinking above.
> >
> >>
> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
> >>
> >>         /* the memmap is guaranteed to remain active under RCU */
> >>         rcu_read_lock();
> >>         if (pfn_active(random_pfn)) {
> >>                 page = pfn_to_page(random_pfn);
> >>                 ... use the page, stays valid
> >>         }
> >>         rcu_unread_lock();
> >>
> >> Memory offlining/memremap code:
> >>
> >>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
> >>         synchronize_rcu();
> >>         /* all users saw the bitmap update, we can invalide the memmap */
> >>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> >
> > Looks good to me.
> >
> >>
> >>>
> >>>>
> >>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >>>> on x86-64 and PPC.
> >>>
> >>> I'll give it a spin, but I don't think the kernel wants to grow more
> >>> is_zone_device_page() users.
> >>
> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> >> The other parts can rely on pfn_to_online_page() only.
> >>
> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> >> - Basically never used with ZONE_DEVICE.
> >> - We hold a reference!
> >> - All it protects is a SetPageDirty(page);
> >>
> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> >> - Same as 1.
> >>
> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
> >>   references (I assume this should be fine as we don't come via random
> >>   PFNs)
> >> - We check that we don't mix Reserved (including device memory) and CMA
> >>   pages when crossing compound pages.
> >>
> >> I think we can drop 1. and 2., resulting in a total of 2 new users in
> >> the same context. I think that is totally tolerable to finally clean
> >> this up.
> >
> > ...but more is_zone_device_page() doesn't "finally clean this up".
> > Like we discussed above it's the missing locking that's the real
> > cleanup, the pfn_to_online_page() internals are secondary.
>
> It's a different cleanup IMHO. We can't do everything in one shot. But
> maybe I can drop the is_zone_device_page() parts from this patch and
> completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
> it's a different story.
>
> The important part of this patch:
>
> While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
> checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
> racy as hell (especially when concurrently initializing the memmap).
>
> This does improve the situation.

True that's a race a vector I was not considering.

>
> >>
> >> However, I think we also have to clarify if we need the change in 3 at all.
> >> It comes from
> >>
> >> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> >> Author: Kees Cook <keescook@chromium.org>
> >> Date:   Tue Jun 7 11:05:33 2016 -0700
> >>
> >>     mm: Hardened usercopy
> >>
> >>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
> >>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
> >>     work is based on code by PaX Team and Brad Spengler, and an earlier port
> >>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> >> [...]
> >>     - otherwise, object must not span page allocations (excepting Reserved
> >>       and CMA ranges)
> >>
> >> Not sure if we really have to care about ZONE_DEVICE at this point.
> >
> > That check needs to be careful to ignore ZONE_DEVICE pages. There's
> > nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>
> Please note that the current check would *forbid* this (AFAIKs for a
> single heap object). As discussed in the relevant patch, we might be
> able to just stop doing that and limit it to real PG_reserved pages
> (without ZONE_DEVICE). I'd be happy to not introduce new
> is_zone_device_page() users.

At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
excluded from this path by:

52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

So this case is one more to add to the pile of HMM auditing.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 19:39           ` Dan Williams
  0 siblings, 0 replies; 112+ messages in thread
From: Dan Williams @ 2019-10-23 19:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Kees Cook,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, Juergen Gross, kvm-ppc, Qian Cai, Alex Williamson,
	Mike Rapoport, Borislav Petkov, Nicholas Piggin, Andy Lutomirski,
	xen-devel, Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov,
	Allison Randal, Jim Mattson, Christophe Leroy, Vandana BN,
	Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman, Cornelia Huck,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> I dislike this for three reasons
> >>
> >> a) It does not protect against any races, really, it does not improve things.
> >> b) We do have the exact same problem with pfn_to_online_page(). As long as we
> >>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> >
> > True, we need to solve that problem too. That seems to want something
> > lighter weight than the hotplug lock that can be held over pfn lookups
> > +  use rather than requiring a page lookup in paths where it's not
> > clear that a page reference would prevent unplug.
> >
> >> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> >
> > Not sure I grok this when the RFC is sprinkling zone-specific
> > is_zone_device_page() throughout the core?
>
> Most users should not care about the zone. pfn_active() would be enough
> in most situations, especially most PFN walkers - "this memmap is valid
> and e.g., contains a valid zone ...".

Oh, I see, you're saying convert most users to pfn_active() (and some
TBD rcu locking), but only pfn_to_online_page() users would need the
zone lookup? I can get on board with that.

>
> >
> >>
> >> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> >
> > Sorry I missed this earlier...
> >
> >>
> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> >> 3. Introduce pfn_active() that checks against the subsection bitmap
> >> 4. Once the memmap was initialized / prepared, set the subsection active
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. Before the memmap gets invalidated, set the subsection inactive
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> >
> > This does not seem to reduce any complexity because it still requires
> > a pfn to zone lookup at the end of the process.
> >
> > I.e. converting pfn_to_online_page() to use a new pfn_active()
> > subsection map plus looking up the zone from pfn_to_page() is more
> > steps than just doing a direct pfn to zone lookup. What am I missing?
>
> That a real "pfn to zone" lookup without going via the struct page will
> require to have more than just a single bitmap. IMHO, keeping the
> information at a single place (memmap) is the clean thing to do (not
> replicating it somewhere else). Going via the memmap might not be as
> fast as a direct lookup, but do we actually care? We are already looking
> at "random PFNs we are not sure if there is a valid memmap".

True, we only care about the validity of the check, and as you pointed
out moving the check to the pfn level does not solve the validity
race. It needs a lock.

>
> >>
> >> Especially, driver-reserved device memory will not get set active in
> >> the subsection bitmap.
> >>
> >> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> >> wonder if we can use RCU:
> >
> > Ah, yes, exactly what I was thinking above.
> >
> >>
> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
> >>
> >>         /* the memmap is guaranteed to remain active under RCU */
> >>         rcu_read_lock();
> >>         if (pfn_active(random_pfn)) {
> >>                 page = pfn_to_page(random_pfn);
> >>                 ... use the page, stays valid
> >>         }
> >>         rcu_unread_lock();
> >>
> >> Memory offlining/memremap code:
> >>
> >>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically */
> >>         synchronize_rcu();
> >>         /* all users saw the bitmap update, we can invalide the memmap */
> >>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> >
> > Looks good to me.
> >
> >>
> >>>
> >>>>
> >>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >>>> on x86-64 and PPC.
> >>>
> >>> I'll give it a spin, but I don't think the kernel wants to grow more
> >>> is_zone_device_page() users.
> >>
> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> >> The other parts can rely on pfn_to_online_page() only.
> >>
> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> >> - Basically never used with ZONE_DEVICE.
> >> - We hold a reference!
> >> - All it protects is a SetPageDirty(page);
> >>
> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> >> - Same as 1.
> >>
> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
> >>   references (I assume this should be fine as we don't come via random
> >>   PFNs)
> >> - We check that we don't mix Reserved (including device memory) and CMA
> >>   pages when crossing compound pages.
> >>
> >> I think we can drop 1. and 2., resulting in a total of 2 new users in
> >> the same context. I think that is totally tolerable to finally clean
> >> this up.
> >
> > ...but more is_zone_device_page() doesn't "finally clean this up".
> > Like we discussed above it's the missing locking that's the real
> > cleanup, the pfn_to_online_page() internals are secondary.
>
> It's a different cleanup IMHO. We can't do everything in one shot. But
> maybe I can drop the is_zone_device_page() parts from this patch and
> completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
> it's a different story.
>
> The important part of this patch:
>
> While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
> checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
> racy as hell (especially when concurrently initializing the memmap).
>
> This does improve the situation.

True that's a race a vector I was not considering.

>
> >>
> >> However, I think we also have to clarify if we need the change in 3 at all.
> >> It comes from
> >>
> >> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> >> Author: Kees Cook <keescook@chromium.org>
> >> Date:   Tue Jun 7 11:05:33 2016 -0700
> >>
> >>     mm: Hardened usercopy
> >>
> >>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
> >>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
> >>     work is based on code by PaX Team and Brad Spengler, and an earlier port
> >>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
> >> [...]
> >>     - otherwise, object must not span page allocations (excepting Reserved
> >>       and CMA ranges)
> >>
> >> Not sure if we really have to care about ZONE_DEVICE at this point.
> >
> > That check needs to be careful to ignore ZONE_DEVICE pages. There's
> > nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>
> Please note that the current check would *forbid* this (AFAIKs for a
> single heap object). As discussed in the relevant patch, we might be
> able to just stop doing that and limit it to real PG_reserved pages
> (without ZONE_DEVICE). I'd be happy to not introduce new
> is_zone_device_page() users.

At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
excluded from this path by:

52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

So this case is one more to add to the pile of HMM auditing.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-23 19:39           ` Dan Williams
  (?)
  (?)
@ 2019-10-23 21:22             ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 21:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, Linux MM, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, KVM list, linux-hyperv, devel, xen-devel,
	X86 ML, Alexander Duyck, Kees Cook, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 23.10.19 21:39, Dan Williams wrote:
> On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I dislike this for three reasons
>>>>
>>>> a) It does not protect against any races, really, it does not improve things.
>>>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
>>>
>>> True, we need to solve that problem too. That seems to want something
>>> lighter weight than the hotplug lock that can be held over pfn lookups
>>> +  use rather than requiring a page lookup in paths where it's not
>>> clear that a page reference would prevent unplug.
>>>
>>>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
>>>
>>> Not sure I grok this when the RFC is sprinkling zone-specific
>>> is_zone_device_page() throughout the core?
>>
>> Most users should not care about the zone. pfn_active() would be enough
>> in most situations, especially most PFN walkers - "this memmap is valid
>> and e.g., contains a valid zone ...".
> 
> Oh, I see, you're saying convert most users to pfn_active() (and some
> TBD rcu locking), but only pfn_to_online_page() users would need the
> zone lookup? I can get on board with that.

I guess my answer to that is simple: If we only care about "is this
memmap safe to touch", use pfn_active()

(well, with pfn_valid_within() similar as done in pfn_to_online_page()
due to memory holes, but these are details - e.g., pfn_active() can
check against pfn_valid_within() right away internally). (+locking TBD
to make sure it remains active)

However, if we want to special case in addition on zones (!ZONE_DEVICE
(a.k.a., onlined via memory blocks, managed by the buddy), ZONE_DEVICE,
whatever might come in the future, ...), also access the zone stored in
the memmap. E.g., by using pfn_to_online_page().

> 
>>
>>>
>>>>
>>>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
>>>
>>> Sorry I missed this earlier...
>>>
>>>>
>>>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>>>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>>>> 3. Introduce pfn_active() that checks against the subsection bitmap
>>>> 4. Once the memmap was initialized / prepared, set the subsection active
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. Before the memmap gets invalidated, set the subsection inactive
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>>>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
>>>
>>> This does not seem to reduce any complexity because it still requires
>>> a pfn to zone lookup at the end of the process.
>>>
>>> I.e. converting pfn_to_online_page() to use a new pfn_active()
>>> subsection map plus looking up the zone from pfn_to_page() is more
>>> steps than just doing a direct pfn to zone lookup. What am I missing?
>>
>> That a real "pfn to zone" lookup without going via the struct page will
>> require to have more than just a single bitmap. IMHO, keeping the
>> information at a single place (memmap) is the clean thing to do (not
>> replicating it somewhere else). Going via the memmap might not be as
>> fast as a direct lookup, but do we actually care? We are already looking
>> at "random PFNs we are not sure if there is a valid memmap".
> 
> True, we only care about the validity of the check, and as you pointed
> out moving the check to the pfn level does not solve the validity
> race. It needs a lock.

Let's call pfn_active() "a pfn that is active in the system and has an
initialized memmap, which contains sane values" (valid memmap sounds
like pfn_valid(), which is actually "there is a memmap which might
contain garbage"). Yes we need some sort of lightweight locking as
discussed.

[...]

>>>> However, I think we also have to clarify if we need the change in 3 at all.
>>>> It comes from
>>>>
>>>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>>>> Author: Kees Cook <keescook@chromium.org>
>>>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>>>
>>>>     mm: Hardened usercopy
>>>>
>>>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>>>> [...]
>>>>     - otherwise, object must not span page allocations (excepting Reserved
>>>>       and CMA ranges)
>>>>
>>>> Not sure if we really have to care about ZONE_DEVICE at this point.
>>>
>>> That check needs to be careful to ignore ZONE_DEVICE pages. There's
>>> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>>
>> Please note that the current check would *forbid* this (AFAIKs for a
>> single heap object). As discussed in the relevant patch, we might be
>> able to just stop doing that and limit it to real PG_reserved pages
>> (without ZONE_DEVICE). I'd be happy to not introduce new
>> is_zone_device_page() users.
> 
> At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
> excluded from this path by:
> 
> 52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

Interesting, and very valuable information. So this sounds like patch #2
can go (or convert it to a documentation update).

> 
> So this case is one more to add to the pile of HMM auditing.

Sounds like HMM is some dangerous piece of software we have. This needs
auditing, fixing, and documentation.

BTW, do you have a good source of details about HMM? Especially about
these oddities you mentioned?

Also, can you have a look at patch #2 7/8 and confirm that doing a
SetPageDirty() on a ZONE_DEVICE page is okay (although not useful)? Thanks!

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 21:22             ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 21:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Pavel Tatashin,
	devel, Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Kees Cook, Anshuman Khandual,
	Haiyang Zhang, Simon Sandström, Sasha Levin, Juergen Gross,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, Linux Kernel Mailing List,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 23.10.19 21:39, Dan Williams wrote:
> On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I dislike this for three reasons
>>>>
>>>> a) It does not protect against any races, really, it does not improve things.
>>>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
>>>
>>> True, we need to solve that problem too. That seems to want something
>>> lighter weight than the hotplug lock that can be held over pfn lookups
>>> +  use rather than requiring a page lookup in paths where it's not
>>> clear that a page reference would prevent unplug.
>>>
>>>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
>>>
>>> Not sure I grok this when the RFC is sprinkling zone-specific
>>> is_zone_device_page() throughout the core?
>>
>> Most users should not care about the zone. pfn_active() would be enough
>> in most situations, especially most PFN walkers - "this memmap is valid
>> and e.g., contains a valid zone ...".
> 
> Oh, I see, you're saying convert most users to pfn_active() (and some
> TBD rcu locking), but only pfn_to_online_page() users would need the
> zone lookup? I can get on board with that.

I guess my answer to that is simple: If we only care about "is this
memmap safe to touch", use pfn_active()

(well, with pfn_valid_within() similar as done in pfn_to_online_page()
due to memory holes, but these are details - e.g., pfn_active() can
check against pfn_valid_within() right away internally). (+locking TBD
to make sure it remains active)

However, if we want to special case in addition on zones (!ZONE_DEVICE
(a.k.a., onlined via memory blocks, managed by the buddy), ZONE_DEVICE,
whatever might come in the future, ...), also access the zone stored in
the memmap. E.g., by using pfn_to_online_page().

> 
>>
>>>
>>>>
>>>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
>>>
>>> Sorry I missed this earlier...
>>>
>>>>
>>>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>>>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>>>> 3. Introduce pfn_active() that checks against the subsection bitmap
>>>> 4. Once the memmap was initialized / prepared, set the subsection active
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. Before the memmap gets invalidated, set the subsection inactive
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>>>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
>>>
>>> This does not seem to reduce any complexity because it still requires
>>> a pfn to zone lookup at the end of the process.
>>>
>>> I.e. converting pfn_to_online_page() to use a new pfn_active()
>>> subsection map plus looking up the zone from pfn_to_page() is more
>>> steps than just doing a direct pfn to zone lookup. What am I missing?
>>
>> That a real "pfn to zone" lookup without going via the struct page will
>> require to have more than just a single bitmap. IMHO, keeping the
>> information at a single place (memmap) is the clean thing to do (not
>> replicating it somewhere else). Going via the memmap might not be as
>> fast as a direct lookup, but do we actually care? We are already looking
>> at "random PFNs we are not sure if there is a valid memmap".
> 
> True, we only care about the validity of the check, and as you pointed
> out moving the check to the pfn level does not solve the validity
> race. It needs a lock.

Let's call pfn_active() "a pfn that is active in the system and has an
initialized memmap, which contains sane values" (valid memmap sounds
like pfn_valid(), which is actually "there is a memmap which might
contain garbage"). Yes we need some sort of lightweight locking as
discussed.

[...]

>>>> However, I think we also have to clarify if we need the change in 3 at all.
>>>> It comes from
>>>>
>>>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>>>> Author: Kees Cook <keescook@chromium.org>
>>>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>>>
>>>>     mm: Hardened usercopy
>>>>
>>>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>>>> [...]
>>>>     - otherwise, object must not span page allocations (excepting Reserved
>>>>       and CMA ranges)
>>>>
>>>> Not sure if we really have to care about ZONE_DEVICE at this point.
>>>
>>> That check needs to be careful to ignore ZONE_DEVICE pages. There's
>>> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>>
>> Please note that the current check would *forbid* this (AFAIKs for a
>> single heap object). As discussed in the relevant patch, we might be
>> able to just stop doing that and limit it to real PG_reserved pages
>> (without ZONE_DEVICE). I'd be happy to not introduce new
>> is_zone_device_page() users.
> 
> At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
> excluded from this path by:
> 
> 52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

Interesting, and very valuable information. So this sounds like patch #2
can go (or convert it to a documentation update).

> 
> So this case is one more to add to the pile of HMM auditing.

Sounds like HMM is some dangerous piece of software we have. This needs
auditing, fixing, and documentation.

BTW, do you have a good source of details about HMM? Especially about
these oddities you mentioned?

Also, can you have a look at patch #2 7/8 and confirm that doing a
SetPageDirty() on a ZONE_DEVICE page is okay (although not useful)? Thanks!

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 21:22             ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 21:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, Linux MM, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Kees Cook, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Sasha Levin, Juergen Gross, kvm-ppc,
	Qian Cai, Alex Williamson, Mike Rapoport, Borislav Petkov,
	Nicholas Piggin, Andy Lutomirski, xen-devel, Boris Ostrovsky,
	Todd Poynor, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Vandana BN, Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman,
	Cornelia Huck, Linux Kernel Mailing List, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 23.10.19 21:39, Dan Williams wrote:
> On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I dislike this for three reasons
>>>>
>>>> a) It does not protect against any races, really, it does not improve things.
>>>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
>>>
>>> True, we need to solve that problem too. That seems to want something
>>> lighter weight than the hotplug lock that can be held over pfn lookups
>>> +  use rather than requiring a page lookup in paths where it's not
>>> clear that a page reference would prevent unplug.
>>>
>>>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
>>>
>>> Not sure I grok this when the RFC is sprinkling zone-specific
>>> is_zone_device_page() throughout the core?
>>
>> Most users should not care about the zone. pfn_active() would be enough
>> in most situations, especially most PFN walkers - "this memmap is valid
>> and e.g., contains a valid zone ...".
> 
> Oh, I see, you're saying convert most users to pfn_active() (and some
> TBD rcu locking), but only pfn_to_online_page() users would need the
> zone lookup? I can get on board with that.

I guess my answer to that is simple: If we only care about "is this
memmap safe to touch", use pfn_active()

(well, with pfn_valid_within() similar as done in pfn_to_online_page()
due to memory holes, but these are details - e.g., pfn_active() can
check against pfn_valid_within() right away internally). (+locking TBD
to make sure it remains active)

However, if we want to special case in addition on zones (!ZONE_DEVICE
(a.k.a., onlined via memory blocks, managed by the buddy), ZONE_DEVICE,
whatever might come in the future, ...), also access the zone stored in
the memmap. E.g., by using pfn_to_online_page().

> 
>>
>>>
>>>>
>>>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
>>>
>>> Sorry I missed this earlier...
>>>
>>>>
>>>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>>>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>>>> 3. Introduce pfn_active() that checks against the subsection bitmap
>>>> 4. Once the memmap was initialized / prepared, set the subsection active
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. Before the memmap gets invalidated, set the subsection inactive
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>>>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
>>>
>>> This does not seem to reduce any complexity because it still requires
>>> a pfn to zone lookup at the end of the process.
>>>
>>> I.e. converting pfn_to_online_page() to use a new pfn_active()
>>> subsection map plus looking up the zone from pfn_to_page() is more
>>> steps than just doing a direct pfn to zone lookup. What am I missing?
>>
>> That a real "pfn to zone" lookup without going via the struct page will
>> require to have more than just a single bitmap. IMHO, keeping the
>> information at a single place (memmap) is the clean thing to do (not
>> replicating it somewhere else). Going via the memmap might not be as
>> fast as a direct lookup, but do we actually care? We are already looking
>> at "random PFNs we are not sure if there is a valid memmap".
> 
> True, we only care about the validity of the check, and as you pointed
> out moving the check to the pfn level does not solve the validity
> race. It needs a lock.

Let's call pfn_active() "a pfn that is active in the system and has an
initialized memmap, which contains sane values" (valid memmap sounds
like pfn_valid(), which is actually "there is a memmap which might
contain garbage"). Yes we need some sort of lightweight locking as
discussed.

[...]

>>>> However, I think we also have to clarify if we need the change in 3 at all.
>>>> It comes from
>>>>
>>>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>>>> Author: Kees Cook <keescook@chromium.org>
>>>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>>>
>>>>     mm: Hardened usercopy
>>>>
>>>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>>>> [...]
>>>>     - otherwise, object must not span page allocations (excepting Reserved
>>>>       and CMA ranges)
>>>>
>>>> Not sure if we really have to care about ZONE_DEVICE at this point.
>>>
>>> That check needs to be careful to ignore ZONE_DEVICE pages. There's
>>> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>>
>> Please note that the current check would *forbid* this (AFAIKs for a
>> single heap object). As discussed in the relevant patch, we might be
>> able to just stop doing that and limit it to real PG_reserved pages
>> (without ZONE_DEVICE). I'd be happy to not introduce new
>> is_zone_device_page() users.
> 
> At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
> excluded from this path by:
> 
> 52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

Interesting, and very valuable information. So this sounds like patch #2
can go (or convert it to a documentation update).

> 
> So this case is one more to add to the pile of HMM auditing.

Sounds like HMM is some dangerous piece of software we have. This needs
auditing, fixing, and documentation.

BTW, do you have a good source of details about HMM? Especially about
these oddities you mentioned?

Also, can you have a look at patch #2 7/8 and confirm that doing a
SetPageDirty() on a ZONE_DEVICE page is okay (although not useful)? Thanks!

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-23 21:22             ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-23 21:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Kees Cook,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, Juergen Gross, kvm-ppc, Qian Cai, Alex Williamson,
	Mike Rapoport, Borislav Petkov, Nicholas Piggin, Andy Lutomirski,
	xen-devel, Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov,
	Allison Randal, Jim Mattson, Christophe Leroy, Vandana BN,
	Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman, Cornelia Huck,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

On 23.10.19 21:39, Dan Williams wrote:
> On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I dislike this for three reasons
>>>>
>>>> a) It does not protect against any races, really, it does not improve things.
>>>> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>>>>    don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
>>>
>>> True, we need to solve that problem too. That seems to want something
>>> lighter weight than the hotplug lock that can be held over pfn lookups
>>> +  use rather than requiring a page lookup in paths where it's not
>>> clear that a page reference would prevent unplug.
>>>
>>>> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
>>>
>>> Not sure I grok this when the RFC is sprinkling zone-specific
>>> is_zone_device_page() throughout the core?
>>
>> Most users should not care about the zone. pfn_active() would be enough
>> in most situations, especially most PFN walkers - "this memmap is valid
>> and e.g., contains a valid zone ...".
> 
> Oh, I see, you're saying convert most users to pfn_active() (and some
> TBD rcu locking), but only pfn_to_online_page() users would need the
> zone lookup? I can get on board with that.

I guess my answer to that is simple: If we only care about "is this
memmap safe to touch", use pfn_active()

(well, with pfn_valid_within() similar as done in pfn_to_online_page()
due to memory holes, but these are details - e.g., pfn_active() can
check against pfn_valid_within() right away internally). (+locking TBD
to make sure it remains active)

However, if we want to special case in addition on zones (!ZONE_DEVICE
(a.k.a., onlined via memory blocks, managed by the buddy), ZONE_DEVICE,
whatever might come in the future, ...), also access the zone stored in
the memmap. E.g., by using pfn_to_online_page().

> 
>>
>>>
>>>>
>>>> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
>>>
>>> Sorry I missed this earlier...
>>>
>>>>
>>>> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
>>>> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
>>>> 3. Introduce pfn_active() that checks against the subsection bitmap
>>>> 4. Once the memmap was initialized / prepared, set the subsection active
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. Before the memmap gets invalidated, set the subsection inactive
>>>>    (similar to SECTION_IS_ONLINE in the buddy right now)
>>>> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
>>>> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
>>>
>>> This does not seem to reduce any complexity because it still requires
>>> a pfn to zone lookup at the end of the process.
>>>
>>> I.e. converting pfn_to_online_page() to use a new pfn_active()
>>> subsection map plus looking up the zone from pfn_to_page() is more
>>> steps than just doing a direct pfn to zone lookup. What am I missing?
>>
>> That a real "pfn to zone" lookup without going via the struct page will
>> require to have more than just a single bitmap. IMHO, keeping the
>> information at a single place (memmap) is the clean thing to do (not
>> replicating it somewhere else). Going via the memmap might not be as
>> fast as a direct lookup, but do we actually care? We are already looking
>> at "random PFNs we are not sure if there is a valid memmap".
> 
> True, we only care about the validity of the check, and as you pointed
> out moving the check to the pfn level does not solve the validity
> race. It needs a lock.

Let's call pfn_active() "a pfn that is active in the system and has an
initialized memmap, which contains sane values" (valid memmap sounds
like pfn_valid(), which is actually "there is a memmap which might
contain garbage"). Yes we need some sort of lightweight locking as
discussed.

[...]

>>>> However, I think we also have to clarify if we need the change in 3 at all.
>>>> It comes from
>>>>
>>>> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
>>>> Author: Kees Cook <keescook@chromium.org>
>>>> Date:   Tue Jun 7 11:05:33 2016 -0700
>>>>
>>>>     mm: Hardened usercopy
>>>>
>>>>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>>>>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>>>>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>>>>     from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>>>> [...]
>>>>     - otherwise, object must not span page allocations (excepting Reserved
>>>>       and CMA ranges)
>>>>
>>>> Not sure if we really have to care about ZONE_DEVICE at this point.
>>>
>>> That check needs to be careful to ignore ZONE_DEVICE pages. There's
>>> nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>>
>> Please note that the current check would *forbid* this (AFAIKs for a
>> single heap object). As discussed in the relevant patch, we might be
>> able to just stop doing that and limit it to real PG_reserved pages
>> (without ZONE_DEVICE). I'd be happy to not introduce new
>> is_zone_device_page() users.
> 
> At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
> excluded from this path by:
> 
> 52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

Interesting, and very valuable information. So this sounds like patch #2
can go (or convert it to a documentation update).

> 
> So this case is one more to add to the pile of HMM auditing.

Sounds like HMM is some dangerous piece of software we have. This needs
auditing, fixing, and documentation.

BTW, do you have a good source of details about HMM? Especially about
these oddities you mentioned?

Also, can you have a look at patch #2 7/8 and confirm that doing a
SetPageDirty() on a ZONE_DEVICE page is okay (although not useful)? Thanks!

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
  2019-10-22 17:12   ` David Hildenbrand
  (?)
  (?)
@ 2019-10-24  3:53     ` Anshuman Khandual
  -1 siblings, 0 replies; 112+ messages in thread
From: Anshuman Khandual @ 2019-10-24  3:53 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Michal Hocko, Andrew Morton, kvm-ppc, linuxppc-dev,
	kvm, linux-hyperv, devel, xen-devel, x86, Alexander Duyck,
	Alexander Duyck, Alex Williamson, Allison Randal,
	Andy Lutomirski, Aneesh Kumar K.V, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing


On 10/22/2019 10:42 PM, David Hildenbrand wrote:
> Our onlining/offlining code is unnecessarily complicated. Only memory
> blocks added during boot can have holes. Hotplugged memory never has
> holes. That memory is already online.

Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).
Currently, do we just abort adding that memory block if there are holes ?

> 
> When we stop allowing to offline memory blocks with holes, we implicitly
> stop to online memory blocks with holes.

Reducing hotplug support for memory blocks with holes just to simplify
the code. Is it worth ?

> 
> This allows to simplify the code. For example, we no longer have to
> worry about marking pages that fall into memory holes PG_reserved when
> onlining memory. We can stop setting pages PG_reserved.

Could not there be any other way of tracking these holes if not the page
reserved bit. In the memory section itself and corresponding struct pages
just remained poisoned ? Just wondering, might be all wrong here.

> 
> Offlining memory blocks added during boot is usually not guranteed to work
> either way. So stopping to do that (if anybody really used and tested

That guarantee does not exist right now because how boot memory could have
been used after boot not from a limitation of the memory hot remove itself.

> this over the years) should not really hurt. For the use case of
> offlining memory to unplug DIMMs, we should see no change. (holes on
> DIMMs would be weird)

Holes on DIMM could be due to HW errors affecting only parts of it. By not
allowing such DIMM's hot add and remove, we are definitely reducing the
scope of overall hotplug functionality. Is code simplification in itself
is worth this reduction in functionality ?

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 561371ead39a..7210f4375279 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  		node_clear_state(node, N_MEMORY);
>  }
>  
> +static int count_system_ram_pages_cb(unsigned long start_pfn,
> +				     unsigned long nr_pages, void *data)
> +{
> +	unsigned long *nr_system_ram_pages = data;
> +
> +	*nr_system_ram_pages += nr_pages;
> +	return 0;
> +}
> +
>  static int __ref __offline_pages(unsigned long start_pfn,
>  		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages;
> +	unsigned long pfn, nr_pages = 0;
>  	unsigned long offlined_pages = 0;
>  	int ret, node, nr_isolate_pageblock;
>  	unsigned long flags;
> @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	mem_hotplug_begin();
>  
> +	/*
> +	 * We don't allow to offline memory blocks that contain holes
> +	 * and consecuently don't allow to online memory blocks that contain
> +	 * holes. This allows to simplify the code quite a lot and we don't
> +	 * have to mess with PG_reserved pages for memory holes.
> +	 */
> +	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
> +			      count_system_ram_pages_cb);
> +	if (nr_pages != end_pfn - start_pfn) {
> +		ret = -EINVAL;
> +		reason = "memory holes";
> +		goto failed_removal;
> +	}
> +
>  	/* This makes hotplug much easier...and readable.
>  	   we assume this for now. .*/
>  	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
> @@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	zone = page_zone(pfn_to_page(valid_start));
>  	node = zone_to_nid(zone);
> -	nr_pages = end_pfn - start_pfn;
>  
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
> 


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-24  3:53     ` Anshuman Khandual
  0 siblings, 0 replies; 112+ messages in thread
From: Anshuman Khandual @ 2019-10-24  3:53 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin,
	Mel Gorman, Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev


On 10/22/2019 10:42 PM, David Hildenbrand wrote:
> Our onlining/offlining code is unnecessarily complicated. Only memory
> blocks added during boot can have holes. Hotplugged memory never has
> holes. That memory is already online.

Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).
Currently, do we just abort adding that memory block if there are holes ?

> 
> When we stop allowing to offline memory blocks with holes, we implicitly
> stop to online memory blocks with holes.

Reducing hotplug support for memory blocks with holes just to simplify
the code. Is it worth ?

> 
> This allows to simplify the code. For example, we no longer have to
> worry about marking pages that fall into memory holes PG_reserved when
> onlining memory. We can stop setting pages PG_reserved.

Could not there be any other way of tracking these holes if not the page
reserved bit. In the memory section itself and corresponding struct pages
just remained poisoned ? Just wondering, might be all wrong here.

> 
> Offlining memory blocks added during boot is usually not guranteed to work
> either way. So stopping to do that (if anybody really used and tested

That guarantee does not exist right now because how boot memory could have
been used after boot not from a limitation of the memory hot remove itself.

> this over the years) should not really hurt. For the use case of
> offlining memory to unplug DIMMs, we should see no change. (holes on
> DIMMs would be weird)

Holes on DIMM could be due to HW errors affecting only parts of it. By not
allowing such DIMM's hot add and remove, we are definitely reducing the
scope of overall hotplug functionality. Is code simplification in itself
is worth this reduction in functionality ?

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 561371ead39a..7210f4375279 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  		node_clear_state(node, N_MEMORY);
>  }
>  
> +static int count_system_ram_pages_cb(unsigned long start_pfn,
> +				     unsigned long nr_pages, void *data)
> +{
> +	unsigned long *nr_system_ram_pages = data;
> +
> +	*nr_system_ram_pages += nr_pages;
> +	return 0;
> +}
> +
>  static int __ref __offline_pages(unsigned long start_pfn,
>  		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages;
> +	unsigned long pfn, nr_pages = 0;
>  	unsigned long offlined_pages = 0;
>  	int ret, node, nr_isolate_pageblock;
>  	unsigned long flags;
> @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	mem_hotplug_begin();
>  
> +	/*
> +	 * We don't allow to offline memory blocks that contain holes
> +	 * and consecuently don't allow to online memory blocks that contain
> +	 * holes. This allows to simplify the code quite a lot and we don't
> +	 * have to mess with PG_reserved pages for memory holes.
> +	 */
> +	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
> +			      count_system_ram_pages_cb);
> +	if (nr_pages != end_pfn - start_pfn) {
> +		ret = -EINVAL;
> +		reason = "memory holes";
> +		goto failed_removal;
> +	}
> +
>  	/* This makes hotplug much easier...and readable.
>  	   we assume this for now. .*/
>  	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
> @@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	zone = page_zone(pfn_to_page(valid_start));
>  	node = zone_to_nid(zone);
> -	nr_pages = end_pfn - start_pfn;
>  
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
> 
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-24  3:53     ` Anshuman Khandual
  0 siblings, 0 replies; 112+ messages in thread
From: Anshuman Khandual @ 2019-10-24  3:53 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Haiyang Zhang, Simon Sandström, Dan Williams,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev


On 10/22/2019 10:42 PM, David Hildenbrand wrote:
> Our onlining/offlining code is unnecessarily complicated. Only memory
> blocks added during boot can have holes. Hotplugged memory never has
> holes. That memory is already online.

Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).
Currently, do we just abort adding that memory block if there are holes ?

> 
> When we stop allowing to offline memory blocks with holes, we implicitly
> stop to online memory blocks with holes.

Reducing hotplug support for memory blocks with holes just to simplify
the code. Is it worth ?

> 
> This allows to simplify the code. For example, we no longer have to
> worry about marking pages that fall into memory holes PG_reserved when
> onlining memory. We can stop setting pages PG_reserved.

Could not there be any other way of tracking these holes if not the page
reserved bit. In the memory section itself and corresponding struct pages
just remained poisoned ? Just wondering, might be all wrong here.

> 
> Offlining memory blocks added during boot is usually not guranteed to work
> either way. So stopping to do that (if anybody really used and tested

That guarantee does not exist right now because how boot memory could have
been used after boot not from a limitation of the memory hot remove itself.

> this over the years) should not really hurt. For the use case of
> offlining memory to unplug DIMMs, we should see no change. (holes on
> DIMMs would be weird)

Holes on DIMM could be due to HW errors affecting only parts of it. By not
allowing such DIMM's hot add and remove, we are definitely reducing the
scope of overall hotplug functionality. Is code simplification in itself
is worth this reduction in functionality ?

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 561371ead39a..7210f4375279 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  		node_clear_state(node, N_MEMORY);
>  }
>  
> +static int count_system_ram_pages_cb(unsigned long start_pfn,
> +				     unsigned long nr_pages, void *data)
> +{
> +	unsigned long *nr_system_ram_pages = data;
> +
> +	*nr_system_ram_pages += nr_pages;
> +	return 0;
> +}
> +
>  static int __ref __offline_pages(unsigned long start_pfn,
>  		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages;
> +	unsigned long pfn, nr_pages = 0;
>  	unsigned long offlined_pages = 0;
>  	int ret, node, nr_isolate_pageblock;
>  	unsigned long flags;
> @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	mem_hotplug_begin();
>  
> +	/*
> +	 * We don't allow to offline memory blocks that contain holes
> +	 * and consecuently don't allow to online memory blocks that contain
> +	 * holes. This allows to simplify the code quite a lot and we don't
> +	 * have to mess with PG_reserved pages for memory holes.
> +	 */
> +	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
> +			      count_system_ram_pages_cb);
> +	if (nr_pages != end_pfn - start_pfn) {
> +		ret = -EINVAL;
> +		reason = "memory holes";
> +		goto failed_removal;
> +	}
> +
>  	/* This makes hotplug much easier...and readable.
>  	   we assume this for now. .*/
>  	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
> @@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	zone = page_zone(pfn_to_page(valid_start));
>  	node = zone_to_nid(zone);
> -	nr_pages = end_pfn - start_pfn;
>  
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
> 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-24  3:53     ` Anshuman Khandual
  0 siblings, 0 replies; 112+ messages in thread
From: Anshuman Khandual @ 2019-10-24  3:53 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Haiyang Zhang, Simon Sandström, Dan Williams,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Jeremy Sowden,
	Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin, Mel Gorman,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev


On 10/22/2019 10:42 PM, David Hildenbrand wrote:
> Our onlining/offlining code is unnecessarily complicated. Only memory
> blocks added during boot can have holes. Hotplugged memory never has
> holes. That memory is already online.

Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).
Currently, do we just abort adding that memory block if there are holes ?

> 
> When we stop allowing to offline memory blocks with holes, we implicitly
> stop to online memory blocks with holes.

Reducing hotplug support for memory blocks with holes just to simplify
the code. Is it worth ?

> 
> This allows to simplify the code. For example, we no longer have to
> worry about marking pages that fall into memory holes PG_reserved when
> onlining memory. We can stop setting pages PG_reserved.

Could not there be any other way of tracking these holes if not the page
reserved bit. In the memory section itself and corresponding struct pages
just remained poisoned ? Just wondering, might be all wrong here.

> 
> Offlining memory blocks added during boot is usually not guranteed to work
> either way. So stopping to do that (if anybody really used and tested

That guarantee does not exist right now because how boot memory could have
been used after boot not from a limitation of the memory hot remove itself.

> this over the years) should not really hurt. For the use case of
> offlining memory to unplug DIMMs, we should see no change. (holes on
> DIMMs would be weird)

Holes on DIMM could be due to HW errors affecting only parts of it. By not
allowing such DIMM's hot add and remove, we are definitely reducing the
scope of overall hotplug functionality. Is code simplification in itself
is worth this reduction in functionality ?

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 561371ead39a..7210f4375279 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  		node_clear_state(node, N_MEMORY);
>  }
>  
> +static int count_system_ram_pages_cb(unsigned long start_pfn,
> +				     unsigned long nr_pages, void *data)
> +{
> +	unsigned long *nr_system_ram_pages = data;
> +
> +	*nr_system_ram_pages += nr_pages;
> +	return 0;
> +}
> +
>  static int __ref __offline_pages(unsigned long start_pfn,
>  		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages;
> +	unsigned long pfn, nr_pages = 0;
>  	unsigned long offlined_pages = 0;
>  	int ret, node, nr_isolate_pageblock;
>  	unsigned long flags;
> @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	mem_hotplug_begin();
>  
> +	/*
> +	 * We don't allow to offline memory blocks that contain holes
> +	 * and consecuently don't allow to online memory blocks that contain
> +	 * holes. This allows to simplify the code quite a lot and we don't
> +	 * have to mess with PG_reserved pages for memory holes.
> +	 */
> +	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
> +			      count_system_ram_pages_cb);
> +	if (nr_pages != end_pfn - start_pfn) {
> +		ret = -EINVAL;
> +		reason = "memory holes";
> +		goto failed_removal;
> +	}
> +
>  	/* This makes hotplug much easier...and readable.
>  	   we assume this for now. .*/
>  	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
> @@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	zone = page_zone(pfn_to_page(valid_start));
>  	node = zone_to_nid(zone);
> -	nr_pages = end_pfn - start_pfn;
>  
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
  2019-10-24  3:53     ` Anshuman Khandual
  (?)
  (?)
@ 2019-10-24  7:55       ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24  7:55 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel
  Cc: linux-mm, Michal Hocko, Andrew Morton, kvm-ppc, linuxppc-dev,
	kvm, linux-hyperv, devel, xen-devel, x86, Alexander Duyck,
	Alexander Duyck, Alex Williamson, Allison Randal,
	Andy Lutomirski, Aneesh Kumar K.V, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dan Williams,
	Dave Hansen, Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang,
	H. Peter Anvin, Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden,
	Jim Mattson, Joerg Roedel, Johannes Weiner, Juergen Gross,
	KarimAllah Ahmed, Kate Stewart, Kees Cook, K. Y. Srinivasan,
	Madhumitha Prabakaran, Matt Sickler, Mel Gorman,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Nicholas Piggin, Nishka Dasgupta, Oscar Salvador, Paolo Bonzini,
	Paul Mackerras, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 24.10.19 05:53, Anshuman Khandual wrote:
> 
> On 10/22/2019 10:42 PM, David Hildenbrand wrote:
>> Our onlining/offlining code is unnecessarily complicated. Only memory
>> blocks added during boot can have holes. Hotplugged memory never has
>> holes. That memory is already online.
> 
> Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).

Important: HWPoison != memory hole

A memory hole is memory that is not "IORESOURCE_SYSRAM". These pages are 
currently marked PG_reserved. Such holes are sometimes used for mapping 
something into kernel space. Some archs use the PG_reserved to detect 
the memory hole ("not ram") and ignore the memmap.

Poisoned pages are marked PG_hwpoison.

> Currently, do we just abort adding that memory block if there are holes ?

There is no interface to do that.

E.g., have a look at add_memory() add_memory_resource(). You can only 
pass one memory resource (that is all IORESOURCE_SYSRAM | IORESOURCE_BUSY)

Hotplugging memory with holes is not supported (nor can I imagine a use 
case for that).

>>
>> When we stop allowing to offline memory blocks with holes, we implicitly
>> stop to online memory blocks with holes.
> 
> Reducing hotplug support for memory blocks with holes just to simplify
> the code. Is it worth ?

Me and Michal are not aware of a users, not even aware of a use case. 
Keeping code around that nobody really needs that limits cleanups, no 
thanks. Similar to us not supporting to offline memory blocks that span 
multiple nodes/zones.

E.g., have a look at the isolation code. It is full of code that jumps 
over memory holes (start_isolate_page_range() -> __first_valid_page()). 
That made sense for our complicated memory offlining code, but it is 
actually harmful when dealing with alloc_contig_range(). Allocation 
never wants to jump over memory holes. After this patch, we can just 
fail hard on any memory hole we detect, instead of ignoring it (or 
special-casing it).

> 
>>
>> This allows to simplify the code. For example, we no longer have to
>> worry about marking pages that fall into memory holes PG_reserved when
>> onlining memory. We can stop setting pages PG_reserved.
> 
> Could not there be any other way of tracking these holes if not the page
> reserved bit. In the memory section itself and corresponding struct pages
> just remained poisoned ? Just wondering, might be all wrong here.

Of course there could be ways (e.g., using PG_offline eventually), but 
it boils down to us having to deal with it in onlining/offlining code. 
And that is some handling nobody really seems to need.

> 
>>
>> Offlining memory blocks added during boot is usually not guranteed to work
>> either way. So stopping to do that (if anybody really used and tested
> 
> That guarantee does not exist right now because how boot memory could have
> been used after boot not from a limitation of the memory hot remove itself.

Yep. However, Michal and I are not even aware of a setup that would made 
this work and guarantee that the existing code actually still is able to 
deal with holes. Are you?

> 
>> this over the years) should not really hurt. For the use case of
>> offlining memory to unplug DIMMs, we should see no change. (holes on
>> DIMMs would be weird)
> 
> Holes on DIMM could be due to HW errors affecting only parts of it. By not

Again, HW errors != holes. We have PG_hwpoison for that.

> allowing such DIMM's hot add and remove, we are definitely reducing the
> scope of overall hotplug functionality. Is code simplification in itself
> is worth this reduction in functionality ?

What you describe is not affected.

Thanks!

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-24  7:55       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24  7:55 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, Kees Cook, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, x86, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Juergen Gross, Haiyang Zhang,
	Simon Sandström, Dan Williams, kvm-ppc, Qian Cai,
	Alex Williamson, Mike Rapoport, Borislav Petkov, Nicholas Piggin,
	Andy Lutomirski, xen-devel, Boris Ostrovsky, Todd Poynor,
	Vitaly Kuznetsov, Allison Randal, Jim Mattson, Christophe Leroy,
	Vandana BN, Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin,
	Mel Gorman, Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 24.10.19 05:53, Anshuman Khandual wrote:
> 
> On 10/22/2019 10:42 PM, David Hildenbrand wrote:
>> Our onlining/offlining code is unnecessarily complicated. Only memory
>> blocks added during boot can have holes. Hotplugged memory never has
>> holes. That memory is already online.
> 
> Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).

Important: HWPoison != memory hole

A memory hole is memory that is not "IORESOURCE_SYSRAM". These pages are 
currently marked PG_reserved. Such holes are sometimes used for mapping 
something into kernel space. Some archs use the PG_reserved to detect 
the memory hole ("not ram") and ignore the memmap.

Poisoned pages are marked PG_hwpoison.

> Currently, do we just abort adding that memory block if there are holes ?

There is no interface to do that.

E.g., have a look at add_memory() add_memory_resource(). You can only 
pass one memory resource (that is all IORESOURCE_SYSRAM | IORESOURCE_BUSY)

Hotplugging memory with holes is not supported (nor can I imagine a use 
case for that).

>>
>> When we stop allowing to offline memory blocks with holes, we implicitly
>> stop to online memory blocks with holes.
> 
> Reducing hotplug support for memory blocks with holes just to simplify
> the code. Is it worth ?

Me and Michal are not aware of a users, not even aware of a use case. 
Keeping code around that nobody really needs that limits cleanups, no 
thanks. Similar to us not supporting to offline memory blocks that span 
multiple nodes/zones.

E.g., have a look at the isolation code. It is full of code that jumps 
over memory holes (start_isolate_page_range() -> __first_valid_page()). 
That made sense for our complicated memory offlining code, but it is 
actually harmful when dealing with alloc_contig_range(). Allocation 
never wants to jump over memory holes. After this patch, we can just 
fail hard on any memory hole we detect, instead of ignoring it (or 
special-casing it).

> 
>>
>> This allows to simplify the code. For example, we no longer have to
>> worry about marking pages that fall into memory holes PG_reserved when
>> onlining memory. We can stop setting pages PG_reserved.
> 
> Could not there be any other way of tracking these holes if not the page
> reserved bit. In the memory section itself and corresponding struct pages
> just remained poisoned ? Just wondering, might be all wrong here.

Of course there could be ways (e.g., using PG_offline eventually), but 
it boils down to us having to deal with it in onlining/offlining code. 
And that is some handling nobody really seems to need.

> 
>>
>> Offlining memory blocks added during boot is usually not guranteed to work
>> either way. So stopping to do that (if anybody really used and tested
> 
> That guarantee does not exist right now because how boot memory could have
> been used after boot not from a limitation of the memory hot remove itself.

Yep. However, Michal and I are not even aware of a setup that would made 
this work and guarantee that the existing code actually still is able to 
deal with holes. Are you?

> 
>> this over the years) should not really hurt. For the use case of
>> offlining memory to unplug DIMMs, we should see no change. (holes on
>> DIMMs would be weird)
> 
> Holes on DIMM could be due to HW errors affecting only parts of it. By not

Again, HW errors != holes. We have PG_hwpoison for that.

> allowing such DIMM's hot add and remove, we are definitely reducing the
> scope of overall hotplug functionality. Is code simplification in itself
> is worth this reduction in functionality ?

What you describe is not affected.

Thanks!

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-24  7:55       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24  7:55 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, linux-mm, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, Alexander Duyck, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Haiyang Zhang, Simon Sandström, Dan Williams,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Vandana BN, Jeremy Sowden, Greg Kroah-Hartman,
	Cornelia Huck, Pavel Tatashin, Mel Gorman, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 24.10.19 05:53, Anshuman Khandual wrote:
> 
> On 10/22/2019 10:42 PM, David Hildenbrand wrote:
>> Our onlining/offlining code is unnecessarily complicated. Only memory
>> blocks added during boot can have holes. Hotplugged memory never has
>> holes. That memory is already online.
> 
> Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).

Important: HWPoison != memory hole

A memory hole is memory that is not "IORESOURCE_SYSRAM". These pages are 
currently marked PG_reserved. Such holes are sometimes used for mapping 
something into kernel space. Some archs use the PG_reserved to detect 
the memory hole ("not ram") and ignore the memmap.

Poisoned pages are marked PG_hwpoison.

> Currently, do we just abort adding that memory block if there are holes ?

There is no interface to do that.

E.g., have a look at add_memory() add_memory_resource(). You can only 
pass one memory resource (that is all IORESOURCE_SYSRAM | IORESOURCE_BUSY)

Hotplugging memory with holes is not supported (nor can I imagine a use 
case for that).

>>
>> When we stop allowing to offline memory blocks with holes, we implicitly
>> stop to online memory blocks with holes.
> 
> Reducing hotplug support for memory blocks with holes just to simplify
> the code. Is it worth ?

Me and Michal are not aware of a users, not even aware of a use case. 
Keeping code around that nobody really needs that limits cleanups, no 
thanks. Similar to us not supporting to offline memory blocks that span 
multiple nodes/zones.

E.g., have a look at the isolation code. It is full of code that jumps 
over memory holes (start_isolate_page_range() -> __first_valid_page()). 
That made sense for our complicated memory offlining code, but it is 
actually harmful when dealing with alloc_contig_range(). Allocation 
never wants to jump over memory holes. After this patch, we can just 
fail hard on any memory hole we detect, instead of ignoring it (or 
special-casing it).

> 
>>
>> This allows to simplify the code. For example, we no longer have to
>> worry about marking pages that fall into memory holes PG_reserved when
>> onlining memory. We can stop setting pages PG_reserved.
> 
> Could not there be any other way of tracking these holes if not the page
> reserved bit. In the memory section itself and corresponding struct pages
> just remained poisoned ? Just wondering, might be all wrong here.

Of course there could be ways (e.g., using PG_offline eventually), but 
it boils down to us having to deal with it in onlining/offlining code. 
And that is some handling nobody really seems to need.

> 
>>
>> Offlining memory blocks added during boot is usually not guranteed to work
>> either way. So stopping to do that (if anybody really used and tested
> 
> That guarantee does not exist right now because how boot memory could have
> been used after boot not from a limitation of the memory hot remove itself.

Yep. However, Michal and I are not even aware of a setup that would made 
this work and guarantee that the existing code actually still is able to 
deal with holes. Are you?

> 
>> this over the years) should not really hurt. For the use case of
>> offlining memory to unplug DIMMs, we should see no change. (holes on
>> DIMMs would be weird)
> 
> Holes on DIMM could be due to HW errors affecting only parts of it. By not

Again, HW errors != holes. We have PG_hwpoison for that.

> allowing such DIMM's hot add and remove, we are definitely reducing the
> scope of overall hotplug functionality. Is code simplification in itself
> is worth this reduction in functionality ?

What you describe is not affected.

Thanks!

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
@ 2019-10-24  7:55       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24  7:55 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel
  Cc: Kate Stewart, Sasha Levin, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	kvm, Pavel Tatashin, KarimAllah Ahmed, Benjamin Herrenschmidt,
	Dave Hansen, Alexander Duyck, Michal Hocko, Paul Mackerras,
	linux-mm, Paul Mackerras, Michael Ellerman, H. Peter Anvin,
	Wanpeng Li, Alexander Duyck, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Kees Cook, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, x86,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Juergen Gross, Haiyang Zhang, Simon Sandström, Dan Williams,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Jeremy Sowden,
	Greg Kroah-Hartman, Cornelia Huck, Pavel Tatashin, Mel Gorman,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 24.10.19 05:53, Anshuman Khandual wrote:
> 
> On 10/22/2019 10:42 PM, David Hildenbrand wrote:
>> Our onlining/offlining code is unnecessarily complicated. Only memory
>> blocks added during boot can have holes. Hotplugged memory never has
>> holes. That memory is already online.
> 
> Why hot plugged memory at runtime cannot have holes (e.g a semi bad DIMM).

Important: HWPoison != memory hole

A memory hole is memory that is not "IORESOURCE_SYSRAM". These pages are 
currently marked PG_reserved. Such holes are sometimes used for mapping 
something into kernel space. Some archs use the PG_reserved to detect 
the memory hole ("not ram") and ignore the memmap.

Poisoned pages are marked PG_hwpoison.

> Currently, do we just abort adding that memory block if there are holes ?

There is no interface to do that.

E.g., have a look at add_memory() add_memory_resource(). You can only 
pass one memory resource (that is all IORESOURCE_SYSRAM | IORESOURCE_BUSY)

Hotplugging memory with holes is not supported (nor can I imagine a use 
case for that).

>>
>> When we stop allowing to offline memory blocks with holes, we implicitly
>> stop to online memory blocks with holes.
> 
> Reducing hotplug support for memory blocks with holes just to simplify
> the code. Is it worth ?

Me and Michal are not aware of a users, not even aware of a use case. 
Keeping code around that nobody really needs that limits cleanups, no 
thanks. Similar to us not supporting to offline memory blocks that span 
multiple nodes/zones.

E.g., have a look at the isolation code. It is full of code that jumps 
over memory holes (start_isolate_page_range() -> __first_valid_page()). 
That made sense for our complicated memory offlining code, but it is 
actually harmful when dealing with alloc_contig_range(). Allocation 
never wants to jump over memory holes. After this patch, we can just 
fail hard on any memory hole we detect, instead of ignoring it (or 
special-casing it).

> 
>>
>> This allows to simplify the code. For example, we no longer have to
>> worry about marking pages that fall into memory holes PG_reserved when
>> onlining memory. We can stop setting pages PG_reserved.
> 
> Could not there be any other way of tracking these holes if not the page
> reserved bit. In the memory section itself and corresponding struct pages
> just remained poisoned ? Just wondering, might be all wrong here.

Of course there could be ways (e.g., using PG_offline eventually), but 
it boils down to us having to deal with it in onlining/offlining code. 
And that is some handling nobody really seems to need.

> 
>>
>> Offlining memory blocks added during boot is usually not guranteed to work
>> either way. So stopping to do that (if anybody really used and tested
> 
> That guarantee does not exist right now because how boot memory could have
> been used after boot not from a limitation of the memory hot remove itself.

Yep. However, Michal and I are not even aware of a setup that would made 
this work and guarantee that the existing code actually still is able to 
deal with holes. Are you?

> 
>> this over the years) should not really hurt. For the use case of
>> offlining memory to unplug DIMMs, we should see no change. (holes on
>> DIMMs would be weird)
> 
> Holes on DIMM could be due to HW errors affecting only parts of it. By not

Again, HW errors != holes. We have PG_hwpoison for that.

> allowing such DIMM's hot add and remove, we are definitely reducing the
> scope of overall hotplug functionality. Is code simplification in itself
> is worth this reduction in functionality ?

What you describe is not affected.

Thanks!

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  2019-10-23  7:26     ` David Hildenbrand
  (?)
  (?)
@ 2019-10-24 12:50       ` David Hildenbrand
  -1 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, Linux MM, Michal Hocko, Andrew Morton,
	kvm-ppc, linuxppc-dev, KVM list, linux-hyperv, devel, xen-devel,
	X86 ML, Alexander Duyck, Kees Cook, Alex Williamson,
	Allison Randal, Andy Lutomirski, Aneesh Kumar K.V,
	Anshuman Khandual, Anthony Yznaga, Ben Chan,
	Benjamin Herrenschmidt, Borislav Petkov, Boris Ostrovsky,
	Christophe Leroy, Cornelia Huck, Dan Carpenter, Dave Hansen,
	Fabio Estevam, Greg Kroah-Hartman, Haiyang Zhang, H. Peter Anvin,
	Ingo Molnar, Isaac J. Manjarres, Jeremy Sowden, Jim Mattson,
	Joerg Roedel, Johannes Weiner, Juergen Gross, KarimAllah Ahmed,
	Kate Stewart, K. Y. Srinivasan, Madhumitha Prabakaran,
	Matt Sickler, Mel Gorman, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Mike Rapoport, Nicholas Piggin, Nishka Dasgupta,
	Oscar Salvador, Paolo Bonzini, Paul Mackerras, Paul Mackerras,
	Pavel Tatashin, Pavel Tatashin, Peter Zijlstra, Qian Cai,
	Radim Krčmář,
	Rob Springer, Sasha Levin, Sean Christopherson,
	Simon Sandström, Stefano Stabellini, Stephen Hemminger,
	Thomas Gleixner, Todd Poynor, Vandana BN, Vitaly Kuznetsov,
	Vlastimil Babka, Wanpeng Li, YueHaibing

On 23.10.19 09:26, David Hildenbrand wrote:
> On 22.10.19 23:54, Dan Williams wrote:
>> Hi David,
>>
>> Thanks for tackling this!
> 
> Thanks for having a look :)
> 
> [...]
> 
> 
>>> I am probably a little bit too careful (but I don't want to break things).
>>> In most places (besides KVM and vfio that are nuts), the
>>> pfn_to_online_page() check could most probably be avoided by a
>>> is_zone_device_page() check. However, I usually get suspicious when I see
>>> a pfn_valid() check (especially after I learned that people mmap parts of
>>> /dev/mem into user space, including memory without memmaps. Also, people
>>> could memmap offline memory blocks this way :/). As long as this does not
>>> hurt performance, I think we should rather do it the clean way.
>>
>> I'm concerned about using is_zone_device_page() in places that are not
>> known to already have a reference to the page. Here's an audit of
>> current usages, and the ones I think need to cleaned up. The "unsafe"
>> ones do not appear to have any protections against the device page
>> being removed (get_dev_pagemap()). Yes, some of these were added by
>> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
>> pages into anonymous memory paths and I'm not up to speed on how it
>> guarantees 'struct page' validity vs device shutdown without using
>> get_dev_pagemap().
>>
>> smaps_pmd_entry(): unsafe
>>
>> put_devmap_managed_page(): safe, page reference is held
>>
>> is_device_private_page(): safe? gpu driver manages private page lifetime
>>
>> is_pci_p2pdma_page(): safe, page reference is held
>>
>> uncharge_page(): unsafe? HMM
>>
>> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
>>
>> soft_offline_page(): unsafe
>>
>> remove_migration_pte(): unsafe? HMM
>>
>> move_to_new_page(): unsafe? HMM
>>
>> migrate_vma_pages() and helpers: unsafe? HMM
>>
>> try_to_unmap_one(): unsafe? HMM
>>
>> __put_page(): safe
>>
>> release_pages(): safe
>>
>> I'm hoping all the HMM ones can be converted to
>> is_device_private_page() directlly and have that routine grow a nice
>> comment about how it knows it can always safely de-reference its @page
>> argument.
>>
>> For the rest I'd like to propose that we add a facility to determine
>> ZONE_DEVICE by pfn rather than page. The most straightforward why I
>> can think of would be to just add another bitmap to mem_section_usage
>> to indicate if a subsection is ZONE_DEVICE or not.
> 
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
> 
> I dislike this for three reasons
> 
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>     don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 

Dan, I am suspecting that you want a pfn_to_zone() that will not touch 
the memmap, because it could potentially (altmap) lie on slow memory, right?

A modification might make this possible (but I am not yet sure if we 
want a less generic MM implementation just to fine tune slow memmap 
access here)

1. Keep SECTION_IS_ONLINE as it is with the same semantics
2. Introduce a subsection bitmap to record active ("initialized memmap")
    PFNs. E.g., also set it when setting sections online.
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE
    (or keep it as is, depends on the RCU locking we eventually
     implement)
6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE
7. use pfn_active() whenever we don't care about the zone.

Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. 
!ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) 
we really want to touch the memmap right away either way. So we can also 
directly read the zone from it. I really do prefer right now a more 
generic implementation.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-24 12:50       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, Pavel Tatashin,
	devel, Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Kees Cook, Anshuman Khandual,
	Haiyang Zhang, Simon Sandström, Sasha Levin, Juergen Gross,
	kvm-ppc, Qian Cai, Alex Williamson, Mike Rapoport,
	Borislav Petkov, Nicholas Piggin, Andy Lutomirski, xen-devel,
	Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov, Allison Randal,
	Jim Mattson, Christophe Leroy, Vandana BN, Mel Gorman,
	Greg Kroah-Hartman, Cornelia Huck, Linux Kernel Mailing List,
	Sean Christopherson, Rob Springer, Thomas Gleixner,
	Johannes Weiner, Paolo Bonzini, Andrew Morton, linuxppc-dev

On 23.10.19 09:26, David Hildenbrand wrote:
> On 22.10.19 23:54, Dan Williams wrote:
>> Hi David,
>>
>> Thanks for tackling this!
> 
> Thanks for having a look :)
> 
> [...]
> 
> 
>>> I am probably a little bit too careful (but I don't want to break things).
>>> In most places (besides KVM and vfio that are nuts), the
>>> pfn_to_online_page() check could most probably be avoided by a
>>> is_zone_device_page() check. However, I usually get suspicious when I see
>>> a pfn_valid() check (especially after I learned that people mmap parts of
>>> /dev/mem into user space, including memory without memmaps. Also, people
>>> could memmap offline memory blocks this way :/). As long as this does not
>>> hurt performance, I think we should rather do it the clean way.
>>
>> I'm concerned about using is_zone_device_page() in places that are not
>> known to already have a reference to the page. Here's an audit of
>> current usages, and the ones I think need to cleaned up. The "unsafe"
>> ones do not appear to have any protections against the device page
>> being removed (get_dev_pagemap()). Yes, some of these were added by
>> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
>> pages into anonymous memory paths and I'm not up to speed on how it
>> guarantees 'struct page' validity vs device shutdown without using
>> get_dev_pagemap().
>>
>> smaps_pmd_entry(): unsafe
>>
>> put_devmap_managed_page(): safe, page reference is held
>>
>> is_device_private_page(): safe? gpu driver manages private page lifetime
>>
>> is_pci_p2pdma_page(): safe, page reference is held
>>
>> uncharge_page(): unsafe? HMM
>>
>> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
>>
>> soft_offline_page(): unsafe
>>
>> remove_migration_pte(): unsafe? HMM
>>
>> move_to_new_page(): unsafe? HMM
>>
>> migrate_vma_pages() and helpers: unsafe? HMM
>>
>> try_to_unmap_one(): unsafe? HMM
>>
>> __put_page(): safe
>>
>> release_pages(): safe
>>
>> I'm hoping all the HMM ones can be converted to
>> is_device_private_page() directlly and have that routine grow a nice
>> comment about how it knows it can always safely de-reference its @page
>> argument.
>>
>> For the rest I'd like to propose that we add a facility to determine
>> ZONE_DEVICE by pfn rather than page. The most straightforward why I
>> can think of would be to just add another bitmap to mem_section_usage
>> to indicate if a subsection is ZONE_DEVICE or not.
> 
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
> 
> I dislike this for three reasons
> 
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>     don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 

Dan, I am suspecting that you want a pfn_to_zone() that will not touch 
the memmap, because it could potentially (altmap) lie on slow memory, right?

A modification might make this possible (but I am not yet sure if we 
want a less generic MM implementation just to fine tune slow memmap 
access here)

1. Keep SECTION_IS_ONLINE as it is with the same semantics
2. Introduce a subsection bitmap to record active ("initialized memmap")
    PFNs. E.g., also set it when setting sections online.
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE
    (or keep it as is, depends on the RCU locking we eventually
     implement)
6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE
7. use pfn_active() whenever we don't care about the zone.

Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. 
!ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) 
we really want to touch the memmap right away either way. So we can also 
directly read the zone from it. I really do prefer right now a more 
generic implementation.

-- 

Thanks,

David / dhildenb

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-24 12:50       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed, Dave Hansen,
	Alexander Duyck, Michal Hocko, Linux MM, Paul Mackerras,
	H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan, Fabio Estevam,
	Ben Chan, Pavel Tatashin, devel, Stefano Stabellini,
	Stephen Hemminger, Aneesh Kumar K.V, Joerg Roedel, X86 ML,
	YueHaibing, Mike Rapoport, Madhumitha Prabakaran, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga,
	Oscar Salvador, Dan Carpenter, Isaac J. Manjarres, Matt Sickler,
	Kees Cook, Anshuman Khandual, Haiyang Zhang,
	Simon Sandström, Sasha Levin, Juergen Gross, kvm-ppc,
	Qian Cai, Alex Williamson, Mike Rapoport, Borislav Petkov,
	Nicholas Piggin, Andy Lutomirski, xen-devel, Boris Ostrovsky,
	Todd Poynor, Vitaly Kuznetsov, Allison Randal, Jim Mattson,
	Vandana BN, Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman,
	Cornelia Huck, Linux Kernel Mailing List, Sean Christopherson,
	Rob Springer, Thomas Gleixner, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, linuxppc-dev

On 23.10.19 09:26, David Hildenbrand wrote:
> On 22.10.19 23:54, Dan Williams wrote:
>> Hi David,
>>
>> Thanks for tackling this!
> 
> Thanks for having a look :)
> 
> [...]
> 
> 
>>> I am probably a little bit too careful (but I don't want to break things).
>>> In most places (besides KVM and vfio that are nuts), the
>>> pfn_to_online_page() check could most probably be avoided by a
>>> is_zone_device_page() check. However, I usually get suspicious when I see
>>> a pfn_valid() check (especially after I learned that people mmap parts of
>>> /dev/mem into user space, including memory without memmaps. Also, people
>>> could memmap offline memory blocks this way :/). As long as this does not
>>> hurt performance, I think we should rather do it the clean way.
>>
>> I'm concerned about using is_zone_device_page() in places that are not
>> known to already have a reference to the page. Here's an audit of
>> current usages, and the ones I think need to cleaned up. The "unsafe"
>> ones do not appear to have any protections against the device page
>> being removed (get_dev_pagemap()). Yes, some of these were added by
>> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
>> pages into anonymous memory paths and I'm not up to speed on how it
>> guarantees 'struct page' validity vs device shutdown without using
>> get_dev_pagemap().
>>
>> smaps_pmd_entry(): unsafe
>>
>> put_devmap_managed_page(): safe, page reference is held
>>
>> is_device_private_page(): safe? gpu driver manages private page lifetime
>>
>> is_pci_p2pdma_page(): safe, page reference is held
>>
>> uncharge_page(): unsafe? HMM
>>
>> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
>>
>> soft_offline_page(): unsafe
>>
>> remove_migration_pte(): unsafe? HMM
>>
>> move_to_new_page(): unsafe? HMM
>>
>> migrate_vma_pages() and helpers: unsafe? HMM
>>
>> try_to_unmap_one(): unsafe? HMM
>>
>> __put_page(): safe
>>
>> release_pages(): safe
>>
>> I'm hoping all the HMM ones can be converted to
>> is_device_private_page() directlly and have that routine grow a nice
>> comment about how it knows it can always safely de-reference its @page
>> argument.
>>
>> For the rest I'd like to propose that we add a facility to determine
>> ZONE_DEVICE by pfn rather than page. The most straightforward why I
>> can think of would be to just add another bitmap to mem_section_usage
>> to indicate if a subsection is ZONE_DEVICE or not.
> 
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
> 
> I dislike this for three reasons
> 
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>     don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 

Dan, I am suspecting that you want a pfn_to_zone() that will not touch 
the memmap, because it could potentially (altmap) lie on slow memory, right?

A modification might make this possible (but I am not yet sure if we 
want a less generic MM implementation just to fine tune slow memmap 
access here)

1. Keep SECTION_IS_ONLINE as it is with the same semantics
2. Introduce a subsection bitmap to record active ("initialized memmap")
    PFNs. E.g., also set it when setting sections online.
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE
    (or keep it as is, depends on the RCU locking we eventually
     implement)
6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE
7. use pfn_active() whenever we don't care about the zone.

Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. 
!ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) 
we really want to touch the memmap right away either way. So we can also 
directly read the zone from it. I really do prefer right now a more 
generic implementation.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
@ 2019-10-24 12:50       ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2019-10-24 12:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kate Stewart, linux-hyperv, Michal Hocko,
	Radim Krčmář,
	KVM list, Pavel Tatashin, KarimAllah Ahmed,
	Benjamin Herrenschmidt, Dave Hansen, Alexander Duyck,
	Michal Hocko, Paul Mackerras, Linux MM, Paul Mackerras,
	Michael Ellerman, H. Peter Anvin, Wanpeng Li, K. Y. Srinivasan,
	Fabio Estevam, Ben Chan, Pavel Tatashin, devel,
	Stefano Stabellini, Stephen Hemminger, Aneesh Kumar K.V,
	Joerg Roedel, X86 ML, YueHaibing, Mike Rapoport,
	Madhumitha Prabakaran, Peter Zijlstra, Ingo Molnar,
	Vlastimil Babka, Nishka Dasgupta, Anthony Yznaga, Oscar Salvador,
	Dan Carpenter, Isaac J. Manjarres, Matt Sickler, Kees Cook,
	Anshuman Khandual, Haiyang Zhang, Simon Sandström,
	Sasha Levin, Juergen Gross, kvm-ppc, Qian Cai, Alex Williamson,
	Mike Rapoport, Borislav Petkov, Nicholas Piggin, Andy Lutomirski,
	xen-devel, Boris Ostrovsky, Todd Poynor, Vitaly Kuznetsov,
	Allison Randal, Jim Mattson, Christophe Leroy, Vandana BN,
	Jeremy Sowden, Mel Gorman, Greg Kroah-Hartman, Cornelia Huck,
	Linux Kernel Mailing List, Sean Christopherson, Rob Springer,
	Thomas Gleixner, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	linuxppc-dev

On 23.10.19 09:26, David Hildenbrand wrote:
> On 22.10.19 23:54, Dan Williams wrote:
>> Hi David,
>>
>> Thanks for tackling this!
> 
> Thanks for having a look :)
> 
> [...]
> 
> 
>>> I am probably a little bit too careful (but I don't want to break things).
>>> In most places (besides KVM and vfio that are nuts), the
>>> pfn_to_online_page() check could most probably be avoided by a
>>> is_zone_device_page() check. However, I usually get suspicious when I see
>>> a pfn_valid() check (especially after I learned that people mmap parts of
>>> /dev/mem into user space, including memory without memmaps. Also, people
>>> could memmap offline memory blocks this way :/). As long as this does not
>>> hurt performance, I think we should rather do it the clean way.
>>
>> I'm concerned about using is_zone_device_page() in places that are not
>> known to already have a reference to the page. Here's an audit of
>> current usages, and the ones I think need to cleaned up. The "unsafe"
>> ones do not appear to have any protections against the device page
>> being removed (get_dev_pagemap()). Yes, some of these were added by
>> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
>> pages into anonymous memory paths and I'm not up to speed on how it
>> guarantees 'struct page' validity vs device shutdown without using
>> get_dev_pagemap().
>>
>> smaps_pmd_entry(): unsafe
>>
>> put_devmap_managed_page(): safe, page reference is held
>>
>> is_device_private_page(): safe? gpu driver manages private page lifetime
>>
>> is_pci_p2pdma_page(): safe, page reference is held
>>
>> uncharge_page(): unsafe? HMM
>>
>> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
>>
>> soft_offline_page(): unsafe
>>
>> remove_migration_pte(): unsafe? HMM
>>
>> move_to_new_page(): unsafe? HMM
>>
>> migrate_vma_pages() and helpers: unsafe? HMM
>>
>> try_to_unmap_one(): unsafe? HMM
>>
>> __put_page(): safe
>>
>> release_pages(): safe
>>
>> I'm hoping all the HMM ones can be converted to
>> is_device_private_page() directlly and have that routine grow a nice
>> comment about how it knows it can always safely de-reference its @page
>> argument.
>>
>> For the rest I'd like to propose that we add a facility to determine
>> ZONE_DEVICE by pfn rather than page. The most straightforward why I
>> can think of would be to just add another bitmap to mem_section_usage
>> to indicate if a subsection is ZONE_DEVICE or not.
> 
> (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread)
> 
> I dislike this for three reasons
> 
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>     don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy.
> c) We mix in ZONE specific stuff into the core. It should be "just another zone"
> 
> What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87)
> 
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>     (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> 

Dan, I am suspecting that you want a pfn_to_zone() that will not touch 
the memmap, because it could potentially (altmap) lie on slow memory, right?

A modification might make this possible (but I am not yet sure if we 
want a less generic MM implementation just to fine tune slow memmap 
access here)

1. Keep SECTION_IS_ONLINE as it is with the same semantics
2. Introduce a subsection bitmap to record active ("initialized memmap")
    PFNs. E.g., also set it when setting sections online.
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
    (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE
    (or keep it as is, depends on the RCU locking we eventually
     implement)
6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE
7. use pfn_active() whenever we don't care about the zone.

Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. 
!ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) 
we really want to touch the memmap right away either way. So we can also 
directly read the zone from it. I really do prefer right now a more 
generic implementation.

-- 

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2019-10-24 20:15 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22 17:12 [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) David Hildenbrand
2019-10-22 17:12 ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12 ` David Hildenbrand
2019-10-22 17:12 ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-24  3:53   ` Anshuman Khandual
2019-10-24  3:53     ` [Xen-devel] " Anshuman Khandual
2019-10-24  3:53     ` Anshuman Khandual
2019-10-24  3:53     ` Anshuman Khandual
2019-10-24  7:55     ` David Hildenbrand
2019-10-24  7:55       ` [Xen-devel] " David Hildenbrand
2019-10-24  7:55       ` David Hildenbrand
2019-10-24  7:55       ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-23  8:20   ` David Hildenbrand
2019-10-23  8:20     ` [Xen-devel] " David Hildenbrand
2019-10-23  8:20     ` David Hildenbrand
2019-10-23  8:20     ` David Hildenbrand
2019-10-23 16:25     ` Kees Cook
2019-10-23 16:25       ` [Xen-devel] " Kees Cook
2019-10-23 16:25       ` Kees Cook
2019-10-23 16:25       ` Kees Cook
2019-10-23 16:32       ` David Hildenbrand
2019-10-23 16:32         ` [Xen-devel] " David Hildenbrand
2019-10-23 16:32         ` David Hildenbrand
2019-10-23 16:32         ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-23  8:17   ` David Hildenbrand
2019-10-23  8:17     ` [Xen-devel] " David Hildenbrand
2019-10-23  8:17     ` David Hildenbrand
2019-10-23  8:17     ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:55   ` Matt Sickler
2019-10-22 17:55     ` [Xen-devel] " Matt Sickler
2019-10-22 17:55     ` Matt Sickler
2019-10-22 17:55     ` Matt Sickler
2019-10-22 21:01     ` David Hildenbrand
2019-10-22 21:01       ` [Xen-devel] " David Hildenbrand
2019-10-22 21:01       ` David Hildenbrand
2019-10-22 21:01       ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() " David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12 ` [PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap David Hildenbrand
2019-10-22 17:12   ` [Xen-devel] " David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 17:12   ` David Hildenbrand
2019-10-22 21:54 ` [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) Dan Williams
2019-10-22 21:54   ` [Xen-devel] " Dan Williams
2019-10-22 21:54   ` Dan Williams
2019-10-22 21:54   ` Dan Williams
2019-10-23  7:26   ` David Hildenbrand
2019-10-23  7:26     ` [Xen-devel] " David Hildenbrand
2019-10-23  7:26     ` David Hildenbrand
2019-10-23  7:26     ` David Hildenbrand
2019-10-23 17:09     ` Dan Williams
2019-10-23 17:09       ` [Xen-devel] " Dan Williams
2019-10-23 17:09       ` Dan Williams
2019-10-23 17:09       ` Dan Williams
2019-10-23 17:27       ` David Hildenbrand
2019-10-23 17:27         ` [Xen-devel] " David Hildenbrand
2019-10-23 17:27         ` David Hildenbrand
2019-10-23 17:27         ` David Hildenbrand
2019-10-23 19:39         ` Dan Williams
2019-10-23 19:39           ` [Xen-devel] " Dan Williams
2019-10-23 19:39           ` Dan Williams
2019-10-23 19:39           ` Dan Williams
2019-10-23 21:22           ` David Hildenbrand
2019-10-23 21:22             ` [Xen-devel] " David Hildenbrand
2019-10-23 21:22             ` David Hildenbrand
2019-10-23 21:22             ` David Hildenbrand
2019-10-24 12:50     ` David Hildenbrand
2019-10-24 12:50       ` [Xen-devel] " David Hildenbrand
2019-10-24 12:50       ` David Hildenbrand
2019-10-24 12:50       ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.