All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2020-10-01 18:17 ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-01 18:17 UTC (permalink / raw)
  To: linux-mm, kvm-ppc, nouveau, linux-kernel
  Cc: Dan Williams, Ira Weiny, Matthew Wilcox, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Paul Mackerras, Ben Skeggs, Andrew Morton,
	Ralph Campbell

This is still an RFC because after looking at the pmem/dax code some
more, I realized that the ZONE_DEVICE struct pages are being inserted
into the process' page tables with vmf_insert_mixed() and a zero
refcount on the ZONE_DEVICE struct page. This is sort of OK because
insert_pfn() increments the reference count on the pgmap which is what
prevents memunmap_pages() from freeing the struct pages and it doesn't
check for a non-zero struct page reference count.
But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
checks for a reference count == 0.

// mmap() an ext4 file that is mounted -o dax.
ext4_dax_fault()
  ext4_dax_huge_fault()
    dax_iomap_fault(&ext4_iomap_ops)
      dax_iomap_pte_fault()
        ops->iomap_begin() // ext4_iomap_begin()
          ext4_map_blocks()
          ext4_set_iomap()
        dax_iomap_pfn()
        dax_insert_entry()
        vmf_insert_mixed(pfn)
          __vm_insert_mixed()
            if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
                !pfn_t_devmap(pfn) && pfn_t_valid(pfn))
              insert_page()
                get_page(page) // XXX would trigger VM_BUG_ON_PAGE()
                page_add_file_rmap()
                set_pte_at()
            else
              insert_pfn()
                pte_mkdevmap()
                set_pte_at()

Should pmem set the page reference count to one before inserting the
pfn into the page tables (and decrement when removing devmap PTEs)?
What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases?
Where should they icrement/decrement the page reference count?
I don't know enough about how these are used to really know what to
do at this point. If people want me to continue to work on this series,
I will need some guidance.

---

Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly because they are "free" when the
reference count is one instead of zero. This leads to explicit checks
for ZONE_DEVICE pages in places like put_page(), GUP, THP splitting, and
page migration which have to adjust the expected reference count when
determining if the page is isolated or idle. This is my attempt to make
ZONE_DEVICE pages be free when the reference count is zero and removing
the special cases.

I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.
I have been able to successfully run xfstests on ext4 with the memmap
kernel boot option to simulate pmem.

Changes in v3:
Rebase to linux-mm 5.9.0-rc7-mm1.
Added a check for page_free() as suggested by Christoph Hellwig.
Added a helper for dax_wait_page() as suggested by Christoph Hellwig.

Changes in v2:
One of the big changes in v2 is that devm_memremap_pages() and
memremap_pages() now return the struct pages' reference count set to
zero instead of one. Normally, get_page() will VM_BUG_ON_PAGE() if
page->_refcount is zero. I didn't see any such warnings running the
xfstests with dax/pmem but I'm not clear how the zero to one reference
count is handled.

Other changes in v2:
Rebased to Linux-5.9.0-rc6 to include pmem fixes.
I added patch 1 to introduce a page refcount helper for ext4 and xfs as
suggested by Christoph Hellwig.
I also applied Christoph Hellwig's other suggested changes for removing
the devmap_managed_key, etc.

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c                               |  8 +--
 fs/ext4/inode.c                        |  5 +-
 fs/xfs/xfs_file.c                      |  4 +-
 include/linux/dax.h                    | 10 +++
 include/linux/memremap.h               |  7 ++-
 include/linux/mm.h                     | 44 --------------
 lib/test_hmm.c                         |  2 +-
 mm/gup.c                               | 44 --------------
 mm/internal.h                          |  8 +++
 mm/memremap.c                          | 84 +++++++-------------------
 mm/migrate.c                           |  5 --
 mm/page_alloc.c                        |  3 +
 mm/swap.c                              | 44 ++------------
 15 files changed, 63 insertions(+), 209 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2020-10-01 18:17 ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-01 18:17 UTC (permalink / raw)
  To: linux-mm, kvm-ppc, nouveau, linux-kernel
  Cc: Dan Williams, Ira Weiny, Matthew Wilcox, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Paul Mackerras, Ben Skeggs, Andrew Morton,
	Ralph Campbell

This is still an RFC because after looking at the pmem/dax code some
more, I realized that the ZONE_DEVICE struct pages are being inserted
into the process' page tables with vmf_insert_mixed() and a zero
refcount on the ZONE_DEVICE struct page. This is sort of OK because
insert_pfn() increments the reference count on the pgmap which is what
prevents memunmap_pages() from freeing the struct pages and it doesn't
check for a non-zero struct page reference count.
But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
checks for a reference count == 0.

// mmap() an ext4 file that is mounted -o dax.
ext4_dax_fault()
  ext4_dax_huge_fault()
    dax_iomap_fault(&ext4_iomap_ops)
      dax_iomap_pte_fault()
        ops->iomap_begin() // ext4_iomap_begin()
          ext4_map_blocks()
          ext4_set_iomap()
        dax_iomap_pfn()
        dax_insert_entry()
        vmf_insert_mixed(pfn)
          __vm_insert_mixed()
            if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
                !pfn_t_devmap(pfn) && pfn_t_valid(pfn))
              insert_page()
                get_page(page) // XXX would trigger VM_BUG_ON_PAGE()
                page_add_file_rmap()
                set_pte_at()
            else
              insert_pfn()
                pte_mkdevmap()
                set_pte_at()

Should pmem set the page reference count to one before inserting the
pfn into the page tables (and decrement when removing devmap PTEs)?
What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases?
Where should they icrement/decrement the page reference count?
I don't know enough about how these are used to really know what to
do at this point. If people want me to continue to work on this series,
I will need some guidance.

---

Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly because they are "free" when the
reference count is one instead of zero. This leads to explicit checks
for ZONE_DEVICE pages in places like put_page(), GUP, THP splitting, and
page migration which have to adjust the expected reference count when
determining if the page is isolated or idle. This is my attempt to make
ZONE_DEVICE pages be free when the reference count is zero and removing
the special cases.

I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.
I have been able to successfully run xfstests on ext4 with the memmap
kernel boot option to simulate pmem.

Changes in v3:
Rebase to linux-mm 5.9.0-rc7-mm1.
Added a check for page_free() as suggested by Christoph Hellwig.
Added a helper for dax_wait_page() as suggested by Christoph Hellwig.

Changes in v2:
One of the big changes in v2 is that devm_memremap_pages() and
memremap_pages() now return the struct pages' reference count set to
zero instead of one. Normally, get_page() will VM_BUG_ON_PAGE() if
page->_refcount is zero. I didn't see any such warnings running the
xfstests with dax/pmem but I'm not clear how the zero to one reference
count is handled.

Other changes in v2:
Rebased to Linux-5.9.0-rc6 to include pmem fixes.
I added patch 1 to introduce a page refcount helper for ext4 and xfs as
suggested by Christoph Hellwig.
I also applied Christoph Hellwig's other suggested changes for removing
the devmap_managed_key, etc.

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c                               |  8 +--
 fs/ext4/inode.c                        |  5 +-
 fs/xfs/xfs_file.c                      |  4 +-
 include/linux/dax.h                    | 10 +++
 include/linux/memremap.h               |  7 ++-
 include/linux/mm.h                     | 44 --------------
 lib/test_hmm.c                         |  2 +-
 mm/gup.c                               | 44 --------------
 mm/internal.h                          |  8 +++
 mm/memremap.c                          | 84 +++++++-------------------
 mm/migrate.c                           |  5 --
 mm/page_alloc.c                        |  3 +
 mm/swap.c                              | 44 ++------------
 15 files changed, 63 insertions(+), 209 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper
  2020-10-01 18:17 ` Ralph Campbell
@ 2020-10-01 18:17   ` Ralph Campbell
  -1 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-01 18:17 UTC (permalink / raw)
  To: linux-mm, kvm-ppc, nouveau, linux-kernel
  Cc: Dan Williams, Ira Weiny, Matthew Wilcox, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Paul Mackerras, Ben Skeggs, Andrew Morton,
	Ralph Campbell

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 fs/dax.c            |  4 ++--
 fs/ext4/inode.c     |  5 +----
 fs/xfs/xfs_file.c   |  4 +---
 include/linux/dax.h | 10 ++++++++++
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..85c63f735909 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
 		page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		if (page_ref_count(page) > 1)
+		if (!dax_layout_is_idle_page(page))
 			return page;
 	}
 	return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..4c3b80e68121 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3926,10 +3926,7 @@ int ext4_break_layouts(struct inode *inode)
 		if (!page)
 			return 0;
 
-		error = ___wait_var_event(&page->_refcount,
-				atomic_read(&page->_refcount) == 1,
-				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(ei));
+		error = dax_wait_page(ei, page, ext4_wait_dax_page);
 	} while (error == 0);
 
 	return error;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3d1b95124744..a5304aaeaa3a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -749,9 +749,7 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, xfs_wait_dax_page(inode));
+	return dax_wait_page(inode, page, xfs_wait_dax_page);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..f906cf4db1cc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping)
 	return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_layout_is_idle_page(struct page *page)
+{
+	return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb)				\
+	___wait_var_event(&(_page)->_refcount,				\
+		dax_layout_is_idle_page(_page),				\
+		TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper
@ 2020-10-01 18:17   ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-01 18:17 UTC (permalink / raw)
  To: linux-mm, kvm-ppc, nouveau, linux-kernel
  Cc: Dan Williams, Ira Weiny, Matthew Wilcox, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Paul Mackerras, Ben Skeggs, Andrew Morton,
	Ralph Campbell

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 fs/dax.c            |  4 ++--
 fs/ext4/inode.c     |  5 +----
 fs/xfs/xfs_file.c   |  4 +---
 include/linux/dax.h | 10 ++++++++++
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..85c63f735909 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
 		page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		if (page_ref_count(page) > 1)
+		if (!dax_layout_is_idle_page(page))
 			return page;
 	}
 	return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..4c3b80e68121 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3926,10 +3926,7 @@ int ext4_break_layouts(struct inode *inode)
 		if (!page)
 			return 0;
 
-		error = ___wait_var_event(&page->_refcount,
-				atomic_read(&page->_refcount) == 1,
-				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(ei));
+		error = dax_wait_page(ei, page, ext4_wait_dax_page);
 	} while (error == 0);
 
 	return error;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3d1b95124744..a5304aaeaa3a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -749,9 +749,7 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, xfs_wait_dax_page(inode));
+	return dax_wait_page(inode, page, xfs_wait_dax_page);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..f906cf4db1cc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping)
 	return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_layout_is_idle_page(struct page *page)
+{
+	return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb)				\
+	___wait_var_event(&(_page)->_refcount,				\
+		dax_layout_is_idle_page(_page),				\
+		TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
-- 
2.20.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-01 18:17 ` Ralph Campbell
@ 2020-10-01 18:17   ` Ralph Campbell
  -1 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-01 18:17 UTC (permalink / raw)
  To: linux-mm, kvm-ppc, nouveau, linux-kernel
  Cc: Dan Williams, Ira Weiny, Matthew Wilcox, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Paul Mackerras, Ben Skeggs, Andrew Morton,
	Ralph Campbell

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c                               |  4 +-
 include/linux/dax.h                    |  2 +-
 include/linux/memremap.h               |  7 ++-
 include/linux/mm.h                     | 44 --------------
 lib/test_hmm.c                         |  2 +-
 mm/gup.c                               | 44 --------------
 mm/internal.h                          |  8 +++
 mm/memremap.c                          | 84 +++++++-------------------
 mm/migrate.c                           |  5 --
 mm/page_alloc.c                        |  3 +
 mm/swap.c                              | 44 ++------------
 13 files changed, 50 insertions(+), 201 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
 
 	dpage = pfn_to_page(uvmem_pfn);
 	dpage->zone_device_data = pvt;
-	get_page(dpage);
+	init_page_count(dpage);
 	lock_page(dpage);
 	return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
 			return NULL;
 	}
 
-	get_page(page);
+	init_page_count(page);
 	lock_page(page);
 	return page;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 85c63f735909..4804348f62e6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
  *       pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f906cf4db1cc..e4920ea6abd3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)
 
 static inline bool dax_layout_is_idle_page(struct page *page)
 {
-	return page_ref_count(page) == 1;
+	return page_ref_count(page) == 0;
 }
 
 #define dax_wait_page(_inode, _page, _wait_cb)				\
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..2f63747caf56 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
 	/*
-	 * Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-	 * reach 0 refcount unless there is a refcount bug. This allows the
-	 * device driver to implement its own memory management.)
+	 * Called once the page refcount reaches 0. The reference count
+	 * should be reset to one with init_page_count(page) before reusing
+	 * the page. This allows the device driver to implement its own
+	 * memory management.
 	 */
 	void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27c64d0d7520..3d71a820ae38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1100,39 +1100,6 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
-	if (!static_branch_unlikely(&devmap_managed_key))
-		return false;
-	if (!is_zone_device_page(page))
-		return false;
-	switch (page->pgmap->type) {
-	case MEMORY_DEVICE_PRIVATE:
-	case MEMORY_DEVICE_FS_DAX:
-		return true;
-	default:
-		break;
-	}
-	return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
-	return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 static inline bool is_device_private_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
@@ -1179,17 +1146,6 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
-	/*
-	 * For devmap managed pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the page is free and we
-	 * need to inform the device driver through callback. See
-	 * include/linux/memremap.h and HMM for details.
-	 */
-	if (page_is_devmap_managed(page)) {
-		put_devmap_managed_page(page);
-		return;
-	}
-
 	if (put_page_testzero(page))
 		__put_page(page);
 }
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e151a7f10519..37a2dea5e805 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	}
 
 	dpage->zone_device_data = rpage;
-	get_page(dpage);
+	init_page_count(dpage);
 	lock_page(dpage);
 	return dpage;
 
diff --git a/mm/gup.c b/mm/gup.c
index 102877ed77a4..ed117e225c86 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -177,41 +177,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
 	return true;
 }
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-	int count, refs = 1;
-
-	if (!page_is_devmap_managed(page))
-		return false;
-
-	if (hpage_pincount_available(page))
-		hpage_pincount_sub(page, 1);
-	else
-		refs = GUP_PIN_COUNTING_BIAS;
-
-	count = page_ref_sub_return(page, refs);
-
-	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
-	/*
-	 * devmap page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (count == 1)
-		free_devmap_managed_page(page);
-	else if (!count)
-		__put_page(page);
-
-	return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-	return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:            pointer to page to be released
@@ -227,15 +192,6 @@ void unpin_user_page(struct page *page)
 
 	page = compound_head(page);
 
-	/*
-	 * For devmap managed pages we need to catch refcount transition from
-	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
-	 * page is free and we need to inform the device driver through
-	 * callback. See include/linux/memremap.h and HMM for details.
-	 */
-	if (__unpin_devmap_managed_user_page(page))
-		return;
-
 	if (hpage_pincount_available(page))
 		hpage_pincount_sub(page, 1);
 	else
diff --git a/mm/internal.h b/mm/internal.h
index 6345b08ce86c..629959a6f26d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -618,4 +618,12 @@ struct migration_target_control {
 	gfp_t gfp_mask;
 };
 
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memremap.c b/mm/memremap.c
index 2bb276680837..336f088c92a1 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/xarray.h>
+#include "internal.h"
 
 static DEFINE_XARRAY(pgmap_array);
 
@@ -37,36 +38,6 @@ unsigned long memremap_compat_align(void)
 EXPORT_SYMBOL_GPL(memremap_compat_align);
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(void)
-{
-	static_branch_dec(&devmap_managed_key);
-}
-
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
-	    (!pgmap->ops || !pgmap->ops->page_free)) {
-		WARN(1, "Missing page_free method\n");
-		return -EINVAL;
-	}
-
-	static_branch_inc(&devmap_managed_key);
-	return 0;
-}
-#else
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-	return -EINVAL;
-}
-static void devmap_managed_enable_put(void)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 static void pgmap_array_delete(struct range *range)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -91,13 +62,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
 	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
 
-static unsigned long pfn_next(unsigned long pfn)
-{
-	if (pfn % 1024 == 0)
-		cond_resched();
-	return pfn + 1;
-}
-
 /*
  * This returns true if the page is reserved by ZONE_DEVICE driver.
  */
@@ -118,9 +82,6 @@ bool pfn_zone_device_reserved(unsigned long pfn)
 	return ret;
 }
 
-#define for_each_device_pfn(pfn, map, i) \
-	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
-
 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
 	if (pgmap->ops && pgmap->ops->kill)
@@ -176,20 +137,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
-	unsigned long pfn;
 	int i;
 
 	dev_pagemap_kill(pgmap);
 	for (i = 0; i < pgmap->nr_range; i++)
-		for_each_device_pfn(pfn, pgmap, i)
-			put_page(pfn_to_page(pfn));
+		percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+						pfn_first(pgmap, i));
 	dev_pagemap_cleanup(pgmap);
 
 	for (i = 0; i < pgmap->nr_range; i++)
 		pageunmap_range(pgmap, i);
 
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
-	devmap_managed_enable_put();
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -327,7 +286,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
-	bool need_devmap_managed = true;
 	int error, i;
 
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
@@ -343,6 +301,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 			WARN(1, "Missing migrate_to_ram method\n");
 			return ERR_PTR(-EINVAL);
 		}
+		if (!pgmap->ops->page_free) {
+			WARN(1, "Missing page_free method\n");
+			return ERR_PTR(-EINVAL);
+		}
 		if (!pgmap->owner) {
 			WARN(1, "Missing owner\n");
 			return ERR_PTR(-EINVAL);
@@ -356,11 +318,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		}
 		break;
 	case MEMORY_DEVICE_GENERIC:
-		need_devmap_managed = false;
 		break;
 	case MEMORY_DEVICE_PCI_P2PDMA:
 		params.pgprot = pgprot_noncached(params.pgprot);
-		need_devmap_managed = false;
 		break;
 	default:
 		WARN(1, "Invalid pgmap type %d\n", pgmap->type);
@@ -384,12 +344,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		}
 	}
 
-	if (need_devmap_managed) {
-		error = devmap_managed_enable_get(pgmap);
-		if (error)
-			return ERR_PTR(error);
-	}
-
 	/*
 	 * Clear the pgmap nr_range as it will be incremented for each
 	 * successfully processed range. This communicates how many
@@ -504,16 +458,9 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+static void free_device_private_page(struct page *page)
 {
-	/* notify page idle for dax */
-	if (!is_device_private_page(page)) {
-		wake_up_var(&page->_refcount);
-		return;
-	}
-
 	__ClearPageWaiters(page);
-
 	mem_cgroup_uncharge(page);
 
 	/*
@@ -540,4 +487,19 @@ void free_devmap_managed_page(struct page *page)
 	page->mapping = NULL;
 	page->pgmap->ops->page_free(page);
 }
+
+void free_zone_device_page(struct page *page)
+{
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_FS_DAX:
+		/* notify page idle */
+		wake_up_var(&page->_refcount);
+		return;
+	case MEMORY_DEVICE_PRIVATE:
+		free_device_private_page(page);
+		return;
+	default:
+		return;
+	}
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/migrate.c b/mm/migrate.c
index 5ca5842df5db..ee09334b46d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -380,11 +380,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
 {
 	int expected_count = 1;
 
-	/*
-	 * Device private pages have an extra refcount as they are
-	 * ZONE_DEVICE pages.
-	 */
-	expected_count += is_device_private_page(page);
 	if (mapping)
 		expected_count += thp_nr_pages(page) + page_has_private(page);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb0962976a0e..039b88bb3978 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6154,6 +6154,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
+		/* ZONE_DEVICE pages start with a zero reference count. */
+		set_page_count(page, 0);
+
 		/*
 		 * Mark page reserved as it will need to wait for onlining
 		 * phase for it to be fully associated with a zone.
diff --git a/mm/swap.c b/mm/swap.c
index 0eb057141a04..82cbe069dcf3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -116,12 +116,11 @@ static void __put_compound_page(struct page *page)
 void __put_page(struct page *page)
 {
 	if (is_zone_device_page(page)) {
-		put_dev_pagemap(page->pgmap);
-
 		/*
 		 * The page belongs to the device that created pgmap. Do
 		 * not return it to page allocator.
 		 */
+		free_zone_device_page(page);
 		return;
 	}
 
@@ -891,26 +890,18 @@ void release_pages(struct page **pages, int nr)
 		if (is_huge_zero_page(page))
 			continue;
 
+		if (!put_page_testzero(page))
+			continue;
+
 		if (is_zone_device_page(page)) {
 			if (locked_pgdat) {
 				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
 						       flags);
 				locked_pgdat = NULL;
 			}
-			/*
-			 * ZONE_DEVICE pages that return 'false' from
-			 * page_is_devmap_managed() do not require special
-			 * processing, and instead, expect a call to
-			 * put_page_testzero().
-			 */
-			if (page_is_devmap_managed(page)) {
-				put_devmap_managed_page(page);
-				continue;
-			}
-		}
-
-		if (!put_page_testzero(page))
+			free_zone_device_page(page);
 			continue;
+		}
 
 		if (PageCompound(page)) {
 			if (locked_pgdat) {
@@ -1188,26 +1179,3 @@ void __init swap_setup(void)
 	 * _really_ don't want to cluster much more
 	 */
 }
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
-	int count;
-
-	if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-		return;
-
-	count = page_ref_dec_return(page);
-
-	/*
-	 * devmap page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (count == 1)
-		free_devmap_managed_page(page);
-	else if (!count)
-		__put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2020-10-01 18:17   ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-01 18:17 UTC (permalink / raw)
  To: linux-mm, kvm-ppc, nouveau, linux-kernel
  Cc: Dan Williams, Ira Weiny, Matthew Wilcox, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Paul Mackerras, Ben Skeggs, Andrew Morton,
	Ralph Campbell

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c                               |  4 +-
 include/linux/dax.h                    |  2 +-
 include/linux/memremap.h               |  7 ++-
 include/linux/mm.h                     | 44 --------------
 lib/test_hmm.c                         |  2 +-
 mm/gup.c                               | 44 --------------
 mm/internal.h                          |  8 +++
 mm/memremap.c                          | 84 +++++++-------------------
 mm/migrate.c                           |  5 --
 mm/page_alloc.c                        |  3 +
 mm/swap.c                              | 44 ++------------
 13 files changed, 50 insertions(+), 201 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
 
 	dpage = pfn_to_page(uvmem_pfn);
 	dpage->zone_device_data = pvt;
-	get_page(dpage);
+	init_page_count(dpage);
 	lock_page(dpage);
 	return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
 			return NULL;
 	}
 
-	get_page(page);
+	init_page_count(page);
 	lock_page(page);
 	return page;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 85c63f735909..4804348f62e6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
  *       pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f906cf4db1cc..e4920ea6abd3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)
 
 static inline bool dax_layout_is_idle_page(struct page *page)
 {
-	return page_ref_count(page) == 1;
+	return page_ref_count(page) == 0;
 }
 
 #define dax_wait_page(_inode, _page, _wait_cb)				\
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..2f63747caf56 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
 	/*
-	 * Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-	 * reach 0 refcount unless there is a refcount bug. This allows the
-	 * device driver to implement its own memory management.)
+	 * Called once the page refcount reaches 0. The reference count
+	 * should be reset to one with init_page_count(page) before reusing
+	 * the page. This allows the device driver to implement its own
+	 * memory management.
 	 */
 	void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27c64d0d7520..3d71a820ae38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1100,39 +1100,6 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
-	if (!static_branch_unlikely(&devmap_managed_key))
-		return false;
-	if (!is_zone_device_page(page))
-		return false;
-	switch (page->pgmap->type) {
-	case MEMORY_DEVICE_PRIVATE:
-	case MEMORY_DEVICE_FS_DAX:
-		return true;
-	default:
-		break;
-	}
-	return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
-	return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 static inline bool is_device_private_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
@@ -1179,17 +1146,6 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
-	/*
-	 * For devmap managed pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the page is free and we
-	 * need to inform the device driver through callback. See
-	 * include/linux/memremap.h and HMM for details.
-	 */
-	if (page_is_devmap_managed(page)) {
-		put_devmap_managed_page(page);
-		return;
-	}
-
 	if (put_page_testzero(page))
 		__put_page(page);
 }
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e151a7f10519..37a2dea5e805 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	}
 
 	dpage->zone_device_data = rpage;
-	get_page(dpage);
+	init_page_count(dpage);
 	lock_page(dpage);
 	return dpage;
 
diff --git a/mm/gup.c b/mm/gup.c
index 102877ed77a4..ed117e225c86 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -177,41 +177,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
 	return true;
 }
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-	int count, refs = 1;
-
-	if (!page_is_devmap_managed(page))
-		return false;
-
-	if (hpage_pincount_available(page))
-		hpage_pincount_sub(page, 1);
-	else
-		refs = GUP_PIN_COUNTING_BIAS;
-
-	count = page_ref_sub_return(page, refs);
-
-	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
-	/*
-	 * devmap page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (count == 1)
-		free_devmap_managed_page(page);
-	else if (!count)
-		__put_page(page);
-
-	return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-	return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:            pointer to page to be released
@@ -227,15 +192,6 @@ void unpin_user_page(struct page *page)
 
 	page = compound_head(page);
 
-	/*
-	 * For devmap managed pages we need to catch refcount transition from
-	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
-	 * page is free and we need to inform the device driver through
-	 * callback. See include/linux/memremap.h and HMM for details.
-	 */
-	if (__unpin_devmap_managed_user_page(page))
-		return;
-
 	if (hpage_pincount_available(page))
 		hpage_pincount_sub(page, 1);
 	else
diff --git a/mm/internal.h b/mm/internal.h
index 6345b08ce86c..629959a6f26d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -618,4 +618,12 @@ struct migration_target_control {
 	gfp_t gfp_mask;
 };
 
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memremap.c b/mm/memremap.c
index 2bb276680837..336f088c92a1 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/xarray.h>
+#include "internal.h"
 
 static DEFINE_XARRAY(pgmap_array);
 
@@ -37,36 +38,6 @@ unsigned long memremap_compat_align(void)
 EXPORT_SYMBOL_GPL(memremap_compat_align);
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(void)
-{
-	static_branch_dec(&devmap_managed_key);
-}
-
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
-	    (!pgmap->ops || !pgmap->ops->page_free)) {
-		WARN(1, "Missing page_free method\n");
-		return -EINVAL;
-	}
-
-	static_branch_inc(&devmap_managed_key);
-	return 0;
-}
-#else
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-	return -EINVAL;
-}
-static void devmap_managed_enable_put(void)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 static void pgmap_array_delete(struct range *range)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -91,13 +62,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
 	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
 
-static unsigned long pfn_next(unsigned long pfn)
-{
-	if (pfn % 1024 == 0)
-		cond_resched();
-	return pfn + 1;
-}
-
 /*
  * This returns true if the page is reserved by ZONE_DEVICE driver.
  */
@@ -118,9 +82,6 @@ bool pfn_zone_device_reserved(unsigned long pfn)
 	return ret;
 }
 
-#define for_each_device_pfn(pfn, map, i) \
-	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
-
 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
 	if (pgmap->ops && pgmap->ops->kill)
@@ -176,20 +137,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
-	unsigned long pfn;
 	int i;
 
 	dev_pagemap_kill(pgmap);
 	for (i = 0; i < pgmap->nr_range; i++)
-		for_each_device_pfn(pfn, pgmap, i)
-			put_page(pfn_to_page(pfn));
+		percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+						pfn_first(pgmap, i));
 	dev_pagemap_cleanup(pgmap);
 
 	for (i = 0; i < pgmap->nr_range; i++)
 		pageunmap_range(pgmap, i);
 
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
-	devmap_managed_enable_put();
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -327,7 +286,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
-	bool need_devmap_managed = true;
 	int error, i;
 
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
@@ -343,6 +301,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 			WARN(1, "Missing migrate_to_ram method\n");
 			return ERR_PTR(-EINVAL);
 		}
+		if (!pgmap->ops->page_free) {
+			WARN(1, "Missing page_free method\n");
+			return ERR_PTR(-EINVAL);
+		}
 		if (!pgmap->owner) {
 			WARN(1, "Missing owner\n");
 			return ERR_PTR(-EINVAL);
@@ -356,11 +318,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		}
 		break;
 	case MEMORY_DEVICE_GENERIC:
-		need_devmap_managed = false;
 		break;
 	case MEMORY_DEVICE_PCI_P2PDMA:
 		params.pgprot = pgprot_noncached(params.pgprot);
-		need_devmap_managed = false;
 		break;
 	default:
 		WARN(1, "Invalid pgmap type %d\n", pgmap->type);
@@ -384,12 +344,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		}
 	}
 
-	if (need_devmap_managed) {
-		error = devmap_managed_enable_get(pgmap);
-		if (error)
-			return ERR_PTR(error);
-	}
-
 	/*
 	 * Clear the pgmap nr_range as it will be incremented for each
 	 * successfully processed range. This communicates how many
@@ -504,16 +458,9 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+static void free_device_private_page(struct page *page)
 {
-	/* notify page idle for dax */
-	if (!is_device_private_page(page)) {
-		wake_up_var(&page->_refcount);
-		return;
-	}
-
 	__ClearPageWaiters(page);
-
 	mem_cgroup_uncharge(page);
 
 	/*
@@ -540,4 +487,19 @@ void free_devmap_managed_page(struct page *page)
 	page->mapping = NULL;
 	page->pgmap->ops->page_free(page);
 }
+
+void free_zone_device_page(struct page *page)
+{
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_FS_DAX:
+		/* notify page idle */
+		wake_up_var(&page->_refcount);
+		return;
+	case MEMORY_DEVICE_PRIVATE:
+		free_device_private_page(page);
+		return;
+	default:
+		return;
+	}
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/migrate.c b/mm/migrate.c
index 5ca5842df5db..ee09334b46d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -380,11 +380,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
 {
 	int expected_count = 1;
 
-	/*
-	 * Device private pages have an extra refcount as they are
-	 * ZONE_DEVICE pages.
-	 */
-	expected_count += is_device_private_page(page);
 	if (mapping)
 		expected_count += thp_nr_pages(page) + page_has_private(page);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb0962976a0e..039b88bb3978 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6154,6 +6154,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
+		/* ZONE_DEVICE pages start with a zero reference count. */
+		set_page_count(page, 0);
+
 		/*
 		 * Mark page reserved as it will need to wait for onlining
 		 * phase for it to be fully associated with a zone.
diff --git a/mm/swap.c b/mm/swap.c
index 0eb057141a04..82cbe069dcf3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -116,12 +116,11 @@ static void __put_compound_page(struct page *page)
 void __put_page(struct page *page)
 {
 	if (is_zone_device_page(page)) {
-		put_dev_pagemap(page->pgmap);
-
 		/*
 		 * The page belongs to the device that created pgmap. Do
 		 * not return it to page allocator.
 		 */
+		free_zone_device_page(page);
 		return;
 	}
 
@@ -891,26 +890,18 @@ void release_pages(struct page **pages, int nr)
 		if (is_huge_zero_page(page))
 			continue;
 
+		if (!put_page_testzero(page))
+			continue;
+
 		if (is_zone_device_page(page)) {
 			if (locked_pgdat) {
 				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
 						       flags);
 				locked_pgdat = NULL;
 			}
-			/*
-			 * ZONE_DEVICE pages that return 'false' from
-			 * page_is_devmap_managed() do not require special
-			 * processing, and instead, expect a call to
-			 * put_page_testzero().
-			 */
-			if (page_is_devmap_managed(page)) {
-				put_devmap_managed_page(page);
-				continue;
-			}
-		}
-
-		if (!put_page_testzero(page))
+			free_zone_device_page(page);
 			continue;
+		}
 
 		if (PageCompound(page)) {
 			if (locked_pgdat) {
@@ -1188,26 +1179,3 @@ void __init swap_setup(void)
 	 * _really_ don't want to cluster much more
 	 */
 }
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
-	int count;
-
-	if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-		return;
-
-	count = page_ref_dec_return(page);
-
-	/*
-	 * devmap page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (count == 1)
-		free_devmap_managed_page(page);
-	else if (!count)
-		__put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
-- 
2.20.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper
  2020-10-01 18:17   ` Ralph Campbell
  (?)
@ 2020-10-02  5:56   ` Christoph Hellwig
  2020-10-05 16:29       ` Ralph Campbell
  -1 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2020-10-02  5:56 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Christoph Hellwig, Jason Gunthorpe,
	Bharata B Rao, Zi Yan, Kirill A . Shutemov, Yang Shi,
	Paul Mackerras, Ben Skeggs, Andrew Morton

On Thu, Oct 01, 2020 at 11:17:14AM -0700, Ralph Campbell wrote:
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-01 18:17   ` Ralph Campbell
  (?)
@ 2020-10-02  5:59   ` Christoph Hellwig
  2020-10-05 16:40       ` Ralph Campbell
  -1 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2020-10-02  5:59 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Christoph Hellwig, Jason Gunthorpe,
	Bharata B Rao, Zi Yan, Kirill A . Shutemov, Yang Shi,
	Paul Mackerras, Ben Skeggs, Andrew Morton

On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper
  2020-10-02  5:56   ` Christoph Hellwig
@ 2020-10-05 16:29       ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-05 16:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton



On 10/1/20 10:56 PM, Christoph Hellwig wrote:
> On Thu, Oct 01, 2020 at 11:17:14AM -0700, Ralph Campbell wrote:
>> There are several places where ZONE_DEVICE struct pages assume a reference
>> count == 1 means the page is idle and free. Instead of open coding this,
>> add a helper function to hide this detail.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 

Thanks for the review.
I'll resend this as an independent patch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper
@ 2020-10-05 16:29       ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-05 16:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton



On 10/1/20 10:56 PM, Christoph Hellwig wrote:
> On Thu, Oct 01, 2020 at 11:17:14AM -0700, Ralph Campbell wrote:
>> There are several places where ZONE_DEVICE struct pages assume a reference
>> count == 1 means the page is idle and free. Instead of open coding this,
>> add a helper function to hide this detail.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 

Thanks for the review.
I'll resend this as an independent patch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-02  5:59   ` Christoph Hellwig
@ 2020-10-05 16:40       ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-05 16:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton


On 10/1/20 10:59 PM, Christoph Hellwig wrote:
> On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:
>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>> code for put_page() and several places in the kernel that need to check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't need to
>> be treated specially for ZONE_DEVICE.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> 
> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for the review.

I still have reservations about making this an official patch.
Did you see the updated cover letter?
Basically, I'm concerned about ZONE_DEVICE struct pages being inserted into
the process page table with a zero reference count with vmf_insert_mixed().
If it is to be a non-zero reference count, then DAX, pmem, and other uses
of ZONE_DEVICE pages need to be changed (or vmf_insert_mixed()) to
inc/dec in appropriate places but I don't feel I know that code well enough
to make those changes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2020-10-05 16:40       ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-05 16:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton


On 10/1/20 10:59 PM, Christoph Hellwig wrote:
> On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:
>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>> code for put_page() and several places in the kernel that need to check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't need to
>> be treated specially for ZONE_DEVICE.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> 
> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for the review.

I still have reservations about making this an official patch.
Did you see the updated cover letter?
Basically, I'm concerned about ZONE_DEVICE struct pages being inserted into
the process page table with a zero reference count with vmf_insert_mixed().
If it is to be a non-zero reference count, then DAX, pmem, and other uses
of ZONE_DEVICE pages need to be changed (or vmf_insert_mixed()) to
inc/dec in appropriate places but I don't feel I know that code well enough
to make those changes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-01 18:17 ` Ralph Campbell
                   ` (2 preceding siblings ...)
  (?)
@ 2020-10-05 16:56 ` Matthew Wilcox
  -1 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2020-10-05 16:56 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Jerome Glisse, John Hubbard, Alistair Popple,
	Christoph Hellwig, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton

On Thu, Oct 01, 2020 at 11:17:13AM -0700, Ralph Campbell wrote:
> This is still an RFC because after looking at the pmem/dax code some
> more, I realized that the ZONE_DEVICE struct pages are being inserted
> into the process' page tables with vmf_insert_mixed() and a zero
> refcount on the ZONE_DEVICE struct page. This is sort of OK because
> insert_pfn() increments the reference count on the pgmap which is what
> prevents memunmap_pages() from freeing the struct pages and it doesn't
> check for a non-zero struct page reference count.
> But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
> checks for a reference count == 0.

And presumably get_user_pages() would call get_page()?  Or is that gated
off somewhere?

Looking at this from a design standpoint instead of getting into the
code, I would say that a struct page which cannot be referenced from
anywhere should have a reference count of 0.  As soon as it can be
found by something, it should have a positive refcount.

For DAX, I think that means it should have its refcount raised to
1 immediately before a DAX entry for it is put in the page cache.
It should have its refcount decreased when its DAX entry is removed from
the page cache.

That should ensure that we can't put a 0-ref DAX page in the page tables,
and thus it can't be found by get_user_pages().

> // mmap() an ext4 file that is mounted -o dax.
> ext4_dax_fault()
>   ext4_dax_huge_fault()
>     dax_iomap_fault(&ext4_iomap_ops)
>       dax_iomap_pte_fault()
>         ops->iomap_begin() // ext4_iomap_begin()
>           ext4_map_blocks()
>           ext4_set_iomap()
>         dax_iomap_pfn()
>         dax_insert_entry()
>         vmf_insert_mixed(pfn)
>           __vm_insert_mixed()
>             if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
>                 !pfn_t_devmap(pfn) && pfn_t_valid(pfn))
>               insert_page()
>                 get_page(page) // XXX would trigger VM_BUG_ON_PAGE()
>                 page_add_file_rmap()
>                 set_pte_at()
>             else
>               insert_pfn()
>                 pte_mkdevmap()
>                 set_pte_at()
> 
> Should pmem set the page reference count to one before inserting the
> pfn into the page tables (and decrement when removing devmap PTEs)?
> What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases?
> Where should they icrement/decrement the page reference count?
> I don't know enough about how these are used to really know what to
> do at this point. If people want me to continue to work on this series,
> I will need some guidance.
> 
> ---
> 
> Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
> struct page reference counting is ugly because they are "free" when the
> reference count is one instead of zero. This leads to explicit checks
> for ZONE_DEVICE pages in places like put_page(), GUP, THP splitting, and
> page migration which have to adjust the expected reference count when
> determining if the page is isolated or idle. This is my attempt to make
> ZONE_DEVICE pages be free when the reference count is zero and removing
> the special cases.
> 
> I'm only sending this out as a RFC since I'm not that familiar with the
> DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
> with devm_memremap_pages() or memremap_pages() but my best reading of
> the code looks like it might be OK. I could use help testing these
> configurations.
> I have been able to successfully run xfstests on ext4 with the memmap
> kernel boot option to simulate pmem.
> 
> Changes in v3:
> Rebase to linux-mm 5.9.0-rc7-mm1.
> Added a check for page_free() as suggested by Christoph Hellwig.
> Added a helper for dax_wait_page() as suggested by Christoph Hellwig.
> 
> Changes in v2:
> One of the big changes in v2 is that devm_memremap_pages() and
> memremap_pages() now return the struct pages' reference count set to
> zero instead of one. Normally, get_page() will VM_BUG_ON_PAGE() if
> page->_refcount is zero. I didn't see any such warnings running the
> xfstests with dax/pmem but I'm not clear how the zero to one reference
> count is handled.
> 
> Other changes in v2:
> Rebased to Linux-5.9.0-rc6 to include pmem fixes.
> I added patch 1 to introduce a page refcount helper for ext4 and xfs as
> suggested by Christoph Hellwig.
> I also applied Christoph Hellwig's other suggested changes for removing
> the devmap_managed_key, etc.
> 
> Ralph Campbell (2):
>   ext4/xfs: add page refcount helper
>   mm: remove extra ZONE_DEVICE struct page refcount
> 
>  arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>  fs/dax.c                               |  8 +--
>  fs/ext4/inode.c                        |  5 +-
>  fs/xfs/xfs_file.c                      |  4 +-
>  include/linux/dax.h                    | 10 +++
>  include/linux/memremap.h               |  7 ++-
>  include/linux/mm.h                     | 44 --------------
>  lib/test_hmm.c                         |  2 +-
>  mm/gup.c                               | 44 --------------
>  mm/internal.h                          |  8 +++
>  mm/memremap.c                          | 84 +++++++-------------------
>  mm/migrate.c                           |  5 --
>  mm/page_alloc.c                        |  3 +
>  mm/swap.c                              | 44 ++------------
>  15 files changed, 63 insertions(+), 209 deletions(-)
> 
> -- 
> 2.20.1
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-01 18:17 ` Ralph Campbell
@ 2020-10-05 18:24   ` Dan Williams
  -1 siblings, 0 replies; 22+ messages in thread
From: Dan Williams @ 2020-10-05 18:24 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Linux MM, kvm-ppc, nouveau, Linux Kernel Mailing List, Ira Weiny,
	Matthew Wilcox, Jerome Glisse, John Hubbard, Alistair Popple,
	Christoph Hellwig, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton

On Thu, Oct 1, 2020 at 11:17 AM Ralph Campbell <rcampbell@nvidia.com> wrote:
>
> This is still an RFC because after looking at the pmem/dax code some
> more, I realized that the ZONE_DEVICE struct pages are being inserted
> into the process' page tables with vmf_insert_mixed() and a zero
> refcount on the ZONE_DEVICE struct page. This is sort of OK because
> insert_pfn() increments the reference count on the pgmap which is what
> prevents memunmap_pages() from freeing the struct pages and it doesn't
> check for a non-zero struct page reference count.
> But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
> checks for a reference count == 0.
>
> // mmap() an ext4 file that is mounted -o dax.
> ext4_dax_fault()
>   ext4_dax_huge_fault()
>     dax_iomap_fault(&ext4_iomap_ops)
>       dax_iomap_pte_fault()
>         ops->iomap_begin() // ext4_iomap_begin()
>           ext4_map_blocks()
>           ext4_set_iomap()
>         dax_iomap_pfn()
>         dax_insert_entry()
>         vmf_insert_mixed(pfn)
>           __vm_insert_mixed()
>             if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
>                 !pfn_t_devmap(pfn) && pfn_t_valid(pfn))
>               insert_page()
>                 get_page(page) // XXX would trigger VM_BUG_ON_PAGE()
>                 page_add_file_rmap()
>                 set_pte_at()
>             else
>               insert_pfn()
>                 pte_mkdevmap()
>                 set_pte_at()
>
> Should pmem set the page reference count to one before inserting the
> pfn into the page tables (and decrement when removing devmap PTEs)?
> What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases?
> Where should they icrement/decrement the page reference count?
> I don't know enough about how these are used to really know what to
> do at this point. If people want me to continue to work on this series,
> I will need some guidance.

fs/dax could take the reference when inserting, but that would mean
that ext4 and xfs would need to go back to checking for 1 to be page
idle. I think that's ok because the filesystem is actually not
checking for page-idle it's checking for "get_user_pages()" idle.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2020-10-05 18:24   ` Dan Williams
  0 siblings, 0 replies; 22+ messages in thread
From: Dan Williams @ 2020-10-05 18:24 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Linux MM, kvm-ppc, nouveau, Linux Kernel Mailing List, Ira Weiny,
	Matthew Wilcox, Jerome Glisse, John Hubbard, Alistair Popple,
	Christoph Hellwig, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton

On Thu, Oct 1, 2020 at 11:17 AM Ralph Campbell <rcampbell@nvidia.com> wrote:
>
> This is still an RFC because after looking at the pmem/dax code some
> more, I realized that the ZONE_DEVICE struct pages are being inserted
> into the process' page tables with vmf_insert_mixed() and a zero
> refcount on the ZONE_DEVICE struct page. This is sort of OK because
> insert_pfn() increments the reference count on the pgmap which is what
> prevents memunmap_pages() from freeing the struct pages and it doesn't
> check for a non-zero struct page reference count.
> But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
> checks for a reference count == 0.
>
> // mmap() an ext4 file that is mounted -o dax.
> ext4_dax_fault()
>   ext4_dax_huge_fault()
>     dax_iomap_fault(&ext4_iomap_ops)
>       dax_iomap_pte_fault()
>         ops->iomap_begin() // ext4_iomap_begin()
>           ext4_map_blocks()
>           ext4_set_iomap()
>         dax_iomap_pfn()
>         dax_insert_entry()
>         vmf_insert_mixed(pfn)
>           __vm_insert_mixed()
>             if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
>                 !pfn_t_devmap(pfn) && pfn_t_valid(pfn))
>               insert_page()
>                 get_page(page) // XXX would trigger VM_BUG_ON_PAGE()
>                 page_add_file_rmap()
>                 set_pte_at()
>             else
>               insert_pfn()
>                 pte_mkdevmap()
>                 set_pte_at()
>
> Should pmem set the page reference count to one before inserting the
> pfn into the page tables (and decrement when removing devmap PTEs)?
> What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases?
> Where should they icrement/decrement the page reference count?
> I don't know enough about how these are used to really know what to
> do at this point. If people want me to continue to work on this series,
> I will need some guidance.

fs/dax could take the reference when inserting, but that would mean
that ext4 and xfs would need to go back to checking for 1 to be page
idle. I think that's ok because the filesystem is actually not
checking for page-idle it's checking for "get_user_pages()" idle.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-01 18:17   ` Ralph Campbell
  (?)
  (?)
@ 2020-10-08  5:17   ` Ram Pai
  2020-10-08 16:45       ` Ralph Campbell
  -1 siblings, 1 reply; 22+ messages in thread
From: Ram Pai @ 2020-10-08  5:17 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Christoph Hellwig, Jason Gunthorpe,
	Bharata B Rao, Zi Yan, Kirill A . Shutemov, Yang Shi,
	Paul Mackerras, Ben Skeggs, Andrew Morton

On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.


I was hoping this patch would resolve a page-reference issue that we run
into at random times while migrating a page to a device page backed by
secure-memory.

Unfortunately I continue to see the problem. There is a reference
held on that page, which fails the migration.

FYI
RP

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-08  5:17   ` Ram Pai
@ 2020-10-08 16:45       ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-08 16:45 UTC (permalink / raw)
  To: Ram Pai
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Christoph Hellwig, Jason Gunthorpe,
	Bharata B Rao, Zi Yan, Kirill A . Shutemov, Yang Shi,
	Paul Mackerras, Ben Skeggs, Andrew Morton


On 10/7/20 10:17 PM, Ram Pai wrote:
> On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:
>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>> code for put_page() and several places in the kernel that need to check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't need to
>> be treated specially for ZONE_DEVICE.
> 
> 
> I was hoping this patch would resolve a page-reference issue that we run
> into at random times while migrating a page to a device page backed by
> secure-memory.
> 
> Unfortunately I continue to see the problem. There is a reference
> held on that page, which fails the migration.
> 
> FYI
> RP

I'm willing to look into it but I would need more information.
Can you give any more details about the conditions when it happens?
It would be great if you have a program that reproduces the problem.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2020-10-08 16:45       ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2020-10-08 16:45 UTC (permalink / raw)
  To: Ram Pai
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Matthew Wilcox, Jerome Glisse, John Hubbard,
	Alistair Popple, Christoph Hellwig, Jason Gunthorpe,
	Bharata B Rao, Zi Yan, Kirill A . Shutemov, Yang Shi,
	Paul Mackerras, Ben Skeggs, Andrew Morton


On 10/7/20 10:17 PM, Ram Pai wrote:
> On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:
>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>> code for put_page() and several places in the kernel that need to check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't need to
>> be treated specially for ZONE_DEVICE.
> 
> 
> I was hoping this patch would resolve a page-reference issue that we run
> into at random times while migrating a page to a device page backed by
> secure-memory.
> 
> Unfortunately I continue to see the problem. There is a reference
> held on that page, which fails the migration.
> 
> FYI
> RP

I'm willing to look into it but I would need more information.
Can you give any more details about the conditions when it happens?
It would be great if you have a program that reproduces the problem.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
  2020-10-01 18:17 ` Ralph Campbell
@ 2021-05-13 13:15   ` Matthew Wilcox
  -1 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-05-13 13:15 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Jerome Glisse, John Hubbard, Alistair Popple,
	Christoph Hellwig, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton

On Thu, Oct 01, 2020 at 11:17:13AM -0700, Ralph Campbell wrote:
> This is still an RFC because after looking at the pmem/dax code some
> more, I realized that the ZONE_DEVICE struct pages are being inserted
> into the process' page tables with vmf_insert_mixed() and a zero
> refcount on the ZONE_DEVICE struct page. This is sort of OK because
> insert_pfn() increments the reference count on the pgmap which is what
> prevents memunmap_pages() from freeing the struct pages and it doesn't
> check for a non-zero struct page reference count.
> But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
> checks for a reference count == 0.

This seems to have gone quiet.  What needs to happen to resurrect this?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Nouveau] [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2021-05-13 13:15   ` Matthew Wilcox
  0 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2021-05-13 13:15 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Yang Shi, Zi Yan, nouveau, Alistair Popple, linux-kernel,
	kvm-ppc, Bharata B Rao, Paul Mackerras, linux-mm, Ben Skeggs,
	Jason Gunthorpe, Dan Williams, Andrew Morton, Ira Weiny,
	Christoph Hellwig, Kirill A . Shutemov

On Thu, Oct 01, 2020 at 11:17:13AM -0700, Ralph Campbell wrote:
> This is still an RFC because after looking at the pmem/dax code some
> more, I realized that the ZONE_DEVICE struct pages are being inserted
> into the process' page tables with vmf_insert_mixed() and a zero
> refcount on the ZONE_DEVICE struct page. This is sort of OK because
> insert_pfn() increments the reference count on the pgmap which is what
> prevents memunmap_pages() from freeing the struct pages and it doesn't
> check for a non-zero struct page reference count.
> But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
> checks for a reference count == 0.

This seems to have gone quiet.  What needs to happen to resurrect this?
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
  2021-05-13 13:15   ` [Nouveau] " Matthew Wilcox
@ 2021-05-13 17:35     ` Ralph Campbell
  -1 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2021-05-13 17:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, kvm-ppc, nouveau, linux-kernel, Dan Williams,
	Ira Weiny, Jerome Glisse, John Hubbard, Alistair Popple,
	Christoph Hellwig, Jason Gunthorpe, Bharata B Rao, Zi Yan,
	Kirill A . Shutemov, Yang Shi, Paul Mackerras, Ben Skeggs,
	Andrew Morton


On 5/13/21 6:15 AM, Matthew Wilcox wrote:
> On Thu, Oct 01, 2020 at 11:17:13AM -0700, Ralph Campbell wrote:
>> This is still an RFC because after looking at the pmem/dax code some
>> more, I realized that the ZONE_DEVICE struct pages are being inserted
>> into the process' page tables with vmf_insert_mixed() and a zero
>> refcount on the ZONE_DEVICE struct page. This is sort of OK because
>> insert_pfn() increments the reference count on the pgmap which is what
>> prevents memunmap_pages() from freeing the struct pages and it doesn't
>> check for a non-zero struct page reference count.
>> But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
>> checks for a reference count == 0.
> 
> This seems to have gone quiet.  What needs to happen to resurrect this?
> 
The main thing I need is time. I have been tied up with other commitments,
there has been a lot of changes going on in FS/DAX and the page cache,
and FS/DAX doesn't use the page reference count to indicate the page is
"free" but rather that it is "idle" so I need a lot of time to really
understand why FS/DAX isn't just any FS on top of a DAX block device.

I too wish this was easier to fix.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Nouveau] [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
@ 2021-05-13 17:35     ` Ralph Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ralph Campbell @ 2021-05-13 17:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yang Shi, Zi Yan, nouveau, Alistair Popple, linux-kernel,
	kvm-ppc, Bharata B Rao, Paul Mackerras, linux-mm, Ben Skeggs,
	Jason Gunthorpe, Dan Williams, Andrew Morton, Ira Weiny,
	Christoph Hellwig, Kirill A . Shutemov


On 5/13/21 6:15 AM, Matthew Wilcox wrote:
> On Thu, Oct 01, 2020 at 11:17:13AM -0700, Ralph Campbell wrote:
>> This is still an RFC because after looking at the pmem/dax code some
>> more, I realized that the ZONE_DEVICE struct pages are being inserted
>> into the process' page tables with vmf_insert_mixed() and a zero
>> refcount on the ZONE_DEVICE struct page. This is sort of OK because
>> insert_pfn() increments the reference count on the pgmap which is what
>> prevents memunmap_pages() from freeing the struct pages and it doesn't
>> check for a non-zero struct page reference count.
>> But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
>> checks for a reference count == 0.
> 
> This seems to have gone quiet.  What needs to happen to resurrect this?
> 
The main thing I need is time. I have been tied up with other commitments,
there has been a lot of changes going on in FS/DAX and the page cache,
and FS/DAX doesn't use the page reference count to indicate the page is
"free" but rather that it is "idle" so I need a lot of time to really
understand why FS/DAX isn't just any FS on top of a DAX block device.

I too wish this was easier to fix.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-05-13 17:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-01 18:17 [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount Ralph Campbell
2020-10-01 18:17 ` Ralph Campbell
2020-10-01 18:17 ` [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper Ralph Campbell
2020-10-01 18:17   ` Ralph Campbell
2020-10-02  5:56   ` Christoph Hellwig
2020-10-05 16:29     ` Ralph Campbell
2020-10-05 16:29       ` Ralph Campbell
2020-10-01 18:17 ` [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount Ralph Campbell
2020-10-01 18:17   ` Ralph Campbell
2020-10-02  5:59   ` Christoph Hellwig
2020-10-05 16:40     ` Ralph Campbell
2020-10-05 16:40       ` Ralph Campbell
2020-10-08  5:17   ` Ram Pai
2020-10-08 16:45     ` Ralph Campbell
2020-10-08 16:45       ` Ralph Campbell
2020-10-05 16:56 ` [RFC PATCH v3 0/2] " Matthew Wilcox
2020-10-05 18:24 ` Dan Williams
2020-10-05 18:24   ` Dan Williams
2021-05-13 13:15 ` Matthew Wilcox
2021-05-13 13:15   ` [Nouveau] " Matthew Wilcox
2021-05-13 17:35   ` Ralph Campbell
2021-05-13 17:35     ` [Nouveau] " Ralph Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.