linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/8] Add support for SVM atomics in Nouveau
@ 2021-04-07  8:42 Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 1/8] mm: Remove special swap entry functions Alistair Popple
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora

This is the eighth version of a series to add support to Nouveau for atomic
memory operations on OpenCL shared virtual memory (SVM) regions.

The main change for this version is a simplification of device exclusive
entry handling. Instead of copying entries for copy-on-write mappings
during fork they are removed instead. This is safer because there could be
unique corner cases when copying, particularly for pinned pages which
should follow the same logic as copy_present_page(). Removing entries
avoids this possiblity by treating them as normal ptes.

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the CPU
finalising the entry.

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one(). These should not change any functionality, but any
help testing would be much appreciated as I have not been able to test
every usage of try_to_unmap_one().

Patch 5 contains the bulk of the implementation for device exclusive
memory.

Patch 6 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 7 is a cleanup for the Nouveau SVM implementation.

Patch 8 contains the implementation of atomic access for the Nouveau
driver.

This has been tested using the latest upstream Mesa userspace with a simple
OpenCL test program which checks the results of atomic GPU operations on a
SVM buffer whilst also writing to the same buffer from the CPU.

Alistair Popple (8):
  mm: Remove special swap entry functions
  mm/swapops: Rework swap entry manipulation code
  mm/rmap: Split try_to_munlock from try_to_unmap
  mm/rmap: Split migration into its own function
  mm: Device exclusive memory access
  mm: Selftests for exclusive device memory
  nouveau/svm: Refactor nouveau_range_fault
  nouveau/svm: Implement atomic SVM access

 Documentation/vm/hmm.rst                      |  19 +-
 Documentation/vm/unevictable-lru.rst          |  33 +-
 arch/s390/mm/pgtable.c                        |   2 +-
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c         | 156 ++++-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c    |   6 +
 fs/proc/task_mmu.c                            |  23 +-
 include/linux/mmu_notifier.h                  |  26 +-
 include/linux/rmap.h                          |  11 +-
 include/linux/swap.h                          |   8 +-
 include/linux/swapops.h                       | 123 ++--
 lib/test_hmm.c                                | 126 +++-
 lib/test_hmm_uapi.h                           |   2 +
 mm/debug_vm_pgtable.c                         |  12 +-
 mm/hmm.c                                      |  12 +-
 mm/huge_memory.c                              |  45 +-
 mm/hugetlb.c                                  |  10 +-
 mm/memcontrol.c                               |   2 +-
 mm/memory.c                                   | 196 +++++-
 mm/migrate.c                                  |  51 +-
 mm/mlock.c                                    |  10 +-
 mm/mprotect.c                                 |  18 +-
 mm/page_vma_mapped.c                          |  15 +-
 mm/rmap.c                                     | 612 +++++++++++++++---
 tools/testing/selftests/vm/hmm-tests.c        | 158 +++++
 26 files changed, 1366 insertions(+), 312 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v8 1/8] mm: Remove special swap entry functions
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-05-18  2:17   ` Peter Xu
  2021-04-07  8:42 ` [PATCH v8 2/8] mm/swapops: Rework swap entry manipulation code Alistair Popple
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora,
	Christoph Hellwig

Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

---

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
  builds
---
 arch/s390/mm/pgtable.c  |  2 +-
 fs/proc/task_mmu.c      | 23 +++++---------
 include/linux/swap.h    |  4 +--
 include/linux/swapops.h | 69 ++++++++++++++---------------------------
 mm/hmm.c                |  5 ++-
 mm/huge_memory.c        |  4 +--
 mm/memcontrol.c         |  2 +-
 mm/memory.c             | 10 +++---
 mm/migrate.c            |  6 ++--
 mm/page_vma_mapped.c    |  6 ++--
 10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
 	if (!non_swap_entry(entry))
 		dec_mm_counter(mm, MM_SWAPENTS);
 	else if (is_migration_entry(entry)) {
-		struct page *page = migration_entry_to_page(entry);
+		struct page *page = pfn_swap_entry_to_page(entry);
 
 		dec_mm_counter(mm, mm_counter(page));
 	}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3cec6fbef725..08ee59d945c0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			} else {
 				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
 			}
-		} else if (is_migration_entry(swpent))
-			page = migration_entry_to_page(swpent);
-		else if (is_device_private_entry(swpent))
-			page = device_private_entry_to_page(swpent);
+		} else if (is_pfn_swap_entry(swpent))
+			page = pfn_swap_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = xa_load(&vma->vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
 		if (is_migration_entry(entry))
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 	}
 	if (IS_ERR_OR_NULL(page))
 		return;
@@ -691,10 +689,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 	} else if (is_swap_pte(*pte)) {
 		swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-		if (is_migration_entry(swpent))
-			page = migration_entry_to_page(swpent);
-		else if (is_device_private_entry(swpent))
-			page = device_private_entry_to_page(swpent);
+		if (is_pfn_swap_entry(swpent))
+			page = pfn_swap_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1383,11 +1379,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			frame = swp_type(entry) |
 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
 		flags |= PM_SWAP;
-		if (is_migration_entry(entry))
-			page = migration_entry_to_page(entry);
-
-		if (is_device_private_entry(entry))
-			page = device_private_entry_to_page(entry);
+		if (is_pfn_swap_entry(entry))
+			page = pfn_swap_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
@@ -1444,7 +1437,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			if (pmd_swp_soft_dirty(pmd))
 				flags |= PM_SOFT_DIRTY;
 			VM_BUG_ON(!is_pmd_migration_entry(pmd));
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 		}
 #endif
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4cc6ec3bf0ab..516104b9334b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -523,8 +523,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
-#define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
+/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
+#define free_swap_and_cache(e) is_pfn_swap_entry(e)
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..139be8235ad2 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -121,16 +121,6 @@ static inline bool is_write_device_private_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
-
-static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
-{
-	return swp_offset(entry);
-}
-
-static inline struct page *device_private_entry_to_page(swp_entry_t entry)
-{
-	return pfn_to_page(swp_offset(entry));
-}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
 {
@@ -150,16 +140,6 @@ static inline bool is_write_device_private_entry(swp_entry_t entry)
 {
 	return false;
 }
-
-static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
-{
-	return 0;
-}
-
-static inline struct page *device_private_entry_to_page(swp_entry_t entry)
-{
-	return NULL;
-}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
@@ -182,22 +162,6 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
-{
-	return swp_offset(entry);
-}
-
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
-{
-	struct page *p = pfn_to_page(swp_offset(entry));
-	/*
-	 * Any use of migration entries may only occur while the
-	 * corresponding page is locked
-	 */
-	BUG_ON(!PageLocked(compound_head(p)));
-	return p;
-}
-
 static inline void make_migration_entry_read(swp_entry_t *entry)
 {
 	*entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
@@ -217,16 +181,6 @@ static inline int is_migration_entry(swp_entry_t swp)
 	return 0;
 }
 
-static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
-{
-	return 0;
-}
-
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
-{
-	return NULL;
-}
-
 static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl) { }
@@ -241,6 +195,29 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
+{
+	struct page *p = pfn_to_page(swp_offset(entry));
+
+	/*
+	 * Any use of migration entries may only occur while the
+	 * corresponding page is locked
+	 */
+	BUG_ON(is_migration_entry(entry) && !PageLocked(p));
+
+	return p;
+}
+
+/*
+ * A pfn swap entry is a special type of swap entry that always has a pfn stored
+ * in the swap offset. They are used to represent unaddressable device memory
+ * and to restrict access to a page undergoing migration.
+ */
+static inline bool is_pfn_swap_entry(swp_entry_t entry)
+{
+	return is_migration_entry(entry) || is_device_private_entry(entry);
+}
+
 struct page_vma_mapped_walk;
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
diff --git a/mm/hmm.c b/mm/hmm.c
index 943cb2ba4442..3b2dda71d0ed 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -214,7 +214,7 @@ static inline bool hmm_is_device_private_entry(struct hmm_range *range,
 		swp_entry_t entry)
 {
 	return is_device_private_entry(entry) &&
-		device_private_entry_to_page(entry)->pgmap->owner ==
+		pfn_swap_entry_to_page(entry)->pgmap->owner ==
 		range->dev_private_owner;
 }
 
@@ -257,8 +257,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_write_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = device_private_entry_to_pfn(entry) |
-					cpu_flags;
+			*hmm_pfn = swp_offset(entry) | cpu_flags;
 			return 0;
 		}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 395c75111d33..a4cda8564bcf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1700,7 +1700,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
-			page = pfn_to_page(swp_offset(entry));
+			page = pfn_swap_entry_to_page(entry);
 			flush_needed = 0;
 		} else
 			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
@@ -2108,7 +2108,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		swp_entry_t entry;
 
 		entry = pmd_to_swp_entry(old_pmd);
-		page = pfn_to_page(swp_offset(entry));
+		page = pfn_swap_entry_to_page(entry);
 		write = is_write_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..043840dbe48a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5523,7 +5523,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 	 * as special swap entry in the CPU page table.
 	 */
 	if (is_device_private_entry(ent)) {
-		page = device_private_entry_to_page(ent);
+		page = pfn_swap_entry_to_page(ent);
 		/*
 		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
 		 * a refcount of 1 when free (unlike normal page)
diff --git a/mm/memory.c b/mm/memory.c
index c8e357627318..1c98e3c1c2de 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -730,7 +730,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		}
 		rss[MM_SWAPENTS]++;
 	} else if (is_migration_entry(entry)) {
-		page = migration_entry_to_page(entry);
+		page = pfn_swap_entry_to_page(entry);
 
 		rss[mm_counter(page)]++;
 
@@ -749,7 +749,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
 	} else if (is_device_private_entry(entry)) {
-		page = device_private_entry_to_page(entry);
+		page = pfn_swap_entry_to_page(entry);
 
 		/*
 		 * Update rss count even for unaddressable pages, as
@@ -1286,7 +1286,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		entry = pte_to_swp_entry(ptent);
 		if (is_device_private_entry(entry)) {
-			struct page *page = device_private_entry_to_page(entry);
+			struct page *page = pfn_swap_entry_to_page(entry);
 
 			if (unlikely(details && details->check_mapping)) {
 				/*
@@ -1315,7 +1315,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		else if (is_migration_entry(entry)) {
 			struct page *page;
 
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 			rss[mm_counter(page)]--;
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
@@ -3282,7 +3282,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
 		} else if (is_device_private_entry(entry)) {
-			vmf->page = device_private_entry_to_page(entry);
+			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
diff --git a/mm/migrate.c b/mm/migrate.c
index 62b81d5257aa..600978d18750 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -321,7 +321,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 	if (!is_migration_entry(entry))
 		goto out;
 
-	page = migration_entry_to_page(entry);
+	page = pfn_swap_entry_to_page(entry);
 
 	/*
 	 * Once page cache replacement of page migration started, page_count
@@ -361,7 +361,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
 	ptl = pmd_lock(mm, pmd);
 	if (!is_pmd_migration_entry(*pmd))
 		goto unlock;
-	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+	page = pfn_swap_entry_to_page(pmd_to_swp_entry(*pmd));
 	if (!get_page_unless_zero(page))
 		goto unlock;
 	spin_unlock(ptl);
@@ -2443,7 +2443,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (!is_device_private_entry(entry))
 				goto next;
 
-			page = device_private_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 			if (!(migrate->flags &
 				MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
 			    page->pgmap->owner != migrate->pgmap_owner)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 86e3a3688d59..eed988ab2e81 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		if (!is_migration_entry(entry))
 			return false;
 
-		pfn = migration_entry_to_pfn(entry);
+		pfn = swp_offset(entry);
 	} else if (is_swap_pte(*pvmw->pte)) {
 		swp_entry_t entry;
 
@@ -105,7 +105,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		if (!is_device_private_entry(entry))
 			return false;
 
-		pfn = device_private_entry_to_pfn(entry);
+		pfn = swp_offset(entry);
 	} else {
 		if (!pte_present(*pvmw->pte))
 			return false;
@@ -200,7 +200,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
 					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
 
-					if (migration_entry_to_page(entry) != page)
+					if (pfn_swap_entry_to_page(entry) != page)
 						return not_found(pvmw);
 					return true;
 				}
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v8 2/8] mm/swapops: Rework swap entry manipulation code
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 1/8] mm: Remove special swap entry functions Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap Alistair Popple
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora,
	Christoph Hellwig

Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
---
 include/linux/swapops.h | 56 ++++++++++++++++++++++-------------------
 mm/debug_vm_pgtable.c   | 12 ++++-----
 mm/hmm.c                |  2 +-
 mm/huge_memory.c        | 26 +++++++++++++------
 mm/hugetlb.c            | 10 +++++---
 mm/memory.c             | 10 +++++---
 mm/migrate.c            | 26 ++++++++++++++-----
 mm/mprotect.c           | 10 +++++---
 mm/rmap.c               | 10 +++++---
 9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-			 page_to_pfn(page));
+	return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-	int type = swp_type(entry);
-	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+	return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+	int type = swp_type(entry);
+	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
 	return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+	return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t entry)
 	return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-	BUG_ON(!PageLocked(compound_head(page)));
-
-	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-			page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
 			swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-	*entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+	return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
 	return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
 static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte) { }
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
 	return 0;
 }
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index a9bd6ce1ba02..3697a80b32f8 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -817,17 +817,17 @@ static void __init swap_migration_tests(void)
 	 * locked, otherwise it stumbles upon a BUG_ON().
 	 */
 	__SetPageLocked(page);
-	swp = make_migration_entry(page, 1);
+	swp = make_writable_migration_entry(page_to_pfn(page));
 	WARN_ON(!is_migration_entry(swp));
-	WARN_ON(!is_write_migration_entry(swp));
+	WARN_ON(!is_writable_migration_entry(swp));
 
-	make_migration_entry_read(&swp);
+	swp = make_readable_migration_entry(swp_offset(swp));
 	WARN_ON(!is_migration_entry(swp));
-	WARN_ON(is_write_migration_entry(swp));
+	WARN_ON(is_writable_migration_entry(swp));
 
-	swp = make_migration_entry(page, 0);
+	swp = make_readable_migration_entry(page_to_pfn(page));
 	WARN_ON(!is_migration_entry(swp));
-	WARN_ON(is_write_migration_entry(swp));
+	WARN_ON(is_writable_migration_entry(swp));
 	__ClearPageLocked(page);
 	__free_page(page);
 }
diff --git a/mm/hmm.c b/mm/hmm.c
index 3b2dda71d0ed..11df3ca30b82 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -255,7 +255,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 */
 		if (hmm_is_device_private_entry(range, entry)) {
 			cpu_flags = HMM_PFN_VALID;
-			if (is_write_device_private_entry(entry))
+			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
 			*hmm_pfn = swp_offset(entry) | cpu_flags;
 			return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4cda8564bcf..89af065cea5b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1051,8 +1051,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
 		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (is_write_migration_entry(entry)) {
-			make_migration_entry_read(&entry);
+		if (is_writable_migration_entry(entry)) {
+			entry = make_readable_migration_entry(
+							swp_offset(entry));
 			pmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
@@ -1825,13 +1826,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
 		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
-		if (is_write_migration_entry(entry)) {
+		if (is_writable_migration_entry(entry)) {
 			pmd_t newpmd;
 			/*
 			 * A protection check is difficult so
 			 * just be safe and disable write
 			 */
-			make_migration_entry_read(&entry);
+			entry = make_readable_migration_entry(
+							swp_offset(entry));
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
@@ -2109,7 +2111,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 		entry = pmd_to_swp_entry(old_pmd);
 		page = pfn_swap_entry_to_page(entry);
-		write = is_write_migration_entry(entry);
+		write = is_writable_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
@@ -2141,7 +2143,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (freeze || pmd_migration) {
 			swp_entry_t swp_entry;
-			swp_entry = make_migration_entry(page + i, write);
+			if (write)
+				swp_entry = make_writable_migration_entry(
+							page_to_pfn(page + i));
+			else
+				swp_entry = make_readable_migration_entry(
+							page_to_pfn(page + i));
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -2998,7 +3005,10 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
 	if (pmd_dirty(pmdval))
 		set_page_dirty(page);
-	entry = make_migration_entry(page, pmd_write(pmdval));
+	if (pmd_write(pmdval))
+		entry = make_writable_migration_entry(page_to_pfn(page));
+	else
+		entry = make_readable_migration_entry(page_to_pfn(page));
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3024,7 +3034,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
-	if (is_write_migration_entry(entry))
+	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8fb42c6dd74b..59645169839b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3795,12 +3795,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				    is_hugetlb_entry_hwpoisoned(entry))) {
 			swp_entry_t swp_entry = pte_to_swp_entry(entry);
 
-			if (is_write_migration_entry(swp_entry) && cow) {
+			if (is_writable_migration_entry(swp_entry) && cow) {
 				/*
 				 * COW mappings require pages in both
 				 * parent and child to be set to read.
 				 */
-				make_migration_entry_read(&swp_entry);
+				swp_entry = make_readable_migration_entry(
+							swp_offset(swp_entry));
 				entry = swp_entry_to_pte(swp_entry);
 				set_huge_swap_pte_at(src, addr, src_pte,
 						     entry, sz);
@@ -4970,10 +4971,11 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		if (unlikely(is_hugetlb_entry_migration(pte))) {
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
-			if (is_write_migration_entry(entry)) {
+			if (is_writable_migration_entry(entry)) {
 				pte_t newpte;
 
-				make_migration_entry_read(&entry);
+				entry = make_readable_migration_entry(
+							swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				set_huge_swap_pte_at(mm, address, ptep,
 						     newpte, huge_page_size(h));
diff --git a/mm/memory.c b/mm/memory.c
index 1c98e3c1c2de..3a5705cfc891 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -734,13 +734,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 		rss[mm_counter(page)]++;
 
-		if (is_write_migration_entry(entry) &&
+		if (is_writable_migration_entry(entry) &&
 				is_cow_mapping(vm_flags)) {
 			/*
 			 * COW mappings require pages in both
 			 * parent and child to be set to read.
 			 */
-			make_migration_entry_read(&entry);
+			entry = make_readable_migration_entry(
+							swp_offset(entry));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_soft_dirty(*src_pte))
 				pte = pte_swp_mksoft_dirty(pte);
@@ -771,9 +772,10 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		 * when a device driver is involved (you cannot easily
 		 * save and restore device driver state).
 		 */
-		if (is_write_device_private_entry(entry) &&
+		if (is_writable_device_private_entry(entry) &&
 		    is_cow_mapping(vm_flags)) {
-			make_device_private_entry_read(&entry);
+			entry = make_readable_device_private_entry(
+							swp_offset(entry));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_uffd_wp(*src_pte))
 				pte = pte_swp_mkuffd_wp(pte);
diff --git a/mm/migrate.c b/mm/migrate.c
index 600978d18750..b752543adb64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -237,13 +237,18 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
-		if (is_write_migration_entry(entry))
+		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
 			pte = pte_mkuffd_wp(pte);
 
 		if (unlikely(is_device_private_page(new))) {
-			entry = make_device_private_entry(new, pte_write(pte));
+			if (pte_write(pte))
+				entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+			else
+				entry = make_readable_device_private_entry(
+							page_to_pfn(new));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_soft_dirty(*pvmw.pte))
 				pte = pte_swp_mksoft_dirty(pte);
@@ -2451,7 +2456,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
-			if (is_write_device_private_entry(entry))
+			if (is_writable_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
 			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
@@ -2497,8 +2502,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			ptep_get_and_clear(mm, addr, ptep);
 
 			/* Setup special migration page table entry */
-			entry = make_migration_entry(page, mpfn &
-						     MIGRATE_PFN_WRITE);
+			if (mpfn & MIGRATE_PFN_WRITE)
+				entry = make_writable_migration_entry(
+							page_to_pfn(page));
+			else
+				entry = make_readable_migration_entry(
+							page_to_pfn(page));
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
@@ -2971,7 +2980,12 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		if (is_device_private_page(page)) {
 			swp_entry_t swp_entry;
 
-			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+			if (vma->vm_flags & VM_WRITE)
+				swp_entry = make_writable_device_private_entry(
+							page_to_pfn(page));
+			else
+				swp_entry = make_readable_device_private_entry(
+							page_to_pfn(page));
 			entry = swp_entry_to_pte(swp_entry);
 		}
 	} else {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..f21b760ec809 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -143,23 +143,25 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
 
-			if (is_write_migration_entry(entry)) {
+			if (is_writable_migration_entry(entry)) {
 				/*
 				 * A protection check is difficult so
 				 * just be safe and disable write
 				 */
-				make_migration_entry_read(&entry);
+				entry = make_readable_migration_entry(
+							swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(oldpte))
 					newpte = pte_swp_mksoft_dirty(newpte);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
-			} else if (is_write_device_private_entry(entry)) {
+			} else if (is_writable_device_private_entry(entry)) {
 				/*
 				 * We do not preserve soft-dirtiness. See
 				 * copy_one_pte() for explanation.
 				 */
-				make_device_private_entry_read(&entry);
+				entry = make_readable_device_private_entry(
+							swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
diff --git a/mm/rmap.c b/mm/rmap.c
index b0fc27e77d6d..977e70803ed8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1526,7 +1526,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			entry = make_migration_entry(page, 0);
+			entry = make_readable_migration_entry(page_to_pfn(page));
 			swp_pte = swp_entry_to_pte(entry);
 
 			/*
@@ -1622,8 +1622,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			entry = make_migration_entry(subpage,
-					pte_write(pteval));
+			if (pte_write(pteval))
+				entry = make_writable_migration_entry(
+							page_to_pfn(subpage));
+			else
+				entry = make_readable_migration_entry(
+							page_to_pfn(subpage));
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 1/8] mm: Remove special swap entry functions Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 2/8] mm/swapops: Rework swap entry manipulation code Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-05-18 20:04   ` Liam Howlett
  2021-04-07  8:42 ` [PATCH v8 4/8] mm/rmap: Split migration into its own function Alistair Popple
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora,
	Christoph Hellwig

The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

---

v8:
* Renamed try_to_munlock to page_mlock to better reflect what the
  function actually does.
* Removed the TODO from the documentation that this patch addresses.

v7:
* Added Christoph's Reviewed-by

v4:
* Removed redundant check for VM_LOCKED
---
 Documentation/vm/unevictable-lru.rst | 33 ++++++++-----------
 include/linux/rmap.h                 |  3 +-
 mm/mlock.c                           | 10 +++---
 mm/rmap.c                            | 48 +++++++++++++++++++++-------
 4 files changed, 55 insertions(+), 39 deletions(-)

diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 0e1490524f53..eae3af17f2d9 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics for the number of
 mlocked pages.  Note, however, that at this point we haven't checked whether
 the page is mapped by other VM_LOCKED VMAs.
 
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
 check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
 not be on an LRU list [more on these below].  However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
+isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
 we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have.  If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have.  If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
 page statistics if it finds another VMA holding the page mlocked.  If we fail
 to isolate the page, we'll have left a potentially mlocked page on the LRU.
 This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
 holepunching, and truncation of file pages and their anonymous COWed pages.
 
 
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
 ---------------------------------
 
-.. warning::
-   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
-   page_referenced() reverse map walker.
-
 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
 Handling <munlock_munlockall_handling>` above] tries to munlock a
 page, it needs to determine whether or not the page is mapped by any
 VM_LOCKED VMA without actually attempting to unmap all PTEs from the
 page.  For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
 
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing.  Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
 However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
 large region or tearing down a large address space that has been mlocked via
 mlockall(), overall this is a fairly rare event.
 
@@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable list.
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
 into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
-recheck via try_to_munlock().  shrink_inactive_list() won't notice the latter,
+recheck via page_mlock().  shrink_inactive_list() won't notice the latter,
 but will pass on to shrink_page_list().
 
 shrink_page_list() again culls obviously unevictable pages that it could
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..38a746787c2f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -87,7 +87,6 @@ struct anon_vma_chain {
 
 enum ttu_flags {
 	TTU_MIGRATION		= 0x1,	/* migration mode */
-	TTU_MUNLOCK		= 0x2,	/* munlock mode */
 
 	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
 	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
@@ -239,7 +238,7 @@ int page_mkclean(struct page *);
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
-void try_to_munlock(struct page *);
+void page_mlock(struct page *page);
 
 void remove_migration_ptes(struct page *old, struct page *new, bool locked);
 
diff --git a/mm/mlock.c b/mm/mlock.c
index f8f8cc32d03d..9b8b82cfbbff 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -108,7 +108,7 @@ void mlock_vma_page(struct page *page)
 /*
  * Finish munlock after successful page isolation
  *
- * Page must be locked. This is a wrapper for try_to_munlock()
+ * Page must be locked. This is a wrapper for page_mlock()
  * and putback_lru_page() with munlock accounting.
  */
 static void __munlock_isolated_page(struct page *page)
@@ -118,7 +118,7 @@ static void __munlock_isolated_page(struct page *page)
 	 * and we don't need to check all the other vmas.
 	 */
 	if (page_mapcount(page) > 1)
-		try_to_munlock(page);
+		page_mlock(page);
 
 	/* Did try_to_unlock() succeed or punt? */
 	if (!PageMlocked(page))
@@ -158,7 +158,7 @@ static void __munlock_isolation_failed(struct page *page)
  * munlock()ed or munmap()ed, we want to check whether other vmas hold the
  * page locked so that we can leave it on the unevictable lru list and not
  * bother vmscan with it.  However, to walk the page's rmap list in
- * try_to_munlock() we must isolate the page from the LRU.  If some other
+ * page_mlock() we must isolate the page from the LRU.  If some other
  * task has removed the page from the LRU, we won't be able to do that.
  * So we clear the PageMlocked as we might not get another chance.  If we
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
@@ -168,7 +168,7 @@ unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
 
-	/* For try_to_munlock() and to serialize with page migration */
+	/* For page_mlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -205,7 +205,7 @@ static int __mlock_posix_error_return(long retval)
  *
  * The fast path is available only for evictable pages with single mapping.
  * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need try_to_munlock() which can fail.
+ * when mapcount > 1 we need page_mlock() which can fail.
  * when !page_evictable(), we need the full redo logic of putback_lru_page to
  * avoid leaving evictable page in unevictable list.
  *
diff --git a/mm/rmap.c b/mm/rmap.c
index 977e70803ed8..f09d522725b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
-	/* munlock has nothing to gain from examining un-locked vmas */
-	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
-		return true;
-
 	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
 	    is_zone_device_page(page) && !is_device_private_page(page))
 		return true;
@@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
-			if (flags & TTU_MUNLOCK)
-				continue;
 		}
 
 		/* Unexpected PMD-mapped THP? */
@@ -1784,8 +1778,39 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
 	return !page_mapcount(page) ? true : false;
 }
 
+static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address, void *arg)
+{
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+
+	/* munlock has nothing to gain from examining un-locked vmas */
+	if (!(vma->vm_flags & VM_LOCKED))
+		return true;
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		/* PTE-mapped THP are never mlocked */
+		if (!PageTransCompound(page)) {
+			/*
+			 * Holding pte lock, we do *not* need
+			 * mmap_lock here
+			 */
+			mlock_vma_page(page);
+		}
+		page_vma_mapped_walk_done(&pvmw);
+
+		/* found a mlocked page, no point continuing munlock check */
+		return false;
+	}
+
+	return true;
+}
+
 /**
- * try_to_munlock - try to munlock a page
+ * page_mlock - try to munlock a page
  * @page: the page to be munlocked
  *
  * Called from munlock code.  Checks all of the VMAs mapping the page
@@ -1793,11 +1818,10 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
  * returned with PG_mlocked cleared if no other vmas have it mlocked.
  */
 
-void try_to_munlock(struct page *page)
+void page_mlock(struct page *page)
 {
 	struct rmap_walk_control rwc = {
-		.rmap_one = try_to_unmap_one,
-		.arg = (void *)TTU_MUNLOCK,
+		.rmap_one = page_mlock_one,
 		.done = page_not_mapped,
 		.anon_lock = page_lock_anon_vma_read,
 
@@ -1849,7 +1873,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the anon_vma struct it points to.
  *
- * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
+ * When called from page_mlock(), the mmap_lock of the mm containing the vma
  * where the page was found will be held for write.  So, we won't recheck
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.
@@ -1901,7 +1925,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  *
- * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
+ * When called from page_mlock(), the mmap_lock of the mm containing the vma
  * where the page was found will be held for write.  So, we won't recheck
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v8 4/8] mm/rmap: Split migration into its own function
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
                   ` (2 preceding siblings ...)
  2021-04-07  8:42 ` [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 5/8] mm: Device exclusive memory access Alistair Popple
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora,
	Christoph Hellwig

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

 - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
 - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
 - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c  - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

---

v5:
* Added comments about how PMD splitting works for migration vs.
  unmapping
* Tightened up the flag check in try_to_migrate() to be explicit about
  which TTU_XXX flags are supported.
---
 include/linux/rmap.h |   4 +-
 mm/huge_memory.c     |  15 +-
 mm/migrate.c         |   9 +-
 mm/rmap.c            | 358 ++++++++++++++++++++++++++++++++-----------
 4 files changed, 280 insertions(+), 106 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 38a746787c2f..0e25d829f742 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_MIGRATION		= 0x1,	/* migration mode */
-
 	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
 	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
 	TTU_IGNORE_HWPOISON	= 0x20,	/* corrupted page is recoverable */
@@ -96,7 +94,6 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
-	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
 };
 
 #ifdef CONFIG_MMU
@@ -193,6 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
+bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 89af065cea5b..eab004331b97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,16 +2357,21 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 
 static void unmap_page(struct page *page)
 {
-	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
-		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+	enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
 	bool unmap_success;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
-		ttu_flags |= TTU_SPLIT_FREEZE;
-
-	unmap_success = try_to_unmap(page, ttu_flags);
+		unmap_success = try_to_migrate(page, ttu_flags);
+	else
+		/*
+		 * Don't install migration entries for file backed pages. This
+		 * helps handle cases when i_size is in the middle of the page
+		 * as there is no need to unmap pages beyond i_size manually.
+		 */
+		unmap_success = try_to_unmap(page, ttu_flags |
+						TTU_IGNORE_MLOCK);
 	VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index b752543adb64..cc4612e2a246 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1130,7 +1130,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		/* Establish migration ptes */
 		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
 				page);
-		try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+		try_to_migrate(page, 0);
 		page_was_mapped = 1;
 	}
 
@@ -1332,7 +1332,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	if (page_mapped(hpage)) {
 		bool mapping_locked = false;
-		enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+		enum ttu_flags ttu = 0;
 
 		if (!PageAnon(hpage)) {
 			/*
@@ -1349,7 +1349,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 			ttu |= TTU_RMAP_LOCKED;
 		}
 
-		try_to_unmap(hpage, ttu);
+		try_to_migrate(hpage, ttu);
 		page_was_mapped = 1;
 
 		if (mapping_locked)
@@ -2756,7 +2756,6 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
  */
 static void migrate_vma_unmap(struct migrate_vma *migrate)
 {
-	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK;
 	const unsigned long npages = migrate->npages;
 	const unsigned long start = migrate->start;
 	unsigned long addr, i, restore = 0;
@@ -2768,7 +2767,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 			continue;
 
 		if (page_mapped(page)) {
-			try_to_unmap(page, flags);
+			try_to_migrate(page, 0);
 			if (page_mapped(page))
 				goto restore;
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index f09d522725b9..7f91f058f1f5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,14 +1405,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
-	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
-	    is_zone_device_page(page) && !is_device_private_page(page))
-		return true;
-
-	if (flags & TTU_SPLIT_HUGE_PMD) {
-		split_huge_pmd_address(vma, address,
-				flags & TTU_SPLIT_FREEZE, page);
-	}
+	if (flags & TTU_SPLIT_HUGE_PMD)
+		split_huge_pmd_address(vma, address, false, page);
 
 	/*
 	 * For THP, we have to assume the worse case ie pmd for invalidation.
@@ -1436,16 +1430,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte && (flags & TTU_MIGRATION)) {
-			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
-
-			set_pmd_migration_entry(&pvmw, page);
-			continue;
-		}
-#endif
-
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
 		 * If it's recently referenced (perhaps page_referenced
@@ -1507,46 +1491,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			}
 		}
 
-		if (IS_ENABLED(CONFIG_MIGRATION) &&
-		    (flags & TTU_MIGRATION) &&
-		    is_zone_device_page(page)) {
-			swp_entry_t entry;
-			pte_t swp_pte;
-
-			pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
-
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			entry = make_readable_migration_entry(page_to_pfn(page));
-			swp_pte = swp_entry_to_pte(entry);
-
-			/*
-			 * pteval maps a zone device page and is therefore
-			 * a swap pte.
-			 */
-			if (pte_swp_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_swp_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 *
-			 * The assignment to subpage above was computed from a
-			 * swap PTE which results in an invalid pointer.
-			 * Since only PAGE_SIZE pages can currently be
-			 * migrated, just set it to page. This will need to be
-			 * changed when hugepage migrations to device private
-			 * memory are supported.
-			 */
-			subpage = page;
-			goto discard;
-		}
-
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 		if (should_defer_flush(mm, flags)) {
@@ -1599,39 +1543,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
-		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
-				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
-			swp_entry_t entry;
-			pte_t swp_pte;
-
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			if (pte_write(pteval))
-				entry = make_writable_migration_entry(
-							page_to_pfn(subpage));
-			else
-				entry = make_readable_migration_entry(
-							page_to_pfn(subpage));
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 */
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1758,6 +1669,268 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
 		.anon_lock = page_lock_anon_vma_read,
 	};
 
+	if (flags & TTU_RMAP_LOCKED)
+		rmap_walk_locked(page, &rwc);
+	else
+		rmap_walk(page, &rwc);
+
+	return !page_mapcount(page) ? true : false;
+}
+
+/*
+ * @arg: enum ttu_flags will be passed to this argument.
+ *
+ * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs
+ * containing migration entries. This and TTU_RMAP_LOCKED are the only supported
+ * flags.
+ */
+static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
+		     unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+	pte_t pteval;
+	struct page *subpage;
+	bool ret = true;
+	struct mmu_notifier_range range;
+	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+
+	if (is_zone_device_page(page) && !is_device_private_page(page))
+		return true;
+
+	/*
+	 * unmap_page() in mm/huge_memory.c is the only user of migration with
+	 * TTU_SPLIT_HUGE_PMD and it wants to freeze.
+	 */
+	if (flags & TTU_SPLIT_HUGE_PMD)
+		split_huge_pmd_address(vma, address, true, page);
+
+	/*
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
+	 */
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				address,
+				min(vma->vm_end, address + page_size(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &range.start,
+						     &range.end);
+	}
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+		/* PMD-mapped THP migration entry */
+		if (!pvmw.pte) {
+			VM_BUG_ON_PAGE(PageHuge(page) ||
+				       !PageTransCompound(page), page);
+
+			set_pmd_migration_entry(&pvmw, page);
+			continue;
+		}
+#endif
+
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		address = pvmw.address;
+
+		if (PageHuge(page) && !PageAnon(page)) {
+			/*
+			 * To call huge_pmd_unshare, i_mmap_rwsem must be
+			 * held in write mode.  Caller needs to explicitly
+			 * do this outside rmap routines.
+			 */
+			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+			if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, range.start, range.end);
+				flush_tlb_range(vma, range.start, range.end);
+				mmu_notifier_invalidate_range(mm, range.start,
+							      range.end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
+
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+		/* Move the dirty bit to the page. Now the pte is gone. */
+		if (pte_dirty(pteval))
+			set_page_dirty(page);
+
+		/* Update high watermark before we lower rss */
+		update_hiwater_rss(mm);
+
+		if (is_zone_device_page(page)) {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			entry = make_readable_migration_entry(
+							page_to_pfn(page));
+			swp_pte = swp_entry_to_pte(entry);
+
+			/*
+			 * pteval maps a zone device page and is therefore
+			 * a swap pte.
+			 */
+			if (pte_swp_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_swp_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 *
+			 * The assignment to subpage above was computed from a
+			 * swap PTE which results in an invalid pointer.
+			 * Since only PAGE_SIZE pages can currently be
+			 * migrated, just set it to page. This will need to be
+			 * changed when hugepage migrations to device private
+			 * memory are supported.
+			 */
+			subpage = page;
+		} else if (PageHWPoison(page)) {
+			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+			if (PageHuge(page)) {
+				hugetlb_count_sub(compound_nr(page), mm);
+				set_huge_swap_pte_at(mm, address,
+						     pvmw.pte, pteval,
+						     vma_mmu_pagesize(vma));
+			} else {
+				dec_mm_counter(mm, mm_counter(page));
+				set_pte_at(mm, address, pvmw.pte, pteval);
+			}
+
+		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+			/*
+			 * The guest indicated that the page content is of no
+			 * interest anymore. Simply discard the pte, vmscan
+			 * will take care of the rest.
+			 * A future reference will then fault in a new zero
+			 * page. When userfaultfd is active, we must not drop
+			 * this page though, as its main user (postcopy
+			 * migration) will not expect userfaults on already
+			 * copied pages.
+			 */
+			dec_mm_counter(mm, mm_counter(page));
+			/* We have to invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
+		} else {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+				set_pte_at(mm, address, pvmw.pte, pteval);
+				ret = false;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			if (pte_write(pteval))
+				entry = make_writable_migration_entry(
+							page_to_pfn(subpage));
+			else
+				entry = make_readable_migration_entry(
+							page_to_pfn(subpage));
+
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
+		}
+
+		/*
+		 * No need to call mmu_notifier_invalidate_range() it has be
+		 * done above for all cases requiring it to happen under page
+		 * table lock before mmu_notifier_invalidate_range_end()
+		 *
+		 * See Documentation/vm/mmu_notifier.rst
+		 */
+		page_remove_rmap(subpage, PageHuge(page));
+		put_page(page);
+	}
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
+/**
+ * try_to_migrate - try to replace all page table mappings with swap entries
+ * @page: the page to replace page table entries for
+ * @flags: action and flags
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special swap entries. Caller must hold the page lock.
+ *
+ * If is successful, return true. Otherwise, false.
+ */
+bool try_to_migrate(struct page *page, enum ttu_flags flags)
+{
+	struct rmap_walk_control rwc = {
+		.rmap_one = try_to_migrate_one,
+		.arg = (void *)flags,
+		.done = page_not_mapped,
+		.anon_lock = page_lock_anon_vma_read,
+	};
+
+	/*
+	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
+	 * TTU_SPLIT_HUGE_PMD flags.
+	 */
+	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD)))
+		return false;
+
 	/*
 	 * During exec, a temporary VMA is setup and later moved.
 	 * The VMA is moved under the anon_vma lock but not the
@@ -1766,8 +1939,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
 	 * locking requirements of exec(), migration skips
 	 * temporary VMAs until after exec() completes.
 	 */
-	if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
-	    && !PageKsm(page) && PageAnon(page))
+	if (!PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
 	if (flags & TTU_RMAP_LOCKED)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v8 5/8] mm: Device exclusive memory access
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
                   ` (3 preceding siblings ...)
  2021-04-07  8:42 ` [PATCH v8 4/8] mm/rmap: Split migration into its own function Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-05-18  2:08   ` Peter Xu
  2021-05-18 21:16   ` Peter Xu
  2021-04-07  8:42 ` [PATCH v8 6/8] mm: Selftests for exclusive device memory Alistair Popple
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora,
	Christoph Hellwig

Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

---

v8:
* Remove device exclusive entries on fork rather than copy them.

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst              |  19 ++-
 drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
 include/linux/mmu_notifier.h          |  26 ++--
 include/linux/rmap.h                  |   4 +
 include/linux/swap.h                  |   4 +-
 include/linux/swapops.h               |  44 +++++-
 lib/test_hmm.c                        |   2 +-
 mm/hmm.c                              |   5 +
 mm/memory.c                           | 176 +++++++++++++++++++--
 mm/migrate.c                          |  10 +-
 mm/mprotect.c                         |   8 +
 mm/page_vma_mapped.c                  |   9 +-
 mm/rmap.c                             | 210 ++++++++++++++++++++++++++
 13 files changed, 487 insertions(+), 32 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
    walks to fill in the ``args->src`` array with PFNs to be migrated.
    The ``invalidate_range_start()`` callback is passed a
    ``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
    the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
    allows the device driver to skip the invalidation callback and only
    invalidate device private MMU mappings that are actually migrating.
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
    The lock can now be released.
 
+Exclusive access memory
+=======================
+
+Some devices have features such as atomic PTE bits that can be used to implement
+atomic access to system memory. To support atomic operations to a shared virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 ========================================
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 	 * the invalidation is handled as part of the migration process.
 	 */
 	if (update->event == MMU_NOTIFY_MIGRATE &&
-	    update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+	    update->owner == svmm->vmm->cli->drm->dev)
 		goto out;
 
 	if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..2e6068d3fb9f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -41,7 +41,12 @@ struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
+ *
+ * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
+ * longer have exclusive access to the page. May ignore the invalidation that's
+ * part of make_device_exclusive_range() if the owner field
+ * matches the value passed to make_device_exclusive_range().
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
@@ -51,6 +56,7 @@ enum mmu_notifier_event {
 	MMU_NOTIFY_SOFT_DIRTY,
 	MMU_NOTIFY_RELEASE,
 	MMU_NOTIFY_MIGRATE,
+	MMU_NOTIFY_EXCLUSIVE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -269,7 +275,7 @@ struct mmu_notifier_range {
 	unsigned long end;
 	unsigned flags;
 	enum mmu_notifier_event event;
-	void *migrate_pgmap_owner;
+	void *owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +527,14 @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *range,
 	range->flags = flags;
 }
 
-static inline void mmu_notifier_range_init_migrate(
-			struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+			struct mmu_notifier_range *range,
+			enum mmu_notifier_event event, unsigned int flags,
 			struct vm_area_struct *vma, struct mm_struct *mm,
-			unsigned long start, unsigned long end, void *pgmap)
+			unsigned long start, unsigned long end, void *owner)
 {
-	mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
-				start, end);
-	range->migrate_pgmap_owner = pgmap;
+	mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+	range->owner = owner;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
@@ -655,8 +661,8 @@ static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range,
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
 	_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
-					pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+					end, owner) \
 	_mmu_notifier_range_init(range, start, end)
 
 static inline bool
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0e25d829f742..3a1ce4ef9276 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
 bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *arg);
+
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
 /* Look for migarion entries rather than present PTEs */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 516104b9334b..7a3c260146df 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
  * to a special SWP_DEVICE_* entry.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_NUM 4
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
+#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4dfd807ae52a..4129bd2ff9d6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -120,6 +120,27 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
+		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
+}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
@@ -140,6 +161,26 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return false;
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
@@ -219,7 +260,8 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
  */
 static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
-	return is_migration_entry(entry) || is_device_private_entry(entry);
+	return is_migration_entry(entry) || is_device_private_entry(entry) ||
+	       is_device_exclusive_entry(entry);
 }
 
 struct page_vma_mapped_walk;
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..5c9f5a020c1d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -218,7 +218,7 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
 	 * the invalidation is handled as part of the migration process.
 	 */
 	if (range->event == MMU_NOTIFY_MIGRATE &&
-	    range->migrate_pgmap_owner == dmirror->mdevice)
+	    range->owner == dmirror->mdevice)
 		return true;
 
 	if (mmu_notifier_range_blockable(range))
diff --git a/mm/hmm.c b/mm/hmm.c
index 11df3ca30b82..fad6be2bf072 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -26,6 +26,8 @@
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
+#include "internal.h"
+
 struct hmm_vma_walk {
 	struct hmm_range	*range;
 	unsigned long		last;
@@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		if (!non_swap_entry(entry))
 			goto fault;
 
+		if (is_device_exclusive_entry(entry))
+			goto fault;
+
 		if (is_migration_entry(entry)) {
 			pte_unmap(ptep);
 			hmm_vma_walk->last = addr;
diff --git a/mm/memory.c b/mm/memory.c
index 3a5705cfc891..556ff396f2e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 #endif
 
+static void restore_exclusive_pte(struct vm_area_struct *vma,
+				  struct page *page, unsigned long address,
+				  pte_t *ptep)
+{
+	pte_t pte;
+	swp_entry_t entry;
+
+	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+	if (pte_swp_soft_dirty(*ptep))
+		pte = pte_mksoft_dirty(pte);
+
+	entry = pte_to_swp_entry(*ptep);
+	if (pte_swp_uffd_wp(*ptep))
+		pte = pte_mkuffd_wp(pte);
+	else if (is_writable_device_exclusive_entry(entry))
+		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+	set_pte_at(vma->vm_mm, address, ptep, pte);
+
+	/*
+	 * No need to take a page reference as one was already
+	 * created when the swap entry was made.
+	 */
+	if (PageAnon(page))
+		page_add_anon_rmap(page, vma, address, false);
+	else
+		page_add_file_rmap(page, false);
+
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_vma_page(page);
+
+	/*
+	 * No need to invalidate - it was non-present before. However
+	 * secondary CPUs may have mappings that need invalidating.
+	 */
+	update_mmu_cache(vma, address, ptep);
+}
+
+/*
+ * Tries to restore an exclusive pte if the page lock can be acquired without
+ * sleeping. Returns 0 on success or -EBUSY if the page could not be locked or
+ * the entry no longer points at locked_page in which case locked_page should be
+ * locked before retrying the call.
+ */
+static unsigned long
+try_restore_exclusive_pte(struct mm_struct *src_mm, pte_t *src_pte,
+			  struct vm_area_struct *vma, unsigned long addr,
+			  struct page **locked_page)
+{
+	swp_entry_t entry = pte_to_swp_entry(*src_pte);
+	struct page *page = pfn_swap_entry_to_page(entry);
+
+	if (*locked_page) {
+		/* The entry changed, retry */
+		if (unlikely(*locked_page != page)) {
+			unlock_page(*locked_page);
+			put_page(*locked_page);
+			*locked_page = page;
+			return -EBUSY;
+		}
+		restore_exclusive_pte(vma, page, addr, src_pte);
+		unlock_page(page);
+		put_page(page);
+		*locked_page = NULL;
+		return 0;
+	}
+
+	if (trylock_page(page)) {
+		restore_exclusive_pte(vma, page, addr, src_pte);
+		unlock_page(page);
+		return 0;
+	}
+
+	/* The page couldn't be locked so drop the locks and retry. */
+	*locked_page = page;
+	return -EBUSY;
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = pte_swp_mkuffd_wp(pte);
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
+	} else if (is_device_exclusive_entry(entry)) {
+		/* COW mappings should be dealt with by removing the entry */
+		VM_BUG_ON(is_cow_mapping(vm_flags));
+		page = pfn_swap_entry_to_page(entry);
+		get_page(page);
+		rss[mm_counter(page)]++;
 	}
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return 0;
@@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
 	struct page *prealloc = NULL;
+	struct page *locked_page = NULL;
 
 again:
 	progress = 0;
@@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			continue;
 		}
 		if (unlikely(!pte_present(*src_pte))) {
-			entry.val = copy_nonpresent_pte(dst_mm, src_mm,
-							dst_pte, src_pte,
-							src_vma, addr, rss);
-			if (entry.val)
-				break;
-			progress += 8;
-			continue;
+			swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);
+
+			if (unlikely(is_cow_mapping(src_vma->vm_flags) &&
+			    is_device_exclusive_entry(swp_entry))) {
+				/*
+				 * Normally this would require sending mmu
+				 * notifiers, but copy_page_range() has already
+				 * done that for COW mappings.
+				 */
+				ret = try_restore_exclusive_pte(src_mm, src_pte,
+								src_vma, addr,
+								&locked_page);
+				if (ret == -EBUSY)
+					break;
+			} else {
+				entry.val = copy_nonpresent_pte(dst_mm, src_mm,
+								dst_pte, src_pte,
+								src_vma, addr,
+								rss);
+				if (entry.val)
+					break;
+				progress += 8;
+				continue;
+			}
+		}
+		/* a non-present pte became present after dropping the ptl */
+		if (unlikely(locked_page)) {
+			unlock_page(locked_page);
+			put_page(locked_page);
+			locked_page = NULL;
 		}
 		/* copy_present_pte() will clear `*prealloc' if consumed */
 		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
@@ -1023,6 +1131,11 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			goto out;
 		}
 		entry.val = 0;
+	} else if (ret == -EBUSY) {
+		if (get_page_unless_zero(locked_page))
+			lock_page(locked_page);
+		else
+			locked_page = NULL;
 	} else if (ret) {
 		WARN_ON_ONCE(ret != -EAGAIN);
 		prealloc = page_copy_prealloc(src_mm, src_vma, addr);
@@ -1287,7 +1400,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		}
 
 		entry = pte_to_swp_entry(ptent);
-		if (is_device_private_entry(entry)) {
+		if (is_device_private_entry(entry) ||
+		    is_device_exclusive_entry(entry)) {
 			struct page *page = pfn_swap_entry_to_page(entry);
 
 			if (unlikely(details && details->check_mapping)) {
@@ -1303,7 +1417,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+
+			if (is_device_private_entry(entry))
+				page_remove_rmap(page, false);
+
 			put_page(page);
 			continue;
 		}
@@ -3256,6 +3373,44 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+/*
+ * Restore a potential device exclusive pte to a working pte entry
+ */
+static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	struct vm_area_struct *vma = vmf->vma;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = vmf->address,
+		.flags = PVMW_SYNC,
+	};
+	vm_fault_t ret = 0;
+	struct mmu_notifier_range range;
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+		return VM_FAULT_RETRY;
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				vmf->address & PAGE_MASK,
+				(vmf->address & PAGE_MASK) + PAGE_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);
+	}
+
+	unlock_page(page);
+
+	mmu_notifier_invalidate_range_end(&range);
+	return ret;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3283,6 +3438,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_exclusive_entry(entry)) {
+			vmf->page = pfn_swap_entry_to_page(entry);
+			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
diff --git a/mm/migrate.c b/mm/migrate.c
index cc4612e2a246..9cc9251d4802 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2570,8 +2570,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
 	 * that the registered device driver can skip invalidating device
 	 * private page mappings that won't be migrated.
 	 */
-	mmu_notifier_range_init_migrate(&range, 0, migrate->vma,
-		migrate->vma->vm_mm, migrate->start, migrate->end,
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
+		migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end,
 		migrate->pgmap_owner);
 	mmu_notifier_invalidate_range_start(&range);
 
@@ -3074,9 +3074,9 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 			if (!notified) {
 				notified = true;
 
-				mmu_notifier_range_init_migrate(&range, 0,
-					migrate->vma, migrate->vma->vm_mm,
-					addr, migrate->end,
+				mmu_notifier_range_init_owner(&range,
+					MMU_NOTIFY_MIGRATE, 0, migrate->vma,
+					migrate->vma->vm_mm, addr, migrate->end,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f21b760ec809..c6018541ea3d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
+			} else if (is_writable_device_exclusive_entry(entry)) {
+				entry = make_readable_device_exclusive_entry(
+							swp_offset(entry));
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				if (pte_swp_uffd_wp(oldpte))
+					newpte = pte_swp_mkuffd_wp(newpte);
 			} else {
 				newpte = oldpte;
 			}
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index eed988ab2e81..29842f169219 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
 
 				/* Handle un-addressable ZONE_DEVICE memory */
 				entry = pte_to_swp_entry(*pvmw->pte);
-				if (!is_device_private_entry(entry))
+				if (!is_device_private_entry(entry) &&
+				    !is_device_exclusive_entry(entry))
 					return false;
 			} else if (!pte_present(*pvmw->pte))
 				return false;
@@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
 
-		if (!is_migration_entry(entry))
+		if (!is_migration_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 
 		/* Handle un-addressable ZONE_DEVICE memory */
 		entry = pte_to_swp_entry(*pvmw->pte);
-		if (!is_device_private_entry(entry))
+		if (!is_device_private_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
diff --git a/mm/rmap.c b/mm/rmap.c
index 7f91f058f1f5..32b99a7bb358 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2005,6 +2005,216 @@ void page_mlock(struct page *page)
 	rmap_walk(page, &rwc);
 }
 
+struct ttp_args {
+	struct mm_struct *mm;
+	unsigned long address;
+	void *arg;
+	bool valid;
+};
+
+static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma,
+			unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+	struct ttp_args *ttp = arg;
+	pte_t pteval;
+	struct page *subpage;
+	bool ret = true;
+	struct mmu_notifier_range range;
+	swp_entry_t entry;
+	pte_t swp_pte;
+
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+				      vma->vm_mm, address,
+				      min(vma->vm_end,
+					  address + page_size(page)),
+				      ttp->arg);
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &range.start,
+						     &range.end);
+	}
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+		if (!pte_present(*pvmw.pte)) {
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		address = pvmw.address;
+
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+		/* Move the dirty bit to the page. Now the pte is gone. */
+		if (pte_dirty(pteval))
+			set_page_dirty(page);
+
+		/* Update high watermark before we lower rss */
+		update_hiwater_rss(mm);
+
+		if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		/*
+		 * Check that our target page is still mapped at the expected
+		 * address.
+		 */
+		if (ttp->mm == mm && ttp->address == address &&
+		    pte_write(pteval))
+			ttp->valid = true;
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		if (pte_write(pteval))
+			entry = make_writable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		else
+			entry = make_readable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+
+		/* Take a reference for the swap entry */
+		get_page(page);
+		set_pte_at(mm, address, pvmw.pte, swp_pte);
+
+		page_remove_rmap(subpage, PageHuge(page));
+		put_page(page);
+	}
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
+/**
+ * try_to_protect - try to replace all page table mappings with swap entries
+ * @page: the page to replace page table entries for
+ * @flags: action and flags
+ * @mm: the mm_struct where the page is expected to be mapped
+ * @address: address where the page is expected to be mapped
+ * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special swap entries to grant a device exclusive access to
+ * the page. Caller must hold the page lock.
+ *
+ * Returns false if the page is still mapped, or if it could not be unmapped
+ * from the expected address. Otherwise returns true (success).
+ */
+static bool try_to_protect(struct page *page, struct mm_struct *mm,
+			   unsigned long address, void *arg)
+{
+	struct ttp_args ttp = {
+		.mm = mm,
+		.address = address,
+		.arg = arg,
+		.valid = false,
+	};
+	struct rmap_walk_control rwc = {
+		.rmap_one = try_to_protect_one,
+		.done = page_not_mapped,
+		.anon_lock = page_lock_anon_vma_read,
+		.arg = &ttp,
+	};
+
+	/*
+	 * Restrict to anonymous pages for now to avoid potential writeback
+	 * issues.
+	 */
+	if (!PageAnon(page))
+		return false;
+
+	/*
+	 * During exec, a temporary VMA is setup and later moved.
+	 * The VMA is moved under the anon_vma lock but not the
+	 * page tables leading to a race where migration cannot
+	 * find the migration ptes. Rather than increasing the
+	 * locking requirements of exec(), migration skips
+	 * temporary VMAs until after exec() completes.
+	 */
+	if (!PageKsm(page) && PageAnon(page))
+		rwc.invalid_vma = invalid_migration_vma;
+
+	rmap_walk(page, &rwc);
+
+	return ttp.valid && !page_mapcount(page);
+}
+
+/**
+ * make_device_exclusive_range() - Mark a range for exclusive use by a device
+ * @mm: mm_struct of assoicated target process
+ * @start: start of the region to mark for exclusive device access
+ * @end: end address of region
+ * @pages: returns the pages which were successfully marked for exclusive access
+ * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow filtering
+ *
+ * Returns: number of pages successfully marked for exclusive access
+ *
+ * This function finds ptes mapping page(s) to the given address range, locks
+ * them and replaces mappings with special swap entries preventing userspace CPU
+ * access. On fault these entries are replaced with the original mapping after
+ * calling MMU notifiers.
+ *
+ * A driver using this to program access from a device must use a mmu notifier
+ * critical section to hold a device specific lock during programming. Once
+ * programming is complete it should drop the page lock and reference after
+ * which point CPU access to the page will revoke the exclusive access.
+ */
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *arg)
+{
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	unsigned long i;
+
+	npages = get_user_pages_remote(mm, start, npages,
+				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+				       pages, NULL, NULL);
+	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
+		if (!trylock_page(pages[i])) {
+			put_page(pages[i]);
+			pages[i] = NULL;
+			continue;
+		}
+
+		if (!try_to_protect(pages[i], mm, start, arg)) {
+			unlock_page(pages[i]);
+			put_page(pages[i]);
+			pages[i] = NULL;
+		}
+	}
+
+	return npages;
+}
+EXPORT_SYMBOL_GPL(make_device_exclusive_range);
+
 void __put_anon_vma(struct anon_vma *anon_vma)
 {
 	struct anon_vma *root = anon_vma->root;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v8 6/8] mm: Selftests for exclusive device memory
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
                   ` (4 preceding siblings ...)
  2021-04-07  8:42 ` [PATCH v8 5/8] mm: Device exclusive memory access Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 7/8] nouveau/svm: Refactor nouveau_range_fault Alistair Popple
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora

Adds some selftests for exclusive device memory.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
---
 lib/test_hmm.c                         | 124 +++++++++++++++++++
 lib/test_hmm_uapi.h                    |   2 +
 tools/testing/selftests/vm/hmm-tests.c | 158 +++++++++++++++++++++++++
 3 files changed, 284 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5c9f5a020c1d..305a9d9e2b4c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -25,6 +25,7 @@
 #include <linux/swapops.h>
 #include <linux/sched/mm.h>
 #include <linux/platform_device.h>
+#include <linux/rmap.h>
 
 #include "test_hmm_uapi.h"
 
@@ -46,6 +47,7 @@ struct dmirror_bounce {
 	unsigned long		cpages;
 };
 
+#define DPT_XA_TAG_ATOMIC 1UL
 #define DPT_XA_TAG_WRITE 3UL
 
 /*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 	}
 }
 
+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+			     unsigned long end)
+{
+	unsigned long pfn;
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+		void *entry;
+		struct page *page;
+
+		entry = xa_load(&dmirror->pt, pfn);
+		page = xa_untag_pointer(entry);
+		if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+			return -EPERM;
+	}
+
+	return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+			      struct page **pages, struct dmirror *dmirror)
+{
+	unsigned long pfn, mapped = 0;
+	int i;
+
+	/* Map the migrated pages into the device's page tables. */
+	mutex_lock(&dmirror->mutex);
+
+	for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) {
+		void *entry;
+
+		if (!pages[i])
+			continue;
+
+		entry = pages[i];
+		entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
+		if (xa_is_err(entry)) {
+			mutex_unlock(&dmirror->mutex);
+			return xa_err(entry);
+		}
+
+		mapped++;
+	}
+
+	mutex_unlock(&dmirror->mutex);
+	return mapped;
+}
+
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 					    struct dmirror *dmirror)
 {
@@ -661,6 +711,71 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	return 0;
 }
 
+static int dmirror_exclusive(struct dmirror *dmirror,
+			     struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, end, addr;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	struct mm_struct *mm = dmirror->notifier.mm;
+	struct page *pages[64];
+	struct dmirror_bounce bounce;
+	unsigned long next;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	mmap_read_lock(mm);
+	for (addr = start; addr < end; addr = next) {
+		int i, mapped;
+
+		if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+			next = end;
+		else
+			next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+		ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+		mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+		for (i = 0; i < ret; i++) {
+			if (pages[i]) {
+				unlock_page(pages[i]);
+				put_page(pages[i]);
+			}
+		}
+
+		if (addr + (mapped << PAGE_SHIFT) < next) {
+			mmap_read_unlock(mm);
+			mmput(mm);
+			return -EBUSY;
+		}
+	}
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	/* Return the migrated data for verification. */
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_read(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0) {
+		if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+				 bounce.size))
+			ret = -EFAULT;
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
 static int dmirror_migrate(struct dmirror *dmirror,
 			   struct hmm_dmirror_cmd *cmd)
 {
@@ -949,6 +1064,15 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		ret = dmirror_migrate(dmirror, &cmd);
 		break;
 
+	case HMM_DMIRROR_EXCLUSIVE:
+		ret = dmirror_exclusive(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_CHECK_EXCLUSIVE:
+		ret = dmirror_check_atomic(dmirror, cmd.addr,
+					cmd.addr + (cmd.npages << PAGE_SHIFT));
+		break;
+
 	case HMM_DMIRROR_SNAPSHOT:
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 670b4ef2a5b6..f14dea5dcd06 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -33,6 +33,8 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 5d1ac691b9f4..864f126ffd78 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1485,4 +1485,162 @@ TEST_F(hmm2, double_map)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Basic check of exclusive faulting.
+ */
+TEST_F(hmm, exclusive)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Map memory exclusively for device access. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i]++, i);
+
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i+1);
+
+	/* Check atomic access revoked */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_CHECK_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+
+	hmm_buffer_free(buffer);
+}
+
+TEST_F(hmm, exclusive_mprotect)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Map memory exclusively for device access. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, -EPERM);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Check copy-on-write works.
+ */
+TEST_F(hmm, exclusive_cow)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Map memory exclusively for device access. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	fork();
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i]++, i);
+
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i+1);
+
+	hmm_buffer_free(buffer);
+}
+
 TEST_HARNESS_MAIN
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v8 7/8] nouveau/svm: Refactor nouveau_range_fault
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
                   ` (5 preceding siblings ...)
  2021-04-07  8:42 ` [PATCH v8 6/8] mm: Selftests for exclusive device memory Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-04-07  8:42 ` [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access Alistair Popple
  2021-05-06  7:43 ` [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
  8 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora

Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ++++++++++++++++-----------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 94f841026c3b..a195e48c9aee 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
 	unsigned long hmm_pfns[1];
 	struct hmm_range range = {
 		.notifier = &notifier->notifier,
-		.start = notifier->notifier.interval_tree.start,
-		.end = notifier->notifier.interval_tree.last + 1,
 		.default_flags = hmm_flags,
 		.hmm_pfns = hmm_pfns,
 		.dev_private_owner = drm->dev,
 	};
-	struct mm_struct *mm = notifier->notifier.mm;
+	struct mm_struct *mm = svmm->notifier.mm;
 	int ret;
 
+	ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
+					args->p.addr, args->p.size,
+					&nouveau_svm_mni_ops);
+	if (ret)
+		return ret;
+
+	range.start = notifier->notifier.interval_tree.start;
+	range.end = notifier->notifier.interval_tree.last + 1;
+
 	while (true) {
-		if (time_after(jiffies, timeout))
-			return -EBUSY;
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
 
 		range.notifier_seq = mmu_interval_read_begin(range.notifier);
 		mmap_read_lock(mm);
@@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
 		if (ret) {
 			if (ret == -EBUSY)
 				continue;
-			return ret;
+			goto out;
 		}
 
 		mutex_lock(&svmm->mutex);
@@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
 	svmm->vmm->vmm.object.client->super = false;
 	mutex_unlock(&svmm->mutex);
 
+out:
+	mmu_interval_notifier_remove(&notifier->notifier);
+
 	return ret;
 }
 
@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		}
 
 		notifier.svmm = svmm;
-		ret = mmu_interval_notifier_insert(&notifier.notifier, mm,
-						   args.i.p.addr, args.i.p.size,
-						   &nouveau_svm_mni_ops);
-		if (!ret) {
-			ret = nouveau_range_fault(svmm, svm->drm, &args.i,
-				sizeof(args), hmm_flags, &notifier);
-			mmu_interval_notifier_remove(&notifier.notifier);
-		}
+		ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+					sizeof(args), hmm_flags, &notifier);
 		mmput(mm);
 
 		limit = args.i.p.addr + args.i.p.size;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
                   ` (6 preceding siblings ...)
  2021-04-07  8:42 ` [PATCH v8 7/8] nouveau/svm: Refactor nouveau_range_fault Alistair Popple
@ 2021-04-07  8:42 ` Alistair Popple
  2021-05-21  4:04   ` Ben Skeggs
  2021-05-06  7:43 ` [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
  8 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-04-07  8:42 UTC (permalink / raw)
  To: linux-mm, nouveau, bskeggs, akpm
  Cc: Alistair Popple, linux-doc, linux-kernel, dri-devel, jhubbard,
	rcampbell, jglisse, jgg, hch, daniel, willy, bsingharora

Some NVIDIA GPUs do not support direct atomic access to system memory
via PCIe. Instead this must be emulated by granting the GPU exclusive
access to the memory. This is achieved by replacing CPU page table
entries with special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via
MMU notifiers to revoke the atomic access. The original page table
entries are then restored allowing CPU access to proceed.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

v7:
* Removed magic values for fault access levels
* Improved readability of fault comparison code

v4:
* Check that page table entries haven't changed before mapping on the
  device
---
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c         | 126 ++++++++++++++++--
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c    |   6 +
 4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
index d6dd40f21eed..9c7ff56831c5 100644
--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
+++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
 #define NVIF_VMM_PFNMAP_V0_APER                           0x00000000000000f0ULL
 #define NVIF_VMM_PFNMAP_V0_HOST                           0x0000000000000000ULL
 #define NVIF_VMM_PFNMAP_V0_VRAM                           0x0000000000000010ULL
+#define NVIF_VMM_PFNMAP_V0_A				  0x0000000000000004ULL
 #define NVIF_VMM_PFNMAP_V0_W                              0x0000000000000002ULL
 #define NVIF_VMM_PFNMAP_V0_V                              0x0000000000000001ULL
 #define NVIF_VMM_PFNMAP_V0_NONE                           0x0000000000000000ULL
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a195e48c9aee..81526d65b4e2 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include <linux/sched/mm.h>
 #include <linux/sort.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
 
 struct nouveau_svm {
 	struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
 	} buffer[1];
 };
 
+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
 #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
 #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
 
@@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
 				      fault->client);
 }
 
+static int
+nouveau_svm_fault_priority(u8 fault)
+{
+	switch (fault) {
+	case FAULT_ACCESS_PREFETCH:
+		return 0;
+	case FAULT_ACCESS_READ:
+		return 1;
+	case FAULT_ACCESS_WRITE:
+		return 2;
+	case FAULT_ACCESS_ATOMIC:
+		return 3;
+	default:
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+}
+
 static int
 nouveau_svm_fault_cmp(const void *a, const void *b)
 {
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
 		return ret;
 	if ((ret = (s64)fa->addr - fb->addr))
 		return ret;
-	/*XXX: atomic? */
-	return (fa->access == 0 || fa->access == 3) -
-	       (fb->access == 0 || fb->access == 3);
+	return nouveau_svm_fault_priority(fa->access) -
+		nouveau_svm_fault_priority(fb->access);
 }
 
 static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
 	struct svm_notifier *sn =
 		container_of(mni, struct svm_notifier, notifier);
 
+	if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+	    range->owner == sn->svmm->vmm->cli->drm->dev)
+		return true;
+
 	/*
 	 * serializes the update to mni->invalidate_seq done by caller and
 	 * prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm,
 		args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
 }
 
+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+			       struct nouveau_drm *drm,
+			       struct nouveau_pfnmap_args *args, u32 size,
+			       struct svm_notifier *notifier)
+{
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	struct mm_struct *mm = svmm->notifier.mm;
+	struct page *page;
+	unsigned long start = args->p.addr;
+	unsigned long notifier_seq;
+	int ret = 0;
+
+	ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
+					args->p.addr, args->p.size,
+					&nouveau_svm_mni_ops);
+	if (ret)
+		return ret;
+
+	while (true) {
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+		mmap_read_lock(mm);
+		make_device_exclusive_range(mm, start, start + PAGE_SIZE,
+					    &page, drm->dev);
+		mmap_read_unlock(mm);
+		if (!page) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		mutex_lock(&svmm->mutex);
+		if (!mmu_interval_read_retry(&notifier->notifier,
+					     notifier_seq))
+			break;
+		mutex_unlock(&svmm->mutex);
+	}
+
+	/* Map the page on the GPU. */
+	args->p.page = 12;
+	args->p.size = PAGE_SIZE;
+	args->p.addr = start;
+	args->p.phys[0] = page_to_phys(page) |
+		NVIF_VMM_PFNMAP_V0_V |
+		NVIF_VMM_PFNMAP_V0_W |
+		NVIF_VMM_PFNMAP_V0_A |
+		NVIF_VMM_PFNMAP_V0_HOST;
+
+	svmm->vmm->vmm.object.client->super = true;
+	ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL);
+	svmm->vmm->vmm.object.client->super = false;
+	mutex_unlock(&svmm->mutex);
+
+	unlock_page(page);
+	put_page(page);
+
+out:
+	mmu_interval_notifier_remove(&notifier->notifier);
+	return ret;
+}
+
 static int nouveau_range_fault(struct nouveau_svmm *svmm,
 			       struct nouveau_drm *drm,
 			       struct nouveau_pfnmap_args *args, u32 size,
@@ -637,7 +729,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
 	unsigned long hmm_flags;
 	u64 inst, start, limit;
 	int fi, fn;
-	int replay = 0, ret;
+	int replay = 0, atomic = 0, ret;
 
 	/* Parse available fault buffer entries into a cache, and update
 	 * the GET pointer so HW can reuse the entries.
@@ -718,12 +810,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		/*
 		 * Determine required permissions based on GPU fault
 		 * access flags.
-		 * XXX: atomic?
 		 */
 		switch (buffer->fault[fi]->access) {
 		case 0: /* READ. */
 			hmm_flags = HMM_PFN_REQ_FAULT;
 			break;
+		case 2: /* ATOMIC. */
+			atomic = true;
+			break;
 		case 3: /* PREFETCH. */
 			hmm_flags = 0;
 			break;
@@ -739,8 +833,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		}
 
 		notifier.svmm = svmm;
-		ret = nouveau_range_fault(svmm, svm->drm, &args.i,
-					sizeof(args), hmm_flags, &notifier);
+		if (atomic)
+			ret = nouveau_atomic_range_fault(svmm, svm->drm,
+							 &args.i, sizeof(args),
+							 &notifier);
+		else
+			ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+						  sizeof(args), hmm_flags,
+						  &notifier);
 		mmput(mm);
 
 		limit = args.i.p.addr + args.i.p.size;
@@ -756,11 +856,15 @@ nouveau_svm_fault(struct nvif_notify *notify)
 			 */
 			if (buffer->fault[fn]->svmm != svmm ||
 			    buffer->fault[fn]->addr >= limit ||
-			    (buffer->fault[fi]->access == 0 /* READ. */ &&
+			    (buffer->fault[fi]->access == FAULT_ACCESS_READ &&
 			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) ||
-			    (buffer->fault[fi]->access != 0 /* READ. */ &&
-			     buffer->fault[fi]->access != 3 /* PREFETCH. */ &&
-			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)))
+			    (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+			     buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) ||
+			    (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+			     buffer->fault[fi]->access != FAULT_ACCESS_WRITE &&
+			     buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A)))
 				break;
 		}
 
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
index a2b179568970..f6188aa9171c 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
@@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_vmm *, struct nvkm_vma *);
 #define NVKM_VMM_PFN_APER                                 0x00000000000000f0ULL
 #define NVKM_VMM_PFN_HOST                                 0x0000000000000000ULL
 #define NVKM_VMM_PFN_VRAM                                 0x0000000000000010ULL
+#define NVKM_VMM_PFN_A					  0x0000000000000004ULL
 #define NVKM_VMM_PFN_W                                    0x0000000000000002ULL
 #define NVKM_VMM_PFN_V                                    0x0000000000000001ULL
 #define NVKM_VMM_PFN_NONE                                 0x0000000000000000ULL
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
index 236db5570771..f02abd9cb4dd 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
@@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
 		if (!(*map->pfn & NVKM_VMM_PFN_W))
 			data |= BIT_ULL(6); /* RO. */
 
+		if (!(*map->pfn & NVKM_VMM_PFN_A))
+			data |= BIT_ULL(7); /* Atomic disable. */
+
 		if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
 			addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
 			addr = dma_map_page(dev, pfn_to_page(addr), 0,
@@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
 		if (!(*map->pfn & NVKM_VMM_PFN_W))
 			data |= BIT_ULL(6); /* RO. */
 
+		if (!(*map->pfn & NVKM_VMM_PFN_A))
+			data |= BIT_ULL(7); /* Atomic disable. */
+
 		if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
 			addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
 			addr = dma_map_page(dev, pfn_to_page(addr), 0,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 0/8] Add support for SVM atomics in Nouveau
  2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
                   ` (7 preceding siblings ...)
  2021-04-07  8:42 ` [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access Alistair Popple
@ 2021-05-06  7:43 ` Alistair Popple
  8 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-05-06  7:43 UTC (permalink / raw)
  To: linux-mm
  Cc: nouveau, bskeggs, akpm, linux-doc, linux-kernel, dri-devel,
	jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora

Hi Andrew,

There is currently no outstanding feedback for this series so I am hoping it 
may be considered for inclusion (or at least the mm portions - I still need 
Reviews/Acks for the Nouveau bits). The main change for v8 was removal of 
entries on fork rather than copying in response to feedback from Jason so any 
follow up comments on patch 5 would also be welcome. The series contains a 
number of general clean-ups suggested by Christoph along with a feature to 
temporarily make selected user page mappings write-protected.

This is needed to support OpenCL atomic operations in Nouveau to shared 
virtual memory (SVM) regions allocated with the CL_MEM_SVM_ATOMICS clSVMAlloc 
flag. A more complete description of the OpenCL SVM feature is available at 
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

I have been testing this with Mesa 21.1.0 and a simple OpenCL program which 
checks GPU atomic accesses to system memory are atomic. Without this series 
the test fails as there is no way of write-protecting the userspace page 
mapping which results in the device clobbering CPU writes. For reference the 
test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/ .

 - Alistair

On Wednesday, 7 April 2021 6:42:30 PM AEST Alistair Popple wrote:
> This is the eighth version of a series to add support to Nouveau for atomic
> memory operations on OpenCL shared virtual memory (SVM) regions.
> 
> The main change for this version is a simplification of device exclusive
> entry handling. Instead of copying entries for copy-on-write mappings
> during fork they are removed instead. This is safer because there could be
> unique corner cases when copying, particularly for pinned pages which
> should follow the same logic as copy_present_page(). Removing entries
> avoids this possiblity by treating them as normal ptes.
> 
> Exclusive device access is implemented by adding a new swap entry type
> (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
> difference is that on fault the original entry is immediately restored by
> the fault handler instead of waiting.
> 
> Restoring the entry triggers calls to MMU notifers which allows a device
> driver to revoke the atomic access permission from the GPU prior to the CPU
> finalising the entry.
> 
> Patches 1 & 2 refactor existing migration and device private entry
> functions.
> 
> Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
> functionality into separate functions - try_to_migrate_one() and
> try_to_munlock_one(). These should not change any functionality, but any
> help testing would be much appreciated as I have not been able to test
> every usage of try_to_unmap_one().
> 
> Patch 5 contains the bulk of the implementation for device exclusive
> memory.
> 
> Patch 6 contains some additions to the HMM selftests to ensure everything
> works as expected.
> 
> Patch 7 is a cleanup for the Nouveau SVM implementation.
> 
> Patch 8 contains the implementation of atomic access for the Nouveau
> driver.
> 
> This has been tested using the latest upstream Mesa userspace with a simple
> OpenCL test program which checks the results of atomic GPU operations on a
> SVM buffer whilst also writing to the same buffer from the CPU.
> 
> Alistair Popple (8):
>   mm: Remove special swap entry functions
>   mm/swapops: Rework swap entry manipulation code
>   mm/rmap: Split try_to_munlock from try_to_unmap
>   mm/rmap: Split migration into its own function
>   mm: Device exclusive memory access
>   mm: Selftests for exclusive device memory
>   nouveau/svm: Refactor nouveau_range_fault
>   nouveau/svm: Implement atomic SVM access
> 
>  Documentation/vm/hmm.rst                      |  19 +-
>  Documentation/vm/unevictable-lru.rst          |  33 +-
>  arch/s390/mm/pgtable.c                        |   2 +-
>  drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
>  drivers/gpu/drm/nouveau/nouveau_svm.c         | 156 ++++-
>  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
>  .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c    |   6 +
>  fs/proc/task_mmu.c                            |  23 +-
>  include/linux/mmu_notifier.h                  |  26 +-
>  include/linux/rmap.h                          |  11 +-
>  include/linux/swap.h                          |   8 +-
>  include/linux/swapops.h                       | 123 ++--
>  lib/test_hmm.c                                | 126 +++-
>  lib/test_hmm_uapi.h                           |   2 +
>  mm/debug_vm_pgtable.c                         |  12 +-
>  mm/hmm.c                                      |  12 +-
>  mm/huge_memory.c                              |  45 +-
>  mm/hugetlb.c                                  |  10 +-
>  mm/memcontrol.c                               |   2 +-
>  mm/memory.c                                   | 196 +++++-
>  mm/migrate.c                                  |  51 +-
>  mm/mlock.c                                    |  10 +-
>  mm/mprotect.c                                 |  18 +-
>  mm/page_vma_mapped.c                          |  15 +-
>  mm/rmap.c                                     | 612 +++++++++++++++---
>  tools/testing/selftests/vm/hmm-tests.c        | 158 +++++
>  26 files changed, 1366 insertions(+), 312 deletions(-)
> 
> 






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-04-07  8:42 ` [PATCH v8 5/8] mm: Device exclusive memory access Alistair Popple
@ 2021-05-18  2:08   ` Peter Xu
  2021-05-18 13:19     ` Alistair Popple
  2021-05-18 21:16   ` Peter Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-18  2:08 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

Hi, Alistair,

The overall patch looks good to me, however I have a few comments or questions
inlined below.

On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> Some devices require exclusive write access to shared virtual
> memory (SVM) ranges to perform atomic operations on that memory. This
> requires CPU page tables to be updated to deny access whilst atomic
> operations are occurring.
> 
> In order to do this introduce a new swap entry
> type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
> exclusive access by a device all page table mappings for the particular
> range are replaced with device exclusive swap entries. This causes any
> CPU access to the page to result in a fault.
> 
> Faults are resovled by replacing the faulting entry with the original
> mapping. This results in MMU notifiers being called which a driver uses
> to update access permissions such as revoking atomic access. After
> notifiers have been called the device will no longer have exclusive
> access to the region.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> ---
> 
> v8:
> * Remove device exclusive entries on fork rather than copy them.
> 
> v7:
> * Added Christoph's Reviewed-by.
> * Minor cosmetic cleanups suggested by Christoph.
> * Replace mmu_notifier_range_init_migrate/exclusive with
>   mmu_notifier_range_init_owner as suggested by Christoph.
> * Replaced lock_page() with lock_page_retry() when handling faults.
> * Restrict to anonymous pages for now.
> 
> v6:
> * Fixed a bisectablity issue due to incorrectly applying the rename of
>   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> 
> v5:
> * Renamed range->migrate_pgmap_owner to range->owner.

May be nicer to mention this rename in commit message (or make it a separate
patch)?

> * Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
>   allows notifiers called as a result of make_device_exclusive_range() to
>   be ignored.
> * Added a check to try_to_protect_one() to detect if the pages originally
>   returned from get_user_pages() have been unmapped or not.
> * Removed check_device_exclusive_range() as it is no longer required with
>   the other changes.
> * Documentation update.
> 
> v4:
> * Add function to check that mappings are still valid and exclusive.
> * s/long/unsigned long/ in make_device_exclusive_entry().
> ---
>  Documentation/vm/hmm.rst              |  19 ++-
>  drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
>  include/linux/mmu_notifier.h          |  26 ++--
>  include/linux/rmap.h                  |   4 +
>  include/linux/swap.h                  |   4 +-
>  include/linux/swapops.h               |  44 +++++-
>  lib/test_hmm.c                        |   2 +-
>  mm/hmm.c                              |   5 +
>  mm/memory.c                           | 176 +++++++++++++++++++--
>  mm/migrate.c                          |  10 +-
>  mm/mprotect.c                         |   8 +
>  mm/page_vma_mapped.c                  |   9 +-
>  mm/rmap.c                             | 210 ++++++++++++++++++++++++++
>  13 files changed, 487 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index 09e28507f5b2..a14c2938e7af 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -332,7 +332,7 @@ between device driver specific code and shared common code:
>     walks to fill in the ``args->src`` array with PFNs to be migrated.
>     The ``invalidate_range_start()`` callback is passed a
>     ``struct mmu_notifier_range`` with the ``event`` field set to
> -   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
> +   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
>     the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
>     allows the device driver to skip the invalidation callback and only
>     invalidate device private MMU mappings that are actually migrating.
> @@ -405,6 +405,23 @@ between device driver specific code and shared common code:
>  
>     The lock can now be released.
>  
> +Exclusive access memory
> +=======================
> +
> +Some devices have features such as atomic PTE bits that can be used to implement
> +atomic access to system memory. To support atomic operations to a shared virtual
> +memory page such a device needs access to that page which is exclusive of any
> +userspace access from the CPU. The ``make_device_exclusive_range()`` function
> +can be used to make a memory range inaccessible from userspace.
> +
> +This replaces all mappings for pages in the given range with special swap
> +entries. Any attempt to access the swap entry results in a fault which is
> +resovled by replacing the entry with the original mapping. A driver gets
> +notified that the mapping has been changed by MMU notifiers, after which point
> +it will no longer have exclusive access to the page. Exclusive access is
> +guranteed to last until the driver drops the page lock and page reference, at
> +which point any CPU faults on the page may proceed as described.
> +
>  Memory cgroup (memcg) and rss accounting
>  ========================================
>  
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index f18bd53da052..94f841026c3b 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
>  	 * the invalidation is handled as part of the migration process.
>  	 */
>  	if (update->event == MMU_NOTIFY_MIGRATE &&
> -	    update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
> +	    update->owner == svmm->vmm->cli->drm->dev)
>  		goto out;
>  
>  	if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index b8200782dede..2e6068d3fb9f 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -41,7 +41,12 @@ struct mmu_interval_notifier;
>   *
>   * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
>   * a device driver to possibly ignore the invalidation if the
> - * migrate_pgmap_owner field matches the driver's device private pgmap owner.
> + * owner field matches the driver's device private pgmap owner.
> + *
> + * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> + * longer have exclusive access to the page. May ignore the invalidation that's
> + * part of make_device_exclusive_range() if the owner field
> + * matches the value passed to make_device_exclusive_range().
>   */
>  enum mmu_notifier_event {
>  	MMU_NOTIFY_UNMAP = 0,
> @@ -51,6 +56,7 @@ enum mmu_notifier_event {
>  	MMU_NOTIFY_SOFT_DIRTY,
>  	MMU_NOTIFY_RELEASE,
>  	MMU_NOTIFY_MIGRATE,
> +	MMU_NOTIFY_EXCLUSIVE,
>  };
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> @@ -269,7 +275,7 @@ struct mmu_notifier_range {
>  	unsigned long end;
>  	unsigned flags;
>  	enum mmu_notifier_event event;
> -	void *migrate_pgmap_owner;
> +	void *owner;
>  };
>  
>  static inline int mm_has_notifiers(struct mm_struct *mm)
> @@ -521,14 +527,14 @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *range,
>  	range->flags = flags;
>  }
>  
> -static inline void mmu_notifier_range_init_migrate(
> -			struct mmu_notifier_range *range, unsigned int flags,
> +static inline void mmu_notifier_range_init_owner(
> +			struct mmu_notifier_range *range,
> +			enum mmu_notifier_event event, unsigned int flags,
>  			struct vm_area_struct *vma, struct mm_struct *mm,
> -			unsigned long start, unsigned long end, void *pgmap)
> +			unsigned long start, unsigned long end, void *owner)
>  {
> -	mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
> -				start, end);
> -	range->migrate_pgmap_owner = pgmap;
> +	mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
> +	range->owner = owner;
>  }
>  
>  #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> @@ -655,8 +661,8 @@ static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range,
>  
>  #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
>  	_mmu_notifier_range_init(range, start, end)
> -#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
> -					pgmap) \
> +#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
> +					end, owner) \
>  	_mmu_notifier_range_init(range, start, end)
>  
>  static inline bool
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 0e25d829f742..3a1ce4ef9276 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
>  bool try_to_migrate(struct page *page, enum ttu_flags flags);
>  bool try_to_unmap(struct page *, enum ttu_flags flags);
>  
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> +				unsigned long end, struct page **pages,
> +				void *arg);
> +
>  /* Avoid racy checks */
>  #define PVMW_SYNC		(1 << 0)
>  /* Look for migarion entries rather than present PTEs */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 516104b9334b..7a3c260146df 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
>   * to a special SWP_DEVICE_* entry.
>   */

Should we add another short description for the newly added two types?
Otherwise the reader could get confused assuming the above comment is
explaining all four types, while it is for SWP_DEVICE_{READ|WRITE} only?

>  #ifdef CONFIG_DEVICE_PRIVATE
> -#define SWP_DEVICE_NUM 2
> +#define SWP_DEVICE_NUM 4
>  #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
>  #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
> +#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
> +#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
>  #else
>  #define SWP_DEVICE_NUM 0
>  #endif
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 4dfd807ae52a..4129bd2ff9d6 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -120,6 +120,27 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>  {
>  	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
>  }
> +
> +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
> +{
> +	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
> +}
> +
> +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
> +{
> +	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
> +}
> +
> +static inline bool is_device_exclusive_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
> +		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
> +}
> +
> +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
> +}
>  #else /* CONFIG_DEVICE_PRIVATE */
>  static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
>  {
> @@ -140,6 +161,26 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
>  {
>  	return false;
>  }
> +
> +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline bool is_device_exclusive_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_DEVICE_PRIVATE */
>  
>  #ifdef CONFIG_MIGRATION
> @@ -219,7 +260,8 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>   */
>  static inline bool is_pfn_swap_entry(swp_entry_t entry)
>  {
> -	return is_migration_entry(entry) || is_device_private_entry(entry);
> +	return is_migration_entry(entry) || is_device_private_entry(entry) ||
> +	       is_device_exclusive_entry(entry);
>  }
>  
>  struct page_vma_mapped_walk;
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 80a78877bd93..5c9f5a020c1d 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -218,7 +218,7 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>  	 * the invalidation is handled as part of the migration process.
>  	 */
>  	if (range->event == MMU_NOTIFY_MIGRATE &&
> -	    range->migrate_pgmap_owner == dmirror->mdevice)
> +	    range->owner == dmirror->mdevice)
>  		return true;
>  
>  	if (mmu_notifier_range_blockable(range))
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 11df3ca30b82..fad6be2bf072 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -26,6 +26,8 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/memory_hotplug.h>
>  
> +#include "internal.h"
> +
>  struct hmm_vma_walk {
>  	struct hmm_range	*range;
>  	unsigned long		last;
> @@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  		if (!non_swap_entry(entry))
>  			goto fault;
>  
> +		if (is_device_exclusive_entry(entry))
> +			goto fault;
> +
>  		if (is_migration_entry(entry)) {
>  			pte_unmap(ptep);
>  			hmm_vma_walk->last = addr;
> diff --git a/mm/memory.c b/mm/memory.c
> index 3a5705cfc891..556ff396f2e9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  }
>  #endif
>  
> +static void restore_exclusive_pte(struct vm_area_struct *vma,
> +				  struct page *page, unsigned long address,
> +				  pte_t *ptep)
> +{
> +	pte_t pte;
> +	swp_entry_t entry;
> +
> +	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> +	if (pte_swp_soft_dirty(*ptep))
> +		pte = pte_mksoft_dirty(pte);
> +
> +	entry = pte_to_swp_entry(*ptep);
> +	if (pte_swp_uffd_wp(*ptep))
> +		pte = pte_mkuffd_wp(pte);
> +	else if (is_writable_device_exclusive_entry(entry))
> +		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +
> +	set_pte_at(vma->vm_mm, address, ptep, pte);
> +
> +	/*
> +	 * No need to take a page reference as one was already
> +	 * created when the swap entry was made.
> +	 */
> +	if (PageAnon(page))
> +		page_add_anon_rmap(page, vma, address, false);
> +	else
> +		page_add_file_rmap(page, false);
> +
> +	if (vma->vm_flags & VM_LOCKED)
> +		mlock_vma_page(page);
> +
> +	/*
> +	 * No need to invalidate - it was non-present before. However
> +	 * secondary CPUs may have mappings that need invalidating.
> +	 */
> +	update_mmu_cache(vma, address, ptep);
> +}
> +
> +/*
> + * Tries to restore an exclusive pte if the page lock can be acquired without
> + * sleeping. Returns 0 on success or -EBUSY if the page could not be locked or
> + * the entry no longer points at locked_page in which case locked_page should be
> + * locked before retrying the call.
> + */
> +static unsigned long
> +try_restore_exclusive_pte(struct mm_struct *src_mm, pte_t *src_pte,
> +			  struct vm_area_struct *vma, unsigned long addr,
> +			  struct page **locked_page)
> +{
> +	swp_entry_t entry = pte_to_swp_entry(*src_pte);
> +	struct page *page = pfn_swap_entry_to_page(entry);
> +
> +	if (*locked_page) {
> +		/* The entry changed, retry */
> +		if (unlikely(*locked_page != page)) {
> +			unlock_page(*locked_page);
> +			put_page(*locked_page);
> +			*locked_page = page;
> +			return -EBUSY;
> +		}
> +		restore_exclusive_pte(vma, page, addr, src_pte);
> +		unlock_page(page);
> +		put_page(page);
> +		*locked_page = NULL;
> +		return 0;
> +	}
> +
> +	if (trylock_page(page)) {
> +		restore_exclusive_pte(vma, page, addr, src_pte);
> +		unlock_page(page);
> +		return 0;
> +	}
> +
> +	/* The page couldn't be locked so drop the locks and retry. */
> +	*locked_page = page;
> +	return -EBUSY;
> +}
> +
>  /*
>   * copy one vm_area from one task to the other. Assumes the page tables
>   * already present in the new task to be cleared in the whole range
> @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pte = pte_swp_mkuffd_wp(pte);
>  			set_pte_at(src_mm, addr, src_pte, pte);
>  		}
> +	} else if (is_device_exclusive_entry(entry)) {
> +		/* COW mappings should be dealt with by removing the entry */
> +		VM_BUG_ON(is_cow_mapping(vm_flags));
> +		page = pfn_swap_entry_to_page(entry);
> +		get_page(page);
> +		rss[mm_counter(page)]++;
>  	}
>  	set_pte_at(dst_mm, addr, dst_pte, pte);
>  	return 0;
> @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	int rss[NR_MM_COUNTERS];
>  	swp_entry_t entry = (swp_entry_t){0};
>  	struct page *prealloc = NULL;
> +	struct page *locked_page = NULL;
>  
>  again:
>  	progress = 0;
> @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  			continue;
>  		}
>  		if (unlikely(!pte_present(*src_pte))) {
> -			entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> -							dst_pte, src_pte,
> -							src_vma, addr, rss);
> -			if (entry.val)
> -				break;
> -			progress += 8;
> -			continue;
> +			swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);
> +
> +			if (unlikely(is_cow_mapping(src_vma->vm_flags) &&
> +			    is_device_exclusive_entry(swp_entry))) {
> +				/*
> +				 * Normally this would require sending mmu
> +				 * notifiers, but copy_page_range() has already
> +				 * done that for COW mappings.
> +				 */
> +				ret = try_restore_exclusive_pte(src_mm, src_pte,
> +								src_vma, addr,
> +								&locked_page);
> +				if (ret == -EBUSY)
> +					break;
> +			} else {
> +				entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> +								dst_pte, src_pte,
> +								src_vma, addr,
> +								rss);
> +				if (entry.val)
> +					break;
> +				progress += 8;
> +				continue;
> +			}
> +		}
> +		/* a non-present pte became present after dropping the ptl */
> +		if (unlikely(locked_page)) {
> +			unlock_page(locked_page);
> +			put_page(locked_page);
> +			locked_page = NULL;
>  		}
>  		/* copy_present_pte() will clear `*prealloc' if consumed */
>  		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> @@ -1023,6 +1131,11 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  			goto out;
>  		}
>  		entry.val = 0;
> +	} else if (ret == -EBUSY) {
> +		if (get_page_unless_zero(locked_page))
> +			lock_page(locked_page);
> +		else
> +			locked_page = NULL;
>  	} else if (ret) {
>  		WARN_ON_ONCE(ret != -EAGAIN);
>  		prealloc = page_copy_prealloc(src_mm, src_vma, addr);
> @@ -1287,7 +1400,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		}
>  
>  		entry = pte_to_swp_entry(ptent);
> -		if (is_device_private_entry(entry)) {
> +		if (is_device_private_entry(entry) ||
> +		    is_device_exclusive_entry(entry)) {
>  			struct page *page = pfn_swap_entry_to_page(entry);
>  
>  			if (unlikely(details && details->check_mapping)) {
> @@ -1303,7 +1417,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  
>  			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>  			rss[mm_counter(page)]--;
> -			page_remove_rmap(page, false);
> +
> +			if (is_device_private_entry(entry))
> +				page_remove_rmap(page, false);
> +
>  			put_page(page);
>  			continue;
>  		}
> @@ -3256,6 +3373,44 @@ void unmap_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
>  
> +/*
> + * Restore a potential device exclusive pte to a working pte entry
> + */
> +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> +{
> +	struct page *page = vmf->page;
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct page_vma_mapped_walk pvmw = {
> +		.page = page,
> +		.vma = vma,
> +		.address = vmf->address,
> +		.flags = PVMW_SYNC,
> +	};
> +	vm_fault_t ret = 0;
> +	struct mmu_notifier_range range;
> +
> +	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> +		return VM_FAULT_RETRY;
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> +				vmf->address & PAGE_MASK,
> +				(vmf->address & PAGE_MASK) + PAGE_SIZE);
> +	mmu_notifier_invalidate_range_start(&range);

I looked at MMU_NOTIFIER_CLEAR document and it tells me:

 * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
 * madvise() or replacing a page by another one, ...).

Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a case
(existing pte is invalid) we don't need to notify at all.  However from what I
read from the whole series, this seems to be a critical point where we'd like
to kick the owner/driver to synchronously stop doing atomic operations from the
device.  Not sure whether we'd like a new notifier type, or maybe at least
comment on why to use CLEAR?

> +
> +	while (page_vma_mapped_walk(&pvmw)) {

IIUC a while loop of page_vma_mapped_walk() handles thps, however here it's
already in do_swap_page() so it's small pte only?  Meanwhile...

> +		if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
> +		restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);

... I'm not sure whether passing in page would work for thp after all, as iiuc
we may need to pass in the subpage rather than the head page if so.

> +	}
> +
> +	unlock_page(page);
> +
> +	mmu_notifier_invalidate_range_end(&range);
> +	return ret;
> +}
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3283,6 +3438,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		if (is_migration_entry(entry)) {
>  			migration_entry_wait(vma->vm_mm, vmf->pmd,
>  					     vmf->address);
> +		} else if (is_device_exclusive_entry(entry)) {
> +			vmf->page = pfn_swap_entry_to_page(entry);
> +			ret = remove_device_exclusive_entry(vmf);
>  		} else if (is_device_private_entry(entry)) {
>  			vmf->page = pfn_swap_entry_to_page(entry);
>  			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cc4612e2a246..9cc9251d4802 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2570,8 +2570,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
>  	 * that the registered device driver can skip invalidating device
>  	 * private page mappings that won't be migrated.
>  	 */
> -	mmu_notifier_range_init_migrate(&range, 0, migrate->vma,
> -		migrate->vma->vm_mm, migrate->start, migrate->end,
> +	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
> +		migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end,
>  		migrate->pgmap_owner);
>  	mmu_notifier_invalidate_range_start(&range);
>  
> @@ -3074,9 +3074,9 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>  			if (!notified) {
>  				notified = true;
>  
> -				mmu_notifier_range_init_migrate(&range, 0,
> -					migrate->vma, migrate->vma->vm_mm,
> -					addr, migrate->end,
> +				mmu_notifier_range_init_owner(&range,
> +					MMU_NOTIFY_MIGRATE, 0, migrate->vma,
> +					migrate->vma->vm_mm, addr, migrate->end,
>  					migrate->pgmap_owner);

(As I read more, I feel more that maybe it's better to move this renaming
 change along with mmu_notifier_range_init_owner() rework into a separate
 patch, as even the changes are straightforward there're still quite a few
 places that need touch up; no strong opinion though)

>  				mmu_notifier_invalidate_range_start(&range);
>  			}
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index f21b760ec809..c6018541ea3d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				newpte = swp_entry_to_pte(entry);
>  				if (pte_swp_uffd_wp(oldpte))
>  					newpte = pte_swp_mkuffd_wp(newpte);
> +			} else if (is_writable_device_exclusive_entry(entry)) {
> +				entry = make_readable_device_exclusive_entry(
> +							swp_offset(entry));
> +				newpte = swp_entry_to_pte(entry);
> +				if (pte_swp_soft_dirty(oldpte))
> +					newpte = pte_swp_mksoft_dirty(newpte);
> +				if (pte_swp_uffd_wp(oldpte))
> +					newpte = pte_swp_mkuffd_wp(newpte);
>  			} else {
>  				newpte = oldpte;
>  			}
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index eed988ab2e81..29842f169219 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
>  
>  				/* Handle un-addressable ZONE_DEVICE memory */
>  				entry = pte_to_swp_entry(*pvmw->pte);
> -				if (!is_device_private_entry(entry))
> +				if (!is_device_private_entry(entry) &&
> +				    !is_device_exclusive_entry(entry))
>  					return false;
>  			} else if (!pte_present(*pvmw->pte))
>  				return false;
> @@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  			return false;
>  		entry = pte_to_swp_entry(*pvmw->pte);
>  
> -		if (!is_migration_entry(entry))
> +		if (!is_migration_entry(entry) &&
> +		    !is_device_exclusive_entry(entry))
>  			return false;
>  
>  		pfn = swp_offset(entry);
> @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  
>  		/* Handle un-addressable ZONE_DEVICE memory */
>  		entry = pte_to_swp_entry(*pvmw->pte);
> -		if (!is_device_private_entry(entry))
> +		if (!is_device_private_entry(entry) &&
> +		    !is_device_exclusive_entry(entry))
>  			return false;
>  
>  		pfn = swp_offset(entry);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7f91f058f1f5..32b99a7bb358 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2005,6 +2005,216 @@ void page_mlock(struct page *page)
>  	rmap_walk(page, &rwc);
>  }
>  
> +struct ttp_args {
> +	struct mm_struct *mm;
> +	unsigned long address;
> +	void *arg;
> +	bool valid;
> +};
> +
> +static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma,
> +			unsigned long address, void *arg)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct page_vma_mapped_walk pvmw = {
> +		.page = page,
> +		.vma = vma,
> +		.address = address,
> +	};
> +	struct ttp_args *ttp = arg;
> +	pte_t pteval;
> +	struct page *subpage;
> +	bool ret = true;
> +	struct mmu_notifier_range range;
> +	swp_entry_t entry;
> +	pte_t swp_pte;
> +
> +	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> +				      vma->vm_mm, address,
> +				      min(vma->vm_end,
> +					  address + page_size(page)),

(this indent looks odd; seems better to join with previous line?  Slightly over
 80 but seems kernel code is not extremely strict on that)

> +				      ttp->arg);
> +	if (PageHuge(page)) {
> +		/*
> +		 * If sharing is possible, start and end will be adjusted
> +		 * accordingly.
> +		 */
> +		adjust_range_if_pmd_sharing_possible(vma, &range.start,
> +						     &range.end);
> +	}

Is this for hugetlb specific?  Can we drop this chunk if we know it's
PageAnon(), or is this a preparation for the future?

IMHO if possible we shouldn't introduce code that may never run, so to me that
sounds still better to be postponed until the hugetlbfs support (if there will
be).  So raise this question up.

> +	mmu_notifier_invalidate_range_start(&range);
> +
> +	while (page_vma_mapped_walk(&pvmw)) {

Same here, not sure whether "if" would be easier.

> +		/* Unexpected PMD-mapped THP? */
> +		VM_BUG_ON_PAGE(!pvmw.pte, page);
> +
> +		if (!pte_present(*pvmw.pte)) {
> +			ret = false;
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
> +		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> +		address = pvmw.address;

Same question here: could the subpage be not the same as page at all?

> +
> +		/* Nuke the page table entry. */
> +		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> +		pteval = ptep_clear_flush(vma, address, pvmw.pte);
> +
> +		/* Move the dirty bit to the page. Now the pte is gone. */
> +		if (pte_dirty(pteval))
> +			set_page_dirty(page);
> +
> +		/* Update high watermark before we lower rss */
> +		update_hiwater_rss(mm);

We don't update RSS, right?  If so, can this be dropped?

> +
> +		if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> +			set_pte_at(mm, address, pvmw.pte, pteval);
> +			ret = false;
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
> +		/*
> +		 * Check that our target page is still mapped at the expected
> +		 * address.
> +		 */
> +		if (ttp->mm == mm && ttp->address == address &&
> +		    pte_write(pteval))
> +			ttp->valid = true;

I think I get the point of doing this (as after GUP the pte could have been
changed to point to another page), however it smells a bit odd to me (or it's
also possible that I'm not familiar enough with the code base..).  IIUC this is
the _only_ reason that we passed in "address" into try_to_protect() too, and
further into the ttp_args.

The odd part is the remote GUP should have walked the page table already, so
since the target here is the vaddr to replace, the 1st page table walk should
be able to both trylock/lock the page, then modify the pte with pgtable lock
held, return the locked page, then walk the rmap again to remove all the rest
of the ptes that are mapping to this page.  In that case before we call the
rmap_walk() we know this must be the page we want to take care of, and no one
will be able to restore the original mm pte either (as we're with the page
lock).  Then we don't need this check, neither do we need ttp->address.

However frankly I didn't think deeper on how to best implement that and how
many code changes are needed.  So just raise it up as question too here.

> +
> +		/*
> +		 * Store the pfn of the page in a special migration
> +		 * pte. do_swap_page() will wait until the migration
> +		 * pte is removed and then restart fault handling.
> +		 */
> +		if (pte_write(pteval))
> +			entry = make_writable_device_exclusive_entry(
> +							page_to_pfn(subpage));
> +		else
> +			entry = make_readable_device_exclusive_entry(
> +							page_to_pfn(subpage));

(For my own preference I actually prefer make_device_exclusive_entry(writable)
 and helpers defined like that, then these lines can be written as oneliner by
 passing in pte_write(); however that's quite subjective opinion and I saw
 there're discussions around that on patch 2, so I'll avoid commenting more)

> +		swp_pte = swp_entry_to_pte(entry);
> +		if (pte_soft_dirty(pteval))
> +			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +		if (pte_uffd_wp(pteval))
> +			swp_pte = pte_swp_mkuffd_wp(swp_pte);
> +
> +		/* Take a reference for the swap entry */
> +		get_page(page);
> +		set_pte_at(mm, address, pvmw.pte, swp_pte);
> +
> +		page_remove_rmap(subpage, PageHuge(page));
> +		put_page(page);
> +	}
> +
> +	mmu_notifier_invalidate_range_end(&range);
> +
> +	return ret;
> +}
> +
> +/**
> + * try_to_protect - try to replace all page table mappings with swap entries

Is this too general?  Either on the word "protect" or the comment after it, as
there're a lot of types of swap entries (and a lot of types of protect too..),
while it seems this is only for the device exclusive swap entries.

How about rename it to try_to_mark_device_exclusive?  Or even dropping the
"try_to_" (e.g. device_exclusive_mark_page())?

> + * @page: the page to replace page table entries for
> + * @flags: action and flags

Obsolete line?

> + * @mm: the mm_struct where the page is expected to be mapped
> + * @address: address where the page is expected to be mapped
> + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> + *
> + * Tries to remove all the page table entries which are mapping this page and
> + * replace them with special swap entries to grant a device exclusive access to
> + * the page. Caller must hold the page lock.
> + *
> + * Returns false if the page is still mapped, or if it could not be unmapped
> + * from the expected address. Otherwise returns true (success).
> + */
> +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> +			   unsigned long address, void *arg)
> +{
> +	struct ttp_args ttp = {
> +		.mm = mm,
> +		.address = address,
> +		.arg = arg,
> +		.valid = false,
> +	};
> +	struct rmap_walk_control rwc = {
> +		.rmap_one = try_to_protect_one,
> +		.done = page_not_mapped,
> +		.anon_lock = page_lock_anon_vma_read,
> +		.arg = &ttp,
> +	};
> +
> +	/*
> +	 * Restrict to anonymous pages for now to avoid potential writeback
> +	 * issues.
> +	 */
> +	if (!PageAnon(page))
> +		return false;
> +
> +	/*
> +	 * During exec, a temporary VMA is setup and later moved.
> +	 * The VMA is moved under the anon_vma lock but not the
> +	 * page tables leading to a race where migration cannot
> +	 * find the migration ptes. Rather than increasing the
> +	 * locking requirements of exec(), migration skips
> +	 * temporary VMAs until after exec() completes.
> +	 */
> +	if (!PageKsm(page) && PageAnon(page))

I think we can drop the PageAnon() check as it's just done above.

I feel like this chunk was copied over from try_to_unmap(), however is that
necessary?  Is it possible that the caller of make_device_exclusive_range()
pass in a temp stack vma during exec()?

> +		rwc.invalid_vma = invalid_migration_vma;
> +
> +	rmap_walk(page, &rwc);
> +
> +	return ttp.valid && !page_mapcount(page);
> +}
> +
> +/**
> + * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * @mm: mm_struct of assoicated target process
> + * @start: start of the region to mark for exclusive device access
> + * @end: end address of region
> + * @pages: returns the pages which were successfully marked for exclusive access
> + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow filtering

s/too/to/?

> + *
> + * Returns: number of pages successfully marked for exclusive access

Hmm, I see that try_to_protect() can fail even if npages returned from GUP, so
perhaps "number of pages successfully GUPed, however the page is marked for
exclusive access only if the page pointer is non-NULL", or something like that?

> + *
> + * This function finds ptes mapping page(s) to the given address range, locks
> + * them and replaces mappings with special swap entries preventing userspace CPU

s/userspace//?  As same for kernel access?

(I don't think I fully read all the codes in this patch, but I'll stop here for
 today..)

Thanks,

> + * access. On fault these entries are replaced with the original mapping after
> + * calling MMU notifiers.
> + *
> + * A driver using this to program access from a device must use a mmu notifier
> + * critical section to hold a device specific lock during programming. Once
> + * programming is complete it should drop the page lock and reference after
> + * which point CPU access to the page will revoke the exclusive access.
> + */
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> +				unsigned long end, struct page **pages,
> +				void *arg)
> +{
> +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> +	unsigned long i;
> +
> +	npages = get_user_pages_remote(mm, start, npages,
> +				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +				       pages, NULL, NULL);
> +	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> +		if (!trylock_page(pages[i])) {
> +			put_page(pages[i]);
> +			pages[i] = NULL;
> +			continue;
> +		}
> +
> +		if (!try_to_protect(pages[i], mm, start, arg)) {
> +			unlock_page(pages[i]);
> +			put_page(pages[i]);
> +			pages[i] = NULL;
> +		}
> +	}
> +
> +	return npages;
> +}
> +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +
>  void __put_anon_vma(struct anon_vma *anon_vma)
>  {
>  	struct anon_vma *root = anon_vma->root;
> -- 
> 2.20.1
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 1/8] mm: Remove special swap entry functions
  2021-04-07  8:42 ` [PATCH v8 1/8] mm: Remove special swap entry functions Alistair Popple
@ 2021-05-18  2:17   ` Peter Xu
  2021-05-18 11:58     ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-18  2:17 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> +{
> +	struct page *p = pfn_to_page(swp_offset(entry));
> +
> +	/*
> +	 * Any use of migration entries may only occur while the
> +	 * corresponding page is locked
> +	 */
> +	BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> +
> +	return p;
> +}

Would swap_pfn_entry_to_page() be slightly better?

The thing is it's very easy to read pfn_*() as a function to take a pfn as
parameter...

Since I'm also recently working on some swap-related new ptes [1], I'm thinking
whether we could name these swap entries as "swap XXX entries".  Say, "swap
hwpoison entry", "swap pfn entry" (which is a superset of "swap migration
entry", "swap device exclusive entry", ...).  That's where I came with the
above swap_pfn_entry_to_page(), then below will be naturally is_swap_pfn_entry().

No strong opinion as this is already a v8 series (and sorry to chim in this
late), just to raise this up.

[1] https://lore.kernel.org/lkml/20210427161317.50682-1-peterx@redhat.com/

Thanks,

> +
> +/*
> + * A pfn swap entry is a special type of swap entry that always has a pfn stored
> + * in the swap offset. They are used to represent unaddressable device memory
> + * and to restrict access to a page undergoing migration.
> + */
> +static inline bool is_pfn_swap_entry(swp_entry_t entry)
> +{
> +	return is_migration_entry(entry) || is_device_private_entry(entry);
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 1/8] mm: Remove special swap entry functions
  2021-05-18  2:17   ` Peter Xu
@ 2021-05-18 11:58     ` Alistair Popple
  2021-05-18 14:17       ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-05-18 11:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Tuesday, 18 May 2021 12:17:32 PM AEST Peter Xu wrote:
> On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> > +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> > +{
> > +     struct page *p = pfn_to_page(swp_offset(entry));
> > +
> > +     /*
> > +      * Any use of migration entries may only occur while the
> > +      * corresponding page is locked
> > +      */
> > +     BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> > +
> > +     return p;
> > +}
> 
> Would swap_pfn_entry_to_page() be slightly better?
> 
> The thing is it's very easy to read pfn_*() as a function to take a pfn as
> parameter...
> 
> Since I'm also recently working on some swap-related new ptes [1], I'm
> thinking whether we could name these swap entries as "swap XXX entries". 
> Say, "swap hwpoison entry", "swap pfn entry" (which is a superset of "swap
> migration entry", "swap device exclusive entry", ...).  That's where I came
> with the above swap_pfn_entry_to_page(), then below will be naturally
> is_swap_pfn_entry().

Equally though "hwpoison swap entry", "pfn swap entry", "migration swap 
entry", etc. also makes sense (at least to me), but does that not fit in as 
well with your series? I haven't looked too deeply at your series but have 
been meaning to so thanks for the pointer.

> No strong opinion as this is already a v8 series (and sorry to chim in this
> late), just to raise this up.

No worries, it's good timing as I was about to send a v9 which was just a 
rebase anyway. I am hoping to try and get this accepted for the next merge 
window but I will wait before sending v9 to see if anyone else has thoughts on 
the naming here.

I don't have a particularly strong opinion either, and your justification is 
more thought than I gave to naming these originally so am happy to rename if 
it's more readable or fits better with your series.

Thanks.

 - Alistair

> [1] https://lore.kernel.org/lkml/20210427161317.50682-1-peterx@redhat.com/
> 
> Thanks,
> 
> > +
> > +/*
> > + * A pfn swap entry is a special type of swap entry that always has a pfn
> > stored + * in the swap offset. They are used to represent unaddressable
> > device memory + * and to restrict access to a page undergoing migration.
> > + */
> > +static inline bool is_pfn_swap_entry(swp_entry_t entry)
> > +{
> > +     return is_migration_entry(entry) || is_device_private_entry(entry);
> > +}
> 
> --
> Peter Xu






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18  2:08   ` Peter Xu
@ 2021-05-18 13:19     ` Alistair Popple
  2021-05-18 17:27       ` Peter Xu
  2021-05-21  6:53       ` Alistair Popple
  0 siblings, 2 replies; 42+ messages in thread
From: Alistair Popple @ 2021-05-18 13:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Tuesday, 18 May 2021 12:08:34 PM AEST Peter Xu wrote:
> > v6:
> > * Fixed a bisectablity issue due to incorrectly applying the rename of
> > 
> >   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> > 
> > v5:
> > * Renamed range->migrate_pgmap_owner to range->owner.
> 
> May be nicer to mention this rename in commit message (or make it a separate
> patch)?

Ok, I think if it needs to be explicitly called out in the commit message it 
should be a separate patch. Originally I thought the change was _just_ small 
enough to include here, but this patch has since grown so I'll split it out.
 
<snip>

> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 0e25d829f742..3a1ce4ef9276 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
> > 
> >  bool try_to_migrate(struct page *page, enum ttu_flags flags);
> >  bool try_to_unmap(struct page *, enum ttu_flags flags);
> > 
> > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long
> > start,
> > +                             unsigned long end, struct page **pages,
> > +                             void *arg);
> > +
> > 
> >  /* Avoid racy checks */
> >  #define PVMW_SYNC            (1 << 0)
> >  /* Look for migarion entries rather than present PTEs */
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 516104b9334b..7a3c260146df 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
> > 
> >   * to a special SWP_DEVICE_* entry.
> >   */
> 
> Should we add another short description for the newly added two types?
> Otherwise the reader could get confused assuming the above comment is
> explaining all four types, while it is for SWP_DEVICE_{READ|WRITE} only?

Good idea, I can see how that could be confusing. Will add a short 
description.

<snip>

> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3a5705cfc891..556ff396f2e9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > *vma, unsigned long addr,> 
> >  }
> >  #endif
> > 
> > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > +                               struct page *page, unsigned long address,
> > +                               pte_t *ptep)
> > +{
> > +     pte_t pte;
> > +     swp_entry_t entry;
> > +
> > +     pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > +     if (pte_swp_soft_dirty(*ptep))
> > +             pte = pte_mksoft_dirty(pte);
> > +
> > +     entry = pte_to_swp_entry(*ptep);
> > +     if (pte_swp_uffd_wp(*ptep))
> > +             pte = pte_mkuffd_wp(pte);
> > +     else if (is_writable_device_exclusive_entry(entry))
> > +             pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > +
> > +     set_pte_at(vma->vm_mm, address, ptep, pte);
> > +
> > +     /*
> > +      * No need to take a page reference as one was already
> > +      * created when the swap entry was made.
> > +      */
> > +     if (PageAnon(page))
> > +             page_add_anon_rmap(page, vma, address, false);
> > +     else
> > +             page_add_file_rmap(page, false);
> > +
> > +     if (vma->vm_flags & VM_LOCKED)
> > +             mlock_vma_page(page);
> > +
> > +     /*
> > +      * No need to invalidate - it was non-present before. However
> > +      * secondary CPUs may have mappings that need invalidating.
> > +      */
> > +     update_mmu_cache(vma, address, ptep);
> > +}
> > +
> > +/*
> > + * Tries to restore an exclusive pte if the page lock can be acquired
> > without + * sleeping. Returns 0 on success or -EBUSY if the page could
> > not be locked or + * the entry no longer points at locked_page in which
> > case locked_page should be + * locked before retrying the call.
> > + */
> > +static unsigned long
> > +try_restore_exclusive_pte(struct mm_struct *src_mm, pte_t *src_pte,
> > +                       struct vm_area_struct *vma, unsigned long addr,
> > +                       struct page **locked_page)
> > +{
> > +     swp_entry_t entry = pte_to_swp_entry(*src_pte);
> > +     struct page *page = pfn_swap_entry_to_page(entry);
> > +
> > +     if (*locked_page) {
> > +             /* The entry changed, retry */
> > +             if (unlikely(*locked_page != page)) {
> > +                     unlock_page(*locked_page);
> > +                     put_page(*locked_page);
> > +                     *locked_page = page;
> > +                     return -EBUSY;
> > +             }
> > +             restore_exclusive_pte(vma, page, addr, src_pte);
> > +             unlock_page(page);
> > +             put_page(page);
> > +             *locked_page = NULL;
> > +             return 0;
> > +     }
> > +
> > +     if (trylock_page(page)) {
> > +             restore_exclusive_pte(vma, page, addr, src_pte);
> > +             unlock_page(page);
> > +             return 0;
> > +     }
> > +
> > +     /* The page couldn't be locked so drop the locks and retry. */
> > +     *locked_page = page;
> > +     return -EBUSY;
> > +}
> > +
> > 
> >  /*
> >  
> >   * copy one vm_area from one task to the other. Assumes the page tables
> >   * already present in the new task to be cleared in the whole range
> > 
> > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > mm_struct *src_mm,> 
> >                               pte = pte_swp_mkuffd_wp(pte);
> >                       
> >                       set_pte_at(src_mm, addr, src_pte, pte);
> >               
> >               }
> > 
> > +     } else if (is_device_exclusive_entry(entry)) {
> > +             /* COW mappings should be dealt with by removing the entry
> > */
> > +             VM_BUG_ON(is_cow_mapping(vm_flags));
> > +             page = pfn_swap_entry_to_page(entry);
> > +             get_page(page);
> > +             rss[mm_counter(page)]++;
> > 
> >       }
> >       set_pte_at(dst_mm, addr, dst_pte, pte);
> >       return 0;
> > 
> > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct
> > vm_area_struct *src_vma,> 
> >       int rss[NR_MM_COUNTERS];
> >       swp_entry_t entry = (swp_entry_t){0};
> >       struct page *prealloc = NULL;
> > 
> > +     struct page *locked_page = NULL;
> > 
> >  again:
> >       progress = 0;
> > 
> > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > struct vm_area_struct *src_vma,> 
> >                       continue;
> >               
> >               }
> >               if (unlikely(!pte_present(*src_pte))) {
> > 
> > -                     entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > -                                                     dst_pte, src_pte,
> > -                                                     src_vma, addr, rss);
> > -                     if (entry.val)
> > -                             break;
> > -                     progress += 8;
> > -                     continue;
> > +                     swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);
> > +
> > +                     if (unlikely(is_cow_mapping(src_vma->vm_flags) &&
> > +                         is_device_exclusive_entry(swp_entry))) {
> > +                             /*
> > +                              * Normally this would require sending mmu
> > +                              * notifiers, but copy_page_range() has
> > already +                              * done that for COW mappings.
> > +                              */
> > +                             ret = try_restore_exclusive_pte(src_mm,
> > src_pte, +                                                            
> > src_vma, addr, +                                                         
> >    &locked_page); +                             if (ret == -EBUSY)
> > +                                     break;
> > +                     } else {
> > +                             entry.val = copy_nonpresent_pte(dst_mm,
> > src_mm, +                                                            
> > dst_pte, src_pte, +                                                      
> >       src_vma, addr, +                                                   
> >          rss); +                             if (entry.val)
> > +                                     break;
> > +                             progress += 8;
> > +                             continue;
> > +                     }
> > +             }
> > +             /* a non-present pte became present after dropping the ptl
> > */
> > +             if (unlikely(locked_page)) {
> > +                     unlock_page(locked_page);
> > +                     put_page(locked_page);
> > +                     locked_page = NULL;
> > 
> >               }
> >               /* copy_present_pte() will clear `*prealloc' if consumed */
> >               ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> > 
> > @@ -1023,6 +1131,11 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > struct vm_area_struct *src_vma,> 
> >                       goto out;
> >               
> >               }
> >               entry.val = 0;
> > 
> > +     } else if (ret == -EBUSY) {
> > +             if (get_page_unless_zero(locked_page))
> > +                     lock_page(locked_page);
> > +             else
> > +                     locked_page = NULL;
> > 
> >       } else if (ret) {
> >       
> >               WARN_ON_ONCE(ret != -EAGAIN);
> >               prealloc = page_copy_prealloc(src_mm, src_vma, addr);
> > 
> > @@ -1287,7 +1400,8 @@ static unsigned long zap_pte_range(struct mmu_gather
> > *tlb,> 
> >               }
> >               
> >               entry = pte_to_swp_entry(ptent);
> > 
> > -             if (is_device_private_entry(entry)) {
> > +             if (is_device_private_entry(entry) ||
> > +                 is_device_exclusive_entry(entry)) {
> > 
> >                       struct page *page = pfn_swap_entry_to_page(entry);
> >                       
> >                       if (unlikely(details && details->check_mapping)) {
> > 
> > @@ -1303,7 +1417,10 @@ static unsigned long zap_pte_range(struct
> > mmu_gather *tlb,> 
> >                       pte_clear_not_present_full(mm, addr, pte,
> >                       tlb->fullmm);
> >                       rss[mm_counter(page)]--;
> > 
> > -                     page_remove_rmap(page, false);
> > +
> > +                     if (is_device_private_entry(entry))
> > +                             page_remove_rmap(page, false);
> > +
> > 
> >                       put_page(page);
> >                       continue;
> >               
> >               }
> > 
> > @@ -3256,6 +3373,44 @@ void unmap_mapping_range(struct address_space
> > *mapping,> 
> >  }
> >  EXPORT_SYMBOL(unmap_mapping_range);
> > 
> > +/*
> > + * Restore a potential device exclusive pte to a working pte entry
> > + */
> > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > +{
> > +     struct page *page = vmf->page;
> > +     struct vm_area_struct *vma = vmf->vma;
> > +     struct page_vma_mapped_walk pvmw = {
> > +             .page = page,
> > +             .vma = vma,
> > +             .address = vmf->address,
> > +             .flags = PVMW_SYNC,
> > +     };
> > +     vm_fault_t ret = 0;
> > +     struct mmu_notifier_range range;
> > +
> > +     if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> > +             return VM_FAULT_RETRY;
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma,
> > vma->vm_mm,
> > +                             vmf->address & PAGE_MASK,
> > +                             (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > +     mmu_notifier_invalidate_range_start(&range);
> 
> I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> 
>  * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
>  * madvise() or replacing a page by another one, ...).
> 
> Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a
> case (existing pte is invalid) we don't need to notify at all.  However
> from what I read from the whole series, this seems to be a critical point
> where we'd like to kick the owner/driver to synchronously stop doing atomic
> operations from the device.  Not sure whether we'd like a new notifier
> type, or maybe at least comment on why to use CLEAR?

Right, notifying the owner/driver when it no longer has exclusive access to 
the page and allowing it to stop atomic operations is the critical point and 
why it notifies when we ordinarily wouldn't (ie. invalid -> valid).

I did consider adding a new type, but in the driver implementation it ends up 
being treated the same as a CLEAR notification anyway so didn't think it was 
necessary. But I suppose adding a different type would allow other listening 
notifiers to filter these which might be worthwhile.

> > +
> > +     while (page_vma_mapped_walk(&pvmw)) {
> 
> IIUC a while loop of page_vma_mapped_walk() handles thps, however here it's
> already in do_swap_page() so it's small pte only?  Meanwhile...
> 
> > +             if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
> > +                     page_vma_mapped_walk_done(&pvmw);
> > +                     break;
> > +             }
> > +
> > +             restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);
> 
> ... I'm not sure whether passing in page would work for thp after all, as
> iiuc we may need to pass in the subpage rather than the head page if so.

Hmm, I need to check this and follow up.

> > +     }
> > +
> > +     unlock_page(page);
> > +
> > +     mmu_notifier_invalidate_range_end(&range);
> > +     return ret;
> > +}
> > +
> > 
> >  /*
> >  
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > 
> > @@ -3283,6 +3438,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > 
> >               if (is_migration_entry(entry)) {
> >               
> >                       migration_entry_wait(vma->vm_mm, vmf->pmd,
> >                       
> >                                            vmf->address);
> > 
> > +             } else if (is_device_exclusive_entry(entry)) {
> > +                     vmf->page = pfn_swap_entry_to_page(entry);
> > +                     ret = remove_device_exclusive_entry(vmf);
> > 
> >               } else if (is_device_private_entry(entry)) {
> >               
> >                       vmf->page = pfn_swap_entry_to_page(entry);
> >                       ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index cc4612e2a246..9cc9251d4802 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2570,8 +2570,8 @@ static void migrate_vma_collect(struct migrate_vma
> > *migrate)> 
> >        * that the registered device driver can skip invalidating device
> >        * private page mappings that won't be migrated.
> >        */
> > 
> > -     mmu_notifier_range_init_migrate(&range, 0, migrate->vma,
> > -             migrate->vma->vm_mm, migrate->start, migrate->end,
> > +     mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
> > +             migrate->vma, migrate->vma->vm_mm, migrate->start,
> > migrate->end,> 
> >               migrate->pgmap_owner);
> >       
> >       mmu_notifier_invalidate_range_start(&range);
> > 
> > @@ -3074,9 +3074,9 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > 
> >                       if (!notified) {
> >                       
> >                               notified = true;
> > 
> > -                             mmu_notifier_range_init_migrate(&range, 0,
> > -                                     migrate->vma, migrate->vma->vm_mm,
> > -                                     addr, migrate->end,
> > +                             mmu_notifier_range_init_owner(&range,
> > +                                     MMU_NOTIFY_MIGRATE, 0, migrate->vma,
> > +                                     migrate->vma->vm_mm, addr,
> > migrate->end,> 
> >                                       migrate->pgmap_owner);
> 
> (As I read more, I feel more that maybe it's better to move this renaming
>  change along with mmu_notifier_range_init_owner() rework into a separate
>  patch, as even the changes are straightforward there're still quite a few
>  places that need touch up; no strong opinion though)

Ack, will take a look at doing this.

> >                               mmu_notifier_invalidate_range_start(&range);
> >                       
> >                       }
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index f21b760ec809..c6018541ea3d 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct
> > vm_area_struct *vma, pmd_t *pmd,> 
> >                               newpte = swp_entry_to_pte(entry);
> >                               if (pte_swp_uffd_wp(oldpte))
> >                               
> >                                       newpte = pte_swp_mkuffd_wp(newpte);
> > 
> > +                     } else if
> > (is_writable_device_exclusive_entry(entry)) { +                          
> >   entry = make_readable_device_exclusive_entry( +                        
> >                             swp_offset(entry)); +                        
> >     newpte = swp_entry_to_pte(entry);
> > +                             if (pte_swp_soft_dirty(oldpte))
> > +                                     newpte =
> > pte_swp_mksoft_dirty(newpte); +                             if
> > (pte_swp_uffd_wp(oldpte))
> > +                                     newpte = pte_swp_mkuffd_wp(newpte);
> > 
> >                       } else {
> >                       
> >                               newpte = oldpte;
> >                       
> >                       }
> > 
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index eed988ab2e81..29842f169219 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
> > 
> >                               /* Handle un-addressable ZONE_DEVICE memory
> >                               */
> >                               entry = pte_to_swp_entry(*pvmw->pte);
> > 
> > -                             if (!is_device_private_entry(entry))
> > +                             if (!is_device_private_entry(entry) &&
> > +                                 !is_device_exclusive_entry(entry))
> > 
> >                                       return false;
> >                       
> >                       } else if (!pte_present(*pvmw->pte))
> >                       
> >                               return false;
> > 
> > @@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> > 
> >                       return false;
> >               
> >               entry = pte_to_swp_entry(*pvmw->pte);
> > 
> > -             if (!is_migration_entry(entry))
> > +             if (!is_migration_entry(entry) &&
> > +                 !is_device_exclusive_entry(entry))
> > 
> >                       return false;
> >               
> >               pfn = swp_offset(entry);
> > 
> > @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk
> > *pvmw)> 
> >               /* Handle un-addressable ZONE_DEVICE memory */
> >               entry = pte_to_swp_entry(*pvmw->pte);
> > 
> > -             if (!is_device_private_entry(entry))
> > +             if (!is_device_private_entry(entry) &&
> > +                 !is_device_exclusive_entry(entry))
> > 
> >                       return false;
> >               
> >               pfn = swp_offset(entry);
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 7f91f058f1f5..32b99a7bb358 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2005,6 +2005,216 @@ void page_mlock(struct page *page)
> > 
> >       rmap_walk(page, &rwc);
> >  
> >  }
> > 
> > +struct ttp_args {
> > +     struct mm_struct *mm;
> > +     unsigned long address;
> > +     void *arg;
> > +     bool valid;
> > +};
> > +
> > +static bool try_to_protect_one(struct page *page, struct vm_area_struct
> > *vma, +                     unsigned long address, void *arg)
> > +{
> > +     struct mm_struct *mm = vma->vm_mm;
> > +     struct page_vma_mapped_walk pvmw = {
> > +             .page = page,
> > +             .vma = vma,
> > +             .address = address,
> > +     };
> > +     struct ttp_args *ttp = arg;
> > +     pte_t pteval;
> > +     struct page *subpage;
> > +     bool ret = true;
> > +     struct mmu_notifier_range range;
> > +     swp_entry_t entry;
> > +     pte_t swp_pte;
> > +
> > +     mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > +                                   vma->vm_mm, address,
> > +                                   min(vma->vm_end,
> > +                                       address + page_size(page)),
> 
> (this indent looks odd; seems better to join with previous line?  Slightly
> over 80 but seems kernel code is not extremely strict on that)

It seems sometimes sticking to 80 is still preferable though so generally I 
still try to stick to that. Agree the indent does look a little odd though so 
perhaps this is better:

+     mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+                                   vma->vm_mm, address, min(vma->vm_end,
+                                   address + page_size(page)),
+                                   ttp->arg);


> > +     if (PageHuge(page)) {
> > +             /*
> > +              * If sharing is possible, start and end will be adjusted
> > +              * accordingly.
> > +              */
> > +             adjust_range_if_pmd_sharing_possible(vma, &range.start,
> > +                                                  &range.end);
> > +     }
> 
> Is this for hugetlb specific?  Can we drop this chunk if we know it's
> PageAnon(), or is this a preparation for the future?
> 
> IMHO if possible we shouldn't introduce code that may never run, so to me
> that sounds still better to be postponed until the hugetlbfs support (if
> there will be).  So raise this question up.

Yeah I agree, this appears to be a hangover from an earlier version when 
try_to_protect_one() was part of the giant try_to_unmap_one() function which I 
subsequently cleaned up so I think it can be removed. Thanks.

> > +     mmu_notifier_invalidate_range_start(&range);
> > +
> > +     while (page_vma_mapped_walk(&pvmw)) {
> 
> Same here, not sure whether "if" would be easier.
> 
> > +             /* Unexpected PMD-mapped THP? */
> > +             VM_BUG_ON_PAGE(!pvmw.pte, page);
> > +
> > +             if (!pte_present(*pvmw.pte)) {
> > +                     ret = false;
> > +                     page_vma_mapped_walk_done(&pvmw);
> > +                     break;
> > +             }
> > +
> > +             subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > +             address = pvmw.address;
> 
> Same question here: could the subpage be not the same as page at all?
> 
> > +
> > +             /* Nuke the page table entry. */
> > +             flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > +             pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > +             /* Move the dirty bit to the page. Now the pte is gone. */
> > +             if (pte_dirty(pteval))
> > +                     set_page_dirty(page);
> > +
> > +             /* Update high watermark before we lower rss */
> > +             update_hiwater_rss(mm);
> 
> We don't update RSS, right?  If so, can this be dropped?

Yep.

> > +
> > +             if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> > +                     set_pte_at(mm, address, pvmw.pte, pteval);
> > +                     ret = false;
> > +                     page_vma_mapped_walk_done(&pvmw);
> > +                     break;
> > +             }
> > +
> > +             /*
> > +              * Check that our target page is still mapped at the
> > expected
> > +              * address.
> > +              */
> > +             if (ttp->mm == mm && ttp->address == address &&
> > +                 pte_write(pteval))
> > +                     ttp->valid = true;
> 
> I think I get the point of doing this (as after GUP the pte could have been
> changed to point to another page), however it smells a bit odd to me (or
> it's also possible that I'm not familiar enough with the code base..). 
> IIUC this is the _only_ reason that we passed in "address" into
> try_to_protect() too, and further into the ttp_args.

Yes, this is why "address" is passed up to ttp_args.

> The odd part is the remote GUP should have walked the page table already, so
> since the target here is the vaddr to replace, the 1st page table walk
> should be able to both trylock/lock the page, then modify the pte with
> pgtable lock held, return the locked page, then walk the rmap again to
> remove all the rest of the ptes that are mapping to this page.  In that
> case before we call the rmap_walk() we know this must be the page we want
> to take care of, and no one will be able to restore the original mm pte
> either (as we're with the page lock).  Then we don't need this check,
> neither do we need ttp->address.

If I am understanding you correctly I think this would be similar to the 
approach that was taken in v2. However it pretty much ended up being just an 
open-coded version of gup which is useful anyway to fault the page in.

> However frankly I didn't think deeper on how to best implement that and how
> many code changes are needed.  So just raise it up as question too here.
> 
> > +
> > +             /*
> > +              * Store the pfn of the page in a special migration
> > +              * pte. do_swap_page() will wait until the migration
> > +              * pte is removed and then restart fault handling.
> > +              */
> > +             if (pte_write(pteval))
> > +                     entry = make_writable_device_exclusive_entry(
> > +                                                    
> > page_to_pfn(subpage)); +             else
> > +                     entry = make_readable_device_exclusive_entry(
> > +                                                    
> > page_to_pfn(subpage));
> (For my own preference I actually prefer
> make_device_exclusive_entry(writable) and helpers defined like that, then
> these lines can be written as oneliner by passing in pte_write(); however
> that's quite subjective opinion and I saw there're discussions around that
> on patch 2, so I'll avoid commenting more)

As you say this is somewhat subjective and personally I prefer this to adding 
flag type arguments (although when the flag is a single bool I can go either 
way). The real reason is the original swap entry code cleaned up in patches 1 
& 2 had a mixture of both styles and I didn't really think a third helper 
(make_device_exclusive_entry(writable)) was necessary.

> > +             swp_pte = swp_entry_to_pte(entry);
> > +             if (pte_soft_dirty(pteval))
> > +                     swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +             if (pte_uffd_wp(pteval))
> > +                     swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > +
> > +             /* Take a reference for the swap entry */
> > +             get_page(page);
> > +             set_pte_at(mm, address, pvmw.pte, swp_pte);
> > +
> > +             page_remove_rmap(subpage, PageHuge(page));
> > +             put_page(page);
> > +     }
> > +
> > +     mmu_notifier_invalidate_range_end(&range);
> > +
> > +     return ret;
> > +}
> > +
> > +/**
> > + * try_to_protect - try to replace all page table mappings with swap
> > entries
> Is this too general?  Either on the word "protect" or the comment after it,
> as there're a lot of types of swap entries (and a lot of types of protect
> too..), while it seems this is only for the device exclusive swap entries.
> 
> How about rename it to try_to_mark_device_exclusive?  Or even dropping the
> "try_to_" (e.g. device_exclusive_mark_page())?
> 
> > + * @page: the page to replace page table entries for
> > + * @flags: action and flags
> 
> Obsolete line?

Yep.

> > + * @mm: the mm_struct where the page is expected to be mapped
> > + * @address: address where the page is expected to be mapped
> > + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > + *
> > + * Tries to remove all the page table entries which are mapping this page
> > and + * replace them with special swap entries to grant a device
> > exclusive access to + * the page. Caller must hold the page lock.
> > + *
> > + * Returns false if the page is still mapped, or if it could not be
> > unmapped + * from the expected address. Otherwise returns true (success).
> > + */
> > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > +                        unsigned long address, void *arg)
> > +{
> > +     struct ttp_args ttp = {
> > +             .mm = mm,
> > +             .address = address,
> > +             .arg = arg,
> > +             .valid = false,
> > +     };
> > +     struct rmap_walk_control rwc = {
> > +             .rmap_one = try_to_protect_one,
> > +             .done = page_not_mapped,
> > +             .anon_lock = page_lock_anon_vma_read,
> > +             .arg = &ttp,
> > +     };
> > +
> > +     /*
> > +      * Restrict to anonymous pages for now to avoid potential writeback
> > +      * issues.
> > +      */
> > +     if (!PageAnon(page))
> > +             return false;
> > +
> > +     /*
> > +      * During exec, a temporary VMA is setup and later moved.
> > +      * The VMA is moved under the anon_vma lock but not the
> > +      * page tables leading to a race where migration cannot
> > +      * find the migration ptes. Rather than increasing the
> > +      * locking requirements of exec(), migration skips
> > +      * temporary VMAs until after exec() completes.
> > +      */
> > +     if (!PageKsm(page) && PageAnon(page))
> 
> I think we can drop the PageAnon() check as it's just done above.
> 
> I feel like this chunk was copied over from try_to_unmap(), however is that
> necessary?  Is it possible that the caller of make_device_exclusive_range()
> pass in a temp stack vma during exec()?

Indeed it was and I doubt it's possible for a caller to pass in a temp stack. 
Will double check.

> > +             rwc.invalid_vma = invalid_migration_vma;
> > +
> > +     rmap_walk(page, &rwc);
> > +
> > +     return ttp.valid && !page_mapcount(page);
> > +}
> > +
> > +/**
> > + * make_device_exclusive_range() - Mark a range for exclusive use by a
> > device + * @mm: mm_struct of assoicated target process
> > + * @start: start of the region to mark for exclusive device access
> > + * @end: end address of region
> > + * @pages: returns the pages which were successfully marked for exclusive
> > access + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow
> > filtering
> s/too/to/?
> 
> > + *
> > + * Returns: number of pages successfully marked for exclusive access
> 
> Hmm, I see that try_to_protect() can fail even if npages returned from GUP,
> so perhaps "number of pages successfully GUPed, however the page is marked
> for exclusive access only if the page pointer is non-NULL", or something
> like that?

Good point, thanks.

> > + *
> > + * This function finds ptes mapping page(s) to the given address range,
> > locks + * them and replaces mappings with special swap entries preventing
> > userspace CPU
> s/userspace//?  As same for kernel access?
> 
> (I don't think I fully read all the codes in this patch, but I'll stop here
> for today..)

No problem, thanks for the comments and reading this far!

 - Alistair

> Thanks,
> 
> > + * access. On fault these entries are replaced with the original mapping
> > after + * calling MMU notifiers.
> > + *
> > + * A driver using this to program access from a device must use a mmu
> > notifier + * critical section to hold a device specific lock during
> > programming. Once + * programming is complete it should drop the page
> > lock and reference after + * which point CPU access to the page will
> > revoke the exclusive access. + */
> > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long
> > start,
> > +                             unsigned long end, struct page **pages,
> > +                             void *arg)
> > +{
> > +     unsigned long npages = (end - start) >> PAGE_SHIFT;
> > +     unsigned long i;
> > +
> > +     npages = get_user_pages_remote(mm, start, npages,
> > +                                    FOLL_GET | FOLL_WRITE |
> > FOLL_SPLIT_PMD, +                                    pages, NULL, NULL);
> > +     for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> > +             if (!trylock_page(pages[i])) {
> > +                     put_page(pages[i]);
> > +                     pages[i] = NULL;
> > +                     continue;
> > +             }
> > +
> > +             if (!try_to_protect(pages[i], mm, start, arg)) {
> > +                     unlock_page(pages[i]);
> > +                     put_page(pages[i]);
> > +                     pages[i] = NULL;
> > +             }
> > +     }
> > +
> > +     return npages;
> > +}
> > +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> > +
> > 
> >  void __put_anon_vma(struct anon_vma *anon_vma)
> >  {
> >  
> >       struct anon_vma *root = anon_vma->root;
> > 
> > --
> > 2.20.1
> 
> --
> Peter Xu






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 1/8] mm: Remove special swap entry functions
  2021-05-18 11:58     ` Alistair Popple
@ 2021-05-18 14:17       ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-05-18 14:17 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 09:58:09PM +1000, Alistair Popple wrote:
> On Tuesday, 18 May 2021 12:17:32 PM AEST Peter Xu wrote:
> > On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> > > +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> > > +{
> > > +     struct page *p = pfn_to_page(swp_offset(entry));
> > > +
> > > +     /*
> > > +      * Any use of migration entries may only occur while the
> > > +      * corresponding page is locked
> > > +      */
> > > +     BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> > > +
> > > +     return p;
> > > +}
> > 
> > Would swap_pfn_entry_to_page() be slightly better?
> > 
> > The thing is it's very easy to read pfn_*() as a function to take a pfn as
> > parameter...
> > 
> > Since I'm also recently working on some swap-related new ptes [1], I'm
> > thinking whether we could name these swap entries as "swap XXX entries". 
> > Say, "swap hwpoison entry", "swap pfn entry" (which is a superset of "swap
> > migration entry", "swap device exclusive entry", ...).  That's where I came
> > with the above swap_pfn_entry_to_page(), then below will be naturally
> > is_swap_pfn_entry().
> 
> Equally though "hwpoison swap entry", "pfn swap entry", "migration swap 
> entry", etc. also makes sense (at least to me), but does that not fit in as 
> well with your series? I haven't looked too deeply at your series but have 
> been meaning to so thanks for the pointer.

Yeah it looks good too.  It's just to avoid starting with "pfn_" I guess, so at
least we know that's something related to swap not one specific pfn.

I found the naming is important as e.g. in my series I introduced another idea
called "swap special pte" or "special swap pte" (yeah so indeed my naming is
not that coherent as well... :), then I noticed if without a good naming I'm
afraid we can get lost easier.

I have a description here in the commit message re the new swap special pte:

https://lore.kernel.org/lkml/20210427161317.50682-5-peterx@redhat.com/

In which the uffd-wp marker pte will be the first swap special pte.  Feel free
to ignore the details too, just want to mention some context, while it should
be orthogonal to this series.

I think yet-another swap entry suites for my case too but it'll be a waste as
in that pte I don't need to keep pfn information, but only a marker (one single
bit would suffice), so I chose one single pte value (one for each arch, I only
defined that on x86 in my series which is a special pte with only one bit set)
to not pollute the swap entry address spaces.

> 
> > No strong opinion as this is already a v8 series (and sorry to chim in this
> > late), just to raise this up.
> 
> No worries, it's good timing as I was about to send a v9 which was just a 
> rebase anyway. I am hoping to try and get this accepted for the next merge 
> window but I will wait before sending v9 to see if anyone else has thoughts on 
> the naming here.
> 
> I don't have a particularly strong opinion either, and your justification is 
> more thought than I gave to naming these originally so am happy to rename if 
> it's more readable or fits better with your series.

I'll leave the decision to you, especially in case you still prefer the old
naming.  Or feel free to wait too until someone else shares the thoughts so as
to avoid unnecessary rebase work.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 13:19     ` Alistair Popple
@ 2021-05-18 17:27       ` Peter Xu
  2021-05-18 17:33         ` Jason Gunthorpe
  2021-05-19 11:35         ` Alistair Popple
  2021-05-21  6:53       ` Alistair Popple
  1 sibling, 2 replies; 42+ messages in thread
From: Peter Xu @ 2021-05-18 17:27 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

Alistair,

While I got one reply below to your previous email, I also looked at the rest
code (majorly restore and fork sides) today and added a few more comments.

On Tue, May 18, 2021 at 11:19:05PM +1000, Alistair Popple wrote:

[...]

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 3a5705cfc891..556ff396f2e9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > > *vma, unsigned long addr,> 
> > >  }
> > >  #endif
> > > 
> > > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > > +                               struct page *page, unsigned long address,
> > > +                               pte_t *ptep)
> > > +{
> > > +     pte_t pte;
> > > +     swp_entry_t entry;
> > > +
> > > +     pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > > +     if (pte_swp_soft_dirty(*ptep))
> > > +             pte = pte_mksoft_dirty(pte);
> > > +
> > > +     entry = pte_to_swp_entry(*ptep);
> > > +     if (pte_swp_uffd_wp(*ptep))
> > > +             pte = pte_mkuffd_wp(pte);
> > > +     else if (is_writable_device_exclusive_entry(entry))
> > > +             pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > +
> > > +     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > +
> > > +     /*
> > > +      * No need to take a page reference as one was already
> > > +      * created when the swap entry was made.
> > > +      */
> > > +     if (PageAnon(page))
> > > +             page_add_anon_rmap(page, vma, address, false);
> > > +     else
> > > +             page_add_file_rmap(page, false);

This seems to be another leftover; maybe WARN_ON_ONCE(!PageAnon(page))?

> > > +
> > > +     if (vma->vm_flags & VM_LOCKED)
> > > +             mlock_vma_page(page);
> > > +
> > > +     /*
> > > +      * No need to invalidate - it was non-present before. However
> > > +      * secondary CPUs may have mappings that need invalidating.
> > > +      */
> > > +     update_mmu_cache(vma, address, ptep);
> > > +}

[...]

> > >  /*
> > >  
> > >   * copy one vm_area from one task to the other. Assumes the page tables
> > >   * already present in the new task to be cleared in the whole range
> > > 
> > > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > > mm_struct *src_mm,> 
> > >                               pte = pte_swp_mkuffd_wp(pte);
> > >                       
> > >                       set_pte_at(src_mm, addr, src_pte, pte);
> > >               
> > >               }
> > > 
> > > +     } else if (is_device_exclusive_entry(entry)) {
> > > +             /* COW mappings should be dealt with by removing the entry
> > > */

Here the comment says "removing the entry" however I think it didn't remove the
pte, instead it keeps it (as there's no "return", so set_pte_at() will be
called below), so I got a bit confused.

> > > +             VM_BUG_ON(is_cow_mapping(vm_flags));

Also here, if PageAnon() is the only case to support so far, doesn't that
easily satisfy is_cow_mapping()? Maybe I missed something..

I also have a pure and high level question regarding a process fork() when
there're device exclusive ptes: would the two processes then own the device
together?  Is this a real usecase?

Indeed it'll be odd for a COW page since for COW page then it means after
parent/child writting to the page it'll clone into two, then it's a mistery on
which one will be the one that "exclusived owned" by the device..

> > > +             page = pfn_swap_entry_to_page(entry);
> > > +             get_page(page);
> > > +             rss[mm_counter(page)]++;
> > > 
> > >       }
> > >       set_pte_at(dst_mm, addr, dst_pte, pte);
> > >       return 0;
> > > 
> > > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct
> > > vm_area_struct *src_vma,> 
> > >       int rss[NR_MM_COUNTERS];
> > >       swp_entry_t entry = (swp_entry_t){0};
> > >       struct page *prealloc = NULL;
> > > 
> > > +     struct page *locked_page = NULL;
> > > 
> > >  again:
> > >       progress = 0;
> > > 
> > > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > > struct vm_area_struct *src_vma,> 
> > >                       continue;
> > >               
> > >               }
> > >               if (unlikely(!pte_present(*src_pte))) {
> > > 
> > > -                     entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > > -                                                     dst_pte, src_pte,
> > > -                                                     src_vma, addr, rss);
> > > -                     if (entry.val)
> > > -                             break;
> > > -                     progress += 8;
> > > -                     continue;
> > > +                     swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);

(Just a side note to all of us: this will be one more place that I'll need to
 look after in my uffd-wp series if this series lands first, as after that
 series we can only call pte_to_swp_entry after a pte_has_swap_entry check, as
 sometimes non-present pte won't contain a swap entry at all)

> > > +
> > > +                     if (unlikely(is_cow_mapping(src_vma->vm_flags) &&
> > > +                         is_device_exclusive_entry(swp_entry))) {
> > > +                             /*
> > > +                              * Normally this would require sending mmu
> > > +                              * notifiers, but copy_page_range() has
> > > already +                              * done that for COW mappings.
> > > +                              */
> > > +                             ret = try_restore_exclusive_pte(src_mm,
> > > src_pte, +                                                            
> > > src_vma, addr, +                                                         
> > >    &locked_page); +                             if (ret == -EBUSY)
> > > +                                     break;

Would it be possible that we put all the handling of device exclusive ptes into
copy_nonpresent_pte()?  As IMHO all device exclusive ptes should still be one
kind of non-present pte.  Splitting the cases really make it (at least to
me...) even harder to read.

Maybe you wanted to avoid the rework of copy_nonpresent_pte() as it currently
returns a entry.val which is indeed not straightforward already.. I wanted to
clean that up but not yet.

An easier option is perhaps failing the fork() directly when trylock_page()
failed when restoring the pte? So the userspace could try again the whole
fork(). However that'll also depend on my previous question on whether this is
a valid scenario after all.  If "maintaining fork correctness" is the only
thing we persue for, maybe still worth to consider?

> > > +                     } else {
> > > +                             entry.val = copy_nonpresent_pte(dst_mm,
> > > src_mm, +                                                            
> > > dst_pte, src_pte, +                                                      
> > >       src_vma, addr, +                                                   
> > >          rss); +                             if (entry.val)
> > > +                                     break;
> > > +                             progress += 8;
> > > +                             continue;
> > > +                     }
> > > +             }
> > > +             /* a non-present pte became present after dropping the ptl
> > > */
> > > +             if (unlikely(locked_page)) {
> > > +                     unlock_page(locked_page);
> > > +                     put_page(locked_page);
> > > +                     locked_page = NULL;
> > > 
> > >               }
> > >               /* copy_present_pte() will clear `*prealloc' if consumed */
> > >               ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> > > 

[...]

> > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > +{
> > > +     struct page *page = vmf->page;
> > > +     struct vm_area_struct *vma = vmf->vma;
> > > +     struct page_vma_mapped_walk pvmw = {
> > > +             .page = page,
> > > +             .vma = vma,
> > > +             .address = vmf->address,
> > > +             .flags = PVMW_SYNC,
> > > +     };
> > > +     vm_fault_t ret = 0;
> > > +     struct mmu_notifier_range range;
> > > +
> > > +     if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> > > +             return VM_FAULT_RETRY;
> > > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma,
> > > vma->vm_mm,
> > > +                             vmf->address & PAGE_MASK,
> > > +                             (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > > +     mmu_notifier_invalidate_range_start(&range);
> > 
> > I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> > 
> >  * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> >  * madvise() or replacing a page by another one, ...).
> > 
> > Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a
> > case (existing pte is invalid) we don't need to notify at all.  However
> > from what I read from the whole series, this seems to be a critical point
> > where we'd like to kick the owner/driver to synchronously stop doing atomic
> > operations from the device.  Not sure whether we'd like a new notifier
> > type, or maybe at least comment on why to use CLEAR?
> 
> Right, notifying the owner/driver when it no longer has exclusive access to 
> the page and allowing it to stop atomic operations is the critical point and 
> why it notifies when we ordinarily wouldn't (ie. invalid -> valid).
> 
> I did consider adding a new type, but in the driver implementation it ends up 
> being treated the same as a CLEAR notification anyway so didn't think it was 
> necessary. But I suppose adding a different type would allow other listening 
> notifiers to filter these which might be worthwhile.

Sounds good to me.

[...]

> > > +             /*
> > > +              * Check that our target page is still mapped at the
> > > expected
> > > +              * address.
> > > +              */
> > > +             if (ttp->mm == mm && ttp->address == address &&
> > > +                 pte_write(pteval))
> > > +                     ttp->valid = true;
> > 
> > I think I get the point of doing this (as after GUP the pte could have been
> > changed to point to another page), however it smells a bit odd to me (or
> > it's also possible that I'm not familiar enough with the code base..). 
> > IIUC this is the _only_ reason that we passed in "address" into
> > try_to_protect() too, and further into the ttp_args.
> 
> Yes, this is why "address" is passed up to ttp_args.
> 
> > The odd part is the remote GUP should have walked the page table already, so
> > since the target here is the vaddr to replace, the 1st page table walk
> > should be able to both trylock/lock the page, then modify the pte with
> > pgtable lock held, return the locked page, then walk the rmap again to
> > remove all the rest of the ptes that are mapping to this page.  In that
> > case before we call the rmap_walk() we know this must be the page we want
> > to take care of, and no one will be able to restore the original mm pte
> > either (as we're with the page lock).  Then we don't need this check,
> > neither do we need ttp->address.
> 
> If I am understanding you correctly I think this would be similar to the 
> approach that was taken in v2. However it pretty much ended up being just an 
> open-coded version of gup which is useful anyway to fault the page in.

I see.  For easier reference this is v2 patch 1:

https://lore.kernel.org/lkml/20210219020750.16444-2-apopple@nvidia.com/

Indeed that looks like it, it's just that instead of grabbing the page only in
hmm_exclusive_pmd() we can do the pte modification along the way to seal the
whole thing (address/pte & page).  I saw Christoph and Jason commented in that
patch, but not regarding to this approach.  So is there a reason that you
switched?  Do you think it'll work?

I have no strong opinion either, it's just not crystal clear why we'd need that
ttp->address at all for a rmap walk along with that "valid" field. Meanwhile it
should be slightly less efficient too to go with current approach, especially
when the page array gets huge, I think: since there'll be longer time we do GUP
before doing the rmap walk, so higher possibility that the GUPed pages got
replaced for whatever reason.  Then the call to make_device_exclusive_range()
will fail as a whole just for a single page replacement within the range.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 17:27       ` Peter Xu
@ 2021-05-18 17:33         ` Jason Gunthorpe
  2021-05-18 18:01           ` Peter Xu
  2021-05-19 11:35         ` Alistair Popple
  1 sibling, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2021-05-18 17:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:

> I also have a pure and high level question regarding a process fork() when
> there're device exclusive ptes: would the two processes then own the device
> together?  Is this a real usecase?

If the pages are MAP_SHARED then yes. All VMAs should point at the
same device_exclusive page and all VMA should migrate back to CPU
pages together.

> Indeed it'll be odd for a COW page since for COW page then it means after
> parent/child writting to the page it'll clone into two, then it's a mistery on
> which one will be the one that "exclusived owned" by the device..

For COW pages it is like every other fork case.. We can't reliably
write-protect the device_exclusive page during fork so we must copy it
at fork time.

Thus three reasonable choices:
 - Copy to a new CPU page
 - Migrate back to a CPU page and write protect it
 - Copy to a new device exclusive page

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 17:33         ` Jason Gunthorpe
@ 2021-05-18 18:01           ` Peter Xu
  2021-05-18 19:45             ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-18 18:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 02:33:34PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:
> 
> > I also have a pure and high level question regarding a process fork() when
> > there're device exclusive ptes: would the two processes then own the device
> > together?  Is this a real usecase?
> 
> If the pages are MAP_SHARED then yes. All VMAs should point at the
> same device_exclusive page and all VMA should migrate back to CPU
> pages together.

Makes sense.  If we keep the anonymous-only in this series (I think it's good
to separate these), maybe we can drop the !COW case, plus some proper installed
WARN_ON_ONCE()s.

> 
> > Indeed it'll be odd for a COW page since for COW page then it means after
> > parent/child writting to the page it'll clone into two, then it's a mistery on
> > which one will be the one that "exclusived owned" by the device..
> 
> For COW pages it is like every other fork case.. We can't reliably
> write-protect the device_exclusive page during fork so we must copy it
> at fork time.
> 
> Thus three reasonable choices:
>  - Copy to a new CPU page
>  - Migrate back to a CPU page and write protect it
>  - Copy to a new device exclusive page

IMHO the ownership question would really help us to answer this one..

If the device ownership should be kept in parent IMHO option (1) is the best
approach. To be explicit on page copy: we can do best-effort, even if the copy
is during a device atomic operation, perhaps?

If the ownership will be shared, seems option (3) will be easier as I don't see
a strong reason to do immediate restorinng of ptes; as long as we're careful on
the refcounting.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 18:01           ` Peter Xu
@ 2021-05-18 19:45             ` Jason Gunthorpe
  2021-05-18 20:29               ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2021-05-18 19:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > Indeed it'll be odd for a COW page since for COW page then it means after
> > > parent/child writting to the page it'll clone into two, then it's a mistery on
> > > which one will be the one that "exclusived owned" by the device..
> > 
> > For COW pages it is like every other fork case.. We can't reliably
> > write-protect the device_exclusive page during fork so we must copy it
> > at fork time.
> > 
> > Thus three reasonable choices:
> >  - Copy to a new CPU page
> >  - Migrate back to a CPU page and write protect it
> >  - Copy to a new device exclusive page
> 
> IMHO the ownership question would really help us to answer this one..

I'm confused about what device ownership you are talking about

It is just a page and it is tied to some specific pgmap?

If the thing providing the migration stuff goes away then all
device_exclusive pages should revert back to CPU pages and destroy the
pgmap?

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
  2021-04-07  8:42 ` [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap Alistair Popple
@ 2021-05-18 20:04   ` Liam Howlett
  2021-05-19 12:38     ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Liam Howlett @ 2021-05-18 20:04 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

* Alistair Popple <apopple@nvidia.com> [210407 04:43]:
> The behaviour of try_to_unmap_one() is difficult to follow because it
> performs different operations based on a fairly large set of flags used
> in different combinations.
> 
> TTU_MUNLOCK is one such flag. However it is exclusively used by
> try_to_munlock() which specifies no other flags. Therefore rather than
> overload try_to_unmap_one() with unrelated behaviour split this out into
> it's own function and remove the flag.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> ---
> 
> v8:
> * Renamed try_to_munlock to page_mlock to better reflect what the
>   function actually does.
> * Removed the TODO from the documentation that this patch addresses.
> 
> v7:
> * Added Christoph's Reviewed-by
> 
> v4:
> * Removed redundant check for VM_LOCKED
> ---
>  Documentation/vm/unevictable-lru.rst | 33 ++++++++-----------
>  include/linux/rmap.h                 |  3 +-
>  mm/mlock.c                           | 10 +++---
>  mm/rmap.c                            | 48 +++++++++++++++++++++-------
>  4 files changed, 55 insertions(+), 39 deletions(-)
> 
> diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
> index 0e1490524f53..eae3af17f2d9 100644
> --- a/Documentation/vm/unevictable-lru.rst
> +++ b/Documentation/vm/unevictable-lru.rst
> @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics for the number of
>  mlocked pages.  Note, however, that at this point we haven't checked whether
>  the page is mapped by other VM_LOCKED VMAs.
>  
> -We can't call try_to_munlock(), the function that walks the reverse map to
> +We can't call page_mlock(), the function that walks the reverse map to
>  check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
> -try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
> +page_mlock() is a variant of try_to_unmap() and thus requires that the page
>  not be on an LRU list [more on these below].  However, the call to
> -isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
> +isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
>  we go ahead and clear PG_mlocked up front, as this might be the only chance we
> -have.  If we can successfully isolate the page, we go ahead and
> -try_to_munlock(), which will restore the PG_mlocked flag and update the zone
> +have.  If we can successfully isolate the page, we go ahead and call
> +page_mlock(), which will restore the PG_mlocked flag and update the zone
>  page statistics if it finds another VMA holding the page mlocked.  If we fail
>  to isolate the page, we'll have left a potentially mlocked page on the LRU.
>  This is fine, because we'll catch it later if and if vmscan tries to reclaim
> @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
>  holepunching, and truncation of file pages and their anonymous COWed pages.
>  
>  
> -try_to_munlock() Reverse Map Scan
> +page_mlock() Reverse Map Scan
>  ---------------------------------
>  
> -.. warning::
> -   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
> -   page_referenced() reverse map walker.
> -
>  When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
>  Handling <munlock_munlockall_handling>` above] tries to munlock a
>  page, it needs to determine whether or not the page is mapped by any
>  VM_LOCKED VMA without actually attempting to unmap all PTEs from the
>  page.  For this purpose, the unevictable/mlock infrastructure
> -introduced a variant of try_to_unmap() called try_to_munlock().
> +introduced a variant of try_to_unmap() called page_mlock().
>  
> -try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
> -mapped file and KSM pages with a flag argument specifying unlock versus unmap
> -processing.  Again, these functions walk the respective reverse maps looking
> -for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
> -the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
> -undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
> +page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
> +such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
> +pre-clearing of the page's PG_mlocked done by munlock_vma_page.
>  
> -Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
> +Note that page_mlock()'s reverse map walk must visit every VMA in a page's
>  reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
>  However, the scan can terminate when it encounters a VM_LOCKED VMA.
> -Although try_to_munlock() might be called a great many times when munlocking a
> +Although page_mlock() might be called a great many times when munlocking a
>  large region or tearing down a large address space that has been mlocked via
>  mlockall(), overall this is a fairly rare event.
>  
> @@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable list.
>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
>  after shrink_active_list() had moved them to the inactive list, or pages mapped
>  into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
> -recheck via try_to_munlock().  shrink_inactive_list() won't notice the latter,
> +recheck via page_mlock().  shrink_inactive_list() won't notice the latter,
>  but will pass on to shrink_page_list().
>  
>  shrink_page_list() again culls obviously unevictable pages that it could
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index def5c62c93b3..38a746787c2f 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -87,7 +87,6 @@ struct anon_vma_chain {
>  
>  enum ttu_flags {
>  	TTU_MIGRATION		= 0x1,	/* migration mode */
> -	TTU_MUNLOCK		= 0x2,	/* munlock mode */
>  
>  	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
>  	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
> @@ -239,7 +238,7 @@ int page_mkclean(struct page *);
>   * called in munlock()/munmap() path to check for other vmas holding
>   * the page mlocked.
>   */
> -void try_to_munlock(struct page *);
> +void page_mlock(struct page *page);
>  
>  void remove_migration_ptes(struct page *old, struct page *new, bool locked);
>  
> diff --git a/mm/mlock.c b/mm/mlock.c
> index f8f8cc32d03d..9b8b82cfbbff 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -108,7 +108,7 @@ void mlock_vma_page(struct page *page)
>  /*
>   * Finish munlock after successful page isolation
>   *
> - * Page must be locked. This is a wrapper for try_to_munlock()
> + * Page must be locked. This is a wrapper for page_mlock()
>   * and putback_lru_page() with munlock accounting.
>   */
>  static void __munlock_isolated_page(struct page *page)
> @@ -118,7 +118,7 @@ static void __munlock_isolated_page(struct page *page)
>  	 * and we don't need to check all the other vmas.
>  	 */
>  	if (page_mapcount(page) > 1)
> -		try_to_munlock(page);
> +		page_mlock(page);
>  
>  	/* Did try_to_unlock() succeed or punt? */
>  	if (!PageMlocked(page))
> @@ -158,7 +158,7 @@ static void __munlock_isolation_failed(struct page *page)
>   * munlock()ed or munmap()ed, we want to check whether other vmas hold the
>   * page locked so that we can leave it on the unevictable lru list and not
>   * bother vmscan with it.  However, to walk the page's rmap list in
> - * try_to_munlock() we must isolate the page from the LRU.  If some other
> + * page_mlock() we must isolate the page from the LRU.  If some other
>   * task has removed the page from the LRU, we won't be able to do that.
>   * So we clear the PageMlocked as we might not get another chance.  If we
>   * can't isolate the page, we leave it for putback_lru_page() and vmscan
> @@ -168,7 +168,7 @@ unsigned int munlock_vma_page(struct page *page)
>  {
>  	int nr_pages;
>  
> -	/* For try_to_munlock() and to serialize with page migration */
> +	/* For page_mlock() and to serialize with page migration */
>  	BUG_ON(!PageLocked(page));
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
> @@ -205,7 +205,7 @@ static int __mlock_posix_error_return(long retval)
>   *
>   * The fast path is available only for evictable pages with single mapping.
>   * Then we can bypass the per-cpu pvec and get better performance.
> - * when mapcount > 1 we need try_to_munlock() which can fail.
> + * when mapcount > 1 we need page_mlock() which can fail.
>   * when !page_evictable(), we need the full redo logic of putback_lru_page to
>   * avoid leaving evictable page in unevictable list.
>   *
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 977e70803ed8..f09d522725b9 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	struct mmu_notifier_range range;
>  	enum ttu_flags flags = (enum ttu_flags)(long)arg;
>  
> -	/* munlock has nothing to gain from examining un-locked vmas */
> -	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> -		return true;
> -
>  	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
>  	    is_zone_device_page(page) && !is_device_private_page(page))
>  		return true;
> @@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> -			if (flags & TTU_MUNLOCK)
> -				continue;
>  		}
>  
>  		/* Unexpected PMD-mapped THP? */
> @@ -1784,8 +1778,39 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
>  	return !page_mapcount(page) ? true : false;
>  }
>  

Please add a comment here, especially around locking.

> +static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
> +				 unsigned long address, void *arg)
> +{
> +	struct page_vma_mapped_walk pvmw = {
> +		.page = page,
> +		.vma = vma,
> +		.address = address,
> +	};
> +
> +	/* munlock has nothing to gain from examining un-locked vmas */
> +	if (!(vma->vm_flags & VM_LOCKED))
> +		return true;

The logic here doesn't make sense.  You called page_mlock_one() on a VMA
that isn't locked and it returns true?  Is this a check to see if the
VMA has zero mlock'ed pages?

> +
> +	while (page_vma_mapped_walk(&pvmw)) {
> +		/* PTE-mapped THP are never mlocked */
> +		if (!PageTransCompound(page)) {
> +			/*
> +			 * Holding pte lock, we do *not* need
> +			 * mmap_lock here
> +			 */

Are you sure?  I think you at least need to hold the mmap lock for
reading to ensure there's no race here?  mlock_vma_page() eludes to such
a scenario when lazy mlocking.

The mmap_lock is held for writing in the scenarios I have checked.

> +			mlock_vma_page(page);
> +		}
> +		page_vma_mapped_walk_done(&pvmw);
> +
> +		/* found a mlocked page, no point continuing munlock check */

I think you need to check_pte() to be sure it is mapped?

> +		return false;
> +	}
> +
> +	return true;

Again, I don't get the return values.  If page_mlock_one() returns true,
I'd expect for my page to now be locked.  This isn't the case here,
page_mlock_one() returns true if there are no pages present for a locked
VMA, correct?

> +}
> +
>  /**
> - * try_to_munlock - try to munlock a page
> + * page_mlock - try to munlock a page

Is this an mlock or an munlock?  I'm not confident it's either, but more
of a check to see if there are pages mapped in a locked VMA?

>   * @page: the page to be munlocked
>   *
>   * Called from munlock code.  Checks all of the VMAs mapping the page
> @@ -1793,11 +1818,10 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
>   * returned with PG_mlocked cleared if no other vmas have it mlocked.
>   */
>  
> -void try_to_munlock(struct page *page)
> +void page_mlock(struct page *page)
>  {
>  	struct rmap_walk_control rwc = {
> -		.rmap_one = try_to_unmap_one,
> -		.arg = (void *)TTU_MUNLOCK,
> +		.rmap_one = page_mlock_one,
>  		.done = page_not_mapped,
>  		.anon_lock = page_lock_anon_vma_read,
>  
> @@ -1849,7 +1873,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
>   * Find all the mappings of a page using the mapping pointer and the vma chains
>   * contained in the anon_vma struct it points to.
>   *
> - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
> + * When called from page_mlock(), the mmap_lock of the mm containing the vma
>   * where the page was found will be held for write.  So, we won't recheck
>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * LOCKED.
> @@ -1901,7 +1925,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
>   * Find all the mappings of a page using the mapping pointer and the vma chains
>   * contained in the address_space struct it points to.
>   *
> - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
> + * When called from page_mlock(), the mmap_lock of the mm containing the vma
>   * where the page was found will be held for write.  So, we won't recheck

Above it is stated that the lock does not need to be held, but this
comment says it is already held for writing - which is it?

>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * LOCKED.
> -- 
> 2.20.1
> 

munlock_vma_pages_range() comments references try_to_{munlock|unmap}


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 19:45             ` Jason Gunthorpe
@ 2021-05-18 20:29               ` Peter Xu
  2021-05-18 23:03                 ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-18 20:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > Indeed it'll be odd for a COW page since for COW page then it means after
> > > > parent/child writting to the page it'll clone into two, then it's a mistery on
> > > > which one will be the one that "exclusived owned" by the device..
> > > 
> > > For COW pages it is like every other fork case.. We can't reliably
> > > write-protect the device_exclusive page during fork so we must copy it
> > > at fork time.
> > > 
> > > Thus three reasonable choices:
> > >  - Copy to a new CPU page
> > >  - Migrate back to a CPU page and write protect it
> > >  - Copy to a new device exclusive page
> > 
> > IMHO the ownership question would really help us to answer this one..
> 
> I'm confused about what device ownership you are talking about

My question was more about the user scenario rather than anything related to
the kernel code, nor does it related to page struct at all.

Let me try to be a little bit more verbose...

Firstly, I think one simple solution to handle fork() of device exclusive ptes
is to do just like device private ptes: if COW we convert writable ptes into
readable ptes.  Then when CPU access happens (in either parent/child) page
restore triggers which will convert those readable ptes into read-only present
ptes (with the original page backing it).  Then do_wp_page() will take care of
page copy.

However... if you see that also means parent/child have the equal opportunity
to reuse that original page: who access first will do COW because refcount>1
for that page (note! it's possible that mapcount==1 here, as we drop mapcount
when converting to device exclusive ptes; however with the most recent
do_wp_page change from Linus where we'll also check page_count(), we'll still
do COW just like when this page was GUPed by someone else).  While that matters
because the device is writting to that original page only, not the COWed one.

Then here comes the ownership question: If we still want to have the parent
process behave like before it fork()ed, IMHO we must make sure that original
page (that exclusively owned by the device once) still belongs to the parent
process not the child.  That's why I think if that's the case we'd do early cow
in fork(), because it guarantees that.

I can't say I fully understand the whole picture, so sorry if I missed
something important there.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-04-07  8:42 ` [PATCH v8 5/8] mm: Device exclusive memory access Alistair Popple
  2021-05-18  2:08   ` Peter Xu
@ 2021-05-18 21:16   ` Peter Xu
  2021-05-19 10:49     ` Alistair Popple
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-18 21:16 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:

[...]

> +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> +			   unsigned long address, void *arg)
> +{
> +	struct ttp_args ttp = {
> +		.mm = mm,
> +		.address = address,
> +		.arg = arg,
> +		.valid = false,
> +	};
> +	struct rmap_walk_control rwc = {
> +		.rmap_one = try_to_protect_one,
> +		.done = page_not_mapped,
> +		.anon_lock = page_lock_anon_vma_read,
> +		.arg = &ttp,
> +	};
> +
> +	/*
> +	 * Restrict to anonymous pages for now to avoid potential writeback
> +	 * issues.
> +	 */
> +	if (!PageAnon(page))
> +		return false;
> +
> +	/*
> +	 * During exec, a temporary VMA is setup and later moved.
> +	 * The VMA is moved under the anon_vma lock but not the
> +	 * page tables leading to a race where migration cannot
> +	 * find the migration ptes. Rather than increasing the
> +	 * locking requirements of exec(), migration skips
> +	 * temporary VMAs until after exec() completes.
> +	 */
> +	if (!PageKsm(page) && PageAnon(page))
> +		rwc.invalid_vma = invalid_migration_vma;
> +
> +	rmap_walk(page, &rwc);
> +
> +	return ttp.valid && !page_mapcount(page);
> +}

I raised a question in the other thread regarding fork():

https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/

While I suddenly noticed that we may have similar issues even if we fork()
before creating the ptes.

In that case, we may see multiple read-only ptes pointing to the same page.  We
will convert all of them into device exclusive read ptes in rmap_walk() above,
however how do we guarantee after all COW done in the parent and all the childs
processes, the device owned page will be returned to the parent?

E.g., if parent accesses the page earlier than the children processes
(actually, as long as not the last one), do_wp_page() will do COW for parent on
this page because refcount(page)>1, then the page seems to get lost to a random
child too..

To resolve all these complexity, not sure whether try_to_protect() could
enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
make sure mapcount(page)==1 before granting the page to the device, so that
this will guarantee this mm owns this page forever, I think?  It'll simplify
fork() too as a side effect, since VM_DONTCOPY vma go away when fork.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 20:29               ` Peter Xu
@ 2021-05-18 23:03                 ` Jason Gunthorpe
  2021-05-18 23:45                   ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2021-05-18 23:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 04:29:14PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > > Indeed it'll be odd for a COW page since for COW page then it means after
> > > > > parent/child writting to the page it'll clone into two, then it's a mistery on
> > > > > which one will be the one that "exclusived owned" by the device..
> > > > 
> > > > For COW pages it is like every other fork case.. We can't reliably
> > > > write-protect the device_exclusive page during fork so we must copy it
> > > > at fork time.
> > > > 
> > > > Thus three reasonable choices:
> > > >  - Copy to a new CPU page
> > > >  - Migrate back to a CPU page and write protect it
> > > >  - Copy to a new device exclusive page
> > > 
> > > IMHO the ownership question would really help us to answer this one..
> > 
> > I'm confused about what device ownership you are talking about
> 
> My question was more about the user scenario rather than anything related to
> the kernel code, nor does it related to page struct at all.
> 
> Let me try to be a little bit more verbose...
> 
> Firstly, I think one simple solution to handle fork() of device exclusive ptes
> is to do just like device private ptes: if COW we convert writable ptes into
> readable ptes.  Then when CPU access happens (in either parent/child) page
> restore triggers which will convert those readable ptes into read-only present
> ptes (with the original page backing it).  Then do_wp_page() will take care of
> page copy.

I suspect it doesn't work. This is much more like pinning than
anything, the data in the page is still under active use by a device
and if we cannot globally write write protect it, both from CPU and
device access, then we cannot do COW. IIRC the mm can't trigger a full
global write protect through the pgmap?
 
> Then here comes the ownership question: If we still want to have the parent
> process behave like before it fork()ed, IMHO we must make sure that original
> page (that exclusively owned by the device once) still belongs to the parent
> process not the child.  That's why I think if that's the case we'd do early cow
> in fork(), because it guarantees that.

Logically during fork all these device exclusive pages should be
reverted back to their CPU pages, write protected and the CPU page PTE
copied to the fork.

We should not copy the device exclusive page PTE to the fork. I think
I pointed to this on an earlier rev..

We can optimize this into the various variants above, but logically
device exclusive stop existing during fork.

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 23:03                 ` Jason Gunthorpe
@ 2021-05-18 23:45                   ` Peter Xu
  2021-05-19 11:04                     ` Alistair Popple
  2021-05-19 13:28                     ` Jason Gunthorpe
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Xu @ 2021-05-18 23:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> Logically during fork all these device exclusive pages should be
> reverted back to their CPU pages, write protected and the CPU page PTE
> copied to the fork.
> 
> We should not copy the device exclusive page PTE to the fork. I think
> I pointed to this on an earlier rev..

Agreed.  Though please see the question I posted in the other thread: now I am
not very sure whether we'll be able to mark a page as device exclusive if that
page has mapcount>1.

> 
> We can optimize this into the various variants above, but logically
> device exclusive stop existing during fork.

Makes sense, I think that's indeed what this patch did at least for the COW
case, so I think Alistair did address that comment.  It's just that I think we
need to drop the other !COW case (imho that should correspond to the changes in
copy_nonpresent_pte()) in this patch to guarantee it.

I also hope we don't make copy_pte_range() even more complicated just to do the
lock_page() right, so we could fail the fork() if the lock is hard to take.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 21:16   ` Peter Xu
@ 2021-05-19 10:49     ` Alistair Popple
  2021-05-19 12:24       ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-05-19 10:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> 
> [...]
> 
> > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > +                        unsigned long address, void *arg)
> > +{
> > +     struct ttp_args ttp = {
> > +             .mm = mm,
> > +             .address = address,
> > +             .arg = arg,
> > +             .valid = false,
> > +     };
> > +     struct rmap_walk_control rwc = {
> > +             .rmap_one = try_to_protect_one,
> > +             .done = page_not_mapped,
> > +             .anon_lock = page_lock_anon_vma_read,
> > +             .arg = &ttp,
> > +     };
> > +
> > +     /*
> > +      * Restrict to anonymous pages for now to avoid potential writeback
> > +      * issues.
> > +      */
> > +     if (!PageAnon(page))
> > +             return false;
> > +
> > +     /*
> > +      * During exec, a temporary VMA is setup and later moved.
> > +      * The VMA is moved under the anon_vma lock but not the
> > +      * page tables leading to a race where migration cannot
> > +      * find the migration ptes. Rather than increasing the
> > +      * locking requirements of exec(), migration skips
> > +      * temporary VMAs until after exec() completes.
> > +      */
> > +     if (!PageKsm(page) && PageAnon(page))
> > +             rwc.invalid_vma = invalid_migration_vma;
> > +
> > +     rmap_walk(page, &rwc);
> > +
> > +     return ttp.valid && !page_mapcount(page);
> > +}
> 
> I raised a question in the other thread regarding fork():
> 
> https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> 
> While I suddenly noticed that we may have similar issues even if we fork()
> before creating the ptes.
> 
> In that case, we may see multiple read-only ptes pointing to the same page. 
> We will convert all of them into device exclusive read ptes in rmap_walk()
> above, however how do we guarantee after all COW done in the parent and all
> the childs processes, the device owned page will be returned to the parent?

I assume you are talking about a fork() followed by a call to 
make_device_exclusive()? I think this should be ok because 
make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW and 
because a device performing atomic operations needs to write to the page. I 
suppose a comment here highlighting the need to break COW to avoid this 
scenario would be useful though.

> E.g., if parent accesses the page earlier than the children processes
> (actually, as long as not the last one), do_wp_page() will do COW for parent
> on this page because refcount(page)>1, then the page seems to get lost to a
> random child too..
>
> To resolve all these complexity, not sure whether try_to_protect() could
> enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
> make sure mapcount(page)==1 before granting the page to the device, so that
> this will guarantee this mm owns this page forever, I think?  It'll
> simplify fork() too as a side effect, since VM_DONTCOPY vma go away when
> fork.
> 
> --
> Peter Xu






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 23:45                   ` Peter Xu
@ 2021-05-19 11:04                     ` Alistair Popple
  2021-05-19 12:15                       ` Peter Xu
  2021-05-19 13:28                     ` Jason Gunthorpe
  1 sibling, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-05-19 11:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Wednesday, 19 May 2021 9:45:05 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I
> am not very sure whether we'll be able to mark a page as device exclusive
> if that page has mapcount>1.
>
> > We can optimize this into the various variants above, but logically
> > device exclusive stop existing during fork.
> 
> Makes sense, I think that's indeed what this patch did at least for the COW
> case, so I think Alistair did address that comment.  It's just that I think
> we need to drop the other !COW case (imho that should correspond to the
> changes in copy_nonpresent_pte()) in this patch to guarantee it.

Right. The main change from v7 -> v8 was to remove device exclusive entries on 
fork instead of copying them. The change in copy_nonpresent_pte() is for the
!COW case. I think what you are getting at is given exclusive entries are 
(currently) only supported for PageAnon pages is_cow_mapping() will always be 
true and therefore the change to copy_nonpresent_pte() can be dropped. That 
logic seems reasonable so I will change the exclusive case in 
copy_nonpresent_pte() to a VM_WARN_ON.

> I also hope we don't make copy_pte_range() even more complicated just to do
> the lock_page() right, so we could fail the fork() if the lock is hard to
> take.

Failing fork() because we couldn't take a lock doesn't seem like the right 
approach though, especially as there is already existing code that retries. I 
get this adds complexity though, so would be happy to take a look at cleaning 
copy_pte_range() up in future.

> --
> Peter Xu






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 17:27       ` Peter Xu
  2021-05-18 17:33         ` Jason Gunthorpe
@ 2021-05-19 11:35         ` Alistair Popple
  2021-05-19 12:21           ` Peter Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-05-19 11:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wednesday, 19 May 2021 3:27:42 AM AEST Peter Xu wrote:
> > > The odd part is the remote GUP should have walked the page table
> > > already, so since the target here is the vaddr to replace, the 1st page
> > > table walk should be able to both trylock/lock the page, then modify
> > > the pte with pgtable lock held, return the locked page, then walk the
> > > rmap again to remove all the rest of the ptes that are mapping to this
> > > page.  In that case before we call the rmap_walk() we know this must be
> > > the page we want to take care of, and no one will be able to restore
> > > the original mm pte either (as we're with the page lock).  Then we
> > > don't need this check, neither do we need ttp->address.
> > 
> > If I am understanding you correctly I think this would be similar to the
> > approach that was taken in v2. However it pretty much ended up being just
> > an open-coded version of gup which is useful anyway to fault the page in.
> I see.  For easier reference this is v2 patch 1:
> 
> https://lore.kernel.org/lkml/20210219020750.16444-2-apopple@nvidia.com/

Sorry, I should have been clearer and just included that reference for you.

> Indeed that looks like it, it's just that instead of grabbing the page only
> in hmm_exclusive_pmd() we can do the pte modification along the way to seal
> the whole thing (address/pte & page).  I saw Christoph and Jason commented
> in that patch, but not regarding to this approach.  So is there a reason
> that you switched?  Do you think it'll work?

I think the approach you are describing is similar to what 
migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
made to work. I ended up going with the GUP+unmap approach in part because 
Christoph suggested it but primarily because it required less code especially 
given we needed to call something to fault the page in/break COW anyway (or 
extend what was there to call handle_mm_fault(), etc.).

> I have no strong opinion either, it's just not crystal clear why we'd need
> that ttp->address at all for a rmap walk along with that "valid" field.

It's purely to ensure the PTE pointing to the GUP page was replaced with an 
exclusive swap entry and that the mapping didn't change between calls.

> Meanwhile it should be slightly less efficient too to go with current
> approach, especially when the page array gets huge, I think: since there'll
> be longer time we do GUP before doing the rmap walk, so higher possibility
> that the GUPed pages got replaced for whatever reason.  Then the call to
> make_device_exclusive_range() will fail as a whole just for a single page
> replacement within the range.






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 11:04                     ` Alistair Popple
@ 2021-05-19 12:15                       ` Peter Xu
  2021-05-19 13:11                         ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-19 12:15 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jason Gunthorpe, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> Failing fork() because we couldn't take a lock doesn't seem like the right 
> approach though, especially as there is already existing code that retries. I 
> get this adds complexity though, so would be happy to take a look at cleaning 
> copy_pte_range() up in future.

Yes, I proposed that as this one won't affect any existing applications (unlike
the existing ones) but only new userspace driver apps that will use this new
atomic feature.

IMHO it'll be a pity to add extra complexity and maintainance burden into
fork() if only for keeping the "logical correctness of fork()" however the code
never triggers. If we start with trylock we'll know whether people will use it,
since people will complain with a reason when needed; however I still doubt
whether a sane userspace device driver should fork() within busy interaction
with the device underneath..

In all cases, please still consider to keep them in copy_nonpresent_pte() (and
if to rework, separating patches would be great).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 11:35         ` Alistair Popple
@ 2021-05-19 12:21           ` Peter Xu
  2021-05-19 12:46             ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-19 12:21 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> I think the approach you are describing is similar to what 
> migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
> made to work. I ended up going with the GUP+unmap approach in part because 
> Christoph suggested it but primarily because it required less code especially 
> given we needed to call something to fault the page in/break COW anyway (or 
> extend what was there to call handle_mm_fault(), etc.).

I see, thank. Would you mind add a short paragraph in the commit message
talking about these two solutions and why we choose the GUP way?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 10:49     ` Alistair Popple
@ 2021-05-19 12:24       ` Peter Xu
  2021-05-19 12:46         ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-19 12:24 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > 
> > [...]
> > 
> > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > +                        unsigned long address, void *arg)
> > > +{
> > > +     struct ttp_args ttp = {
> > > +             .mm = mm,
> > > +             .address = address,
> > > +             .arg = arg,
> > > +             .valid = false,
> > > +     };
> > > +     struct rmap_walk_control rwc = {
> > > +             .rmap_one = try_to_protect_one,
> > > +             .done = page_not_mapped,
> > > +             .anon_lock = page_lock_anon_vma_read,
> > > +             .arg = &ttp,
> > > +     };
> > > +
> > > +     /*
> > > +      * Restrict to anonymous pages for now to avoid potential writeback
> > > +      * issues.
> > > +      */
> > > +     if (!PageAnon(page))
> > > +             return false;
> > > +
> > > +     /*
> > > +      * During exec, a temporary VMA is setup and later moved.
> > > +      * The VMA is moved under the anon_vma lock but not the
> > > +      * page tables leading to a race where migration cannot
> > > +      * find the migration ptes. Rather than increasing the
> > > +      * locking requirements of exec(), migration skips
> > > +      * temporary VMAs until after exec() completes.
> > > +      */
> > > +     if (!PageKsm(page) && PageAnon(page))
> > > +             rwc.invalid_vma = invalid_migration_vma;
> > > +
> > > +     rmap_walk(page, &rwc);
> > > +
> > > +     return ttp.valid && !page_mapcount(page);
> > > +}
> > 
> > I raised a question in the other thread regarding fork():
> > 
> > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > 
> > While I suddenly noticed that we may have similar issues even if we fork()
> > before creating the ptes.
> > 
> > In that case, we may see multiple read-only ptes pointing to the same page. 
> > We will convert all of them into device exclusive read ptes in rmap_walk()
> > above, however how do we guarantee after all COW done in the parent and all
> > the childs processes, the device owned page will be returned to the parent?
> 
> I assume you are talking about a fork() followed by a call to 
> make_device_exclusive()? I think this should be ok because 
> make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW and 
> because a device performing atomic operations needs to write to the page. I 
> suppose a comment here highlighting the need to break COW to avoid this 
> scenario would be useful though.

Indeed, sorry for the false alarm!  Yes it would be great to mention that too.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
  2021-05-18 20:04   ` Liam Howlett
@ 2021-05-19 12:38     ` Alistair Popple
  2021-05-20 20:24       ` Liam Howlett
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-05-19 12:38 UTC (permalink / raw)
  To: Liam Howlett, Hugh Dickins
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig, Shakeel Butt

On Wednesday, 19 May 2021 6:04:51 AM AEST Liam Howlett wrote:
> External email: Use caution opening links or attachments
> 
> * Alistair Popple <apopple@nvidia.com> [210407 04:43]:
> > The behaviour of try_to_unmap_one() is difficult to follow because it
> > performs different operations based on a fairly large set of flags used
> > in different combinations.
> > 
> > TTU_MUNLOCK is one such flag. However it is exclusively used by
> > try_to_munlock() which specifies no other flags. Therefore rather than
> > overload try_to_unmap_one() with unrelated behaviour split this out into
> > it's own function and remove the flag.
> > 
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> > ---
> > 
> > v8:
> > * Renamed try_to_munlock to page_mlock to better reflect what the
> > 
> >   function actually does.
> > 
> > * Removed the TODO from the documentation that this patch addresses.
> > 
> > v7:
> > * Added Christoph's Reviewed-by
> > 
> > v4:
> > * Removed redundant check for VM_LOCKED
> > ---
> > 
> >  Documentation/vm/unevictable-lru.rst | 33 ++++++++-----------
> >  include/linux/rmap.h                 |  3 +-
> >  mm/mlock.c                           | 10 +++---
> >  mm/rmap.c                            | 48 +++++++++++++++++++++-------
> >  4 files changed, 55 insertions(+), 39 deletions(-)
> > 
> > diff --git a/Documentation/vm/unevictable-lru.rst
> > b/Documentation/vm/unevictable-lru.rst index 0e1490524f53..eae3af17f2d9
> > 100644
> > --- a/Documentation/vm/unevictable-lru.rst
> > +++ b/Documentation/vm/unevictable-lru.rst
> > @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone
> > statistics for the number of> 
> >  mlocked pages.  Note, however, that at this point we haven't checked
> >  whether the page is mapped by other VM_LOCKED VMAs.
> > 
> > -We can't call try_to_munlock(), the function that walks the reverse map
> > to
> > +We can't call page_mlock(), the function that walks the reverse map to
> > 
> >  check for other VM_LOCKED VMAs, without first isolating the page from the
> >  LRU.> 
> > -try_to_munlock() is a variant of try_to_unmap() and thus requires that
> > the page +page_mlock() is a variant of try_to_unmap() and thus requires
> > that the page> 
> >  not be on an LRU list [more on these below].  However, the call to
> > 
> > -isolate_lru_page() could fail, in which case we couldn't
> > try_to_munlock().  So, +isolate_lru_page() could fail, in which case we
> > can't call page_mlock().  So,> 
> >  we go ahead and clear PG_mlocked up front, as this might be the only
> >  chance we> 
> > -have.  If we can successfully isolate the page, we go ahead and
> > -try_to_munlock(), which will restore the PG_mlocked flag and update the
> > zone +have.  If we can successfully isolate the page, we go ahead and
> > call +page_mlock(), which will restore the PG_mlocked flag and update the
> > zone> 
> >  page statistics if it finds another VMA holding the page mlocked.  If we
> >  fail to isolate the page, we'll have left a potentially mlocked page on
> >  the LRU. This is fine, because we'll catch it later if and if vmscan
> >  tries to reclaim> 
> > @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown
> > (munlock_vma_pages_all), reclaim,> 
> >  holepunching, and truncation of file pages and their anonymous COWed
> >  pages.
> > 
> > -try_to_munlock() Reverse Map Scan
> > +page_mlock() Reverse Map Scan
> > 
> >  ---------------------------------
> > 
> > -.. warning::
> > -   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to
> > the -   page_referenced() reverse map walker.
> > -
> > 
> >  When munlock_vma_page() [see section :ref:`munlock()/munlockall() System
> >  Call Handling <munlock_munlockall_handling>` above] tries to munlock a
> >  page, it needs to determine whether or not the page is mapped by any
> >  VM_LOCKED VMA without actually attempting to unmap all PTEs from the
> >  page.  For this purpose, the unevictable/mlock infrastructure
> > 
> > -introduced a variant of try_to_unmap() called try_to_munlock().
> > +introduced a variant of try_to_unmap() called page_mlock().
> > 
> > -try_to_munlock() calls the same functions as try_to_unmap() for anonymous
> > and -mapped file and KSM pages with a flag argument specifying unlock
> > versus unmap -processing.  Again, these functions walk the respective
> > reverse maps looking -for VM_LOCKED VMAs.  When such a VMA is found, as
> > in the try_to_unmap() case, -the functions mlock the page via
> > mlock_vma_page() and return SWAP_MLOCK.  This -undoes the pre-clearing of
> > the page's PG_mlocked done by munlock_vma_page. +page_mlock() walks the
> > respective reverse maps looking for VM_LOCKED VMAs. When +such a VMA is
> > found the page is mlocked via mlock_vma_page(). This undoes the
> > +pre-clearing of the page's PG_mlocked done by munlock_vma_page.
> > 
> > -Note that try_to_munlock()'s reverse map walk must visit every VMA in a
> > page's +Note that page_mlock()'s reverse map walk must visit every VMA in
> > a page's> 
> >  reverse map to determine that a page is NOT mapped into any VM_LOCKED
> >  VMA.
> >  However, the scan can terminate when it encounters a VM_LOCKED VMA.
> > 
> > -Although try_to_munlock() might be called a great many times when
> > munlocking a +Although page_mlock() might be called a great many times
> > when munlocking a> 
> >  large region or tearing down a large address space that has been mlocked
> >  via mlockall(), overall this is a fairly rare event.
> > 
> > @@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable
> > list.> 
> >  shrink_inactive_list() should only see SHM_LOCK'd pages that became
> >  SHM_LOCK'd after shrink_active_list() had moved them to the inactive
> >  list, or pages mapped into VM_LOCKED VMAs that munlock_vma_page()
> >  couldn't isolate from the LRU to> 
> > -recheck via try_to_munlock().  shrink_inactive_list() won't notice the
> > latter, +recheck via page_mlock().  shrink_inactive_list() won't notice
> > the latter,> 
> >  but will pass on to shrink_page_list().
> >  
> >  shrink_page_list() again culls obviously unevictable pages that it could
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index def5c62c93b3..38a746787c2f 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -87,7 +87,6 @@ struct anon_vma_chain {
> > 
> >  enum ttu_flags {
> >  
> >       TTU_MIGRATION           = 0x1,  /* migration mode */
> > 
> > -     TTU_MUNLOCK             = 0x2,  /* munlock mode */
> > 
> >       TTU_SPLIT_HUGE_PMD      = 0x4,  /* split huge PMD if any */
> >       TTU_IGNORE_MLOCK        = 0x8,  /* ignore mlock */
> > 
> > @@ -239,7 +238,7 @@ int page_mkclean(struct page *);
> > 
> >   * called in munlock()/munmap() path to check for other vmas holding
> >   * the page mlocked.
> >   */
> > 
> > -void try_to_munlock(struct page *);
> > +void page_mlock(struct page *page);
> > 
> >  void remove_migration_ptes(struct page *old, struct page *new, bool
> >  locked);> 
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index f8f8cc32d03d..9b8b82cfbbff 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -108,7 +108,7 @@ void mlock_vma_page(struct page *page)
> > 
> >  /*
> >  
> >   * Finish munlock after successful page isolation
> >   *
> > 
> > - * Page must be locked. This is a wrapper for try_to_munlock()
> > + * Page must be locked. This is a wrapper for page_mlock()
> > 
> >   * and putback_lru_page() with munlock accounting.
> >   */
> >  
> >  static void __munlock_isolated_page(struct page *page)
> > 
> > @@ -118,7 +118,7 @@ static void __munlock_isolated_page(struct page *page)
> > 
> >        * and we don't need to check all the other vmas.
> >        */
> >       
> >       if (page_mapcount(page) > 1)
> > 
> > -             try_to_munlock(page);
> > +             page_mlock(page);
> > 
> >       /* Did try_to_unlock() succeed or punt? */
> >       if (!PageMlocked(page))
> > 
> > @@ -158,7 +158,7 @@ static void __munlock_isolation_failed(struct page
> > *page)> 
> >   * munlock()ed or munmap()ed, we want to check whether other vmas hold
> >   the
> >   * page locked so that we can leave it on the unevictable lru list and
> >   not
> >   * bother vmscan with it.  However, to walk the page's rmap list in
> > 
> > - * try_to_munlock() we must isolate the page from the LRU.  If some other
> > + * page_mlock() we must isolate the page from the LRU.  If some other
> > 
> >   * task has removed the page from the LRU, we won't be able to do that.
> >   * So we clear the PageMlocked as we might not get another chance.  If we
> >   * can't isolate the page, we leave it for putback_lru_page() and vmscan
> > 
> > @@ -168,7 +168,7 @@ unsigned int munlock_vma_page(struct page *page)
> > 
> >  {
> >  
> >       int nr_pages;
> > 
> > -     /* For try_to_munlock() and to serialize with page migration */
> > +     /* For page_mlock() and to serialize with page migration */
> > 
> >       BUG_ON(!PageLocked(page));
> >       VM_BUG_ON_PAGE(PageTail(page), page);
> > 
> > @@ -205,7 +205,7 @@ static int __mlock_posix_error_return(long retval)
> > 
> >   *
> >   * The fast path is available only for evictable pages with single
> >   mapping.
> >   * Then we can bypass the per-cpu pvec and get better performance.
> > 
> > - * when mapcount > 1 we need try_to_munlock() which can fail.
> > + * when mapcount > 1 we need page_mlock() which can fail.
> > 
> >   * when !page_evictable(), we need the full redo logic of
> >   putback_lru_page to * avoid leaving evictable page in unevictable list.
> >   *
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 977e70803ed8..f09d522725b9 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page,
> > struct vm_area_struct *vma,> 
> >       struct mmu_notifier_range range;
> >       enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > 
> > -     /* munlock has nothing to gain from examining un-locked vmas */
> > -     if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> > -             return true;
> > -
> > 
> >       if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> >       
> >           is_zone_device_page(page) && !is_device_private_page(page))
> >           
> >               return true;
> > 
> > @@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page,
> > struct vm_area_struct *vma,> 
> >                               page_vma_mapped_walk_done(&pvmw);
> >                               break;
> >                       
> >                       }
> > 
> > -                     if (flags & TTU_MUNLOCK)
> > -                             continue;
> > 
> >               }
> >               
> >               /* Unexpected PMD-mapped THP? */
> > 
> > @@ -1784,8 +1778,39 @@ bool try_to_unmap(struct page *page, enum ttu_flags
> > flags)> 
> >       return !page_mapcount(page) ? true : false;
> >  
> >  }
> 
> Please add a comment here, especially around locking.
> 
> > +static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
> > +                              unsigned long address, void *arg)
> > +{
> > +     struct page_vma_mapped_walk pvmw = {
> > +             .page = page,
> > +             .vma = vma,
> > +             .address = address,
> > +     };
> > +
> > +     /* munlock has nothing to gain from examining un-locked vmas */
> > +     if (!(vma->vm_flags & VM_LOCKED))
> > +             return true;
> 
> The logic here doesn't make sense.  You called page_mlock_one() on a VMA
> that isn't locked and it returns true?  Is this a check to see if the
> VMA has zero mlock'ed pages?

I'm pretty sure the logic is correct. This is used for an rmap_walk, so we 
return true to continue the page table scan to see if other VMAs have the page 
locked.

> > +
> > +     while (page_vma_mapped_walk(&pvmw)) {
> > +             /* PTE-mapped THP are never mlocked */
> > +             if (!PageTransCompound(page)) {
> > +                     /*
> > +                      * Holding pte lock, we do *not* need
> > +                      * mmap_lock here
> > +                      */
> 
> Are you sure?  I think you at least need to hold the mmap lock for
> reading to ensure there's no race here?  mlock_vma_page() eludes to such
> a scenario when lazy mlocking.

Not really. I don't claim to be an mlock expert but as this is a clean-up for 
try_to_unmap() the intent was to not change existing behaviour.

However presenting the function in this simplified form did raise this and 
some other questions during previous reviews - see https://lore.kernel.org/
dri-devel/20210331115746.GA1463678@nvidia.com/ for the previous discussion.

To answer the questions around locking though I did do some git sha1 mining. 
The best explanation seemed to be contained in https://git.kernel.org/pub/scm/
linux/kernel/git/torvalds/linux.git/commit/?
id=b87537d9e2feb30f6a962f27eb32768682698d3b from Hugh (whom I've added again 
here in case he can help answer some of these).

> The mmap_lock is held for writing in the scenarios I have checked.
> 
> > +                     mlock_vma_page(page);
> > +             }
> > +             page_vma_mapped_walk_done(&pvmw);
> > +
> > +             /* found a mlocked page, no point continuing munlock check
> > */
> 
> I think you need to check_pte() to be sure it is mapped?
> 
> > +             return false;
> > +     }
> > +
> > +     return true;
> 
> Again, I don't get the return values.  If page_mlock_one() returns true,
> I'd expect for my page to now be locked.  This isn't the case here,
> page_mlock_one() returns true if there are no pages present for a locked
> VMA, correct?

This is used for an rmap walk, if we were able to mlock the page we return 
false to stop the walk as there is no need to examine other VMAs if the page 
has already been mlocked.

> > +}
> > +
> > 
> >  /**
> > 
> > - * try_to_munlock - try to munlock a page
> > + * page_mlock - try to munlock a page
> 
> Is this an mlock or an munlock?  I'm not confident it's either, but more
> of a check to see if there are pages mapped in a locked VMA?

It is called (only) from the munlock code to check a page does not need to be 
mlocked, or to mlock it if it does.

> >   * @page: the page to be munlocked
> >   *
> >   * Called from munlock code.  Checks all of the VMAs mapping the page
> > 
> > @@ -1793,11 +1818,10 @@ bool try_to_unmap(struct page *page, enum
> > ttu_flags flags)> 
> >   * returned with PG_mlocked cleared if no other vmas have it mlocked.
> >   */
> > 
> > -void try_to_munlock(struct page *page)
> > +void page_mlock(struct page *page)
> > 
> >  {
> >  
> >       struct rmap_walk_control rwc = {
> > 
> > -             .rmap_one = try_to_unmap_one,
> > -             .arg = (void *)TTU_MUNLOCK,
> > +             .rmap_one = page_mlock_one,
> > 
> >               .done = page_not_mapped,
> >               .anon_lock = page_lock_anon_vma_read,
> > 
> > @@ -1849,7 +1873,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct
> > page *page,> 
> >   * Find all the mappings of a page using the mapping pointer and the vma
> >   chains * contained in the anon_vma struct it points to.
> >   *
> > 
> > - * When called from try_to_munlock(), the mmap_lock of the mm containing
> > the vma + * When called from page_mlock(), the mmap_lock of the mm
> > containing the vma> 
> >   * where the page was found will be held for write.  So, we won't recheck
> >   * vm_flags for that VMA.  That should be OK, because that vma shouldn't
> >   be
> >   * LOCKED.
> > 
> > @@ -1901,7 +1925,7 @@ static void rmap_walk_anon(struct page *page, struct
> > rmap_walk_control *rwc,> 
> >   * Find all the mappings of a page using the mapping pointer and the vma
> >   chains * contained in the address_space struct it points to.
> >   *
> > 
> > - * When called from try_to_munlock(), the mmap_lock of the mm containing
> > the vma + * When called from page_mlock(), the mmap_lock of the mm
> > containing the vma> 
> >   * where the page was found will be held for write.  So, we won't recheck
> 
> Above it is stated that the lock does not need to be held, but this
> comment says it is already held for writing - which is it?

I will have to check.

> >   * vm_flags for that VMA.  That should be OK, because that vma shouldn't
> >   be
> >   * LOCKED.
> > 
> > --
> > 2.20.1
> 
> munlock_vma_pages_range() comments references try_to_{munlock|unmap}

Thanks, I noticed that too when I was rereading it just now. Will fix that up.

 - Alistair





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 12:24       ` Peter Xu
@ 2021-05-19 12:46         ` Alistair Popple
  0 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-05-19 12:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wednesday, 19 May 2021 10:24:27 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> > On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > > 
> > > [...]
> > > 
> > > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > > +                        unsigned long address, void *arg)
> > > > +{
> > > > +     struct ttp_args ttp = {
> > > > +             .mm = mm,
> > > > +             .address = address,
> > > > +             .arg = arg,
> > > > +             .valid = false,
> > > > +     };
> > > > +     struct rmap_walk_control rwc = {
> > > > +             .rmap_one = try_to_protect_one,
> > > > +             .done = page_not_mapped,
> > > > +             .anon_lock = page_lock_anon_vma_read,
> > > > +             .arg = &ttp,
> > > > +     };
> > > > +
> > > > +     /*
> > > > +      * Restrict to anonymous pages for now to avoid potential
> > > > writeback
> > > > +      * issues.
> > > > +      */
> > > > +     if (!PageAnon(page))
> > > > +             return false;
> > > > +
> > > > +     /*
> > > > +      * During exec, a temporary VMA is setup and later moved.
> > > > +      * The VMA is moved under the anon_vma lock but not the
> > > > +      * page tables leading to a race where migration cannot
> > > > +      * find the migration ptes. Rather than increasing the
> > > > +      * locking requirements of exec(), migration skips
> > > > +      * temporary VMAs until after exec() completes.
> > > > +      */
> > > > +     if (!PageKsm(page) && PageAnon(page))
> > > > +             rwc.invalid_vma = invalid_migration_vma;
> > > > +
> > > > +     rmap_walk(page, &rwc);
> > > > +
> > > > +     return ttp.valid && !page_mapcount(page);
> > > > +}
> > > 
> > > I raised a question in the other thread regarding fork():
> > > 
> > > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > > 
> > > While I suddenly noticed that we may have similar issues even if we
> > > fork()
> > > before creating the ptes.
> > > 
> > > In that case, we may see multiple read-only ptes pointing to the same
> > > page.
> > > We will convert all of them into device exclusive read ptes in
> > > rmap_walk()
> > > above, however how do we guarantee after all COW done in the parent and
> > > all
> > > the childs processes, the device owned page will be returned to the
> > > parent?
> > 
> > I assume you are talking about a fork() followed by a call to
> > make_device_exclusive()? I think this should be ok because
> > make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW
> > and because a device performing atomic operations needs to write to the
> > page. I suppose a comment here highlighting the need to break COW to
> > avoid this scenario would be useful though.
> 
> Indeed, sorry for the false alarm!  Yes it would be great to mention that
> too.

No problem! Thanks for the comments.

> --
> Peter Xu






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 12:21           ` Peter Xu
@ 2021-05-19 12:46             ` Alistair Popple
  0 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-05-19 12:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Wednesday, 19 May 2021 10:21:08 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> > I think the approach you are describing is similar to what
> > migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be
> > made to work. I ended up going with the GUP+unmap approach in part because
> > Christoph suggested it but primarily because it required less code
> > especially given we needed to call something to fault the page in/break
> > COW anyway (or extend what was there to call handle_mm_fault(), etc.).
> 
> I see, thank. Would you mind add a short paragraph in the commit message
> talking about these two solutions and why we choose the GUP way?

Sure.

 - Alistair

> --
> Peter Xu






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 12:15                       ` Peter Xu
@ 2021-05-19 13:11                         ` Alistair Popple
  2021-05-19 14:04                           ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-05-19 13:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > Failing fork() because we couldn't take a lock doesn't seem like the right
> > approach though, especially as there is already existing code that
> > retries. I get this adds complexity though, so would be happy to take a
> > look at cleaning copy_pte_range() up in future.
> 
> Yes, I proposed that as this one won't affect any existing applications
> (unlike the existing ones) but only new userspace driver apps that will use
> this new atomic feature.
> 
> IMHO it'll be a pity to add extra complexity and maintainance burden into
> fork() if only for keeping the "logical correctness of fork()" however the
> code never triggers. If we start with trylock we'll know whether people
> will use it, since people will complain with a reason when needed; however
> I still doubt whether a sane userspace device driver should fork() within
> busy interaction with the device underneath..

I will refrain from commenting on the sanity or otherwise of doing that :-)

Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
if the added complexity/maintenance burden for an admittedly fairly specific 
feature is going to stop progress here I am happy to take the fail fork 
approach. I could then possibly fix it up as a future clean up to 
copy_pte_range(). Perhaps others have thoughts?

> In all cases, please still consider to keep them in copy_nonpresent_pte()
> (and if to rework, separating patches would be great).
>
> Thanks,
> 
> --
> Peter Xu






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 23:45                   ` Peter Xu
  2021-05-19 11:04                     ` Alistair Popple
@ 2021-05-19 13:28                     ` Jason Gunthorpe
  2021-05-19 14:09                       ` Peter Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2021-05-19 13:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I am
> not very sure whether we'll be able to mark a page as device exclusive if that
> page has mapcount>1.

IMHO it is similar to write protect done by filesystems on shared
mappings - all VMAs with a copy of the CPU page have to get switched
to the device exclusive PTE. This is why the rmap stuff is involved in
the migration helpers

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 13:11                         ` Alistair Popple
@ 2021-05-19 14:04                           ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-05-19 14:04 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jason Gunthorpe, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Wed, May 19, 2021 at 11:11:55PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > > Failing fork() because we couldn't take a lock doesn't seem like the right
> > > approach though, especially as there is already existing code that
> > > retries. I get this adds complexity though, so would be happy to take a
> > > look at cleaning copy_pte_range() up in future.
> > 
> > Yes, I proposed that as this one won't affect any existing applications
> > (unlike the existing ones) but only new userspace driver apps that will use
> > this new atomic feature.
> > 
> > IMHO it'll be a pity to add extra complexity and maintainance burden into
> > fork() if only for keeping the "logical correctness of fork()" however the
> > code never triggers. If we start with trylock we'll know whether people
> > will use it, since people will complain with a reason when needed; however
> > I still doubt whether a sane userspace device driver should fork() within
> > busy interaction with the device underneath..
> 
> I will refrain from commenting on the sanity or otherwise of doing that :-)
> 
> Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
> Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
> if the added complexity/maintenance burden for an admittedly fairly specific 
> feature is going to stop progress here I am happy to take the fail fork 
> approach. I could then possibly fix it up as a future clean up to 
> copy_pte_range(). Perhaps others have thoughts?

Yes, it's more about making this series easier to be accepted, and it'll be
great to have others' input.

Btw, just to mention that I don't even think fail fork() on failed trylock() is
against "logical correctness of fork()": IMHO it's still 100% correct just like
most syscalls can return with -EAGAIN, that suggests the userspace to try again
the syscall, and I hope that also works for fork().  I'd be more than glad to
be corrected too.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 13:28                     ` Jason Gunthorpe
@ 2021-05-19 14:09                       ` Peter Xu
  2021-05-19 18:11                         ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-05-19 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

On Wed, May 19, 2021 at 10:28:42AM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> > On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > > Logically during fork all these device exclusive pages should be
> > > reverted back to their CPU pages, write protected and the CPU page PTE
> > > copied to the fork.
> > > 
> > > We should not copy the device exclusive page PTE to the fork. I think
> > > I pointed to this on an earlier rev..
> > 
> > Agreed.  Though please see the question I posted in the other thread: now I am
> > not very sure whether we'll be able to mark a page as device exclusive if that
> > page has mapcount>1.
> 
> IMHO it is similar to write protect done by filesystems on shared
> mappings - all VMAs with a copy of the CPU page have to get switched
> to the device exclusive PTE. This is why the rmap stuff is involved in
> the migration helpers

Right, I think Alistair corrected me there that I missed the early COW
happening in GUP.

Actually even without that GUP triggering early COW it won't be a problem,
because as long as one child mm restored the pte from exclusive to normal
(before any further COW happens) device exclusiveness is broken in the mmu
notifiers, and after that point all previous-exclusive ptes actually becomes
the same as a very normal PageAnon.  Then it's very sane to even not have the
original page in parent process, because we know each COWed page will contain
all the device atomic modifications (so we don't really have the requirement to
return the original page to parent).

Sorry for the noise.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-19 14:09                       ` Peter Xu
@ 2021-05-19 18:11                         ` Jason Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2021-05-19 18:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alistair Popple, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, hch,
	daniel, willy, bsingharora, Christoph Hellwig

> Sorry for the noise.

Not at all, it is good that more people understand things!

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
  2021-05-19 12:38     ` Alistair Popple
@ 2021-05-20 20:24       ` Liam Howlett
  2021-05-21  2:23         ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Liam Howlett @ 2021-05-20 20:24 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Hugh Dickins, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, jgg, hch,
	daniel, willy, bsingharora, Christoph Hellwig, Shakeel Butt

* Alistair Popple <apopple@nvidia.com> [210519 08:38]:
> On Wednesday, 19 May 2021 6:04:51 AM AEST Liam Howlett wrote:
> > External email: Use caution opening links or attachments
> > 
> > * Alistair Popple <apopple@nvidia.com> [210407 04:43]:
> > > The behaviour of try_to_unmap_one() is difficult to follow because it
> > > performs different operations based on a fairly large set of flags used
> > > in different combinations.
> > > 
> > > TTU_MUNLOCK is one such flag. However it is exclusively used by
> > > try_to_munlock() which specifies no other flags. Therefore rather than
> > > overload try_to_unmap_one() with unrelated behaviour split this out into
> > > it's own function and remove the flag.
> > > 
> > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > 
> > > ---
> > > 
> > > v8:
> > > * Renamed try_to_munlock to page_mlock to better reflect what the
> > > 
> > >   function actually does.
> > > 
> > > * Removed the TODO from the documentation that this patch addresses.
> > > 
> > > v7:
> > > * Added Christoph's Reviewed-by
> > > 
> > > v4:
> > > * Removed redundant check for VM_LOCKED
> > > ---
> > > 
> > >  Documentation/vm/unevictable-lru.rst | 33 ++++++++-----------
> > >  include/linux/rmap.h                 |  3 +-
> > >  mm/mlock.c                           | 10 +++---
> > >  mm/rmap.c                            | 48 +++++++++++++++++++++-------
> > >  4 files changed, 55 insertions(+), 39 deletions(-)
> > > 
> > > diff --git a/Documentation/vm/unevictable-lru.rst
> > > b/Documentation/vm/unevictable-lru.rst index 0e1490524f53..eae3af17f2d9
> > > 100644
> > > --- a/Documentation/vm/unevictable-lru.rst
> > > +++ b/Documentation/vm/unevictable-lru.rst
> > > @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone
> > > statistics for the number of> 
> > >  mlocked pages.  Note, however, that at this point we haven't checked
> > >  whether the page is mapped by other VM_LOCKED VMAs.
> > > 
> > > -We can't call try_to_munlock(), the function that walks the reverse map
> > > to
> > > +We can't call page_mlock(), the function that walks the reverse map to
> > > 
> > >  check for other VM_LOCKED VMAs, without first isolating the page from the
> > >  LRU.> 
> > > -try_to_munlock() is a variant of try_to_unmap() and thus requires that
> > > the page +page_mlock() is a variant of try_to_unmap() and thus requires
> > > that the page> 
> > >  not be on an LRU list [more on these below].  However, the call to
> > > 
> > > -isolate_lru_page() could fail, in which case we couldn't
> > > try_to_munlock().  So, +isolate_lru_page() could fail, in which case we
> > > can't call page_mlock().  So,> 
> > >  we go ahead and clear PG_mlocked up front, as this might be the only
> > >  chance we> 
> > > -have.  If we can successfully isolate the page, we go ahead and
> > > -try_to_munlock(), which will restore the PG_mlocked flag and update the
> > > zone +have.  If we can successfully isolate the page, we go ahead and
> > > call +page_mlock(), which will restore the PG_mlocked flag and update the
> > > zone> 
> > >  page statistics if it finds another VMA holding the page mlocked.  If we
> > >  fail to isolate the page, we'll have left a potentially mlocked page on
> > >  the LRU. This is fine, because we'll catch it later if and if vmscan
> > >  tries to reclaim> 
> > > @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown
> > > (munlock_vma_pages_all), reclaim,> 
> > >  holepunching, and truncation of file pages and their anonymous COWed
> > >  pages.
> > > 
> > > -try_to_munlock() Reverse Map Scan
> > > +page_mlock() Reverse Map Scan
> > > 
> > >  ---------------------------------
> > > 
> > > -.. warning::
> > > -   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to
> > > the -   page_referenced() reverse map walker.
> > > -
> > > 
> > >  When munlock_vma_page() [see section :ref:`munlock()/munlockall() System
> > >  Call Handling <munlock_munlockall_handling>` above] tries to munlock a
> > >  page, it needs to determine whether or not the page is mapped by any
> > >  VM_LOCKED VMA without actually attempting to unmap all PTEs from the
> > >  page.  For this purpose, the unevictable/mlock infrastructure
> > > 
> > > -introduced a variant of try_to_unmap() called try_to_munlock().
> > > +introduced a variant of try_to_unmap() called page_mlock().
> > > 
> > > -try_to_munlock() calls the same functions as try_to_unmap() for anonymous
> > > and -mapped file and KSM pages with a flag argument specifying unlock
> > > versus unmap -processing.  Again, these functions walk the respective
> > > reverse maps looking -for VM_LOCKED VMAs.  When such a VMA is found, as
> > > in the try_to_unmap() case, -the functions mlock the page via
> > > mlock_vma_page() and return SWAP_MLOCK.  This -undoes the pre-clearing of
> > > the page's PG_mlocked done by munlock_vma_page. +page_mlock() walks the
> > > respective reverse maps looking for VM_LOCKED VMAs. When +such a VMA is
> > > found the page is mlocked via mlock_vma_page(). This undoes the
> > > +pre-clearing of the page's PG_mlocked done by munlock_vma_page.
> > > 
> > > -Note that try_to_munlock()'s reverse map walk must visit every VMA in a
> > > page's +Note that page_mlock()'s reverse map walk must visit every VMA in
> > > a page's> 
> > >  reverse map to determine that a page is NOT mapped into any VM_LOCKED
> > >  VMA.
> > >  However, the scan can terminate when it encounters a VM_LOCKED VMA.
> > > 
> > > -Although try_to_munlock() might be called a great many times when
> > > munlocking a +Although page_mlock() might be called a great many times
> > > when munlocking a> 
> > >  large region or tearing down a large address space that has been mlocked
> > >  via mlockall(), overall this is a fairly rare event.
> > > 
> > > @@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable
> > > list.> 
> > >  shrink_inactive_list() should only see SHM_LOCK'd pages that became
> > >  SHM_LOCK'd after shrink_active_list() had moved them to the inactive
> > >  list, or pages mapped into VM_LOCKED VMAs that munlock_vma_page()
> > >  couldn't isolate from the LRU to> 
> > > -recheck via try_to_munlock().  shrink_inactive_list() won't notice the
> > > latter, +recheck via page_mlock().  shrink_inactive_list() won't notice
> > > the latter,> 
> > >  but will pass on to shrink_page_list().
> > >  
> > >  shrink_page_list() again culls obviously unevictable pages that it could
> > > 
> > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > index def5c62c93b3..38a746787c2f 100644
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -87,7 +87,6 @@ struct anon_vma_chain {
> > > 
> > >  enum ttu_flags {
> > >  
> > >       TTU_MIGRATION           = 0x1,  /* migration mode */
> > > 
> > > -     TTU_MUNLOCK             = 0x2,  /* munlock mode */
> > > 
> > >       TTU_SPLIT_HUGE_PMD      = 0x4,  /* split huge PMD if any */
> > >       TTU_IGNORE_MLOCK        = 0x8,  /* ignore mlock */
> > > 
> > > @@ -239,7 +238,7 @@ int page_mkclean(struct page *);
> > > 
> > >   * called in munlock()/munmap() path to check for other vmas holding
> > >   * the page mlocked.
> > >   */
> > > 
> > > -void try_to_munlock(struct page *);
> > > +void page_mlock(struct page *page);
> > > 
> > >  void remove_migration_ptes(struct page *old, struct page *new, bool
> > >  locked);> 
> > > diff --git a/mm/mlock.c b/mm/mlock.c
> > > index f8f8cc32d03d..9b8b82cfbbff 100644
> > > --- a/mm/mlock.c
> > > +++ b/mm/mlock.c
> > > @@ -108,7 +108,7 @@ void mlock_vma_page(struct page *page)
> > > 
> > >  /*
> > >  
> > >   * Finish munlock after successful page isolation
> > >   *
> > > 
> > > - * Page must be locked. This is a wrapper for try_to_munlock()
> > > + * Page must be locked. This is a wrapper for page_mlock()
> > > 
> > >   * and putback_lru_page() with munlock accounting.
> > >   */
> > >  
> > >  static void __munlock_isolated_page(struct page *page)
> > > 
> > > @@ -118,7 +118,7 @@ static void __munlock_isolated_page(struct page *page)
> > > 
> > >        * and we don't need to check all the other vmas.
> > >        */
> > >       
> > >       if (page_mapcount(page) > 1)
> > > 
> > > -             try_to_munlock(page);
> > > +             page_mlock(page);
> > > 
> > >       /* Did try_to_unlock() succeed or punt? */
> > >       if (!PageMlocked(page))
> > > 
> > > @@ -158,7 +158,7 @@ static void __munlock_isolation_failed(struct page
> > > *page)> 
> > >   * munlock()ed or munmap()ed, we want to check whether other vmas hold
> > >   the
> > >   * page locked so that we can leave it on the unevictable lru list and
> > >   not
> > >   * bother vmscan with it.  However, to walk the page's rmap list in
> > > 
> > > - * try_to_munlock() we must isolate the page from the LRU.  If some other
> > > + * page_mlock() we must isolate the page from the LRU.  If some other
> > > 
> > >   * task has removed the page from the LRU, we won't be able to do that.
> > >   * So we clear the PageMlocked as we might not get another chance.  If we
> > >   * can't isolate the page, we leave it for putback_lru_page() and vmscan
> > > 
> > > @@ -168,7 +168,7 @@ unsigned int munlock_vma_page(struct page *page)
> > > 
> > >  {
> > >  
> > >       int nr_pages;
> > > 
> > > -     /* For try_to_munlock() and to serialize with page migration */
> > > +     /* For page_mlock() and to serialize with page migration */
> > > 
> > >       BUG_ON(!PageLocked(page));
> > >       VM_BUG_ON_PAGE(PageTail(page), page);
> > > 
> > > @@ -205,7 +205,7 @@ static int __mlock_posix_error_return(long retval)
> > > 
> > >   *
> > >   * The fast path is available only for evictable pages with single
> > >   mapping.
> > >   * Then we can bypass the per-cpu pvec and get better performance.
> > > 
> > > - * when mapcount > 1 we need try_to_munlock() which can fail.
> > > + * when mapcount > 1 we need page_mlock() which can fail.
> > > 
> > >   * when !page_evictable(), we need the full redo logic of
> > >   putback_lru_page to * avoid leaving evictable page in unevictable list.
> > >   *
> > > 
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 977e70803ed8..f09d522725b9 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page,
> > > struct vm_area_struct *vma,> 
> > >       struct mmu_notifier_range range;
> > >       enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > > 
> > > -     /* munlock has nothing to gain from examining un-locked vmas */
> > > -     if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> > > -             return true;
> > > -
> > > 
> > >       if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> > >       
> > >           is_zone_device_page(page) && !is_device_private_page(page))
> > >           
> > >               return true;
> > > 
> > > @@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page,
> > > struct vm_area_struct *vma,> 
> > >                               page_vma_mapped_walk_done(&pvmw);
> > >                               break;
> > >                       
> > >                       }
> > > 
> > > -                     if (flags & TTU_MUNLOCK)
> > > -                             continue;
> > > 
> > >               }
> > >               
> > >               /* Unexpected PMD-mapped THP? */
> > > 
> > > @@ -1784,8 +1778,39 @@ bool try_to_unmap(struct page *page, enum ttu_flags
> > > flags)> 
> > >       return !page_mapcount(page) ? true : false;
> > >  
> > >  }
> > 
> > Please add a comment here, especially around locking.

Did you miss this comment?  I think the name confusion alone means this
needs some documentation.  It's also worth mentioning arg is unused.

> > 
> > > +static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
> > > +                              unsigned long address, void *arg)
> > > +{
> > > +     struct page_vma_mapped_walk pvmw = {
> > > +             .page = page,
> > > +             .vma = vma,
> > > +             .address = address,
> > > +     };
> > > +
> > > +     /* munlock has nothing to gain from examining un-locked vmas */
> > > +     if (!(vma->vm_flags & VM_LOCKED))
> > > +             return true;
> > 
> > The logic here doesn't make sense.  You called page_mlock_one() on a VMA
> > that isn't locked and it returns true?  Is this a check to see if the
> > VMA has zero mlock'ed pages?
> 
> I'm pretty sure the logic is correct. This is used for an rmap_walk, so we 
> return true to continue the page table scan to see if other VMAs have the page 
> locked.

yes, sorry.  The logic is correct but doesn't read as though it does.
I cannot see what is going on easily and there are no comments stating
what is happening.

> 
> > > +
> > > +     while (page_vma_mapped_walk(&pvmw)) {
> > > +             /* PTE-mapped THP are never mlocked */
> > > +             if (!PageTransCompound(page)) {
> > > +                     /*
> > > +                      * Holding pte lock, we do *not* need
> > > +                      * mmap_lock here
> > > +                      */
> > 
> > Are you sure?  I think you at least need to hold the mmap lock for
> > reading to ensure there's no race here?  mlock_vma_page() eludes to such
> > a scenario when lazy mlocking.
> 
> Not really. I don't claim to be an mlock expert but as this is a clean-up for 
> try_to_unmap() the intent was to not change existing behaviour.
> 
> However presenting the function in this simplified form did raise this and 
> some other questions during previous reviews - see https://lore.kernel.org/
> dri-devel/20210331115746.GA1463678@nvidia.com/ for the previous discussion.

From what I can see, at least the following paths have mmap_lock held
for writing:

munlock_vma_pages_range() from __do_munmap()
munlokc_vma_pages_range() from remap_file_pages()

> 
> To answer the questions around locking though I did do some git sha1 mining. 
> The best explanation seemed to be contained in https://git.kernel.org/pub/scm/
> linux/kernel/git/torvalds/linux.git/commit/?
> id=b87537d9e2feb30f6a962f27eb32768682698d3b from Hugh (whom I've added again 
> here in case he can help answer some of these).

Thanks for the pointer.  That race doesn't make the lock unnecessary.
It is the exception to the rule because the exit of the process will
correct what has happened - and in all other cases, the mmap_lock is
held already (or ought to be) at a higher level.  I think this comment
is just wrong and should be removed as it is confusing.  I see that this
is a copy from elsewhere, but it's at least wrong in this context.

> 
> > The mmap_lock is held for writing in the scenarios I have checked.
> > 
> > > +                     mlock_vma_page(page);
> > > +             }
> > > +             page_vma_mapped_walk_done(&pvmw);
> > > +
> > > +             /* found a mlocked page, no point continuing munlock check
> > > */
> > 
> > I think you need to check_pte() to be sure it is mapped?

On second thought, my comment here is wrong.

> > 
> > > +             return false;
> > > +     }
> > > +
> > > +     return true;
> > 
> > Again, I don't get the return values.  If page_mlock_one() returns true,
> > I'd expect for my page to now be locked.  This isn't the case here,
> > page_mlock_one() returns true if there are no pages present for a locked
> > VMA, correct?
> 
> This is used for an rmap walk, if we were able to mlock the page we return 
> false to stop the walk as there is no need to examine other VMAs if the page 
> has already been mlocked.

My confusion is about the function name.  The code is not
self-documenting as it is written and there are no comments.  Maybe add
a comment block saying what it returns and why?

> 
> > > +}
> > > +
> > > 
> > >  /**
> > > 
> > > - * try_to_munlock - try to munlock a page
> > > + * page_mlock - try to munlock a page
> > 
> > Is this an mlock or an munlock?  I'm not confident it's either, but more
> > of a check to see if there are pages mapped in a locked VMA?
> 
> It is called (only) from the munlock code to check a page does not need to be 
> mlocked, or to mlock it if it does.

I don't want to re-open the discussion on the name, but I just might
with this comment: Do you think it is correct to have a comment for
page_mlock that says "try to munlock a page"?  Those seem like opposite
functions to me.  The @page also references the munlock'ing.  This is
ensuring a locked page that is also locked elsewhere will not be
removed, right?

> 
> > >   * @page: the page to be munlocked
> > >   *
> > >   * Called from munlock code.  Checks all of the VMAs mapping the page
> > > 
> > > @@ -1793,11 +1818,10 @@ bool try_to_unmap(struct page *page, enum
> > > ttu_flags flags)> 
> > >   * returned with PG_mlocked cleared if no other vmas have it mlocked.
> > >   */
> > > 
> > > -void try_to_munlock(struct page *page)
> > > +void page_mlock(struct page *page)
> > > 
> > >  {
> > >  
> > >       struct rmap_walk_control rwc = {
> > > 
> > > -             .rmap_one = try_to_unmap_one,
> > > -             .arg = (void *)TTU_MUNLOCK,
> > > +             .rmap_one = page_mlock_one,
> > > 
> > >               .done = page_not_mapped,
> > >               .anon_lock = page_lock_anon_vma_read,
> > > 
> > > @@ -1849,7 +1873,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct
> > > page *page,> 
> > >   * Find all the mappings of a page using the mapping pointer and the vma
> > >   chains * contained in the anon_vma struct it points to.
> > >   *
> > > 
> > > - * When called from try_to_munlock(), the mmap_lock of the mm containing
> > > the vma + * When called from page_mlock(), the mmap_lock of the mm
> > > containing the vma> 
> > >   * where the page was found will be held for write.  So, we won't recheck
> > >   * vm_flags for that VMA.  That should be OK, because that vma shouldn't
> > >   be
> > >   * LOCKED.
> > > 
> > > @@ -1901,7 +1925,7 @@ static void rmap_walk_anon(struct page *page, struct
> > > rmap_walk_control *rwc,> 
> > >   * Find all the mappings of a page using the mapping pointer and the vma
> > >   chains * contained in the address_space struct it points to.
> > >   *
> > > 
> > > - * When called from try_to_munlock(), the mmap_lock of the mm containing
> > > the vma + * When called from page_mlock(), the mmap_lock of the mm
> > > containing the vma> 
> > >   * where the page was found will be held for write.  So, we won't recheck
> > 
> > Above it is stated that the lock does not need to be held, but this
> > comment says it is already held for writing - which is it?
> 
> I will have to check.
> 
> > >   * vm_flags for that VMA.  That should be OK, because that vma shouldn't
> > >   be
> > >   * LOCKED.
> > > 
> > > --
> > > 2.20.1
> > 
> > munlock_vma_pages_range() comments references try_to_{munlock|unmap}
> 
> Thanks, I noticed that too when I was rereading it just now. Will fix that up.
> 
>  - Alistair
> 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
  2021-05-20 20:24       ` Liam Howlett
@ 2021-05-21  2:23         ` Alistair Popple
  0 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-05-21  2:23 UTC (permalink / raw)
  To: Liam Howlett
  Cc: Hugh Dickins, linux-mm, nouveau, bskeggs, akpm, linux-doc,
	linux-kernel, dri-devel, jhubbard, rcampbell, jglisse, jgg, hch,
	daniel, willy, bsingharora, Christoph Hellwig, Shakeel Butt

On Friday, 21 May 2021 6:24:27 AM AEST Liam Howlett wrote:

[...]
 
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 977e70803ed8..f09d522725b9 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page,
> > > > struct vm_area_struct *vma,>
> > > > 
> > > >       struct mmu_notifier_range range;
> > > >       enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > > > 
> > > > -     /* munlock has nothing to gain from examining un-locked vmas */
> > > > -     if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> > > > -             return true;
> > > > -
> > > > 
> > > >       if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> > > >       
> > > >           is_zone_device_page(page) && !is_device_private_page(page))
> > > >           
> > > >               return true;
> > > > 
> > > > @@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page,
> > > > struct vm_area_struct *vma,>
> > > > 
> > > >                               page_vma_mapped_walk_done(&pvmw);
> > > >                               break;
> > > >                       
> > > >                       }
> > > > 
> > > > -                     if (flags & TTU_MUNLOCK)
> > > > -                             continue;
> > > > 
> > > >               }
> > > >               
> > > >               /* Unexpected PMD-mapped THP? */
> > > > 
> > > > @@ -1784,8 +1778,39 @@ bool try_to_unmap(struct page *page, enum
> > > > ttu_flags
> > > > flags)>
> > > > 
> > > >       return !page_mapcount(page) ? true : false;
> > > >  
> > > >  }
> > > 
> > > Please add a comment here, especially around locking.
> 
> Did you miss this comment?  I think the name confusion alone means this
> needs some documentation.  It's also worth mentioning arg is unused.

Ack. Was meant to come back to that after discussing some of the locking 
questions below. The other side effect of splitting this code out is it leaves 
space for more specific documentation which is only a good thing. I will try 
and summarise some of the discussion below into a comment here.

> > > > +static bool page_mlock_one(struct page *page, struct vm_area_struct
> > > > *vma,
> > > > +                              unsigned long address, void *arg)
> > > > +{
> > > > +     struct page_vma_mapped_walk pvmw = {
> > > > +             .page = page,
> > > > +             .vma = vma,
> > > > +             .address = address,
> > > > +     };
> > > > +
> > > > +     /* munlock has nothing to gain from examining un-locked vmas */
> > > > +     if (!(vma->vm_flags & VM_LOCKED))
> > > > +             return true;
> > > 
> > > The logic here doesn't make sense.  You called page_mlock_one() on a VMA
> > > that isn't locked and it returns true?  Is this a check to see if the
> > > VMA has zero mlock'ed pages?
> > 
> > I'm pretty sure the logic is correct. This is used for an rmap_walk, so we
> > return true to continue the page table scan to see if other VMAs have the
> > page locked.
> 
> yes, sorry.  The logic is correct but doesn't read as though it does.
> I cannot see what is going on easily and there are no comments stating
> what is happening.

Thanks for confirming. The documentation in Documentation/vm/unevictable-
lru.rst is helpful for higher level context but I will put some comments here 
around the logic.

> > > > +
> > > > +     while (page_vma_mapped_walk(&pvmw)) {
> > > > +             /* PTE-mapped THP are never mlocked */
> > > > +             if (!PageTransCompound(page)) {
> > > > +                     /*
> > > > +                      * Holding pte lock, we do *not* need
> > > > +                      * mmap_lock here
> > > > +                      */
> > > 
> > > Are you sure?  I think you at least need to hold the mmap lock for
> > > reading to ensure there's no race here?  mlock_vma_page() eludes to such
> > > a scenario when lazy mlocking.
> > 
> > Not really. I don't claim to be an mlock expert but as this is a clean-up
> > for try_to_unmap() the intent was to not change existing behaviour.
> > 
> > However presenting the function in this simplified form did raise this and
> > some other questions during previous reviews - see
> > https://lore.kernel.org/
> > dri-devel/20210331115746.GA1463678@nvidia.com/ for the previous
> > discussion.
> 
> From what I can see, at least the following paths have mmap_lock held
> for writing:
> 
> munlock_vma_pages_range() from __do_munmap()
> munlokc_vma_pages_range() from remap_file_pages()
> 
> > To answer the questions around locking though I did do some git sha1
> > mining. The best explanation seemed to be contained in
> > https://git.kernel.org/pub/scm/
> > linux/kernel/git/torvalds/linux.git/commit/?
> > id=b87537d9e2feb30f6a962f27eb32768682698d3b from Hugh (whom I've added
> > again here in case he can help answer some of these).
> 
> Thanks for the pointer.  That race doesn't make the lock unnecessary.
> It is the exception to the rule because the exit of the process will
> correct what has happened - and in all other cases, the mmap_lock is
> held already (or ought to be) at a higher level.  I think this comment
> is just wrong and should be removed as it is confusing.  I see that this
> is a copy from elsewhere, but it's at least wrong in this context.

Ha, thanks. I had just copied the comment without checking to see if mmap_lock 
was always held by callers of page_mlock() or not. From what I can tell it is 
which confirms what you are saying. Thanks.

> > > The mmap_lock is held for writing in the scenarios I have checked.
> > > 
> > > > +                     mlock_vma_page(page);
> > > > +             }
> > > > +             page_vma_mapped_walk_done(&pvmw);
> > > > +
> > > > +             /* found a mlocked page, no point continuing munlock
> > > > check
> > > > */
> > > 
> > > I think you need to check_pte() to be sure it is mapped?
> 
> On second thought, my comment here is wrong.

Thanks, I had meant to question that in my last reply.

> > > > +             return false;
> > > > +     }
> > > > +
> > > > +     return true;
> > > 
> > > Again, I don't get the return values.  If page_mlock_one() returns true,
> > > I'd expect for my page to now be locked.  This isn't the case here,
> > > page_mlock_one() returns true if there are no pages present for a locked
> > > VMA, correct?
> > 
> > This is used for an rmap walk, if we were able to mlock the page we return
> > false to stop the walk as there is no need to examine other VMAs if the
> > page has already been mlocked.
> 
> My confusion is about the function name.  The code is not
> self-documenting as it is written and there are no comments.  Maybe add
> a comment block saying what it returns and why?

Fair enough, I agree it isn't obvious from just reading the code so will 
incorporate a summary from the unevictable-lru documentation (which would have 
been helpful to have had referenced originally anyway).

> > > > +}
> > > > +
> > > > 
> > > >  /**
> > > > 
> > > > - * try_to_munlock - try to munlock a page
> > > > + * page_mlock - try to munlock a page
> > > 
> > > Is this an mlock or an munlock?  I'm not confident it's either, but more
> > > of a check to see if there are pages mapped in a locked VMA?
> > 
> > It is called (only) from the munlock code to check a page does not need to
> > be mlocked, or to mlock it if it does.
> 
> I don't want to re-open the discussion on the name, but I just might
> with this comment: Do you think it is correct to have a comment for
> page_mlock that says "try to munlock a page"?  Those seem like opposite
> functions to me.  The @page also references the munlock'ing.  This is
> ensuring a locked page that is also locked elsewhere will not be
> removed, right?

No, I don't think that is correct. Sorry for the confusion. I updated the name 
with a script but obviously missed some of the comments. When reviewing this 
for rebasing I noticed a few other places in the code still incorrectly 
referencing this as an munlock so will go through and improve the comments 
more generally.

I think a better description of what this function does and why might help 
with the inevitable confusion that arises from munlock calling a function to 
mlock pages.

> > > >   * @page: the page to be munlocked
> > > >   *
> > > >   * Called from munlock code.  Checks all of the VMAs mapping the page
> > > > 
> > > > @@ -1793,11 +1818,10 @@ bool try_to_unmap(struct page *page, enum
> > > > ttu_flags flags)>
> > > > 
> > > >   * returned with PG_mlocked cleared if no other vmas have it mlocked.
> > > >   */
> > > > 
> > > > -void try_to_munlock(struct page *page)
> > > > +void page_mlock(struct page *page)
> > > > 
> > > >  {
> > > >  
> > > >       struct rmap_walk_control rwc = {
> > > > 
> > > > -             .rmap_one = try_to_unmap_one,
> > > > -             .arg = (void *)TTU_MUNLOCK,
> > > > +             .rmap_one = page_mlock_one,
> > > > 
> > > >               .done = page_not_mapped,
> > > >               .anon_lock = page_lock_anon_vma_read,
> > > > 
> > > > @@ -1849,7 +1873,7 @@ static struct anon_vma
> > > > *rmap_walk_anon_lock(struct
> > > > page *page,>
> > > > 
> > > >   * Find all the mappings of a page using the mapping pointer and the
> > > >   vma
> > > >   chains * contained in the anon_vma struct it points to.
> > > >   *
> > > > 
> > > > - * When called from try_to_munlock(), the mmap_lock of the mm
> > > > containing
> > > > the vma + * When called from page_mlock(), the mmap_lock of the mm
> > > > containing the vma>
> > > > 
> > > >   * where the page was found will be held for write.  So, we won't
> > > >   recheck
> > > >   * vm_flags for that VMA.  That should be OK, because that vma
> > > >   shouldn't
> > > >   be
> > > >   * LOCKED.
> > > > 
> > > > @@ -1901,7 +1925,7 @@ static void rmap_walk_anon(struct page *page,
> > > > struct
> > > > rmap_walk_control *rwc,>
> > > > 
> > > >   * Find all the mappings of a page using the mapping pointer and the
> > > >   vma
> > > >   chains * contained in the address_space struct it points to.
> > > >   *
> > > > 
> > > > - * When called from try_to_munlock(), the mmap_lock of the mm
> > > > containing
> > > > the vma + * When called from page_mlock(), the mmap_lock of the mm
> > > > containing the vma>
> > > > 
> > > >   * where the page was found will be held for write.  So, we won't
> > > >   recheck
> > > 
> > > Above it is stated that the lock does not need to be held, but this
> > > comment says it is already held for writing - which is it?
> > 
> > I will have to check.
> > 
> > > >   * vm_flags for that VMA.  That should be OK, because that vma
> > > >   shouldn't
> > > >   be
> > > >   * LOCKED.
> > > > 
> > > > --
> > > > 2.20.1
> > > 
> > > munlock_vma_pages_range() comments references try_to_{munlock|unmap}
> > 
> > Thanks, I noticed that too when I was rereading it just now. Will fix that
> > up.> 
> >  - Alistair






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access
  2021-04-07  8:42 ` [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access Alistair Popple
@ 2021-05-21  4:04   ` Ben Skeggs
  0 siblings, 0 replies; 42+ messages in thread
From: Ben Skeggs @ 2021-05-21  4:04 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, ML nouveau, Ben Skeggs, Andrew Morton, Ralph Campbell,
	linux-doc, John Hubbard, bsingharora, LKML, ML dri-devel,
	Christoph Hellwig, jglisse, willy, Jason Gunthorpe

On Wed, 7 Apr 2021 at 18:43, Alistair Popple <apopple@nvidia.com> wrote:
>
> Some NVIDIA GPUs do not support direct atomic access to system memory
> via PCIe. Instead this must be emulated by granting the GPU exclusive
> access to the memory. This is achieved by replacing CPU page table
> entries with special swap entries that fault on userspace access.
>
> The driver then grants the GPU permission to update the page undergoing
> atomic access via the GPU page tables. When CPU access to the page is
> required a CPU fault is raised which calls into the device driver via
> MMU notifiers to revoke the atomic access. The original page table
> entries are then restored allowing CPU access to proceed.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
The Nouveau bits at least look good to me.

For patches 7/8:
Reviewed-by: Ben Skeggs <bskeggs@redhat.com>

>
> ---
>
> v7:
> * Removed magic values for fault access levels
> * Improved readability of fault comparison code
>
> v4:
> * Check that page table entries haven't changed before mapping on the
>   device
> ---
>  drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
>  drivers/gpu/drm/nouveau/nouveau_svm.c         | 126 ++++++++++++++++--
>  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
>  .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c    |   6 +
>  4 files changed, 123 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
> index d6dd40f21eed..9c7ff56831c5 100644
> --- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
> +++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
> @@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
>  #define NVIF_VMM_PFNMAP_V0_APER                           0x00000000000000f0ULL
>  #define NVIF_VMM_PFNMAP_V0_HOST                           0x0000000000000000ULL
>  #define NVIF_VMM_PFNMAP_V0_VRAM                           0x0000000000000010ULL
> +#define NVIF_VMM_PFNMAP_V0_A                             0x0000000000000004ULL
>  #define NVIF_VMM_PFNMAP_V0_W                              0x0000000000000002ULL
>  #define NVIF_VMM_PFNMAP_V0_V                              0x0000000000000001ULL
>  #define NVIF_VMM_PFNMAP_V0_NONE                           0x0000000000000000ULL
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index a195e48c9aee..81526d65b4e2 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -35,6 +35,7 @@
>  #include <linux/sched/mm.h>
>  #include <linux/sort.h>
>  #include <linux/hmm.h>
> +#include <linux/rmap.h>
>
>  struct nouveau_svm {
>         struct nouveau_drm *drm;
> @@ -67,6 +68,11 @@ struct nouveau_svm {
>         } buffer[1];
>  };
>
> +#define FAULT_ACCESS_READ 0
> +#define FAULT_ACCESS_WRITE 1
> +#define FAULT_ACCESS_ATOMIC 2
> +#define FAULT_ACCESS_PREFETCH 3
> +
>  #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
>  #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
>
> @@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
>                                       fault->client);
>  }
>
> +static int
> +nouveau_svm_fault_priority(u8 fault)
> +{
> +       switch (fault) {
> +       case FAULT_ACCESS_PREFETCH:
> +               return 0;
> +       case FAULT_ACCESS_READ:
> +               return 1;
> +       case FAULT_ACCESS_WRITE:
> +               return 2;
> +       case FAULT_ACCESS_ATOMIC:
> +               return 3;
> +       default:
> +               WARN_ON_ONCE(1);
> +               return -1;
> +       }
> +}
> +
>  static int
>  nouveau_svm_fault_cmp(const void *a, const void *b)
>  {
> @@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
>                 return ret;
>         if ((ret = (s64)fa->addr - fb->addr))
>                 return ret;
> -       /*XXX: atomic? */
> -       return (fa->access == 0 || fa->access == 3) -
> -              (fb->access == 0 || fb->access == 3);
> +       return nouveau_svm_fault_priority(fa->access) -
> +               nouveau_svm_fault_priority(fb->access);
>  }
>
>  static void
> @@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
>         struct svm_notifier *sn =
>                 container_of(mni, struct svm_notifier, notifier);
>
> +       if (range->event == MMU_NOTIFY_EXCLUSIVE &&
> +           range->owner == sn->svmm->vmm->cli->drm->dev)
> +               return true;
> +
>         /*
>          * serializes the update to mni->invalidate_seq done by caller and
>          * prevents invalidation of the PTE from progressing while HW is being
> @@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm,
>                 args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
>  }
>
> +static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
> +                              struct nouveau_drm *drm,
> +                              struct nouveau_pfnmap_args *args, u32 size,
> +                              struct svm_notifier *notifier)
> +{
> +       unsigned long timeout =
> +               jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +       struct mm_struct *mm = svmm->notifier.mm;
> +       struct page *page;
> +       unsigned long start = args->p.addr;
> +       unsigned long notifier_seq;
> +       int ret = 0;
> +
> +       ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
> +                                       args->p.addr, args->p.size,
> +                                       &nouveau_svm_mni_ops);
> +       if (ret)
> +               return ret;
> +
> +       while (true) {
> +               if (time_after(jiffies, timeout)) {
> +                       ret = -EBUSY;
> +                       goto out;
> +               }
> +
> +               notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> +               mmap_read_lock(mm);
> +               make_device_exclusive_range(mm, start, start + PAGE_SIZE,
> +                                           &page, drm->dev);
> +               mmap_read_unlock(mm);
> +               if (!page) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +
> +               mutex_lock(&svmm->mutex);
> +               if (!mmu_interval_read_retry(&notifier->notifier,
> +                                            notifier_seq))
> +                       break;
> +               mutex_unlock(&svmm->mutex);
> +       }
> +
> +       /* Map the page on the GPU. */
> +       args->p.page = 12;
> +       args->p.size = PAGE_SIZE;
> +       args->p.addr = start;
> +       args->p.phys[0] = page_to_phys(page) |
> +               NVIF_VMM_PFNMAP_V0_V |
> +               NVIF_VMM_PFNMAP_V0_W |
> +               NVIF_VMM_PFNMAP_V0_A |
> +               NVIF_VMM_PFNMAP_V0_HOST;
> +
> +       svmm->vmm->vmm.object.client->super = true;
> +       ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL);
> +       svmm->vmm->vmm.object.client->super = false;
> +       mutex_unlock(&svmm->mutex);
> +
> +       unlock_page(page);
> +       put_page(page);
> +
> +out:
> +       mmu_interval_notifier_remove(&notifier->notifier);
> +       return ret;
> +}
> +
>  static int nouveau_range_fault(struct nouveau_svmm *svmm,
>                                struct nouveau_drm *drm,
>                                struct nouveau_pfnmap_args *args, u32 size,
> @@ -637,7 +729,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
>         unsigned long hmm_flags;
>         u64 inst, start, limit;
>         int fi, fn;
> -       int replay = 0, ret;
> +       int replay = 0, atomic = 0, ret;
>
>         /* Parse available fault buffer entries into a cache, and update
>          * the GET pointer so HW can reuse the entries.
> @@ -718,12 +810,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
>                 /*
>                  * Determine required permissions based on GPU fault
>                  * access flags.
> -                * XXX: atomic?
>                  */
>                 switch (buffer->fault[fi]->access) {
>                 case 0: /* READ. */
>                         hmm_flags = HMM_PFN_REQ_FAULT;
>                         break;
> +               case 2: /* ATOMIC. */
> +                       atomic = true;
> +                       break;
>                 case 3: /* PREFETCH. */
>                         hmm_flags = 0;
>                         break;
> @@ -739,8 +833,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
>                 }
>
>                 notifier.svmm = svmm;
> -               ret = nouveau_range_fault(svmm, svm->drm, &args.i,
> -                                       sizeof(args), hmm_flags, &notifier);
> +               if (atomic)
> +                       ret = nouveau_atomic_range_fault(svmm, svm->drm,
> +                                                        &args.i, sizeof(args),
> +                                                        &notifier);
> +               else
> +                       ret = nouveau_range_fault(svmm, svm->drm, &args.i,
> +                                                 sizeof(args), hmm_flags,
> +                                                 &notifier);
>                 mmput(mm);
>
>                 limit = args.i.p.addr + args.i.p.size;
> @@ -756,11 +856,15 @@ nouveau_svm_fault(struct nvif_notify *notify)
>                          */
>                         if (buffer->fault[fn]->svmm != svmm ||
>                             buffer->fault[fn]->addr >= limit ||
> -                           (buffer->fault[fi]->access == 0 /* READ. */ &&
> +                           (buffer->fault[fi]->access == FAULT_ACCESS_READ &&
>                              !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) ||
> -                           (buffer->fault[fi]->access != 0 /* READ. */ &&
> -                            buffer->fault[fi]->access != 3 /* PREFETCH. */ &&
> -                            !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)))
> +                           (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
> +                            buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
> +                            !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) ||
> +                           (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
> +                            buffer->fault[fi]->access != FAULT_ACCESS_WRITE &&
> +                            buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
> +                            !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A)))
>                                 break;
>                 }
>
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> index a2b179568970..f6188aa9171c 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> @@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_vmm *, struct nvkm_vma *);
>  #define NVKM_VMM_PFN_APER                                 0x00000000000000f0ULL
>  #define NVKM_VMM_PFN_HOST                                 0x0000000000000000ULL
>  #define NVKM_VMM_PFN_VRAM                                 0x0000000000000010ULL
> +#define NVKM_VMM_PFN_A                                   0x0000000000000004ULL
>  #define NVKM_VMM_PFN_W                                    0x0000000000000002ULL
>  #define NVKM_VMM_PFN_V                                    0x0000000000000001ULL
>  #define NVKM_VMM_PFN_NONE                                 0x0000000000000000ULL
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
> index 236db5570771..f02abd9cb4dd 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
> @@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
>                 if (!(*map->pfn & NVKM_VMM_PFN_W))
>                         data |= BIT_ULL(6); /* RO. */
>
> +               if (!(*map->pfn & NVKM_VMM_PFN_A))
> +                       data |= BIT_ULL(7); /* Atomic disable. */
> +
>                 if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
>                         addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
>                         addr = dma_map_page(dev, pfn_to_page(addr), 0,
> @@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
>                 if (!(*map->pfn & NVKM_VMM_PFN_W))
>                         data |= BIT_ULL(6); /* RO. */
>
> +               if (!(*map->pfn & NVKM_VMM_PFN_A))
> +                       data |= BIT_ULL(7); /* Atomic disable. */
> +
>                 if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
>                         addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
>                         addr = dma_map_page(dev, pfn_to_page(addr), 0,
> --
> 2.20.1
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v8 5/8] mm: Device exclusive memory access
  2021-05-18 13:19     ` Alistair Popple
  2021-05-18 17:27       ` Peter Xu
@ 2021-05-21  6:53       ` Alistair Popple
  1 sibling, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-05-21  6:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, nouveau, bskeggs, akpm, linux-doc, linux-kernel,
	dri-devel, jhubbard, rcampbell, jglisse, jgg, hch, daniel, willy,
	bsingharora, Christoph Hellwig

On Tuesday, 18 May 2021 11:19:05 PM AEST Alistair Popple wrote:

[...]

> > > +/*
> > > + * Restore a potential device exclusive pte to a working pte entry
> > > + */
> > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > +{
> > > +     struct page *page = vmf->page;
> > > +     struct vm_area_struct *vma = vmf->vma;
> > > +     struct page_vma_mapped_walk pvmw = {
> > > +             .page = page,
> > > +             .vma = vma,
> > > +             .address = vmf->address,
> > > +             .flags = PVMW_SYNC,
> > > +     };
> > > +     vm_fault_t ret = 0;
> > > +     struct mmu_notifier_range range;
> > > +
> > > +     if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> > > +             return VM_FAULT_RETRY;
> > > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma,
> > > vma->vm_mm,
> > > +                             vmf->address & PAGE_MASK,
> > > +                             (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > > +     mmu_notifier_invalidate_range_start(&range);
> > 
> > I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> > 
> > * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> > * madvise() or replacing a page by another one, ...).
> > 
> > Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a
> > case (existing pte is invalid) we don't need to notify at all.  However
> > from what I read from the whole series, this seems to be a critical point
> > where we'd like to kick the owner/driver to synchronously stop doing
> > atomic
> > operations from the device.  Not sure whether we'd like a new notifier
> > type, or maybe at least comment on why to use CLEAR?
> 
> Right, notifying the owner/driver when it no longer has exclusive access to
> the page and allowing it to stop atomic operations is the critical point and
> why it notifies when we ordinarily wouldn't (ie. invalid -> valid).
> 
> I did consider adding a new type, but in the driver implementation it ends
> up
> being treated the same as a CLEAR notification anyway so didn't think it was
> necessary. But I suppose adding a different type would allow other listening
> notifiers to filter these which might be worthwhile.
>
> > > +
> > > +     while (page_vma_mapped_walk(&pvmw)) {
> > 
> > IIUC a while loop of page_vma_mapped_walk() handles thps, however here
> > it's
> > already in do_swap_page() so it's small pte only?  Meanwhile...
> > 
> > > +             if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
> > > +                     page_vma_mapped_walk_done(&pvmw);
> > > +                     break;
> > > +             }
> > > +
> > > +             restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);
> > 
> > ... I'm not sure whether passing in page would work for thp after all, as
> > iiuc we may need to pass in the subpage rather than the head page if so.
> 
> Hmm, I need to check this and follow up.

Thanks Peter for pointing that out. I needed to follow this up because I had 
slightly misunderstood page_vma_mapped_walk(). As you say this is actually a 
small pte and restore_exclusive_pte() needs the actual page from the fault. So 
I should be able to drop the page_vma_mapped_walk() and use 
pte_offset_map_lock() to call restore_exclusive_pte directly.

 - Alistair





^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2021-05-21  6:53 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-07  8:42 [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple
2021-04-07  8:42 ` [PATCH v8 1/8] mm: Remove special swap entry functions Alistair Popple
2021-05-18  2:17   ` Peter Xu
2021-05-18 11:58     ` Alistair Popple
2021-05-18 14:17       ` Peter Xu
2021-04-07  8:42 ` [PATCH v8 2/8] mm/swapops: Rework swap entry manipulation code Alistair Popple
2021-04-07  8:42 ` [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap Alistair Popple
2021-05-18 20:04   ` Liam Howlett
2021-05-19 12:38     ` Alistair Popple
2021-05-20 20:24       ` Liam Howlett
2021-05-21  2:23         ` Alistair Popple
2021-04-07  8:42 ` [PATCH v8 4/8] mm/rmap: Split migration into its own function Alistair Popple
2021-04-07  8:42 ` [PATCH v8 5/8] mm: Device exclusive memory access Alistair Popple
2021-05-18  2:08   ` Peter Xu
2021-05-18 13:19     ` Alistair Popple
2021-05-18 17:27       ` Peter Xu
2021-05-18 17:33         ` Jason Gunthorpe
2021-05-18 18:01           ` Peter Xu
2021-05-18 19:45             ` Jason Gunthorpe
2021-05-18 20:29               ` Peter Xu
2021-05-18 23:03                 ` Jason Gunthorpe
2021-05-18 23:45                   ` Peter Xu
2021-05-19 11:04                     ` Alistair Popple
2021-05-19 12:15                       ` Peter Xu
2021-05-19 13:11                         ` Alistair Popple
2021-05-19 14:04                           ` Peter Xu
2021-05-19 13:28                     ` Jason Gunthorpe
2021-05-19 14:09                       ` Peter Xu
2021-05-19 18:11                         ` Jason Gunthorpe
2021-05-19 11:35         ` Alistair Popple
2021-05-19 12:21           ` Peter Xu
2021-05-19 12:46             ` Alistair Popple
2021-05-21  6:53       ` Alistair Popple
2021-05-18 21:16   ` Peter Xu
2021-05-19 10:49     ` Alistair Popple
2021-05-19 12:24       ` Peter Xu
2021-05-19 12:46         ` Alistair Popple
2021-04-07  8:42 ` [PATCH v8 6/8] mm: Selftests for exclusive device memory Alistair Popple
2021-04-07  8:42 ` [PATCH v8 7/8] nouveau/svm: Refactor nouveau_range_fault Alistair Popple
2021-04-07  8:42 ` [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access Alistair Popple
2021-05-21  4:04   ` Ben Skeggs
2021-05-06  7:43 ` [PATCH v8 0/8] Add support for SVM atomics in Nouveau Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).