nouveau.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues
@ 2022-09-28 12:01 Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page Alistair Popple
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: nouveau, dri-devel, Alistair Popple, linux-kernel, amd-gfx

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages. These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace.
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. In
modules such as Nouveau it is also possible to trigger some of these issues
by explicitly closing the device file-descriptor prior to the task exiting
and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code.
Unfortunately I lack hardware to test either of those so any help there
would be appreciated. The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.

To: Andrew Morton <akpm@linux-foundation.org>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org

Alistair Popple (8):
  mm/memory.c: Fix race when faulting a device private page
  mm: Free device private pages have zero refcount
  mm/memremap.c: Take a pgmap reference on page allocation
  mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
  mm/migrate_device.c: Add migrate_device_range()
  nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
  nouveau/dmem: Evict device private memory during release
  hmm-tests: Add test for migrate_device_range()

 arch/powerpc/kvm/book3s_hv_uvmem.c       |  17 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  19 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     |  11 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++++++----
 include/linux/memremap.h                 |   1 +-
 include/linux/migrate.h                  |  15 ++-
 lib/test_hmm.c                           | 129 ++++++++++---
 lib/test_hmm_uapi.h                      |   1 +-
 mm/memory.c                              |  16 +-
 mm/memremap.c                            |  30 ++-
 mm/migrate.c                             |  34 +--
 mm/migrate_device.c                      | 239 +++++++++++++++++-------
 mm/page_alloc.c                          |   8 +-
 tools/testing/selftests/vm/hmm-tests.c   |  49 +++++-
 15 files changed, 516 insertions(+), 163 deletions(-)

base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786
-- 
git-series 0.9.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-29 18:30   ` Felix Kuehling
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 2/8] mm: Free device private pages have zero refcount Alistair Popple
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ralph Campbell, Michael Ellerman, nouveau, Felix Kuehling,
	Alistair Popple, linux-kernel, dri-devel, amd-gfx,
	Jason Gunthorpe

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent
migration of the device private page can free the page and possibly the
underlying pgmap. This results in a race which can crash the kernel due
to the migrate_to_ram() function pointer becoming invalid. It also means
drivers can't reliably read the zone_device_data field because the page
may have been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl
to ensure it has not been freed. Unfortunately the elevated reference
count will cause the migration required to handle the fault to fail. To
avoid this failure pass the faulting page into the migrate_vma functions
so that if an elevated reference count is found it can be checked to see
if it's expected or not.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Lyude Paul <lyude@redhat.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       | 15 ++++++-----
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++++++------
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 11 +++++---
 include/linux/migrate.h                  |  8 ++++++-
 lib/test_hmm.c                           |  7 ++---
 mm/memory.c                              | 16 +++++++++++-
 mm/migrate.c                             | 34 ++++++++++++++-----------
 mm/migrate_device.c                      | 18 +++++++++----
 9 files changed, 87 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5980063..d4eacf4 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
 		unsigned long start,
 		unsigned long end, unsigned long page_shift,
-		struct kvm *kvm, unsigned long gpa)
+		struct kvm *kvm, unsigned long gpa, struct page *fault_page)
 {
 	unsigned long src_pfn, dst_pfn = 0;
-	struct migrate_vma mig;
+	struct migrate_vma mig = { 0 };
 	struct page *dpage, *spage;
 	struct kvmppc_uvmem_page_pvt *pvt;
 	unsigned long pfn;
@@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
 	mig.dst = &dst_pfn;
 	mig.pgmap_owner = &kvmppc_uvmem_pgmap;
 	mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+	mig.fault_page = fault_page;
 
 	/* The requested page is already paged-out, nothing to do */
 	if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
@@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
 static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
 				      unsigned long start, unsigned long end,
 				      unsigned long page_shift,
-				      struct kvm *kvm, unsigned long gpa)
+				      struct kvm *kvm, unsigned long gpa,
+				      struct page *fault_page)
 {
 	int ret;
 
 	mutex_lock(&kvm->arch.uvmem_lock);
-	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
+				fault_page);
 	mutex_unlock(&kvm->arch.uvmem_lock);
 
 	return ret;
@@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
 		bool pagein)
 {
 	unsigned long src_pfn, dst_pfn = 0;
-	struct migrate_vma mig;
+	struct migrate_vma mig = { 0 };
 	struct page *spage;
 	unsigned long pfn;
 	struct page *dpage;
@@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
 
 	if (kvmppc_svm_page_out(vmf->vma, vmf->address,
 				vmf->address + PAGE_SIZE, PAGE_SHIFT,
-				pvt->kvm, pvt->gpa))
+				pvt->kvm, pvt->gpa, vmf->page))
 		return VM_FAULT_SIGBUS;
 	else
 		return 0;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index b059a77..776448b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
 	uint64_t npages = (end - start) >> PAGE_SHIFT;
 	struct kfd_process_device *pdd;
 	struct dma_fence *mfence = NULL;
-	struct migrate_vma migrate;
+	struct migrate_vma migrate = { 0 };
 	unsigned long cpages = 0;
 	dma_addr_t *scratch;
 	void *buf;
@@ -668,7 +668,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 static long
 svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 		       struct vm_area_struct *vma, uint64_t start, uint64_t end,
-		       uint32_t trigger)
+		       uint32_t trigger, struct page *fault_page)
 {
 	struct kfd_process *p = container_of(prange->svms, struct kfd_process, svms);
 	uint64_t npages = (end - start) >> PAGE_SHIFT;
@@ -676,7 +676,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 	unsigned long cpages = 0;
 	struct kfd_process_device *pdd;
 	struct dma_fence *mfence = NULL;
-	struct migrate_vma migrate;
+	struct migrate_vma migrate = { 0 };
 	dma_addr_t *scratch;
 	void *buf;
 	int r = -ENOMEM;
@@ -699,6 +699,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 
 	migrate.src = buf;
 	migrate.dst = migrate.src + npages;
+	migrate.fault_page = fault_page;
 	scratch = (dma_addr_t *)(migrate.dst + npages);
 
 	kfd_smi_event_migration_start(adev->kfd.dev, p->lead_thread->pid,
@@ -766,7 +767,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
  * 0 - OK, otherwise error code
  */
 int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
-			    uint32_t trigger)
+			    uint32_t trigger, struct page *fault_page)
 {
 	struct amdgpu_device *adev;
 	struct vm_area_struct *vma;
@@ -807,7 +808,8 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
 		}
 
 		next = min(vma->vm_end, end);
-		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger);
+		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger,
+			fault_page);
 		if (r < 0) {
 			pr_debug("failed %ld to migrate prange %p\n", r, prange);
 			break;
@@ -851,7 +853,7 @@ svm_migrate_vram_to_vram(struct svm_range *prange, uint32_t best_loc,
 	pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc, best_loc);
 
 	do {
-		r = svm_migrate_vram_to_ram(prange, mm, trigger);
+		r = svm_migrate_vram_to_ram(prange, mm, trigger, NULL);
 		if (r)
 			return r;
 	} while (prange->actual_loc && --retries);
@@ -938,7 +940,8 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
 		goto out_unlock_prange;
 	}
 
-	r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU);
+	r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU,
+				vmf->page);
 	if (r)
 		pr_debug("failed %d migrate 0x%p [0x%lx 0x%lx] to ram\n", r,
 			 prange, prange->start, prange->last);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
index b3f0754..a5d7e6d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -43,7 +43,7 @@ enum MIGRATION_COPY_DIR {
 int svm_migrate_to_vram(struct svm_range *prange,  uint32_t best_loc,
 			struct mm_struct *mm, uint32_t trigger);
 int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
-			    uint32_t trigger);
+			    uint32_t trigger, struct page *fault_page);
 unsigned long
 svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 11074cc..9139e5a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2913,13 +2913,15 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
 				 */
 				if (prange->actual_loc)
 					r = svm_migrate_vram_to_ram(prange, mm,
-					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
+					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
+					   NULL);
 				else
 					r = 0;
 			}
 		} else {
 			r = svm_migrate_vram_to_ram(prange, mm,
-					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
+					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
+					NULL);
 		}
 		if (r) {
 			pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
@@ -3242,7 +3244,8 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
 		return 0;
 
 	if (!best_loc) {
-		r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PREFETCH);
+		r = svm_migrate_vram_to_ram(prange, mm,
+					KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
 		*migrated = !r;
 		return r;
 	}
@@ -3303,7 +3306,7 @@ static void svm_range_evict_svm_bo_worker(struct work_struct *work)
 		mutex_lock(&prange->migrate_mutex);
 		do {
 			r = svm_migrate_vram_to_ram(prange, mm,
-						KFD_MIGRATE_TRIGGER_TTM_EVICTION);
+					KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL);
 		} while (!r && prange->actual_loc && --retries);
 
 		if (!r && prange->actual_loc)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 22c0a0c..82ffa47 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -62,6 +62,8 @@ extern const char *migrate_reason_names[MR_TYPES];
 #ifdef CONFIG_MIGRATION
 
 extern void putback_movable_pages(struct list_head *l);
+int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
+		struct folio *src, enum migrate_mode mode, int extra_count);
 int migrate_folio(struct address_space *mapping, struct folio *dst,
 		struct folio *src, enum migrate_mode mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
@@ -212,6 +214,12 @@ struct migrate_vma {
 	 */
 	void			*pgmap_owner;
 	unsigned long		flags;
+
+	/*
+	 * Set to vmf->page if this is being called to migrate a page as part of
+	 * a migrate_to_ram() callback.
+	 */
+	struct page		*fault_page;
 };
 
 int migrate_vma_setup(struct migrate_vma *args);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e3965ca..89463ff 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -907,7 +907,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
 	struct vm_area_struct *vma;
 	unsigned long src_pfns[64] = { 0 };
 	unsigned long dst_pfns[64] = { 0 };
-	struct migrate_vma args;
+	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
 
@@ -968,7 +968,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
 	unsigned long src_pfns[64] = { 0 };
 	unsigned long dst_pfns[64] = { 0 };
 	struct dmirror_bounce bounce;
-	struct migrate_vma args;
+	struct migrate_vma args = { 0 };
 	unsigned long next;
 	int ret;
 
@@ -1334,7 +1334,7 @@ static void dmirror_devmem_free(struct page *page)
 
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
-	struct migrate_vma args;
+	struct migrate_vma args = { 0 };
 	unsigned long src_pfns = 0;
 	unsigned long dst_pfns = 0;
 	struct page *rpage;
@@ -1357,6 +1357,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	args.dst = &dst_pfns;
 	args.pgmap_owner = dmirror->mdevice;
 	args.flags = dmirror_select_device(dmirror);
+	args.fault_page = vmf->page;
 
 	if (migrate_vma_setup(&args))
 		return VM_FAULT_SIGBUS;
diff --git a/mm/memory.c b/mm/memory.c
index b994784..65d3977 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3742,7 +3742,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
 			vmf->page = pfn_swap_entry_to_page(entry);
-			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+					vmf->address, &vmf->ptl);
+			if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+				spin_unlock(vmf->ptl);
+				goto out;
+			}
+
+			/*
+			 * Get a page reference while we know the page can't be
+			 * freed.
+			 */
+			get_page(vmf->page);
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			vmf->page->pgmap->ops->migrate_to_ram(vmf);
+			put_page(vmf->page);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else if (is_swapin_error_entry(entry)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index ce6a58f..e3f78a7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -620,6 +620,25 @@ EXPORT_SYMBOL(folio_migrate_copy);
  *                    Migration functions
  ***********************************************************/
 
+int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
+		struct folio *src, enum migrate_mode mode, int extra_count)
+{
+	int rc;
+
+	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
+
+	rc = folio_migrate_mapping(mapping, dst, src, extra_count);
+
+	if (rc != MIGRATEPAGE_SUCCESS)
+		return rc;
+
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		folio_migrate_copy(dst, src);
+	else
+		folio_migrate_flags(dst, src);
+	return MIGRATEPAGE_SUCCESS;
+}
+
 /**
  * migrate_folio() - Simple folio migration.
  * @mapping: The address_space containing the folio.
@@ -635,20 +654,7 @@ EXPORT_SYMBOL(folio_migrate_copy);
 int migrate_folio(struct address_space *mapping, struct folio *dst,
 		struct folio *src, enum migrate_mode mode)
 {
-	int rc;
-
-	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
-
-	rc = folio_migrate_mapping(mapping, dst, src, 0);
-
-	if (rc != MIGRATEPAGE_SUCCESS)
-		return rc;
-
-	if (mode != MIGRATE_SYNC_NO_COPY)
-		folio_migrate_copy(dst, src);
-	else
-		folio_migrate_flags(dst, src);
-	return MIGRATEPAGE_SUCCESS;
+	return migrate_folio_extra(mapping, dst, src, mode, 0);
 }
 EXPORT_SYMBOL(migrate_folio);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 7235424..f756c00 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -313,14 +313,14 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
  * folio_migrate_mapping(), except that here we allow migration of a
  * ZONE_DEVICE page.
  */
-static bool migrate_vma_check_page(struct page *page)
+static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 {
 	/*
 	 * One extra ref because caller holds an extra reference, either from
 	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
 	 * a device page.
 	 */
-	int extra = 1;
+	int extra = 1 + (page == fault_page);
 
 	/*
 	 * FIXME support THP (transparent huge page), it is bit more complex to
@@ -393,7 +393,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		if (folio_mapped(folio))
 			try_to_migrate(folio, 0);
 
-		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+		if (page_mapped(page) ||
+		    !migrate_vma_check_page(page, migrate->fault_page)) {
 			if (!is_zone_device_page(page)) {
 				get_page(page);
 				putback_lru_page(page);
@@ -505,6 +506,8 @@ int migrate_vma_setup(struct migrate_vma *args)
 		return -EINVAL;
 	if (!args->src || !args->dst)
 		return -EINVAL;
+	if (args->fault_page && !is_device_private_page(args->fault_page))
+		return -EINVAL;
 
 	memset(args->src, 0, sizeof(*args->src) * nr_pages);
 	args->cpages = 0;
@@ -735,8 +738,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 			continue;
 		}
 
-		r = migrate_folio(mapping, page_folio(newpage),
-				page_folio(page), MIGRATE_SYNC_NO_COPY);
+		if (migrate->fault_page == page)
+			r = migrate_folio_extra(mapping, page_folio(newpage),
+						page_folio(page),
+						MIGRATE_SYNC_NO_COPY, 1);
+		else
+			r = migrate_folio(mapping, page_folio(newpage),
+					page_folio(page), MIGRATE_SYNC_NO_COPY);
 		if (r != MIGRATEPAGE_SUCCESS)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 2/8] mm: Free device private pages have zero refcount
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-29 19:21   ` Felix Kuehling
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 3/8] mm/memremap.c: Take a pgmap reference on page allocation Alistair Popple
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Alex Sierra, Ralph Campbell, nouveau, Felix Kuehling,
	Alistair Popple, linux-kernel, dri-devel, amd-gfx,
	Jason Gunthorpe, Michael Ellerman, Alex Deucher, Dan Williams,
	Christian König, Ben Skeggs

Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free
pages have a reference count of one.

This makes it difficult to tell if a page is free or not because both
free and in use pages will have a non-zero refcount. Instead we should
return pages to the drivers page allocator with a zero reference count.
Kernel code can then safely use kernel functions such as
get_page_unless_zero().

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>

---

This will conflict with Dan's series to fix reference counts for DAX[1].
At the moment this only makes changes for device private and coherent
pages, however if DAX is fixed to remove the extra refcount then we
should just be able to drop the checks for private/coherent pages and
treat them the same.

[1] - https://lore.kernel.org/linux-mm/166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com/
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 include/linux/memremap.h                 |  1 +
 lib/test_hmm.c                           |  2 +-
 mm/memremap.c                            |  9 +++++++++
 mm/page_alloc.c                          |  8 ++++++++
 7 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index d4eacf4..9d8de68 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -718,7 +718,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
 
 	dpage = pfn_to_page(uvmem_pfn);
 	dpage->zone_device_data = pvt;
-	lock_page(dpage);
+	zone_device_page_init(dpage);
 	return dpage;
 out_clear:
 	spin_lock(&kvmppc_uvmem_bitmap_lock);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 776448b..97a6845 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -223,7 +223,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
 	page = pfn_to_page(pfn);
 	svm_range_bo_ref(prange->svm_bo);
 	page->zone_device_data = prange->svm_bo;
-	lock_page(page);
+	zone_device_page_init(page);
 }
 
 static void
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1635661..b092988 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
 			return NULL;
 	}
 
-	lock_page(page);
+	zone_device_page_init(page);
 	return page;
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1901049..f68bf6d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -182,6 +182,7 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 89463ff..688c15d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -627,8 +627,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 			goto error;
 	}
 
+	zone_device_page_init(dpage);
 	dpage->zone_device_data = rpage;
-	lock_page(dpage);
 	return dpage;
 
 error:
diff --git a/mm/memremap.c b/mm/memremap.c
index 25029a4..1c2c038 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -505,8 +505,17 @@ void free_zone_device_page(struct page *page)
 	/*
 	 * Reset the page count to 1 to prepare for handing out the page again.
 	 */
+	if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+	    page->pgmap->type != MEMORY_DEVICE_COHERENT)
+		set_page_count(page, 1);
+}
+
+void zone_device_page_init(struct page *page)
+{
 	set_page_count(page, 1);
+	lock_page(page);
 }
+EXPORT_SYMBOL_GPL(zone_device_page_init);
 
 #ifdef CONFIG_FS_DAX
 bool __put_devmap_managed_page_refs(struct page *page, int refs)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d49803..4df1e43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6744,6 +6744,14 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 		cond_resched();
 	}
+
+	/*
+	 * ZONE_DEVICE pages are released directly to the driver page allocator
+	 * which will set the page count to 1 when allocating the page.
+	 */
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_COHERENT)
+		set_page_count(page, 0);
 }
 
 /*
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 3/8] mm/memremap.c: Take a pgmap reference on page allocation
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 2/8] mm: Free device private pages have zero refcount Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 4/8] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page() Alistair Popple
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Alex Sierra, Ralph Campbell, nouveau, Felix Kuehling,
	Alistair Popple, linux-kernel, dri-devel, amd-gfx,
	Jason Gunthorpe, Alex Deucher, Dan Williams,
	Christian König, Ben Skeggs

ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a
driver. When the struct page is first allocated by the kernel in
memremap_pages() a reference is taken on the associated pagemap to
ensure it is not freed prior to the pages being freed.

Prior to 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
refcount") pages were considered free and returned to the driver when
the reference count dropped to one. However the pagemap reference was
not dropped until the page reference count hit zero. This would occur as
part of the final put_page() in memunmap_pages() which would wait for
all pages to be freed prior to returning.

When the extra refcount was removed the pagemap reference was no longer
being dropped in put_page(). Instead memunmap_pages() was changed to
explicitly drop the pagemap references. This means that memunmap_pages()
can complete even though pages are still mapped by the kernel which can
lead to kernel crashes, particularly if a driver frees the pagemap.

To fix this drivers should take a pagemap reference when allocating the
page. This reference can then be returned when the page is freed.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Fixes: 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>

---

Again I expect this will conflict with Dan's series. This implements the
first suggestion from Jason at
https://lore.kernel.org/linux-mm/YzLy5jJOF0jdlrJK@nvidia.com/ so
whatever we end up doing for DAX we should do the same here.
---
 mm/memremap.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 1c2c038..421bec3 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -138,8 +138,11 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 	int i;
 
 	percpu_ref_kill(&pgmap->ref);
-	for (i = 0; i < pgmap->nr_range; i++)
-		percpu_ref_put_many(&pgmap->ref, pfn_len(pgmap, i));
+	if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+	    pgmap->type != MEMORY_DEVICE_COHERENT)
+		for (i = 0; i < pgmap->nr_range; i++)
+			percpu_ref_put_many(&pgmap->ref, pfn_len(pgmap, i));
+
 	wait_for_completion(&pgmap->done);
 
 	for (i = 0; i < pgmap->nr_range; i++)
@@ -264,7 +267,9 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(&pgmap->ref, pfn_len(pgmap, range_id));
+	if (pgmap->type != MEMORY_DEVICE_PRIVATE &&
+	    pgmap->type != MEMORY_DEVICE_COHERENT)
+		percpu_ref_get_many(&pgmap->ref, pfn_len(pgmap, range_id));
 	return 0;
 
 err_add_memory:
@@ -502,16 +507,24 @@ void free_zone_device_page(struct page *page)
 	page->mapping = NULL;
 	page->pgmap->ops->page_free(page);
 
-	/*
-	 * Reset the page count to 1 to prepare for handing out the page again.
-	 */
 	if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
 	    page->pgmap->type != MEMORY_DEVICE_COHERENT)
+		/*
+		 * Reset the page count to 1 to prepare for handing out the page
+		 * again.
+		 */
 		set_page_count(page, 1);
+	else
+		put_dev_pagemap(page->pgmap);
 }
 
 void zone_device_page_init(struct page *page)
 {
+	/*
+	 * Drivers shouldn't be allocating pages after calling
+	 * memunmap_pages().
+	 */
+	WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref));
 	set_page_count(page, 1);
 	lock_page(page);
 }
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 4/8] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
                   ` (2 preceding siblings ...)
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 3/8] mm/memremap.c: Take a pgmap reference on page allocation Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 5/8] mm/migrate_device.c: Add migrate_device_range() Alistair Popple
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ralph Campbell, Matthew Wilcox, David Hildenbrand, nouveau,
	Yang Shi, Alistair Popple, linux-kernel, dri-devel, amd-gfx,
	Zi Yan, Huang, Ying

migrate_device_coherent_page() reuses the existing migrate_vma family of
functions to migrate a specific page without providing a valid mapping
or vma. This looks a bit odd because it means we are calling
migrate_vma_*() without setting a valid vma, however it was considered
acceptable at the time because the details were internal to
migrate_device.c and there was only a single user.

One of the reasons the details could be kept internal was that this was
strictly for migrating device coherent memory. Such memory can be copied
directly by the CPU without intervention from a driver. However this
isn't true for device private memory, and a future change requires
similar functionality for device private memory. So refactor the code
into something more sensible for migrating device memory without a vma.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/migrate_device.c | 150 +++++++++++++++++++++++++--------------------
 1 file changed, 85 insertions(+), 65 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index f756c00..ba479b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -345,26 +345,20 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 }
 
 /*
- * migrate_vma_unmap() - replace page mapping with special migration pte entry
- * @migrate: migrate struct containing all migration information
- *
- * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
- * special migration pte entry and check if it has been pinned. Pinned pages are
- * restored because we cannot migrate them.
- *
- * This is the last step before we call the device driver callback to allocate
- * destination memory and copy contents of original page over to new page.
+ * Unmaps pages for migration. Returns number of unmapped pages.
  */
-static void migrate_vma_unmap(struct migrate_vma *migrate)
+static unsigned long migrate_device_unmap(unsigned long *src_pfns,
+					  unsigned long npages,
+					  struct page *fault_page)
 {
-	const unsigned long npages = migrate->npages;
 	unsigned long i, restore = 0;
 	bool allow_drain = true;
+	unsigned long unmapped = 0;
 
 	lru_add_drain();
 
 	for (i = 0; i < npages; i++) {
-		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct folio *folio;
 
 		if (!page)
@@ -379,8 +373,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 			}
 
 			if (isolate_lru_page(page)) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-				migrate->cpages--;
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 				restore++;
 				continue;
 			}
@@ -394,34 +387,54 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 			try_to_migrate(folio, 0);
 
 		if (page_mapped(page) ||
-		    !migrate_vma_check_page(page, migrate->fault_page)) {
+		    !migrate_vma_check_page(page, fault_page)) {
 			if (!is_zone_device_page(page)) {
 				get_page(page);
 				putback_lru_page(page);
 			}
 
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-			migrate->cpages--;
+			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			restore++;
 			continue;
 		}
+
+		unmapped++;
 	}
 
 	for (i = 0; i < npages && restore; i++) {
-		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct folio *folio;
 
-		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+		if (!page || (src_pfns[i] & MIGRATE_PFN_MIGRATE))
 			continue;
 
 		folio = page_folio(page);
 		remove_migration_ptes(folio, folio, false);
 
-		migrate->src[i] = 0;
+		src_pfns[i] = 0;
 		folio_unlock(folio);
 		folio_put(folio);
 		restore--;
 	}
+
+	return unmapped;
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
+ * special migration pte entry and check if it has been pinned. Pinned pages are
+ * restored because we cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+	migrate->cpages = migrate_device_unmap(migrate->src, migrate->npages,
+					migrate->fault_page);
 }
 
 /**
@@ -668,41 +681,36 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	*src &= ~MIGRATE_PFN_MIGRATE;
 }
 
-/**
- * migrate_vma_pages() - migrate meta-data from src page to dst page
- * @migrate: migrate struct containing all migration information
- *
- * This migrates struct page meta-data from source struct page to destination
- * struct page. This effectively finishes the migration from source page to the
- * destination page.
- */
-void migrate_vma_pages(struct migrate_vma *migrate)
+static void migrate_device_pages(unsigned long *src_pfns,
+				unsigned long *dst_pfns, unsigned long npages,
+				struct migrate_vma *migrate)
 {
-	const unsigned long npages = migrate->npages;
-	const unsigned long start = migrate->start;
 	struct mmu_notifier_range range;
-	unsigned long addr, i;
+	unsigned long i;
 	bool notified = false;
 
-	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
-		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
-		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+	for (i = 0; i < npages; i++) {
+		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
+		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 		struct address_space *mapping;
 		int r;
 
 		if (!newpage) {
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			continue;
 		}
 
 		if (!page) {
+			unsigned long addr;
+
 			/*
 			 * The only time there is no vma is when called from
 			 * migrate_device_coherent_page(). However this isn't
 			 * called if the page could not be unmapped.
 			 */
-			VM_BUG_ON(!migrate->vma);
-			if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			VM_BUG_ON(!migrate);
+			addr = migrate->start + i*PAGE_SIZE;
+			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
 				continue;
 			if (!notified) {
 				notified = true;
@@ -714,7 +722,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 				mmu_notifier_invalidate_range_start(&range);
 			}
 			migrate_vma_insert_page(migrate, addr, newpage,
-						&migrate->src[i]);
+						&src_pfns[i]);
 			continue;
 		}
 
@@ -727,18 +735,18 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 			 * device private or coherent memory.
 			 */
 			if (mapping) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 				continue;
 			}
 		} else if (is_zone_device_page(newpage)) {
 			/*
 			 * Other types of ZONE_DEVICE page are not supported.
 			 */
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 			continue;
 		}
 
-		if (migrate->fault_page == page)
+		if (migrate && migrate->fault_page == page)
 			r = migrate_folio_extra(mapping, page_folio(newpage),
 						page_folio(page),
 						MIGRATE_SYNC_NO_COPY, 1);
@@ -746,7 +754,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 			r = migrate_folio(mapping, page_folio(newpage),
 					page_folio(page), MIGRATE_SYNC_NO_COPY);
 		if (r != MIGRATEPAGE_SUCCESS)
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
 
 	/*
@@ -757,28 +765,30 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 	if (notified)
 		mmu_notifier_invalidate_range_only_end(&range);
 }
-EXPORT_SYMBOL(migrate_vma_pages);
 
 /**
- * migrate_vma_finalize() - restore CPU page table entry
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
  *
- * This replaces the special migration pte entry with either a mapping to the
- * new page if migration was successful for that page, or to the original page
- * otherwise.
- *
- * This also unlocks the pages and puts them back on the lru, or drops the extra
- * refcount, for device pages.
+ * This migrates struct page meta-data from source struct page to destination
+ * struct page. This effectively finishes the migration from source page to the
+ * destination page.
  */
-void migrate_vma_finalize(struct migrate_vma *migrate)
+void migrate_vma_pages(struct migrate_vma *migrate)
+{
+	migrate_device_pages(migrate->src, migrate->dst, migrate->npages, migrate);
+}
+EXPORT_SYMBOL(migrate_vma_pages);
+
+static void migrate_device_finalize(unsigned long *src_pfns,
+				unsigned long *dst_pfns, unsigned long npages)
 {
-	const unsigned long npages = migrate->npages;
 	unsigned long i;
 
 	for (i = 0; i < npages; i++) {
 		struct folio *dst, *src;
-		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
-		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
+		struct page *page = migrate_pfn_to_page(src_pfns[i]);
 
 		if (!page) {
 			if (newpage) {
@@ -788,7 +798,7 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
 			continue;
 		}
 
-		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+		if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
 			if (newpage) {
 				unlock_page(newpage);
 				put_page(newpage);
@@ -815,6 +825,22 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
 		}
 	}
 }
+
+/**
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * This replaces the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page, or to the original page
+ * otherwise.
+ *
+ * This also unlocks the pages and puts them back on the lru, or drops the extra
+ * refcount, for device pages.
+ */
+void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
+}
 EXPORT_SYMBOL(migrate_vma_finalize);
 
 /*
@@ -825,25 +851,19 @@ EXPORT_SYMBOL(migrate_vma_finalize);
 int migrate_device_coherent_page(struct page *page)
 {
 	unsigned long src_pfn, dst_pfn = 0;
-	struct migrate_vma args;
 	struct page *dpage;
 
 	WARN_ON_ONCE(PageCompound(page));
 
 	lock_page(page);
 	src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
-	args.src = &src_pfn;
-	args.dst = &dst_pfn;
-	args.cpages = 1;
-	args.npages = 1;
-	args.vma = NULL;
 
 	/*
 	 * We don't have a VMA and don't need to walk the page tables to find
 	 * the source page. So call migrate_vma_unmap() directly to unmap the
 	 * page as migrate_vma_setup() will fail if args.vma == NULL.
 	 */
-	migrate_vma_unmap(&args);
+	migrate_device_unmap(&src_pfn, 1, NULL);
 	if (!(src_pfn & MIGRATE_PFN_MIGRATE))
 		return -EBUSY;
 
@@ -853,10 +873,10 @@ int migrate_device_coherent_page(struct page *page)
 		dst_pfn = migrate_pfn(page_to_pfn(dpage));
 	}
 
-	migrate_vma_pages(&args);
+	migrate_device_pages(&src_pfn, &dst_pfn, 1, NULL);
 	if (src_pfn & MIGRATE_PFN_MIGRATE)
 		copy_highpage(dpage, page);
-	migrate_vma_finalize(&args);
+	migrate_device_finalize(&src_pfn, &dst_pfn, 1);
 
 	if (src_pfn & MIGRATE_PFN_MIGRATE)
 		return 0;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 5/8] mm/migrate_device.c: Add migrate_device_range()
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
                   ` (3 preceding siblings ...)
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 4/8] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page() Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 6/8] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one() Alistair Popple
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ralph Campbell, Matthew Wilcox, David Hildenbrand, nouveau,
	Yang Shi, Alistair Popple, linux-kernel, dri-devel, amd-gfx,
	Zi Yan, Huang, Ying

Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages. These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.

Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero. Alternatively they
may be freed indirectly via migration back to CPU memory in response to
a pgmap->migrate_to_ram() callback called whenever the CPU accesses
an address mapped to a device private page.

In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace.
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or
because another user needs to use the memory.

Drivers could use the existing migrate_vma functions to migrate data off
the device. However this would require them to track the mappings of
each page which is both complicated and not always possible. Instead
drivers need to be able to migrate device pages directly so they can
free up device memory.

To allow that this patch introduces the migrate_device family of
functions which are functionally similar to migrate_vma but which skips
the initial lookup based on mapping.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/migrate.h |  7 +++-
 mm/migrate_device.c     | 89 ++++++++++++++++++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 7 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 82ffa47..582cdc7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -225,6 +225,13 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int migrate_device_range(unsigned long *src_pfns, unsigned long start,
+			unsigned long npages);
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+			unsigned long npages);
+void migrate_device_finalize(unsigned long *src_pfns,
+			unsigned long *dst_pfns, unsigned long npages);
+
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index ba479b5..824860a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -681,7 +681,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	*src &= ~MIGRATE_PFN_MIGRATE;
 }
 
-static void migrate_device_pages(unsigned long *src_pfns,
+static void __migrate_device_pages(unsigned long *src_pfns,
 				unsigned long *dst_pfns, unsigned long npages,
 				struct migrate_vma *migrate)
 {
@@ -703,6 +703,9 @@ static void migrate_device_pages(unsigned long *src_pfns,
 		if (!page) {
 			unsigned long addr;
 
+			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+				continue;
+
 			/*
 			 * The only time there is no vma is when called from
 			 * migrate_device_coherent_page(). However this isn't
@@ -710,8 +713,6 @@ static void migrate_device_pages(unsigned long *src_pfns,
 			 */
 			VM_BUG_ON(!migrate);
 			addr = migrate->start + i*PAGE_SIZE;
-			if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE))
-				continue;
 			if (!notified) {
 				notified = true;
 
@@ -767,6 +768,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
 }
 
 /**
+ * migrate_device_pages() - migrate meta-data from src page to dst page
+ * @src_pfns: src_pfns returned from migrate_device_range()
+ * @dst_pfns: array of pfns allocated by the driver to migrate memory to
+ * @npages: number of pages in the range
+ *
+ * Equivalent to migrate_vma_pages(). This is called to migrate struct page
+ * meta-data from source struct page to destination.
+ */
+void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
+			unsigned long npages)
+{
+	__migrate_device_pages(src_pfns, dst_pfns, npages, NULL);
+}
+EXPORT_SYMBOL(migrate_device_pages);
+
+/**
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
  *
@@ -776,12 +793,22 @@ static void migrate_device_pages(unsigned long *src_pfns,
  */
 void migrate_vma_pages(struct migrate_vma *migrate)
 {
-	migrate_device_pages(migrate->src, migrate->dst, migrate->npages, migrate);
+	__migrate_device_pages(migrate->src, migrate->dst, migrate->npages, migrate);
 }
 EXPORT_SYMBOL(migrate_vma_pages);
 
-static void migrate_device_finalize(unsigned long *src_pfns,
-				unsigned long *dst_pfns, unsigned long npages)
+/*
+ * migrate_device_finalize() - complete page migration
+ * @src_pfns: src_pfns returned from migrate_device_range()
+ * @dst_pfns: array of pfns allocated by the driver to migrate memory to
+ * @npages: number of pages in the range
+ *
+ * Completes migration of the page by removing special migration entries.
+ * Drivers must ensure copying of page data is complete and visible to the CPU
+ * before calling this.
+ */
+void migrate_device_finalize(unsigned long *src_pfns,
+			unsigned long *dst_pfns, unsigned long npages)
 {
 	unsigned long i;
 
@@ -825,6 +852,7 @@ static void migrate_device_finalize(unsigned long *src_pfns,
 		}
 	}
 }
+EXPORT_SYMBOL(migrate_device_finalize);
 
 /**
  * migrate_vma_finalize() - restore CPU page table entry
@@ -843,6 +871,53 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 
+/**
+ * migrate_device_range() - migrate device private pfns to normal memory.
+ * @src_pfns: array large enough to hold migrating source device private pfns.
+ * @start: starting pfn in the range to migrate.
+ * @npages: number of pages to migrate.
+ *
+ * migrate_vma_setup() is similar in concept to migrate_vma_setup() except that
+ * instead of looking up pages based on virtual address mappings a range of
+ * device pfns that should be migrated to system memory is used instead.
+ *
+ * This is useful when a driver needs to free device memory but doesn't know the
+ * virtual mappings of every page that may be in device memory. For example this
+ * is often the case when a driver is being unloaded or unbound from a device.
+ *
+ * Like migrate_vma_setup() this function will take a reference and lock any
+ * migrating pages that aren't free before unmapping them. Drivers may then
+ * allocate destination pages and start copying data from the device to CPU
+ * memory before calling migrate_device_pages().
+ */
+int migrate_device_range(unsigned long *src_pfns, unsigned long start,
+			unsigned long npages)
+{
+	unsigned long i, pfn;
+
+	for (pfn = start, i = 0; i < npages; pfn++, i++) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (!get_page_unless_zero(page)) {
+			src_pfns[i] = 0;
+			continue;
+		}
+
+		if (!trylock_page(page)) {
+			src_pfns[i] = 0;
+			put_page(page);
+			continue;
+		}
+
+		src_pfns[i] = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+	}
+
+	migrate_device_unmap(src_pfns, npages, NULL);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_device_range);
+
 /*
  * Migrate a device coherent page back to normal memory. The caller should have
  * a reference on page which will be copied to the new page if migration is
@@ -873,7 +948,7 @@ int migrate_device_coherent_page(struct page *page)
 		dst_pfn = migrate_pfn(page_to_pfn(dpage));
 	}
 
-	migrate_device_pages(&src_pfn, &dst_pfn, 1, NULL);
+	migrate_device_pages(&src_pfn, &dst_pfn, 1);
 	if (src_pfn & MIGRATE_PFN_MIGRATE)
 		copy_highpage(dpage, page);
 	migrate_device_finalize(&src_pfn, &dst_pfn, 1);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 6/8] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
                   ` (4 preceding siblings ...)
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 5/8] mm/migrate_device.c: Add migrate_device_range() Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 7/8] nouveau/dmem: Evict device private memory during release Alistair Popple
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ralph Campbell, nouveau, Alistair Popple, linux-kernel,
	dri-devel, amd-gfx, Ben Skeggs

nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
the migrate_to_ram() callback and is used to copy data from GPU to CPU
memory. It is currently specific to fault handling, however a future
patch implementing eviction of data during teardown needs similar
functionality.

Refactor out the core functionality so that it is not specific to fault
handling.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 58 +++++++++++++--------------
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index b092988..65f51fb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -139,44 +139,24 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
 	}
 }
 
-static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
-		struct vm_fault *vmf, struct migrate_vma *args,
-		dma_addr_t *dma_addr)
+static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
+				struct page *dpage, dma_addr_t *dma_addr)
 {
 	struct device *dev = drm->dev->dev;
-	struct page *dpage, *spage;
-	struct nouveau_svmm *svmm;
-
-	spage = migrate_pfn_to_page(args->src[0]);
-	if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-		return 0;
 
-	dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
-	if (!dpage)
-		return VM_FAULT_SIGBUS;
 	lock_page(dpage);
 
 	*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
 	if (dma_mapping_error(dev, *dma_addr))
-		goto error_free_page;
+		return -EIO;
 
-	svmm = spage->zone_device_data;
-	mutex_lock(&svmm->mutex);
-	nouveau_svmm_invalidate(svmm, args->start, args->end);
 	if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-			NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
-		goto error_dma_unmap;
-	mutex_unlock(&svmm->mutex);
+					 NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage))) {
+		dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+		return -EIO;
+	}
 
-	args->dst[0] = migrate_pfn(page_to_pfn(dpage));
 	return 0;
-
-error_dma_unmap:
-	mutex_unlock(&svmm->mutex);
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
-error_free_page:
-	__free_page(dpage);
-	return VM_FAULT_SIGBUS;
 }
 
 static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
@@ -184,9 +164,11 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	struct nouveau_drm *drm = page_to_drm(vmf->page);
 	struct nouveau_dmem *dmem = drm->dmem;
 	struct nouveau_fence *fence;
+	struct nouveau_svmm *svmm;
+	struct page *spage, *dpage;
 	unsigned long src = 0, dst = 0;
 	dma_addr_t dma_addr = 0;
-	vm_fault_t ret;
+	vm_fault_t ret = 0;
 	struct migrate_vma args = {
 		.vma		= vmf->vma,
 		.start		= vmf->address,
@@ -207,9 +189,25 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 	if (!args.cpages)
 		return 0;
 
-	ret = nouveau_dmem_fault_copy_one(drm, vmf, &args, &dma_addr);
-	if (ret || dst == 0)
+	spage = migrate_pfn_to_page(src);
+	if (!spage || !(src & MIGRATE_PFN_MIGRATE))
+		goto done;
+
+	dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+	if (!dpage)
+		goto done;
+
+	dst = migrate_pfn(page_to_pfn(dpage));
+
+	svmm = spage->zone_device_data;
+	mutex_lock(&svmm->mutex);
+	nouveau_svmm_invalidate(svmm, args.start, args.end);
+	ret = nouveau_dmem_copy_one(drm, spage, dpage, &dma_addr);
+	mutex_unlock(&svmm->mutex);
+	if (ret) {
+		ret = VM_FAULT_SIGBUS;
 		goto done;
+	}
 
 	nouveau_fence_new(dmem->migrate.chan, false, &fence);
 	migrate_vma_pages(&args);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 7/8] nouveau/dmem: Evict device private memory during release
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
                   ` (5 preceding siblings ...)
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 6/8] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one() Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-28 21:37   ` Lyude Paul
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range() Alistair Popple
  2022-10-25 10:17 ` [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Vlastimil Babka (SUSE)
  8 siblings, 1 reply; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ralph Campbell, nouveau, Alistair Popple, linux-kernel,
	dri-devel, amd-gfx, Ben Skeggs

When the module is unloaded or a GPU is unbound from the module it is
possible for device private pages to still be mapped in currently
running processes. This can lead to a hangs and RCU stall warnings when
unbinding the device as memunmap_pages() will wait in an uninterruptible
state until all device pages have been freed which may never happen.

Fix this by migrating device mappings back to normal CPU memory prior to
freeing the GPU memory chunks and associated device private pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 65f51fb..5fe2091 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -367,6 +367,52 @@ nouveau_dmem_suspend(struct nouveau_drm *drm)
 	mutex_unlock(&drm->dmem->mutex);
 }
 
+/*
+ * Evict all pages mapping a chunk.
+ */
+static void
+nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
+{
+	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
+	unsigned long *src_pfns, *dst_pfns;
+	dma_addr_t *dma_addrs;
+	struct nouveau_fence *fence;
+
+	src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+	dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+	dma_addrs = kcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL);
+
+	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
+			npages);
+
+	for (i = 0; i < npages; i++) {
+		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
+			struct page *dpage;
+
+			/*
+			 * _GFP_NOFAIL because the GPU is going away and there
+			 * is nothing sensible we can do if we can't copy the
+			 * data back.
+			 */
+			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+			nouveau_dmem_copy_one(chunk->drm,
+					migrate_pfn_to_page(src_pfns[i]), dpage,
+					&dma_addrs[i]);
+		}
+	}
+
+	nouveau_fence_new(chunk->drm->dmem->migrate.chan, false, &fence);
+	migrate_device_pages(src_pfns, dst_pfns, npages);
+	nouveau_dmem_fence_done(&fence);
+	migrate_device_finalize(src_pfns, dst_pfns, npages);
+	kfree(src_pfns);
+	kfree(dst_pfns);
+	for (i = 0; i < npages; i++)
+		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
+	kfree(dma_addrs);
+}
+
 void
 nouveau_dmem_fini(struct nouveau_drm *drm)
 {
@@ -378,8 +424,10 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
 	mutex_lock(&drm->dmem->mutex);
 
 	list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
+		nouveau_dmem_evict_chunk(chunk);
 		nouveau_bo_unpin(chunk->bo);
 		nouveau_bo_ref(NULL, &chunk->bo);
+		WARN_ON(chunk->callocated);
 		list_del(&chunk->list);
 		memunmap_pages(&chunk->pagemap);
 		release_mem_region(chunk->pagemap.range.start,
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range()
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
                   ` (6 preceding siblings ...)
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 7/8] nouveau/dmem: Evict device private memory during release Alistair Popple
@ 2022-09-28 12:01 ` Alistair Popple
  2022-09-28 15:10   ` Andrew Morton
  2022-10-25 10:17 ` [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Vlastimil Babka (SUSE)
  8 siblings, 1 reply; 18+ messages in thread
From: Alistair Popple @ 2022-09-28 12:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Alex Sierra, Ralph Campbell, nouveau, Felix Kuehling,
	Alistair Popple, linux-kernel, dri-devel, amd-gfx,
	Jason Gunthorpe

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
---
 lib/test_hmm.c                         | 120 +++++++++++++++++++++-----
 lib/test_hmm_uapi.h                    |   1 +-
 tools/testing/selftests/vm/hmm-tests.c |  49 +++++++++++-
 3 files changed, 149 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 688c15d..6c2fc85 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -100,6 +100,7 @@ struct dmirror {
 struct dmirror_chunk {
 	struct dev_pagemap	pagemap;
 	struct dmirror_device	*mdevice;
+	bool remove;
 };
 
 /*
@@ -192,11 +193,15 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
+{
+	return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+}
+
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
 
 {
-	return container_of(page->pgmap, struct dmirror_chunk,
-			    pagemap)->mdevice;
+	return dmirror_page_to_chunk(page)->mdevice;
 }
 
 static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
@@ -1218,6 +1223,85 @@ static int dmirror_snapshot(struct dmirror *dmirror,
 	return ret;
 }
 
+static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
+{
+	unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
+	unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
+	unsigned long npages = end_pfn - start_pfn + 1;
+	unsigned long i;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
+
+	src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+	dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+
+	migrate_device_range(src_pfns, start_pfn, npages);
+	for (i = 0; i < npages; i++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(src_pfns[i]);
+		if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		if (WARN_ON(!is_device_private_page(spage) &&
+			    !is_device_coherent_page(spage)))
+			continue;
+		spage = BACKING_PAGE(spage);
+		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+		lock_page(dpage);
+		copy_highpage(dpage, spage);
+		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+		if (src_pfns[i] & MIGRATE_PFN_WRITE)
+			dst_pfns[i] |= MIGRATE_PFN_WRITE;
+	}
+	migrate_device_pages(src_pfns, dst_pfns, npages);
+	migrate_device_finalize(src_pfns, dst_pfns, npages);
+	kfree(src_pfns);
+	kfree(dst_pfns);
+}
+
+/* Removes free pages from the free list so they can't be re-allocated */
+static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
+{
+	struct dmirror_device *mdevice = devmem->mdevice;
+	struct page *page;
+
+	for (page = mdevice->free_pages; page; page = page->zone_device_data)
+		if (dmirror_page_to_chunk(page) == devmem)
+			mdevice->free_pages = page->zone_device_data;
+}
+
+static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
+{
+	unsigned int i;
+
+	mutex_lock(&mdevice->devmem_lock);
+	if (mdevice->devmem_chunks) {
+		for (i = 0; i < mdevice->devmem_count; i++) {
+			struct dmirror_chunk *devmem =
+				mdevice->devmem_chunks[i];
+
+			spin_lock(&mdevice->lock);
+			devmem->remove = true;
+			dmirror_remove_free_pages(devmem);
+			spin_unlock(&mdevice->lock);
+
+			dmirror_device_evict_chunk(devmem);
+			memunmap_pages(&devmem->pagemap);
+			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+				release_mem_region(devmem->pagemap.range.start,
+						   range_len(&devmem->pagemap.range));
+			kfree(devmem);
+		}
+		mdevice->devmem_count = 0;
+		mdevice->devmem_capacity = 0;
+		mdevice->free_pages = NULL;
+		kfree(mdevice->devmem_chunks);
+		mdevice->devmem_chunks = NULL;
+	}
+	mutex_unlock(&mdevice->devmem_lock);
+}
+
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
 					unsigned int command,
 					unsigned long arg)
@@ -1272,6 +1356,11 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
 
+	case HMM_DMIRROR_RELEASE:
+		dmirror_device_remove_chunks(dmirror->mdevice);
+		ret = 0;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -1326,9 +1415,13 @@ static void dmirror_devmem_free(struct page *page)
 
 	mdevice = dmirror_page_to_device(page);
 	spin_lock(&mdevice->lock);
-	mdevice->cfree++;
-	page->zone_device_data = mdevice->free_pages;
-	mdevice->free_pages = page;
+
+	/* Return page to our allocator if not freeing the chunk */
+	if (!dmirror_page_to_chunk(page)->remove) {
+		mdevice->cfree++;
+		page->zone_device_data = mdevice->free_pages;
+		mdevice->free_pages = page;
+	}
 	spin_unlock(&mdevice->lock);
 }
 
@@ -1401,22 +1494,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
 {
-	unsigned int i;
-
-	if (mdevice->devmem_chunks) {
-		for (i = 0; i < mdevice->devmem_count; i++) {
-			struct dmirror_chunk *devmem =
-				mdevice->devmem_chunks[i];
-
-			memunmap_pages(&devmem->pagemap);
-			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
-				release_mem_region(devmem->pagemap.range.start,
-						   range_len(&devmem->pagemap.range));
-			kfree(devmem);
-		}
-		kfree(mdevice->devmem_chunks);
-	}
-
+	dmirror_device_remove_chunks(mdevice);
 	cdev_del(&mdevice->cdevice);
 }
 
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index e31d58c..8c818a2 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -36,6 +36,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index f2c2c97..28232ad 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1054,6 +1054,55 @@ TEST_F(hmm, migrate_fault)
 	hmm_buffer_free(buffer);
 }
 
+TEST_F(hmm, migrate_release)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Release device memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_RELEASE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / (2 * sizeof(*ptr)); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
 /*
  * Migrate anonymous shared memory to device private memory.
  */
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range()
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range() Alistair Popple
@ 2022-09-28 15:10   ` Andrew Morton
  2022-09-29 11:00     ` Alistair Popple
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2022-09-28 15:10 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Alex Sierra, Ralph Campbell, nouveau, Felix Kuehling,
	linux-kernel, dri-devel, linux-mm, amd-gfx, Jason Gunthorpe

On Wed, 28 Sep 2022 22:01:22 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> @@ -1401,22 +1494,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
>  
>  static void dmirror_device_remove(struct dmirror_device *mdevice)
>  {
> -	unsigned int i;
> -
> -	if (mdevice->devmem_chunks) {
> -		for (i = 0; i < mdevice->devmem_count; i++) {
> -			struct dmirror_chunk *devmem =
> -				mdevice->devmem_chunks[i];
> -
> -			memunmap_pages(&devmem->pagemap);
> -			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
> -				release_mem_region(devmem->pagemap.range.start,
> -						   range_len(&devmem->pagemap.range));
> -			kfree(devmem);
> -		}
> -		kfree(mdevice->devmem_chunks);
> -	}
> -
> +	dmirror_device_remove_chunks(mdevice);
>  	cdev_del(&mdevice->cdevice);
>  }

Needed a bit or rework due to
https://lkml.kernel.org/r/20220826050631.25771-1-mpenttil@redhat.com. 
Please check my resolution.


--- a/lib/test_hmm.c~hmm-tests-add-test-for-migrate_device_range
+++ a/lib/test_hmm.c
@@ -100,6 +100,7 @@ struct dmirror {
 struct dmirror_chunk {
 	struct dev_pagemap	pagemap;
 	struct dmirror_device	*mdevice;
+	bool remove;
 };
 
 /*
@@ -192,11 +193,15 @@ static int dmirror_fops_release(struct i
 	return 0;
 }
 
+static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
+{
+	return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+}
+
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
 
 {
-	return container_of(page->pgmap, struct dmirror_chunk,
-			    pagemap)->mdevice;
+	return dmirror_page_to_chunk(page)->mdevice;
 }
 
 static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
@@ -1218,6 +1223,85 @@ static int dmirror_snapshot(struct dmirr
 	return ret;
 }
 
+static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
+{
+	unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
+	unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
+	unsigned long npages = end_pfn - start_pfn + 1;
+	unsigned long i;
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
+
+	src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+	dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+
+	migrate_device_range(src_pfns, start_pfn, npages);
+	for (i = 0; i < npages; i++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(src_pfns[i]);
+		if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		if (WARN_ON(!is_device_private_page(spage) &&
+			    !is_device_coherent_page(spage)))
+			continue;
+		spage = BACKING_PAGE(spage);
+		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
+		lock_page(dpage);
+		copy_highpage(dpage, spage);
+		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+		if (src_pfns[i] & MIGRATE_PFN_WRITE)
+			dst_pfns[i] |= MIGRATE_PFN_WRITE;
+	}
+	migrate_device_pages(src_pfns, dst_pfns, npages);
+	migrate_device_finalize(src_pfns, dst_pfns, npages);
+	kfree(src_pfns);
+	kfree(dst_pfns);
+}
+
+/* Removes free pages from the free list so they can't be re-allocated */
+static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
+{
+	struct dmirror_device *mdevice = devmem->mdevice;
+	struct page *page;
+
+	for (page = mdevice->free_pages; page; page = page->zone_device_data)
+		if (dmirror_page_to_chunk(page) == devmem)
+			mdevice->free_pages = page->zone_device_data;
+}
+
+static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
+{
+	unsigned int i;
+
+	mutex_lock(&mdevice->devmem_lock);
+	if (mdevice->devmem_chunks) {
+		for (i = 0; i < mdevice->devmem_count; i++) {
+			struct dmirror_chunk *devmem =
+				mdevice->devmem_chunks[i];
+
+			spin_lock(&mdevice->lock);
+			devmem->remove = true;
+			dmirror_remove_free_pages(devmem);
+			spin_unlock(&mdevice->lock);
+
+			dmirror_device_evict_chunk(devmem);
+			memunmap_pages(&devmem->pagemap);
+			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+				release_mem_region(devmem->pagemap.range.start,
+						   range_len(&devmem->pagemap.range));
+			kfree(devmem);
+		}
+		mdevice->devmem_count = 0;
+		mdevice->devmem_capacity = 0;
+		mdevice->free_pages = NULL;
+		kfree(mdevice->devmem_chunks);
+		mdevice->devmem_chunks = NULL;
+	}
+	mutex_unlock(&mdevice->devmem_lock);
+}
+
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
 					unsigned int command,
 					unsigned long arg)
@@ -1272,6 +1356,11 @@ static long dmirror_fops_unlocked_ioctl(
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
 
+	case HMM_DMIRROR_RELEASE:
+		dmirror_device_remove_chunks(dmirror->mdevice);
+		ret = 0;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -1326,9 +1415,13 @@ static void dmirror_devmem_free(struct p
 
 	mdevice = dmirror_page_to_device(page);
 	spin_lock(&mdevice->lock);
-	mdevice->cfree++;
-	page->zone_device_data = mdevice->free_pages;
-	mdevice->free_pages = page;
+
+	/* Return page to our allocator if not freeing the chunk */
+	if (!dmirror_page_to_chunk(page)->remove) {
+		mdevice->cfree++;
+		page->zone_device_data = mdevice->free_pages;
+		mdevice->free_pages = page;
+	}
 	spin_unlock(&mdevice->lock);
 }
 
@@ -1408,22 +1501,7 @@ static int dmirror_device_init(struct dm
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
 {
-	unsigned int i;
-
-	if (mdevice->devmem_chunks) {
-		for (i = 0; i < mdevice->devmem_count; i++) {
-			struct dmirror_chunk *devmem =
-				mdevice->devmem_chunks[i];
-
-			memunmap_pages(&devmem->pagemap);
-			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
-				release_mem_region(devmem->pagemap.range.start,
-						   range_len(&devmem->pagemap.range));
-			kfree(devmem);
-		}
-		kfree(mdevice->devmem_chunks);
-	}
-
+	dmirror_device_remove_chunks(mdevice);
 	cdev_device_del(&mdevice->cdevice, &mdevice->device);
 }
 
--- a/lib/test_hmm_uapi.h~hmm-tests-add-test-for-migrate_device_range
+++ a/lib/test_hmm_uapi.h
@@ -36,6 +36,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
--- a/tools/testing/selftests/vm/hmm-tests.c~hmm-tests-add-test-for-migrate_device_range
+++ a/tools/testing/selftests/vm/hmm-tests.c
@@ -1054,6 +1054,55 @@ TEST_F(hmm, migrate_fault)
 	hmm_buffer_free(buffer);
 }
 
+TEST_F(hmm, migrate_release)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Release device memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_RELEASE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / (2 * sizeof(*ptr)); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
 /*
  * Migrate anonymous shared memory to device private memory.
  */
_


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 7/8] nouveau/dmem: Evict device private memory during release
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 7/8] nouveau/dmem: Evict device private memory during release Alistair Popple
@ 2022-09-28 21:37   ` Lyude Paul
  0 siblings, 0 replies; 18+ messages in thread
From: Lyude Paul @ 2022-09-28 21:37 UTC (permalink / raw)
  To: Alistair Popple, Andrew Morton, linux-mm
  Cc: Ralph Campbell, nouveau, linux-kernel, dri-devel, amd-gfx, Ben Skeggs

Reviewed-by: Lyude Paul <lyude@redhat.com>

On Wed, 2022-09-28 at 22:01 +1000, Alistair Popple wrote:
> When the module is unloaded or a GPU is unbound from the module it is
> possible for device private pages to still be mapped in currently
> running processes. This can lead to a hangs and RCU stall warnings when
> unbinding the device as memunmap_pages() will wait in an uninterruptible
> state until all device pages have been freed which may never happen.
> 
> Fix this by migrating device mappings back to normal CPU memory prior to
> freeing the GPU memory chunks and associated device private pages.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++++++++++++++++++++++++++-
>  1 file changed, 48 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 65f51fb..5fe2091 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -367,6 +367,52 @@ nouveau_dmem_suspend(struct nouveau_drm *drm)
>  	mutex_unlock(&drm->dmem->mutex);
>  }
>  
> +/*
> + * Evict all pages mapping a chunk.
> + */
> +static void
> +nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
> +{
> +	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
> +	unsigned long *src_pfns, *dst_pfns;
> +	dma_addr_t *dma_addrs;
> +	struct nouveau_fence *fence;
> +
> +	src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
> +	dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
> +	dma_addrs = kcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL);
> +
> +	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
> +			npages);
> +
> +	for (i = 0; i < npages; i++) {
> +		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
> +			struct page *dpage;
> +
> +			/*
> +			 * _GFP_NOFAIL because the GPU is going away and there
> +			 * is nothing sensible we can do if we can't copy the
> +			 * data back.
> +			 */
> +			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
> +			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
> +			nouveau_dmem_copy_one(chunk->drm,
> +					migrate_pfn_to_page(src_pfns[i]), dpage,
> +					&dma_addrs[i]);
> +		}
> +	}
> +
> +	nouveau_fence_new(chunk->drm->dmem->migrate.chan, false, &fence);
> +	migrate_device_pages(src_pfns, dst_pfns, npages);
> +	nouveau_dmem_fence_done(&fence);
> +	migrate_device_finalize(src_pfns, dst_pfns, npages);
> +	kfree(src_pfns);
> +	kfree(dst_pfns);
> +	for (i = 0; i < npages; i++)
> +		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
> +	kfree(dma_addrs);
> +}
> +
>  void
>  nouveau_dmem_fini(struct nouveau_drm *drm)
>  {
> @@ -378,8 +424,10 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
>  	mutex_lock(&drm->dmem->mutex);
>  
>  	list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
> +		nouveau_dmem_evict_chunk(chunk);
>  		nouveau_bo_unpin(chunk->bo);
>  		nouveau_bo_ref(NULL, &chunk->bo);
> +		WARN_ON(chunk->callocated);
>  		list_del(&chunk->list);
>  		memunmap_pages(&chunk->pagemap);
>  		release_mem_region(chunk->pagemap.range.start,

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range()
  2022-09-28 15:10   ` Andrew Morton
@ 2022-09-29 11:00     ` Alistair Popple
  0 siblings, 0 replies; 18+ messages in thread
From: Alistair Popple @ 2022-09-29 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Sierra, Ralph Campbell, nouveau, Felix Kuehling,
	linux-kernel, dri-devel, linux-mm, amd-gfx, Jason Gunthorpe


Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 28 Sep 2022 22:01:22 +1000 Alistair Popple <apopple@nvidia.com> wrote:
>
>> @@ -1401,22 +1494,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
>>
>>  static void dmirror_device_remove(struct dmirror_device *mdevice)
>>  {
>> -	unsigned int i;
>> -
>> -	if (mdevice->devmem_chunks) {
>> -		for (i = 0; i < mdevice->devmem_count; i++) {
>> -			struct dmirror_chunk *devmem =
>> -				mdevice->devmem_chunks[i];
>> -
>> -			memunmap_pages(&devmem->pagemap);
>> -			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
>> -				release_mem_region(devmem->pagemap.range.start,
>> -						   range_len(&devmem->pagemap.range));
>> -			kfree(devmem);
>> -		}
>> -		kfree(mdevice->devmem_chunks);
>> -	}
>> -
>> +	dmirror_device_remove_chunks(mdevice);
>>  	cdev_del(&mdevice->cdevice);
>>  }
>
> Needed a bit or rework due to
> https://lkml.kernel.org/r/20220826050631.25771-1-mpenttil@redhat.com.
> Please check my resolution.

Thanks. Rework looks good to me.

> --- a/lib/test_hmm.c~hmm-tests-add-test-for-migrate_device_range
> +++ a/lib/test_hmm.c
> @@ -100,6 +100,7 @@ struct dmirror {
>  struct dmirror_chunk {
>  	struct dev_pagemap	pagemap;
>  	struct dmirror_device	*mdevice;
> +	bool remove;
>  };
>
>  /*
> @@ -192,11 +193,15 @@ static int dmirror_fops_release(struct i
>  	return 0;
>  }
>
> +static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
> +{
> +	return container_of(page->pgmap, struct dmirror_chunk, pagemap);
> +}
> +
>  static struct dmirror_device *dmirror_page_to_device(struct page *page)
>
>  {
> -	return container_of(page->pgmap, struct dmirror_chunk,
> -			    pagemap)->mdevice;
> +	return dmirror_page_to_chunk(page)->mdevice;
>  }
>
>  static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
> @@ -1218,6 +1223,85 @@ static int dmirror_snapshot(struct dmirr
>  	return ret;
>  }
>
> +static void dmirror_device_evict_chunk(struct dmirror_chunk *chunk)
> +{
> +	unsigned long start_pfn = chunk->pagemap.range.start >> PAGE_SHIFT;
> +	unsigned long end_pfn = chunk->pagemap.range.end >> PAGE_SHIFT;
> +	unsigned long npages = end_pfn - start_pfn + 1;
> +	unsigned long i;
> +	unsigned long *src_pfns;
> +	unsigned long *dst_pfns;
> +
> +	src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
> +	dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
> +
> +	migrate_device_range(src_pfns, start_pfn, npages);
> +	for (i = 0; i < npages; i++) {
> +		struct page *dpage, *spage;
> +
> +		spage = migrate_pfn_to_page(src_pfns[i]);
> +		if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		if (WARN_ON(!is_device_private_page(spage) &&
> +			    !is_device_coherent_page(spage)))
> +			continue;
> +		spage = BACKING_PAGE(spage);
> +		dpage = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_NOFAIL);
> +		lock_page(dpage);
> +		copy_highpage(dpage, spage);
> +		dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
> +		if (src_pfns[i] & MIGRATE_PFN_WRITE)
> +			dst_pfns[i] |= MIGRATE_PFN_WRITE;
> +	}
> +	migrate_device_pages(src_pfns, dst_pfns, npages);
> +	migrate_device_finalize(src_pfns, dst_pfns, npages);
> +	kfree(src_pfns);
> +	kfree(dst_pfns);
> +}
> +
> +/* Removes free pages from the free list so they can't be re-allocated */
> +static void dmirror_remove_free_pages(struct dmirror_chunk *devmem)
> +{
> +	struct dmirror_device *mdevice = devmem->mdevice;
> +	struct page *page;
> +
> +	for (page = mdevice->free_pages; page; page = page->zone_device_data)
> +		if (dmirror_page_to_chunk(page) == devmem)
> +			mdevice->free_pages = page->zone_device_data;
> +}
> +
> +static void dmirror_device_remove_chunks(struct dmirror_device *mdevice)
> +{
> +	unsigned int i;
> +
> +	mutex_lock(&mdevice->devmem_lock);
> +	if (mdevice->devmem_chunks) {
> +		for (i = 0; i < mdevice->devmem_count; i++) {
> +			struct dmirror_chunk *devmem =
> +				mdevice->devmem_chunks[i];
> +
> +			spin_lock(&mdevice->lock);
> +			devmem->remove = true;
> +			dmirror_remove_free_pages(devmem);
> +			spin_unlock(&mdevice->lock);
> +
> +			dmirror_device_evict_chunk(devmem);
> +			memunmap_pages(&devmem->pagemap);
> +			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
> +				release_mem_region(devmem->pagemap.range.start,
> +						   range_len(&devmem->pagemap.range));
> +			kfree(devmem);
> +		}
> +		mdevice->devmem_count = 0;
> +		mdevice->devmem_capacity = 0;
> +		mdevice->free_pages = NULL;
> +		kfree(mdevice->devmem_chunks);
> +		mdevice->devmem_chunks = NULL;
> +	}
> +	mutex_unlock(&mdevice->devmem_lock);
> +}
> +
>  static long dmirror_fops_unlocked_ioctl(struct file *filp,
>  					unsigned int command,
>  					unsigned long arg)
> @@ -1272,6 +1356,11 @@ static long dmirror_fops_unlocked_ioctl(
>  		ret = dmirror_snapshot(dmirror, &cmd);
>  		break;
>
> +	case HMM_DMIRROR_RELEASE:
> +		dmirror_device_remove_chunks(dmirror->mdevice);
> +		ret = 0;
> +		break;
> +
>  	default:
>  		return -EINVAL;
>  	}
> @@ -1326,9 +1415,13 @@ static void dmirror_devmem_free(struct p
>
>  	mdevice = dmirror_page_to_device(page);
>  	spin_lock(&mdevice->lock);
> -	mdevice->cfree++;
> -	page->zone_device_data = mdevice->free_pages;
> -	mdevice->free_pages = page;
> +
> +	/* Return page to our allocator if not freeing the chunk */
> +	if (!dmirror_page_to_chunk(page)->remove) {
> +		mdevice->cfree++;
> +		page->zone_device_data = mdevice->free_pages;
> +		mdevice->free_pages = page;
> +	}
>  	spin_unlock(&mdevice->lock);
>  }
>
> @@ -1408,22 +1501,7 @@ static int dmirror_device_init(struct dm
>
>  static void dmirror_device_remove(struct dmirror_device *mdevice)
>  {
> -	unsigned int i;
> -
> -	if (mdevice->devmem_chunks) {
> -		for (i = 0; i < mdevice->devmem_count; i++) {
> -			struct dmirror_chunk *devmem =
> -				mdevice->devmem_chunks[i];
> -
> -			memunmap_pages(&devmem->pagemap);
> -			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
> -				release_mem_region(devmem->pagemap.range.start,
> -						   range_len(&devmem->pagemap.range));
> -			kfree(devmem);
> -		}
> -		kfree(mdevice->devmem_chunks);
> -	}
> -
> +	dmirror_device_remove_chunks(mdevice);
>  	cdev_device_del(&mdevice->cdevice, &mdevice->device);
>  }
>
> --- a/lib/test_hmm_uapi.h~hmm-tests-add-test-for-migrate_device_range
> +++ a/lib/test_hmm_uapi.h
> @@ -36,6 +36,7 @@ struct hmm_dmirror_cmd {
>  #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
>  #define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x05, struct hmm_dmirror_cmd)
>  #define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x06, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_RELEASE		_IOWR('H', 0x07, struct hmm_dmirror_cmd)
>
>  /*
>   * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
> --- a/tools/testing/selftests/vm/hmm-tests.c~hmm-tests-add-test-for-migrate_device_range
> +++ a/tools/testing/selftests/vm/hmm-tests.c
> @@ -1054,6 +1054,55 @@ TEST_F(hmm, migrate_fault)
>  	hmm_buffer_free(buffer);
>  }
>
> +TEST_F(hmm, migrate_release)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize buffer in system memory. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Migrate memory to device. */
> +	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +
> +	/* Check what the device read. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	/* Release device memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_RELEASE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* Fault pages back to system memory and check them. */
> +	for (i = 0, ptr = buffer->ptr; i < size / (2 * sizeof(*ptr)); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
>  /*
>   * Migrate anonymous shared memory to device private memory.
>   */
> _

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page Alistair Popple
@ 2022-09-29 18:30   ` Felix Kuehling
  2022-10-03  0:53     ` Alistair Popple
  0 siblings, 1 reply; 18+ messages in thread
From: Felix Kuehling @ 2022-09-29 18:30 UTC (permalink / raw)
  To: Alistair Popple, Andrew Morton, linux-mm
  Cc: Ralph Campbell, Michael Ellerman, nouveau, linux-kernel,
	dri-devel, amd-gfx, Jason Gunthorpe

On 2022-09-28 08:01, Alistair Popple wrote:
> When the CPU tries to access a device private page the migrate_to_ram()
> callback associated with the pgmap for the page is called. However no
> reference is taken on the faulting page. Therefore a concurrent
> migration of the device private page can free the page and possibly the
> underlying pgmap. This results in a race which can crash the kernel due
> to the migrate_to_ram() function pointer becoming invalid. It also means
> drivers can't reliably read the zone_device_data field because the page
> may have been freed with memunmap_pages().
>
> Close the race by getting a reference on the page while holding the ptl
> to ensure it has not been freed. Unfortunately the elevated reference
> count will cause the migration required to handle the fault to fail. To
> avoid this failure pass the faulting page into the migrate_vma functions
> so that if an elevated reference count is found it can be checked to see
> if it's expected or not.

Do we really have to drag the fault_page all the way into the migrate 
structure? Is the elevated refcount still needed at that time? Maybe it 
would be easier to drop the refcount early in the ops->migrage_to_ram 
callbacks, so we won't have to deal with it in all the migration code.

Regards,
   Felix


>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Lyude Paul <lyude@redhat.com>
> ---
>   arch/powerpc/kvm/book3s_hv_uvmem.c       | 15 ++++++-----
>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++++++------
>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 11 +++++---
>   include/linux/migrate.h                  |  8 ++++++-
>   lib/test_hmm.c                           |  7 ++---
>   mm/memory.c                              | 16 +++++++++++-
>   mm/migrate.c                             | 34 ++++++++++++++-----------
>   mm/migrate_device.c                      | 18 +++++++++----
>   9 files changed, 87 insertions(+), 41 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 5980063..d4eacf4 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>   static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>   		unsigned long start,
>   		unsigned long end, unsigned long page_shift,
> -		struct kvm *kvm, unsigned long gpa)
> +		struct kvm *kvm, unsigned long gpa, struct page *fault_page)
>   {
>   	unsigned long src_pfn, dst_pfn = 0;
> -	struct migrate_vma mig;
> +	struct migrate_vma mig = { 0 };
>   	struct page *dpage, *spage;
>   	struct kvmppc_uvmem_page_pvt *pvt;
>   	unsigned long pfn;
> @@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>   	mig.dst = &dst_pfn;
>   	mig.pgmap_owner = &kvmppc_uvmem_pgmap;
>   	mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
> +	mig.fault_page = fault_page;
>   
>   	/* The requested page is already paged-out, nothing to do */
>   	if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
> @@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>   static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>   				      unsigned long start, unsigned long end,
>   				      unsigned long page_shift,
> -				      struct kvm *kvm, unsigned long gpa)
> +				      struct kvm *kvm, unsigned long gpa,
> +				      struct page *fault_page)
>   {
>   	int ret;
>   
>   	mutex_lock(&kvm->arch.uvmem_lock);
> -	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
> +	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
> +				fault_page);
>   	mutex_unlock(&kvm->arch.uvmem_lock);
>   
>   	return ret;
> @@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>   		bool pagein)
>   {
>   	unsigned long src_pfn, dst_pfn = 0;
> -	struct migrate_vma mig;
> +	struct migrate_vma mig = { 0 };
>   	struct page *spage;
>   	unsigned long pfn;
>   	struct page *dpage;
> @@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
>   
>   	if (kvmppc_svm_page_out(vmf->vma, vmf->address,
>   				vmf->address + PAGE_SIZE, PAGE_SHIFT,
> -				pvt->kvm, pvt->gpa))
> +				pvt->kvm, pvt->gpa, vmf->page))
>   		return VM_FAULT_SIGBUS;
>   	else
>   		return 0;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index b059a77..776448b 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
>   	uint64_t npages = (end - start) >> PAGE_SHIFT;
>   	struct kfd_process_device *pdd;
>   	struct dma_fence *mfence = NULL;
> -	struct migrate_vma migrate;
> +	struct migrate_vma migrate = { 0 };
>   	unsigned long cpages = 0;
>   	dma_addr_t *scratch;
>   	void *buf;
> @@ -668,7 +668,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>   static long
>   svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>   		       struct vm_area_struct *vma, uint64_t start, uint64_t end,
> -		       uint32_t trigger)
> +		       uint32_t trigger, struct page *fault_page)
>   {
>   	struct kfd_process *p = container_of(prange->svms, struct kfd_process, svms);
>   	uint64_t npages = (end - start) >> PAGE_SHIFT;
> @@ -676,7 +676,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>   	unsigned long cpages = 0;
>   	struct kfd_process_device *pdd;
>   	struct dma_fence *mfence = NULL;
> -	struct migrate_vma migrate;
> +	struct migrate_vma migrate = { 0 };
>   	dma_addr_t *scratch;
>   	void *buf;
>   	int r = -ENOMEM;
> @@ -699,6 +699,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>   
>   	migrate.src = buf;
>   	migrate.dst = migrate.src + npages;
> +	migrate.fault_page = fault_page;
>   	scratch = (dma_addr_t *)(migrate.dst + npages);
>   
>   	kfd_smi_event_migration_start(adev->kfd.dev, p->lead_thread->pid,
> @@ -766,7 +767,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>    * 0 - OK, otherwise error code
>    */
>   int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
> -			    uint32_t trigger)
> +			    uint32_t trigger, struct page *fault_page)
>   {
>   	struct amdgpu_device *adev;
>   	struct vm_area_struct *vma;
> @@ -807,7 +808,8 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
>   		}
>   
>   		next = min(vma->vm_end, end);
> -		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger);
> +		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger,
> +			fault_page);
>   		if (r < 0) {
>   			pr_debug("failed %ld to migrate prange %p\n", r, prange);
>   			break;
> @@ -851,7 +853,7 @@ svm_migrate_vram_to_vram(struct svm_range *prange, uint32_t best_loc,
>   	pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc, best_loc);
>   
>   	do {
> -		r = svm_migrate_vram_to_ram(prange, mm, trigger);
> +		r = svm_migrate_vram_to_ram(prange, mm, trigger, NULL);
>   		if (r)
>   			return r;
>   	} while (prange->actual_loc && --retries);
> @@ -938,7 +940,8 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
>   		goto out_unlock_prange;
>   	}
>   
> -	r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU);
> +	r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU,
> +				vmf->page);
>   	if (r)
>   		pr_debug("failed %d migrate 0x%p [0x%lx 0x%lx] to ram\n", r,
>   			 prange, prange->start, prange->last);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> index b3f0754..a5d7e6d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> @@ -43,7 +43,7 @@ enum MIGRATION_COPY_DIR {
>   int svm_migrate_to_vram(struct svm_range *prange,  uint32_t best_loc,
>   			struct mm_struct *mm, uint32_t trigger);
>   int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
> -			    uint32_t trigger);
> +			    uint32_t trigger, struct page *fault_page);
>   unsigned long
>   svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 11074cc..9139e5a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -2913,13 +2913,15 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>   				 */
>   				if (prange->actual_loc)
>   					r = svm_migrate_vram_to_ram(prange, mm,
> -					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
> +					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
> +					   NULL);
>   				else
>   					r = 0;
>   			}
>   		} else {
>   			r = svm_migrate_vram_to_ram(prange, mm,
> -					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
> +					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
> +					NULL);
>   		}
>   		if (r) {
>   			pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
> @@ -3242,7 +3244,8 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
>   		return 0;
>   
>   	if (!best_loc) {
> -		r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PREFETCH);
> +		r = svm_migrate_vram_to_ram(prange, mm,
> +					KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
>   		*migrated = !r;
>   		return r;
>   	}
> @@ -3303,7 +3306,7 @@ static void svm_range_evict_svm_bo_worker(struct work_struct *work)
>   		mutex_lock(&prange->migrate_mutex);
>   		do {
>   			r = svm_migrate_vram_to_ram(prange, mm,
> -						KFD_MIGRATE_TRIGGER_TTM_EVICTION);
> +					KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL);
>   		} while (!r && prange->actual_loc && --retries);
>   
>   		if (!r && prange->actual_loc)
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 22c0a0c..82ffa47 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -62,6 +62,8 @@ extern const char *migrate_reason_names[MR_TYPES];
>   #ifdef CONFIG_MIGRATION
>   
>   extern void putback_movable_pages(struct list_head *l);
> +int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
> +		struct folio *src, enum migrate_mode mode, int extra_count);
>   int migrate_folio(struct address_space *mapping, struct folio *dst,
>   		struct folio *src, enum migrate_mode mode);
>   extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> @@ -212,6 +214,12 @@ struct migrate_vma {
>   	 */
>   	void			*pgmap_owner;
>   	unsigned long		flags;
> +
> +	/*
> +	 * Set to vmf->page if this is being called to migrate a page as part of
> +	 * a migrate_to_ram() callback.
> +	 */
> +	struct page		*fault_page;
>   };
>   
>   int migrate_vma_setup(struct migrate_vma *args);
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index e3965ca..89463ff 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -907,7 +907,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
>   	struct vm_area_struct *vma;
>   	unsigned long src_pfns[64] = { 0 };
>   	unsigned long dst_pfns[64] = { 0 };
> -	struct migrate_vma args;
> +	struct migrate_vma args = { 0 };
>   	unsigned long next;
>   	int ret;
>   
> @@ -968,7 +968,7 @@ static int dmirror_migrate_to_device(struct dmirror *dmirror,
>   	unsigned long src_pfns[64] = { 0 };
>   	unsigned long dst_pfns[64] = { 0 };
>   	struct dmirror_bounce bounce;
> -	struct migrate_vma args;
> +	struct migrate_vma args = { 0 };
>   	unsigned long next;
>   	int ret;
>   
> @@ -1334,7 +1334,7 @@ static void dmirror_devmem_free(struct page *page)
>   
>   static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>   {
> -	struct migrate_vma args;
> +	struct migrate_vma args = { 0 };
>   	unsigned long src_pfns = 0;
>   	unsigned long dst_pfns = 0;
>   	struct page *rpage;
> @@ -1357,6 +1357,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>   	args.dst = &dst_pfns;
>   	args.pgmap_owner = dmirror->mdevice;
>   	args.flags = dmirror_select_device(dmirror);
> +	args.fault_page = vmf->page;
>   
>   	if (migrate_vma_setup(&args))
>   		return VM_FAULT_SIGBUS;
> diff --git a/mm/memory.c b/mm/memory.c
> index b994784..65d3977 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3742,7 +3742,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   			ret = remove_device_exclusive_entry(vmf);
>   		} else if (is_device_private_entry(entry)) {
>   			vmf->page = pfn_swap_entry_to_page(entry);
> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> +			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> +					vmf->address, &vmf->ptl);
> +			if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> +				spin_unlock(vmf->ptl);
> +				goto out;
> +			}
> +
> +			/*
> +			 * Get a page reference while we know the page can't be
> +			 * freed.
> +			 */
> +			get_page(vmf->page);
> +			pte_unmap_unlock(vmf->pte, vmf->ptl);
> +			vmf->page->pgmap->ops->migrate_to_ram(vmf);
> +			put_page(vmf->page);
>   		} else if (is_hwpoison_entry(entry)) {
>   			ret = VM_FAULT_HWPOISON;
>   		} else if (is_swapin_error_entry(entry)) {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ce6a58f..e3f78a7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -620,6 +620,25 @@ EXPORT_SYMBOL(folio_migrate_copy);
>    *                    Migration functions
>    ***********************************************************/
>   
> +int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
> +		struct folio *src, enum migrate_mode mode, int extra_count)
> +{
> +	int rc;
> +
> +	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
> +
> +	rc = folio_migrate_mapping(mapping, dst, src, extra_count);
> +
> +	if (rc != MIGRATEPAGE_SUCCESS)
> +		return rc;
> +
> +	if (mode != MIGRATE_SYNC_NO_COPY)
> +		folio_migrate_copy(dst, src);
> +	else
> +		folio_migrate_flags(dst, src);
> +	return MIGRATEPAGE_SUCCESS;
> +}
> +
>   /**
>    * migrate_folio() - Simple folio migration.
>    * @mapping: The address_space containing the folio.
> @@ -635,20 +654,7 @@ EXPORT_SYMBOL(folio_migrate_copy);
>   int migrate_folio(struct address_space *mapping, struct folio *dst,
>   		struct folio *src, enum migrate_mode mode)
>   {
> -	int rc;
> -
> -	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
> -
> -	rc = folio_migrate_mapping(mapping, dst, src, 0);
> -
> -	if (rc != MIGRATEPAGE_SUCCESS)
> -		return rc;
> -
> -	if (mode != MIGRATE_SYNC_NO_COPY)
> -		folio_migrate_copy(dst, src);
> -	else
> -		folio_migrate_flags(dst, src);
> -	return MIGRATEPAGE_SUCCESS;
> +	return migrate_folio_extra(mapping, dst, src, mode, 0);
>   }
>   EXPORT_SYMBOL(migrate_folio);
>   
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 7235424..f756c00 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -313,14 +313,14 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
>    * folio_migrate_mapping(), except that here we allow migration of a
>    * ZONE_DEVICE page.
>    */
> -static bool migrate_vma_check_page(struct page *page)
> +static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>   {
>   	/*
>   	 * One extra ref because caller holds an extra reference, either from
>   	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
>   	 * a device page.
>   	 */
> -	int extra = 1;
> +	int extra = 1 + (page == fault_page);
>   
>   	/*
>   	 * FIXME support THP (transparent huge page), it is bit more complex to
> @@ -393,7 +393,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>   		if (folio_mapped(folio))
>   			try_to_migrate(folio, 0);
>   
> -		if (page_mapped(page) || !migrate_vma_check_page(page)) {
> +		if (page_mapped(page) ||
> +		    !migrate_vma_check_page(page, migrate->fault_page)) {
>   			if (!is_zone_device_page(page)) {
>   				get_page(page);
>   				putback_lru_page(page);
> @@ -505,6 +506,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>   		return -EINVAL;
>   	if (!args->src || !args->dst)
>   		return -EINVAL;
> +	if (args->fault_page && !is_device_private_page(args->fault_page))
> +		return -EINVAL;
>   
>   	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>   	args->cpages = 0;
> @@ -735,8 +738,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>   			continue;
>   		}
>   
> -		r = migrate_folio(mapping, page_folio(newpage),
> -				page_folio(page), MIGRATE_SYNC_NO_COPY);
> +		if (migrate->fault_page == page)
> +			r = migrate_folio_extra(mapping, page_folio(newpage),
> +						page_folio(page),
> +						MIGRATE_SYNC_NO_COPY, 1);
> +		else
> +			r = migrate_folio(mapping, page_folio(newpage),
> +					page_folio(page), MIGRATE_SYNC_NO_COPY);
>   		if (r != MIGRATEPAGE_SUCCESS)
>   			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>   	}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 2/8] mm: Free device private pages have zero refcount
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 2/8] mm: Free device private pages have zero refcount Alistair Popple
@ 2022-09-29 19:21   ` Felix Kuehling
  0 siblings, 0 replies; 18+ messages in thread
From: Felix Kuehling @ 2022-09-29 19:21 UTC (permalink / raw)
  To: Alistair Popple, Andrew Morton, linux-mm
  Cc: Alex Sierra, Ralph Campbell, Michael Ellerman, linux-kernel,
	dri-devel, amd-gfx, Jason Gunthorpe, nouveau, Alex Deucher,
	Dan Williams, Christian König, Ben Skeggs


On 2022-09-28 08:01, Alistair Popple wrote:
> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
> refcount") device private pages have no longer had an extra reference
> count when the page is in use. However before handing them back to the
> owning device driver we add an extra reference count such that free
> pages have a reference count of one.
>
> This makes it difficult to tell if a page is free or not because both
> free and in use pages will have a non-zero refcount. Instead we should
> return pages to the drivers page allocator with a zero reference count.
> Kernel code can then safely use kernel functions such as
> get_page_unless_zero().
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>


> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Alex Sierra <alex.sierra@amd.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
>
> ---
>
> This will conflict with Dan's series to fix reference counts for DAX[1].
> At the moment this only makes changes for device private and coherent
> pages, however if DAX is fixed to remove the extra refcount then we
> should just be able to drop the checks for private/coherent pages and
> treat them the same.
>
> [1] - https://lore.kernel.org/linux-mm/166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com/
> ---
>   arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
>   drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
>   include/linux/memremap.h                 |  1 +
>   lib/test_hmm.c                           |  2 +-
>   mm/memremap.c                            |  9 +++++++++
>   mm/page_alloc.c                          |  8 ++++++++
>   7 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index d4eacf4..9d8de68 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -718,7 +718,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
>   
>   	dpage = pfn_to_page(uvmem_pfn);
>   	dpage->zone_device_data = pvt;
> -	lock_page(dpage);
> +	zone_device_page_init(dpage);
>   	return dpage;
>   out_clear:
>   	spin_lock(&kvmppc_uvmem_bitmap_lock);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index 776448b..97a6845 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -223,7 +223,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
>   	page = pfn_to_page(pfn);
>   	svm_range_bo_ref(prange->svm_bo);
>   	page->zone_device_data = prange->svm_bo;
> -	lock_page(page);
> +	zone_device_page_init(page);
>   }
>   
>   static void
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 1635661..b092988 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -326,7 +326,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
>   			return NULL;
>   	}
>   
> -	lock_page(page);
> +	zone_device_page_init(page);
>   	return page;
>   }
>   
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1901049..f68bf6d 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -182,6 +182,7 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
>   }
>   
>   #ifdef CONFIG_ZONE_DEVICE
> +void zone_device_page_init(struct page *page);
>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>   void memunmap_pages(struct dev_pagemap *pgmap);
>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 89463ff..688c15d 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -627,8 +627,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
>   			goto error;
>   	}
>   
> +	zone_device_page_init(dpage);
>   	dpage->zone_device_data = rpage;
> -	lock_page(dpage);
>   	return dpage;
>   
>   error:
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 25029a4..1c2c038 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -505,8 +505,17 @@ void free_zone_device_page(struct page *page)
>   	/*
>   	 * Reset the page count to 1 to prepare for handing out the page again.
>   	 */
> +	if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
> +	    page->pgmap->type != MEMORY_DEVICE_COHERENT)
> +		set_page_count(page, 1);
> +}
> +
> +void zone_device_page_init(struct page *page)
> +{
>   	set_page_count(page, 1);
> +	lock_page(page);
>   }
> +EXPORT_SYMBOL_GPL(zone_device_page_init);
>   
>   #ifdef CONFIG_FS_DAX
>   bool __put_devmap_managed_page_refs(struct page *page, int refs)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9d49803..4df1e43 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6744,6 +6744,14 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>   		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
>   		cond_resched();
>   	}
> +
> +	/*
> +	 * ZONE_DEVICE pages are released directly to the driver page allocator
> +	 * which will set the page count to 1 when allocating the page.
> +	 */
> +	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    pgmap->type == MEMORY_DEVICE_COHERENT)
> +		set_page_count(page, 0);
>   }
>   
>   /*

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page
  2022-09-29 18:30   ` Felix Kuehling
@ 2022-10-03  0:53     ` Alistair Popple
  2022-10-03 17:34       ` Felix Kuehling
  0 siblings, 1 reply; 18+ messages in thread
From: Alistair Popple @ 2022-10-03  0:53 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Ralph Campbell, Michael Ellerman, nouveau, linux-kernel,
	dri-devel, linux-mm, amd-gfx, Jason Gunthorpe, Andrew Morton


Felix Kuehling <felix.kuehling@amd.com> writes:

> On 2022-09-28 08:01, Alistair Popple wrote:
>> When the CPU tries to access a device private page the migrate_to_ram()
>> callback associated with the pgmap for the page is called. However no
>> reference is taken on the faulting page. Therefore a concurrent
>> migration of the device private page can free the page and possibly the
>> underlying pgmap. This results in a race which can crash the kernel due
>> to the migrate_to_ram() function pointer becoming invalid. It also means
>> drivers can't reliably read the zone_device_data field because the page
>> may have been freed with memunmap_pages().
>>
>> Close the race by getting a reference on the page while holding the ptl
>> to ensure it has not been freed. Unfortunately the elevated reference
>> count will cause the migration required to handle the fault to fail. To
>> avoid this failure pass the faulting page into the migrate_vma functions
>> so that if an elevated reference count is found it can be checked to see
>> if it's expected or not.
>
> Do we really have to drag the fault_page all the way into the migrate structure?
> Is the elevated refcount still needed at that time? Maybe it would be easier to
> drop the refcount early in the ops->migrage_to_ram callbacks, so we won't have
> to deal with it in all the migration code.

That would also work. Honestly I don't really like either solution :-)

I didn't like having to plumb it all through the migration code
but I ended up going this way because I felt it was easier to explain
the life time of vmf->page for the migrate_to_ram() callback. This way
vmf->page is guaranteed to be valid for the duration of the
migrate_to_ram() callbak.

As you suggest we could instead have drivers call put_page(vmf->page)
somewhere in migrate_to_ram() before calling migrate_vma_setup(). The
reason I didn't go this way is IMHO it's more subtle because in general
the page will remain valid after that put_page() anyway. So it would be
easy for drivers to introduce a bug assuming the vmf->page is still
valid and not notice because most of the time it is.

Let me know if you disagree with my reasoning though - would appreciate
any review here.

> Regards,
>   Felix
>
>
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: John Hubbard <jhubbard@nvidia.com>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c       | 15 ++++++-----
>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++++++------
>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 11 +++++---
>>   include/linux/migrate.h                  |  8 ++++++-
>>   lib/test_hmm.c                           |  7 ++---
>>   mm/memory.c                              | 16 +++++++++++-
>>   mm/migrate.c                             | 34 ++++++++++++++-----------
>>   mm/migrate_device.c                      | 18 +++++++++----
>>   9 files changed, 87 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> index 5980063..d4eacf4 100644
>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> @@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>>   static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>   		unsigned long start,
>>   		unsigned long end, unsigned long page_shift,
>> -		struct kvm *kvm, unsigned long gpa)
>> +		struct kvm *kvm, unsigned long gpa, struct page *fault_page)
>>   {
>>   	unsigned long src_pfn, dst_pfn = 0;
>> -	struct migrate_vma mig;
>> +	struct migrate_vma mig = { 0 };
>>   	struct page *dpage, *spage;
>>   	struct kvmppc_uvmem_page_pvt *pvt;
>>   	unsigned long pfn;
>> @@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>   	mig.dst = &dst_pfn;
>>   	mig.pgmap_owner = &kvmppc_uvmem_pgmap;
>>   	mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>> +	mig.fault_page = fault_page;
>>     	/* The requested page is already paged-out, nothing to do */
>>   	if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
>> @@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>   static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>   				      unsigned long start, unsigned long end,
>>   				      unsigned long page_shift,
>> -				      struct kvm *kvm, unsigned long gpa)
>> +				      struct kvm *kvm, unsigned long gpa,
>> +				      struct page *fault_page)
>>   {
>>   	int ret;
>>     	mutex_lock(&kvm->arch.uvmem_lock);
>> -	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
>> +	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
>> +				fault_page);
>>   	mutex_unlock(&kvm->arch.uvmem_lock);
>>     	return ret;
>> @@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>>   		bool pagein)
>>   {
>>   	unsigned long src_pfn, dst_pfn = 0;
>> -	struct migrate_vma mig;
>> +	struct migrate_vma mig = { 0 };
>>   	struct page *spage;
>>   	unsigned long pfn;
>>   	struct page *dpage;
>> @@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
>>     	if (kvmppc_svm_page_out(vmf->vma, vmf->address,
>>   				vmf->address + PAGE_SIZE, PAGE_SHIFT,
>> -				pvt->kvm, pvt->gpa))
>> +				pvt->kvm, pvt->gpa, vmf->page))
>>   		return VM_FAULT_SIGBUS;
>>   	else
>>   		return 0;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> index b059a77..776448b 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>> @@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
>>   	uint64_t npages = (end - start) >> PAGE_SHIFT;
>>   	struct kfd_process_device *pdd;
>>   	struct dma_fence *mfence = NULL;
>> -	struct migrate_vma migrate;
>> +	struct migrate_vma migrate = { 0 };
>>   	unsigned long cpages = 0;
>>   	dma_addr_t *scratch;
>>   	void *buf;
>> @@ -668,7 +668,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>   static long
>>   svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>   		       struct vm_area_struct *vma, uint64_t start, uint64_t end,
>> -		       uint32_t trigger)
>> +		       uint32_t trigger, struct page *fault_page)
>>   {
>>   	struct kfd_process *p = container_of(prange->svms, struct kfd_process, svms);
>>   	uint64_t npages = (end - start) >> PAGE_SHIFT;
>> @@ -676,7 +676,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>   	unsigned long cpages = 0;
>>   	struct kfd_process_device *pdd;
>>   	struct dma_fence *mfence = NULL;
>> -	struct migrate_vma migrate;
>> +	struct migrate_vma migrate = { 0 };
>>   	dma_addr_t *scratch;
>>   	void *buf;
>>   	int r = -ENOMEM;
>> @@ -699,6 +699,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>     	migrate.src = buf;
>>   	migrate.dst = migrate.src + npages;
>> +	migrate.fault_page = fault_page;
>>   	scratch = (dma_addr_t *)(migrate.dst + npages);
>>     	kfd_smi_event_migration_start(adev->kfd.dev, p->lead_thread->pid,
>> @@ -766,7 +767,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>    * 0 - OK, otherwise error code
>>    */
>>   int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
>> -			    uint32_t trigger)
>> +			    uint32_t trigger, struct page *fault_page)
>>   {
>>   	struct amdgpu_device *adev;
>>   	struct vm_area_struct *vma;
>> @@ -807,7 +808,8 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
>>   		}
>>     		next = min(vma->vm_end, end);
>> -		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger);
>> +		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger,
>> +			fault_page);
>>   		if (r < 0) {
>>   			pr_debug("failed %ld to migrate prange %p\n", r, prange);
>>   			break;
>> @@ -851,7 +853,7 @@ svm_migrate_vram_to_vram(struct svm_range *prange, uint32_t best_loc,
>>   	pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc, best_loc);
>>     	do {
>> -		r = svm_migrate_vram_to_ram(prange, mm, trigger);
>> +		r = svm_migrate_vram_to_ram(prange, mm, trigger, NULL);
>>   		if (r)
>>   			return r;
>>   	} while (prange->actual_loc && --retries);
>> @@ -938,7 +940,8 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
>>   		goto out_unlock_prange;
>>   	}
>>   -	r = svm_migrate_vram_to_ram(prange, mm,
>> KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU);
>> +	r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU,
>> +				vmf->page);
>>   	if (r)
>>   		pr_debug("failed %d migrate 0x%p [0x%lx 0x%lx] to ram\n", r,
>>   			 prange, prange->start, prange->last);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>> index b3f0754..a5d7e6d 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>> @@ -43,7 +43,7 @@ enum MIGRATION_COPY_DIR {
>>   int svm_migrate_to_vram(struct svm_range *prange,  uint32_t best_loc,
>>   			struct mm_struct *mm, uint32_t trigger);
>>   int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
>> -			    uint32_t trigger);
>> +			    uint32_t trigger, struct page *fault_page);
>>   unsigned long
>>   svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index 11074cc..9139e5a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -2913,13 +2913,15 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>>   				 */
>>   				if (prange->actual_loc)
>>   					r = svm_migrate_vram_to_ram(prange, mm,
>> -					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
>> +					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
>> +					   NULL);
>>   				else
>>   					r = 0;
>>   			}
>>   		} else {
>>   			r = svm_migrate_vram_to_ram(prange, mm,
>> -					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
>> +					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
>> +					NULL);
>>   		}
>>   		if (r) {
>>   			pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
>> @@ -3242,7 +3244,8 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
>>   		return 0;
>>     	if (!best_loc) {
>> -		r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PREFETCH);
>> +		r = svm_migrate_vram_to_ram(prange, mm,
>> +					KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
>>   		*migrated = !r;
>>   		return r;
>>   	}
>> @@ -3303,7 +3306,7 @@ static void svm_range_evict_svm_bo_worker(struct work_struct *work)
>>   		mutex_lock(&prange->migrate_mutex);
>>   		do {
>>   			r = svm_migrate_vram_to_ram(prange, mm,
>> -						KFD_MIGRATE_TRIGGER_TTM_EVICTION);
>> +					KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL);
>>   		} while (!r && prange->actual_loc && --retries);
>>     		if (!r && prange->actual_loc)
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 22c0a0c..82ffa47 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -62,6 +62,8 @@ extern const char *migrate_reason_names[MR_TYPES];
>>   #ifdef CONFIG_MIGRATION
>>     extern void putback_movable_pages(struct list_head *l);
>> +int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
>> +		struct folio *src, enum migrate_mode mode, int extra_count);
>>   int migrate_folio(struct address_space *mapping, struct folio *dst,
>>   		struct folio *src, enum migrate_mode mode);
>>   extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
>> @@ -212,6 +214,12 @@ struct migrate_vma {
>>   	 */
>>   	void			*pgmap_owner;
>>   	unsigned long		flags;
>> +
>> +	/*
>> +	 * Set to vmf->page if this is being called to migrate a page as part of
>> +	 * a migrate_to_ram() callback.
>> +	 */
>> +	struct page		*fault_page;
>>   };
>>     int migrate_vma_setup(struct migrate_vma *args);
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index e3965ca..89463ff 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -907,7 +907,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
>>   	struct vm_area_struct *vma;
>>   	unsigned long src_pfns[64] = { 0 };
>>   	unsigned long dst_pfns[64] = { 0 };
>> -	struct migrate_vma args;
>> +	struct migrate_vma args = { 0 };
>>   	unsigned long next;
>>   	int ret;
>>   @@ -968,7 +968,7 @@ static int dmirror_migrate_to_device(struct dmirror
>> *dmirror,
>>   	unsigned long src_pfns[64] = { 0 };
>>   	unsigned long dst_pfns[64] = { 0 };
>>   	struct dmirror_bounce bounce;
>> -	struct migrate_vma args;
>> +	struct migrate_vma args = { 0 };
>>   	unsigned long next;
>>   	int ret;
>>   @@ -1334,7 +1334,7 @@ static void dmirror_devmem_free(struct page *page)
>>     static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>   {
>> -	struct migrate_vma args;
>> +	struct migrate_vma args = { 0 };
>>   	unsigned long src_pfns = 0;
>>   	unsigned long dst_pfns = 0;
>>   	struct page *rpage;
>> @@ -1357,6 +1357,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>   	args.dst = &dst_pfns;
>>   	args.pgmap_owner = dmirror->mdevice;
>>   	args.flags = dmirror_select_device(dmirror);
>> +	args.fault_page = vmf->page;
>>     	if (migrate_vma_setup(&args))
>>   		return VM_FAULT_SIGBUS;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b994784..65d3977 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3742,7 +3742,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   			ret = remove_device_exclusive_entry(vmf);
>>   		} else if (is_device_private_entry(entry)) {
>>   			vmf->page = pfn_swap_entry_to_page(entry);
>> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> +			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> +					vmf->address, &vmf->ptl);
>> +			if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
>> +				spin_unlock(vmf->ptl);
>> +				goto out;
>> +			}
>> +
>> +			/*
>> +			 * Get a page reference while we know the page can't be
>> +			 * freed.
>> +			 */
>> +			get_page(vmf->page);
>> +			pte_unmap_unlock(vmf->pte, vmf->ptl);
>> +			vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> +			put_page(vmf->page);
>>   		} else if (is_hwpoison_entry(entry)) {
>>   			ret = VM_FAULT_HWPOISON;
>>   		} else if (is_swapin_error_entry(entry)) {
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index ce6a58f..e3f78a7 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -620,6 +620,25 @@ EXPORT_SYMBOL(folio_migrate_copy);
>>    *                    Migration functions
>>    ***********************************************************/
>>   +int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
>> +		struct folio *src, enum migrate_mode mode, int extra_count)
>> +{
>> +	int rc;
>> +
>> +	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
>> +
>> +	rc = folio_migrate_mapping(mapping, dst, src, extra_count);
>> +
>> +	if (rc != MIGRATEPAGE_SUCCESS)
>> +		return rc;
>> +
>> +	if (mode != MIGRATE_SYNC_NO_COPY)
>> +		folio_migrate_copy(dst, src);
>> +	else
>> +		folio_migrate_flags(dst, src);
>> +	return MIGRATEPAGE_SUCCESS;
>> +}
>> +
>>   /**
>>    * migrate_folio() - Simple folio migration.
>>    * @mapping: The address_space containing the folio.
>> @@ -635,20 +654,7 @@ EXPORT_SYMBOL(folio_migrate_copy);
>>   int migrate_folio(struct address_space *mapping, struct folio *dst,
>>   		struct folio *src, enum migrate_mode mode)
>>   {
>> -	int rc;
>> -
>> -	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
>> -
>> -	rc = folio_migrate_mapping(mapping, dst, src, 0);
>> -
>> -	if (rc != MIGRATEPAGE_SUCCESS)
>> -		return rc;
>> -
>> -	if (mode != MIGRATE_SYNC_NO_COPY)
>> -		folio_migrate_copy(dst, src);
>> -	else
>> -		folio_migrate_flags(dst, src);
>> -	return MIGRATEPAGE_SUCCESS;
>> +	return migrate_folio_extra(mapping, dst, src, mode, 0);
>>   }
>>   EXPORT_SYMBOL(migrate_folio);
>>   diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 7235424..f756c00 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -313,14 +313,14 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
>>    * folio_migrate_mapping(), except that here we allow migration of a
>>    * ZONE_DEVICE page.
>>    */
>> -static bool migrate_vma_check_page(struct page *page)
>> +static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>   {
>>   	/*
>>   	 * One extra ref because caller holds an extra reference, either from
>>   	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
>>   	 * a device page.
>>   	 */
>> -	int extra = 1;
>> +	int extra = 1 + (page == fault_page);
>>     	/*
>>   	 * FIXME support THP (transparent huge page), it is bit more complex to
>> @@ -393,7 +393,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>>   		if (folio_mapped(folio))
>>   			try_to_migrate(folio, 0);
>>   -		if (page_mapped(page) || !migrate_vma_check_page(page)) {
>> +		if (page_mapped(page) ||
>> +		    !migrate_vma_check_page(page, migrate->fault_page)) {
>>   			if (!is_zone_device_page(page)) {
>>   				get_page(page);
>>   				putback_lru_page(page);
>> @@ -505,6 +506,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>>   		return -EINVAL;
>>   	if (!args->src || !args->dst)
>>   		return -EINVAL;
>> +	if (args->fault_page && !is_device_private_page(args->fault_page))
>> +		return -EINVAL;
>>     	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>>   	args->cpages = 0;
>> @@ -735,8 +738,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>   			continue;
>>   		}
>>   -		r = migrate_folio(mapping, page_folio(newpage),
>> -				page_folio(page), MIGRATE_SYNC_NO_COPY);
>> +		if (migrate->fault_page == page)
>> +			r = migrate_folio_extra(mapping, page_folio(newpage),
>> +						page_folio(page),
>> +						MIGRATE_SYNC_NO_COPY, 1);
>> +		else
>> +			r = migrate_folio(mapping, page_folio(newpage),
>> +					page_folio(page), MIGRATE_SYNC_NO_COPY);
>>   		if (r != MIGRATEPAGE_SUCCESS)
>>   			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>   	}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page
  2022-10-03  0:53     ` Alistair Popple
@ 2022-10-03 17:34       ` Felix Kuehling
  0 siblings, 0 replies; 18+ messages in thread
From: Felix Kuehling @ 2022-10-03 17:34 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Ralph Campbell, Michael Ellerman, nouveau, linux-kernel,
	dri-devel, linux-mm, amd-gfx, Jason Gunthorpe, Andrew Morton

Am 2022-10-02 um 20:53 schrieb Alistair Popple:
> Felix Kuehling <felix.kuehling@amd.com> writes:
>
>> On 2022-09-28 08:01, Alistair Popple wrote:
>>> When the CPU tries to access a device private page the migrate_to_ram()
>>> callback associated with the pgmap for the page is called. However no
>>> reference is taken on the faulting page. Therefore a concurrent
>>> migration of the device private page can free the page and possibly the
>>> underlying pgmap. This results in a race which can crash the kernel due
>>> to the migrate_to_ram() function pointer becoming invalid. It also means
>>> drivers can't reliably read the zone_device_data field because the page
>>> may have been freed with memunmap_pages().
>>>
>>> Close the race by getting a reference on the page while holding the ptl
>>> to ensure it has not been freed. Unfortunately the elevated reference
>>> count will cause the migration required to handle the fault to fail. To
>>> avoid this failure pass the faulting page into the migrate_vma functions
>>> so that if an elevated reference count is found it can be checked to see
>>> if it's expected or not.
>> Do we really have to drag the fault_page all the way into the migrate structure?
>> Is the elevated refcount still needed at that time? Maybe it would be easier to
>> drop the refcount early in the ops->migrage_to_ram callbacks, so we won't have
>> to deal with it in all the migration code.
> That would also work. Honestly I don't really like either solution :-)

Then we agree. :)


> I didn't like having to plumb it all through the migration code
> but I ended up going this way because I felt it was easier to explain
> the life time of vmf->page for the migrate_to_ram() callback. This way
> vmf->page is guaranteed to be valid for the duration of the
> migrate_to_ram() callbak.
>
> As you suggest we could instead have drivers call put_page(vmf->page)
> somewhere in migrate_to_ram() before calling migrate_vma_setup(). The
> reason I didn't go this way is IMHO it's more subtle because in general
> the page will remain valid after that put_page() anyway. So it would be
> easy for drivers to introduce a bug assuming the vmf->page is still
> valid and not notice because most of the time it is.

I guess I'm biased because my migrate_to_ram implementation doesn't use 
the vmf->page at all. I agree that dropping the refcount in the callback 
is subtle. But handling an elevated refcount for just one special page 
in the migration code also looks a bit fragile to me.

It's not my call to make. But my preference is very clear. Either way, 
if the decision is to go with your solution, then you have my

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>


>
> Let me know if you disagree with my reasoning though - would appreciate
> any review here.
>
>> Regards,
>>    Felix
>>
>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> ---
>>>    arch/powerpc/kvm/book3s_hv_uvmem.c       | 15 ++++++-----
>>>    drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 +++++++------
>>>    drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  2 +-
>>>    drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 11 +++++---
>>>    include/linux/migrate.h                  |  8 ++++++-
>>>    lib/test_hmm.c                           |  7 ++---
>>>    mm/memory.c                              | 16 +++++++++++-
>>>    mm/migrate.c                             | 34 ++++++++++++++-----------
>>>    mm/migrate_device.c                      | 18 +++++++++----
>>>    9 files changed, 87 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> index 5980063..d4eacf4 100644
>>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> @@ -508,10 +508,10 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>>>    static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>>    		unsigned long start,
>>>    		unsigned long end, unsigned long page_shift,
>>> -		struct kvm *kvm, unsigned long gpa)
>>> +		struct kvm *kvm, unsigned long gpa, struct page *fault_page)
>>>    {
>>>    	unsigned long src_pfn, dst_pfn = 0;
>>> -	struct migrate_vma mig;
>>> +	struct migrate_vma mig = { 0 };
>>>    	struct page *dpage, *spage;
>>>    	struct kvmppc_uvmem_page_pvt *pvt;
>>>    	unsigned long pfn;
>>> @@ -525,6 +525,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>>    	mig.dst = &dst_pfn;
>>>    	mig.pgmap_owner = &kvmppc_uvmem_pgmap;
>>>    	mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>>> +	mig.fault_page = fault_page;
>>>      	/* The requested page is already paged-out, nothing to do */
>>>    	if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
>>> @@ -580,12 +581,14 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>>>    static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>>    				      unsigned long start, unsigned long end,
>>>    				      unsigned long page_shift,
>>> -				      struct kvm *kvm, unsigned long gpa)
>>> +				      struct kvm *kvm, unsigned long gpa,
>>> +				      struct page *fault_page)
>>>    {
>>>    	int ret;
>>>      	mutex_lock(&kvm->arch.uvmem_lock);
>>> -	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
>>> +	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa,
>>> +				fault_page);
>>>    	mutex_unlock(&kvm->arch.uvmem_lock);
>>>      	return ret;
>>> @@ -736,7 +739,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>>>    		bool pagein)
>>>    {
>>>    	unsigned long src_pfn, dst_pfn = 0;
>>> -	struct migrate_vma mig;
>>> +	struct migrate_vma mig = { 0 };
>>>    	struct page *spage;
>>>    	unsigned long pfn;
>>>    	struct page *dpage;
>>> @@ -994,7 +997,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
>>>      	if (kvmppc_svm_page_out(vmf->vma, vmf->address,
>>>    				vmf->address + PAGE_SIZE, PAGE_SHIFT,
>>> -				pvt->kvm, pvt->gpa))
>>> +				pvt->kvm, pvt->gpa, vmf->page))
>>>    		return VM_FAULT_SIGBUS;
>>>    	else
>>>    		return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>> index b059a77..776448b 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>> @@ -409,7 +409,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
>>>    	uint64_t npages = (end - start) >> PAGE_SHIFT;
>>>    	struct kfd_process_device *pdd;
>>>    	struct dma_fence *mfence = NULL;
>>> -	struct migrate_vma migrate;
>>> +	struct migrate_vma migrate = { 0 };
>>>    	unsigned long cpages = 0;
>>>    	dma_addr_t *scratch;
>>>    	void *buf;
>>> @@ -668,7 +668,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>>    static long
>>>    svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>>    		       struct vm_area_struct *vma, uint64_t start, uint64_t end,
>>> -		       uint32_t trigger)
>>> +		       uint32_t trigger, struct page *fault_page)
>>>    {
>>>    	struct kfd_process *p = container_of(prange->svms, struct kfd_process, svms);
>>>    	uint64_t npages = (end - start) >> PAGE_SHIFT;
>>> @@ -676,7 +676,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>>    	unsigned long cpages = 0;
>>>    	struct kfd_process_device *pdd;
>>>    	struct dma_fence *mfence = NULL;
>>> -	struct migrate_vma migrate;
>>> +	struct migrate_vma migrate = { 0 };
>>>    	dma_addr_t *scratch;
>>>    	void *buf;
>>>    	int r = -ENOMEM;
>>> @@ -699,6 +699,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>>      	migrate.src = buf;
>>>    	migrate.dst = migrate.src + npages;
>>> +	migrate.fault_page = fault_page;
>>>    	scratch = (dma_addr_t *)(migrate.dst + npages);
>>>      	kfd_smi_event_migration_start(adev->kfd.dev, p->lead_thread->pid,
>>> @@ -766,7 +767,7 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>>>     * 0 - OK, otherwise error code
>>>     */
>>>    int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
>>> -			    uint32_t trigger)
>>> +			    uint32_t trigger, struct page *fault_page)
>>>    {
>>>    	struct amdgpu_device *adev;
>>>    	struct vm_area_struct *vma;
>>> @@ -807,7 +808,8 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
>>>    		}
>>>      		next = min(vma->vm_end, end);
>>> -		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger);
>>> +		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next, trigger,
>>> +			fault_page);
>>>    		if (r < 0) {
>>>    			pr_debug("failed %ld to migrate prange %p\n", r, prange);
>>>    			break;
>>> @@ -851,7 +853,7 @@ svm_migrate_vram_to_vram(struct svm_range *prange, uint32_t best_loc,
>>>    	pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc, best_loc);
>>>      	do {
>>> -		r = svm_migrate_vram_to_ram(prange, mm, trigger);
>>> +		r = svm_migrate_vram_to_ram(prange, mm, trigger, NULL);
>>>    		if (r)
>>>    			return r;
>>>    	} while (prange->actual_loc && --retries);
>>> @@ -938,7 +940,8 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
>>>    		goto out_unlock_prange;
>>>    	}
>>>    -	r = svm_migrate_vram_to_ram(prange, mm,
>>> KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU);
>>> +	r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU,
>>> +				vmf->page);
>>>    	if (r)
>>>    		pr_debug("failed %d migrate 0x%p [0x%lx 0x%lx] to ram\n", r,
>>>    			 prange, prange->start, prange->last);
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>> index b3f0754..a5d7e6d 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>> @@ -43,7 +43,7 @@ enum MIGRATION_COPY_DIR {
>>>    int svm_migrate_to_vram(struct svm_range *prange,  uint32_t best_loc,
>>>    			struct mm_struct *mm, uint32_t trigger);
>>>    int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
>>> -			    uint32_t trigger);
>>> +			    uint32_t trigger, struct page *fault_page);
>>>    unsigned long
>>>    svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
>>>    diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 11074cc..9139e5a 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -2913,13 +2913,15 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>>>    				 */
>>>    				if (prange->actual_loc)
>>>    					r = svm_migrate_vram_to_ram(prange, mm,
>>> -					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
>>> +					   KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
>>> +					   NULL);
>>>    				else
>>>    					r = 0;
>>>    			}
>>>    		} else {
>>>    			r = svm_migrate_vram_to_ram(prange, mm,
>>> -					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
>>> +					KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
>>> +					NULL);
>>>    		}
>>>    		if (r) {
>>>    			pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
>>> @@ -3242,7 +3244,8 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
>>>    		return 0;
>>>      	if (!best_loc) {
>>> -		r = svm_migrate_vram_to_ram(prange, mm, KFD_MIGRATE_TRIGGER_PREFETCH);
>>> +		r = svm_migrate_vram_to_ram(prange, mm,
>>> +					KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
>>>    		*migrated = !r;
>>>    		return r;
>>>    	}
>>> @@ -3303,7 +3306,7 @@ static void svm_range_evict_svm_bo_worker(struct work_struct *work)
>>>    		mutex_lock(&prange->migrate_mutex);
>>>    		do {
>>>    			r = svm_migrate_vram_to_ram(prange, mm,
>>> -						KFD_MIGRATE_TRIGGER_TTM_EVICTION);
>>> +					KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL);
>>>    		} while (!r && prange->actual_loc && --retries);
>>>      		if (!r && prange->actual_loc)
>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>> index 22c0a0c..82ffa47 100644
>>> --- a/include/linux/migrate.h
>>> +++ b/include/linux/migrate.h
>>> @@ -62,6 +62,8 @@ extern const char *migrate_reason_names[MR_TYPES];
>>>    #ifdef CONFIG_MIGRATION
>>>      extern void putback_movable_pages(struct list_head *l);
>>> +int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
>>> +		struct folio *src, enum migrate_mode mode, int extra_count);
>>>    int migrate_folio(struct address_space *mapping, struct folio *dst,
>>>    		struct folio *src, enum migrate_mode mode);
>>>    extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
>>> @@ -212,6 +214,12 @@ struct migrate_vma {
>>>    	 */
>>>    	void			*pgmap_owner;
>>>    	unsigned long		flags;
>>> +
>>> +	/*
>>> +	 * Set to vmf->page if this is being called to migrate a page as part of
>>> +	 * a migrate_to_ram() callback.
>>> +	 */
>>> +	struct page		*fault_page;
>>>    };
>>>      int migrate_vma_setup(struct migrate_vma *args);
>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>> index e3965ca..89463ff 100644
>>> --- a/lib/test_hmm.c
>>> +++ b/lib/test_hmm.c
>>> @@ -907,7 +907,7 @@ static int dmirror_migrate_to_system(struct dmirror *dmirror,
>>>    	struct vm_area_struct *vma;
>>>    	unsigned long src_pfns[64] = { 0 };
>>>    	unsigned long dst_pfns[64] = { 0 };
>>> -	struct migrate_vma args;
>>> +	struct migrate_vma args = { 0 };
>>>    	unsigned long next;
>>>    	int ret;
>>>    @@ -968,7 +968,7 @@ static int dmirror_migrate_to_device(struct dmirror
>>> *dmirror,
>>>    	unsigned long src_pfns[64] = { 0 };
>>>    	unsigned long dst_pfns[64] = { 0 };
>>>    	struct dmirror_bounce bounce;
>>> -	struct migrate_vma args;
>>> +	struct migrate_vma args = { 0 };
>>>    	unsigned long next;
>>>    	int ret;
>>>    @@ -1334,7 +1334,7 @@ static void dmirror_devmem_free(struct page *page)
>>>      static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>    {
>>> -	struct migrate_vma args;
>>> +	struct migrate_vma args = { 0 };
>>>    	unsigned long src_pfns = 0;
>>>    	unsigned long dst_pfns = 0;
>>>    	struct page *rpage;
>>> @@ -1357,6 +1357,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>    	args.dst = &dst_pfns;
>>>    	args.pgmap_owner = dmirror->mdevice;
>>>    	args.flags = dmirror_select_device(dmirror);
>>> +	args.fault_page = vmf->page;
>>>      	if (migrate_vma_setup(&args))
>>>    		return VM_FAULT_SIGBUS;
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index b994784..65d3977 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3742,7 +3742,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>    			ret = remove_device_exclusive_entry(vmf);
>>>    		} else if (is_device_private_entry(entry)) {
>>>    			vmf->page = pfn_swap_entry_to_page(entry);
>>> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>> +			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>> +					vmf->address, &vmf->ptl);
>>> +			if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
>>> +				spin_unlock(vmf->ptl);
>>> +				goto out;
>>> +			}
>>> +
>>> +			/*
>>> +			 * Get a page reference while we know the page can't be
>>> +			 * freed.
>>> +			 */
>>> +			get_page(vmf->page);
>>> +			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> +			vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>> +			put_page(vmf->page);
>>>    		} else if (is_hwpoison_entry(entry)) {
>>>    			ret = VM_FAULT_HWPOISON;
>>>    		} else if (is_swapin_error_entry(entry)) {
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index ce6a58f..e3f78a7 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -620,6 +620,25 @@ EXPORT_SYMBOL(folio_migrate_copy);
>>>     *                    Migration functions
>>>     ***********************************************************/
>>>    +int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
>>> +		struct folio *src, enum migrate_mode mode, int extra_count)
>>> +{
>>> +	int rc;
>>> +
>>> +	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
>>> +
>>> +	rc = folio_migrate_mapping(mapping, dst, src, extra_count);
>>> +
>>> +	if (rc != MIGRATEPAGE_SUCCESS)
>>> +		return rc;
>>> +
>>> +	if (mode != MIGRATE_SYNC_NO_COPY)
>>> +		folio_migrate_copy(dst, src);
>>> +	else
>>> +		folio_migrate_flags(dst, src);
>>> +	return MIGRATEPAGE_SUCCESS;
>>> +}
>>> +
>>>    /**
>>>     * migrate_folio() - Simple folio migration.
>>>     * @mapping: The address_space containing the folio.
>>> @@ -635,20 +654,7 @@ EXPORT_SYMBOL(folio_migrate_copy);
>>>    int migrate_folio(struct address_space *mapping, struct folio *dst,
>>>    		struct folio *src, enum migrate_mode mode)
>>>    {
>>> -	int rc;
>>> -
>>> -	BUG_ON(folio_test_writeback(src));	/* Writeback must be complete */
>>> -
>>> -	rc = folio_migrate_mapping(mapping, dst, src, 0);
>>> -
>>> -	if (rc != MIGRATEPAGE_SUCCESS)
>>> -		return rc;
>>> -
>>> -	if (mode != MIGRATE_SYNC_NO_COPY)
>>> -		folio_migrate_copy(dst, src);
>>> -	else
>>> -		folio_migrate_flags(dst, src);
>>> -	return MIGRATEPAGE_SUCCESS;
>>> +	return migrate_folio_extra(mapping, dst, src, mode, 0);
>>>    }
>>>    EXPORT_SYMBOL(migrate_folio);
>>>    diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 7235424..f756c00 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -313,14 +313,14 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
>>>     * folio_migrate_mapping(), except that here we allow migration of a
>>>     * ZONE_DEVICE page.
>>>     */
>>> -static bool migrate_vma_check_page(struct page *page)
>>> +static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>>    {
>>>    	/*
>>>    	 * One extra ref because caller holds an extra reference, either from
>>>    	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
>>>    	 * a device page.
>>>    	 */
>>> -	int extra = 1;
>>> +	int extra = 1 + (page == fault_page);
>>>      	/*
>>>    	 * FIXME support THP (transparent huge page), it is bit more complex to
>>> @@ -393,7 +393,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>>>    		if (folio_mapped(folio))
>>>    			try_to_migrate(folio, 0);
>>>    -		if (page_mapped(page) || !migrate_vma_check_page(page)) {
>>> +		if (page_mapped(page) ||
>>> +		    !migrate_vma_check_page(page, migrate->fault_page)) {
>>>    			if (!is_zone_device_page(page)) {
>>>    				get_page(page);
>>>    				putback_lru_page(page);
>>> @@ -505,6 +506,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>>>    		return -EINVAL;
>>>    	if (!args->src || !args->dst)
>>>    		return -EINVAL;
>>> +	if (args->fault_page && !is_device_private_page(args->fault_page))
>>> +		return -EINVAL;
>>>      	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>>>    	args->cpages = 0;
>>> @@ -735,8 +738,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>>    			continue;
>>>    		}
>>>    -		r = migrate_folio(mapping, page_folio(newpage),
>>> -				page_folio(page), MIGRATE_SYNC_NO_COPY);
>>> +		if (migrate->fault_page == page)
>>> +			r = migrate_folio_extra(mapping, page_folio(newpage),
>>> +						page_folio(page),
>>> +						MIGRATE_SYNC_NO_COPY, 1);
>>> +		else
>>> +			r = migrate_folio(mapping, page_folio(newpage),
>>> +					page_folio(page), MIGRATE_SYNC_NO_COPY);
>>>    		if (r != MIGRATEPAGE_SUCCESS)
>>>    			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>    	}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues
  2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
                   ` (7 preceding siblings ...)
  2022-09-28 12:01 ` [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range() Alistair Popple
@ 2022-10-25 10:17 ` Vlastimil Babka (SUSE)
  2022-10-26  1:47   ` Alistair Popple
  8 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka (SUSE) @ 2022-10-25 10:17 UTC (permalink / raw)
  To: Alistair Popple, Andrew Morton, linux-mm
  Cc: nouveau, Felix Kuehling, linux-kernel, dri-devel, amd-gfx

On 9/28/22 14:01, Alistair Popple wrote:
> This series aims to fix a number of page reference counting issues in
> drivers dealing with device private ZONE_DEVICE pages. These result in
> use-after-free type bugs, either from accessing a struct page which no
> longer exists because it has been removed or accessing fields within the
> struct page which are no longer valid because the page has been freed.
> 
> During normal usage it is unlikely these will cause any problems. However
> without these fixes it is possible to crash the kernel from userspace.
> These crashes can be triggered either by unloading the kernel module or
> unbinding the device from the driver prior to a userspace task exiting. In
> modules such as Nouveau it is also possible to trigger some of these issues
> by explicitly closing the device file-descriptor prior to the task exiting
> and then accessing device private memory.

Hi, as this series was noticed to create a CVE [1], do you think a stable
backport is warranted? I think the "It is possible to launch the attack
remotely." in [1] is incorrect though, right?

It looks to me that patch 1 would be needed since the CONFIG_DEVICE_PRIVATE
introduction, while the following few only to kernels with 27674ef6c73f
(probably not so critical as that includes no LTS)?

Thanks,
Vlastimil

[1] https://nvd.nist.gov/vuln/detail/CVE-2022-3523

> This involves some minor changes to both PowerPC and AMD GPU code.
> Unfortunately I lack hardware to test either of those so any help there
> would be appreciated. The changes mimic what is done in for both Nouveau
> and hmm-tests though so I doubt they will cause problems.
> 
> To: Andrew Morton <akpm@linux-foundation.org>
> To: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org
> 
> Alistair Popple (8):
>   mm/memory.c: Fix race when faulting a device private page
>   mm: Free device private pages have zero refcount
>   mm/memremap.c: Take a pgmap reference on page allocation
>   mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
>   mm/migrate_device.c: Add migrate_device_range()
>   nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
>   nouveau/dmem: Evict device private memory during release
>   hmm-tests: Add test for migrate_device_range()
> 
>  arch/powerpc/kvm/book3s_hv_uvmem.c       |  17 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  19 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c     |  11 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++++++----
>  include/linux/memremap.h                 |   1 +-
>  include/linux/migrate.h                  |  15 ++-
>  lib/test_hmm.c                           | 129 ++++++++++---
>  lib/test_hmm_uapi.h                      |   1 +-
>  mm/memory.c                              |  16 +-
>  mm/memremap.c                            |  30 ++-
>  mm/migrate.c                             |  34 +--
>  mm/migrate_device.c                      | 239 +++++++++++++++++-------
>  mm/page_alloc.c                          |   8 +-
>  tools/testing/selftests/vm/hmm-tests.c   |  49 +++++-
>  15 files changed, 516 insertions(+), 163 deletions(-)
> 
> base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues
  2022-10-25 10:17 ` [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Vlastimil Babka (SUSE)
@ 2022-10-26  1:47   ` Alistair Popple
  0 siblings, 0 replies; 18+ messages in thread
From: Alistair Popple @ 2022-10-26  1:47 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: nouveau, Felix Kuehling, linux-kernel, dri-devel, linux-mm,
	amd-gfx, Andrew Morton


"Vlastimil Babka (SUSE)" <vbabka@kernel.org> writes:

> On 9/28/22 14:01, Alistair Popple wrote:
>> This series aims to fix a number of page reference counting issues in
>> drivers dealing with device private ZONE_DEVICE pages. These result in
>> use-after-free type bugs, either from accessing a struct page which no
>> longer exists because it has been removed or accessing fields within the
>> struct page which are no longer valid because the page has been freed.
>>
>> During normal usage it is unlikely these will cause any problems. However
>> without these fixes it is possible to crash the kernel from userspace.
>> These crashes can be triggered either by unloading the kernel module or
>> unbinding the device from the driver prior to a userspace task exiting. In
>> modules such as Nouveau it is also possible to trigger some of these issues
>> by explicitly closing the device file-descriptor prior to the task exiting
>> and then accessing device private memory.
>
> Hi, as this series was noticed to create a CVE [1], do you think a stable
> backport is warranted? I think the "It is possible to launch the attack
> remotely." in [1] is incorrect though, right?

Right, I don't see how this could be exploited remotely. And I'm pretty
sure you need root as well because in practice the pgmap needs to be
freed, and for Nouveau at least that only happens on device removal.

> It looks to me that patch 1 would be needed since the CONFIG_DEVICE_PRIVATE
> introduction, while the following few only to kernels with 27674ef6c73f
> (probably not so critical as that includes no LTS)?

Patch 3 already has a fixes tag for 27674ef6c73f. Patch 1 would need to
go back to CONFIG_DEVICE_PRIVATE introduction. I think patches 4-8 would
also need to go back to introduction of CONFIG_DEVICE_PRIVATE, but there
isn't as much impact there and they would be harder to backport I think.
Without them device removal can loop indefinitely in kernel mode (if
patch 3 is present or the kernel is older than 27674ef6c73f).

 - Alistair

> Thanks,
> Vlastimil
>
> [1] https://nvd.nist.gov/vuln/detail/CVE-2022-3523
>
>> This involves some minor changes to both PowerPC and AMD GPU code.
>> Unfortunately I lack hardware to test either of those so any help there
>> would be appreciated. The changes mimic what is done in for both Nouveau
>> and hmm-tests though so I doubt they will cause problems.
>>
>> To: Andrew Morton <akpm@linux-foundation.org>
>> To: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: amd-gfx@lists.freedesktop.org
>> Cc: nouveau@lists.freedesktop.org
>> Cc: dri-devel@lists.freedesktop.org
>>
>> Alistair Popple (8):
>>   mm/memory.c: Fix race when faulting a device private page
>>   mm: Free device private pages have zero refcount
>>   mm/memremap.c: Take a pgmap reference on page allocation
>>   mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page()
>>   mm/migrate_device.c: Add migrate_device_range()
>>   nouveau/dmem: Refactor nouveau_dmem_fault_copy_one()
>>   nouveau/dmem: Evict device private memory during release
>>   hmm-tests: Add test for migrate_device_range()
>>
>>  arch/powerpc/kvm/book3s_hv_uvmem.c       |  17 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  19 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c     |  11 +-
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 108 +++++++----
>>  include/linux/memremap.h                 |   1 +-
>>  include/linux/migrate.h                  |  15 ++-
>>  lib/test_hmm.c                           | 129 ++++++++++---
>>  lib/test_hmm_uapi.h                      |   1 +-
>>  mm/memory.c                              |  16 +-
>>  mm/memremap.c                            |  30 ++-
>>  mm/migrate.c                             |  34 +--
>>  mm/migrate_device.c                      | 239 +++++++++++++++++-------
>>  mm/page_alloc.c                          |   8 +-
>>  tools/testing/selftests/vm/hmm-tests.c   |  49 +++++-
>>  15 files changed, 516 insertions(+), 163 deletions(-)
>>
>> base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-05-04 12:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-28 12:01 [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Alistair Popple
2022-09-28 12:01 ` [Nouveau] [PATCH v2 1/8] mm/memory.c: Fix race when faulting a device private page Alistair Popple
2022-09-29 18:30   ` Felix Kuehling
2022-10-03  0:53     ` Alistair Popple
2022-10-03 17:34       ` Felix Kuehling
2022-09-28 12:01 ` [Nouveau] [PATCH v2 2/8] mm: Free device private pages have zero refcount Alistair Popple
2022-09-29 19:21   ` Felix Kuehling
2022-09-28 12:01 ` [Nouveau] [PATCH v2 3/8] mm/memremap.c: Take a pgmap reference on page allocation Alistair Popple
2022-09-28 12:01 ` [Nouveau] [PATCH v2 4/8] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page() Alistair Popple
2022-09-28 12:01 ` [Nouveau] [PATCH v2 5/8] mm/migrate_device.c: Add migrate_device_range() Alistair Popple
2022-09-28 12:01 ` [Nouveau] [PATCH v2 6/8] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one() Alistair Popple
2022-09-28 12:01 ` [Nouveau] [PATCH v2 7/8] nouveau/dmem: Evict device private memory during release Alistair Popple
2022-09-28 21:37   ` Lyude Paul
2022-09-28 12:01 ` [Nouveau] [PATCH v2 8/8] hmm-tests: Add test for migrate_device_range() Alistair Popple
2022-09-28 15:10   ` Andrew Morton
2022-09-29 11:00     ` Alistair Popple
2022-10-25 10:17 ` [Nouveau] [PATCH v2 0/8] Fix several device private page reference counting issues Vlastimil Babka (SUSE)
2022-10-26  1:47   ` Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).