linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>
Cc: "Alex Sierra" <alex.sierra@amd.com>,
	"Karol Herbst" <kherbst@redhat.com>,
	"David Airlie" <airlied@linux.ie>,
	nouveau@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	amd-gfx@lists.freedesktop.org,
	"Alistair Popple" <apopple@nvidia.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Lyude Paul" <lyude@redhat.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	linux-kernel@vger.kernel.org, "Daniel Vetter" <daniel@ffwll.ch>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	linuxppc-dev@lists.ozlabs.org,
	"Christian König" <christian.koenig@amd.com>
Subject: [PATCH 6/7] nouveau/dmem: Evict device private memory during release
Date: Mon, 26 Sep 2022 16:03:10 +1000	[thread overview]
Message-ID: <072e1ce590fe101a4cdbd5e91b1702efebb6d0fd.1664171943.git-series.apopple@nvidia.com> (raw)
In-Reply-To: <cover.f15b25597fc3afd45b144df863eeca3b2c13f9f4.1664171943.git-series.apopple@nvidia.com>

When the module is unloaded or a GPU is unbound from the module it is
possible for device private pages to be left mapped in currently running
processes. This leads to a kernel crash when the pages are either freed
or accessed from the CPU because the GPU and associated data structures
and callbacks have all been freed.

Fix this by migrating any mappings back to normal CPU memory prior to
freeing the GPU memory chunks and associated device private pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

I assume the AMD driver might have a similar issue. However I can't see
where device private (or coherent) pages actually get unmapped/freed
during teardown as I couldn't find any relevant calls to
devm_memunmap(), memunmap(), devm_release_mem_region() or
release_mem_region(). So it appears that ZONE_DEVICE pages are not being
properly freed during module unload, unless I'm missing something?
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 48 +++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 66ebbd4..3b247b8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -369,6 +369,52 @@ nouveau_dmem_suspend(struct nouveau_drm *drm)
 	mutex_unlock(&drm->dmem->mutex);
 }
 
+/*
+ * Evict all pages mapping a chunk.
+ */
+void
+nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
+{
+	unsigned long i, npages = range_len(&chunk->pagemap.range) >> PAGE_SHIFT;
+	unsigned long *src_pfns, *dst_pfns;
+	dma_addr_t *dma_addrs;
+	struct nouveau_fence *fence;
+
+	src_pfns = kcalloc(npages, sizeof(*src_pfns), GFP_KERNEL);
+	dst_pfns = kcalloc(npages, sizeof(*dst_pfns), GFP_KERNEL);
+	dma_addrs = kcalloc(npages, sizeof(*dma_addrs), GFP_KERNEL);
+
+	migrate_device_range(src_pfns, chunk->pagemap.range.start >> PAGE_SHIFT,
+			npages);
+
+	for (i = 0; i < npages; i++) {
+		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
+			struct page *dpage;
+
+			/*
+			 * _GFP_NOFAIL because the GPU is going away and there
+			 * is nothing sensible we can do if we can't copy the
+			 * data back.
+			 */
+			dpage = alloc_page(GFP_HIGHUSER | __GFP_NOFAIL);
+			dst_pfns[i] = migrate_pfn(page_to_pfn(dpage));
+			nouveau_dmem_copy_one(chunk->drm,
+					migrate_pfn_to_page(src_pfns[i]), dpage,
+					&dma_addrs[i]);
+		}
+	}
+
+	nouveau_fence_new(chunk->drm->dmem->migrate.chan, false, &fence);
+	migrate_device_pages(src_pfns, dst_pfns, npages);
+	nouveau_dmem_fence_done(&fence);
+	migrate_device_finalize(src_pfns, dst_pfns, npages);
+	kfree(src_pfns);
+	kfree(dst_pfns);
+	for (i = 0; i < npages; i++)
+		dma_unmap_page(chunk->drm->dev->dev, dma_addrs[i], PAGE_SIZE, DMA_BIDIRECTIONAL);
+	kfree(dma_addrs);
+}
+
 void
 nouveau_dmem_fini(struct nouveau_drm *drm)
 {
@@ -380,8 +426,10 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
 	mutex_lock(&drm->dmem->mutex);
 
 	list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
+		nouveau_dmem_evict_chunk(chunk);
 		nouveau_bo_unpin(chunk->bo);
 		nouveau_bo_ref(NULL, &chunk->bo);
+		WARN_ON(chunk->callocated);
 		list_del(&chunk->list);
 		memunmap_pages(&chunk->pagemap);
 		release_mem_region(chunk->pagemap.range.start,
-- 
git-series 0.9.1

  parent reply	other threads:[~2022-09-26  6:09 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-26  6:03 [PATCH 0/7] Fix several device private page reference counting issues Alistair Popple
2022-09-26  6:03 ` [PATCH 1/7] mm/memory.c: Fix race when faulting a device private page Alistair Popple
2022-09-29  0:07   ` Michael Ellerman
2022-09-29  1:40     ` Alistair Popple
2022-09-29  5:07       ` Michael Ellerman
2022-09-26  6:03 ` [PATCH 2/7] mm: Free device private pages have zero refcount Alistair Popple
2022-09-26 14:36   ` Jason Gunthorpe
2022-09-27  2:06     ` Alistair Popple
2022-09-29 20:18       ` Dan Williams
2022-09-30  0:45         ` Alistair Popple
2022-09-30  1:49           ` Dan Williams
2022-09-26  6:03 ` [PATCH 3/7] mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page() Alistair Popple
2022-09-26  6:03 ` [PATCH 4/7] mm/migrate_device.c: Add migrate_device_range() Alistair Popple
2022-09-26  6:03 ` [PATCH 5/7] nouveau/dmem: Refactor nouveau_dmem_fault_copy_one() Alistair Popple
2022-09-26 21:29   ` Lyude Paul
2022-09-28 11:30     ` Alistair Popple
2022-09-26  6:03 ` Alistair Popple [this message]
2022-09-26 13:28   ` [PATCH 6/7] nouveau/dmem: Evict device private memory during release kernel test robot
2022-09-26 21:35   ` Lyude Paul
2022-09-26 22:14     ` John Hubbard
2022-09-26 23:45       ` Alistair Popple
2022-09-28 21:39         ` Lyude Paul
2022-09-26 23:07     ` Felix Kuehling
2022-09-27  1:39       ` Alistair Popple
2022-09-28 21:23         ` Lyude Paul
2022-09-26  6:03 ` [PATCH 7/7] hmm-tests: Add test for migrate_device_range() Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=072e1ce590fe101a4cdbd5e91b1702efebb6d0fd.1664171943.git-series.apopple@nvidia.com \
    --to=apopple@nvidia.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bskeggs@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kherbst@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lyude@redhat.com \
    --cc=nouveau@lists.freedesktop.org \
    --cc=npiggin@gmail.com \
    --cc=rcampbell@nvidia.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).