Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

From: Ralph Campbell <rcampbell@nvidia.com>
To: John Hubbard <jhubbard@nvidia.com>,
	<nouveau@lists.freedesktop.org>, <linux-kernel@vger.kernel.org>
Cc: Jerome Glisse <jglisse@redhat.com>,
	Christoph Hellwig <hch@lst.de>,
	"Jason Gunthorpe" <jgg@mellanox.com>,
	Ben Skeggs <bskeggs@redhat.com>
Subject: Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
Date: Mon, 22 Jun 2020 18:42:00 -0700	[thread overview]
Message-ID: <730e85c9-33b5-9c57-7123-057b75cbbddf@nvidia.com> (raw)
In-Reply-To: <f2bf81df-8faa-0f51-3f74-cb3b31d96aad@nvidia.com>

On 6/22/20 5:30 PM, John Hubbard wrote:
> On 2020-06-22 16:38, Ralph Campbell wrote:
>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>> migrate memory in the given address range to device private memory. The
>> source pages might already have been migrated to device private memory.
>> In that case, the source struct page is not checked to see if it is
>> a device private page and incorrectly computes the GPU's physical
>> address of local memory leading to data corruption.
>> Fix this by checking the source struct page and computing the correct
>> physical address.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> ---
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index cc9993837508..f6a806ba3caa 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>>       if (!(src & MIGRATE_PFN_MIGRATE))
>>           goto out;
>> +    if (spage && is_device_private_page(spage)) {
>> +        paddr = nouveau_dmem_page_addr(spage);
>> +        *dma_addr = DMA_MAPPING_ERROR;
>> +        goto done;
>> +    }
>> +
>>       dpage = nouveau_dmem_page_alloc_locked(drm);
>>       if (!dpage)
>>           goto out;
>> @@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>>               goto out_free_page;
>>       }
>> +done:
>>       *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
>>           ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
>>       if (src & MIGRATE_PFN_WRITE)
>> @@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
>>       struct migrate_vma args = {
>>           .vma        = vma,
>>           .start        = start,
>> +        .src_owner    = drm->dev,
> 
> Hi Ralph,
> 
> This .src_owner setting does look like a required fix, but it seems like
> a completely separate fix from what is listed in this patch's commit
> description, right? (It feels like a casualty of rearranging the patches.)
> 
> 
> thanks,

It's a bit more complex. There is a catch-22 here with the change to mm/migrate.c.
Without this patch or mm/migrate.c, a second call to clEnqueueSVMMigrateMem()
for the same address range will invalidate the GPU mapping to device private memory
created by the first call.
With this patch but not mm/migrate.c, the first call to clEnqueueSVMMigrateMem()
will fail to migrate normal anonymous memory to device private memory.
Without this patch but including the change to mm/migrate.c, a second call to
clEnqueueSVMMigrateMem() will crash the kernel because dma_map_page() will be
called with the device private PFN which is not a valid CPU physical address.
With both changes, a range of anonymous and device private pages can be migrated
to the GPU and the GPU page tables updated properly.