linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH 0/3] nouveau: fixes for SVM
@ 2020-06-22 23:38 Ralph Campbell
  2020-06-22 23:38 ` [RESEND PATCH 1/3] nouveau: fix migrate page regression Ralph Campbell
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Ralph Campbell @ 2020-06-22 23:38 UTC (permalink / raw)
  To: nouveau, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe,
	Ben Skeggs, Ralph Campbell

These are based on 5.8.0-rc2 and intended for Ben Skeggs' nouveau tree.
I believe the changes can be queued for 5.8-rcX after being reviewed.
These were part of a larger series but I'm resending them separately as
suggested by Jason Gunthorpe.
https://lore.kernel.org/linux-mm/20200619215649.32297-1-rcampbell@nvidia.com/
Note that in order to exercise/test patch 2 here, you will need a
kernel with patch 1 from the original series (the fix to mm/migrate.c).
It is safe to apply these changes before the fix to mm/migrate.c
though.

Ralph Campbell (3):
  nouveau: fix migrate page regression
  nouveau: fix mixed normal and device private page migration
  nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static

 drivers/gpu/drm/nouveau/nouveau_dmem.c         | 10 +++++++++-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c |  2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c  |  2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h  |  3 ---
 4 files changed, 11 insertions(+), 6 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RESEND PATCH 1/3] nouveau: fix migrate page regression
  2020-06-22 23:38 [RESEND PATCH 0/3] nouveau: fixes for SVM Ralph Campbell
@ 2020-06-22 23:38 ` Ralph Campbell
  2020-06-23  0:51   ` John Hubbard
  2020-06-22 23:38 ` [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration Ralph Campbell
  2020-06-22 23:38 ` [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static Ralph Campbell
  2 siblings, 1 reply; 15+ messages in thread
From: Ralph Campbell @ 2020-06-22 23:38 UTC (permalink / raw)
  To: nouveau, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe,
	Ben Skeggs, Ralph Campbell

The patch to add zero page migration to GPU memory inadvertantly included
part of a future change which broke normal page migration to GPU memory
by copying too much data and corrupting GPU memory.
Fix this by only copying one page instead of a byte count.

Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page to GPU")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e5c230d9ae24..cc9993837508 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -550,7 +550,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 					 DMA_BIDIRECTIONAL);
 		if (dma_mapping_error(dev, *dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, page_size(spage),
+		if (drm->dmem->migrate.copy_func(drm, 1,
 			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
 			goto out_dma_unmap;
 	} else {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-22 23:38 [RESEND PATCH 0/3] nouveau: fixes for SVM Ralph Campbell
  2020-06-22 23:38 ` [RESEND PATCH 1/3] nouveau: fix migrate page regression Ralph Campbell
@ 2020-06-22 23:38 ` Ralph Campbell
  2020-06-23  0:30   ` John Hubbard
  2020-06-24  7:23   ` Christoph Hellwig
  2020-06-22 23:38 ` [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static Ralph Campbell
  2 siblings, 2 replies; 15+ messages in thread
From: Ralph Campbell @ 2020-06-22 23:38 UTC (permalink / raw)
  To: nouveau, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe,
	Ben Skeggs, Ralph Campbell

The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
migrate memory in the given address range to device private memory. The
source pages might already have been migrated to device private memory.
In that case, the source struct page is not checked to see if it is
a device private page and incorrectly computes the GPU's physical
address of local memory leading to data corruption.
Fix this by checking the source struct page and computing the correct
physical address.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index cc9993837508..f6a806ba3caa 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 	if (!(src & MIGRATE_PFN_MIGRATE))
 		goto out;
 
+	if (spage && is_device_private_page(spage)) {
+		paddr = nouveau_dmem_page_addr(spage);
+		*dma_addr = DMA_MAPPING_ERROR;
+		goto done;
+	}
+
 	dpage = nouveau_dmem_page_alloc_locked(drm);
 	if (!dpage)
 		goto out;
@@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 			goto out_free_page;
 	}
 
+done:
 	*pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
 		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
 	if (src & MIGRATE_PFN_WRITE)
@@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
+		.src_owner	= drm->dev,
 	};
 	unsigned long i;
 	u64 *pfns;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static
  2020-06-22 23:38 [RESEND PATCH 0/3] nouveau: fixes for SVM Ralph Campbell
  2020-06-22 23:38 ` [RESEND PATCH 1/3] nouveau: fix migrate page regression Ralph Campbell
  2020-06-22 23:38 ` [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration Ralph Campbell
@ 2020-06-22 23:38 ` Ralph Campbell
  2020-06-23  0:57   ` John Hubbard
  2 siblings, 1 reply; 15+ messages in thread
From: Ralph Campbell @ 2020-06-22 23:38 UTC (permalink / raw)
  To: nouveau, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe,
	Ben Skeggs, Ralph Campbell

The functions nvkm_vmm_ctor() and nvkm_mmu_ptp_get() are not called outside
of the file defining them so make them static.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c | 2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c  | 2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h  | 3 ---
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
index ee11ccaf0563..de91e9a26172 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
@@ -61,7 +61,7 @@ nvkm_mmu_ptp_put(struct nvkm_mmu *mmu, bool force, struct nvkm_mmu_pt *pt)
 	kfree(pt);
 }
 
-struct nvkm_mmu_pt *
+static struct nvkm_mmu_pt *
 nvkm_mmu_ptp_get(struct nvkm_mmu *mmu, u32 size, bool zero)
 {
 	struct nvkm_mmu_pt *pt;
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
index 199f94e15c5f..67b00dcef4b8 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
@@ -1030,7 +1030,7 @@ nvkm_vmm_ctor_managed(struct nvkm_vmm *vmm, u64 addr, u64 size)
 	return 0;
 }
 
-int
+static int
 nvkm_vmm_ctor(const struct nvkm_vmm_func *func, struct nvkm_mmu *mmu,
 	      u32 pd_header, bool managed, u64 addr, u64 size,
 	      struct lock_class_key *key, const char *name,
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
index d3f8f916d0db..a2b179568970 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
@@ -163,9 +163,6 @@ int nvkm_vmm_new_(const struct nvkm_vmm_func *, struct nvkm_mmu *,
 		  u32 pd_header, bool managed, u64 addr, u64 size,
 		  struct lock_class_key *, const char *name,
 		  struct nvkm_vmm **);
-int nvkm_vmm_ctor(const struct nvkm_vmm_func *, struct nvkm_mmu *,
-		  u32 pd_header, bool managed, u64 addr, u64 size,
-		  struct lock_class_key *, const char *name, struct nvkm_vmm *);
 struct nvkm_vma *nvkm_vmm_node_search(struct nvkm_vmm *, u64 addr);
 struct nvkm_vma *nvkm_vmm_node_split(struct nvkm_vmm *, struct nvkm_vma *,
 				     u64 addr, u64 size);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-22 23:38 ` [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration Ralph Campbell
@ 2020-06-23  0:30   ` John Hubbard
  2020-06-23  1:42     ` Ralph Campbell
  2020-06-24  7:23   ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: John Hubbard @ 2020-06-23  0:30 UTC (permalink / raw)
  To: Ralph Campbell, nouveau, linux-kernel
  Cc: Jerome Glisse, Christoph Hellwig, Jason Gunthorpe, Ben Skeggs

On 2020-06-22 16:38, Ralph Campbell wrote:
> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
> migrate memory in the given address range to device private memory. The
> source pages might already have been migrated to device private memory.
> In that case, the source struct page is not checked to see if it is
> a device private page and incorrectly computes the GPU's physical
> address of local memory leading to data corruption.
> Fix this by checking the source struct page and computing the correct
> physical address.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> ---
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index cc9993837508..f6a806ba3caa 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>   	if (!(src & MIGRATE_PFN_MIGRATE))
>   		goto out;
>   
> +	if (spage && is_device_private_page(spage)) {
> +		paddr = nouveau_dmem_page_addr(spage);
> +		*dma_addr = DMA_MAPPING_ERROR;
> +		goto done;
> +	}
> +
>   	dpage = nouveau_dmem_page_alloc_locked(drm);
>   	if (!dpage)
>   		goto out;
> @@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>   			goto out_free_page;
>   	}
>   
> +done:
>   	*pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
>   		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
>   	if (src & MIGRATE_PFN_WRITE)
> @@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
>   	struct migrate_vma args = {
>   		.vma		= vma,
>   		.start		= start,
> +		.src_owner	= drm->dev,

Hi Ralph,

This .src_owner setting does look like a required fix, but it seems like
a completely separate fix from what is listed in this patch's commit
description, right? (It feels like a casualty of rearranging the patches.)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 1/3] nouveau: fix migrate page regression
  2020-06-22 23:38 ` [RESEND PATCH 1/3] nouveau: fix migrate page regression Ralph Campbell
@ 2020-06-23  0:51   ` John Hubbard
  2020-06-25  5:23     ` [Nouveau] " Ben Skeggs
  0 siblings, 1 reply; 15+ messages in thread
From: John Hubbard @ 2020-06-23  0:51 UTC (permalink / raw)
  To: Ralph Campbell, nouveau, linux-kernel
  Cc: Jerome Glisse, Christoph Hellwig, Jason Gunthorpe, Ben Skeggs

On 2020-06-22 16:38, Ralph Campbell wrote:
> The patch to add zero page migration to GPU memory inadvertantly included

inadvertently

> part of a future change which broke normal page migration to GPU memory
> by copying too much data and corrupting GPU memory.
> Fix this by only copying one page instead of a byte count.
> 
> Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page to GPU")
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> ---
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index e5c230d9ae24..cc9993837508 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -550,7 +550,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>   					 DMA_BIDIRECTIONAL);
>   		if (dma_mapping_error(dev, *dma_addr))
>   			goto out_free_page;
> -		if (drm->dmem->migrate.copy_func(drm, page_size(spage),
> +		if (drm->dmem->migrate.copy_func(drm, 1,
>   			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
>   			goto out_dma_unmap;
>   	} else {
>


I Am Not A Nouveau Expert, nor is it really clear to me how
page_size(spage) came to contain something other than a page's worth of
byte count, but this fix looks accurate to me. It's better for
maintenance, too, because the function never intends to migrate "some
number of bytes". It intends to migrate exactly one page.

Hope I'm not missing something fundamental, but:

Reviewed-by: John Hubbard <jhubbard@nvidia.com


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static
  2020-06-22 23:38 ` [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static Ralph Campbell
@ 2020-06-23  0:57   ` John Hubbard
  0 siblings, 0 replies; 15+ messages in thread
From: John Hubbard @ 2020-06-23  0:57 UTC (permalink / raw)
  To: Ralph Campbell, nouveau, linux-kernel
  Cc: Jerome Glisse, Christoph Hellwig, Jason Gunthorpe, Ben Skeggs

On 2020-06-22 16:38, Ralph Campbell wrote:
> The functions nvkm_vmm_ctor() and nvkm_mmu_ptp_get() are not called outside
> of the file defining them so make them static.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> ---
>   drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c | 2 +-
>   drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c  | 2 +-
>   drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h  | 3 ---
>   3 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
> index ee11ccaf0563..de91e9a26172 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
> @@ -61,7 +61,7 @@ nvkm_mmu_ptp_put(struct nvkm_mmu *mmu, bool force, struct nvkm_mmu_pt *pt)
>   	kfree(pt);
>   }
>   
> -struct nvkm_mmu_pt *
> +static struct nvkm_mmu_pt *
>   nvkm_mmu_ptp_get(struct nvkm_mmu *mmu, u32 size, bool zero)
>   {
>   	struct nvkm_mmu_pt *pt;
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> index 199f94e15c5f..67b00dcef4b8 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> @@ -1030,7 +1030,7 @@ nvkm_vmm_ctor_managed(struct nvkm_vmm *vmm, u64 addr, u64 size)
>   	return 0;
>   }
>   
> -int
> +static int
>   nvkm_vmm_ctor(const struct nvkm_vmm_func *func, struct nvkm_mmu *mmu,
>   	      u32 pd_header, bool managed, u64 addr, u64 size,
>   	      struct lock_class_key *key, const char *name,
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> index d3f8f916d0db..a2b179568970 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> @@ -163,9 +163,6 @@ int nvkm_vmm_new_(const struct nvkm_vmm_func *, struct nvkm_mmu *,
>   		  u32 pd_header, bool managed, u64 addr, u64 size,
>   		  struct lock_class_key *, const char *name,
>   		  struct nvkm_vmm **);
> -int nvkm_vmm_ctor(const struct nvkm_vmm_func *, struct nvkm_mmu *,
> -		  u32 pd_header, bool managed, u64 addr, u64 size,
> -		  struct lock_class_key *, const char *name, struct nvkm_vmm *);
>   struct nvkm_vma *nvkm_vmm_node_search(struct nvkm_vmm *, u64 addr);
>   struct nvkm_vma *nvkm_vmm_node_split(struct nvkm_vmm *, struct nvkm_vma *,
>   				     u64 addr, u64 size);
> 

Looks accurate: the order within vmm.c (now that there is no .h
declaration) is still good, and I found no other uses of either function
within the linux.git tree, so


Reviewed-by: John Hubbard <jhubbard@nvidia.com


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-23  0:30   ` John Hubbard
@ 2020-06-23  1:42     ` Ralph Campbell
  0 siblings, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2020-06-23  1:42 UTC (permalink / raw)
  To: John Hubbard, nouveau, linux-kernel
  Cc: Jerome Glisse, Christoph Hellwig, Jason Gunthorpe, Ben Skeggs


On 6/22/20 5:30 PM, John Hubbard wrote:
> On 2020-06-22 16:38, Ralph Campbell wrote:
>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>> migrate memory in the given address range to device private memory. The
>> source pages might already have been migrated to device private memory.
>> In that case, the source struct page is not checked to see if it is
>> a device private page and incorrectly computes the GPU's physical
>> address of local memory leading to data corruption.
>> Fix this by checking the source struct page and computing the correct
>> physical address.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> ---
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index cc9993837508..f6a806ba3caa 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>>       if (!(src & MIGRATE_PFN_MIGRATE))
>>           goto out;
>> +    if (spage && is_device_private_page(spage)) {
>> +        paddr = nouveau_dmem_page_addr(spage);
>> +        *dma_addr = DMA_MAPPING_ERROR;
>> +        goto done;
>> +    }
>> +
>>       dpage = nouveau_dmem_page_alloc_locked(drm);
>>       if (!dpage)
>>           goto out;
>> @@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>>               goto out_free_page;
>>       }
>> +done:
>>       *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
>>           ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
>>       if (src & MIGRATE_PFN_WRITE)
>> @@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
>>       struct migrate_vma args = {
>>           .vma        = vma,
>>           .start        = start,
>> +        .src_owner    = drm->dev,
> 
> Hi Ralph,
> 
> This .src_owner setting does look like a required fix, but it seems like
> a completely separate fix from what is listed in this patch's commit
> description, right? (It feels like a casualty of rearranging the patches.)
> 
> 
> thanks,

It's a bit more complex. There is a catch-22 here with the change to mm/migrate.c.
Without this patch or mm/migrate.c, a second call to clEnqueueSVMMigrateMem()
for the same address range will invalidate the GPU mapping to device private memory
created by the first call.
With this patch but not mm/migrate.c, the first call to clEnqueueSVMMigrateMem()
will fail to migrate normal anonymous memory to device private memory.
Without this patch but including the change to mm/migrate.c, a second call to
clEnqueueSVMMigrateMem() will crash the kernel because dma_map_page() will be
called with the device private PFN which is not a valid CPU physical address.
With both changes, a range of anonymous and device private pages can be migrated
to the GPU and the GPU page tables updated properly.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-22 23:38 ` [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration Ralph Campbell
  2020-06-23  0:30   ` John Hubbard
@ 2020-06-24  7:23   ` Christoph Hellwig
  2020-06-24 18:10     ` Ralph Campbell
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2020-06-24  7:23 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: nouveau, linux-kernel, Jerome Glisse, John Hubbard,
	Christoph Hellwig, Jason Gunthorpe, Ben Skeggs

On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell wrote:
> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
> migrate memory in the given address range to device private memory. The
> source pages might already have been migrated to device private memory.
> In that case, the source struct page is not checked to see if it is
> a device private page and incorrectly computes the GPU's physical
> address of local memory leading to data corruption.
> Fix this by checking the source struct page and computing the correct
> physical address.

I'm really worried about all this delicate code to fix the mixed
ranges.  Can't we make it clear at the migrate_vma_* level if we want
to migrate from or two device private memory, and then skip all the work
for regions of memory that already are in the right place?  This might be
a little more work initially, but I think it leads to a much better
API.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-24  7:23   ` Christoph Hellwig
@ 2020-06-24 18:10     ` Ralph Campbell
  2020-06-25 17:25       ` Ralph Campbell
  0 siblings, 1 reply; 15+ messages in thread
From: Ralph Campbell @ 2020-06-24 18:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nouveau, linux-kernel, Jerome Glisse, John Hubbard,
	Jason Gunthorpe, Ben Skeggs


On 6/24/20 12:23 AM, Christoph Hellwig wrote:
> On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell wrote:
>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>> migrate memory in the given address range to device private memory. The
>> source pages might already have been migrated to device private memory.
>> In that case, the source struct page is not checked to see if it is
>> a device private page and incorrectly computes the GPU's physical
>> address of local memory leading to data corruption.
>> Fix this by checking the source struct page and computing the correct
>> physical address.
> 
> I'm really worried about all this delicate code to fix the mixed
> ranges.  Can't we make it clear at the migrate_vma_* level if we want
> to migrate from or two device private memory, and then skip all the work
> for regions of memory that already are in the right place?  This might be
> a little more work initially, but I think it leads to a much better
> API.
> 

The current code does encode the direction with src_owner != NULL meaning
device private to system memory and src_owner == NULL meaning system
memory to device private memory. This patch would obviously defeat that
so perhaps a flag could be added to the struct migrate_vma to indicate the
direction but I'm unclear how that makes things less delicate.
Can you expand on what you are worried about?

The issue with invalidations might be better addressed by letting the device
driver handle device private page TLB invalidations when migrating to
system memory and changing migrate_vma_setup() to only invalidate CPU
TLB entries for normal pages being migrated to device private memory.
If a page isn't migrating, it seems inefficient to invalidate those TLB
entries.

Any other suggestions?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Nouveau] [RESEND PATCH 1/3] nouveau: fix migrate page regression
  2020-06-23  0:51   ` John Hubbard
@ 2020-06-25  5:23     ` Ben Skeggs
  2020-06-25  5:25       ` Ben Skeggs
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Skeggs @ 2020-06-25  5:23 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ralph Campbell, ML nouveau, LKML, Jason Gunthorpe,
	Christoph Hellwig, Ben Skeggs

On Tue, 23 Jun 2020 at 10:51, John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 2020-06-22 16:38, Ralph Campbell wrote:
> > The patch to add zero page migration to GPU memory inadvertantly included
>
> inadvertently
>
> > part of a future change which broke normal page migration to GPU memory
> > by copying too much data and corrupting GPU memory.
> > Fix this by only copying one page instead of a byte count.
> >
> > Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page to GPU")
> > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > ---
> >   drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > index e5c230d9ae24..cc9993837508 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > @@ -550,7 +550,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> >                                        DMA_BIDIRECTIONAL);
> >               if (dma_mapping_error(dev, *dma_addr))
> >                       goto out_free_page;
> > -             if (drm->dmem->migrate.copy_func(drm, page_size(spage),
> > +             if (drm->dmem->migrate.copy_func(drm, 1,
> >                       NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
> >                       goto out_dma_unmap;
> >       } else {
> >
>
>
> I Am Not A Nouveau Expert, nor is it really clear to me how
> page_size(spage) came to contain something other than a page's worth of
> byte count, but this fix looks accurate to me. It's better for
> maintenance, too, because the function never intends to migrate "some
> number of bytes". It intends to migrate exactly one page.
>
> Hope I'm not missing something fundamental, but:
I'm actually a bit confused here too.  Because, it *looks* like the
function takes a byte count, not a page count, and unless I'm missing
something too, it's setup the copy class for a byte count also.

>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Nouveau] [RESEND PATCH 1/3] nouveau: fix migrate page regression
  2020-06-25  5:23     ` [Nouveau] " Ben Skeggs
@ 2020-06-25  5:25       ` Ben Skeggs
  0 siblings, 0 replies; 15+ messages in thread
From: Ben Skeggs @ 2020-06-25  5:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ralph Campbell, ML nouveau, LKML, Jason Gunthorpe,
	Christoph Hellwig, Ben Skeggs

On Thu, 25 Jun 2020 at 15:23, Ben Skeggs <skeggsb@gmail.com> wrote:
>
> On Tue, 23 Jun 2020 at 10:51, John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 2020-06-22 16:38, Ralph Campbell wrote:
> > > The patch to add zero page migration to GPU memory inadvertantly included
> >
> > inadvertently
> >
> > > part of a future change which broke normal page migration to GPU memory
> > > by copying too much data and corrupting GPU memory.
> > > Fix this by only copying one page instead of a byte count.
> > >
> > > Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page to GPU")
> > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > > ---
> > >   drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > > index e5c230d9ae24..cc9993837508 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > > @@ -550,7 +550,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> > >                                        DMA_BIDIRECTIONAL);
> > >               if (dma_mapping_error(dev, *dma_addr))
> > >                       goto out_free_page;
> > > -             if (drm->dmem->migrate.copy_func(drm, page_size(spage),
> > > +             if (drm->dmem->migrate.copy_func(drm, 1,
> > >                       NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
> > >                       goto out_dma_unmap;
> > >       } else {
> > >
> >
> >
> > I Am Not A Nouveau Expert, nor is it really clear to me how
> > page_size(spage) came to contain something other than a page's worth of
> > byte count, but this fix looks accurate to me. It's better for
> > maintenance, too, because the function never intends to migrate "some
> > number of bytes". It intends to migrate exactly one page.
> >
> > Hope I'm not missing something fundamental, but:
> I'm actually a bit confused here too.  Because, it *looks* like the
> function takes a byte count, not a page count, and unless I'm missing
> something too, it's setup the copy class for a byte count also.
No, nevermind.. I was looking at nvc0b5_migrate_clear() :)

>
> >
> > Reviewed-by: John Hubbard <jhubbard@nvidia.com
> >
> >
> > thanks,
> > --
> > John Hubbard
> > NVIDIA
> > _______________________________________________
> > Nouveau mailing list
> > Nouveau@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-24 18:10     ` Ralph Campbell
@ 2020-06-25 17:25       ` Ralph Campbell
  2020-06-25 17:31         ` Jason Gunthorpe
  0 siblings, 1 reply; 15+ messages in thread
From: Ralph Campbell @ 2020-06-25 17:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nouveau, linux-kernel, Jerome Glisse, John Hubbard,
	Jason Gunthorpe, Ben Skeggs, Andrew Morton, linux-mm,
	Bharata B Rao

Making sure to include linux-mm and Bharata B Rao for IBM's
use of migrate_vma*().

On 6/24/20 11:10 AM, Ralph Campbell wrote:
> 
> On 6/24/20 12:23 AM, Christoph Hellwig wrote:
>> On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell wrote:
>>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>>> migrate memory in the given address range to device private memory. The
>>> source pages might already have been migrated to device private memory.
>>> In that case, the source struct page is not checked to see if it is
>>> a device private page and incorrectly computes the GPU's physical
>>> address of local memory leading to data corruption.
>>> Fix this by checking the source struct page and computing the correct
>>> physical address.
>>
>> I'm really worried about all this delicate code to fix the mixed
>> ranges.  Can't we make it clear at the migrate_vma_* level if we want
>> to migrate from or two device private memory, and then skip all the work
>> for regions of memory that already are in the right place?  This might be
>> a little more work initially, but I think it leads to a much better
>> API.
>>
> 
> The current code does encode the direction with src_owner != NULL meaning
> device private to system memory and src_owner == NULL meaning system
> memory to device private memory. This patch would obviously defeat that
> so perhaps a flag could be added to the struct migrate_vma to indicate the
> direction but I'm unclear how that makes things less delicate.
> Can you expand on what you are worried about?
> 
> The issue with invalidations might be better addressed by letting the device
> driver handle device private page TLB invalidations when migrating to
> system memory and changing migrate_vma_setup() to only invalidate CPU
> TLB entries for normal pages being migrated to device private memory.
> If a page isn't migrating, it seems inefficient to invalidate those TLB
> entries.
> 
> Any other suggestions?

After a night's sleep, I think this might work. What do others think?

1) Add a new MMU_NOTIFY_MIGRATE enum to mmu_notifier_event.

2) Change migrate_vma_collect() to use the new MMU_NOTIFY_MIGRATE event type.

3) Modify nouveau_svmm_invalidate_range_start() to simply return (no invalidations)
for MMU_NOTIFY_MIGRATE mmu notifier callbacks.

4) Leave the src_owner check in migrate_vma_collect_pmd() for normal pages so if the
device driver is migrating normal pages to device private memory, the driver would
set src_owner = NULL and already migrated device private pages would be skipped.
Since the mmu notifier callback did nothing, the device private entries remain valid
in the device's MMU. migrate_vma_collect_pmd() would still invalidate the CPU page
tables for migrated normal pages.
If the driver is migrating device private pages to system memory, it would set
src_owner != NULL, normal pages would be skipped, but now the device driver has to
invalidate device MMU mappings in the "alloc and copy" before doing the copy.
This would be after migrate_vma_setup() returns so the list of migrating device
pages is known to the driver.

The rest of the migrate_vma_pages() and migrate_vma_finalize() remain the same.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-25 17:25       ` Ralph Campbell
@ 2020-06-25 17:31         ` Jason Gunthorpe
  2020-06-25 17:42           ` Ralph Campbell
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2020-06-25 17:31 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Christoph Hellwig, nouveau, linux-kernel, Jerome Glisse,
	John Hubbard, Ben Skeggs, Andrew Morton, linux-mm, Bharata B Rao

On Thu, Jun 25, 2020 at 10:25:38AM -0700, Ralph Campbell wrote:
> Making sure to include linux-mm and Bharata B Rao for IBM's
> use of migrate_vma*().
> 
> On 6/24/20 11:10 AM, Ralph Campbell wrote:
> > 
> > On 6/24/20 12:23 AM, Christoph Hellwig wrote:
> > > On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell wrote:
> > > > The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
> > > > migrate memory in the given address range to device private memory. The
> > > > source pages might already have been migrated to device private memory.
> > > > In that case, the source struct page is not checked to see if it is
> > > > a device private page and incorrectly computes the GPU's physical
> > > > address of local memory leading to data corruption.
> > > > Fix this by checking the source struct page and computing the correct
> > > > physical address.
> > > 
> > > I'm really worried about all this delicate code to fix the mixed
> > > ranges.  Can't we make it clear at the migrate_vma_* level if we want
> > > to migrate from or two device private memory, and then skip all the work
> > > for regions of memory that already are in the right place?  This might be
> > > a little more work initially, but I think it leads to a much better
> > > API.
> > > 
> > 
> > The current code does encode the direction with src_owner != NULL meaning
> > device private to system memory and src_owner == NULL meaning system
> > memory to device private memory. This patch would obviously defeat that
> > so perhaps a flag could be added to the struct migrate_vma to indicate the
> > direction but I'm unclear how that makes things less delicate.
> > Can you expand on what you are worried about?
> > 
> > The issue with invalidations might be better addressed by letting the device
> > driver handle device private page TLB invalidations when migrating to
> > system memory and changing migrate_vma_setup() to only invalidate CPU
> > TLB entries for normal pages being migrated to device private memory.
> > If a page isn't migrating, it seems inefficient to invalidate those TLB
> > entries.
> > 
> > Any other suggestions?
> 
> After a night's sleep, I think this might work. What do others think?
> 
> 1) Add a new MMU_NOTIFY_MIGRATE enum to mmu_notifier_event.
> 
> 2) Change migrate_vma_collect() to use the new MMU_NOTIFY_MIGRATE event type.
>
> 3) Modify nouveau_svmm_invalidate_range_start() to simply return (no invalidations)
> for MMU_NOTIFY_MIGRATE mmu notifier callbacks.

Isn't it a bit of an assumption that migrate_vma_collect() is only
used by nouveau itself?

What if some other devices' device_private pages are being migrated?

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
  2020-06-25 17:31         ` Jason Gunthorpe
@ 2020-06-25 17:42           ` Ralph Campbell
  0 siblings, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2020-06-25 17:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, nouveau, linux-kernel, Jerome Glisse,
	John Hubbard, Ben Skeggs, Andrew Morton, linux-mm, Bharata B Rao


On 6/25/20 10:31 AM, Jason Gunthorpe wrote:
> On Thu, Jun 25, 2020 at 10:25:38AM -0700, Ralph Campbell wrote:
>> Making sure to include linux-mm and Bharata B Rao for IBM's
>> use of migrate_vma*().
>>
>> On 6/24/20 11:10 AM, Ralph Campbell wrote:
>>>
>>> On 6/24/20 12:23 AM, Christoph Hellwig wrote:
>>>> On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell wrote:
>>>>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>>>>> migrate memory in the given address range to device private memory. The
>>>>> source pages might already have been migrated to device private memory.
>>>>> In that case, the source struct page is not checked to see if it is
>>>>> a device private page and incorrectly computes the GPU's physical
>>>>> address of local memory leading to data corruption.
>>>>> Fix this by checking the source struct page and computing the correct
>>>>> physical address.
>>>>
>>>> I'm really worried about all this delicate code to fix the mixed
>>>> ranges.  Can't we make it clear at the migrate_vma_* level if we want
>>>> to migrate from or two device private memory, and then skip all the work
>>>> for regions of memory that already are in the right place?  This might be
>>>> a little more work initially, but I think it leads to a much better
>>>> API.
>>>>
>>>
>>> The current code does encode the direction with src_owner != NULL meaning
>>> device private to system memory and src_owner == NULL meaning system
>>> memory to device private memory. This patch would obviously defeat that
>>> so perhaps a flag could be added to the struct migrate_vma to indicate the
>>> direction but I'm unclear how that makes things less delicate.
>>> Can you expand on what you are worried about?
>>>
>>> The issue with invalidations might be better addressed by letting the device
>>> driver handle device private page TLB invalidations when migrating to
>>> system memory and changing migrate_vma_setup() to only invalidate CPU
>>> TLB entries for normal pages being migrated to device private memory.
>>> If a page isn't migrating, it seems inefficient to invalidate those TLB
>>> entries.
>>>
>>> Any other suggestions?
>>
>> After a night's sleep, I think this might work. What do others think?
>>
>> 1) Add a new MMU_NOTIFY_MIGRATE enum to mmu_notifier_event.
>>
>> 2) Change migrate_vma_collect() to use the new MMU_NOTIFY_MIGRATE event type.
>>
>> 3) Modify nouveau_svmm_invalidate_range_start() to simply return (no invalidations)
>> for MMU_NOTIFY_MIGRATE mmu notifier callbacks.
> 
> Isn't it a bit of an assumption that migrate_vma_collect() is only
> used by nouveau itself?
> 
> What if some other devices' device_private pages are being migrated?
> 
> Jason
> 

Good point. The driver needs a way of knowing the callback is due its call
to migrate_vma_setup() and not some other migration invalidation.
How about adding a void pointer to struct mmu_notifier_range
which migrate_vma_collect() can set to src_owner. If the event is
MMU_NOTIFY_MIGRATE and the src_owner matches the void pointer, then the
callback should be the one the driver initiated.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-06-25 17:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-22 23:38 [RESEND PATCH 0/3] nouveau: fixes for SVM Ralph Campbell
2020-06-22 23:38 ` [RESEND PATCH 1/3] nouveau: fix migrate page regression Ralph Campbell
2020-06-23  0:51   ` John Hubbard
2020-06-25  5:23     ` [Nouveau] " Ben Skeggs
2020-06-25  5:25       ` Ben Skeggs
2020-06-22 23:38 ` [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration Ralph Campbell
2020-06-23  0:30   ` John Hubbard
2020-06-23  1:42     ` Ralph Campbell
2020-06-24  7:23   ` Christoph Hellwig
2020-06-24 18:10     ` Ralph Campbell
2020-06-25 17:25       ` Ralph Campbell
2020-06-25 17:31         ` Jason Gunthorpe
2020-06-25 17:42           ` Ralph Campbell
2020-06-22 23:38 ` [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static Ralph Campbell
2020-06-23  0:57   ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).