All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ralph Campbell <rcampbell@nvidia.com>
To: Christoph Hellwig <hch@lst.de>
Cc: <nouveau@lists.freedesktop.org>, <linux-kernel@vger.kernel.org>,
	"Jerome Glisse" <jglisse@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	"Jason Gunthorpe" <jgg@mellanox.com>,
	Ben Skeggs <bskeggs@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	Bharata B Rao <bharata@linux.ibm.com>
Subject: Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
Date: Thu, 25 Jun 2020 10:25:38 -0700	[thread overview]
Message-ID: <a9aba057-3786-8204-f782-6e8f3c290b35@nvidia.com> (raw)
In-Reply-To: <330f6a82-d01d-db97-1dec-69346f41e707@nvidia.com>

Making sure to include linux-mm and Bharata B Rao for IBM's
use of migrate_vma*().

On 6/24/20 11:10 AM, Ralph Campbell wrote:
> 
> On 6/24/20 12:23 AM, Christoph Hellwig wrote:
>> On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell wrote:
>>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>>> migrate memory in the given address range to device private memory. The
>>> source pages might already have been migrated to device private memory.
>>> In that case, the source struct page is not checked to see if it is
>>> a device private page and incorrectly computes the GPU's physical
>>> address of local memory leading to data corruption.
>>> Fix this by checking the source struct page and computing the correct
>>> physical address.
>>
>> I'm really worried about all this delicate code to fix the mixed
>> ranges.  Can't we make it clear at the migrate_vma_* level if we want
>> to migrate from or two device private memory, and then skip all the work
>> for regions of memory that already are in the right place?  This might be
>> a little more work initially, but I think it leads to a much better
>> API.
>>
> 
> The current code does encode the direction with src_owner != NULL meaning
> device private to system memory and src_owner == NULL meaning system
> memory to device private memory. This patch would obviously defeat that
> so perhaps a flag could be added to the struct migrate_vma to indicate the
> direction but I'm unclear how that makes things less delicate.
> Can you expand on what you are worried about?
> 
> The issue with invalidations might be better addressed by letting the device
> driver handle device private page TLB invalidations when migrating to
> system memory and changing migrate_vma_setup() to only invalidate CPU
> TLB entries for normal pages being migrated to device private memory.
> If a page isn't migrating, it seems inefficient to invalidate those TLB
> entries.
> 
> Any other suggestions?

After a night's sleep, I think this might work. What do others think?

1) Add a new MMU_NOTIFY_MIGRATE enum to mmu_notifier_event.

2) Change migrate_vma_collect() to use the new MMU_NOTIFY_MIGRATE event type.

3) Modify nouveau_svmm_invalidate_range_start() to simply return (no invalidations)
for MMU_NOTIFY_MIGRATE mmu notifier callbacks.

4) Leave the src_owner check in migrate_vma_collect_pmd() for normal pages so if the
device driver is migrating normal pages to device private memory, the driver would
set src_owner = NULL and already migrated device private pages would be skipped.
Since the mmu notifier callback did nothing, the device private entries remain valid
in the device's MMU. migrate_vma_collect_pmd() would still invalidate the CPU page
tables for migrated normal pages.
If the driver is migrating device private pages to system memory, it would set
src_owner != NULL, normal pages would be skipped, but now the device driver has to
invalidate device MMU mappings in the "alloc and copy" before doing the copy.
This would be after migrate_vma_setup() returns so the list of migrating device
pages is known to the driver.

The rest of the migrate_vma_pages() and migrate_vma_finalize() remain the same.

WARNING: multiple messages have this Message-ID (diff)
From: Ralph Campbell <rcampbell@nvidia.com>
To: Christoph Hellwig <hch@lst.de>
Cc: nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	Jerome Glisse <jglisse@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Jason Gunthorpe <jgg@mellanox.com>,
	Ben Skeggs <bskeggs@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Bharata B Rao <bharata@linux.ibm.com>
Subject: Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration
Date: Thu, 25 Jun 2020 10:25:38 -0700	[thread overview]
Message-ID: <a9aba057-3786-8204-f782-6e8f3c290b35@nvidia.com> (raw)
In-Reply-To: <330f6a82-d01d-db97-1dec-69346f41e707@nvidia.com>

Making sure to include linux-mm and Bharata B Rao for IBM's
use of migrate_vma*().

On 6/24/20 11:10 AM, Ralph Campbell wrote:
> 
> On 6/24/20 12:23 AM, Christoph Hellwig wrote:
>> On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell wrote:
>>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>>> migrate memory in the given address range to device private memory. The
>>> source pages might already have been migrated to device private memory.
>>> In that case, the source struct page is not checked to see if it is
>>> a device private page and incorrectly computes the GPU's physical
>>> address of local memory leading to data corruption.
>>> Fix this by checking the source struct page and computing the correct
>>> physical address.
>>
>> I'm really worried about all this delicate code to fix the mixed
>> ranges.  Can't we make it clear at the migrate_vma_* level if we want
>> to migrate from or two device private memory, and then skip all the work
>> for regions of memory that already are in the right place?  This might be
>> a little more work initially, but I think it leads to a much better
>> API.
>>
> 
> The current code does encode the direction with src_owner != NULL meaning
> device private to system memory and src_owner == NULL meaning system
> memory to device private memory. This patch would obviously defeat that
> so perhaps a flag could be added to the struct migrate_vma to indicate the
> direction but I'm unclear how that makes things less delicate.
> Can you expand on what you are worried about?
> 
> The issue with invalidations might be better addressed by letting the device
> driver handle device private page TLB invalidations when migrating to
> system memory and changing migrate_vma_setup() to only invalidate CPU
> TLB entries for normal pages being migrated to device private memory.
> If a page isn't migrating, it seems inefficient to invalidate those TLB
> entries.
> 
> Any other suggestions?

After a night's sleep, I think this might work. What do others think?

1) Add a new MMU_NOTIFY_MIGRATE enum to mmu_notifier_event.

2) Change migrate_vma_collect() to use the new MMU_NOTIFY_MIGRATE event type.

3) Modify nouveau_svmm_invalidate_range_start() to simply return (no invalidations)
for MMU_NOTIFY_MIGRATE mmu notifier callbacks.

4) Leave the src_owner check in migrate_vma_collect_pmd() for normal pages so if the
device driver is migrating normal pages to device private memory, the driver would
set src_owner = NULL and already migrated device private pages would be skipped.
Since the mmu notifier callback did nothing, the device private entries remain valid
in the device's MMU. migrate_vma_collect_pmd() would still invalidate the CPU page
tables for migrated normal pages.
If the driver is migrating device private pages to system memory, it would set
src_owner != NULL, normal pages would be skipped, but now the device driver has to
invalidate device MMU mappings in the "alloc and copy" before doing the copy.
This would be after migrate_vma_setup() returns so the list of migrating device
pages is known to the driver.

The rest of the migrate_vma_pages() and migrate_vma_finalize() remain the same.

  reply	other threads:[~2020-06-25 17:25 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-22 23:38 [RESEND PATCH 0/3] nouveau: fixes for SVM Ralph Campbell
2020-06-22 23:38 ` Ralph Campbell
2020-06-22 23:38 ` [RESEND PATCH 1/3] nouveau: fix migrate page regression Ralph Campbell
2020-06-22 23:38   ` Ralph Campbell
2020-06-23  0:51   ` John Hubbard
2020-06-23  0:51     ` John Hubbard
2020-06-25  5:23     ` [Nouveau] " Ben Skeggs
2020-06-25  5:25       ` Ben Skeggs
2020-06-22 23:38 ` [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration Ralph Campbell
2020-06-22 23:38   ` Ralph Campbell
2020-06-23  0:30   ` John Hubbard
2020-06-23  0:30     ` John Hubbard
2020-06-23  1:42     ` Ralph Campbell
2020-06-23  1:42       ` Ralph Campbell
2020-06-24  7:23   ` Christoph Hellwig
2020-06-24 18:10     ` Ralph Campbell
2020-06-24 18:10       ` Ralph Campbell
2020-06-25 17:25       ` Ralph Campbell [this message]
2020-06-25 17:25         ` Ralph Campbell
2020-06-25 17:31         ` Jason Gunthorpe
2020-06-25 17:42           ` Ralph Campbell
2020-06-25 17:42             ` Ralph Campbell
2020-06-22 23:38 ` [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static Ralph Campbell
2020-06-22 23:38   ` Ralph Campbell
2020-06-23  0:57   ` John Hubbard
2020-06-23  0:57     ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a9aba057-3786-8204-f782-6e8f3c290b35@nvidia.com \
    --to=rcampbell@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@linux.ibm.com \
    --cc=bskeggs@redhat.com \
    --cc=hch@lst.de \
    --cc=jgg@mellanox.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nouveau@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.