Re: Regarding HMM

From: Valmiki <valmikibow@gmail.com>
To: John Hubbard <jhubbard@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	linux-mm@kvack.org
Cc: jglisse@redhat.com
Subject: Re: Regarding HMM
Date: Sun, 23 Aug 2020 18:51:35 +0530	[thread overview]
Message-ID: <c1e7786a-76d8-93d6-b7db-91355d206920@gmail.com> (raw)
In-Reply-To: <9af4d56c-61f5-9367-28bf-b6f1236e90fa@nvidia.com>

On 19-08-2020 02:05 am, John Hubbard wrote:
> On 8/18/20 10:06 AM, Ralph Campbell wrote:
>>
>> On 8/18/20 12:15 AM, Valmiki wrote:
>>> Hi All,
>>>
>>> Im trying to understand heterogeneous memory management, i have 
>>> following doubts.
>>>
>>> If HMM is being used we dont have to use DMA controller on device for 
>>> memory transfers ?
> 
> Hi,
> 
> Nothing about HMM either requires or prevents using DMA controllers.
> 
>>> Without DMA if software is managing page faults and migrations, will 
>>> there be any performance impacts ?
>>>
>>> Is HMM targeted for any specific use cases where DMA controller is 
>>> not there on device ?
>>>
>>> Regards,
>>> Valmiki
>>>
>>
>> There are two APIs that are part of "HMM" and are independent of each 
>> other.
>>
>> hmm_range_fault() is for getting the physical address of a system 
>> resident memory page that
>> a device can map but is not pinned in the usual way I/O increases the 
>> page reference count
>> to pin the page. The device driver has to handle invalidation 
>> callbacks to remove the device
>> mapping. This lets the device access the page without moving it.
>>
>> migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() 
>> are used by the device
>> driver to migrate data to device private memory. After migration, the 
>> system memory is freed
>> and the CPU page table holds an invalid PTE that points to the device 
>> private struct page
>> (similar to a swap PTE). If the CPU process faults on that address, 
>> there is a callback
>> to the driver to migrate it back to system memory. This is where 
>> device DMA engines can
>> be used to copy data to/from system memory and device private memory.
>>
>> The use case for the above is to be able to run code such as OpenCL on 
>> GPUs and CPUs using
>> the same virtual addresses without having to call special memory 
>> allocators.
>> In other words, just use mmap() and malloc() and not clSVMAlloc().
>>
>> There is a performance consideration here. If the GPU accesses the 
>> data over PCIe to
>> system memory, there is much less bandwidth than accessing local GPU 
>> memory. If the
>> data is to be accessed/used many times, it can be more efficient to 
>> migrate the data
>> to local GPU memory. If the data is only accessed a few times, then it 
>> is probably
>> more efficient to map system memory.
>>
> 
> Ralph, that's a good write-up!
> 
> Valmiki, did you already read Documentation/vm/hmm.rst, before posting 
> your question?
> 
> It's OK to say "no"--I'm not asking in order to criticize, but in order 
> to calibrate
> the documentation. Because, we should consider merging in Ralph's 
> write-up above
> into hmm.rst, depending on if it helps (which I expect it does, but I'm 
> tainted by
> reading hmm.rst too many times and now I can't see what might be missing).
> 
> Any time someone new tries to understand the system, it's an opportunity 
> to "unit test"
> the documentation. Ideally, hmm.rst would answer many of a first-time 
> reader's questions,
> that's where we'd like to end up.
> 
> 
> thanks,
Hi John, i did give initial reading, but Ralph details on migration_* 
API's above helped a lot to clarify things better, yes adding above 
details into hmm.rst would help beginners.

Regards,
Valmiki