2017-12-14 12:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:

> On Thu, Dec 14, 2017 at 11:53:40AM +0800, Figo.zhang wrote:
> > 2017-12-14 11:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> >
> > > On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote:
> > > > 2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> > > >
> > > > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:
>
> [...]
>
> > > This is not what happen. Here is the workflow with HMM mirror (note
> that
> > > physical address do not matter here so i do not even reference them it
> is
> > > all about virtual address):
> > >  1 They are 3 buffers a, b and r at given virtual address both CPU and
> > >    GPU can access them (concurently or not this does not matter).
> > >  2 GPU can fault so if any virtual address do not have a page table
> > >    entry inside the GPU page table this trigger a page fault that will
> > >    call HMM mirror helper to snapshot CPU page table into the GPU page
> > >    table. If there is no physical memory backing the virtual address
> > >    (ie CPU page table is also empty for the given virtual address) then
> > >    the regular page fault handler of the kernel is invoked.
> > >
> >
> > so when HMM mirror done, the content of GPU page table entry and
> > CPU page table entry
> > are same, right? so the GPU and CPU can access the same physical address,
> > this physical
> > address is allocated by CPU malloc systemcall. is it conflict and race
> > condition? CPU and GPU
> > write to this physical address concurrently.
>
> Correct and yes it is conflict free. PCIE platform already support
> cache coherent access by device to main memory (snoop transaction
> in PCIE specification). Access can happen concurently to same byte
> and it behave exactly the same as if two CPU core try to access the
> same byte.
>
> >
> > i see this slides said:
> > http://on-demand.gputechconf.com/gtc/2017/presentation/
> s7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf
> >
> > in page 22~23：
> > When CPU page fault occurs:
> > * UM (unified memory driver) copies page data to CPU, umaps from GPU
> > *HMM maps page to CPU
> >
> > when GPU page fault occurs:
> > *HMM has a malloc record buffer, so UM copy page data to GPU
> > *HMM unmaps page from CPU
> >
> > so in this slides, it said it will has two copies, from CPU to GPU, and
> > from GPU to CPU. so in this case (mul_mat_on_gpu()), is it really need
> two
> > copies in kernel space?
>
> This slide is for the case where you use device memory on PCIE platform.
> When that happen only the device can access the virtual address back by
> device memory. If CPU try to access such address a page fault is trigger
> and it migrate the data back to regular memory where both GPU and CPU can
> access it concurently.
>
> And again this behavior only happen if you use HMM non cache coherent
> device memory model. If you use the device cache coherent model with HMM
> then CPU can access the device memory directly too and above scenario
> never happen.
>
> Note that memory copy when data move from device to system or from system
> to device memory are inevitable. This is exactly as with autoNUMA. Also
> note that in some case thing can get allocated directly on GPU and never
> copied back to regular memory (only use by GPU and freed once GPU is done
> with them) the zero copy case. But i want to stress that the zero copy
> case is unlikely to happen for input buffer. Usualy you do not get your
> input data set directly on the GPU but from network or disk and you might
> do pre-processing on CPU (uncompress input, or do something that is better
> done on the CPU). Then you feed your data to the GPU and you do computation
> there.
>

Great！very detail about the HMM explanation, Thanks a lot.
so would you like see my conclusion is correct?
* if support CCIX/CAPI, CPU can access GPU memory directly, and GPU also
can access CPU memory directly,
so it no need copy on kernel space in HMM solutions.

* if no support CCIX/CAPI, CPU cannot access GPU memory in cache
coherency method, also GPU cannot access CPU memory at
cache coherency. it need some copies like John Hobburt's slides.
   *when GPU page fault, need copy data from CPU page to GPU page, and
HMM unmap the CPU page...
   * when CPU page fault, need copy data from GPU page to CPU page
and ummap GPU page and map the CPU page...


>
>
> > > Without HMM mirror but ATS/PASI (CCIX or CAPI):
> > >  1 They are 3 buffers a, b and r at given virtual address both CPU and
> > >    GPU can access them (concurently or not this does not matter).
> > >  2 GPU use the exact same page table as the CPU and fault exactly like
> > >    CPU on empty page table entry
> > >
> > > So in the end with HMM mirror or ATS/PASID you get the same behavior.
> > > There is no complexity like you seem to assume. This all about virtual
> > > address. At any point in time any given valid virtual address of a
> process
> > > point to a given physical memory address and that physical memory
> address
> > > is the same on both the CPU and the GPU at any point in time they are
> > > never out of sync (both in HMM mirror and in ATS/PASID case).
> > >
> > > The exception is for platform that do not have CAPI or CCIX property ie
> > > cache coherency for CPU access to device memory. On such platform when
> > > you migrate a virtual address to use device physical memory you update
> > > the CPU page table with a special entry. If the CPU try to access the
> > > virtual address with special entry it trigger fault and HMM will
> migrate
> > > the virtual address back to regular memory. But this does not apply for
> > > CAPI or CCIX platform.
> > >
> >
> > the example of the virtual address using device physical memory is :
> gpu_r
> > = gpu_alloc(m*m*sizeof(float)),
> > so CPU want to access gpu_r will trigger migrate back to CPU memory,
> > it will allocate CPU page and copy
> > to gpu_r's content to CPU pages, right?
>
> No. Here we are always talking about virtual address that are the outcome
> of an mmap syscall either as private anonymous memory or as mmap of regular
> file (ie not a device file but a regular file on a filesystem).
>
> Device driver can migrate any virtual address to use device memory for
> performance reasons (how, why and when such migration happens is totaly
> opaque to HMM it is under the control of the device driver).
>
> So if you do:
>    BUFA = malloc(size);
> Then do something with BUFA on the CPU (like reading input or network, ...)
> the memory is likely to be allocated with regular main memory (like DDR).
>
> Now if you start some job on your GPU that access BUFA the device driver
> might call migrate_vma() helper to migrate the memory to device memory. At
> that point the virtual address of BUFA point to physical device memory here
> CAPI or CCIX. If it is not CAPI/CCIX than the GPU page table point to
> device
> memory while the CPU page table point to invalid special entry. The GPU can
> work on BUFA that now reside inside the device memory. Finaly, in the non
> CAPI/CCIX case, if CPU try to access that memory then a migration back to
> regular memory happen.
>

in this scenario:
*if CAPI/CCIX support， the CPU's page table and GPU's also point to the
device physical page?
in this case, it still need the ZONE_DEVICE infrastructure for
CPU page table？

*if no CAPI/CCIX support, the CPU's page table filled a invalid special pte.


>
> What you really need is to decouple the virtual address part from what is
> the physical memory that is backing a virtual address. HMM provide helpers
> for both aspect. First to mirror page table so that every virtual address
> point to same physical address. Second side of HMM is to allow to use
> device
> memory transparently inside a process by allowing to migrate any virtual
> address to use device memory. Both aspect are orthogonal to each others.
>
> Cheers,
> Jérôme
>