2017-12-14 12:16 GMT+08:00 Jerome Glisse : > On Thu, Dec 14, 2017 at 11:53:40AM +0800, Figo.zhang wrote: > > 2017-12-14 11:16 GMT+08:00 Jerome Glisse : > > > > > On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote: > > > > 2017-12-14 0:12 GMT+08:00 Jerome Glisse : > > > > > > > > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote: > > [...] > > > > This is not what happen. Here is the workflow with HMM mirror (note > that > > > physical address do not matter here so i do not even reference them it > is > > > all about virtual address): > > > 1 They are 3 buffers a, b and r at given virtual address both CPU and > > > GPU can access them (concurently or not this does not matter). > > > 2 GPU can fault so if any virtual address do not have a page table > > > entry inside the GPU page table this trigger a page fault that will > > > call HMM mirror helper to snapshot CPU page table into the GPU page > > > table. If there is no physical memory backing the virtual address > > > (ie CPU page table is also empty for the given virtual address) then > > > the regular page fault handler of the kernel is invoked. > > > > > > > so when HMM mirror done, the content of GPU page table entry and > > CPU page table entry > > are same, right? so the GPU and CPU can access the same physical address, > > this physical > > address is allocated by CPU malloc systemcall. is it conflict and race > > condition? CPU and GPU > > write to this physical address concurrently. > > Correct and yes it is conflict free. PCIE platform already support > cache coherent access by device to main memory (snoop transaction > in PCIE specification). Access can happen concurently to same byte > and it behave exactly the same as if two CPU core try to access the > same byte. > > > > > i see this slides said: > > http://on-demand.gputechconf.com/gtc/2017/presentation/ > s7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf > > > > in page 22~23: > > When CPU page fault occurs: > > * UM (unified memory driver) copies page data to CPU, umaps from GPU > > *HMM maps page to CPU > > > > when GPU page fault occurs: > > *HMM has a malloc record buffer, so UM copy page data to GPU > > *HMM unmaps page from CPU > > > > so in this slides, it said it will has two copies, from CPU to GPU, and > > from GPU to CPU. so in this case (mul_mat_on_gpu()), is it really need > two > > copies in kernel space? > > This slide is for the case where you use device memory on PCIE platform. > When that happen only the device can access the virtual address back by > device memory. If CPU try to access such address a page fault is trigger > and it migrate the data back to regular memory where both GPU and CPU can > access it concurently. > > And again this behavior only happen if you use HMM non cache coherent > device memory model. If you use the device cache coherent model with HMM > then CPU can access the device memory directly too and above scenario > never happen. > > Note that memory copy when data move from device to system or from system > to device memory are inevitable. This is exactly as with autoNUMA. Also > note that in some case thing can get allocated directly on GPU and never > copied back to regular memory (only use by GPU and freed once GPU is done > with them) the zero copy case. But i want to stress that the zero copy > case is unlikely to happen for input buffer. Usualy you do not get your > input data set directly on the GPU but from network or disk and you might > do pre-processing on CPU (uncompress input, or do something that is better > done on the CPU). Then you feed your data to the GPU and you do computation > there. > Great!very detail about the HMM explanation, Thanks a lot. so would you like see my conclusion is correct? * if support CCIX/CAPI, CPU can access GPU memory directly, and GPU also can access CPU memory directly, so it no need copy on kernel space in HMM solutions. * if no support CCIX/CAPI, CPU cannot access GPU memory in cache coherency method, also GPU cannot access CPU memory at cache coherency. it need some copies like John Hobburt's slides. *when GPU page fault, need copy data from CPU page to GPU page, and HMM unmap the CPU page... * when CPU page fault, need copy data from GPU page to CPU page and ummap GPU page and map the CPU page... > > > > > Without HMM mirror but ATS/PASI (CCIX or CAPI): > > > 1 They are 3 buffers a, b and r at given virtual address both CPU and > > > GPU can access them (concurently or not this does not matter). > > > 2 GPU use the exact same page table as the CPU and fault exactly like > > > CPU on empty page table entry > > > > > > So in the end with HMM mirror or ATS/PASID you get the same behavior. > > > There is no complexity like you seem to assume. This all about virtual > > > address. At any point in time any given valid virtual address of a > process > > > point to a given physical memory address and that physical memory > address > > > is the same on both the CPU and the GPU at any point in time they are > > > never out of sync (both in HMM mirror and in ATS/PASID case). > > > > > > The exception is for platform that do not have CAPI or CCIX property ie > > > cache coherency for CPU access to device memory. On such platform when > > > you migrate a virtual address to use device physical memory you update > > > the CPU page table with a special entry. If the CPU try to access the > > > virtual address with special entry it trigger fault and HMM will > migrate > > > the virtual address back to regular memory. But this does not apply for > > > CAPI or CCIX platform. > > > > > > > the example of the virtual address using device physical memory is : > gpu_r > > = gpu_alloc(m*m*sizeof(float)), > > so CPU want to access gpu_r will trigger migrate back to CPU memory, > > it will allocate CPU page and copy > > to gpu_r's content to CPU pages, right? > > No. Here we are always talking about virtual address that are the outcome > of an mmap syscall either as private anonymous memory or as mmap of regular > file (ie not a device file but a regular file on a filesystem). > > Device driver can migrate any virtual address to use device memory for > performance reasons (how, why and when such migration happens is totaly > opaque to HMM it is under the control of the device driver). > > So if you do: > BUFA = malloc(size); > Then do something with BUFA on the CPU (like reading input or network, ...) > the memory is likely to be allocated with regular main memory (like DDR). > > Now if you start some job on your GPU that access BUFA the device driver > might call migrate_vma() helper to migrate the memory to device memory. At > that point the virtual address of BUFA point to physical device memory here > CAPI or CCIX. If it is not CAPI/CCIX than the GPU page table point to > device > memory while the CPU page table point to invalid special entry. The GPU can > work on BUFA that now reside inside the device memory. Finaly, in the non > CAPI/CCIX case, if CPU try to access that memory then a migration back to > regular memory happen. > in this scenario: *if CAPI/CCIX support, the CPU's page table and GPU's also point to the device physical page? in this case, it still need the ZONE_DEVICE infrastructure for CPU page table? *if no CAPI/CCIX support, the CPU's page table filled a invalid special pte. > > What you really need is to decouple the virtual address part from what is > the physical memory that is backing a virtual address. HMM provide helpers > for both aspect. First to mirror page table so that every virtual address > point to same physical address. Second side of HMM is to allow to use > device > memory transparently inside a process by allowing to migrate any virtual > address to use device memory. Both aspect are orthogonal to each others. > > Cheers, > Jérôme >