From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f198.google.com (mail-ot0-f198.google.com [74.125.82.198]) by kanga.kvack.org (Postfix) with ESMTP id CA2876B0033 for ; Thu, 14 Dec 2017 02:05:41 -0500 (EST) Received: by mail-ot0-f198.google.com with SMTP id t47so2558802otd.19 for ; Wed, 13 Dec 2017 23:05:41 -0800 (PST) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id 2sor1374505otb.235.2017.12.13.23.05.40 for (Google Transport Security); Wed, 13 Dec 2017 23:05:40 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20171214041650.GB17710@redhat.com> References: <20170817000548.32038-1-jglisse@redhat.com> <20171213161247.GA2927@redhat.com> <20171214031607.GA17710@redhat.com> <20171214041650.GB17710@redhat.com> From: "Figo.zhang" Date: Thu, 14 Dec 2017 15:05:39 +0800 Message-ID: Subject: Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Content-Type: multipart/alternative; boundary="94eb2c1c0370fed6d405604782f2" Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Andrew Morton , Linux MM , John Hubbard , Dan Williams , David Nellans , Balbir Singh --94eb2c1c0370fed6d405604782f2 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable 2017-12-14 12:16 GMT+08:00 Jerome Glisse : > On Thu, Dec 14, 2017 at 11:53:40AM +0800, Figo.zhang wrote: > > 2017-12-14 11:16 GMT+08:00 Jerome Glisse : > > > > > On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote: > > > > 2017-12-14 0:12 GMT+08:00 Jerome Glisse : > > > > > > > > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote: > > [...] > > > > This is not what happen. Here is the workflow with HMM mirror (note > that > > > physical address do not matter here so i do not even reference them i= t > is > > > all about virtual address): > > > 1 They are 3 buffers a, b and r at given virtual address both CPU an= d > > > GPU can access them (concurently or not this does not matter). > > > 2 GPU can fault so if any virtual address do not have a page table > > > entry inside the GPU page table this trigger a page fault that wil= l > > > call HMM mirror helper to snapshot CPU page table into the GPU pag= e > > > table. If there is no physical memory backing the virtual address > > > (ie CPU page table is also empty for the given virtual address) th= en > > > the regular page fault handler of the kernel is invoked. > > > > > > > so when HMM mirror done, the content of GPU page table entry and > > CPU page table entry > > are same, right? so the GPU and CPU can access the same physical addres= s, > > this physical > > address is allocated by CPU malloc systemcall. is it conflict and race > > condition? CPU and GPU > > write to this physical address concurrently. > > Correct and yes it is conflict free. PCIE platform already support > cache coherent access by device to main memory (snoop transaction > in PCIE specification). Access can happen concurently to same byte > and it behave exactly the same as if two CPU core try to access the > same byte. > > > > > i see this slides said: > > http://on-demand.gputechconf.com/gtc/2017/presentation/ > s7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf > > > > in page 22~23=EF=BC=9A > > When CPU page fault occurs: > > * UM (unified memory driver) copies page data to CPU, umaps from GPU > > *HMM maps page to CPU > > > > when GPU page fault occurs: > > *HMM has a malloc record buffer, so UM copy page data to GPU > > *HMM unmaps page from CPU > > > > so in this slides, it said it will has two copies, from CPU to GPU, and > > from GPU to CPU. so in this case (mul_mat_on_gpu()), is it really need > two > > copies in kernel space? > > This slide is for the case where you use device memory on PCIE platform. > When that happen only the device can access the virtual address back by > device memory. If CPU try to access such address a page fault is trigger > and it migrate the data back to regular memory where both GPU and CPU can > access it concurently. > > And again this behavior only happen if you use HMM non cache coherent > device memory model. If you use the device cache coherent model with HMM > then CPU can access the device memory directly too and above scenario > never happen. > > Note that memory copy when data move from device to system or from system > to device memory are inevitable. This is exactly as with autoNUMA. Also > note that in some case thing can get allocated directly on GPU and never > copied back to regular memory (only use by GPU and freed once GPU is done > with them) the zero copy case. But i want to stress that the zero copy > case is unlikely to happen for input buffer. Usualy you do not get your > input data set directly on the GPU but from network or disk and you might > do pre-processing on CPU (uncompress input, or do something that is bette= r > done on the CPU). Then you feed your data to the GPU and you do computati= on > there. > Great=EF=BC=81very detail about the HMM explanation, Thanks a lot. so would you like see my conclusion is correct? * if support CCIX/CAPI, CPU can access GPU memory directly, and GPU also can access CPU memory directly, so it no need copy on kernel space in HMM solutions. * if no support CCIX/CAPI, CPU cannot access GPU memory in cache coherency method, also GPU cannot access CPU memory at cache coherency. it need some copies like John Hobburt's slides. *when GPU page fault, need copy data from CPU page to GPU page, and HMM unmap the CPU page... * when CPU page fault, need copy data from GPU page to CPU page and ummap GPU page and map the CPU page... > > > > > Without HMM mirror but ATS/PASI (CCIX or CAPI): > > > 1 They are 3 buffers a, b and r at given virtual address both CPU an= d > > > GPU can access them (concurently or not this does not matter). > > > 2 GPU use the exact same page table as the CPU and fault exactly lik= e > > > CPU on empty page table entry > > > > > > So in the end with HMM mirror or ATS/PASID you get the same behavior. > > > There is no complexity like you seem to assume. This all about virtua= l > > > address. At any point in time any given valid virtual address of a > process > > > point to a given physical memory address and that physical memory > address > > > is the same on both the CPU and the GPU at any point in time they are > > > never out of sync (both in HMM mirror and in ATS/PASID case). > > > > > > The exception is for platform that do not have CAPI or CCIX property = ie > > > cache coherency for CPU access to device memory. On such platform whe= n > > > you migrate a virtual address to use device physical memory you updat= e > > > the CPU page table with a special entry. If the CPU try to access the > > > virtual address with special entry it trigger fault and HMM will > migrate > > > the virtual address back to regular memory. But this does not apply f= or > > > CAPI or CCIX platform. > > > > > > > the example of the virtual address using device physical memory is : > gpu_r > > =3D gpu_alloc(m*m*sizeof(float)), > > so CPU want to access gpu_r will trigger migrate back to CPU memory, > > it will allocate CPU page and copy > > to gpu_r's content to CPU pages, right? > > No. Here we are always talking about virtual address that are the outcome > of an mmap syscall either as private anonymous memory or as mmap of regul= ar > file (ie not a device file but a regular file on a filesystem). > > Device driver can migrate any virtual address to use device memory for > performance reasons (how, why and when such migration happens is totaly > opaque to HMM it is under the control of the device driver). > > So if you do: > BUFA =3D malloc(size); > Then do something with BUFA on the CPU (like reading input or network, ..= .) > the memory is likely to be allocated with regular main memory (like DDR). > > Now if you start some job on your GPU that access BUFA the device driver > might call migrate_vma() helper to migrate the memory to device memory. A= t > that point the virtual address of BUFA point to physical device memory he= re > CAPI or CCIX. If it is not CAPI/CCIX than the GPU page table point to > device > memory while the CPU page table point to invalid special entry. The GPU c= an > work on BUFA that now reside inside the device memory. Finaly, in the non > CAPI/CCIX case, if CPU try to access that memory then a migration back to > regular memory happen. > in this scenario: *if CAPI/CCIX support=EF=BC=8C the CPU's page table and GPU's also point to= the device physical page? in this case, it still need the ZONE_DEVICE infrastructure for CPU page table=EF=BC=9F *if no CAPI/CCIX support, the CPU's page table filled a invalid special pte= . > > What you really need is to decouple the virtual address part from what is > the physical memory that is backing a virtual address. HMM provide helper= s > for both aspect. First to mirror page table so that every virtual address > point to same physical address. Second side of HMM is to allow to use > device > memory transparently inside a process by allowing to migrate any virtual > address to use device memory. Both aspect are orthogonal to each others. > > Cheers, > J=C3=A9r=C3=B4me > --94eb2c1c0370fed6d405604782f2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


2017-12-14 12:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>= :
On Thu, Dec 14,= 2017 at 11:53:40AM +0800, Figo.zhang wrote:
> 2017-12-14 11:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
>
> > On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote:
> > > 2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> > >
> > > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wr= ote:

[...]

> > This is not what happen. Here is the work= flow with HMM mirror (note that
> > physical address do not matter here so i do not even reference th= em it is
> > all about virtual address):
> >=C2=A0 1 They are 3 buffers a, b and r at given virtual address bo= th CPU and
> >=C2=A0 =C2=A0 GPU can access them (concurently or not this does no= t matter).
> >=C2=A0 2 GPU can fault so if any virtual address do not have a pag= e table
> >=C2=A0 =C2=A0 entry inside the GPU page table this trigger a page = fault that will
> >=C2=A0 =C2=A0 call HMM mirror helper to snapshot CPU page table in= to the GPU page
> >=C2=A0 =C2=A0 table. If there is no physical memory backing the vi= rtual address
> >=C2=A0 =C2=A0 (ie CPU page table is also empty for the given virtu= al address) then
> >=C2=A0 =C2=A0 the regular page fault handler of the kernel is invo= ked.
> >
>
> so when HMM mirror done, the content of GPU page table entry and
> CPU page table entry
> are same, right? so the GPU and CPU can access the same physical addre= ss,
> this physical
> address is allocated by CPU malloc systemcall. is it conflict and race=
> condition? CPU and GPU
> write to this physical address concurrently.

Correct and yes it is conflict free. PCIE platform already support cache coherent access by device to main memory (snoop transaction
in PCIE specification). Access can happen concurently to same byte
and it behave exactly the same as if two CPU core try to access the
same byte.

>
> i see this slides said:
> http://on-demand.gputechconf.com/gtc/20= 17/presentation/s7764_john-hubbardgpus-using-hmm-blur-the-lines-b= etween-cpu-and-gpu.pdf
>
> in page 22~23=EF=BC=9A
> When CPU page fault occurs:
> * UM (unified memory driver) copies page data to CPU, umaps from GPU > *HMM maps page to CPU
>
> when GPU page fault occurs:
> *HMM has a malloc record buffer, so UM copy page data to GPU
> *HMM unmaps page from CPU
>
> so in this slides, it said it will has two copies, from CPU to GPU, an= d
> from GPU to CPU. so in this case (mul_mat_on_gpu()), is it really need= two
> copies in kernel space?

This slide is for the case where you use device memory on PCIE platf= orm.
When that happen only the device can access the virtual address back by
device memory. If CPU try to access such address a page fault is trigger and it migrate the data back to regular memory where both GPU and CPU can access it concurently.

And again this behavior only happen if you use HMM non cache coherent
device memory model. If you use the device cache coherent model with HMM then CPU can access the device memory directly too and above scenario
never happen.

Note that memory copy when data move from device to system or from system to device memory are inevitable. This is exactly as with autoNUMA. Also
note that in some case thing can get allocated directly on GPU and never copied back to regular memory (only use by GPU and freed once GPU is done with them) the zero copy case. But i want to stress that the zero copy
case is unlikely to happen for input buffer. Usualy you do not get your
input data set directly on the GPU but from network or disk and you might do pre-processing on CPU (uncompress input, or do something that is better<= br> done on the CPU). Then you feed your data to the GPU and you do computation=
there.

Great=EF=BC=81very=C2=A0detail= =C2=A0about=C2=A0the HMM explanation, Thanks a=C2=A0lot.
so would= you=C2=A0like see my conclusion is=C2=A0correct?
* if=C2=A0suppo= rt CCIX/CAPI, CPU can=C2=A0access GPU memory directly, and GPU=C2=A0also ca= n=C2=A0access CPU=C2=A0memory directly,
so it no need=C2=A0copy o= n=C2=A0kernel=C2=A0space in HMM solutions.

* if=C2= =A0no=C2=A0support CCIX/CAPI, CPU=C2=A0cannot=C2=A0access GPU=C2=A0memory i= n=C2=A0cache coherency=C2=A0method,=C2=A0also GPU=C2=A0cannot=C2=A0access C= PU=C2=A0memory=C2=A0at
cache coherency. it=C2=A0need=C2=A0some co= pies=C2=A0like John Hobburt's=C2=A0slides.
=C2=A0 =C2=A0*when= GPU=C2=A0page=C2=A0fault,=C2=A0need=C2=A0copy=C2=A0data=C2=A0from CPU=C2= =A0page=C2=A0to GPU=C2=A0page, and HMM=C2=A0unmap the CPU=C2=A0page...
=C2=A0 =C2=A0*=C2=A0when CPU=C2=A0page=C2=A0fault,=C2=A0need=C2=A0cop= y=C2=A0data=C2=A0from GPU=C2=A0page to CPU=C2=A0page and=C2=A0ummap GPU=C2= =A0page and=C2=A0map the CPU=C2=A0page...
=C2=A0


> > Without HMM mirror but ATS/PASI (CCIX or CAPI):
> >=C2=A0 1 They are 3 buffers a, b and r at given virtual address bo= th CPU and
> >=C2=A0 =C2=A0 GPU can access them (concurently or not this does no= t matter).
> >=C2=A0 2 GPU use the exact same page table as the CPU and fault ex= actly like
> >=C2=A0 =C2=A0 CPU on empty page table entry
> >
> > So in the end with HMM mirror or ATS/PASID you get the same behav= ior.
> > There is no complexity like you seem to assume. This all about vi= rtual
> > address. At any point in time any given valid virtual address of = a process
> > point to a given physical memory address and that physical memory= address
> > is the same on both the CPU and the GPU at any point in time they= are
> > never out of sync (both in HMM mirror and in ATS/PASID case).
> >
> > The exception is for platform that do not have CAPI or CCIX prope= rty ie
> > cache coherency for CPU access to device memory. On such platform= when
> > you migrate a virtual address to use device physical memory you u= pdate
> > the CPU page table with a special entry. If the CPU try to access= the
> > virtual address with special entry it trigger fault and HMM will = migrate
> > the virtual address back to regular memory. But this does not app= ly for
> > CAPI or CCIX platform.
> >
>
> the example of the virtual address using device physical memory is : g= pu_r
> =3D gpu_alloc(m*m*sizeof(float)),
> so CPU want to access gpu_r will trigger migrate back to CPU memory, > it will allocate CPU page and copy
> to gpu_r's content to CPU pages, right?

No. Here we are always talking about virtual address that are the ou= tcome
of an mmap syscall either as private anonymous memory or as mmap of regular=
file (ie not a device file but a regular file on a filesystem).

Device driver can migrate any virtual address to use device memory for
performance reasons (how, why and when such migration happens is totaly
opaque to HMM it is under the control of the device driver).

So if you do:
=C2=A0 =C2=A0BUFA =3D malloc(size);
Then do something with BUFA on the CPU (like reading input or network, ...)=
the memory is likely to be allocated with regular main memory (like DDR).
Now if you start some job on your GPU that access BUFA the device driver might call migrate_vma() helper to migrate the memory to device memory. At<= br> that point the virtual address of BUFA point to physical device memory here=
CAPI or CCIX. If it is not CAPI/CCIX than the GPU page table point to devic= e
memory while the CPU page table point to invalid special entry. The GPU can=
work on BUFA that now reside inside the device memory. Finaly, in the non CAPI/CCIX case, if CPU try to access that memory then a migration back to regular memory happen.
=C2=A0
in=C2=A0this= =C2=A0scenario:
*if CAPI/CCIX=C2=A0support=EF=BC=8C the CPU's= page table and GPU's also point to the device physical page?
in this case, it=C2=A0still need the ZONE_DEVICE=C2=A0infrastructure=C2=A0= for CPU=C2=A0page=C2=A0table=EF=BC=9F

*if no CAPI/= CCIX support, the CPU's page table filled a invalid special pte.
<= div>


What you really need is to decouple the virtual address part from what is the physical memory that is backing a virtual address. HMM provide helpers<= br> for both aspect. First to mirror page table so that every virtual address point to same physical address. Second side of HMM is to allow to use devic= e
memory transparently inside a process by allowing to migrate any virtual address to use device memory. Both aspect are orthogonal to each others.
Cheers,
J=C3=A9r=C3=B4me

--94eb2c1c0370fed6d405604782f2-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org