>
>
>
>
> Heterogeneous Memory Management (HMM) (description and justification)
>
> Today device driver expose dedicated memory allocation API through their
> device file, often relying on a combination of IOCTL and mmap calls. The
> device can only access and use memory allocated through this API. This
> effectively split the program address space into object allocated for the
> device and useable by the device and other regular memory (malloc, mmap
> of a file, share memory, …) only accessible by CPU (or in a very limited
> way by a device by pinning memory).
>
> Allowing different isolated component of a program to use a device thus
> require duplication of the input data structure using device memory
> allocator. This is reasonable for simple data structure (array, grid,
> image, …) but this get extremely complex with advance data structure
> (list, tree, graph, …) that rely on a web of memory pointers. This is
> becoming a serious limitation on the kind of work load that can be
> offloaded to device like GPU.
>

how handle it by current  GPU software stack? maintain a complex middle
firmwork/HAL?


>
> New industry standard like C++, OpenCL or CUDA are pushing to remove this
> barrier. This require a shared address space between GPU device and CPU so
> that GPU can access any memory of a process (while still obeying memory
> protection like read only).


GPU can access the whole process VMAs or any VMAs which backing system
memory has migrate to GPU page table?



> This kind of feature is also appearing in
> various other operating systems.
>
> HMM is a set of helpers to facilitate several aspects of address space
> sharing and device memory management. Unlike existing sharing mechanism
> that rely on pining pages use by a device, HMM relies on mmu_notifier to
> propagate CPU page table update to device page table.
>
> Duplicating CPU page table is only one aspect necessary for efficiently
> using device like GPU. GPU local memory have bandwidth in the TeraBytes/
> second range but they are connected to main memory through a system bus
> like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
> is necessary to allow migration of process memory from main system memory
> to device memory. Issue is that on platform that only have PCIE the device
> memory is not accessible by the CPU with the same properties as main
> memory (cache coherency, atomic operations, …).
>
> To allow migration from main memory to device memory HMM provides a set
> of helper to hotplug device memory as a new type of ZONE_DEVICE memory
> which is un-addressable by CPU but still has struct page representing it.
> This allow most of the core kernel logic that deals with a process memory
> to stay oblivious of the peculiarity of device memory.
>
> When page backing an address of a process is migrated to device memory
> the CPU page table entry is set to a new specific swap entry. CPU access
> to such address triggers a migration back to system memory, just like if
> the page was swap on disk. HMM also blocks any one from pinning a
> ZONE_DEVICE page so that it can always be migrated back to system memory
> if CPU access it. Conversely HMM does not migrate to device memory any
> page that is pin in system memory.
>

the purpose of  migrate the system pages to device is that device can read
the system memory?
if the CPU/programs want read the device data, it need pin/mapping the
device memory to the process address space?
if multiple applications want to read the same device memory region
concurrently, how to do it?

it is better a graph to show how CPU and GPU share the address space.


>
> To allow efficient migration between device memory and main memory a new
> migrate_vma() helpers is added with this patchset. It allows to leverage
> device DMA engine to perform the copy operation.
>
> This feature will be use by upstream driver like nouveau mlx5 and probably
> other in the future (amdgpu is next suspect  in line). We are actively
> working on nouveau and mlx5 support. To test this patchset we also worked
> with NVidia close source driver team, they have more resources than us to
> test this kind of infrastructure and also a bigger and better userspace
> eco-system with various real industry workload they can be use to test and
> profile HMM.
>
> The expected workload is a program builds a data set on the CPU (from disk,
> from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...)
> to give hint on memory placement for the input data and also for the output
> buffer. Program call GPU API to schedule a GPU job, this happens using
> device driver specific ioctl. All this is hidden from programmer point of
> view in case of C++ compiler that transparently offload some part of a
> program to GPU. Program can keep doing other stuff on the CPU while the
> GPU is crunching numbers.
>
> It is expected that CPU will not access the same data set as the GPU while
> GPU is working on it, but this is not mandatory. In fact we expect some
> small memory object to be actively access by both GPU and CPU concurrently
> as synchronization channel and/or for monitoring purposes. Such object will
> stay in system memory and should not be bottlenecked by system bus
> bandwidth (rare write and read access from both CPU and GPU).
>
> As we are relying on device driver API, HMM does not introduce any new
> syscall nor does it modify any existing ones. It does not change any POSIX
> semantics or behaviors. For instance the child after a fork of a process
> that is using HMM will not be impacted in anyway, nor is there any data
> hazard between child COW or parent COW of memory that was migrated to
> device prior to fork.
>
> HMM assume a numbers of hardware features. Device must allow device page
> table to be updated at any time (ie device job must be preemptable). Device
> page table must provides memory protection such as read only. Device must
> track write access (dirty bit). Device must have a minimum granularity that
> match PAGE_SIZE (ie 4k).
>
>
>
>