> > > > > Heterogeneous Memory Management (HMM) (description and justification) > > Today device driver expose dedicated memory allocation API through their > device file, often relying on a combination of IOCTL and mmap calls. The > device can only access and use memory allocated through this API. This > effectively split the program address space into object allocated for the > device and useable by the device and other regular memory (malloc, mmap > of a file, share memory, …) only accessible by CPU (or in a very limited > way by a device by pinning memory). > > Allowing different isolated component of a program to use a device thus > require duplication of the input data structure using device memory > allocator. This is reasonable for simple data structure (array, grid, > image, …) but this get extremely complex with advance data structure > (list, tree, graph, …) that rely on a web of memory pointers. This is > becoming a serious limitation on the kind of work load that can be > offloaded to device like GPU. > how handle it by current GPU software stack? maintain a complex middle firmwork/HAL? > > New industry standard like C++, OpenCL or CUDA are pushing to remove this > barrier. This require a shared address space between GPU device and CPU so > that GPU can access any memory of a process (while still obeying memory > protection like read only). GPU can access the whole process VMAs or any VMAs which backing system memory has migrate to GPU page table? > This kind of feature is also appearing in > various other operating systems. > > HMM is a set of helpers to facilitate several aspects of address space > sharing and device memory management. Unlike existing sharing mechanism > that rely on pining pages use by a device, HMM relies on mmu_notifier to > propagate CPU page table update to device page table. > > Duplicating CPU page table is only one aspect necessary for efficiently > using device like GPU. GPU local memory have bandwidth in the TeraBytes/ > second range but they are connected to main memory through a system bus > like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it > is necessary to allow migration of process memory from main system memory > to device memory. Issue is that on platform that only have PCIE the device > memory is not accessible by the CPU with the same properties as main > memory (cache coherency, atomic operations, …). > > To allow migration from main memory to device memory HMM provides a set > of helper to hotplug device memory as a new type of ZONE_DEVICE memory > which is un-addressable by CPU but still has struct page representing it. > This allow most of the core kernel logic that deals with a process memory > to stay oblivious of the peculiarity of device memory. > > When page backing an address of a process is migrated to device memory > the CPU page table entry is set to a new specific swap entry. CPU access > to such address triggers a migration back to system memory, just like if > the page was swap on disk. HMM also blocks any one from pinning a > ZONE_DEVICE page so that it can always be migrated back to system memory > if CPU access it. Conversely HMM does not migrate to device memory any > page that is pin in system memory. > the purpose of migrate the system pages to device is that device can read the system memory? if the CPU/programs want read the device data, it need pin/mapping the device memory to the process address space? if multiple applications want to read the same device memory region concurrently, how to do it? it is better a graph to show how CPU and GPU share the address space. > > To allow efficient migration between device memory and main memory a new > migrate_vma() helpers is added with this patchset. It allows to leverage > device DMA engine to perform the copy operation. > > This feature will be use by upstream driver like nouveau mlx5 and probably > other in the future (amdgpu is next suspect in line). We are actively > working on nouveau and mlx5 support. To test this patchset we also worked > with NVidia close source driver team, they have more resources than us to > test this kind of infrastructure and also a bigger and better userspace > eco-system with various real industry workload they can be use to test and > profile HMM. > > The expected workload is a program builds a data set on the CPU (from disk, > from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...) > to give hint on memory placement for the input data and also for the output > buffer. Program call GPU API to schedule a GPU job, this happens using > device driver specific ioctl. All this is hidden from programmer point of > view in case of C++ compiler that transparently offload some part of a > program to GPU. Program can keep doing other stuff on the CPU while the > GPU is crunching numbers. > > It is expected that CPU will not access the same data set as the GPU while > GPU is working on it, but this is not mandatory. In fact we expect some > small memory object to be actively access by both GPU and CPU concurrently > as synchronization channel and/or for monitoring purposes. Such object will > stay in system memory and should not be bottlenecked by system bus > bandwidth (rare write and read access from both CPU and GPU). > > As we are relying on device driver API, HMM does not introduce any new > syscall nor does it modify any existing ones. It does not change any POSIX > semantics or behaviors. For instance the child after a fork of a process > that is using HMM will not be impacted in anyway, nor is there any data > hazard between child COW or parent COW of memory that was migrated to > device prior to fork. > > HMM assume a numbers of hardware features. Device must allow device page > table to be updated at any time (ie device job must be preemptable). Device > page table must provides memory protection such as read only. Device must > track write access (dirty bit). Device must have a minimum granularity that > match PAGE_SIZE (ie 4k). > > > >