From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Hubbard Subject: Re: HMM (Heterogeneous Memory Management) v8 Date: Fri, 29 May 2015 20:01:04 -0700 Message-ID: References: <1432236705-4209-1-git-send-email-j.glisse@gmail.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="279739828-502894658-1432954864=:13637" Cc: , , , Linus Torvalds , , Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Haggai Eran , Shachar Raindel , Liran Liss , To: Return-path: In-Reply-To: <1432236705-4209-1-git-send-email-j.glisse@gmail.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org --279739828-502894658-1432954864=:13637 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, 21 May 2015, j.glisse@gmail.com wrote: >=20 > So sorry had to resend because i stupidly forgot to cc mailing list. > Ignore private send done before. >=20 >=20 > HMM (Heterogeneous Memory Management) is an helper layer for device > that want to mirror a process address space into their own mmu. Main > target is GPU but other hardware, like network device can take also > use HMM. >=20 > There is two side to HMM, first one is mirroring of process address > space on behalf of a device. HMM will manage a secondary page table > for the device and keep it synchronize with the CPU page table. HMM > also do DMA mapping on behalf of the device (which would allow new > kind of optimization further down the road (1)). >=20 > Second side is allowing to migrate process memory to device memory > where device memory is unmappable by the CPU. Any CPU access will > trigger special fault that will migrate memory back. >=20 > From design point of view not much changed since last patchset (2). > Most of the change are in small details of the API expose to device > driver. This version also include device driver change for Mellanox > hardware to use HMM as an alternative to ODP (which provide a subset > of HMM functionality specificaly for RDMA devices). Long term plan > is to have HMM completely replace ODP. >=20 Hi Jerome! OK, seeing as how there is so much material to review here, I'll start=20 with the easiest part first: documentation. There is a lot of information spread throughout this patchset that needs=20 to be preserved and made readily accessible, but some of it is only found= =20 in the comments in patch headers. It would be better if the information=20 were right there in the source tree, not just in git history. Also, the=20 comment blocks that are in the code itself are useful, but maybe not=20 quite sufficient to provide the big picture. With that in mind, I think that a Documentation/vm/hmm.txt file should be= =20 provided. It could capture all of this. We can refer to it from within th= e=20 code, thus providing a higher level of quality (because we only have to=20 update one place, for big-picture documentation comments). If it helps, I'll volunteer to piece something together from the material= =20 that you have created, plus maybe a few notes about what a typical callin= g=20 sequence looks like (since I have actual backtraces here from the=20 ConnectIB cards). Also, there are a lot of typographical errors that we can fix up as part=20 of that effort. We want to ensure that such tiny issues don't distract=20 people from the valuable content, so those need to be fixed. I'll let=20 others decide as to whether that sort of fit-and-finish needs to happen=20 now, or as a follow-up patch or two. And finally, a critical part of good documentation is the naming of=20 things. We're sort of still in the "wow, it works" phase of this project,= =20 and so now is a good time to start fussing about names. Therefore, you'll= =20 see a bunch of small and large naming recommendations coming from me, for= =20 the various patches here. thanks, John Hubbard >=20 >=20 > Why doing this ? >=20 > Mirroring a process address space is mandatory with OpenCL 2.0 and > with other GPU compute API. OpenCL 2.0 allow different level of > implementation and currently only the lowest 2 are supported on > Linux. To implement the highest level, where CPU and GPU access > can happen concurently and are cache coherent, HMM is needed, or > something providing same functionality, for instance through > platform hardware. >=20 > Hardware solution such as PCIE ATS/PASID is limited to mirroring > system memory and does not provide way to migrate memory to device > memory (which offer significantly more bandwidth up to 10 times > faster than regular system memory with discret GPU, also have > lower latency than PCIE transaction). >=20 > Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID > and for Intel a special level of cache (backed by a large pool of > fast memory). >=20 > For foreseeable futur, discrete GPU will remain releveant as they > can have a large quantity of faster memory than integrated GPU. >=20 > Thus we believe HMM will allow to leverage discret GPU memory in > a transparent fashion to the application, with minimum disruption > to the linux kernel mm code. Also HMM can work along hardware > solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID > while HMM handles the migrated memory case). >=20 >=20 >=20 > Design : >=20 > The patch 1, 2, 3 and 4 augment the mmu notifier API with new > informations to more efficiently mirror CPU page table updates. >=20 > The first side of HMM, process address space mirroring, is > implemented in patch 5 through 12. This use a secondary page > table, in which HMM mirror memory actively use by the device. > HMM does not take a reference on any of the page, it use the > mmu notifier API to track changes to the CPU page table and to > update the mirror page table. All this while providing a simple > API to device driver. >=20 > To implement this we use a "generic" page table and not a radix > tree because we need to store more flags than radix allows and > we need to store dma address (sizeof(dma_addr_t) > sizeof(long) > on some platform). All this is >=20 > Patch 14 pass down the lane the new child mm struct of a parent > process being forked. This is necessary to properly handle fork > when parent process have migrated memory (more on that below). >=20 > Patch 15 allow to get the current memcg against which anonymous > memory of a process should be accounted. It usefull because in > HMM we do bulk transaction on address space and we wish to avoid > storing a pointer to memcg for each single page. All operation > dealing with memcg happens under the protection of the mmap > semaphore. >=20 >=20 > Second side of HMM, migration to device memory, is implemented > in patch 16 to 28. This only deal with anonymous memory. A new > special swap type is introduced. Migrated memory will have there > CPU page table entry set to this special swap entry (like the > migration entry but unlike migration this is not a short lived > state). >=20 > All the patches are then set of functions that deals with those > special entry in the various code path that might face them. >=20 > Memory migration require several steps, first the memory is un- > mapped from CPU and replace with special "locked" entry, HMM > locked entry is a short lived transitional state, this is to > avoid two threads to fight over migration entry. >=20 > Once unmapped HMM can determine what can be migrated or not by > comparing mapcount and page count. If something holds a reference > then the page is not migrated and CPU page table is restored. > Next step is to schedule the copy to device memory and update > the CPU page table to regular HMM entry. >=20 > Migration back follow the same pattern, replace with special > lock entry, then copy back, then update CPU page table. >=20 >=20 > (1) Because HMM keeps a secondary page table which keeps track of > DMA mapping, there is room for new optimization. We want to > add a new DMA API to allow to manage DMA page table mapping > at directory level. This would allow to minimize memory > consumption of mirror page table and also over head of doing > DMA mapping page per page. This is a future feature we want > to work on and hope the idea will proove usefull not only to > HMM users. >=20 > (2) Previous patchset posting : > v1 http://lwn.net/Articles/597289/ > v2 https://lkml.org/lkml/2014/6/12/559 > v3 https://lkml.org/lkml/2014/6/13/633 > v4 https://lkml.org/lkml/2014/8/29/423 > v5 https://lkml.org/lkml/2014/11/3/759 > v6 http://lwn.net/Articles/619737/ > v7 http://lwn.net/Articles/627316/ >=20 >=20 > Cheers, > J=C3=A9r=C3=B4me >=20 > To: "Andrew Morton" , > Cc: , > Cc: linux-mm , > Cc: , > Cc: "Linus Torvalds" , > Cc: "Mel Gorman" , > Cc: "H. Peter Anvin" , > Cc: "Peter Zijlstra" , > Cc: "Linda Wang" , > Cc: "Kevin E Martin" , > Cc: "Andrea Arcangeli" , > Cc: "Johannes Weiner" , > Cc: "Larry Woodman" , > Cc: "Rik van Riel" , > Cc: "Dave Airlie" , > Cc: "Jeff Law" , > Cc: "Brendan Conoboy" , > Cc: "Joe Donohue" , > Cc: "Duncan Poole" , > Cc: "Sherry Cheung" , > Cc: "Subhash Gutti" , > Cc: "John Hubbard" , > Cc: "Mark Hairgrove" , > Cc: "Lucien Dunning" , > Cc: "Cameron Buschardt" , > Cc: "Arvind Gopalakrishnan" , > Cc: "Haggai Eran" , > Cc: "Or Gerlitz" , > Cc: "Sagi Grimberg" > Cc: "Shachar Raindel" , > Cc: "Liran Liss" , > Cc: "Roland Dreier" , > Cc: "Sander, Ben" , > Cc: "Stoner, Greg" , > Cc: "Bridgman, John" , > Cc: "Mantor, Michael" , > Cc: "Blinzer, Paul" , > Cc: "Morichetti, Laurent" , > Cc: "Deucher, Alexander" , > Cc: "Gabbay, Oded" , >=20 >=20 thanks, John H. --279739828-502894658-1432954864=:13637-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org