From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f46.google.com (mail-qg0-f46.google.com [209.85.192.46]) by kanga.kvack.org (Postfix) with ESMTP id 68D3F6B018E for ; Thu, 21 May 2015 15:33:15 -0400 (EDT) Received: by qgez61 with SMTP id z61so43331880qge.1 for ; Thu, 21 May 2015 12:33:15 -0700 (PDT) Received: from mail-qg0-x235.google.com (mail-qg0-x235.google.com. [2607:f8b0:400d:c04::235]) by mx.google.com with ESMTPS id b39si1034853qkh.21.2015.05.21.12.33.14 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 May 2015 12:33:14 -0700 (PDT) Received: by qgez61 with SMTP id z61so43331597qge.1 for ; Thu, 21 May 2015 12:33:14 -0700 (PDT) From: j.glisse@gmail.com Subject: HMM (Heterogeneous Memory Management) v8 Date: Thu, 21 May 2015 15:31:09 -0400 Message-Id: <1432236705-4209-1-git-send-email-j.glisse@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Linus Torvalds , joro@8bytes.org, Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Haggai Eran , Shachar Raindel , Liran Liss , Roland Dreier , Ben Sander , Greg Stoner , John Bridgman , Michael Mantor , Paul Blinzer , Laurent Morichetti , Alexander Deucher , Oded Gabbay , linux-fsdevel@vger.kernel.org, Linda Wang , Kevin E Martin , Jeff Law , Or Gerlitz , Sagi Grimberg So sorry had to resend because i stupidly forgot to cc mailing list. Ignore private send done before. HMM (Heterogeneous Memory Management) is an helper layer for device that want to mirror a process address space into their own mmu. Main target is GPU but other hardware, like network device can take also use HMM. There is two side to HMM, first one is mirroring of process address space on behalf of a device. HMM will manage a secondary page table for the device and keep it synchronize with the CPU page table. HMM also do DMA mapping on behalf of the device (which would allow new kind of optimization further down the road (1)). Second side is allowing to migrate process memory to device memory where device memory is unmappable by the CPU. Any CPU access will trigger special fault that will migrate memory back. >>From design point of view not much changed since last patchset (2). Most of the change are in small details of the API expose to device driver. This version also include device driver change for Mellanox hardware to use HMM as an alternative to ODP (which provide a subset of HMM functionality specificaly for RDMA devices). Long term plan is to have HMM completely replace ODP. Why doing this ? Mirroring a process address space is mandatory with OpenCL 2.0 and with other GPU compute API. OpenCL 2.0 allow different level of implementation and currently only the lowest 2 are supported on Linux. To implement the highest level, where CPU and GPU access can happen concurently and are cache coherent, HMM is needed, or something providing same functionality, for instance through platform hardware. Hardware solution such as PCIE ATS/PASID is limited to mirroring system memory and does not provide way to migrate memory to device memory (which offer significantly more bandwidth up to 10 times faster than regular system memory with discret GPU, also have lower latency than PCIE transaction). Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID and for Intel a special level of cache (backed by a large pool of fast memory). For foreseeable futur, discrete GPU will remain releveant as they can have a large quantity of faster memory than integrated GPU. Thus we believe HMM will allow to leverage discret GPU memory in a transparent fashion to the application, with minimum disruption to the linux kernel mm code. Also HMM can work along hardware solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID while HMM handles the migrated memory case). Design : The patch 1, 2, 3 and 4 augment the mmu notifier API with new informations to more efficiently mirror CPU page table updates. The first side of HMM, process address space mirroring, is implemented in patch 5 through 12. This use a secondary page table, in which HMM mirror memory actively use by the device. HMM does not take a reference on any of the page, it use the mmu notifier API to track changes to the CPU page table and to update the mirror page table. All this while providing a simple API to device driver. To implement this we use a "generic" page table and not a radix tree because we need to store more flags than radix allows and we need to store dma address (sizeof(dma_addr_t) > sizeof(long) on some platform). All this is Patch 14 pass down the lane the new child mm struct of a parent process being forked. This is necessary to properly handle fork when parent process have migrated memory (more on that below). Patch 15 allow to get the current memcg against which anonymous memory of a process should be accounted. It usefull because in HMM we do bulk transaction on address space and we wish to avoid storing a pointer to memcg for each single page. All operation dealing with memcg happens under the protection of the mmap semaphore. Second side of HMM, migration to device memory, is implemented in patch 16 to 28. This only deal with anonymous memory. A new special swap type is introduced. Migrated memory will have there CPU page table entry set to this special swap entry (like the migration entry but unlike migration this is not a short lived state). All the patches are then set of functions that deals with those special entry in the various code path that might face them. Memory migration require several steps, first the memory is un- mapped from CPU and replace with special "locked" entry, HMM locked entry is a short lived transitional state, this is to avoid two threads to fight over migration entry. Once unmapped HMM can determine what can be migrated or not by comparing mapcount and page count. If something holds a reference then the page is not migrated and CPU page table is restored. Next step is to schedule the copy to device memory and update the CPU page table to regular HMM entry. Migration back follow the same pattern, replace with special lock entry, then copy back, then update CPU page table. (1) Because HMM keeps a secondary page table which keeps track of DMA mapping, there is room for new optimization. We want to add a new DMA API to allow to manage DMA page table mapping at directory level. This would allow to minimize memory consumption of mirror page table and also over head of doing DMA mapping page per page. This is a future feature we want to work on and hope the idea will proove usefull not only to HMM users. (2) Previous patchset posting : v1 http://lwn.net/Articles/597289/ v2 https://lkml.org/lkml/2014/6/12/559 v3 https://lkml.org/lkml/2014/6/13/633 v4 https://lkml.org/lkml/2014/8/29/423 v5 https://lkml.org/lkml/2014/11/3/759 v6 http://lwn.net/Articles/619737/ v7 http://lwn.net/Articles/627316/ Cheers, JA(C)rA'me To: "Andrew Morton" , Cc: , Cc: linux-mm , Cc: , Cc: "Linus Torvalds" , Cc: "Mel Gorman" , Cc: "H. Peter Anvin" , Cc: "Peter Zijlstra" , Cc: "Linda Wang" , Cc: "Kevin E Martin" , Cc: "Andrea Arcangeli" , Cc: "Johannes Weiner" , Cc: "Larry Woodman" , Cc: "Rik van Riel" , Cc: "Dave Airlie" , Cc: "Jeff Law" , Cc: "Brendan Conoboy" , Cc: "Joe Donohue" , Cc: "Duncan Poole" , Cc: "Sherry Cheung" , Cc: "Subhash Gutti" , Cc: "John Hubbard" , Cc: "Mark Hairgrove" , Cc: "Lucien Dunning" , Cc: "Cameron Buschardt" , Cc: "Arvind Gopalakrishnan" , Cc: "Haggai Eran" , Cc: "Or Gerlitz" , Cc: "Sagi Grimberg" Cc: "Shachar Raindel" , Cc: "Liran Liss" , Cc: "Roland Dreier" , Cc: "Sander, Ben" , Cc: "Stoner, Greg" , Cc: "Bridgman, John" , Cc: "Mantor, Michael" , Cc: "Blinzer, Paul" , Cc: "Morichetti, Laurent" , Cc: "Deucher, Alexander" , Cc: "Gabbay, Oded" , -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org