From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 0A7466B0279 for ; Fri, 30 Jun 2017 01:32:52 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id j79so106369546pfj.9 for ; Thu, 29 Jun 2017 22:32:52 -0700 (PDT) Received: from hqemgate14.nvidia.com (hqemgate14.nvidia.com. [216.228.121.143]) by mx.google.com with ESMTPS id b61si3467710pli.527.2017.06.29.22.32.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 29 Jun 2017 22:32:50 -0700 (PDT) Subject: Re: [PATCH 00/15] HMM (Heterogeneous Memory Management) v24 References: <20170628180047.5386-1-jglisse@redhat.com> From: John Hubbard Message-ID: <960ef002-3cfd-5b91-054e-aa685abc5f1f@nvidia.com> Date: Thu, 29 Jun 2017 22:32:49 -0700 MIME-Version: 1.0 In-Reply-To: <20170628180047.5386-1-jglisse@redhat.com> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Dan Williams , David Nellans On 06/28/2017 11:00 AM, J=C3=A9r=C3=B4me Glisse wrote: >=20 > Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i > test same kernel as kbuild system, git branch: >=20 > https://cgit.freedesktop.org/~glisse/linux/log/?h=3Dhmm-v24 >=20 > Change since v23 is code comment fixes, simplify kernel configuration and > improve allocation of new page on migration do device memory (last patch > in this patchset). Hi Jerome, Tiny note: one more change is that hmm_devmem_fault_range() has been removed (and thanks for taking care of that, btw). Anyway, this looks good. A basic smoke test shows the following: 1. We definitely *require* your other patch,=20 "[PATCH] x86/mm/hotplug: fix BUG_ON() after hotremove by not freeing pud v3= ", otherwise I will reliably hit that bug every time I run my simple page faul= t test. So, let me know if I should ping that thread. It looks like your patc= h was not rejected, but I can't tell if (!rejected =3D=3D accepted), there. := ) We'll continue testing, but I expect at this point that anything we find can be patched up after HMM finally gets merged. thanks, John Hubbard NVIDIA >=20 > Everything else is the same. Below is the long description of what HMM > is about and why. At the end of this email i describe briefly each patch > and suggest reviewers for each of them. >=20 >=20 > Heterogeneous Memory Management (HMM) (description and justification) >=20 > Today device driver expose dedicated memory allocation API through their > device file, often relying on a combination of IOCTL and mmap calls. The > device can only access and use memory allocated through this API. This > effectively split the program address space into object allocated for the > device and useable by the device and other regular memory (malloc, mmap > of a file, share memory, =C3=A2) only accessible by CPU (or in a very lim= ited > way by a device by pinning memory). >=20 > Allowing different isolated component of a program to use a device thus > require duplication of the input data structure using device memory > allocator. This is reasonable for simple data structure (array, grid, > image, =C3=A2) but this get extremely complex with advance data structure > (list, tree, graph, =C3=A2) that rely on a web of memory pointers. This i= s > becoming a serious limitation on the kind of work load that can be > offloaded to device like GPU. >=20 > New industry standard like C++, OpenCL or CUDA are pushing to remove this > barrier. This require a shared address space between GPU device and CPU s= o > that GPU can access any memory of a process (while still obeying memory > protection like read only). This kind of feature is also appearing in > various other operating systems. >=20 > HMM is a set of helpers to facilitate several aspects of address space > sharing and device memory management. Unlike existing sharing mechanism > that rely on pining pages use by a device, HMM relies on mmu_notifier to > propagate CPU page table update to device page table. >=20 > Duplicating CPU page table is only one aspect necessary for efficiently > using device like GPU. GPU local memory have bandwidth in the TeraBytes/ > second range but they are connected to main memory through a system bus > like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it > is necessary to allow migration of process memory from main system memory > to device memory. Issue is that on platform that only have PCIE the devic= e > memory is not accessible by the CPU with the same properties as main > memory (cache coherency, atomic operations, ...). >=20 > To allow migration from main memory to device memory HMM provides a set > of helper to hotplug device memory as a new type of ZONE_DEVICE memory > which is un-addressable by CPU but still has struct page representing it. > This allow most of the core kernel logic that deals with a process memory > to stay oblivious of the peculiarity of device memory. >=20 > When page backing an address of a process is migrated to device memory > the CPU page table entry is set to a new specific swap entry. CPU access > to such address triggers a migration back to system memory, just like if > the page was swap on disk. HMM also blocks any one from pinning a > ZONE_DEVICE page so that it can always be migrated back to system memory > if CPU access it. Conversely HMM does not migrate to device memory any > page that is pin in system memory. >=20 > To allow efficient migration between device memory and main memory a new > migrate_vma() helpers is added with this patchset. It allows to leverage > device DMA engine to perform the copy operation. >=20 > This feature will be use by upstream driver like nouveau mlx5 and probabl= y > other in the future (amdgpu is next suspect in line). We are actively > working on nouveau and mlx5 support. To test this patchset we also worked > with NVidia close source driver team, they have more resources than us to > test this kind of infrastructure and also a bigger and better userspace > eco-system with various real industry workload they can be use to test an= d > profile HMM. >=20 > The expected workload is a program builds a data set on the CPU (from dis= k, > from network, from sensors, =C3=A2). Program uses GPU API (OpenCL, CUDA, = ...) > to give hint on memory placement for the input data and also for the outp= ut > buffer. Program call GPU API to schedule a GPU job, this happens using > device driver specific ioctl. All this is hidden from programmer point of > view in case of C++ compiler that transparently offload some part of a > program to GPU. Program can keep doing other stuff on the CPU while the > GPU is crunching numbers. >=20 > It is expected that CPU will not access the same data set as the GPU whil= e > GPU is working on it, but this is not mandatory. In fact we expect some > small memory object to be actively access by both GPU and CPU concurrentl= y > as synchronization channel and/or for monitoring purposes. Such object wi= ll > stay in system memory and should not be bottlenecked by system bus > bandwidth (rare write and read access from both CPU and GPU). >=20 > As we are relying on device driver API, HMM does not introduce any new > syscall nor does it modify any existing ones. It does not change any POSI= X > semantics or behaviors. For instance the child after a fork of a process > that is using HMM will not be impacted in anyway, nor is there any data > hazard between child COW or parent COW of memory that was migrated to > device prior to fork. >=20 > HMM assume a numbers of hardware features. Device must allow device page > table to be updated at any time (ie device job must be preemptable). Devi= ce > page table must provides memory protection such as read only. Device must > track write access (dirty bit). Device must have a minimum granularity th= at > match PAGE_SIZE (ie 4k). >=20 >=20 > Reviewer (just hint): > Patch 1 HMM documentation > Patch 2 introduce core infrastructure and definition of HMM, pretty > small patch and easy to review > Patch 3 introduce the mirror functionality of HMM, it relies on > mmu_notifier and thus someone familiar with that part would be > in better position to review > Patch 4 is an helper to snapshot CPU page table while synchronizing with > concurrent page table update. Understanding mmu_notifier makes > review easier. > Patch 5 is mostly a wrapper around handle_mm_fault() > Patch 6 add new add_pages() helper to avoid modifying each arch memory > hot plug function > Patch 7 add a new memory type for ZONE_DEVICE and also add all the logic > in various core mm to support this new type. Dan Williams and > any core mm contributor are best people to review each half of > this patchset > Patch 8 special case HMM ZONE_DEVICE pages inside put_page() Kirill and > Dan Williams are best person to review this > Patch 9 add helper to hotplug un-addressable device memory as new type > of ZONE_DEVICE memory (new type introducted in patch 3 of this > serie). This is boiler plate code around memory hotplug and it > also pick a free range of physical address for the device memory= . > Note that the physical address do not point to anything (at leas= t > as far as the kernel knows). > Patch 10 introduce a new hmm_device class as an helper for device driver > that want to expose multiple device memory under a common fake > device driver. This is usefull for multi-gpu configuration. > Anyone familiar with device driver infrastructure can review > this. Boiler plate code really. > Patch 11 add a new migrate mode. Any one familiar with page migration is > welcome to review. > Patch 12 introduce a new migration helper (migrate_vma()) that allow to > migrate a range of virtual address of a process using device DMA > engine to perform the copy. It is not limited to do copy from an= d > to device but can also do copy between any kind of source and > destination memory. Again anyone familiar with migration code > should be able to verify the logic. > Patch 13 optimize the new migrate_vma() by unmapping pages while we are > collecting them. This can be review by any mm folks. > Patch 14 add unaddressable memory migration to helper introduced in patch > 7, this can be review by anyone familiar with migration code > Patch 15 add a feature that allow device to allocate non-present page on > the GPU when migrating a range of address to device memory. This > is an helper for device driver to avoid having to first allocate > system memory before migration to device memory >=20 >=20 > Previous patchset posting : > v1 http://lwn.net/Articles/597289/ > v2 https://lkml.org/lkml/2014/6/12/559 > v3 https://lkml.org/lkml/2014/6/13/633 > v4 https://lkml.org/lkml/2014/8/29/423 > v5 https://lkml.org/lkml/2014/11/3/759 > v6 http://lwn.net/Articles/619737/ > v7 http://lwn.net/Articles/627316/ > v8 https://lwn.net/Articles/645515/ > v9 https://lwn.net/Articles/651553/ > v10 https://lwn.net/Articles/654430/ > v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424 > v12 http://www.kernelhub.org/?msg=3D972982&p=3D2 > v13 https://lwn.net/Articles/706856/ > v14 https://lkml.org/lkml/2016/12/8/344 > v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.h= tml > v16 http://www.spinics.net/lists/linux-mm/msg119814.html > v17 https://lkml.org/lkml/2017/1/27/847 > v18 https://lkml.org/lkml/2017/3/16/596 > v19 https://lkml.org/lkml/2017/4/5/831 > v20 https://lwn.net/Articles/720715/ > v21 https://lkml.org/lkml/2017/4/24/747 > v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html >=20 >=20 > J=C3=A9r=C3=B4me Glisse (14): > hmm: heterogeneous memory management documentation v2 > mm/hmm: heterogeneous memory management (HMM for short) v4 > mm/hmm/mirror: mirror process address space on device with HMM helpers > v3 > mm/hmm/mirror: helper to snapshot CPU page table v3 > mm/hmm/mirror: device page fault handler > mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v4 > mm/ZONE_DEVICE: special case put_page() for device private pages v2 > mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v6 > mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 > mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY > mm/migrate: new memory migration helper for use with device memory v4 > mm/migrate: migrate_vma() unmap page from vma while collecting pages > mm/migrate: support un-addressable ZONE_DEVICE page in migration v2 > mm/migrate: allow migrate_vma() to alloc new page on empty entry v3 >=20 > Michal Hocko (1): > mm/memory_hotplug: introduce add_pages >=20 > Documentation/vm/hmm.txt | 344 ++++++++++++ > MAINTAINERS | 7 + > arch/x86/Kconfig | 4 + > arch/x86/mm/init_64.c | 22 +- > fs/aio.c | 8 + > fs/f2fs/data.c | 5 +- > fs/hugetlbfs/inode.c | 5 +- > fs/proc/task_mmu.c | 7 + > fs/ubifs/file.c | 5 +- > include/linux/hmm.h | 458 +++++++++++++++ > include/linux/ioport.h | 1 + > include/linux/memory_hotplug.h | 11 + > include/linux/memremap.h | 86 +++ > include/linux/migrate.h | 124 +++++ > include/linux/migrate_mode.h | 5 + > include/linux/mm.h | 25 + > include/linux/mm_types.h | 6 + > include/linux/swap.h | 24 +- > include/linux/swapops.h | 68 +++ > kernel/fork.c | 2 + > kernel/memremap.c | 53 +- > mm/Kconfig | 34 ++ > mm/Makefile | 2 +- > mm/balloon_compaction.c | 8 + > mm/hmm.c | 1193 ++++++++++++++++++++++++++++++++++= ++++++ > mm/memory.c | 61 ++ > mm/memory_hotplug.c | 10 +- > mm/migrate.c | 806 ++++++++++++++++++++++++++- > mm/mprotect.c | 14 + > mm/page_vma_mapped.c | 10 + > mm/rmap.c | 25 + > mm/zsmalloc.c | 8 + > 32 files changed, 3411 insertions(+), 30 deletions(-) > create mode 100644 Documentation/vm/hmm.txt > create mode 100644 include/linux/hmm.h > create mode 100644 mm/hmm.c >=20 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org