All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Jérôme Glisse" <jglisse@redhat.com>
To: akpm@linux-foundation.org, <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
	joro@8bytes.org, "Mel Gorman" <mgorman@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Johannes Weiner" <jweiner@redhat.com>,
	"Larry Woodman" <lwoodman@redhat.com>,
	"Rik van Riel" <riel@redhat.com>,
	"Dave Airlie" <airlied@redhat.com>,
	"Brendan Conoboy" <blc@redhat.com>,
	"Joe Donohue" <jdonohue@redhat.com>,
	"Christophe Harle" <charle@nvidia.com>,
	"Duncan Poole" <dpoole@nvidia.com>,
	"Sherry Cheung" <SCheung@nvidia.com>,
	"Subhash Gutti" <sgutti@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Mark Hairgrove" <mhairgrove@nvidia.com>,
	"Lucien Dunning" <ldunning@nvidia.com>,
	"Cameron Buschardt" <cabuschardt@nvidia.com>,
	"Arvind Gopalakrishnan" <arvindg@nvidia.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Shachar Raindel" <raindel@mellanox.com>,
	"Liran Liss" <liranl@mellanox.com>,
	"Roland Dreier" <roland@purestorage.com>,
	"Ben Sander" <ben.sander@amd.com>,
	"Greg Stoner" <Greg.Stoner@amd.com>,
	"John Bridgman" <John.Bridgman@amd.com>,
	"Michael Mantor" <Michael.Mantor@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Leonid Shamis" <Leonid.Shamis@amd.com>,
	"Laurent Morichetti" <Laurent.Morichetti@amd.com>,
	"Alexander Deucher" <Alexander.Deucher@amd.com>,
	"Jérôme Glisse" <jglisse@redhat.com>
Subject: [PATCH v12 15/29] HMM: add documentation explaining HMM internals and how to use it.
Date: Tue,  8 Mar 2016 15:43:08 -0500	[thread overview]
Message-ID: <1457469802-11850-16-git-send-email-jglisse@redhat.com> (raw)
In-Reply-To: <1457469802-11850-1-git-send-email-jglisse@redhat.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 11907 bytes --]

This add documentation on how HMM works and a more in depth view of how it
should be use by device driver writers.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..febed50
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,219 @@
+Heterogeneous Memory Management (HMM)
+-------------------------------------
+
+The raison d'être of HMM is to provide a common API for device driver that
+wants to mirror a process address space on there device and/or migrate system
+memory to device memory. Device driver can decide to only use one aspect of
+HMM (mirroring or memory migration), for instance some device can directly
+access process address space through hardware (for instance PCIe ATS/PASID),
+but still want to benefit from memory migration capabilities that HMM offer.
+
+While HMM rely on existing kernel infrastructure (namely mmu_notifier) some
+of its features (memory migration, atomic access) require integration with
+core mm kernel code. Having HMM as the common intermediary is more appealing
+than having each device driver hooking itself inside the common mm code.
+
+Moreover HMM as a layer allows integration with DMA API or page reclaimation.
+
+
+Mirroring address space on the device:
+--------------------------------------
+
+Device that can't directly access transparently the process address space, need
+to mirror the CPU page table into there own page table. HMM helps to keep the
+device page table synchronize with the CPU page table. It is not expected that
+the device will fully mirror the CPU page table but only mirror region that are
+actively accessed by the device. For that reasons HMM only helps populating and
+synchronizing device page table for range that the device driver explicitly ask
+for.
+
+Mirroring address space inside the device page table is easy with HMM :
+
+  /* Create a mirror for the current process for your device. */
+  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
+  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
+
+  ...
+
+  /* Mirror memory (in read mode) between addressA and addressB */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >update() callback. During the
+     * callback use the HMM page table to populate the device page table. You
+     * can only use the HMM page table to populate the device page table for
+     * the specified range during the >update() callback, at any other point in
+     * time the HMM page table content should be assume to be undefined.
+     */
+    your_hmm_device->update(mirror, event);
+
+  ...
+
+  /* Process is quiting or device done stop the mirroring and cleanup. */
+  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
+  /* Device driver can free your_hmm_mirror */
+
+
+HMM mirror page table:
+----------------------
+
+Each hmm_mirror object is associated with a mirror page table that HMM keeps
+synchronize with the CPU page table by using the mmu_notifier API. HMM is using
+its own generic page table format because it needs to store DMA address, which
+are bigger than long on some architecture, and have more flags per entry than
+radix tree allows.
+
+The HMM page table mostly mirror x86 page table layout. A page holds a global
+directory and each entry points to a lower level directory. Unlike regular CPU
+page table, directory level are more aggressively freed and remove from the HMM
+mirror page table. This means device driver needs to use the HMM helpers and to
+follow directive on when and how to access the mirror page table. HMM use the
+per page spinlock of directory page to synchronize update of directory ie update
+can happen on different directory concurently.
+
+As a rules the mirror page table can only be accessed by device driver from one
+of the HMM device callback. Any access from outside a callback is illegal and
+gives undertimed result.
+
+Accessing the mirror page table from a device callback needs to use the HMM
+page table helpers. A loop to access entry for a range of address looks like :
+
+  /* Initialize a HMM page table iterator. */
+  struct hmm_pt_iter iter;
+  hmm_pt_iter_init(&iter, &mirror->pt)
+
+  /* Get pointer to HMM page table entry for a given address. */
+  dma_addr_t *hmm_pte;
+  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+
+If there is no valid entry directory for given range address then hmm_pte is
+NULL. If there is a valid entry directory then you can access the hmm_pte and
+the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
+the same iter struct for a different address or call hmm_pt_iter_fini().
+
+While the HMM page table entry pointer stays valid you can only modify the
+value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
+threads might be updating the same entry concurrently. The device driver only
+need to update an HMM page table entry to set the dirty bit, so driver should
+only be using hmm_pte_set_dirty().
+
+Similarly to extract information the device driver should use one of the helper
+like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
+is a device driver at initialization parameter).
+
+
+Migrating system memory to device memory:
+-----------------------------------------
+
+Device like discret GPU often have there own local memory which offer bigger
+bandwidth and smaller latency than access to system memory for the GPU. This
+local memory is not necessarily accessible by the CPU. Device local memory will
+remain revealent for the foreseeable future as bandwidth of GPU memory keep
+increasing faster than bandwidth of system memory and as latency of PCIe does
+not decrease.
+
+Thus to maximize use of device like GPU, program need to use the device memory.
+Userspace API wants to make this as transparent as it can be, so that there is
+no need for complex modification of applications.
+
+Transparent use of device memory for range of address of a process require core
+mm code modifications. Adding a new memory zone for devices memory did not make
+sense given that such memory is often only accessible by the device only. This
+is why we decided to use a special kind of swap, migrated memory is mark as a
+special swap entry inside the CPU page table.
+
+While HMM handles the migration process, it does not decide what range or when
+to migrate memory. The decision to perform such migration is under the control
+of the device driver. Migration back to system memory happens either because
+the CPU try to access the memory or because device driver decided to migrate
+the memory back.
+
+
+  /* Migrate system memory between addressA and addressB to device memory. */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >copy_to_device() callback.
+     * Device driver must allocate device memory, DMA system memory to device
+     * memory, update the device page table to point to device memory and
+     * return. See hmm.h for details instructions and how failure are handled.
+     */
+    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
+
+
+Right now HMM only support migrating anonymous private memory. Migration of
+share memory and more generaly file mapped memory is on the road map.
+
+
+Locking consideration and overall design:
+-----------------------------------------
+
+As a rule HMM will handle proper locking on the behalf of the device driver,
+as such device driver does not need to take any mm lock before calling into
+the HMM code.
+
+HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The
+device driver can only free those after calling hmm_device_unregister() or
+hmm_mirror_unregister() respectively.
+
+All the lock inside any of the HMM structure should never be use by the device
+driver. They are intended to be use only and only by HMM code. Below is short
+description of the 3 main locks that exist for HMM internal use. Educational
+purpose only.
+
+Each process mm has one and only one struct hmm associated with it. Each hmm
+struct can be use by several different mirror. There is one and only one mirror
+per mm and device pair. So in essence the hmm struct is the core that dispatch
+everything to every single mirror, each of them corresponding to a specific
+device. The list of mirror for an hmm struct is protected by a semaphore as it
+sees mostly read access.
+
+Each time a device fault a range of address it calls hmm_mirror_fault(), HMM
+keeps track, inside the hmm struct, of each range currently being faulted. It
+does that so it can synchronize with any CPU page table update. If there is a
+CPU page table update then a callback through mmu_notifier will happen and HMM
+will try to interrupt the device page fault that conflict (ie address range
+overlap with the range being updated) and wait for them to back off. This
+insure that at no point in time the device driver see transient page table
+information. The list of active fault is protected by a spinlock, query on
+that list should be short and quick (we haven't gather enough statistic on
+that side yet to have a good idea of the average access pattern).
+
+Each device driver wanting to use HMM must register one and only one hmm_device
+struct per physical device with HMM. The hmm_device struct have pointer to the
+device driver call back and keeps track of active mirrors for a given device.
+The active mirrors list is protected by a spinlock.
+
+
+Future work:
+------------
+
+Improved atomic access by the device to system memory. Some platform bus (PCIe)
+offer limited number of atomic memory operations, some platform do not even
+have any kind of atomic memory operations by a device. In order to allow such
+atomic operation we want to map page read only the CPU while the device perform
+its operation. For this we need a new case inside the CPU write fault code path
+to synchronize with the device.
+
+We want to allow program to lock a range of memory inside device memory and
+forbid CPU access while the memory is lock inside the device. Any CPU access
+to locked range would result in SIGBUS. We think that madvise() would be the
+right syscall into which we could plug that feature.
+
+In order to minimize kernel memory consumption and overhead of DMA mapping, we
+want to introduce new DMA API that allows to manage mapping on IOMMU directory
+page basis. This would allow to map/unmap/update DMA mapping in bulk and
+minimize IOMMU update and flushing overhead. Moreover this would allow to
+improve IOMMU bad access reporting for DMA address inside those directory.
+
+Because update to the device page table might require "heavy" synchronization
+with the device, the mmu_notifier callback might have to sleep while HMM is
+waiting for the device driver to report device page table update completion.
+This is especialy bad if this happens during page reclaimation, this might
+bring the system to pause. We want to mitigate this, either by maintaining a
+new intermediate lru level in which we put pages actively mirrored by a device
+or by some other mecanism. For time being we advice that device driver that
+use HMM explicitly explain this corner case so that user are aware that this
+can happens if there is memory pressure.
-- 
2.4.3

WARNING: multiple messages have this Message-ID (diff)
From: "Jérôme Glisse" <jglisse@redhat.com>
To: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
	joro@8bytes.org, "Mel Gorman" <mgorman@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Johannes Weiner" <jweiner@redhat.com>,
	"Larry Woodman" <lwoodman@redhat.com>,
	"Rik van Riel" <riel@redhat.com>,
	"Dave Airlie" <airlied@redhat.com>,
	"Brendan Conoboy" <blc@redhat.com>,
	"Joe Donohue" <jdonohue@redhat.com>,
	"Christophe Harle" <charle@nvidia.com>,
	"Duncan Poole" <dpoole@nvidia.com>,
	"Sherry Cheung" <SCheung@nvidia.com>,
	"Subhash Gutti" <sgutti@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Mark Hairgrove" <mhairgrove@nvidia.com>,
	"Lucien Dunning" <ldunning@nvidia.com>,
	"Cameron Buschardt" <cabuschardt@nvidia.com>,
	"Arvind Gopalakrishnan" <arvindg@nvidia.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Shachar Raindel" <raindel@mellanox.com>,
	"Liran Liss" <liranl@mellanox.com>,
	"Roland Dreier" <roland@purestorage.com>,
	"Ben Sander" <ben.sander@amd.com>,
	"Greg Stoner" <Greg.Stoner@amd.com>,
	"John Bridgman" <John.Bridgman@amd.com>,
	"Michael Mantor" <Michael.Mantor@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Leonid Shamis" <Leonid.Shamis@amd.com>,
	"Laurent Morichetti" <Laurent.Morichetti@amd.com>,
	"Alexander Deucher" <Alexander.Deucher@amd.com>,
	"Jérôme Glisse" <jglisse@redhat.com>
Subject: [PATCH v12 15/29] HMM: add documentation explaining HMM internals and how to use it.
Date: Tue,  8 Mar 2016 15:43:08 -0500	[thread overview]
Message-ID: <1457469802-11850-16-git-send-email-jglisse@redhat.com> (raw)
In-Reply-To: <1457469802-11850-1-git-send-email-jglisse@redhat.com>

This add documentation on how HMM works and a more in depth view of how it
should be use by device driver writers.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..febed50
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,219 @@
+Heterogeneous Memory Management (HMM)
+-------------------------------------
+
+The raison d'etre of HMM is to provide a common API for device driver that
+wants to mirror a process address space on there device and/or migrate system
+memory to device memory. Device driver can decide to only use one aspect of
+HMM (mirroring or memory migration), for instance some device can directly
+access process address space through hardware (for instance PCIe ATS/PASID),
+but still want to benefit from memory migration capabilities that HMM offer.
+
+While HMM rely on existing kernel infrastructure (namely mmu_notifier) some
+of its features (memory migration, atomic access) require integration with
+core mm kernel code. Having HMM as the common intermediary is more appealing
+than having each device driver hooking itself inside the common mm code.
+
+Moreover HMM as a layer allows integration with DMA API or page reclaimation.
+
+
+Mirroring address space on the device:
+--------------------------------------
+
+Device that can't directly access transparently the process address space, need
+to mirror the CPU page table into there own page table. HMM helps to keep the
+device page table synchronize with the CPU page table. It is not expected that
+the device will fully mirror the CPU page table but only mirror region that are
+actively accessed by the device. For that reasons HMM only helps populating and
+synchronizing device page table for range that the device driver explicitly ask
+for.
+
+Mirroring address space inside the device page table is easy with HMM :
+
+  /* Create a mirror for the current process for your device. */
+  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
+  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
+
+  ...
+
+  /* Mirror memory (in read mode) between addressA and addressB */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >update() callback. During the
+     * callback use the HMM page table to populate the device page table. You
+     * can only use the HMM page table to populate the device page table for
+     * the specified range during the >update() callback, at any other point in
+     * time the HMM page table content should be assume to be undefined.
+     */
+    your_hmm_device->update(mirror, event);
+
+  ...
+
+  /* Process is quiting or device done stop the mirroring and cleanup. */
+  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
+  /* Device driver can free your_hmm_mirror */
+
+
+HMM mirror page table:
+----------------------
+
+Each hmm_mirror object is associated with a mirror page table that HMM keeps
+synchronize with the CPU page table by using the mmu_notifier API. HMM is using
+its own generic page table format because it needs to store DMA address, which
+are bigger than long on some architecture, and have more flags per entry than
+radix tree allows.
+
+The HMM page table mostly mirror x86 page table layout. A page holds a global
+directory and each entry points to a lower level directory. Unlike regular CPU
+page table, directory level are more aggressively freed and remove from the HMM
+mirror page table. This means device driver needs to use the HMM helpers and to
+follow directive on when and how to access the mirror page table. HMM use the
+per page spinlock of directory page to synchronize update of directory ie update
+can happen on different directory concurently.
+
+As a rules the mirror page table can only be accessed by device driver from one
+of the HMM device callback. Any access from outside a callback is illegal and
+gives undertimed result.
+
+Accessing the mirror page table from a device callback needs to use the HMM
+page table helpers. A loop to access entry for a range of address looks like :
+
+  /* Initialize a HMM page table iterator. */
+  struct hmm_pt_iter iter;
+  hmm_pt_iter_init(&iter, &mirror->pt)
+
+  /* Get pointer to HMM page table entry for a given address. */
+  dma_addr_t *hmm_pte;
+  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+
+If there is no valid entry directory for given range address then hmm_pte is
+NULL. If there is a valid entry directory then you can access the hmm_pte and
+the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
+the same iter struct for a different address or call hmm_pt_iter_fini().
+
+While the HMM page table entry pointer stays valid you can only modify the
+value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
+threads might be updating the same entry concurrently. The device driver only
+need to update an HMM page table entry to set the dirty bit, so driver should
+only be using hmm_pte_set_dirty().
+
+Similarly to extract information the device driver should use one of the helper
+like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
+is a device driver at initialization parameter).
+
+
+Migrating system memory to device memory:
+-----------------------------------------
+
+Device like discret GPU often have there own local memory which offer bigger
+bandwidth and smaller latency than access to system memory for the GPU. This
+local memory is not necessarily accessible by the CPU. Device local memory will
+remain revealent for the foreseeable future as bandwidth of GPU memory keep
+increasing faster than bandwidth of system memory and as latency of PCIe does
+not decrease.
+
+Thus to maximize use of device like GPU, program need to use the device memory.
+Userspace API wants to make this as transparent as it can be, so that there is
+no need for complex modification of applications.
+
+Transparent use of device memory for range of address of a process require core
+mm code modifications. Adding a new memory zone for devices memory did not make
+sense given that such memory is often only accessible by the device only. This
+is why we decided to use a special kind of swap, migrated memory is mark as a
+special swap entry inside the CPU page table.
+
+While HMM handles the migration process, it does not decide what range or when
+to migrate memory. The decision to perform such migration is under the control
+of the device driver. Migration back to system memory happens either because
+the CPU try to access the memory or because device driver decided to migrate
+the memory back.
+
+
+  /* Migrate system memory between addressA and addressB to device memory. */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >copy_to_device() callback.
+     * Device driver must allocate device memory, DMA system memory to device
+     * memory, update the device page table to point to device memory and
+     * return. See hmm.h for details instructions and how failure are handled.
+     */
+    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
+
+
+Right now HMM only support migrating anonymous private memory. Migration of
+share memory and more generaly file mapped memory is on the road map.
+
+
+Locking consideration and overall design:
+-----------------------------------------
+
+As a rule HMM will handle proper locking on the behalf of the device driver,
+as such device driver does not need to take any mm lock before calling into
+the HMM code.
+
+HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The
+device driver can only free those after calling hmm_device_unregister() or
+hmm_mirror_unregister() respectively.
+
+All the lock inside any of the HMM structure should never be use by the device
+driver. They are intended to be use only and only by HMM code. Below is short
+description of the 3 main locks that exist for HMM internal use. Educational
+purpose only.
+
+Each process mm has one and only one struct hmm associated with it. Each hmm
+struct can be use by several different mirror. There is one and only one mirror
+per mm and device pair. So in essence the hmm struct is the core that dispatch
+everything to every single mirror, each of them corresponding to a specific
+device. The list of mirror for an hmm struct is protected by a semaphore as it
+sees mostly read access.
+
+Each time a device fault a range of address it calls hmm_mirror_fault(), HMM
+keeps track, inside the hmm struct, of each range currently being faulted. It
+does that so it can synchronize with any CPU page table update. If there is a
+CPU page table update then a callback through mmu_notifier will happen and HMM
+will try to interrupt the device page fault that conflict (ie address range
+overlap with the range being updated) and wait for them to back off. This
+insure that at no point in time the device driver see transient page table
+information. The list of active fault is protected by a spinlock, query on
+that list should be short and quick (we haven't gather enough statistic on
+that side yet to have a good idea of the average access pattern).
+
+Each device driver wanting to use HMM must register one and only one hmm_device
+struct per physical device with HMM. The hmm_device struct have pointer to the
+device driver call back and keeps track of active mirrors for a given device.
+The active mirrors list is protected by a spinlock.
+
+
+Future work:
+------------
+
+Improved atomic access by the device to system memory. Some platform bus (PCIe)
+offer limited number of atomic memory operations, some platform do not even
+have any kind of atomic memory operations by a device. In order to allow such
+atomic operation we want to map page read only the CPU while the device perform
+its operation. For this we need a new case inside the CPU write fault code path
+to synchronize with the device.
+
+We want to allow program to lock a range of memory inside device memory and
+forbid CPU access while the memory is lock inside the device. Any CPU access
+to locked range would result in SIGBUS. We think that madvise() would be the
+right syscall into which we could plug that feature.
+
+In order to minimize kernel memory consumption and overhead of DMA mapping, we
+want to introduce new DMA API that allows to manage mapping on IOMMU directory
+page basis. This would allow to map/unmap/update DMA mapping in bulk and
+minimize IOMMU update and flushing overhead. Moreover this would allow to
+improve IOMMU bad access reporting for DMA address inside those directory.
+
+Because update to the device page table might require "heavy" synchronization
+with the device, the mmu_notifier callback might have to sleep while HMM is
+waiting for the device driver to report device page table update completion.
+This is especialy bad if this happens during page reclaimation, this might
+bring the system to pause. We want to mitigate this, either by maintaining a
+new intermediate lru level in which we put pages actively mirrored by a device
+or by some other mecanism. For time being we advice that device driver that
+use HMM explicitly explain this corner case so that user are aware that this
+can happens if there is memory pressure.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-03-08 19:48 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-08 20:42 HMM (Heterogeneous Memory Management) Jérôme Glisse
2016-03-08 20:42 ` Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 01/29] mmu_notifier: add event information to address invalidation v9 Jérôme Glisse
2016-03-08 20:42   ` Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 02/29] mmu_notifier: keep track of active invalidation ranges v5 Jérôme Glisse
2016-03-08 20:42   ` Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 03/29] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
2016-03-08 20:42   ` Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 04/29] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
2016-03-08 20:42   ` Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 05/29] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
2016-03-08 20:42   ` Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 06/29] HMM: add HMM page table v4 Jérôme Glisse
2016-03-08 20:42   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 07/29] HMM: add per mirror " Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-29 22:58   ` John Hubbard
2016-03-29 22:58     ` John Hubbard
2016-03-08 20:43 ` [PATCH v12 08/29] HMM: add device page fault support v6 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-23  6:52   ` Aneesh Kumar K.V
2016-03-23  6:52     ` Aneesh Kumar K.V
2016-03-23 10:09     ` Jerome Glisse
2016-03-23 10:09       ` Jerome Glisse
2016-03-23 10:29       ` Aneesh Kumar K.V
2016-03-23 10:29         ` Aneesh Kumar K.V
2016-03-23 11:25         ` Jerome Glisse
2016-03-23 11:25           ` Jerome Glisse
2016-03-08 20:43 ` [PATCH v12 09/29] HMM: add mm page table iterator helpers Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 10/29] HMM: use CPU page table during invalidation Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 11/29] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 12/29] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 13/29] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 14/29] HMM: Add support for hugetlb Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` Jérôme Glisse [this message]
2016-03-08 20:43   ` [PATCH v12 15/29] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 16/29] fork: pass the dst vma to copy_page_range() and its sub-functions Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 17/29] HMM: add special swap filetype for memory migrated to device v2 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 18/29] HMM: add new HMM page table flag (valid device memory) Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 19/29] HMM: add new HMM page table flag (select flag) Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 20/29] HMM: handle HMM device page table entry on mirror page table fault and update Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-21 11:27   ` Aneesh Kumar K.V
2016-03-21 11:27     ` Aneesh Kumar K.V
2016-03-21 12:02     ` Jerome Glisse
2016-03-21 12:02       ` Jerome Glisse
2016-03-21 13:48       ` Aneesh Kumar K.V
2016-03-21 13:48         ` Aneesh Kumar K.V
2016-03-21 14:30         ` Jerome Glisse
2016-03-21 14:30           ` Jerome Glisse
2016-03-08 20:43 ` [PATCH v12 22/29] HMM: mm add helper to update page table when migrating memory v3 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-21 14:24   ` Aneesh Kumar K.V
2016-03-21 14:24     ` Aneesh Kumar K.V
2016-03-08 20:43 ` [PATCH v12 23/29] HMM: new callback for copying memory from and to device memory v2 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 24/29] HMM: allow to get pointer to spinlock protecting a directory Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 25/29] HMM: split DMA mapping function in two Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 26/29] HMM: add helpers for migration back to system memory v3 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 27/29] HMM: fork copy migrated memory into system memory for child process Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 28/29] HMM: CPU page fault on migrated memory Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 29/29] HMM: add mirror fault support for system to device memory migration v3 Jérôme Glisse
2016-03-08 20:43   ` Jérôme Glisse
2016-03-08 22:02 ` HMM (Heterogeneous Memory Management) John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1457469802-11850-16-git-send-email-jglisse@redhat.com \
    --to=jglisse@redhat.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Greg.Stoner@amd.com \
    --cc=John.Bridgman@amd.com \
    --cc=Laurent.Morichetti@amd.com \
    --cc=Leonid.Shamis@amd.com \
    --cc=Michael.Mantor@amd.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=SCheung@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arvindg@nvidia.com \
    --cc=ben.sander@amd.com \
    --cc=blc@redhat.com \
    --cc=cabuschardt@nvidia.com \
    --cc=charle@nvidia.com \
    --cc=dpoole@nvidia.com \
    --cc=haggaie@mellanox.com \
    --cc=hpa@zytor.com \
    --cc=jdonohue@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=jweiner@redhat.com \
    --cc=ldunning@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liranl@mellanox.com \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhairgrove@nvidia.com \
    --cc=peterz@infradead.org \
    --cc=raindel@mellanox.com \
    --cc=riel@redhat.com \
    --cc=roland@purestorage.com \
    --cc=sgutti@nvidia.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.